The Data Stack Show - 121: Materialize Origins: Breaking Down Data Flow Layers with Arjun Narayan and Frank McSherry
Episode Date: January 11, 2023Highlights from this week’s conversation include:Defining data flow (2:31)Are there limitations in timely data flow operation and/or building operators? (8:20)Areas of incremental computation that a...re having an impact today (17:10)Building a library vs building a product (24:06)Combining delight and empathy into a focus (27:52)Final thoughts and takeaways (32:42)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. This is part two of our long conversation
with Frank and Arjun from Materialize. Brooks was out when we recorded this one, so we went
90 minutes, over 90 minutes, and Brooks made us split it into two episodes. In the first episode,
which if you haven't listened, you absolutely need to go back and listen. We heard about the backstory of Materialized and actually the individual
backstories of Frank, who has an incredible history building all sorts of interesting things
and has an academic paper that has an unreal number of citations. And then also Arjun, who was studying at a PhD level, database stuff, and how they came
together.
And it was an amazing conversation.
So definitely check that one out.
In this episode, we dig into the technical details.
So Kostas, give us a little teaser of what we tackle in part two.
Oh, that's hard.
So we are going to get deeper into what timely data flow is.
By the way, we also have like different flavors.
Well, like we have differential and time data flow.
We also like get into like that and we will understand and learn more about why Frank got into like building this, what the relationship with MapReduce.
Yep.
And also like what it takes from building a model that can do, theoretically at least,
some amazing things, to reach the point where these can be used by users.
So, it's going to be super, super interesting.
Much more technical than the previous part.
So, yeah, I don don't want to say more.
Let's just let the experts talk, right?
Buckle up.
Let's dive in.
All right.
Let's talk about NIACs and let's talk about data flow.
Okay.
I heard Arjun mentioning two types of data flow.
He used the term differential and timey.
Yes.
Yeah, that's a good point.
Why do we have two terms here?
What's the difference?
So it's a good,
it's a good question.
So data flow,
first of all,
just for, you know,
folks watching together
on the same page, right,
is the idea that you might
describe your computer program
or, you know,
what you need to do
as paths of data
through various places that you're going to do some work. And, you know to do as paths of data through various places that
you're going to do some work.
Right.
And, you know, this is sort of like assembly line building of things back at, you know,
a hundred years ago, but both data now, right.
Data move around.
And as data show up at a particular place, you'd say like, oh, as data comes here, I
need to go and canonicalize it in the following way, or I need to join with some other data
with everything that I receive. but it's a way of
describing your program using usually
a directed graph with little arrows and
circles so that
you get the answers out that you want, but
you're not too prescriptive about exactly what the computer
has to go and do in any particular moment.
It lets us spill that work across lots of
different computers.
There's two flavors. Lots of things get called
data flow. This is fine. So we had two of them, timely data flow and differential data flow. The right way to think
about them or a way to think about them is that timely data flow is sort of analogous to an
operating system and differential data flow is more analogous to a database. This is a terrible
analogy. It's a terrible analogy, but I'm going to, I'm going to finish it anyhow, just real quick.
Timely data flow is the sort of layer that is, I would say, is unopinionated
about what you're planning on doing
with the data moving around.
It just says, hi, I will,
data from here to there, amazing.
You want to run this little bit of code over there?
I will do that for you.
Why are you doing this?
I have no idea,
but I will make sure that it happens.
In the same, similar sort of way
that operating system, like,
you want to run a program?
Great.
What's it going to do?
We'll find out.
Database has a lot more opinions and says like before you get to run anything you have to get it past me first i've got some opinions on what you're allowed to run and i also know what
the correct answer is going to be and i'm not just going to let you go and make a mess out of things
and this is where differential data flow sort of differs from timely data flow it says i believe
that you're talking about collections of data i believe that you're going to communicate how those collections of data change.
And the only thing that I'm going to let you do with them is communicate how the answers
to your operations would change in response to the input data.
And you could do some crazier stuff than that in timely data flow, but differential data
flow sort of liberating by saying, I'm only going to help you do this part, but we're
going to do it really well.
Okay.
I think restatement that is simpler.
So, so timely data flow is a generic data flow system, right?
They, the, I like the assembly line analogy, like you create a directed graph of
operators, but timely data flow, you can have, you know, arbitrary operators that you write from scratch, you know, like a thingamajoodoodad recombinator.
What is that?
I don't know.
It's black box.
Stuff goes in, thingamajoodads come out the other side.
Great.
And you can write a whole variety of these and some people do and that's great.
Differential data flow is simply a set of elegantly written operators that are opinionated, that we believe or Frank believes or differential data flow people believe that you might want.
So one of them, for instance, is called join.
You might be interested in that one.
One of them is called reduce, I think.
Yeah, yeah, no, that's a reduce.
And these are familiar operators.
They are also opinionated about the shapes of their inputs and their outputs.
Right.
They believe in timestamp diffs of data.
Right.
So, the inputs, it's very different.
You could imagine a MapReduce is a directed graph of timestamp data, right?
You give it data, it gives you output data. Differential data flow deals in diffs of data,
right? And of course, if you have data and no diffs, the diff is start from zero,
here's all the data. So it's sort of a generalization of batch compute. And it is a, and those operators, a lot of care and thought has been put into
very performant implementations of those operators. So it's a library that uses timely
data flow underneath. Timely data flow is the underlying execution engine. Differential data
flow is a bunch of pinionated implementations of operators, of data flow operators, that is still very surprisingly
general and useful. On top of that, you know, you could put another layer, which is the SQL,
I'm going to take a SQL statement and convert it into a differential data flow program.
Now, this is what in fact Materialize says, except Materialize sits one layer even above,
which is like, I am going to run many timely data flow computers for you. Every time you type the words, create cluster, I will create another timely data flow shape box. And then every time
you say create view or select or something of that sort, create materialized view or a select
statement that requires doing some computation. I am going to translate that into perhaps optimize
that and do a bunch of well of transformations and then come up with a
differential data flow program, which then gets installed, run to completion, and then turned off
or sort of run continuously and kept running on that timely data flow cluster.
All right. So differential time flow, differential data flow builds on top of timely data flow.
And timely data flow is much more generic, like an operating system, as you said, Frank.
So what if I de-expressivity, I don't know if that's the right term,
but are there limits in the things that I can do with timely data flow in terms of what I can compute?
Sure.
I mean, let me say there's two answers.
It's like, yes, there's some limits.
Absolutely.
And the other answer is no, there's no limits.
Let me try to explain.
Like timely data flow forces you to write your programs in a certain way.
And those ways tie your hands a little bit.
And sometimes that might be frustrating.
You know, it compels you to write structured programs.
You can't just, you know, one data flow graph you can make is just a little self-loop.
And it's just like, I'm going to do whatever I want.
Just send data back to myself and do whatever I want.
Screw you.
And it's not very helpful when you do that.
When you express a data flow,
sorry, a computation as a data flow, you get some cool
abilities from the system. The system is actually more helpful to you
at this point. We can start to distribute work once you've
actually broken things apart into different little pieces.
You could have always written whatever you wanted as
sort of one monolithic timely data flow operator that just doesn't
really benefit from expressing stuff as data flow, but as soon as you
break it apart and describe these interoperating pieces, you you start to get
some benefits.
Yeah.
You start to get concurrency, data parallelism, all sorts of stuff like that.
The not flip answer.
Sorry.
There's a yes, you can do everything answer, which is not flip, which is that the thing
that, that NIAID added, timely data flow added on top of existing systems was loops and loops were
sort of the thing that was missing from big data systems to make them fully general.
Get various models of computations. There's this PRAM model of computation, parallel RAM
model of computation, where you need three
fundamental things.
You need to be able to read from memory.
You need to be able to write back to memory.
You need to be able to go over and over again based on what you see.
So it turns out, if you scratch your head and turn your head sideways enough, joins
are reads.
If you join two things together, you're saying, hey, go find me this stuff that has this address.
Let's call it the key. You know, go find me the stuff that has this address, let's call it
the key, you know, go look up some stuff. Great. Reduce is the right. So that's the thing that says,
we've got a bunch of folks who think that they belong at a particular address.
The key, go figure out what the right answer is. That once you get loops put into there,
you now have the ability to write programs, just generally. You can take an algorithm off the shelf and say,
how would I write this in a timely
data flow, for sure? Often differential
data flow, and many of its
advances, many of the reasons that it goes
fast and beats up on people, it's because you can
just take a smarter algorithm for the same problem.
There are a bunch of dumb algorithms
for problems that
are dumb, and people know that they're dumb, but they fit
in MapReduce.
And you spend 10 times more compute than you really need to, but that's fine because
you're rented a hundred times as much. With Nyad and timely data flow, the cool thing that we were
able to do was use the smart algorithms and be more, just more performant, just do less work.
Not because raw system building,
but because you could transport intelligent ideas
that other people had come up with.
We're not inventing these algorithms.
We're just transporting existing known algorithms
into the big data space.
So you can implement.
I'm not aware of fundamental limitations.
Sorry, I'm sure they exist, and I'm sure you put this online.
There'll be a long list of things that people point out.
But it was definitely like a quantum step up over the MapReduce style models,
which did not have loops, which are just straight line data flow graphs.
Yeah, that's interesting.
So question here.
So you said, okay, I can go to differential data flow, which is, I have like some operators
there that I can use, right?
And I can either like have a monolith data flow, right?
Which, okay, it will execute fine.
But the real value comes from like, I mean, obviously you want to parallelize that so
you can scale, right?
And you want to do that, like, as a developer, you want to use the primitives
and not have to worry about how this thing is going to be parallelized.
And if the parallelization is going to be consistent and like
sounded like all that stuff.
Do you, are there like limitations in terms of like the operators?
Like, is there like an operator that I can build that turns the dataflow
into something that cannot parallelize?
Alex Williams- So this is, let me, this is a great question.
Let me actually back up just a moment, because you said you use this
language so that you can parallelize and it's actually more complicated
than that, or better than that.
Because not only do you get to parallelize, that's why you would
use MapReduce or Spark or so.
The reason differential data flow wants you to do it
is because they automatically incrementalize as well.
So all of this parallelism that you got,
let's imagine that you spread the work out
across 10 computers or even a million,
you don't have a million computers, but let's pretend.
And if the input to only one of them changes,
you only need to redo the work in that one location.
So the real advantage actually, in my mind for differential data flow is that by using
this programming, which for the same reasons that they parallelize, they happen to incrementalize
as well.
So these operators that we've forced you to use joins and reduces maps, filters, stuff
like that, caused you, tricked you into writing your program in an automatically incrementalizable
form.
You could always write a credi.
You can write a reduce function that says,
there's only one key, true, or something like that.
Please give me all of my gigabytes of data.
I'll do the function on it, and we'll see what happens.
And you can write that into virtual data flow.
Unfortunately, you'll be disappointed to find out that
if any of your input gigabyte changes,
we will show you the
gigabyte again, slightly changed and say, what's the answer now? Because we don't know what you're
going to do with it. You might be computing a hash of this, in which case the answer is totally
different and we really can't help you out. If on the other hand, you were to say, well, yeah,
I'm computing a Merkle tree or something like that. What I really want to do is break apart
my data into a bunch of different pieces, hash each of the pieces, put those hashes together, and then get an answer at the bottom.
If any one bit of data changed, we'd only need to reflow the changes to the hashes down the tree,
and you'd have now an efficiently updatable thing.
You can write the credit version as well.
You just won't be delighted either by its parallelization or by its incrementalization.
Okay, that's great. be delighted either by its parallelization or by its incrementalization. Okay.
That's wow.
That's great.
And my understanding, correct me if I'm wrong, but these operators that we are
talking about that have been like implemented as part of differential data
flow, they feel a little bit more, let's say, focused on processing data, right?
Like we have like joins, mapReduce.
We're talking about datasets and like trying to run like some aggregations
probably like on top of them, like all that stuff.
Are there like other types of, let's say, operators out there that have been
made that are not like only have to do with like aggregations and
joins and like the stuff that we usually use.
Well, for sure in the space of
incremental computation, there are different approaches
to how you might go
and try to convince someone
to write an incremental program
or how you might elicit from them
stuff. And
differential data flow uses a technique called change propagation
which basically says,
let's see what the program is, change your data, we'll see what happens. It's very data-centric.
It's about moving the data through the computation, seeing what happens differently.
There are other approaches based, for example, on memoization. So you have things like
Matthew Hammer's Adapton and Umut Akar's various, I guess, a few different approaches in different
languages that are based more on memorizing, incrementalizing control flow systems.
So these are, you know, if you write a program that has a lot more ifs and elses and wheres
and whatnot like that, while I guess I've been writing SQL too much, then they're going
to respond much better to that versus one of the sort of obvious
when I say it out loud, but one of the downsides of a data flow program generally is that
the data flow graph is locked down. Like you write that and that's what happens to the data.
You don't decide halfway through the execution that really it should look different or something
like that. If you have two things you want to do and choose between them, you write both of them.
You have a little switch node up front, but you have to write both of them.
And that's super gross.
If there's a hundred thousand different ways that you could do work.
It's really handy if there's five ways to do work, you have a
hundred thousand bits of data.
But these other systems are going to be much more appropriate for control flow.
Heavy.
Just turns out that data processing is pretty popular at the moment.
So, of course.
Yeah, it makes sense.
What other like areas you see these incremental computation having impact
today, or you think like we're going to see like more of it happen?
There's a bunch.
Let's see.
I mean, these are like application areas that you could drop down.
Arjun just loaded up on the side SDNs, which is one, sorry, software-defined networking.
Yeah.
Where you use logic to describe where in the world little bits of packets.
Sorry, I might've just stolen Arjun's thunder, by the way.
No, you know this better than I do.
I was actually looking up publicly citable sources so that I could, I wanted to check
if I was allowed to speak about it.
I see.
Yeah.
And I am because yes.
Yeah.
Sorry.
So, so VMware is happily using differential data flow as well in, in prototype for various
software defined networking, where your goal is to describe the configuration state of
the data flow and then you can use it to, to, to, to, to, to, to, to, for various software defined networking, where your
goal is to describe the configuration state of networks, you know, other
systems generally, let's say, but like in VMware's case, networks, packets
seem to go from A to B to wherever.
You really want the property that as soon as, it's not super data intensive,
actually their control plane necessarily, but you want the property that as soon
as something changes as fast as possible, no joke, you get to the right new answer.
And no glitches either.
Don't screw up.
So a good way to think about it is when a VM is moved,
the host networking address has changed,
and you want to precisely cut over all the streams of TCP
packets that were going to the old hardware host to the new hardware host.
And you don't want to actually duplicate any packets.
You don't want to actually TCP might be fine because it might layer over you and fix the
errors.
But if these were, these may not be TCP packets, maybe UDP packets.
You want all that stuff to be the control plane to sort of do that, that Indiana Jones
swap perfectly.
Makes sense.
There's plenty of other places, like there's lots of applications, especially now with things adjacent to data.
I mean, actually, in the heart of data, but maybe one level up, you have all sorts of machine learning, various serving tasks and stuff like that.
That people are, machine learning, I think often actually is another example of a different
way to do incremental stuff.
Like a lot of machine learning is based on stirring a pot for a while until you get the
answers.
And if the data change, like, great, stir some more.
And, you know, sorry, this is a funny mental image, but the idea there is that your models are confluent in the sense that as you put whatever data in, you'll get to the right answer.
So it's totally fine to throw in a little bit more data and you'll keep going there.
But it's a different approach to incrementalization.
There's a whole bunch of incremental work going on and things like parsing. You know, if you have your 10,000 line source file open and you go and you change it curly brace somewhere, you don't want to rescan the entire file and rerun that sort of thing.
So it creeps up.
There's some bits of differential data flow were used, I think not anymore, but were used in Rust's type checker internally, for example.
Try to determine has someone written a valid program or
not again i think that it being incremental is kind of handy on account of re-analyzing an entire
code base both yeah to compile it but also you like lints and stuff like that just re-checking
a code base a lot of people are essentially a lot of organizations are ci bound right like you can't
land the next bit of code until 30 minutes have gone through where
someone has gone and reanalyzed all of your stuff and you've checked a bunch of
random nonsense.
And if you can turn that into one minute instead of 30, that's a great feeling.
I have a question about Rust because I know that you have also like
contributed some stuff there, like for the compiler.
Kind of, but ask away.
Not as much as you might think.
But I think it's very interesting.
And it's very interesting because I think it's important for people to
understand like how general this architecture, all these model is of
computation, right?
And we talk about data here, but I think bringing an example from something that
might feel alien enough from data, which is like compiler and using a similar
technique there to perform something, I think makes people understand the
expressivity of these things that we are talking about.
I would take the contrary position because one of the jokes I like to make here is we will be successful when we have users who are delighted by Materialize, but all they know about it is me have SQL, me SQL slow, me use Materialize, me SQL go fast right like and that's important because again back to the academia like you have
to earn the use the right of taking up the user's time to care to understand all of this stuff
that's below the iceberg below the waterline right like all this stuff is important we got to sweat
the details but by no means can you can your pitch to the user be, you look at
all this wonderful deep compiler tech, it's not that people are dumb, it's
people are busy, right?
They have business problems.
They don't have enough time and you have to approach them in the data stack that
they have with the queries that they already have and say, Hey, in five
minutes, you can auto incrementalize this dbt model and have it be real time.
And then they're like, now they're paying attention right now.'re like how did you do that i might be interested in doing more
things like this and that's a good time to start talking about some of the things that we've started
talking about yeah yeah i was just point here actually like it's definitely great to if you
start and show someone i can keep your counts up to date really fast that's cool and maybe
eye-catching but like the scary experiences is certainly,
all right, I'm going to, I'm going to do counts anywhere.
I'm gonna make it a little harder.
Getting whatever it takes to get the confidence there with people that
actually the horizon for how much you could potentially do with this is quite
large, uh, one of the things that we've not yet put into materialize because
we're busy is recursive computation.
Uh, it's a thing that most, most like i think no one else out there
is prepared to put recursive sql especially into a view maintenance engine it's i don't say easy
that's wrong but like 100 the compute plane is prepared for that and it's in many ways nice to
know that materialize isn't going to be out of date in a year or two when people realize that they could benefit from some recursive rules.
Because all the software-defined networking stuff uses Datalog and has recursion in it.
Does that mean that you won't be able to materialize to wherever your application takes you next?
Unfortunately not.
Unfortunately, it's broad and expressive.
Yeah.
So one question about that and okay, I'll, I'll skip the question about like the
combine and I'll get more like back to my, I am like a SQL Neanderthal here and I
just want like, you know, like things to be easy.
So you build a library, right?
Frank, so you've built something there and I'm saying that because part of like
the conversation at the beginning with Arjun was like, yeah, like it's cool.
You build this thing over there, like academics can like probably use it.
But from that to making it accessible, like to every people out there, like academics can like probably use it, but from that to making it
accessible, like to every people out there, like everyone out there, like
there's things that need to happen.
And like you like SQL for example, right?
Which is something that like more people speak.
So what's like, what's your experience on that?
You build like the library in a very specific mindset where you were coming from.
And then you started seeing like the steps and like the things that need to be built on top of it, like to make it like even more accessible.
So how different it is and like how much work is needed and how many people are like needed to find the right way to do that?
David PĂ©rez- So there's a big difference was my conclusion between building
a library and building a product.
The library got built certainly with help of colleagues that I had throughout the
years, but I would say, you know, timely data flow and differential data flow
together about 15,000 lines of code or something like that, they're not, they're
not large.
You build libraries or I, my experience has been that when you build libraries,
one of the things that's valuable is your opinion.
You know, you get to tell people what the rules are when they show up. You get to tell people, here's how to correctly use the thing that I've built.
And if I think what you're trying to do is dumb, I'll find some way to rule it out
because I think like, it's not gonna work out well for you.
Yeah.
When you're building a product, you have to do quite the opposite, which is people are going to come to you and tell you, here's what I'm planning on doing.
And if you want to do business, you need to make sure that is accommodated.
You know, I would love to delete various parts of the SQL spec because I think they're misfeatures.
Not allowed to do that.
And I, you know, have been dragged to the opinionfeatures. Not allowed to do that. And I, you know,
have been dragged
to the opinion that I'm not allowed to do this and I need
to instead figure out how to interpret
the weirdest things that people wrote
down in SQL and turn
them into meaningful computation
that behaves itself.
That's not easy.
Like, there are other plenty of people in the org who are better
at that than I am, and it's an interesting
technical challenge to figure out how to translate
again,
cunning ideas here into
more pre-chewed and easy-to-use
packets.
But yeah, very different experiences.
One of them, the library is
very inward-focused.
I'm going to do a thing that I'm, I know how to use works great for me.
Transitioning to more of a, an outward focus.
How do I make a thing that brings what we have that's cool to as many people as possible.
Alessandro Bellofiorelli 00, 00.:00.
All right.
I have over-monopolized the conversation and we are all over our time, but I couldn't like, I think it would be like a shame, like to stop the conversation and we are all like over our time, but I couldn't like, I think it
would be like a shame, like to stop the conversation because it was super, super
interesting and like I learned a lot.
But Eric, to you for the last question.
Eric Boerwinkle So we actually, I get to make the rules and y'all are awesome.
So I'm super excited about that
but so that Brooks doesn't
quit
when we send him this file
I'll end on a question that
has really intrigued
me throughout this conversation
and
wow I have learned so much but
one thing that
both of you continually bring up is empathy.
And it's very clear in the way that both of you describe even very deeply technical concepts that you have a very high level of empathy.
And both of you use the word delightful a lot.
And you're very descriptive and sort of describing experiences. I'm so interested in where that
comes from because you're very aligned on that. And I think it's very rare, actually,
especially when discussing deeply technical topics to have delight as such a foundational value.
But that's really, I've heard, you know,
throughout the last 90 minutes repeatedly. So I'd just love to know where that comes from and how,
and maybe for our listeners, have you learned anything about how to develop or maintain that
focus? I have to be totally honest. I have, I think, an intellectual appreciation for empathy and, you know, I'm practicing it, but it's, it's, you know, it's not where things started for me.
I mean, I think, well, let me just say, I think if you have a variety of experiences, like I went from being in academia to being unemployed to, to eventually being in a startup. And like, one of the things that was sort of cool about that was getting to bump into a
whole bunch of different people doing different things,
different levels of background,
you know,
going from talking with academics to going and talking to people who were,
you know,
as smart,
but really quite busy and being asked to do dumb things that you agree are
dumb.
And like you realize,
wow,
okay.
It's not,
everyone had the same experiences that I had.
Uh, and then you have some of those experiences yourself
with a bunch of PRs that people file against your library.
I don't, you know, just having access to a broader and broader,
if you can manage it, variety of experiences in life,
definitely hammers home how many different people
are coming from different places
and what's actually worth doing
to make as many of these people happy as you can.
I think a large part of it is so...
I forget where I heard this framing of it.
It's not original to me.
It comes from somewhere.
I just don't...
I'm forgetting where.
But a thing I continually remind myself is,
imagine you're sitting down with some people, and
these are incredibly, you know, you have to not think about it as dumbing down what your
contributions are because the audience isn't smart enough.
And I think a lot of people make the mistake of trying to dumb down things.
It's not about dumbing things down.
Imagine you're sitting down with a bunch of incredibly intelligent folks who
have been absolutely so swamped that they have had no time to think about your problem. So they're
fully capable of understanding it. Let's say you've got three Nobel laureates in biology,
chemistry, and physics in front of you, right? They are very busy people because they are
consumed with very hard problems.
And that is what they think about every single day. And now you have a shared problem. You have
to explain it to them. Again, it's not that they're not smart enough. It's that they have
zero, a devoted zero minutes or seconds. How would you explain things? And I think that goes a very long way to setting a tone, which is you never really
talk down you educate because people are busy and that's exactly what the case is
in the data ecosystem, right?
Like most people writing SQL queries have shit to do, which is why they're
writing these SQL queries.
We can nerd out a lot about SQL and query languages and microservices,
but you will lose your audience not because they can't handle it,
but you need to first have an experience where they are getting value.
And ideally, in a world where they don't actually need to dig through all of the various details,
they might have to get into one or two specifics if it so pertains to the specific business problem that they have in front of them.
But if you start from the premise of they first have to wrap their heads around your entire field before they can make progress in their field, then I think you're pretty doomed.
Wonderful advice to end on.
Thank you again for giving us so much of your time.
This will be our first double episode
which I'm super excited about.
And we'll definitely have you back on
again in the future
to hear even more about what you're building.
So thank you again.
Thank you.
Thank you very much.
It's really fun.
Costas, that whole conversation
I know we released it in two parts
but it was over 90 minutes and it really felt more like 20 minutes i would say and was just
such an enjoyable episode you know doing this for two years it's definitely going to be one of the ones I think that sticks out. My big takeaway, I think,
from the conversation is actually something that we discussed right at the very end.
And it was remarkable to me how both Frank and Arjun really independently, I think authentically,
independently from an authentic nature, because they were talking about very different things
use the word delightful
and when you're talking about
heavy duty technology
building on timely data flow
and streaming SQL
and all of the crazy stuff we talked about.
Delightful is not a word that you would expect.
And it gave me so much respect for the way that they think about the people using the
technology that they're building and how they're keeping that at the forefront.
You know, even in the face of, you know, some really heavy duty
technology that's doing really cool stuff.
And that to me was just a personal lesson and reminder about that being such a key
ingredient of building something truly great.
Yeah.
And something that I want to keep from both parts of the conversation, and I
think like Frank mentioned that like numerous times, is how many different
people are needed with different skills to turn like, let's say a scientific paper at the end into
something, the product that everyone can go and use and get value out of it.
And I think that's I mean, you know, like we think about that stuff, like many
times we hear like on the news about like scientific breakthroughs and usually that's like in other like fields, not like in computer science that much.
And hear about like breakthroughs and like people think that, oh, okay, like this has been achieved and like, okay, now, like, I don't know, suddenly we are going to have infinite energy or like we will be only like, you know, galaxies and stuff like that.
But the true thing is that like from the point that something has been achieved for the first time or like something has been described or proposed, right?
To get to the point where this can be used by like everyone out there, like takes a lot of human effort.
Like a lot of human efforts.
Like a lot of human efforts.
And yeah, like building a company, it's exactly that, like bringing all these different people together to do that.
Even marketing people.
They're did it.
Mark Havardy- Marketing people.
Yes.
Even marketing people.
I couldn't have said it better myself.
No, I think you're totally right. And I would say we got a full end-to-end picture of not only what it takes to get the technology itself to a place where end users can use it, but
a really good look at how you build a team that can actually do that
work. So what a special conversation. We'll take the wheel from books more often
on behalf of our listeners, and we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rutterstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.