The Data Stack Show - 121: Materialize Origins: Breaking Down Data Flow Layers with Arjun Narayan and Frank McSherry

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. This is part two of our long conversation with Frank and Arjun from Materialize. Brooks was out when we recorded this one, so we went 90 minutes, over 90 minutes, and Brooks made us split it into two episodes. In the first episode,

Starting point is 00:00:42 which if you haven't listened, you absolutely need to go back and listen. We heard about the backstory of Materialized and actually the individual backstories of Frank, who has an incredible history building all sorts of interesting things and has an academic paper that has an unreal number of citations. And then also Arjun, who was studying at a PhD level, database stuff, and how they came together. And it was an amazing conversation. So definitely check that one out. In this episode, we dig into the technical details. So Kostas, give us a little teaser of what we tackle in part two.

Starting point is 00:01:23 Oh, that's hard. So we are going to get deeper into what timely data flow is. By the way, we also have like different flavors. Well, like we have differential and time data flow. We also like get into like that and we will understand and learn more about why Frank got into like building this, what the relationship with MapReduce. Yep. And also like what it takes from building a model that can do, theoretically at least, some amazing things, to reach the point where these can be used by users.

Starting point is 00:02:14 So, it's going to be super, super interesting. Much more technical than the previous part. So, yeah, I don don't want to say more. Let's just let the experts talk, right? Buckle up. Let's dive in. All right. Let's talk about NIACs and let's talk about data flow.

Starting point is 00:02:35 Okay. I heard Arjun mentioning two types of data flow. He used the term differential and timey. Yes. Yeah, that's a good point. Why do we have two terms here? What's the difference? So it's a good,

Starting point is 00:02:52 it's a good question. So data flow, first of all, just for, you know, folks watching together on the same page, right, is the idea that you might describe your computer program

Starting point is 00:03:00 or, you know, what you need to do as paths of data through various places that you're going to do some work. And, you know to do as paths of data through various places that you're going to do some work. Right. And, you know, this is sort of like assembly line building of things back at, you know, a hundred years ago, but both data now, right.

Starting point is 00:03:14 Data move around. And as data show up at a particular place, you'd say like, oh, as data comes here, I need to go and canonicalize it in the following way, or I need to join with some other data with everything that I receive. but it's a way of describing your program using usually a directed graph with little arrows and circles so that you get the answers out that you want, but

Starting point is 00:03:33 you're not too prescriptive about exactly what the computer has to go and do in any particular moment. It lets us spill that work across lots of different computers. There's two flavors. Lots of things get called data flow. This is fine. So we had two of them, timely data flow and differential data flow. The right way to think about them or a way to think about them is that timely data flow is sort of analogous to an operating system and differential data flow is more analogous to a database. This is a terrible

Starting point is 00:03:57 analogy. It's a terrible analogy, but I'm going to, I'm going to finish it anyhow, just real quick. Timely data flow is the sort of layer that is, I would say, is unopinionated about what you're planning on doing with the data moving around. It just says, hi, I will, data from here to there, amazing. You want to run this little bit of code over there? I will do that for you.

Starting point is 00:04:16 Why are you doing this? I have no idea, but I will make sure that it happens. In the same, similar sort of way that operating system, like, you want to run a program? Great. What's it going to do?

Starting point is 00:04:23 We'll find out. Database has a lot more opinions and says like before you get to run anything you have to get it past me first i've got some opinions on what you're allowed to run and i also know what the correct answer is going to be and i'm not just going to let you go and make a mess out of things and this is where differential data flow sort of differs from timely data flow it says i believe that you're talking about collections of data i believe that you're going to communicate how those collections of data change. And the only thing that I'm going to let you do with them is communicate how the answers to your operations would change in response to the input data. And you could do some crazier stuff than that in timely data flow, but differential data

Starting point is 00:05:00 flow sort of liberating by saying, I'm only going to help you do this part, but we're going to do it really well. Okay. I think restatement that is simpler. So, so timely data flow is a generic data flow system, right? They, the, I like the assembly line analogy, like you create a directed graph of operators, but timely data flow, you can have, you know, arbitrary operators that you write from scratch, you know, like a thingamajoodoodad recombinator. What is that?

Starting point is 00:05:29 I don't know. It's black box. Stuff goes in, thingamajoodads come out the other side. Great. And you can write a whole variety of these and some people do and that's great. Differential data flow is simply a set of elegantly written operators that are opinionated, that we believe or Frank believes or differential data flow people believe that you might want. So one of them, for instance, is called join. You might be interested in that one.

Starting point is 00:05:58 One of them is called reduce, I think. Yeah, yeah, no, that's a reduce. And these are familiar operators. They are also opinionated about the shapes of their inputs and their outputs. Right. They believe in timestamp diffs of data. Right. So, the inputs, it's very different.

Starting point is 00:06:20 You could imagine a MapReduce is a directed graph of timestamp data, right? You give it data, it gives you output data. Differential data flow deals in diffs of data, right? And of course, if you have data and no diffs, the diff is start from zero, here's all the data. So it's sort of a generalization of batch compute. And it is a, and those operators, a lot of care and thought has been put into very performant implementations of those operators. So it's a library that uses timely data flow underneath. Timely data flow is the underlying execution engine. Differential data flow is a bunch of pinionated implementations of operators, of data flow operators, that is still very surprisingly general and useful. On top of that, you know, you could put another layer, which is the SQL,

Starting point is 00:07:12 I'm going to take a SQL statement and convert it into a differential data flow program. Now, this is what in fact Materialize says, except Materialize sits one layer even above, which is like, I am going to run many timely data flow computers for you. Every time you type the words, create cluster, I will create another timely data flow shape box. And then every time you say create view or select or something of that sort, create materialized view or a select statement that requires doing some computation. I am going to translate that into perhaps optimize that and do a bunch of well of transformations and then come up with a differential data flow program, which then gets installed, run to completion, and then turned off or sort of run continuously and kept running on that timely data flow cluster.

Starting point is 00:07:58 All right. So differential time flow, differential data flow builds on top of timely data flow. And timely data flow is much more generic, like an operating system, as you said, Frank. So what if I de-expressivity, I don't know if that's the right term, but are there limits in the things that I can do with timely data flow in terms of what I can compute? Sure. I mean, let me say there's two answers. It's like, yes, there's some limits. Absolutely.

Starting point is 00:08:31 And the other answer is no, there's no limits. Let me try to explain. Like timely data flow forces you to write your programs in a certain way. And those ways tie your hands a little bit. And sometimes that might be frustrating. You know, it compels you to write structured programs. You can't just, you know, one data flow graph you can make is just a little self-loop. And it's just like, I'm going to do whatever I want.

Starting point is 00:08:56 Just send data back to myself and do whatever I want. Screw you. And it's not very helpful when you do that. When you express a data flow, sorry, a computation as a data flow, you get some cool abilities from the system. The system is actually more helpful to you at this point. We can start to distribute work once you've actually broken things apart into different little pieces.

Starting point is 00:09:17 You could have always written whatever you wanted as sort of one monolithic timely data flow operator that just doesn't really benefit from expressing stuff as data flow, but as soon as you break it apart and describe these interoperating pieces, you you start to get some benefits. Yeah. You start to get concurrency, data parallelism, all sorts of stuff like that. The not flip answer.

Starting point is 00:09:41 Sorry. There's a yes, you can do everything answer, which is not flip, which is that the thing that, that NIAID added, timely data flow added on top of existing systems was loops and loops were sort of the thing that was missing from big data systems to make them fully general. Get various models of computations. There's this PRAM model of computation, parallel RAM model of computation, where you need three fundamental things. You need to be able to read from memory.

Starting point is 00:10:09 You need to be able to write back to memory. You need to be able to go over and over again based on what you see. So it turns out, if you scratch your head and turn your head sideways enough, joins are reads. If you join two things together, you're saying, hey, go find me this stuff that has this address. Let's call it the key. You know, go find me the stuff that has this address, let's call it the key, you know, go look up some stuff. Great. Reduce is the right. So that's the thing that says, we've got a bunch of folks who think that they belong at a particular address.

Starting point is 00:10:35 The key, go figure out what the right answer is. That once you get loops put into there, you now have the ability to write programs, just generally. You can take an algorithm off the shelf and say, how would I write this in a timely data flow, for sure? Often differential data flow, and many of its advances, many of the reasons that it goes fast and beats up on people, it's because you can just take a smarter algorithm for the same problem.

Starting point is 00:10:58 There are a bunch of dumb algorithms for problems that are dumb, and people know that they're dumb, but they fit in MapReduce. And you spend 10 times more compute than you really need to, but that's fine because you're rented a hundred times as much. With Nyad and timely data flow, the cool thing that we were able to do was use the smart algorithms and be more, just more performant, just do less work. Not because raw system building,

Starting point is 00:11:25 but because you could transport intelligent ideas that other people had come up with. We're not inventing these algorithms. We're just transporting existing known algorithms into the big data space. So you can implement. I'm not aware of fundamental limitations. Sorry, I'm sure they exist, and I'm sure you put this online.

Starting point is 00:11:47 There'll be a long list of things that people point out. But it was definitely like a quantum step up over the MapReduce style models, which did not have loops, which are just straight line data flow graphs. Yeah, that's interesting. So question here. So you said, okay, I can go to differential data flow, which is, I have like some operators there that I can use, right? And I can either like have a monolith data flow, right?

Starting point is 00:12:15 Which, okay, it will execute fine. But the real value comes from like, I mean, obviously you want to parallelize that so you can scale, right? And you want to do that, like, as a developer, you want to use the primitives and not have to worry about how this thing is going to be parallelized. And if the parallelization is going to be consistent and like sounded like all that stuff. Do you, are there like limitations in terms of like the operators?

Starting point is 00:12:42 Like, is there like an operator that I can build that turns the dataflow into something that cannot parallelize? Alex Williams- So this is, let me, this is a great question. Let me actually back up just a moment, because you said you use this language so that you can parallelize and it's actually more complicated than that, or better than that. Because not only do you get to parallelize, that's why you would use MapReduce or Spark or so.

Starting point is 00:13:05 The reason differential data flow wants you to do it is because they automatically incrementalize as well. So all of this parallelism that you got, let's imagine that you spread the work out across 10 computers or even a million, you don't have a million computers, but let's pretend. And if the input to only one of them changes, you only need to redo the work in that one location.

Starting point is 00:13:24 So the real advantage actually, in my mind for differential data flow is that by using this programming, which for the same reasons that they parallelize, they happen to incrementalize as well. So these operators that we've forced you to use joins and reduces maps, filters, stuff like that, caused you, tricked you into writing your program in an automatically incrementalizable form. You could always write a credi. You can write a reduce function that says,

Starting point is 00:13:49 there's only one key, true, or something like that. Please give me all of my gigabytes of data. I'll do the function on it, and we'll see what happens. And you can write that into virtual data flow. Unfortunately, you'll be disappointed to find out that if any of your input gigabyte changes, we will show you the gigabyte again, slightly changed and say, what's the answer now? Because we don't know what you're

Starting point is 00:14:09 going to do with it. You might be computing a hash of this, in which case the answer is totally different and we really can't help you out. If on the other hand, you were to say, well, yeah, I'm computing a Merkle tree or something like that. What I really want to do is break apart my data into a bunch of different pieces, hash each of the pieces, put those hashes together, and then get an answer at the bottom. If any one bit of data changed, we'd only need to reflow the changes to the hashes down the tree, and you'd have now an efficiently updatable thing. You can write the credit version as well. You just won't be delighted either by its parallelization or by its incrementalization.

Starting point is 00:14:46 Okay, that's great. be delighted either by its parallelization or by its incrementalization. Okay. That's wow. That's great. And my understanding, correct me if I'm wrong, but these operators that we are talking about that have been like implemented as part of differential data flow, they feel a little bit more, let's say, focused on processing data, right? Like we have like joins, mapReduce. We're talking about datasets and like trying to run like some aggregations

Starting point is 00:15:10 probably like on top of them, like all that stuff. Are there like other types of, let's say, operators out there that have been made that are not like only have to do with like aggregations and joins and like the stuff that we usually use. Well, for sure in the space of incremental computation, there are different approaches to how you might go and try to convince someone

Starting point is 00:15:36 to write an incremental program or how you might elicit from them stuff. And differential data flow uses a technique called change propagation which basically says, let's see what the program is, change your data, we'll see what happens. It's very data-centric. It's about moving the data through the computation, seeing what happens differently. There are other approaches based, for example, on memoization. So you have things like

Starting point is 00:15:59 Matthew Hammer's Adapton and Umut Akar's various, I guess, a few different approaches in different languages that are based more on memorizing, incrementalizing control flow systems. So these are, you know, if you write a program that has a lot more ifs and elses and wheres and whatnot like that, while I guess I've been writing SQL too much, then they're going to respond much better to that versus one of the sort of obvious when I say it out loud, but one of the downsides of a data flow program generally is that the data flow graph is locked down. Like you write that and that's what happens to the data. You don't decide halfway through the execution that really it should look different or something

Starting point is 00:16:40 like that. If you have two things you want to do and choose between them, you write both of them. You have a little switch node up front, but you have to write both of them. And that's super gross. If there's a hundred thousand different ways that you could do work. It's really handy if there's five ways to do work, you have a hundred thousand bits of data. But these other systems are going to be much more appropriate for control flow. Heavy.

Starting point is 00:17:04 Just turns out that data processing is pretty popular at the moment. So, of course. Yeah, it makes sense. What other like areas you see these incremental computation having impact today, or you think like we're going to see like more of it happen? There's a bunch. Let's see. I mean, these are like application areas that you could drop down.

Starting point is 00:17:26 Arjun just loaded up on the side SDNs, which is one, sorry, software-defined networking. Yeah. Where you use logic to describe where in the world little bits of packets. Sorry, I might've just stolen Arjun's thunder, by the way. No, you know this better than I do. I was actually looking up publicly citable sources so that I could, I wanted to check if I was allowed to speak about it. I see.

Starting point is 00:17:47 Yeah. And I am because yes. Yeah. Sorry. So, so VMware is happily using differential data flow as well in, in prototype for various software defined networking, where your goal is to describe the configuration state of the data flow and then you can use it to, to, to, to, to, to, to, to, for various software defined networking, where your goal is to describe the configuration state of networks, you know, other

Starting point is 00:18:11 systems generally, let's say, but like in VMware's case, networks, packets seem to go from A to B to wherever. You really want the property that as soon as, it's not super data intensive, actually their control plane necessarily, but you want the property that as soon as something changes as fast as possible, no joke, you get to the right new answer. And no glitches either. Don't screw up. So a good way to think about it is when a VM is moved,

Starting point is 00:18:38 the host networking address has changed, and you want to precisely cut over all the streams of TCP packets that were going to the old hardware host to the new hardware host. And you don't want to actually duplicate any packets. You don't want to actually TCP might be fine because it might layer over you and fix the errors. But if these were, these may not be TCP packets, maybe UDP packets. You want all that stuff to be the control plane to sort of do that, that Indiana Jones

Starting point is 00:19:04 swap perfectly. Makes sense. There's plenty of other places, like there's lots of applications, especially now with things adjacent to data. I mean, actually, in the heart of data, but maybe one level up, you have all sorts of machine learning, various serving tasks and stuff like that. That people are, machine learning, I think often actually is another example of a different way to do incremental stuff. Like a lot of machine learning is based on stirring a pot for a while until you get the answers.

Starting point is 00:19:36 And if the data change, like, great, stir some more. And, you know, sorry, this is a funny mental image, but the idea there is that your models are confluent in the sense that as you put whatever data in, you'll get to the right answer. So it's totally fine to throw in a little bit more data and you'll keep going there. But it's a different approach to incrementalization. There's a whole bunch of incremental work going on and things like parsing. You know, if you have your 10,000 line source file open and you go and you change it curly brace somewhere, you don't want to rescan the entire file and rerun that sort of thing. So it creeps up. There's some bits of differential data flow were used, I think not anymore, but were used in Rust's type checker internally, for example. Try to determine has someone written a valid program or

Starting point is 00:20:25 not again i think that it being incremental is kind of handy on account of re-analyzing an entire code base both yeah to compile it but also you like lints and stuff like that just re-checking a code base a lot of people are essentially a lot of organizations are ci bound right like you can't land the next bit of code until 30 minutes have gone through where someone has gone and reanalyzed all of your stuff and you've checked a bunch of random nonsense. And if you can turn that into one minute instead of 30, that's a great feeling. I have a question about Rust because I know that you have also like

Starting point is 00:20:59 contributed some stuff there, like for the compiler. Kind of, but ask away. Not as much as you might think. But I think it's very interesting. And it's very interesting because I think it's important for people to understand like how general this architecture, all these model is of computation, right? And we talk about data here, but I think bringing an example from something that

Starting point is 00:21:25 might feel alien enough from data, which is like compiler and using a similar technique there to perform something, I think makes people understand the expressivity of these things that we are talking about. I would take the contrary position because one of the jokes I like to make here is we will be successful when we have users who are delighted by Materialize, but all they know about it is me have SQL, me SQL slow, me use Materialize, me SQL go fast right like and that's important because again back to the academia like you have to earn the use the right of taking up the user's time to care to understand all of this stuff that's below the iceberg below the waterline right like all this stuff is important we got to sweat the details but by no means can you can your pitch to the user be, you look at all this wonderful deep compiler tech, it's not that people are dumb, it's

Starting point is 00:22:30 people are busy, right? They have business problems. They don't have enough time and you have to approach them in the data stack that they have with the queries that they already have and say, Hey, in five minutes, you can auto incrementalize this dbt model and have it be real time. And then they're like, now they're paying attention right now.'re like how did you do that i might be interested in doing more things like this and that's a good time to start talking about some of the things that we've started talking about yeah yeah i was just point here actually like it's definitely great to if you

Starting point is 00:22:57 start and show someone i can keep your counts up to date really fast that's cool and maybe eye-catching but like the scary experiences is certainly, all right, I'm going to, I'm going to do counts anywhere. I'm gonna make it a little harder. Getting whatever it takes to get the confidence there with people that actually the horizon for how much you could potentially do with this is quite large, uh, one of the things that we've not yet put into materialize because we're busy is recursive computation.

Starting point is 00:23:24 Uh, it's a thing that most, most like i think no one else out there is prepared to put recursive sql especially into a view maintenance engine it's i don't say easy that's wrong but like 100 the compute plane is prepared for that and it's in many ways nice to know that materialize isn't going to be out of date in a year or two when people realize that they could benefit from some recursive rules. Because all the software-defined networking stuff uses Datalog and has recursion in it. Does that mean that you won't be able to materialize to wherever your application takes you next? Unfortunately not. Unfortunately, it's broad and expressive.

Starting point is 00:24:06 Yeah. So one question about that and okay, I'll, I'll skip the question about like the combine and I'll get more like back to my, I am like a SQL Neanderthal here and I just want like, you know, like things to be easy. So you build a library, right? Frank, so you've built something there and I'm saying that because part of like the conversation at the beginning with Arjun was like, yeah, like it's cool. You build this thing over there, like academics can like probably use it.

Starting point is 00:24:43 But from that to making it accessible, like to every people out there, like academics can like probably use it, but from that to making it accessible, like to every people out there, like everyone out there, like there's things that need to happen. And like you like SQL for example, right? Which is something that like more people speak. So what's like, what's your experience on that? You build like the library in a very specific mindset where you were coming from. And then you started seeing like the steps and like the things that need to be built on top of it, like to make it like even more accessible.

Starting point is 00:25:14 So how different it is and like how much work is needed and how many people are like needed to find the right way to do that? David Pérez- So there's a big difference was my conclusion between building a library and building a product. The library got built certainly with help of colleagues that I had throughout the years, but I would say, you know, timely data flow and differential data flow together about 15,000 lines of code or something like that, they're not, they're not large. You build libraries or I, my experience has been that when you build libraries,

Starting point is 00:25:51 one of the things that's valuable is your opinion. You know, you get to tell people what the rules are when they show up. You get to tell people, here's how to correctly use the thing that I've built. And if I think what you're trying to do is dumb, I'll find some way to rule it out because I think like, it's not gonna work out well for you. Yeah. When you're building a product, you have to do quite the opposite, which is people are going to come to you and tell you, here's what I'm planning on doing. And if you want to do business, you need to make sure that is accommodated. You know, I would love to delete various parts of the SQL spec because I think they're misfeatures.

Starting point is 00:26:23 Not allowed to do that. And I, you know, have been dragged to the opinionfeatures. Not allowed to do that. And I, you know, have been dragged to the opinion that I'm not allowed to do this and I need to instead figure out how to interpret the weirdest things that people wrote down in SQL and turn them into meaningful computation

Starting point is 00:26:38 that behaves itself. That's not easy. Like, there are other plenty of people in the org who are better at that than I am, and it's an interesting technical challenge to figure out how to translate again, cunning ideas here into more pre-chewed and easy-to-use

Starting point is 00:26:54 packets. But yeah, very different experiences. One of them, the library is very inward-focused. I'm going to do a thing that I'm, I know how to use works great for me. Transitioning to more of a, an outward focus. How do I make a thing that brings what we have that's cool to as many people as possible. Alessandro Bellofiorelli 00, 00.:00.

Starting point is 00:27:17 All right. I have over-monopolized the conversation and we are all over our time, but I couldn't like, I think it would be like a shame, like to stop the conversation and we are all like over our time, but I couldn't like, I think it would be like a shame, like to stop the conversation because it was super, super interesting and like I learned a lot. But Eric, to you for the last question. Eric Boerwinkle So we actually, I get to make the rules and y'all are awesome. So I'm super excited about that but so that Brooks doesn't

Starting point is 00:27:48 quit when we send him this file I'll end on a question that has really intrigued me throughout this conversation and wow I have learned so much but one thing that

Starting point is 00:28:04 both of you continually bring up is empathy. And it's very clear in the way that both of you describe even very deeply technical concepts that you have a very high level of empathy. And both of you use the word delightful a lot. And you're very descriptive and sort of describing experiences. I'm so interested in where that comes from because you're very aligned on that. And I think it's very rare, actually, especially when discussing deeply technical topics to have delight as such a foundational value. But that's really, I've heard, you know, throughout the last 90 minutes repeatedly. So I'd just love to know where that comes from and how,

Starting point is 00:28:52 and maybe for our listeners, have you learned anything about how to develop or maintain that focus? I have to be totally honest. I have, I think, an intellectual appreciation for empathy and, you know, I'm practicing it, but it's, it's, you know, it's not where things started for me. I mean, I think, well, let me just say, I think if you have a variety of experiences, like I went from being in academia to being unemployed to, to eventually being in a startup. And like, one of the things that was sort of cool about that was getting to bump into a whole bunch of different people doing different things, different levels of background, you know, going from talking with academics to going and talking to people who were, you know,

Starting point is 00:29:33 as smart, but really quite busy and being asked to do dumb things that you agree are dumb. And like you realize, wow, okay. It's not, everyone had the same experiences that I had.

Starting point is 00:29:45 Uh, and then you have some of those experiences yourself with a bunch of PRs that people file against your library. I don't, you know, just having access to a broader and broader, if you can manage it, variety of experiences in life, definitely hammers home how many different people are coming from different places and what's actually worth doing to make as many of these people happy as you can.

Starting point is 00:30:07 I think a large part of it is so... I forget where I heard this framing of it. It's not original to me. It comes from somewhere. I just don't... I'm forgetting where. But a thing I continually remind myself is, imagine you're sitting down with some people, and

Starting point is 00:30:26 these are incredibly, you know, you have to not think about it as dumbing down what your contributions are because the audience isn't smart enough. And I think a lot of people make the mistake of trying to dumb down things. It's not about dumbing things down. Imagine you're sitting down with a bunch of incredibly intelligent folks who have been absolutely so swamped that they have had no time to think about your problem. So they're fully capable of understanding it. Let's say you've got three Nobel laureates in biology, chemistry, and physics in front of you, right? They are very busy people because they are

Starting point is 00:31:04 consumed with very hard problems. And that is what they think about every single day. And now you have a shared problem. You have to explain it to them. Again, it's not that they're not smart enough. It's that they have zero, a devoted zero minutes or seconds. How would you explain things? And I think that goes a very long way to setting a tone, which is you never really talk down you educate because people are busy and that's exactly what the case is in the data ecosystem, right? Like most people writing SQL queries have shit to do, which is why they're writing these SQL queries.

Starting point is 00:31:42 We can nerd out a lot about SQL and query languages and microservices, but you will lose your audience not because they can't handle it, but you need to first have an experience where they are getting value. And ideally, in a world where they don't actually need to dig through all of the various details, they might have to get into one or two specifics if it so pertains to the specific business problem that they have in front of them. But if you start from the premise of they first have to wrap their heads around your entire field before they can make progress in their field, then I think you're pretty doomed. Wonderful advice to end on. Thank you again for giving us so much of your time.

Starting point is 00:32:27 This will be our first double episode which I'm super excited about. And we'll definitely have you back on again in the future to hear even more about what you're building. So thank you again. Thank you. Thank you very much.

Starting point is 00:32:40 It's really fun. Costas, that whole conversation I know we released it in two parts but it was over 90 minutes and it really felt more like 20 minutes i would say and was just such an enjoyable episode you know doing this for two years it's definitely going to be one of the ones I think that sticks out. My big takeaway, I think, from the conversation is actually something that we discussed right at the very end. And it was remarkable to me how both Frank and Arjun really independently, I think authentically, independently from an authentic nature, because they were talking about very different things

Starting point is 00:33:25 use the word delightful and when you're talking about heavy duty technology building on timely data flow and streaming SQL and all of the crazy stuff we talked about. Delightful is not a word that you would expect. And it gave me so much respect for the way that they think about the people using the

Starting point is 00:33:55 technology that they're building and how they're keeping that at the forefront. You know, even in the face of, you know, some really heavy duty technology that's doing really cool stuff. And that to me was just a personal lesson and reminder about that being such a key ingredient of building something truly great. Yeah. And something that I want to keep from both parts of the conversation, and I think like Frank mentioned that like numerous times, is how many different

Starting point is 00:34:36 people are needed with different skills to turn like, let's say a scientific paper at the end into something, the product that everyone can go and use and get value out of it. And I think that's I mean, you know, like we think about that stuff, like many times we hear like on the news about like scientific breakthroughs and usually that's like in other like fields, not like in computer science that much. And hear about like breakthroughs and like people think that, oh, okay, like this has been achieved and like, okay, now, like, I don't know, suddenly we are going to have infinite energy or like we will be only like, you know, galaxies and stuff like that. But the true thing is that like from the point that something has been achieved for the first time or like something has been described or proposed, right? To get to the point where this can be used by like everyone out there, like takes a lot of human effort. Like a lot of human efforts.

Starting point is 00:35:46 Like a lot of human efforts. And yeah, like building a company, it's exactly that, like bringing all these different people together to do that. Even marketing people. They're did it. Mark Havardy- Marketing people. Yes. Even marketing people. I couldn't have said it better myself.

Starting point is 00:36:15 No, I think you're totally right. And I would say we got a full end-to-end picture of not only what it takes to get the technology itself to a place where end users can use it, but a really good look at how you build a team that can actually do that work. So what a special conversation. We'll take the wheel from books more often on behalf of our listeners, and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback.

Starting point is 00:36:49 You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Pet Camera - EBO Air 2

The Data Stack Show - 121: Materialize Origins: Breaking Down Data Flow Layers with Arjun Narayan and Frank McSherry

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 121: Materialize Origins: Breaking Down Data Flow Layers with Arjun Narayan and Frank McSherry

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.