The Data Stack Show - 46: A New Paradigm in Stream Processing with Arjun Narayan of Materialize

Episode Date: July 28, 2021

Highlights from this week’s episode include:Introducing Arjun and how he fell in love with databases (2:51)Looking at what Materialize brings to the stack (5:28)Analytics starts with a human in the ...loop and comes into its own when analysts get themselves out and automate it (15:46)Using Materialize instead of the materialized view from another tool (18:44)Comparing Postgres and Materialize and looking at what's under the hood of Materialize (23:16)Making Materialize simple to use (32:33)Why Materialize doubled down on writing 100% in Rust (35:43)The best use case to start with (42:03)Lessons learned from making Materialize a cloud offering (44:22)Keeping databases to the cloud for low latency (48:31) The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rutterstack, the CDP for developers. You can learn more at rutterstack.com. Welcome back to the show. Today, we get to talk with the founder of a company building a database product. The company is called Materialize, and Arjun is the founder of the company. And I'm super
Starting point is 00:00:42 interested to talk to him. I think, as I think about our audience, Costas, the biggest question that comes to mind is, what are the immediate use cases from a tool like Materialize that at a foundational level can take jobs with data that are generally considered batch and happen over a long period of time with a lot of latency and essentially turn them into real-time jobs. Analytics is absolutely a use case that I think makes a ton of sense, but I'm sure that people are doing all sorts of other interesting things. So that's going to be my big question is, as far as use cases, analytics is obvious, but what else can you do when you go from batch to real-time in the context of a database? Costas, you love materializing. I cannot wait to hear what your burning questions are. Yeah, yeah. I mean, okay, first of all, what you have in your mind, I think it's a great question. Materialize is a very, let's say, novel way of interacting with data and consuming data. So it's very interesting to see what people
Starting point is 00:01:43 are doing with it. So absolutely, I'm really looking forward to hear about the use cases. I have a lot of questions myself, to be honest. I don't know how much we'll manage to cover today. Most of them are going to be technical. I want to learn more about the technology, like the secret sauce, let's say, behind Materialize as a database. And also, apart from technology, it's also a very interesting product. Like the ergonomics that this database has is very, very interesting.
Starting point is 00:02:13 So I have quite a few different questions that will help us understand better the technology behind it and also some choices that the team has made in building this new database system. Okay. Well, let's jump in and talk with Arjun. Let's do it. Arjun, welcome to the Data Stack Show. We are very excited to talk with you because there's just so many data topics that we could cover in this conversation,
Starting point is 00:02:40 and we probably won't have time to get through all of them, but welcome. Thank you very much. I'm excited to be on the show. Let's just start like we always do with, we'd love to know your background. I'm Arjun Narayan. I'm the co-founder and CEO of Materialize. Materialize is a streaming database for real-time applications and analytics. It allows you to get extremely complicated and complex analytics answers in real time on top of streams of data as opposed to once a day on top of batch data. It looks and feels exactly like a SQL database. I started Materialize a little over two and a half years ago. Before that, I worked in a different field of databases. I was
Starting point is 00:03:26 a software engineer at Cockroach Labs working on CockroachDB, which is an OLTP scale out, horizontally scalable database. And before that, I did a PhD in distributed systems and big data processing. I've sort of lived, breathed, and been in data for a while, and a little bit by accident. I didn't intend to fall in love with databases, but as I learned more and more about how they power most of our applications and experiences that we deal with computers, they just became endlessly fascinating to me. And I've spent a decade looking at databases at this point. I love that.
Starting point is 00:04:09 With a PhD in anything related to databases, I would think that you have a lot of technical acumen. But, and I love the sentence, I didn't mean to fall in love with databases. I feel like that's the beginning of a novel that may have a very specific readership. Okay, Materialize, super interesting. I feel like that's the beginning of a novel that may have a very specific readership. Okay. Materialize, super interesting. I think a lot of our audience is very familiar with working sort of in and around your traditional database data warehouse, right?
Starting point is 00:04:38 So Postgres, the usual suspects when it comes to data warehouses, you have Redshift, BigQuery, Snowflake is obviously taking over the market. And there are really common paradigms within that, you can run SQL, you can create views, et cetera. The syntax and stuff is a little bit different depending on the warehouse. But for our average listener who maybe, let's just take an example, they are a data engineer. They do a lot of work getting data into Snowflake. They create views. They create different use cases for analytics teams, et cetera.
Starting point is 00:05:15 For that person who may not be familiar with Materialize, could you just paint a picture of if you introduce Materialize into the stack, what does that look like? And what are the key benefits that it brings? That's a great question. I think it helps to break down a standard paradigm of where most databases fit in, in the traditional worldview. And then we'll introduce how Materialize sort of brings some new capability that's different from what's currently in the market. So databases, and this is going back, say, several decades at this point, traditionally fall into two large buckets. There's the transactional databases and the analytics databases. So transactional databases
Starting point is 00:05:55 are your Oracle, your Postgres, your MySQL. They're generally speaking focused on processing lots of transactions that may potentially be conflicting. They're sort of the point that decides what events are allowed to happen. So they reject some transactions, they accept some other ones, and then they're very good at writing those transactions down. So they're very focused on avoiding data losses. It's something you really, really want from your transactional database. Then you have your analytics databases, like your BigQuery, your Redshift, your Snowflake. Your analytics databases are more focused on enabling far more powerful compute. Typically, in SQL databases, people use SQL in both settings, in the transactional setting and the analytical setting.
Starting point is 00:06:47 But if you take some of these complex queries, say it's joining eight tables together over at least some of these tables are very, very large. Those queries, if you ran them on a transactional database, the transactional database would A, most likely fall apart. And B, if it didn't fall apart, it would probably greatly slow down your other concurrent transactions. So there's a reason people mostly separate these systems. If an analyst types some large analytical query about last quarter sales, you don't want all your cart checkouts to triple in latency, right? So it makes sense. It makes perfect architectural sense to separate these concerns and then also build separate
Starting point is 00:07:21 systems that are optimized for these different classes of workloads. The big, big thing that most people give up today is your analytics query runs on a dump of the data that is somewhat stale. So this is feeding your batch data warehouse with a once a day ETL. I mean, this is really ETL. ETL, really extract transform load is about getting data out of the transactional system and putting it in the analytics system. It's getting less painful, but it used to be an extremely painful process. You would run it overnight, once a day. Some folks are now running this on a more multiple times a day, but it is still fundamentally a batch operation, which means there's a large of analytics or analytical style queries that are incredibly valuable to have in real time, which don't make sense around a transactional
Starting point is 00:08:34 database, but existing analytical databases or data warehouses are not equipped to do because they're fundamentally built in this batch paradigm. Materialized flips the setting a little bit, which is instead of computing your answer off of a data set from scratch when the query is presented to you, it pre-materializes some set of questions that you've pre-registered with Materialize. And this is why the companies have been named Materialize. So if you might be familiar with the term Materialized Views, the entire point of a Materialized View is you tell the database, hey, I'm interested in asking this question on a repeated basis. Can you please pre-compute it for me as the data changes?
Starting point is 00:09:20 In the past, most Materialized view support in most databases has been highly restricted, right? So you can do it for fairly simple queries, but if the query gets fairly complex, the database really wants you to ask it and then it'll go ahead and do the work rather than doing a whole bunch of redundant work that has to be immediately thrown away the moment the data changes. So under the hood, Materialize is an incremental query processor. And we can talk a little bit more about the technology because this is a thing, I don't think I'm describing anything that people haven't wanted for a very long time. The unique thing that we bring is a novel set of underlying research and technologies that allow
Starting point is 00:10:01 this to happen in an elegant fashion. But Materialize allows you to ask these complex analytical queries on a sub-second, say a single-digit millisecond latency, even when these queries are very, very complex. This is more than just about taking some analytics query that you've asked once a day and making it a dashboard. Now, absolutely, a lot of our users start by taking something that they computed once a day, some very valuable metric, and making that into a dashboard so they could see it on a more real-time basis, especially in, say, the financial and the trading use cases. They can never have things fast enough. But the more interesting thing happens when you start to put these live changing data and taking automated actions off of them.
Starting point is 00:10:51 So you could think alerting, you could think personalization in an application as you get real-time data, as opposed to realizing that somebody was a customer that should be segmented a certain way and then doing an email marketing campaign the next day by the time your OLAP job finished. There's a wide variety of uses where when you can action while a user is on your website or while a transaction is still pending before it has been authorized to clients, if it's a card transaction, it's much more valuable to make a precise judgment as to the quality of that user or that transaction within, say, a 10, 100 millisecond budget versus doing that overnight and reacting to it the next day. Absolutely. I mean, this is fascinating. And
Starting point is 00:11:40 we've had several conversations with different businesses where this is where they're heading with their architecture. And e-commerce comes to mind just because it's a situation where you have a lot of data. A lot of it needs to be enriched or combined with other data. So data from transactions or ML models and all of that's happening in some sort of database. And the challenge has been we're creating all this value of the data that we have. And it's very difficult to deliver that with speed, right? And in e-commerce, if you want to send a personalized coupon right after purchase or something like that, that needs to happen very quickly. But the latency has
Starting point is 00:12:26 been really high just due to technology. But that's changing. And that's really, really exciting. So super, super interesting. Absolutely. One of the things that we see is the amount of folks who are putting in the capabilities, and we're very much in the early stages of this architectural transform, because folks are pretty much just putting in place the streaming infrastructure to move the data at low latencies and at high volumes. So this is doing change data capture out of their transactional databases on an ongoing basis so that milliseconds after a transaction commits in Postgres or MySQL, it is present in a Kafka topic that can be used for
Starting point is 00:13:06 these downstream consumers or downstream applications. And the early adopters have gone ahead and built these manual microservices, right? So the absolute earliest adopters have adopted this microservice pattern, which comes at a huge cost, right? So not to mention just the development cost of building these manual not to mention just the development cost of building these manual microservices, but the ongoing maintenance and upkeep costs that these microservices introduce when you want to just say,
Starting point is 00:13:34 change a little bit of business logic, right? So changing business logic sometimes takes a full quarter because you have to shut down or upgrade these microservices in a controlled fashion. And perhaps something that would be very simple in a database, like joining against another stream, ends up introducing a massive amount of architectural shift, as you now have to build and manually maintain an extra set of states that is introduced by adding on that third topic. So these are the sort of costs that people currently pay that we want to reduce.
Starting point is 00:14:07 So we think that building these streaming microservices, streaming applications right on top of the stream should be as easy as building a CRUD app using a MySQL database. Today, it's not, but with Materialize, it is. Yeah. Well, I want to dig into some of the technical details because there are a lot of questions that Kostas and I talked about. But before we get there, you mentioned something around moving just beyond the basic analytics use case. And that's something I want to talk about briefly. People use the term digital transformation, which is a buzzword, but on the spectrum of digital transformation, you have companies who have figured out the analytics thing and they're relying on technology that is doing the batch,
Starting point is 00:15:06 you know, is relying on the batch load paradigm, maybe with outdated tech. What are you seeing? Or, I mean, there are a lot of companies who I think could just benefit from the analytics use case in and of itself. But the real, the use cases that really move the needle are the ones where you're actually delivering personalization or other really dynamic customer experiences. But I'd just love to know what you're saying as you talk with your customers and people who are interested in adopting something like Materialize, what's the balance? Are a lot
Starting point is 00:15:35 of companies still trying to figure out the analytics use case or are there more companies than we think who are actually doing some really interesting things around the customer experience. That's an excellent question. To me, a large part of this comes from where your analytics team is. One of the amazing things that has been happening in the industry is analytics teams have become progressively more empowered to do more and more and create more value for their organizations and now are starting to get into building these applications or building part something that ends up being surfaced in the core application. The way I think about this is analytics pretty much starts with a human in the loop and then analytics starts really coming into a zone where once the analysts
Starting point is 00:16:22 themselves are trying to figure out how to get themselves out of the loop, right? And how to make these things automated. So I think a lot of the analytics journey to real time and streaming begins with augmenting the human capability by giving them a more live, but where it truly comes into its own is when we start doing automated actions directly off that analytics pipeline. There's a huge benefit to everyone in the organization, whether it's the application or the analyst speaking the same, speaking a common language in terms of defining the metrics that they've been thinking about in the exact same way. DBT is, of course, absolutely the leader in creating an ecosystem where an entire company's or an organization's data is modeled using a single unified paradigm. And starting from the
Starting point is 00:17:16 analyst and then going towards the application, I think, is the correct way to do things. I absolutely encourage most folks to take their first steps by moving, say, a once a day refreshed dashboard into real time because, A, it's an enabler of a lot more things. And it's a good way to ensure that all the application and the real time in application experiences are fundamentally based on the exact same vocabulary that is already part of the analytical organization. Arjun, this is great. Actually, before I start asking my questions, I have to tell you, I really enjoyed your
Starting point is 00:17:54 introduction. I think it was one of the best descriptions of the difference between the two database paradigms that we have, which is pretty common. Many people are asking about why do we need to have an analytics database and a transactional database. But that was amazing. If you haven't written a blog post or something about that, please go and do it. I think many people are going to thank you about it.
Starting point is 00:18:19 But I have a couple of more technical questions that I want to make. And let's start, a little bit more technical questions that I want to make. Let's start with the materialization. You mentioned that you also chose the name because of the concept of materialized use. Why someone would use materialize and not just keep using the materialized use of transactional database for example offers? Excellent. Well, thank you so much, Costas. I appreciate it. I should write a blog post. This is a great question in terms of why not just use the materialized view
Starting point is 00:18:52 in, say, Postgres or MySQL? Well, the first answer is if your materialized view becomes the slightest bit complicated, you'll lose the ability to incrementally update it. So it's really about what is the update strategy for this materialized view? Because for a complex materialized view, let's say you're joining four tables together, you have some subquery in there, you have some non-trivial aggregation, maybe some max and some group by or something of that sort. The first thing in OLTP or even an OLAP database is going to tell you is you have to manually tell me when to refresh the materialized view. And then when you do that, I will essentially run the equivalent of a select query and then
Starting point is 00:19:37 stash the result in a table. So for you to query. So it's not in any way, it gains you almost nothing compared to repeatedly issuing select queries. The hard part, the technologically hard part is the reuse of previously computed results to efficiently update the materialized view. A good way to think about it is you want to do work proportional to the changes, not proportional to the query load. So if somebody asks a select query and very little has changed, you shouldn't force your database to do a massive quantity of work. Data has changed, but does not affect the result. You want that to essentially be suppressed as early as possible, so a good example of this is if I'm, if I'm summing a bunch
Starting point is 00:20:26 of rows and then somebody added a bunch of zeros, we should quickly detect that and not, not, not, not throw all our results out and recompute everything from scratch. A large amount of analytics workloads that happen in data warehouses today are fundamentally redundant queries where we are mostly recomputing the same answer. So if you have terabytes of data, most of this data is historical, right? Like big data is absolutely real, but it's primarily a phenomenon related to the amount of data we have collected. You don't have big data every second. Well, Google might have, but most organizations today, the amount of data that is coming in second by second is not that voluminous.
Starting point is 00:21:08 But when your queries are fundamentally nonlinear, they're joining a bunch of different things, the database sort of looks at it and goes, well, I don't know what's changed. I kind of have to throw it all out and start over from scratch. And that's fundamentally the paradigm that we want to get away from. That's great. Another question on that, why I would like to have incrementally updated views instance of having something like a caching layer and cast the results of a view? Well, the hard part is deciding when to invalidate your cache, right? So what you get from an incrementally updated materialized view is this logic is handled
Starting point is 00:21:47 correctly, perfectly, without the user having to do anything more than think. One of the cute taglines we use internally is think declaratively, but execute incrementally. So it allows you to still think in terms of what's fundamentally the select query I'm trying to run. And then we think through all the hard parts of what is the data flow query I'm trying to run. And then we think through all the hard parts of what is the data flow that has to happen under the hood, which parts of these are stateful, which are stateless, which ones invalidate cache. If you're building a microservice, you're going to have to reason about all of this yourself, build a microservice, a stateful microservice. And this is hard and you might get it wrong. And if you get it wrong,
Starting point is 00:22:25 it's really subtle to debug. It's difficult. Generally speaking, most people use databases because inventing half a database that you happen to need for this particular use case is a risky thing to do and very hard to validate if you did it correctly. So we also find a solution to one of the hardest problems in computer science, right? When to invalidate the cache. So we also find like a solution to one of the hardest problems in computer science, right? When to invalidate the cache. So that's great. Yeah, exactly.
Starting point is 00:22:50 It's data naming things. Yeah, yeah. All right. So what's the secret sauce? What's the magic? Like what is different in materialized converting? Like what Postgres is doing, which is, I don't know, probably one of the most complex databases ever built.
Starting point is 00:23:09 We built it for like the past 30 years or something, right? So what's new and what is different with materialize? That's an excellent question. Before, I don't want to talk negative about Postgres, I'm actually going to take the flip of the question. It's like, what does Postgres do that we can't do, right? So Postgres is a great OLTP database. In fact, we love it very much in the engineering team at Materialise because Materialise speaks as close as possible wire compatible Postgres. So for an application
Starting point is 00:23:36 that's talking to Materialise, you use Postgres client drivers, you use the Postgres native language bindings and it'll all just work. So we're huge fans of Postgres. Postgres is a great OLTP database. What Postgres does very well that we don't do is transaction isolation and concurrency control. So if you have, say, a unique index or a primary key field and you have two people racing to commit transactions, Postgres will ensure that only one of them succeeds, right? It's great. It's great at this conflict resolution and consistency aspects of the asset properties that you want from a database. What we're very good at is computing these
Starting point is 00:24:17 denormalizations, these complex views, and keeping them incrementally up to date. And we actually work very, very well downstream of Postgres. So one way that some of our users deploy Materialize is they have Materialize essentially acting as a read replica, right? So Materialize connects directly to Postgres, the transactions, all the writes land in Postgres, and then get immediately replicated within a millisecond or a few to Materialize. And then Materialize gets to maintain all these rich analytical indexes that essentially are kept incrementally updated as soon as the data comes in. This way, the writes offload to Postgres and then the complicated reads, essentially it offloads compute from Postgres. Now, how do we actually do this? So under the hood, Materialize is built on this state-of-the-art
Starting point is 00:25:05 stream processing platform called Timely Dataflow. Now, Timely Dataflow was invented or co-invented by my co-founder, Frank McSherry, who has done a lot of stream processing research for, I think, coming up on seven to eight years now. Timely data flow is a fully horizontally scalable stream processing framework on which we've built query planning and data flow planning such that we can take an arbitrary SQL statement or a SQL view definition and convert it down into a persistent data flow that is horizontally scaled out on this timely data flow cluster. We do have some, there are some folks who use timely data flow directly as a stream
Starting point is 00:25:54 processing library. It's an open source project, but most people don't want to do this, right? You don't want to write, so timely data flow is written in Rust. You don't necessarily want to build and write Rust data flows and manually orchestrate them. So we think there's a large market for people who want those benefits of that incrementally updated high performance scale out, blah, blah, blah. for several decades, which is they write and define SQL queries, and these SQL queries just stay alive, and they don't really think about it, and these things just stay alive forever as the data changes. That's very interesting. So how is timely data flow different compared to other solutions out there like Flink, Databricks, and the rest of all the streaming processing platforms that we have seen in the
Starting point is 00:26:47 market until now. That's great. So first off, I'm going to do sort of a bad job answering this, but there's a wonderful research paper called NIAID, a timely data flow system, which won several academic awards that lays the foundational case for timely data flow and how it's novel. There's a few things, not all of which we currently take advantage of in Materialise today, but a good example is timely data flow is capable of reasoning about cyclic data flows, whereas most other data flow models are purely acyclic. It is extremely expressive, almost to a fault. So driving timely data flow around is hard and
Starting point is 00:27:26 something that we take a lot of pains to do correctly at Materialize in the Materialize database layer. It is data parallel across a sharded data flow graph in a way that most other data flow engines are not. So today, most data flow systems, say Flink or Spark streaming, the primary way in which they scale across to use many more compute resources is by taking various operators of the graph and placing them on dedicated CPU resources and flowing data from a data flow node to another data flow node. So if you think about that, so a good way to get intuition for this is let's say you have two sources of data, each of which have some map operation, and then there's a join operation and then there's some subsequent map or map or filter or things like that.
Starting point is 00:28:16 These each things form this graph of computation and each one of those nodes gets their own dedicated compute resource. Timely data flow is sharded in a very, very different model that results in a very, very higher performance, particularly in cases where you have very, very large data flow. So let's say you have a SQL query that has eight different input streams, complex sub queries, things like that. The actual execution graph of this may actually be hundreds of nodes. You as a user may not care. You just want that SQL to be incrementally updated.
Starting point is 00:28:54 Getting that data flow graph to get high performance in some of these other stream processing systems is very, very hard. Whereas with timely data flow, because of the way it scales up and has this shared cooperatively scheduled data flow execution model, makes it far, far more performant. For more details, I would point you to the research paper, because I'm struggling a little bit to convey some of the nuances without the reference to some diagrams and some slides. Yeah, yeah, makes sense, makes sense. I mean, I was aware about the NIAD paper and also the timely data flow model.
Starting point is 00:29:27 But I think it's something that people out there are not the community out there, are not that aware of. So I think the more we can communicate and talk about it, I think the better it is for everyone to start understanding, thinking in new terms, right? Because as you said, timely data flow is like a different paradigm of how you can process.
Starting point is 00:29:49 And whenever we introduce a new paradigm, it takes a lot of repetition from the people who know about it and they evangelize this to help the people out there understand. And actually, it's very interesting because we had an episode pretty recently with CockroachDB. And one of the topics
Starting point is 00:30:09 that we were discussing was how important it is today for the engineers out there to start thinking more into getting some distributed elements from distributed computing and start incorporating them in the way you think as an engineer or as a developer, right? And I think this is one of the values that we, as people here, are sitting together and discussing about interesting technical topics that we can offer to our audience out there, how we can give them some guidance of, yeah, you know something, there's a different way that data can be processed out there.
Starting point is 00:30:44 Maybe you should start also trying to think into this or yeah you might be like a web developer or like a front-end developer but still if you start thinking and using some of like the patterns that come from like distributed systems probably it can help you with your work and also can help you work much better with the back ends that probably are distributed behind the scenes. So that's why I find it always very, very valuable to discuss a little bit more technical details. Absolutely. I strongly agree. I think it's very important for developers building and using systems like this to understand
Starting point is 00:31:21 and appreciate what the right principles are. One, so they can choose the right technologies to work with or the appropriate technologies for the problems that they're solving. But one of the things we maybe struggle with, and I appreciate you pushing a little bit on this, is to what extent should we encapsulate and hide the complexity versus unwrap and show the complexity? So one of the big advantages of Materialize is you don't have to know, you just write SQL, but there's a sort of inherent tension where,
Starting point is 00:31:48 you know, actually, A, everyone is interested and definitely wants to know, and B, maybe understanding will get you the right intuitions for what computations you can even execute and how to go about choosing the right architecture to build which systems you can incorporate and not incorporate in your architecture.
Starting point is 00:32:07 Absolutely. I totally agree. So, Arjun, you mentioned that by incorporating this new timely data flow processing model, Materialize achieves to be very performant compared to the rest of the solutions out there for streaming processing. What kind of resources someone who wants to start using it today should consider about setting up the open source version of Materialize?
Starting point is 00:32:33 So we aim to make Materialize very simple to use. So you go to our website or our GitHub, you click the download button, and you can run this on a single node. You can scale up this node to, to, to handle. In fact, in fact, if you, if you get the, the large, the larger sized, uh, VM and you run materialize on it, you can ingest millions of messages, a million messages a second. You can, you can install dozens of views and so on before even needing to consider
Starting point is 00:33:02 whether you need a multi-machine setup as part of making it easy to graduate beyond this. In fact, you know, you will be very productive on a single node database. We really go to great lengths to make it as easy to use as a database, right? So you run it on a single node, you connect to it using a SQL shell or a SQL driver in your language. The lived experience is very much like Postgres, right? Like this is how most people run Postgres is they run brew install Postgres or app get
Starting point is 00:33:29 install Postgres and they run it and then it's living in a VM by itself in a cloud for years of uptime. So that's really the easiest way to get started. We are building a cloud service, which we are launching publicly next month, which allows folks to get even more advanced features. So some of the features that we will be shipping in our cloud product is horizontal scalability, where you have these very, very large data volumes, well north of a million messages per second, for instance. And you do need multiple machines in a horizontally scaled setup to absorb that data volume.
Starting point is 00:34:04 And then two for having replication, right? So if you have extremely high availability needs, you're going to want multiple servers set up in an automatic failover capacity. And that's something that our cloud product will, not next month, but down the road, also support. That's great. And I'm very excited to hear that you are launching a cloud version of the product.
Starting point is 00:34:26 And I want to ask you more about this. But before we go there, because we are going to spend some time on it, I have a question that I don't want to forget to ask. And that's about, you mentioned at some point that Timely Dataflow is implemented in Rust. So how did you decide to use Rust? What's the reason behind that? I think the original reason was Frank, when he started coding Timely Dataflow, he had recently left. He had just left Microsoft Research
Starting point is 00:34:56 and he had been coding for a while in the sort of.NET ecosystem. He wanted to try something new and Rust was a beta programming language at the time, a very risky thing, but he was just playing around. I think a lot of these open source projects, they start that way. So timely data flow was coded in Rust. Now I think for highly data intensive applications, the best choices are Rust or C++ because the manual memory management and control is quite important for predictable low latency experience. I think there are some places that have gotten good
Starting point is 00:35:35 at writing in Go. Go is a garbage collected language and not manual memory management. So I had some experience because I was suffering from a cockroach. Cockroach DB is written in Go. We struggled with it a little bit. I don't think it's impossible. I think you can definitely,
Starting point is 00:35:52 with enough sweat and effort, essentially drive the garbage collector around to do the kinds of things that you would have wanted to do in a manually managed environment. There's pluses and minuses. Rust, we doubled down on Rust when we built Materialize
Starting point is 00:36:05 because one of the things we could have done is we could have left timely data flow as a Rust underlying engine layer, and then built the Materialize database management layer in a different language. And when we looked at that design decision, we thought about it a little bit and we came to the conclusion that Rust was actually pretty great. And we were quite happy to build it on Rust at all layers of the stack. So Materialize is 100% written in Rust. And we're quite happy with that. I mean, I'm happy to go into like, more detail as to our experience building in Rust and maybe contrasting a little bit to the Cockroach experience in Go as well. Yeah, that's very interesting. And I'm asking you because Rust is
Starting point is 00:36:41 like a pretty young language language but it's gaining a lot of traction lately and it's a very interesting language also like from a let's say research perspective in terms of like what kind of primitives they've added there in order to do like this kind of memory management it's very interesting it's of course like very interesting to see that it starts to be used for systems out there that get in production and in products that are delivered out there so that's production and in products that are delivered out there. So that's why I was very interested to hear your opinion about Rust. And something that it's about Rust again, but from the perspective of being a founder and building teams, right? So
Starting point is 00:37:16 how easy it is today to find developers out there that can write in Rust or who are willing to write in Rust? Right. So we don't expect our engineers to know Rust when they join, although many of them do, certainly not all. We find that it takes a reasonable amount of time on the order of a few months to get productive in Rust. This is probably the biggest cost that we pay as an organization for building a product in Rust is there is a bit of a ramp up time that we have to pay, but that's fine. It is not difficult to find people who want to work in Rust. In fact, I would say it's a significant attraction to several engineers who maybe if they've written
Starting point is 00:37:56 C++ code and they've lost so many weeks of their life to chasing down some memory leak or some manual memory management bug and they want to move to a language or an environment where they get the benefits of manual memory management, the performance, and they also don't have to deal with that class of bugs. So we find quite a few people are very excited to work in Rust, although we do have to take some time to let them ramp up. And what is the reason that it takes a couple of months to start being productive in Rust? And that's probably also the...
Starting point is 00:38:31 Sorry for interrupting you, but I think this is one probably of the main contrast with Go, because one of the benefits that I hear, at least from engineering months, about Go is that it doesn't take that much time to be productive in Go. But why Rust has that? It takes five to six months to get productive. I wouldn't go so far as to say five to six.
Starting point is 00:38:54 I think it's more like two to three months, assuming we have an experienced software engineer who has been building, which is the backend or distributed systems, which pretty much all the engineers that we hire fit that mold. The primary difficulty, and by the way, having worked in Go and at Cockroach Labs, most people can be productive in Go in under one week. It's a truly incredibly concise language
Starting point is 00:39:18 to get productive in. It's sort of, I would almost say, optimized for productivity. The primary difficulty with Rust is that it is, most folks have a little bit of an adversarial engagement with the compiler. It can be a little bit frustrating to essentially what you are doing when you're writing a Rust program is you are giving it sufficient type annotations that it is able to prove that certain classes of memory bugs are provably absent. So it's a little bit of you are guiding a not very smart computer because
Starting point is 00:39:53 it's not a human to follow a proof. And there's a little bit of it's too dumb to see that the code you've written does not have a memory leak. This is often called fighting the borrow checker. So the borrow checker is a part of the compiler that yells at you. And there's this standard failure mode of like fighting the borrow checker for a while until you fully internalize the limited ways in which the borrow checker thinks. And then you know, oh, this is where I should probably add this annotation or do this thing or use this pattern in order to do the compilation step. The other thing I didn't mention is, and this is a place where I say, given the novelty of Rust, this is a negative. There's just not that much libraries and pre-existing tools that you can draw off a rich sort of open source ecosystem.
Starting point is 00:40:44 It's very different from Go. And Go, like pretty much if you're looking for some compatibility to some driver or some library or some parsing library or some security thing, like it's a very rich, mature ecosystem compared to Rust where oftentimes we've had to, there's at least a couple instances I can think of where we had to write a library from scratch, whereas if we were writing in Go, we would have used an off-the-shelf one. Yeah, makes sense.
Starting point is 00:41:11 Although from my limited experience with Rust, I have to say that Cargo is a very nice experience for package management. So yeah, there's always trade. Docents, it's a young language, right? It takes time for the community to build everything there. But with Attraction, I think it will catch up pretty fast. For sure.
Starting point is 00:41:30 And also some of these things that I'm saying, they're not going to be downsides for people coming after because there will be more software engineers who are already fluent in Rust. And hopefully we are a contributor as well, adding some of these libraries that we've open sourced and other people as well. So a year from now, it'll be even easier.
Starting point is 00:41:45 So these are just growing pains. Yep, absolutely. You'd asked about how to get started with Materialize. And I just wanted to jump in really quickly because we talked about, obviously, the open source offering and then super exciting that you're launching cloud. Arjun, one quick question. I'm just thinking about our audience here. What use case would you encourage them to start with? within a matter of a couple of days and really validate that the technology is capable of taking arbitrary SQL that you have, business logic in your organization, and move it to real time. And then that's a position from which
Starting point is 00:42:33 we can think through the more complex things like actioning or integrating this into a pipeline that sort of is part of an application experience. But getting this value in as short time as possible is what I would encourage folks. And that pretty much means some pre-existing business logic or a pre-existing DBT model. Since Materialize has a DBT plugin,
Starting point is 00:42:56 you should be able to take your pre-existing DBT model and make it work on Materialize with ideally in a single day. Oh, very cool. Wow, I mean, that's extremely fast time to value. And then just one more quick tactical question for our listeners. Is there, just go to materialize.com to get notified about the launch of the cloud product? Yes, that will be front and center on our homepage. And in the meanwhile, you can download the source available free product from there as well.
Starting point is 00:43:26 Sure. Great. Okay. Sorry, Costas. I know we're close to time. We have another. But I just I just I constantly think about our listeners and I love learning about new technologies. And I just want I just want them to get the fastest way to understand how I can get in and kick the tires on it. Yeah, absolutely. And it was very good that you asked these questions, Eric, because it's time to spend a little more time on the cloud version of Materialize. for a product or like a framework, like materialize, like things that you expected beforehand and didn't happen and things you didn't expect, but they happened.
Starting point is 00:44:13 Like anything interesting that you can share with us about this process of turning this amazing piece of technology into a cloud offering. Absolutely. The first one I would say is the biggest reason why we're building a cloud product is
Starting point is 00:44:27 by far, we talk to our users, we talk to prospective users, we talk to basically everyone in the industry, there's a wide consensus that everyone wants to use a managed cloud offering of pretty much all of the technologies that they use because running and upgrading and manually maintaining these things is not something that most people are interested in doing, particularly as things get more and let me put it this way, you much rather have somebody else carrying a pager than you carry a pager. The more mission critical this gets, the less you want to be in charge of carrying
Starting point is 00:45:02 that pager when that system might go down. In terms of building a cloud service, one of the things that's very exciting, and this is particularly true for companies like ours, where we're building this from day one, knowing this, that the cloud product is the predominant way in which we are going to be successful as a business, is you get to think in terms of atomic components that are cloud native. A very, very good example of this is separating storage from compute. So storage in this infinitely scalable, extremely low cost service, namely S3 or the S3 equivalents on the other major clouds, is available and extremely high durability, extremely strong guarantees that you get from these services is a building block that you can build, say, a database around that means that there's an entire class of problem
Starting point is 00:46:00 that you don't have to engineer for, namely data loss or data corruption or replication or things like that. You can rely on this atomic unit of an S3 bucket being the principal storage layer for the vast majority of your data. And what this means is, of course, you get to use your engineering budget instead of solving the same problem that everyone has had to solve pre-cloud. You get to use this to solve new problems. Another one that you get is the ability to other services that are cloud native, save for other components. So a good example of this is going back to Postgres, materialized cloud uses highly available
Starting point is 00:46:38 Postgres nodes under the hood for certain classes of metadata and things like that. Whereas otherwise, if we were building a fully on-premise piece of software, getting this highly available would be a long engineering challenge. At the same time, we love users who just want to use the source available product or they want to use it and deploy it in their own premises. The key distinction I would make is we've designed materialized cloud such that the best place to get the highest number of nines of availability is materialized cloud. So things like active, active replication, automatic failover, load balancing, these are built using cloud native services and owned and operated by us as part of materialized cloud that are not part of the downloadable on-premise offering.
Starting point is 00:47:29 And that's because fundamentally these things are designed using cloud services that they're not portable, right? Like you can't take, you don't have S3 on your laptop. You can, and yes, you can emulate it for testing, but that's not how you would run a production service. Absolutely, absolutely. Operating a software and building a software are two different things. So I have a question about the cloud offering compared to the experience that you described
Starting point is 00:47:56 about Materialize from the beginning. And it has to do with latency, right? You said that Materialize is a system that you can expect a single number of digit latency when it comes to the queries that you execute and the updates that you have. My intuition says that in order to achieve that, if I'm consuming data on Materialize from a database system that I have, I have to have my materialized nodes as close as possible to my database. How can I do that when I use the cloud offering? So the first point I'd make is you're absolutely correct. You want this to be very close to your database. But the other thing I'll observe is most of the databases are in the cloud. So if you want to be close to the databases, you have to be in a
Starting point is 00:48:42 cloud instance by definition to be close to the databases that are in the cloud. The important part of this is co-locating them as closely as possible. And it usually would come down to region, availability zone, co-location, and things like that. You almost certainly don't want to move this data across clouds, right? So our cloud service is launching next month on AWS, but eventually we want to fast follow to Azure and Google Cloud as well, because if your database is in one of these other clouds and you will have too much latency going between two clouds.
Starting point is 00:49:18 The other thing I would say is the clouds have gotten, the hyperscaler, the three cloud companies have gotten very good at laying extremely high bandwidth, low latency network connections. So as long as you're in the same region and spinning up your materialized instance in a VM that is in the same region and perhaps even the same availability zone as your database, they've done a very good job making sure that those actual packets that are going across this virtual network will go over a fairly small physical distance. That's great. One last question from me, Arjen, and then I'll give it to Eric so we can also conclude this episode. You mentioned about co-location and all that stuff, and you mentioned
Starting point is 00:50:02 also about S3. So for the people out there who are interested in using the cloud version of Materialize when it's launched, is this going to be on one cloud provider like AWS? Yes. Next month, we're rolling it out on AWS, and then a few quarters later, we will be loading it out on other clouds. Okay. So people can expect that in the next couple of months, if they are a GCP shop, the materialized will also be available there, Azure and at least the major cloud providers are there. I can't commit to a specific timeline, but one thing I will say is that there always is the option of running Materialize in a VM, the downloadable source available product,
Starting point is 00:50:46 in a VM in an Azure region or data center. That's great. I think we need to have at least another episode because I have more questions to make. But I have completely monopolized this conversation and I need to give at least some time to Eric. This has been really fun. I really appreciate the questions, Costas.
Starting point is 00:51:07 Thank you. Yeah, it's great. I think we're close to the buzzer, but we've talked about Materialize a lot just as a team and Costas and I, because we love discovering new technologies and it really is a true joy just to get to talk with you and just hear about the inner workings in many ways.
Starting point is 00:51:25 And I hope this has been a really fun conversation for our listeners. Arjun, this has been such a wonderful conversation. We'll definitely have to have you back on. And congrats on the cloud launch. That's going to be great. Encourage all of our listeners to go to materialize.com and check it out. And we'll have you back on the show maybe in another six months or so after the cloud products been live and hear how it's going. I would love to do that. This is an
Starting point is 00:51:51 absolute pleasure of a conversation. Thank you both. Thank you, Eric. Thank you, Costas. This is a wonderful show you have over here. Well, Costas, I think one of the big takeaways I have, and this won't be my takeaway from the content of the show, is that you and Arjun are incredibly intelligent when it comes to very deep concepts around databases and languages that you use to build technologies. And so it was a real joy for me to hear two very intelligent people reason around some of the decisions that they're making. I think the big takeaway actually relates to my big question on the front end. Analytics is a really obvious use case, but all the other interesting things you can do
Starting point is 00:52:36 when you enable real-time, I think are just going to open up a lot of really creative solutions to problems that are low-level plumbing problems in the stack currently. And that's very exciting. I mean, coming from a marketing background, I think about enriched profiles and automation and other things like that. And the ability to have this stuff in real time from a database, I think it will actually be a very big driver of creativity in the way that people are building experiences. Absolutely. You're absolutely right. I mean, the closer you get to real time, the more use cases you open. And I think we are just at the beginning of like seeing what people can come up with
Starting point is 00:53:20 technologies like materialize. And I'm pretty sure that like, if we talk again with Arjun, like in six months from now, he will probably have even more use cases to share with us. So yeah, absolutely. Materialize is a new technology, a new paradigm. There's many new, let's say, patterns that we have to learn and understand from there and experiment with. It might take some time for people to figure out
Starting point is 00:53:44 how to use it, but my feeling is that we are going to see very exciting things coming from this technology. I have to say though that Arjen is also an amazing, amazing speaker. He was amazing in explaining really complex concepts, so I really enjoyed the conversation. I was really happy to hear about all the technology that they are using to make materialized products. And I'm also very excited to see what's going to happen with the cloud version of the product. It's also very exciting for me to hear that regardless of like the technology that someone is building, how this technology is delivered and is used, it's very important. And cloud is probably the best delivery model
Starting point is 00:54:31 that we have at this point for this kind of product. So yeah, hopefully in a couple of months from now, we'll chat again with him and learn even more. Yeah, absolutely. I think as I reflect on the conversation, a lot of really paradigm-shifting technologies take something extremely complex and make the experience very simple. And there are lots of examples of that, but being non-technical, but working with you closely enough to understand when you talk about anything real-time related to a database, from a technical perspective, that's an extremely complex problem to solve. And I think if materialize can simplify that, I mean,
Starting point is 00:55:11 that's pretty paradigm shifting. So it'll be really fun. And I think if they can accomplish that, that'll, that'll be huge. Awesome. Well, thank you for joining us on the show. Lots of really good episodes coming up this fall. We're actually about to wrap up season two. So you'll see that wrap up coming up in the next couple of weeks. And then we have a great lineup for season three. And until then, we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
Starting point is 00:55:53 The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.