The Data Stack Show - 46: A New Paradigm in Stream Processing with Arjun Narayan of Materialize
Episode Date: July 28, 2021Highlights from this week’s episode include:Introducing Arjun and how he fell in love with databases (2:51)Looking at what Materialize brings to the stack (5:28)Analytics starts with a human in the ...loop and comes into its own when analysts get themselves out and automate it (15:46)Using Materialize instead of the materialized view from another tool (18:44)Comparing Postgres and Materialize and looking at what's under the hood of Materialize (23:16)Making Materialize simple to use (32:33)Why Materialize doubled down on writing 100% in Rust (35:43)The best use case to start with (42:03)Lessons learned from making Materialize a cloud offering (44:22)Keeping databases to the cloud for low latency (48:31) The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rutterstack, the CDP for developers.
You can learn more at rutterstack.com.
Welcome back to the show. Today, we get to talk with the founder of a company building a database
product. The company is called Materialize, and Arjun is the founder of the company. And I'm super
interested to talk to him. I think, as I think about our audience, Costas, the biggest question that comes to mind is, what are the immediate use cases from a tool like Materialize that at a foundational level can take jobs with data that are generally considered batch and happen over a long period of time with a lot of latency and essentially turn them into real-time jobs.
Analytics is absolutely a use case that I think makes a ton of sense, but I'm sure that people
are doing all sorts of other interesting things. So that's going to be my big question is,
as far as use cases, analytics is obvious, but what else can you do when you go from batch to
real-time in the context of a database? Costas, you love materializing.
I cannot wait to hear what your burning questions are. Yeah, yeah. I mean, okay, first of all,
what you have in your mind, I think it's a great question. Materialize is a very, let's say,
novel way of interacting with data and consuming data. So it's very interesting to see what people
are doing with it. So absolutely, I'm really looking forward to hear about the use cases.
I have a lot of questions myself, to be honest.
I don't know how much we'll manage to cover today.
Most of them are going to be technical.
I want to learn more about the technology,
like the secret sauce, let's say, behind Materialize as a database.
And also, apart from technology, it's also a very interesting product.
Like the ergonomics that this database has is very, very interesting.
So I have quite a few different questions that will help us understand better the
technology behind it and also some choices that the team has made in
building this new database system.
Okay. Well, let's jump in and talk with Arjun.
Let's do it.
Arjun, welcome to the Data Stack Show.
We are very excited to talk with you because there's just so many data topics
that we could cover in this conversation,
and we probably won't have time to get through all of them, but welcome.
Thank you very much. I'm excited to be on the show.
Let's just start like we always do with, we'd love to know your background.
I'm Arjun Narayan. I'm the co-founder and CEO of Materialize. Materialize is a streaming database
for real-time applications and analytics. It allows you to get extremely complicated and complex analytics answers
in real time on top of streams of data as opposed to once a day on top of batch data. It looks and
feels exactly like a SQL database. I started Materialize a little over two and a half years
ago. Before that, I worked in a different field of databases. I was
a software engineer at Cockroach Labs working on CockroachDB, which is an OLTP scale out,
horizontally scalable database. And before that, I did a PhD in distributed systems and big data
processing. I've sort of lived, breathed, and been in data for a while, and a little bit by accident.
I didn't intend to fall in love with databases, but as I learned more and more about how they
power most of our applications and experiences that we deal with computers, they just became
endlessly fascinating to me.
And I've spent a decade looking at databases at this point.
I love that.
With a PhD in anything related to databases, I would think that you have a lot of technical
acumen.
But, and I love the sentence, I didn't mean to fall in love with databases.
I feel like that's the beginning of a novel that may have a very specific readership.
Okay, Materialize, super interesting. I feel like that's the beginning of a novel that may have a very specific readership. Okay.
Materialize, super interesting.
I think a lot of our audience is very familiar with working sort of in and around your traditional
database data warehouse, right?
So Postgres, the usual suspects when it comes to data warehouses, you have Redshift, BigQuery,
Snowflake is obviously taking over the market. And there are really common paradigms within that,
you can run SQL, you can create views, et cetera. The syntax and stuff is a little bit
different depending on the warehouse. But for our average listener who maybe, let's just take an
example, they are a data engineer.
They do a lot of work getting data into Snowflake.
They create views.
They create different use cases for analytics teams, et cetera.
For that person who may not be familiar with Materialize, could you just paint a picture
of if you introduce Materialize into the stack, what does that look like? And what are the key
benefits that it brings? That's a great question. I think it helps to break down a standard paradigm
of where most databases fit in, in the traditional worldview. And then we'll introduce how Materialize
sort of brings some new capability that's different from what's currently in the market.
So databases, and this is going
back, say, several decades at this point, traditionally fall into two large buckets.
There's the transactional databases and the analytics databases. So transactional databases
are your Oracle, your Postgres, your MySQL. They're generally speaking focused on processing lots of transactions that may potentially be conflicting.
They're sort of the point that decides what events are allowed to happen. So they reject
some transactions, they accept some other ones, and then they're very good at writing those
transactions down. So they're very focused on avoiding data losses. It's something you really,
really want from your transactional
database. Then you have your analytics databases, like your BigQuery, your Redshift, your Snowflake.
Your analytics databases are more focused on enabling far more powerful compute. Typically,
in SQL databases, people use SQL in both settings, in the transactional setting and the analytical setting.
But if you take some of these complex queries, say it's joining eight tables together over at least some of these tables are very, very large.
Those queries, if you ran them on a transactional database, the transactional database would A, most likely fall apart.
And B, if it didn't fall apart, it would probably greatly slow down your other concurrent transactions.
So there's a reason people mostly separate these systems.
If an analyst types some large analytical query about last quarter sales, you don't
want all your cart checkouts to triple in latency, right?
So it makes sense.
It makes perfect architectural sense to separate these concerns and then also build separate
systems that are optimized for these different classes of workloads.
The big, big thing that most people give up today is your analytics query runs on a dump of the data that is somewhat stale.
So this is feeding your batch data warehouse with a once a day ETL. I mean, this is really ETL. ETL, really extract
transform load is about getting data out of the transactional system and putting it in the
analytics system. It's getting less painful, but it used to be an extremely painful process. You
would run it overnight, once a day. Some folks are now running this on a more multiple times a day,
but it is still fundamentally a batch operation, which means there's a large of analytics or analytical style queries
that are incredibly valuable to have in real time, which don't make sense around a transactional
database, but existing analytical databases or data warehouses are not equipped to do because
they're fundamentally built in this batch paradigm. Materialized flips
the setting a little bit, which is instead of computing your answer off of a data set from
scratch when the query is presented to you, it pre-materializes some set of questions that you've
pre-registered with Materialize. And this is why the companies have
been named Materialize. So if you might be familiar with the term Materialized Views,
the entire point of a Materialized View is you tell the database, hey, I'm interested in asking
this question on a repeated basis. Can you please pre-compute it for me as the data changes?
In the past, most Materialized view support in most databases has been highly
restricted, right? So you can do it for fairly simple queries, but if the query gets fairly
complex, the database really wants you to ask it and then it'll go ahead and do the work rather
than doing a whole bunch of redundant work that has to be immediately thrown away the moment the
data changes. So under the hood, Materialize is an incremental query processor.
And we can talk a little bit more about the technology because this is a thing,
I don't think I'm describing anything that people haven't wanted for a very long time.
The unique thing that we bring is a novel set of underlying research and technologies that allow
this to happen in an elegant fashion. But Materialize allows you to ask
these complex analytical queries on a sub-second, say a single-digit millisecond latency, even when
these queries are very, very complex. This is more than just about taking some analytics query that
you've asked once a day and making it a dashboard. Now, absolutely, a lot of our users start by taking something that they computed once a day, some very valuable
metric, and making that into a dashboard so they could see it on a more real-time basis,
especially in, say, the financial and the trading use cases. They can never have things fast enough.
But the more interesting thing happens when you start to put these live changing data
and taking automated actions off of them.
So you could think alerting, you could think personalization in an application as you get
real-time data, as opposed to realizing that somebody was a customer that should be segmented
a certain way and then doing an email
marketing campaign the next day by the time your OLAP job finished. There's a wide variety of
uses where when you can action while a user is on your website or while a transaction is still
pending before it has been authorized to clients, if it's a card transaction, it's much more valuable to make a precise judgment as to the
quality of that user or that transaction within, say, a 10, 100 millisecond budget versus doing
that overnight and reacting to it the next day. Absolutely. I mean, this is fascinating. And
we've had several conversations with different businesses where this is where they're heading with their architecture.
And e-commerce comes to mind just because it's a situation where you have a lot of data.
A lot of it needs to be enriched or combined with other data.
So data from transactions or ML models and all of that's happening in some sort of database.
And the challenge has been we're creating all this value of the data that we have.
And it's very difficult to deliver that with speed, right?
And in e-commerce, if you want to send a personalized coupon right after purchase or something like that, that needs to happen very quickly.
But the latency has
been really high just due to technology. But that's changing. And that's really, really exciting.
So super, super interesting. Absolutely. One of the things that we see is the amount of folks who
are putting in the capabilities, and we're very much in the early stages of this architectural
transform, because folks are pretty much just
putting in place the streaming infrastructure to move the data at low latencies and at high
volumes. So this is doing change data capture out of their transactional databases on an ongoing
basis so that milliseconds after a transaction commits in Postgres or MySQL, it is present in
a Kafka topic that can be used for
these downstream consumers or downstream applications. And the early adopters have
gone ahead and built these manual microservices, right? So the absolute earliest adopters have
adopted this microservice pattern, which comes at a huge cost, right? So not to mention just the
development cost of building these manual not to mention just the development cost
of building these manual microservices,
but the ongoing maintenance and upkeep costs
that these microservices introduce
when you want to just say,
change a little bit of business logic, right?
So changing business logic sometimes takes a full quarter
because you have to shut down or upgrade these microservices
in a controlled fashion.
And perhaps something that would be very simple in a database, like joining against another stream,
ends up introducing a massive amount of architectural shift,
as you now have to build and manually maintain an extra set of states that is introduced by adding on that third topic.
So these are the sort of costs that people currently pay that we want to reduce.
So we think that building these streaming microservices, streaming applications right
on top of the stream should be as easy as building a CRUD app using a MySQL database.
Today, it's not, but with Materialize, it is.
Yeah.
Well, I want to dig into some of the technical details because
there are a lot of questions that Kostas and I talked about. But before we get there,
you mentioned something around moving just beyond the basic analytics use case. And that's something
I want to talk about briefly. People use the term digital transformation, which is a buzzword, but on the spectrum of digital transformation, you have companies who have figured out the analytics thing and they're relying on technology that is doing the batch,
you know, is relying on the batch load paradigm, maybe with outdated tech.
What are you seeing?
Or, I mean, there are a lot of companies who I think could just benefit from the analytics
use case in and of itself.
But the real, the use cases that really move the needle are the ones where you're actually
delivering personalization or other really dynamic customer
experiences. But I'd just love to know what you're saying as you talk with your customers and
people who are interested in adopting something like Materialize, what's the balance? Are a lot
of companies still trying to figure out the analytics use case or are there more companies
than we think who are actually doing some really interesting things around the customer experience.
That's an excellent question. To me, a large part of this comes from where your analytics team is.
One of the amazing things that has been happening in the industry is analytics teams have become progressively more empowered
to do more and more and create more value for their organizations and now are
starting to get into building these applications or building part something that ends up being
surfaced in the core application. The way I think about this is analytics pretty much starts with a
human in the loop and then analytics starts really coming into a zone where once the analysts
themselves are trying to figure out how to get themselves out of the loop, right? And how to make these things automated. So I think a lot of the
analytics journey to real time and streaming begins with augmenting the human capability by
giving them a more live, but where it truly comes into its own is when we start doing automated
actions directly off that analytics pipeline.
There's a huge benefit to everyone in the organization, whether it's the application
or the analyst speaking the same, speaking a common language in terms of defining the metrics
that they've been thinking about in the exact same way. DBT is, of course, absolutely the leader in creating an ecosystem where an entire company's
or an organization's data is modeled using a single unified paradigm. And starting from the
analyst and then going towards the application, I think, is the correct way to do things. I
absolutely encourage most folks to take their first steps by moving,
say, a once a day refreshed dashboard into real time because, A, it's an enabler of a lot more
things. And it's a good way to ensure that all the application and the real time in application
experiences are fundamentally based on the exact same vocabulary that is already part
of the analytical organization.
Arjun, this is great.
Actually, before I start asking my questions, I have to tell you, I really enjoyed your
introduction.
I think it was one of the best descriptions of the difference between the two database
paradigms that we have, which is pretty common. Many people are asking about why do we need to have an analytics database
and a transactional database.
But that was amazing.
If you haven't written a blog post or something about that,
please go and do it.
I think many people are going to thank you about it.
But I have a couple of more technical questions that I want to make.
And let's start, a little bit more technical questions that I want to make.
Let's start with the materialization.
You mentioned that you also chose the name because of the concept of materialized use.
Why someone would use materialize and not just keep using the materialized use of transactional
database for example offers?
Excellent. Well, thank you so much, Costas. I appreciate it. I should write a blog post.
This is a great question in terms of why not just use the materialized view
in, say, Postgres or MySQL? Well, the first answer is if your materialized view becomes
the slightest bit complicated, you'll lose the ability to incrementally update it.
So it's really about what is the update strategy for this materialized view? Because
for a complex materialized view, let's say you're joining four tables together,
you have some subquery in there, you have some non-trivial aggregation, maybe some max and some
group by or something of that sort. The first thing in OLTP or even an OLAP
database is going to tell you is you have to manually tell me when to refresh the materialized
view. And then when you do that, I will essentially run the equivalent of a select query and then
stash the result in a table. So for you to query. So it's not in any way, it gains you almost
nothing compared to
repeatedly issuing select queries. The hard part, the technologically hard part is the reuse of
previously computed results to efficiently update the materialized view. A good way to think about
it is you want to do work proportional to the changes, not proportional to the query load.
So if somebody asks a select query and very little has changed, you shouldn't force your
database to do a massive quantity of work. Data has changed, but does not affect the result. You
want that to essentially be suppressed as early as possible, so a good example of this is if I'm, if I'm summing a bunch
of rows and then somebody added a bunch of zeros, we should quickly detect that and not, not, not,
not throw all our results out and recompute everything from scratch. A large amount of
analytics workloads that happen in data warehouses today are fundamentally redundant queries where
we are mostly recomputing the same answer. So if you
have terabytes of data, most of this data is historical, right? Like big data is absolutely
real, but it's primarily a phenomenon related to the amount of data we have collected. You don't
have big data every second. Well, Google might have, but most organizations today, the amount of data that is coming in
second by second is not that voluminous.
But when your queries are fundamentally nonlinear, they're joining a bunch of different things,
the database sort of looks at it and goes, well, I don't know what's changed.
I kind of have to throw it all out and start over from scratch.
And that's fundamentally the paradigm that we want to get away from.
That's great. Another question on that, why I would like to have incrementally updated
views instance of having something like a caching layer and cast the results of a view?
Well, the hard part is deciding when to invalidate your cache, right? So what you get from an
incrementally updated materialized view is this logic is handled
correctly, perfectly, without the user having to do anything more than think. One of the cute
taglines we use internally is think declaratively, but execute incrementally. So it allows you to
still think in terms of what's fundamentally the select query I'm trying to run. And then we think
through all the hard parts of what is the data flow query I'm trying to run. And then we think through
all the hard parts of what is the data flow that has to happen under the hood, which parts of these
are stateful, which are stateless, which ones invalidate cache. If you're building a microservice,
you're going to have to reason about all of this yourself, build a microservice, a stateful
microservice. And this is hard and you might get it wrong. And if you get it wrong,
it's really subtle to debug. It's difficult. Generally speaking, most people use databases
because inventing half a database that you happen to need for this particular use case is a risky
thing to do and very hard to validate if you did it correctly. So we also find a solution to one of
the hardest problems in computer science, right? When to invalidate the cache. So we also find like a solution to one of the hardest problems in computer science,
right?
When to invalidate the cache.
So that's great.
Yeah, exactly.
It's data naming things.
Yeah, yeah.
All right.
So what's the secret sauce?
What's the magic?
Like what is different in materialized converting?
Like what Postgres is doing, which is, I don't know, probably
one of the most complex databases ever built.
We built it for like the past 30 years or something, right?
So what's new and what is different with materialize?
That's an excellent question.
Before, I don't want to talk negative about Postgres, I'm actually going to take the flip
of the question.
It's like, what does Postgres do that we can't do, right?
So Postgres is a great OLTP database. In fact, we love it very much in the engineering team at Materialise
because Materialise speaks as close as possible wire compatible Postgres. So for an application
that's talking to Materialise, you use Postgres client drivers, you use the Postgres native
language bindings and it'll all just work. So we're huge fans of Postgres.
Postgres is a great OLTP database.
What Postgres does very well that we don't do is transaction isolation and concurrency
control.
So if you have, say, a unique index or a primary key field and you have two people racing to
commit transactions, Postgres will ensure that only one of them succeeds, right? It's great. It's great at this conflict resolution and consistency aspects of
the asset properties that you want from a database. What we're very good at is computing these
denormalizations, these complex views, and keeping them incrementally up to date.
And we actually work very, very well downstream of Postgres. So one way that some of our users deploy Materialize is they have Materialize
essentially acting as a read replica, right? So Materialize connects directly to Postgres,
the transactions, all the writes land in Postgres, and then get immediately replicated within a
millisecond or a few to Materialize. And then Materialize gets to maintain all these
rich analytical indexes that essentially are kept incrementally updated as soon as the data comes in.
This way, the writes offload to Postgres and then the complicated reads, essentially it offloads
compute from Postgres. Now, how do we actually do this? So under the hood, Materialize is built on this state-of-the-art
stream processing platform called Timely Dataflow. Now, Timely Dataflow was invented
or co-invented by my co-founder, Frank McSherry, who has done a lot of stream processing research
for, I think, coming up on seven to eight years now. Timely data flow is a fully horizontally scalable
stream processing framework on which we've built query planning and data flow planning such that we
can take an arbitrary SQL statement or a SQL view definition and convert it down into a persistent
data flow that is
horizontally scaled out on this timely data flow cluster.
We do have some, there are some folks who use timely data flow directly as a stream
processing library.
It's an open source project, but most people don't want to do this, right?
You don't want to write, so timely data flow is written in Rust.
You don't necessarily want to build and write Rust data flows and manually orchestrate them. So we think there's a large market for people who want those benefits of that incrementally updated high performance scale out, blah, blah, blah. for several decades, which is they write and define SQL queries, and these SQL queries just stay alive, and they don't really think about it,
and these things just stay alive forever as the data changes.
That's very interesting.
So how is timely data flow different compared to other solutions out there
like Flink, Databricks, and the rest of all the streaming processing platforms that we have seen in the
market until now. That's great. So first off, I'm going to do sort of a bad job answering this,
but there's a wonderful research paper called NIAID, a timely data flow system,
which won several academic awards that lays the foundational case for timely data flow and how
it's novel. There's a few things, not all of which we currently take advantage of in Materialise
today, but a good example is timely data flow is capable of reasoning about cyclic data
flows, whereas most other data flow models are purely acyclic.
It is extremely expressive, almost to a fault.
So driving timely data flow around is hard and
something that we take a lot of pains to do correctly at Materialize in the Materialize
database layer. It is data parallel across a sharded data flow graph in a way that most other
data flow engines are not. So today, most data flow systems, say Flink or Spark streaming,
the primary way in which they scale across to use many more compute resources is by taking
various operators of the graph and placing them on dedicated CPU resources and flowing data from
a data flow node to another data flow node. So if you think about that, so a good way to get intuition for this is let's say you
have two sources of data, each of which have some map operation, and then there's a join
operation and then there's some subsequent map or map or filter or things like that.
These each things form this graph of computation and each one of those nodes gets their own
dedicated compute resource.
Timely data flow is sharded in a very, very different
model that results in a very, very higher performance, particularly in cases where you
have very, very large data flow. So let's say you have a SQL query that has eight different input
streams, complex sub queries, things like that. The actual execution graph of this may actually
be hundreds of nodes.
You as a user may not care. You just want that SQL to be incrementally updated.
Getting that data flow graph to get high performance in some of these other stream processing systems is very, very hard. Whereas with timely data flow, because of the way it
scales up and has this shared cooperatively scheduled data flow execution model,
makes it far, far more performant.
For more details, I would point you to the research paper,
because I'm struggling a little bit to convey some of the nuances
without the reference to some diagrams and some slides.
Yeah, yeah, makes sense, makes sense.
I mean, I was aware about the NIAD paper and also the timely data flow model.
But I think it's something that people out there
are not the community out there, are not that aware of.
So I think the more we can communicate and talk about it,
I think the better it is for everyone to start understanding,
thinking in new terms, right?
Because as you said,
timely data flow is like a different paradigm
of how you can process.
And whenever we introduce a new paradigm,
it takes a lot of repetition
from the people who know about it
and they evangelize this
to help the people out there understand.
And actually, it's very interesting
because we had an episode pretty recently
with CockroachDB. And one of the topics
that we were discussing was how important it is today for the engineers out there to start
thinking more into getting some distributed elements from distributed computing and start
incorporating them in the way you think as an engineer or as a developer,
right?
And I think this is one of the values that we, as people here, are sitting together and
discussing about interesting technical topics that we can offer to our audience out there,
how we can give them some guidance of, yeah, you know something, there's a different way
that data can be processed out there.
Maybe you should start also trying to think into this or yeah you might be like a web developer or
like a front-end developer but still if you start thinking and using some of like the patterns that
come from like distributed systems probably it can help you with your work and also can help you work
much better with the back ends that probably are distributed behind the scenes. So that's why I find it always very, very valuable to discuss a little bit more technical
details.
Absolutely.
I strongly agree.
I think it's very important for developers building and using systems like this to understand
and appreciate what the right principles are.
One, so they can choose the right technologies to work with or the appropriate technologies
for the problems that they're solving.
But one of the things we maybe struggle with, and I appreciate you pushing a little bit
on this, is to what extent should we encapsulate and hide the complexity versus unwrap and
show the complexity?
So one of the big advantages of Materialize is you don't have to know, you just write SQL,
but there's a sort of inherent tension where,
you know, actually, A, everyone is interested
and definitely wants to know,
and B, maybe understanding will get you
the right intuitions for what computations
you can even execute
and how to go about choosing the right architecture
to build which systems you can incorporate
and not incorporate in your architecture.
Absolutely. I totally agree.
So, Arjun, you mentioned that by incorporating
this new timely data flow processing model,
Materialize achieves to be very performant
compared to the rest of the solutions out there
for streaming processing.
What kind of resources someone who wants to start using it today
should consider about setting up the open source version of Materialize?
So we aim to make Materialize very simple to use.
So you go to our website or our GitHub, you click the download button,
and you can run this on a single node.
You can scale up this node to, to, to handle.
In fact, in fact, if you, if you get the, the large, the larger sized, uh, VM
and you run materialize on it, you can ingest millions of messages,
a million messages a second.
You can, you can install dozens of views and so on before even needing to consider
whether you need a multi-machine setup as part of
making it easy to graduate beyond this.
In fact, you know, you will be very productive on a single node database.
We really go to great lengths to make it as easy to use as a database, right?
So you run it on a single node, you connect to it using a SQL shell or a SQL driver in
your language.
The lived experience is very much like Postgres, right?
Like this is how most people run Postgres is they run brew install Postgres or app get
install Postgres and they run it and then it's living in a VM by itself in a cloud for
years of uptime.
So that's really the easiest way to get started.
We are building a cloud service, which we are launching publicly next month, which allows
folks to get even more advanced features.
So some of the features that we will be shipping in our cloud product is horizontal scalability,
where you have these very, very large data volumes, well north of a million messages per second, for instance.
And you do need multiple machines in a horizontally scaled setup to absorb that data volume.
And then two for having replication, right?
So if you have extremely high availability needs,
you're going to want multiple servers set up in an automatic failover capacity.
And that's something that our cloud product will,
not next month, but down the road, also support.
That's great.
And I'm very excited to hear that you are launching
a cloud version of the product.
And I want to ask you more about this. But before we go there, because we are going to spend some
time on it, I have a question that I don't want to forget to ask. And that's about, you mentioned
at some point that Timely Dataflow is implemented in Rust. So how did you decide to use Rust?
What's the reason behind that?
I think the original reason was Frank,
when he started coding Timely Dataflow,
he had recently left.
He had just left Microsoft Research
and he had been coding for a while
in the sort of.NET ecosystem.
He wanted to try something new
and Rust was a beta programming language at the time,
a very risky thing, but he was just playing around. I think a lot of these open source
projects, they start that way. So timely data flow was coded in Rust. Now I think
for highly data intensive applications, the best choices are Rust or C++ because the manual memory management and control is quite important
for predictable low latency experience. I think there are some places that have gotten good
at writing in Go. Go is a garbage collected language and not manual memory management.
So I had some experience
because I was suffering
from a cockroach.
Cockroach DB is written in Go.
We struggled with it a little bit.
I don't think it's impossible.
I think you can definitely,
with enough sweat and effort,
essentially drive the garbage collector
around to do the kinds of things
that you would have wanted to do
in a manually managed environment.
There's pluses and minuses.
Rust, we doubled down on Rust
when we built Materialize
because one of the things we could have done is we could have left timely data flow as a Rust
underlying engine layer, and then built the Materialize database management layer in a
different language. And when we looked at that design decision, we thought about it a little bit
and we came to the conclusion that Rust was actually pretty great. And we were quite happy
to build it on Rust at all layers of the stack. So Materialize is 100%
written in Rust. And we're quite happy with that. I mean, I'm happy to go into like, more detail as
to our experience building in Rust and maybe contrasting a little bit to the Cockroach
experience in Go as well. Yeah, that's very interesting. And I'm asking you because Rust is
like a pretty young language language but it's gaining
a lot of traction lately and it's a very interesting language also like from a let's say
research perspective in terms of like what kind of primitives they've added there in order to do
like this kind of memory management it's very interesting it's of course like very interesting
to see that it starts to be used for systems out there that get in production and in products that
are delivered out there so that's production and in products that are delivered
out there. So that's why I was very interested to hear your opinion about Rust. And something that
it's about Rust again, but from the perspective of being a founder and building teams, right? So
how easy it is today to find developers out there that can write in Rust or who are willing
to write in Rust? Right. So we don't expect our engineers to
know Rust when they join, although many of them do, certainly not all. We find that it takes a
reasonable amount of time on the order of a few months to get productive in Rust. This is probably
the biggest cost that we pay as an organization for building a product in Rust is there is a bit
of a ramp up time that we have to pay,
but that's fine. It is not difficult to find people who want to work in Rust. In fact,
I would say it's a significant attraction to several engineers who maybe if they've written
C++ code and they've lost so many weeks of their life to chasing down some memory leak or some manual memory management bug
and they want to move to a language or an environment where they get the benefits of
manual memory management, the performance, and they also don't have to deal with that class of
bugs. So we find quite a few people are very excited to work in Rust, although we do have
to take some time to let them ramp up.
And what is the reason that it takes a couple of months
to start being productive in Rust?
And that's probably also the...
Sorry for interrupting you, but I think this is one
probably of the main contrast with Go,
because one of the benefits that I hear,
at least from engineering months, about Go is that
it doesn't take that much time to be productive in Go.
But why Rust has that?
It takes five to six months to get productive.
I wouldn't go so far as to say five to six.
I think it's more like two to three months,
assuming we have an experienced software engineer who has been building, which is the backend or distributed systems,
which pretty much all the engineers that we hire fit that mold.
The primary difficulty,
and by the way, having worked in Go
and at Cockroach Labs,
most people can be productive in Go in under one week.
It's a truly incredibly concise language
to get productive in.
It's sort of, I would almost say,
optimized for productivity.
The primary difficulty with Rust is that it is, most folks have a little bit of an adversarial
engagement with the compiler.
It can be a little bit frustrating to essentially what you are doing when you're writing a Rust
program is you are giving it sufficient type annotations that it is able to prove that certain classes of memory bugs
are provably absent. So it's a little bit of you are guiding a not very smart computer because
it's not a human to follow a proof. And there's a little bit of it's too dumb to see that the code
you've written does not have a memory leak. This is often called fighting the borrow
checker. So the borrow checker is a part of the compiler that yells at you. And there's this
standard failure mode of like fighting the borrow checker for a while until you fully internalize
the limited ways in which the borrow checker thinks. And then you know, oh, this is where I
should probably add this annotation or do this thing or use this pattern in order to do the compilation step.
The other thing I didn't mention is, and this is a place where I say, given the novelty of Rust, this is a negative.
There's just not that much libraries and pre-existing tools that you can draw off a rich sort of open source ecosystem.
It's very different from Go.
And Go, like pretty much if you're looking for some compatibility to some driver or some
library or some parsing library or some security thing, like it's a very rich, mature ecosystem
compared to Rust where oftentimes we've had to, there's at least a couple instances I
can think of where we had to write a library from scratch,
whereas if we were writing in Go,
we would have used an off-the-shelf one.
Yeah, makes sense.
Although from my limited experience with Rust,
I have to say that Cargo is a very nice experience
for package management.
So yeah, there's always trade.
Docents, it's a young language, right?
It takes time for the community to build everything there.
But with Attraction, I think it will catch up pretty fast.
For sure.
And also some of these things that I'm saying,
they're not going to be downsides for people coming after
because there will be more software engineers
who are already fluent in Rust.
And hopefully we are a contributor as well,
adding some of these libraries that we've open sourced
and other people as well.
So a year from now, it'll be even easier.
So these are just growing pains. Yep, absolutely. You'd asked about how to get started with
Materialize. And I just wanted to jump in really quickly because we talked about,
obviously, the open source offering and then super exciting that you're launching cloud.
Arjun, one quick question. I'm just thinking about our audience here.
What use case would you encourage them to start with? within a matter of a couple of days and really validate that the technology is capable of taking arbitrary SQL that you have,
business logic in your organization,
and move it to real time.
And then that's a position from which
we can think through the more complex things
like actioning or integrating this into a pipeline
that sort of is part of an application experience.
But getting this value in as short time as possible
is what I would encourage folks.
And that pretty much means some pre-existing business logic
or a pre-existing DBT model.
Since Materialize has a DBT plugin,
you should be able to take your pre-existing DBT model
and make it work on Materialize with ideally in a single day.
Oh, very cool.
Wow, I mean, that's extremely
fast time to value. And then just one more quick tactical question for our listeners.
Is there, just go to materialize.com to get notified about the launch of the cloud product?
Yes, that will be front and center on our homepage. And in the meanwhile, you can download
the source available free product from there as well.
Sure. Great. Okay. Sorry, Costas.
I know we're close to time. We have another.
But I just I just I constantly think about our listeners and I love learning about new technologies. And I just want I just want them to get the fastest way to understand how I can get in and kick the tires on it.
Yeah, absolutely. And it was very good that you asked these questions, Eric, because it's time to spend a little more time on the cloud version of Materialize. for a product or like a framework,
like materialize,
like things that you expected beforehand and didn't happen
and things you didn't expect,
but they happened.
Like anything interesting
that you can share with us
about this process
of turning this amazing piece of technology
into a cloud offering.
Absolutely.
The first one I would say
is the biggest reason why we're building a cloud product is
by far, we talk to our users, we talk to prospective users,
we talk to basically everyone in the industry, there's a wide consensus
that everyone wants to use a managed cloud offering
of pretty much all of the technologies that they use
because running and upgrading and manually maintaining these things
is not something that most people are interested in doing, particularly as things get more and
let me put it this way, you much rather have somebody else carrying a pager than you carry
a pager. The more mission critical this gets, the less you want to be in charge of carrying
that pager when that system might go down. In terms of building a cloud service, one of the things that's very exciting, and this
is particularly true for companies like ours, where we're building this from day one, knowing
this, that the cloud product is the predominant way in which we are going to be successful as a business, is you get to think in
terms of atomic components that are cloud native. A very, very good example of this is separating
storage from compute. So storage in this infinitely scalable, extremely low cost service,
namely S3 or the S3 equivalents on the other major clouds, is available and extremely
high durability, extremely strong guarantees that you get from these services is a building block
that you can build, say, a database around that means that there's an entire class of problem
that you don't have to engineer for, namely data loss or data corruption or replication
or things like that. You can rely on this atomic unit of an S3 bucket being the principal storage
layer for the vast majority of your data. And what this means is, of course, you get to use
your engineering budget instead of solving the same problem that everyone has had to solve pre-cloud.
You get to use this to solve new problems.
Another one that you get is the ability to other services that are cloud native, save
for other components.
So a good example of this is going back to Postgres, materialized cloud uses highly available
Postgres nodes under the hood for certain classes of metadata and things like that. Whereas otherwise, if we were
building a fully on-premise piece of software, getting this highly available would be a long
engineering challenge. At the same time, we love users who just want to use the source available
product or they want to use it and deploy it in their own premises. The key distinction I would make is we've designed
materialized cloud such that the best place to get the highest number of nines of availability
is materialized cloud. So things like active, active replication, automatic failover, load
balancing, these are built using cloud native services and owned and operated by us as part
of materialized cloud that are not part of the downloadable on-premise offering.
And that's because fundamentally these things are designed
using cloud services that they're not portable, right?
Like you can't take, you don't have S3 on your laptop.
You can, and yes, you can emulate it for testing,
but that's not how you would run a production service.
Absolutely, absolutely.
Operating a software and building a software are two different things.
So I have a question about the cloud offering compared to the experience that you described
about Materialize from the beginning. And it has to do with latency, right? You said that
Materialize is a system that you can expect a single number of digit latency
when it comes to the queries that you execute and the updates that you have.
My intuition says that in order to achieve that, if I'm consuming data on Materialize
from a database system that I have, I have to have my materialized nodes as close as possible to my database.
How can I do that when I use the cloud offering? So the first point I'd make is you're absolutely
correct. You want this to be very close to your database. But the other thing I'll observe is most
of the databases are in the cloud. So if you want to be close to the databases, you have to be in a
cloud instance by definition to be close to the databases that are in the cloud.
The important part of this is co-locating them as closely as possible.
And it usually would come down to region, availability zone, co-location, and things
like that.
You almost certainly don't want to move this data across clouds, right?
So our cloud service is launching next month on AWS, but
eventually we want to fast follow to Azure and Google Cloud as well, because if your database
is in one of these other clouds and you will have too much latency going between two clouds.
The other thing I would say is the clouds have gotten, the hyperscaler, the three cloud companies
have gotten very good at laying extremely high bandwidth, low latency network connections.
So as long as you're in the same region and spinning up your materialized instance in
a VM that is in the same region and perhaps even the same availability zone as your database,
they've done a very good job making sure that those actual packets that are
going across this virtual network will go over a fairly small physical distance.
That's great. One last question from me, Arjen, and then I'll give it to Eric so we can also
conclude this episode. You mentioned about co-location and all that stuff, and you mentioned
also about S3. So for the people out there who are interested in using the cloud version of Materialize when it's launched,
is this going to be on one cloud provider like AWS? Yes. Next month, we're rolling it out on AWS,
and then a few quarters later, we will be loading it out on other clouds.
Okay.
So people can expect that in the next couple of months, if they are a GCP shop, the materialized
will also be available there, Azure and at least the major cloud providers are there.
I can't commit to a specific timeline, but one thing I will say is that there always
is the option of running Materialize in a VM, the downloadable source available product,
in a VM in an Azure region or data center.
That's great.
I think we need to have at least another episode
because I have more questions to make.
But I have completely monopolized this conversation
and I need to give at least some time to Eric.
This has been really fun.
I really appreciate the questions, Costas.
Thank you.
Yeah, it's great.
I think we're close to the buzzer,
but we've talked about Materialize a lot just as a team
and Costas and I,
because we love discovering new technologies
and it really is a true joy just to get to talk with you
and just hear about the inner workings in many ways.
And I hope this has been a really fun conversation for our listeners.
Arjun, this has been such a wonderful conversation.
We'll definitely have to have you back on.
And congrats on the cloud launch.
That's going to be great.
Encourage all of our listeners to go to materialize.com and check it out.
And we'll have you back on the show maybe in another six months or so
after the cloud products been live and hear how it's going. I would love to do that. This is an
absolute pleasure of a conversation. Thank you both. Thank you, Eric. Thank you, Costas. This
is a wonderful show you have over here. Well, Costas, I think one of the big takeaways I have,
and this won't be my takeaway from the content of the show, is that you and Arjun are incredibly intelligent when it comes to very deep concepts around
databases and languages that you use to build technologies. And so it was a real joy for me to
hear two very intelligent people reason around some of the decisions
that they're making.
I think the big takeaway actually relates to my big question on the front end.
Analytics is a really obvious use case, but all the other interesting things you can do
when you enable real-time, I think are just going to open up a lot of really creative
solutions to problems that are low-level plumbing problems in the stack
currently. And that's very exciting. I mean, coming from a marketing background, I think about
enriched profiles and automation and other things like that. And the ability to have this stuff in
real time from a database, I think it will actually be a very big driver of creativity
in the way that people are building experiences.
Absolutely. You're absolutely right. I mean, the closer you get to real time, the more use cases
you open. And I think we are just at the beginning of like seeing what people can come up with
technologies like materialize. And I'm pretty sure that like, if we talk again with Arjun,
like in six months from now, he will probably have
even more use cases to share with us.
So yeah, absolutely.
Materialize is a new technology, a new paradigm.
There's many new, let's say, patterns that we have
to learn and understand from there and experiment with.
It might take some time for people to figure out
how to use it, but my
feeling is that we are going to see very exciting things coming from this technology. I have
to say though that Arjen is also an amazing, amazing speaker. He was amazing in explaining
really complex concepts, so I really enjoyed the conversation. I was really happy to hear
about all the technology that they are using to make materialized products. And I'm also very
excited to see what's going to happen with the cloud version of the product. It's also very
exciting for me to hear that regardless of like the technology that someone is building, how this technology
is delivered and is used, it's very important. And cloud is probably the best delivery model
that we have at this point for this kind of product. So yeah, hopefully in a couple of
months from now, we'll chat again with him and learn even more. Yeah, absolutely. I think as I
reflect on the conversation, a lot of really paradigm-shifting
technologies take something extremely complex and make the experience very simple. And there
are lots of examples of that, but being non-technical, but working with you closely
enough to understand when you talk about anything real-time related to a database,
from a technical perspective, that's an extremely
complex problem to solve. And I think if materialize can simplify that, I mean,
that's pretty paradigm shifting. So it'll be really fun. And I think if they can accomplish
that, that'll, that'll be huge. Awesome. Well, thank you for joining us on the show.
Lots of really good episodes coming up this fall. We're actually about to wrap up season two.
So you'll see that wrap up coming up in the next couple of weeks. And then we have a great lineup
for season three. And until then, we'll catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.