The Infra Pod - Databases are becoming commodity, what's next? Chat with Chris from Materialized View
Episode Date: March 19, 2024Ian and Tim sat down with Chris Riccomini (ex-Distinguished eng at Wepay, co-creator of Apache Samza) to talk about the spicy takes he has been sharing on his blog Materialized View and even further t...akes in the future of what the data system world will look like!
Transcript
Discussion (0)
Hey, welcome back to yet another Infra Deep Dive podcast. I know it's been a little bit
since we had published our last one because I was on paternity leave, I think.
It's part of the reason. But we're back. Just a little introduction. Tim from Essence VC and
yeah, let's go. And I'm Ian Livingston, trying to Sneak into a platform at the moment, do a little angel investing.
And also, obviously, I spent some time with you.
I'm so excited for your addition to your family, Tim. That's exciting.
And also, I'm super excited to be joined by Chris.
Chris, why don't you introduce yourself? Tell us a little about yourself.
You're a very interesting character and have some really great opinions. I can't wait to dig in.
Yeah, thanks for having me.
So, yeah, my name is Chris Riccomini.
I guess historically I've been a software engineer. The two main companies I worked at were LinkedIn
and a company called WePay, which is like a fintech company.
I spent about seven years at each,
also a brief stint at PayPal early on in my career.
And I spent most of that time data science,
data infrastructure, and service infrastructure as well. I spent most of that time, data science, data infrastructure,
and service infrastructure as well. I got involved in open source. I wrote the initial pass at Apache Sando, which was like an early on stream processing system, a la Flink and Spark streaming.
Then while I was at WePay, I wrote a book called The Missing Readme. Basically, I got tired of
saying the same thing over and over again in one-on-ones with new software engineers. I was
like, ah, there should be a manual.
So Dimitri Riaboi, who's an ex-Twitter engineer, runs in the biotech space now.
But we got together and basically wrote this sort of manual for new software engineers.
What is a sprint? How do you go on call? That kind of stuff.
And I also got an Apache Airflow while I was at WePay and was the mentor to navigate them through Incubator.
Left WePay about two and a half years ago.
And since then have been doing this kind of some combination of investing, advising and writing.
So I've got this little newsletter that I use now, Materialized View, where I write a lot about infrastructure and sort of my take on different stuff.
I also work on a couple of little open source projects here and there. I got a sort of a bee in my bonnet and wanted to write like an
almost an embedded like data catalog. It was just annoying to me when we pay, we were trying to
install a data catalog for our data infrastructure team, our engineering team. And it was like,
step one is install Neo4j and MySQL and Kafka and Elasticsearch. And I was like, oh man,
like this is going to take us like three months to get our SREs to get this footprint up.
And all we really wanted was like
basically key value schema information.
So that eventually morphed into something called ReCap,
which is sort of a type system
that's meant to model both, you know,
online, nearline and offline data schemas.
And I got a number of companies
that are kind of using that now.
One of them is like a data contract company. Another one is doing something more in like event pipeline.
So that combination of investing, inviting a little bit of coding and writing is what I'm up to now.
Amazing. And also like writing hot takes about the future of the data space.
Yeah. Yeah. I try to write stuff that I am thinking about. I got to be
careful. I, you know, maintain control of myself. I tend to get riled up. But yeah, I definitely
have opinions. Good. And we're gonna we're gonna do our best between Tim and I to try and extract
those from you. I you know, one of the things I'd love to just kick it off is the state of data
today. When I look at it, we've gone from this world where, you know, early 2000, you've got MySQL, mid-2000, you've got Hadoop, you've got maybe some Postgres.
And the data world in terms of tooling and infrastructure is very small.
And over the last 10 to 15 years of this explosion into all of these different data tools.
And you wrote this blog post recently about how
you basically see the data stack disassembling and all these different components. Can you help
Tim and I understand your view? What do you think is going on and why? And what's that look like?
What's happened and why it's happening? So actually, the current decomposition of data
systems is actually something that I think has been going on for a long time.
And I kind of tried to trace the history of that in my blog post.
You mentioned Hadoop.
And to me, one of the starting points for a lot of this is going back to Hadoop.
Pre-Hadoop and pre-MapReduce databases, I think, were in general a little more integrated.
So you had your Teradata's and your Vertica's and stuff like that. And then with MapReduce, you started to separate compute and
data, right? You had MapReduce and then you had HDFS. And then things kind of just kept evolving
from there. And so you started to evolve a data catalog like the Hive Metastore. And then you
started to involve query engine on top of MapReduce. So you had Hive or Pig or systems like that.
And so once you started from the MapReduce building block,
which the instigation there was big data, right?
Like essentially, hey, we need a way
to process large amounts of data.
This is going to involve an architectural change.
And so that's where MapReduce came from.
But then it turns out everyone likes SQL.
And so you end up basically rebuilding
the database on top of this.
But because I think
Hadoop had already, it was A, open source, and B, already had kind of disaggregated compute and
storage, things kind of evolved from there. So that's sort of like one path that I kind of just
traced through. And you started seeing storage formats and Parquet and stuff. Everything was
decomposed in the offline data or admin world. The second thread that I kind of was thinking,
I've been thinking about more recently is the evolution of Postgres. Historically, Postgres
kind of got beaten up for performance reasons. And I think it just lacks some features that I
think MySQL had. I think it was a little bit late to the game with replication protocol and CDC and
stuff. But a lot of that's been solved, right? It's got a really robust extension system, right?
And so again, you can start to see where, okay, you can plug stuff in. And so people are starting
to build Postgres extensions for OLAP, for graph database, for vector search, for GIS.
So on the query engine side, you've got Postgres there. And then underneath that,
you have people starting to play with the storage layer of it as well. I just love talking about Neon.
I find their architecture so fascinating.
But Neon is this open-source Postgres project that ripped out Postgres' storage engine
and replaced it with a distributed write-ahead log that's backed by object storage.
You can start to see the disaggregation in the Postgres world as well.
I think, again, it's some combination of scalability, like Neon's bottomless storage,
and also flexibility in terms of data processing,
different kinds of queries.
I think a lot of people just view Postgres
as a pluggable query engine
that they can just leverage to focus only on the area
where they can provide the most value.
So like, what is it, Hydra, the OLAP system,
or ParadeDB is doing OLAP
and Elasticsearch kind of killing competitor kind of stuff.
I think they kind of view it as a jumping off point
for, you know, being able to not have to rebuild
an entire database, right?
I think the third strand, I didn't really write,
I guess I wrote about it in the blog post,
but I didn't really tie it into these two narratives,
is just what's happening in the Arrow ecosystem. And to me, Arrow is sort of like the modern Hadoop. Like when I look at the
ecosystem that's being built around Arrow, it's sort of like, you know, stemming from Pandas and
machine learning and integration there. But then they've started to just layer on more and more
really interesting projects,
cross-language compatibility and data fusion, which is this decomposable query engine. And then they've got obviously Arrow, the in-memory storage format. And so I think it's driven
initially by the sort of data frame, PySpark-ish kind of use cases. But once you have these
building blocks, you can imagine building all kinds of
really interesting databases. You know, I mentioned Arrow and Data Fusion, but it goes beyond this,
right? There's Velox from Facebook, which is sort of a execution engine. So it's the bottom layer
of a query engine, which is a thing that's meant to take sort of logical query plans and actually
translate them into physical query plans and execute them on a runtime like Spark or Flink
or Presto or whatever.
Then you have Substrate,
which is like essentially a way to model
these logical query plans.
Point being, there are a whole bunch of projects
in this space, but it kind of looks to me
like Hadoop circa, you know, 2009, 2010,
where there's this sort of Cambrian explosion
of like all this new stuff.
In a way, I kind of look at it as like rebuilding
a lot of the Hadoop ecosystem for the modern era.
And so I think those three things are all sort of inform how I look at the decomposition.
I mean, that's a great explanation about why this is happening.
Immediate question I have.
Do we think there's another consolidation phase where we kind of end up with a new stack that you can describe with four letters?
And it's kind of like this Cambridge explosion
followed by natural selection.
We pick the best for the
task. That results in a deep
consolidation. We end up back with a new stack.
Or are we entering a world
where, you know, a good example is the
Postgres ecosystem where we have this common spire,
right? We have this common core that you
can kind of plug building blocks in Node.gov and
it's actually easy to swap stuff.
And we're just trying to find which are the best pieces, and then we'll forget the other ones?
Or is there actually a world where we can have this big, beautiful ecosystem of lots of plugins?
It kind of looks more like the NPM, like JavaScript's NPM, or Python's PIP, like package ecosystem.
Or is it really like there's just not that many use cases?
We just want to figure out which is best.
Yeah. Oh, boy. Okay, So you hit the nail on the head. My most recent blog post is kind of thinking through some of this. And then I have another blog post, which I have not published, which is
comparing and contrasting. If I were writing a database, would I write it as a Postgres
extension or would I write it as a Postgres compatible protocol? If I unpack your question
a little bit, it's sort of like, is Postgres going to be like the OLTP platform that everyone
builds off of and they use query engines and they kind of like modify it? Or is it going to be the
case that, you know, we're going to have a bunch of different systems? And I think in that world,
what's likely is that everyone will congregate around the Postgres protocols. And by Postgres
protocol, I actually mean three things.
I mean, like the Postgres run-in protocol, which is like super simple,
the SQL dialect, i.e. Postgres SQL,
and then the backend replication protocol,
the streaming replication protocol that they have.
Those are kind of the three things that I think,
if you have compatibility on, you know, pick any two,
like you're probably going to be able to slot into
most ecosystems pretty well. And I think if you look at systems like Cedar DB, which is this new
HTAP PG compatible system, like they have a very nice story around like, oh, if you want to use us,
like we're very easy to drop in and you can drop us in and for only analytics first. And then
as you get comfortable, you can start using this for OLTP workloads and transactionality as well, if you like, or you don't have to, right?
So the question is, are we going to start to unify around a query engine or some piece of
infrastructure? And I think the one in my mind that would be most likely would be the Postgres
query engine, right? Having talked to a number of people that have built actual PG extensions,
I very much initially came down on
the side of, I think extensions are pretty interesting and like a nice way to start things
off. After talking to people, I've definitely come around to the viewpoint that I think the future is
most likely going to be Postgres compatible replication and front ends and stuff, and that
the databases will probably be built from scratch.
The reasoning there is sort of twofold. One is eventually you want to get more control over
Postgres than it's going to give you from the extension library. So you're going to start to
get straight jacketed in. And so you're going to have to start essentially moving off of Postgres
or forking it. And so if you look at Neon, actually, they had to fork Postgres because
in order to make the wall API remote, they had to change it.
And they're saying they're hoping they're going to be able to get the patch merge back
in.
I actually pulled this closer.
This is from a blog post a year or two ago.
I don't know if they did manage to get it back in or not.
I haven't checked.
I just view that as one sample of like, okay, they had to fork it.
The other thing that I find kind of interesting is I have thought that from an operational
standpoint, it would be appealing to be able to just run Postgres and then install these extensions
and operationally that might be a little bit easier. But I think in practice, that's probably
not true. For one thing, like running Postgres as an OLAP database is like just fundamentally
very different from running it as a OLTP database. And this is sort of like the common criticism of HTAP.
So it's unlikely that you're going to be running one Postgres that is both OLAP and OLTP or GIS or whatever.
As you scale that workload, you're probably going to end up having to shard or partition your Postgres by their workload.
And so it's like, okay, so your operational footprint is already going to look different.
The operational characteristics
are already going to look different.
How much more of a leap is it
to just install the purpose-built thing
that is Postgres compatible
and can like replicate off of the OLTP database
and handle GIS workloads
or vector search or whatever it is.
And so I've come around more and more
to believing that the future
is probably Postgres-compatible
protocol systems that are built from the ground up. And I think another, you know, sort of tailwind
for that is that it's much easier to build databases now than it was 15 years ago. Like,
there's all these building blocks that you can grab and build out. There's RocksDB, and there's,
you know, Valox, and all the stuff that I've mentioned that makes it not only easier to build
databases, but makes it easier to build trustworthy databases that are going to work.
So my opinion is probably the future is a bunch of different Postgres-compatible databases
that all plug into each other over this common API and protocol.
So I'm actually quite curious about how usually a data system gets widely adopted.
Because I think over time, you worked at LinkedIn.
LinkedIn has a bunch of teams working on a lot of systems that satisfy LinkedIn's need.
Looking at your background, started working at LinkedIn and went to WePay,
I'm sure you see a big difference in terms of the culture and how people do things.
I saw the Hadoop adoption, how it happened.
You know, we saw Spark, you know, we also saw Kafka.
A lot of things usually start with internal projects and then gets spreaded through almost like this very interesting religious moments of a content, a blog post, and then a bunch of people shouting this is the future. And somehow it got adopted. I'm actually very curious because I think to talk
about what becomes the new system, I'm also very curious, have you seen a trending of how people
adopt and evangelize data systems have changed or not, or just been pretty much the same. I just wonder how you view the next explosion, because either it's Postgres, you know,
being a front end or not, or, you know, there's so much vendors now that are all fighting to
become the next future. Who do you think will be the first users of one of those ecosystems?
And do you see the same Hadoop type of religious moments happen the same way of how people talk about the data fusion era type of stuff?
Or we don't know yet.
Maybe this is my take on how things will take off.
Yeah, that's a really good question.
So I think the Hadoop adoption
story was one of necessity, right? Essentially, Hadoop was not anybody's first choice.
Like at LinkedIn, when I joined, one of the main issues we were having in the data side of things
was around this people you may know algorithm, which was this little recommendation system we
had of who you should follow. And the most important signal in that was this thing we called triangle closing,
which is essentially friends of friends.
So it turns out to compute friends of friends,
you have to do a self-join on the connections to get a second degree out.
That's an insanely expensive operation on a traditional database.
So when I joined, we were doing this on Oracle.
And it got to the point where we were
just running the query and it would just never return. Days would go by and it would just not
return. We then tried Greenplum. We tried AsterData. We tried this weird OSGI in-memory thing. So we
literally tried four other things before we tried Hadoop. None of them worked. Literally none of them worked in any reliable way.
So we landed on Hadoop because nothing worked. And the big aha moment for me is like,
I wrote the code and we ran it and I got a percent bar, percent complete, like the number of map reduced tasks. I could see the progress because with all these other relational databases,
it's just like you run the query and you have no idea. Like, is it going to finish now or in 10 weeks?
Like, I don't know.
It's just doing stuff, right?
And so it ran, I could see the percent,
I could go get a cup of coffee and see it progress
and I could see, okay, it's going to finish.
And like, you know, unless it runs out of memory,
it's going to finish.
And it did.
And so we adopted it.
And, you know, fast forwarding to now,
scale is, I won't say it's solved,
but it's much more well understood.
So the growth story, I think scale is almost table stakes.
If you're going to compete on the OLTP world, you're going up against TIDB and Yuggabyte and Spanner and AuroraDB and CockroachDB.
They all scale, right?
They all scale on the OLTP side.
And if you go on the data warehousing side,
and same story with BigQuery and Snowflake,
and now you've got data lakes.
So on the one hand, scalability is pretty much table stakes.
On the other hand, you have the DuckDB crew
singing the you-don't-need scale,
like you're going to be fine with a single query engine
running locally on your desktop,
and we're going to just do some basic push down into Parquet.
That story, I think, is more or less done.
So in terms of differentiation and how these vendors are going to compete,
I have three different ideas.
This is something I just wrote about on how vendors could compete.
I think one of them is building platforms.
And I think the second is building more verticalized, very hyper-specific
databases. And I think the third is finally cracking the HTAP nut or the multi-model database,
which is something that has been, you know, sort of the panacea that never came to be.
And so on the platform side of things, the way I see this playing out is essentially just that
the database is a commodity. Everybody has all the same features. They're built off the same open source libraries.
There's no real exciting difference between them. But the surrounding ecosystem
around managing that database, managing database schema migrations, indexes,
query optimization, snapshotting, CDC,
data integration, all that kind of stuff is as or more
important than the database technology
you pick. And a lot of that still is like this hodgepodge of like, well, I'm going to run to
BZM and I'm going to have Flink in there. And for database migrations, who knows what,
are you doing a Lumbic or like, it's just a complete mess. There's also really interesting
work going on for new features like forking and branching that companies like PlanetScale are
doing. And so I think one avenue as a vendor vendor you could go down is this platform route, right?
I'm not going to make the claim that these three things are mutually exclusive.
In fact, I think the winners will probably adopt more than one of these.
So in the vertical database world, or what Tiger Beetle calls the domain-specific database,
I'm sure that's probably been around for a while.
But in any case,
these are essentially databases that are built for a very specific use case. And the two I
cite often are Nile, which is this new database that's built for SaaS providers. So it understands
the idea that, hey, you are a SaaS provider, you have a bunch of different customers or tenants,
and they all have different needs. And so they kind of can build stuff into the database that makes your life easier.
Disclaimer, I have investments both in Nile and Tiger Beetle. Tiger Beetle is a financial database
that's purpose-built for essentially double entry bookkeeping. They don't have a generic
relational data model. They've got like, this is a ledger. You have credits and debits. And
they're very opinionated about what that looks like. And then they're very hyper-focused on
transactionality and consistency, which coming from my prior FinTech experience, we had to build
essentially the same system and it was a complete nightmare to get right. It's very hard to do that.
So being able to pull that off the shelf is super interesting. Point being, there's probably a lot
of room for other domain-specific databases in other know, healthcare and biotech and on and on and on.
And then the third way I think database vendors can really kind of differentiate is around
this HTAP stuff and multi-model databases. In the world I previously described, you say, okay,
I had Elasticsearch and I had Neo4j and I had Postgres and I had BigQuery.
Now I have Postgres, I have BigQuery with a PG-compatible protocol. I have Elasticsearch
with a PG-compatible protocol. I have Neo4j with a PG-compatible protocol. I'm still running four
completely different systems. That's incrementally better And like the data integration story is better. And, you know, operationally, it still kind of
sucks. What I would really like is to have fewer systems, right? That I can all operate in a little
bit of an easier way. And so I think if vendors can figure out how to make HTAP or multi-model
systems work, then the architecture fundamentally changes for the better. Because I
have fewer systems, I have less operational overhead, less cost, less complexity. And I think
the if in that sentence is doing a lot of the heavy lifting there. But I'm somewhat optimistic
just due to the way cloud storage is rolling out. We've got NVMe and we've got all these
three engine systems and stuff. I actually do think it's possible. And again, citing CedarDB
and Umbra,
which is the sort of academic system
that they are built off of or came from,
there's signal, like there's signal
that this looks like it could be possible this time around.
And if that's the case,
like whoever nails that is gonna be a real big winner.
And so you can start to imagine,
well, like, okay, you know,
we've got some vendors that are going after multimodal,
there's a single store and Unistore and so on. We've got some vendors that are going after multimodal. This is single store and unit store and so on.
We've got some vendors that are going after verticals.
And then eventually they're going to inevitably have to start building platforms.
And you can already see some of this with like Nile, where they're building platform-like features around forking of schemas and slowly rolling out schema changes across tenants and so on.
And so I think it's likely that you'll see some spillover.
But I think that's how vendors are going to have to compete is on the surrounding stuff.
I'm really curious to dig in.
That was a master class, by the way.
But one of the things you just said that I'm curious to dig in on is you talked a little bit about the object store.
I think this is something Tim and I have talked to multiple people about is why the future of data is object stores. I'm curious to get your take
and how disruptive you think this is
in the context of, you know,
we had WarpStream on
and they're rebuilding Kafka
on top of the object store.
Like, how do you think the object store
accelerates both this sort of
a Cambrian explosion in many ways
or the rebuilding of the data ecosystem
to the default being, you know,
an S3 bucket or a GCS bucket?
And the other part of it is,
what does that do in terms of
the way we build our data ecosystems?
Yeah, I've been thinking about this quite a bit too.
So I have two thoughts.
The first one is, yes,
I'm definitely a huge proponent of object store
as primary store.
And I think there's sort of two ways that that pans out.
One is like
literally all your data is stored in the object store and the software is stateless. Otherwise,
the classic example is something like WarpStream, which you mentioned, which caveat, I do have some
money in WarpStream as well. And their agents are literally stateless and all the data goes into
the object store, right? S3 or GCS or whatever. The other pattern is
something more like Neon, where the object store is sort of primary storage, but there's this
write-ahead log. And it's not part of the object storage, but it gives you multi-region
transactionality and consistency and lower latency and stuff. I think it's likely over the long term
that even the write-ahead logs with stuff like S3 Express,
especially as that goes multi-region, which I'm sure it will,
that the walls will move into the object storage as well.
And then the main prohibitive issue there is just going to be cost.
I'm not sure how that's going to play out.
There may still be some requirement for caching outside of the object store.
My hope is that competition between, you know,
GCS, S3, and so on helps keep costs reasonably low.
Maybe not, maybe not, right?
We'll see.
So I'm not sure whether we're going to continue
to need off-object storage caching
in order to mitigate the cost issues.
It probably depends a bit on use case.
This is somewhat of a tangent,
but I've been kind of noodling about building Redis
on top of S3 and what that would look like. And I was doing some back of the
envelope math about, okay, how much would this cost if I'm doing a bunch of key value get puts
on S3 Express? And at low volume, it's reasonable and at high volume, it gets more pricey.
But yeah, I think it's a big deal using object storage. The big deal for me really is that it, out of the box,
gives you super high durability, multi-region replication,
essentially for nothing.
And those are the hardest things to get right with one building system,
is getting multi-region replication, getting high durability.
To build the consensus algorithm and the
replication protocol and like all the stuff you need to do is just a nightmare. And so moving
into a world where we don't really have to think about that and it's just sort of commoditized by
the object store is like a huge, huge deal. This is going back like 15, 20 years, but I have this
vivid memory where I was working on SAMSA at the time, the stream processing system. And I was
sitting next to Jay
Kreps, who's the original author of Kafka, and he
turned to me at one point and he was just like,
if we had a distributed file system
that actually worked, we wouldn't
need to do any of this. I was like,
wow, that's mind-blowing.
And I think while we're not there yet
with S3 and
stuff, we're getting
closer and closer and closer. You can start to imagine a world where you get S3 Express stuff, we're getting closer and closer and closer, right?
You can start to imagine a world where,
you know, you get S3 Express,
you start getting multi-region S3 Express,
and maybe you get optimistic concurrency control on rights,
which is something GCS can provide that S3 doesn't.
Like, you can start to get really, really close
to, like, what you would need to do
a lot of highly consistent operations.
And so I just feel like the architectural advantage
of not having to
worry about any of that stuff is just astronomically high. So I'm a big proponent of it.
With respect to your second question about what is the implication for the sort of data space
writ large, I don't have a good answer, but something I've been thinking a lot about is
if I imagine a world in which all of our primary store data is living in object storage, what does that do for the data integration story?
Let's imagine your MySQL slash Postgres data is in S3. Let's imagine that your streaming data is
in S3. Let's imagine that you've got your data lake there with your data warehouse. I think one
thing that it definitely does in my mind is it definitely adds sort of a challenge
for the Kafka world,
because Kafka, I think, by and large,
one of its major use cases is data integration.
And the idea is, hey, we're going to tie together
all these disparate systems
and move the data from one to the other.
And if all the data is already sitting in S3,
it's like, well, what does the data integration story
look like there?
It probably looks something more like Flink or Spark,
like reading these files and transforming them
and loading them into the data lake or whatever.
I'm not sure what it does for the data integration streaming story
that we've kind of had for the past 10 years or so with Kafka.
I don't have a clear vision on what that looks like
because on the one hand, it seems very natural
that you would want to do some kind of utl
on top of s3 to load data from your primary stores into your data lake and so on or in your search
engine or what have you but each of these systems is going to have its own storage format and not
only that they're also going to have their own like directory structure and semantics about when
the files are going to get deleted and when you it's safe to read the files and so on right and so
it's not clear to me how that's going to shake out.
Like, is everyone going to be reading these raw files
or is there going to be some like agreed upon format?
Maybe the format is just the data lake.
And so everyone shovels their data
and integrates with, you know,
Iceberg or Delta Lake or whatever.
I'm not sure what that looks like,
but it seems to me no matter how it shakes out
that it poses a challenge
for the streaming data integration story.
So let's jump into what we call the spicy future.
I think you already alluded to a lot of possible future states in our last two conversations of sorts. But I think to make it even more interesting,
take a stance.
What do you believe it should happen
in this data system world in the next sort of like,
let's just pick three to five years.
Like you mentioned a bunch of object storage
and error fusion, there's a lot of stuff.
What do you believe should become the norm in data systems?
I think I'm going to make some compromise between my dream world,
which is just one Postgres system that does everything,
and we figured out multi-model and HDAP on one system.
I think the more realistic and likely scenario is that we will shrink our operational footprint somewhat
by having a bunch of different
Postgres-compatible databases
that serve the standard use cases
that we have, Vector Search and, you know,
Text Search and GIS
and all the stuff I've mentioned.
And there will be fewer than there are now
because some of those workloads
will get folded in.
You know, PG may be able to service VectorSearch.
I think there's a very good article,
I forget who it was from, maybe the OtterToon folks,
about whether VectorSearch is just a feature or a database.
So some of them could get folded into Postgres.
Some of them could get folded into the data warehouse.
And so I think we'll see some slight shrinking
in terms of the operational footprint
of the data systems that we run.
I think they'll all be stitched together
by Postgres and Postgres-compatible protocols.
I think you can probably swap MySQL and Postgres protocol interchangeably in that statement,
but I think most light Postgres seems to be winning out. And then I think these systems will,
for the most part, be built on top of object storage. And again, most of them will probably
use some form of caching layer on top of the object storage, but a few will not, right?
So I think two that definitely aren't right now, I can cite examples, are TurboPuffer, which is this vector search serverless system, straight to object store.
And I think WarpStream is the other one.
And so I think that's what the database world will look like.
And then I think on the data integration side, we're going to start seeing a lot more Spark, Flink, and related systems. I
think there's some new up-and-coming ones that are going to be servicing sort of the data integration,
data lakehouse, maintenance side of things. So that's kind of my take on how I think
things will shake out. How do you think the evolution of LLMs play in the data space?
Because I mean, it's interesting, right? Like you alluded a little bit to vector store collapsing in.
A lot of the space has been in discussion around LLMs
from a data place.
It's been like, how do I get data into the vector store?
How do you think about evolution?
Because it is interesting, and I'm also curious,
how does moving data to the edge play into this?
I guess if the object store is there,
then the object store will get the edge.
You run this with the edge.
I'm curious, have you given any of that thought where that kind of fits?
Do you have a take on it?
Oh, boy.
So edge is a whole other world that we can talk about.
On the LLM front, I will just say flat out, I am probably not qualified to talk a lot about AI and the future of LLMs. I have somewhat deliberately stayed away from it because I just think it's a lot right now.
It's very chaotic, and I just haven't put in the effort to really track it.
On the vector search and sort of RAG prime, this is just anecdotal.
I saw this company, OneGrep, that's essentially doing RAG search on your code base,
and they're providing the ability to ask questions about your code. But one of the things I noticed
is that they're not just dumping your code into the vector search stuff. They're doing a whole
bunch of pre-processing first. And they're parsing the AST, and they're doing documentation
generation before they put it into the RAG engine. And so there's kind of this almost ETL-ish looking
thing, this transformation thing that's happening before it even goes into RAG. And so there's kind of this almost ETL-ish looking thing, this transformation
thing that's happening before it even goes into RAG. And so that's something that kind of struck
me as interesting that I think is kind of related to what you're asking about how data is getting
into these systems. But I don't, I just, I'm not qualified to give you a good answer. So I'm going
to back away from that one. On the edge and embedded world, very interesting. So the object
store stuff, I think helps a lot with that
because there's buckets in every region and you can ship stuff around. And again, going back to
Nile, this is something they offer where at the tenant level, you can be like, to use DoorDash
as an example, not a customer of Nile as far as I know, but if the restaurant is in Oakland,
then they can place the tenant's information in the bucket and,
you know, US West one or whatever. That's helpful. I think as you go beyond the server and object
store into client side, that's where it gets even more interesting. And the area that I've kind of
been poking at there recently is around CRDT and how you manage to give the developers a reasonable
experience when they're developing with, I don't want to say local first, but just areas where you're spanning client and server.
And then it's like a whole bunch of different takes on this.
But I think that's something that we need to figure out sort of collectively in industry,
because my experience with databases in the past has been that developer ergonomics matter a lot.
And, you know, going back to, you know, the Voldemort key value store that we built at LinkedIn,
which was this
Dynamo, not DynamoDB, but Dynamo clone, which is sort of this eventually consistent kind of
database that used vector clocks. We actually exposed the vector clocks to the application
engineers. They were just all like, oh, what's this? Ignore. And so that was a lesson that kind
of left an indelible mark on me. It was like, yeah, you kind of can't do that. It has to work
well. And so when I look at CRDT, a lot of what I see is like, oh boy, this is like super complicated
for the application engineers to think through. And basically the moment you ask them to think
about anything beyond last ride or wins, they just are like, yeah, it's not going to work.
So I think Edge is a really interesting space. I think we really need to figure it out, but I don't
know that we've necessarily cracked it yet. I think the other thing I would say about Edge is that there's just
getting back to sort of the three ways vendors can make money and the verticalized approach.
I think there's a lot of different ways to tackle this. I think if you look at the company like
Ditto, which a friend of mine just joined as their VP of engineering, they're an Edge company,
but they're hyper-focused on payments. And so their product is not just the Edge database,
but also this networking layer that's designed to create a mesh and use Bluetooth low energy
and handle these really spotty internet things. And so they were purpose-built for this low
connectivity environment. And I think if you look at other Edge databases, they're built for
reactivity. And so they're thinking more about, well, I have a React front end. And so every time
somebody changes something, it needs to propagate out and update the UI and stuff.
So I think these hyper-specific verticalized edge databases are a thing that are going to be increasingly so.
I think there are also generalized edge databases that are trying to figure this out in more detail, more generalizable.
I think the one there is Terzo, Electric SQL, these kind of
companies. Caveat, I have a little bit of money in Electric SQL, but I think they're a little more
general purpose and they're trying to provide semantics that feel more ergonomic. So Electric
SQL has done all this work around TCC Plus and trying to like, as best as possible, make developers
feel comfortable with running these local first databases but it's a
challenge i'm rambling now i will stop super fun well chris i think maybe the last question before
you go talk about what's next for you personally because you've been doing investing advising i
guess is this like your full-time gig maybe you don't know yet but i'm just curious too like what
is next for you yeah boy uh i don't have a good answer to'm just curious too, like what is next for you? Yeah, boy, I don't have a
good answer to that question. I thought I would get bored by now and want to go get a job.
I kind of explored the VC space a while back and was looking at like all these different approaches
to venture capital, joining a firm, starting a fund, doing a rolling fund, doing a scout fund.
And like none of those really suit what I wanted to do so i kind of backed off
that and so i'm sort of just in exploration mode right now i guess for lack of a better term and
it's unbound i'm fortunate it's unbounded i'm sort of pulling at the threads i want to pull at
talking to people i want to talk to it's a lot of fun it's been a great conversations being able to
connect people and talk with different people and i'm learning learning a lot. And so, I don't know, I guess I'm just sort of opportunistically and greedily grabbing at
whatever pops up that's interesting. So it's not a good answer, but it's sort of like, you know,
Ronan wandering the hillside, masterless samurai. That's when we know you're going to be on a great
path of something you've never seen and never heard before. That is actually even more exciting, right?
Cool. Well, how do people subscribe to Materialize View and how do people find you on social media?
Yeah, materializeview.io is the newsletter. I'm on Twitter at C and then my last name,
Cricomini. On Blue Sky, I am Chris.blue. Those are probably the
three best ways to get ahold of me. My DMs are open. I feed a chat about all interesting things
and I appreciate you guys having me on. This is a great way to kick the morning off for me on a
cloudy Thursday. Well, we are so happy to have you. This is so much fun. And I think we're all
kindred spirits of interest in what's happening here. i learned a lot but also i uh i think this is a great conversation a
lot of people are learning a lot of good stuff and look forward to having you back so i'm sure
we're gonna get you know as the space evolves we're gonna come back at this over and over again
and be like let's talk about this new thing that happened you know so it was a great time yeah
awesome appreciate it