The Infra Pod - Databases are becoming commodity, what's next? Chat with Chris from Materialized View

Episode Date: March 19, 2024

Ian and Tim sat down with Chris Riccomini (ex-Distinguished eng at Wepay, co-creator of Apache Samza) to talk about the spicy takes he has been sharing on his blog Materialized View and even further t...akes in the future of what the data system world will look like!

Transcript
Discussion (0)
Starting point is 00:00:00 Hey, welcome back to yet another Infra Deep Dive podcast. I know it's been a little bit since we had published our last one because I was on paternity leave, I think. It's part of the reason. But we're back. Just a little introduction. Tim from Essence VC and yeah, let's go. And I'm Ian Livingston, trying to Sneak into a platform at the moment, do a little angel investing. And also, obviously, I spent some time with you. I'm so excited for your addition to your family, Tim. That's exciting. And also, I'm super excited to be joined by Chris. Chris, why don't you introduce yourself? Tell us a little about yourself.
Starting point is 00:00:38 You're a very interesting character and have some really great opinions. I can't wait to dig in. Yeah, thanks for having me. So, yeah, my name is Chris Riccomini. I guess historically I've been a software engineer. The two main companies I worked at were LinkedIn and a company called WePay, which is like a fintech company. I spent about seven years at each, also a brief stint at PayPal early on in my career. And I spent most of that time data science,
Starting point is 00:01:03 data infrastructure, and service infrastructure as well. I spent most of that time, data science, data infrastructure, and service infrastructure as well. I got involved in open source. I wrote the initial pass at Apache Sando, which was like an early on stream processing system, a la Flink and Spark streaming. Then while I was at WePay, I wrote a book called The Missing Readme. Basically, I got tired of saying the same thing over and over again in one-on-ones with new software engineers. I was like, ah, there should be a manual. So Dimitri Riaboi, who's an ex-Twitter engineer, runs in the biotech space now. But we got together and basically wrote this sort of manual for new software engineers. What is a sprint? How do you go on call? That kind of stuff.
Starting point is 00:01:40 And I also got an Apache Airflow while I was at WePay and was the mentor to navigate them through Incubator. Left WePay about two and a half years ago. And since then have been doing this kind of some combination of investing, advising and writing. So I've got this little newsletter that I use now, Materialized View, where I write a lot about infrastructure and sort of my take on different stuff. I also work on a couple of little open source projects here and there. I got a sort of a bee in my bonnet and wanted to write like an almost an embedded like data catalog. It was just annoying to me when we pay, we were trying to install a data catalog for our data infrastructure team, our engineering team. And it was like, step one is install Neo4j and MySQL and Kafka and Elasticsearch. And I was like, oh man,
Starting point is 00:02:22 like this is going to take us like three months to get our SREs to get this footprint up. And all we really wanted was like basically key value schema information. So that eventually morphed into something called ReCap, which is sort of a type system that's meant to model both, you know, online, nearline and offline data schemas. And I got a number of companies
Starting point is 00:02:42 that are kind of using that now. One of them is like a data contract company. Another one is doing something more in like event pipeline. So that combination of investing, inviting a little bit of coding and writing is what I'm up to now. Amazing. And also like writing hot takes about the future of the data space. Yeah. Yeah. I try to write stuff that I am thinking about. I got to be careful. I, you know, maintain control of myself. I tend to get riled up. But yeah, I definitely have opinions. Good. And we're gonna we're gonna do our best between Tim and I to try and extract those from you. I you know, one of the things I'd love to just kick it off is the state of data
Starting point is 00:03:23 today. When I look at it, we've gone from this world where, you know, early 2000, you've got MySQL, mid-2000, you've got Hadoop, you've got maybe some Postgres. And the data world in terms of tooling and infrastructure is very small. And over the last 10 to 15 years of this explosion into all of these different data tools. And you wrote this blog post recently about how you basically see the data stack disassembling and all these different components. Can you help Tim and I understand your view? What do you think is going on and why? And what's that look like? What's happened and why it's happening? So actually, the current decomposition of data systems is actually something that I think has been going on for a long time.
Starting point is 00:04:07 And I kind of tried to trace the history of that in my blog post. You mentioned Hadoop. And to me, one of the starting points for a lot of this is going back to Hadoop. Pre-Hadoop and pre-MapReduce databases, I think, were in general a little more integrated. So you had your Teradata's and your Vertica's and stuff like that. And then with MapReduce, you started to separate compute and data, right? You had MapReduce and then you had HDFS. And then things kind of just kept evolving from there. And so you started to evolve a data catalog like the Hive Metastore. And then you started to involve query engine on top of MapReduce. So you had Hive or Pig or systems like that.
Starting point is 00:04:45 And so once you started from the MapReduce building block, which the instigation there was big data, right? Like essentially, hey, we need a way to process large amounts of data. This is going to involve an architectural change. And so that's where MapReduce came from. But then it turns out everyone likes SQL. And so you end up basically rebuilding
Starting point is 00:05:02 the database on top of this. But because I think Hadoop had already, it was A, open source, and B, already had kind of disaggregated compute and storage, things kind of evolved from there. So that's sort of like one path that I kind of just traced through. And you started seeing storage formats and Parquet and stuff. Everything was decomposed in the offline data or admin world. The second thread that I kind of was thinking, I've been thinking about more recently is the evolution of Postgres. Historically, Postgres kind of got beaten up for performance reasons. And I think it just lacks some features that I
Starting point is 00:05:35 think MySQL had. I think it was a little bit late to the game with replication protocol and CDC and stuff. But a lot of that's been solved, right? It's got a really robust extension system, right? And so again, you can start to see where, okay, you can plug stuff in. And so people are starting to build Postgres extensions for OLAP, for graph database, for vector search, for GIS. So on the query engine side, you've got Postgres there. And then underneath that, you have people starting to play with the storage layer of it as well. I just love talking about Neon. I find their architecture so fascinating. But Neon is this open-source Postgres project that ripped out Postgres' storage engine
Starting point is 00:06:12 and replaced it with a distributed write-ahead log that's backed by object storage. You can start to see the disaggregation in the Postgres world as well. I think, again, it's some combination of scalability, like Neon's bottomless storage, and also flexibility in terms of data processing, different kinds of queries. I think a lot of people just view Postgres as a pluggable query engine that they can just leverage to focus only on the area
Starting point is 00:06:40 where they can provide the most value. So like, what is it, Hydra, the OLAP system, or ParadeDB is doing OLAP and Elasticsearch kind of killing competitor kind of stuff. I think they kind of view it as a jumping off point for, you know, being able to not have to rebuild an entire database, right? I think the third strand, I didn't really write,
Starting point is 00:06:59 I guess I wrote about it in the blog post, but I didn't really tie it into these two narratives, is just what's happening in the Arrow ecosystem. And to me, Arrow is sort of like the modern Hadoop. Like when I look at the ecosystem that's being built around Arrow, it's sort of like, you know, stemming from Pandas and machine learning and integration there. But then they've started to just layer on more and more really interesting projects, cross-language compatibility and data fusion, which is this decomposable query engine. And then they've got obviously Arrow, the in-memory storage format. And so I think it's driven initially by the sort of data frame, PySpark-ish kind of use cases. But once you have these
Starting point is 00:07:42 building blocks, you can imagine building all kinds of really interesting databases. You know, I mentioned Arrow and Data Fusion, but it goes beyond this, right? There's Velox from Facebook, which is sort of a execution engine. So it's the bottom layer of a query engine, which is a thing that's meant to take sort of logical query plans and actually translate them into physical query plans and execute them on a runtime like Spark or Flink or Presto or whatever. Then you have Substrate, which is like essentially a way to model
Starting point is 00:08:09 these logical query plans. Point being, there are a whole bunch of projects in this space, but it kind of looks to me like Hadoop circa, you know, 2009, 2010, where there's this sort of Cambrian explosion of like all this new stuff. In a way, I kind of look at it as like rebuilding a lot of the Hadoop ecosystem for the modern era.
Starting point is 00:08:27 And so I think those three things are all sort of inform how I look at the decomposition. I mean, that's a great explanation about why this is happening. Immediate question I have. Do we think there's another consolidation phase where we kind of end up with a new stack that you can describe with four letters? And it's kind of like this Cambridge explosion followed by natural selection. We pick the best for the task. That results in a deep
Starting point is 00:08:51 consolidation. We end up back with a new stack. Or are we entering a world where, you know, a good example is the Postgres ecosystem where we have this common spire, right? We have this common core that you can kind of plug building blocks in Node.gov and it's actually easy to swap stuff. And we're just trying to find which are the best pieces, and then we'll forget the other ones?
Starting point is 00:09:13 Or is there actually a world where we can have this big, beautiful ecosystem of lots of plugins? It kind of looks more like the NPM, like JavaScript's NPM, or Python's PIP, like package ecosystem. Or is it really like there's just not that many use cases? We just want to figure out which is best. Yeah. Oh, boy. Okay, So you hit the nail on the head. My most recent blog post is kind of thinking through some of this. And then I have another blog post, which I have not published, which is comparing and contrasting. If I were writing a database, would I write it as a Postgres extension or would I write it as a Postgres compatible protocol? If I unpack your question a little bit, it's sort of like, is Postgres going to be like the OLTP platform that everyone
Starting point is 00:09:49 builds off of and they use query engines and they kind of like modify it? Or is it going to be the case that, you know, we're going to have a bunch of different systems? And I think in that world, what's likely is that everyone will congregate around the Postgres protocols. And by Postgres protocol, I actually mean three things. I mean, like the Postgres run-in protocol, which is like super simple, the SQL dialect, i.e. Postgres SQL, and then the backend replication protocol, the streaming replication protocol that they have.
Starting point is 00:10:17 Those are kind of the three things that I think, if you have compatibility on, you know, pick any two, like you're probably going to be able to slot into most ecosystems pretty well. And I think if you look at systems like Cedar DB, which is this new HTAP PG compatible system, like they have a very nice story around like, oh, if you want to use us, like we're very easy to drop in and you can drop us in and for only analytics first. And then as you get comfortable, you can start using this for OLTP workloads and transactionality as well, if you like, or you don't have to, right? So the question is, are we going to start to unify around a query engine or some piece of
Starting point is 00:10:53 infrastructure? And I think the one in my mind that would be most likely would be the Postgres query engine, right? Having talked to a number of people that have built actual PG extensions, I very much initially came down on the side of, I think extensions are pretty interesting and like a nice way to start things off. After talking to people, I've definitely come around to the viewpoint that I think the future is most likely going to be Postgres compatible replication and front ends and stuff, and that the databases will probably be built from scratch. The reasoning there is sort of twofold. One is eventually you want to get more control over
Starting point is 00:11:31 Postgres than it's going to give you from the extension library. So you're going to start to get straight jacketed in. And so you're going to have to start essentially moving off of Postgres or forking it. And so if you look at Neon, actually, they had to fork Postgres because in order to make the wall API remote, they had to change it. And they're saying they're hoping they're going to be able to get the patch merge back in. I actually pulled this closer. This is from a blog post a year or two ago.
Starting point is 00:11:52 I don't know if they did manage to get it back in or not. I haven't checked. I just view that as one sample of like, okay, they had to fork it. The other thing that I find kind of interesting is I have thought that from an operational standpoint, it would be appealing to be able to just run Postgres and then install these extensions and operationally that might be a little bit easier. But I think in practice, that's probably not true. For one thing, like running Postgres as an OLAP database is like just fundamentally very different from running it as a OLTP database. And this is sort of like the common criticism of HTAP.
Starting point is 00:12:26 So it's unlikely that you're going to be running one Postgres that is both OLAP and OLTP or GIS or whatever. As you scale that workload, you're probably going to end up having to shard or partition your Postgres by their workload. And so it's like, okay, so your operational footprint is already going to look different. The operational characteristics are already going to look different. How much more of a leap is it to just install the purpose-built thing that is Postgres compatible
Starting point is 00:12:52 and can like replicate off of the OLTP database and handle GIS workloads or vector search or whatever it is. And so I've come around more and more to believing that the future is probably Postgres-compatible protocol systems that are built from the ground up. And I think another, you know, sort of tailwind for that is that it's much easier to build databases now than it was 15 years ago. Like,
Starting point is 00:13:15 there's all these building blocks that you can grab and build out. There's RocksDB, and there's, you know, Valox, and all the stuff that I've mentioned that makes it not only easier to build databases, but makes it easier to build trustworthy databases that are going to work. So my opinion is probably the future is a bunch of different Postgres-compatible databases that all plug into each other over this common API and protocol. So I'm actually quite curious about how usually a data system gets widely adopted. Because I think over time, you worked at LinkedIn. LinkedIn has a bunch of teams working on a lot of systems that satisfy LinkedIn's need.
Starting point is 00:13:55 Looking at your background, started working at LinkedIn and went to WePay, I'm sure you see a big difference in terms of the culture and how people do things. I saw the Hadoop adoption, how it happened. You know, we saw Spark, you know, we also saw Kafka. A lot of things usually start with internal projects and then gets spreaded through almost like this very interesting religious moments of a content, a blog post, and then a bunch of people shouting this is the future. And somehow it got adopted. I'm actually very curious because I think to talk about what becomes the new system, I'm also very curious, have you seen a trending of how people adopt and evangelize data systems have changed or not, or just been pretty much the same. I just wonder how you view the next explosion, because either it's Postgres, you know, being a front end or not, or, you know, there's so much vendors now that are all fighting to
Starting point is 00:14:57 become the next future. Who do you think will be the first users of one of those ecosystems? And do you see the same Hadoop type of religious moments happen the same way of how people talk about the data fusion era type of stuff? Or we don't know yet. Maybe this is my take on how things will take off. Yeah, that's a really good question. So I think the Hadoop adoption story was one of necessity, right? Essentially, Hadoop was not anybody's first choice. Like at LinkedIn, when I joined, one of the main issues we were having in the data side of things
Starting point is 00:15:37 was around this people you may know algorithm, which was this little recommendation system we had of who you should follow. And the most important signal in that was this thing we called triangle closing, which is essentially friends of friends. So it turns out to compute friends of friends, you have to do a self-join on the connections to get a second degree out. That's an insanely expensive operation on a traditional database. So when I joined, we were doing this on Oracle. And it got to the point where we were
Starting point is 00:16:05 just running the query and it would just never return. Days would go by and it would just not return. We then tried Greenplum. We tried AsterData. We tried this weird OSGI in-memory thing. So we literally tried four other things before we tried Hadoop. None of them worked. Literally none of them worked in any reliable way. So we landed on Hadoop because nothing worked. And the big aha moment for me is like, I wrote the code and we ran it and I got a percent bar, percent complete, like the number of map reduced tasks. I could see the progress because with all these other relational databases, it's just like you run the query and you have no idea. Like, is it going to finish now or in 10 weeks? Like, I don't know. It's just doing stuff, right?
Starting point is 00:16:49 And so it ran, I could see the percent, I could go get a cup of coffee and see it progress and I could see, okay, it's going to finish. And like, you know, unless it runs out of memory, it's going to finish. And it did. And so we adopted it. And, you know, fast forwarding to now,
Starting point is 00:17:02 scale is, I won't say it's solved, but it's much more well understood. So the growth story, I think scale is almost table stakes. If you're going to compete on the OLTP world, you're going up against TIDB and Yuggabyte and Spanner and AuroraDB and CockroachDB. They all scale, right? They all scale on the OLTP side. And if you go on the data warehousing side, and same story with BigQuery and Snowflake,
Starting point is 00:17:27 and now you've got data lakes. So on the one hand, scalability is pretty much table stakes. On the other hand, you have the DuckDB crew singing the you-don't-need scale, like you're going to be fine with a single query engine running locally on your desktop, and we're going to just do some basic push down into Parquet. That story, I think, is more or less done.
Starting point is 00:17:48 So in terms of differentiation and how these vendors are going to compete, I have three different ideas. This is something I just wrote about on how vendors could compete. I think one of them is building platforms. And I think the second is building more verticalized, very hyper-specific databases. And I think the third is finally cracking the HTAP nut or the multi-model database, which is something that has been, you know, sort of the panacea that never came to be. And so on the platform side of things, the way I see this playing out is essentially just that
Starting point is 00:18:22 the database is a commodity. Everybody has all the same features. They're built off the same open source libraries. There's no real exciting difference between them. But the surrounding ecosystem around managing that database, managing database schema migrations, indexes, query optimization, snapshotting, CDC, data integration, all that kind of stuff is as or more important than the database technology you pick. And a lot of that still is like this hodgepodge of like, well, I'm going to run to BZM and I'm going to have Flink in there. And for database migrations, who knows what,
Starting point is 00:18:54 are you doing a Lumbic or like, it's just a complete mess. There's also really interesting work going on for new features like forking and branching that companies like PlanetScale are doing. And so I think one avenue as a vendor vendor you could go down is this platform route, right? I'm not going to make the claim that these three things are mutually exclusive. In fact, I think the winners will probably adopt more than one of these. So in the vertical database world, or what Tiger Beetle calls the domain-specific database, I'm sure that's probably been around for a while. But in any case,
Starting point is 00:19:25 these are essentially databases that are built for a very specific use case. And the two I cite often are Nile, which is this new database that's built for SaaS providers. So it understands the idea that, hey, you are a SaaS provider, you have a bunch of different customers or tenants, and they all have different needs. And so they kind of can build stuff into the database that makes your life easier. Disclaimer, I have investments both in Nile and Tiger Beetle. Tiger Beetle is a financial database that's purpose-built for essentially double entry bookkeeping. They don't have a generic relational data model. They've got like, this is a ledger. You have credits and debits. And they're very opinionated about what that looks like. And then they're very hyper-focused on
Starting point is 00:20:08 transactionality and consistency, which coming from my prior FinTech experience, we had to build essentially the same system and it was a complete nightmare to get right. It's very hard to do that. So being able to pull that off the shelf is super interesting. Point being, there's probably a lot of room for other domain-specific databases in other know, healthcare and biotech and on and on and on. And then the third way I think database vendors can really kind of differentiate is around this HTAP stuff and multi-model databases. In the world I previously described, you say, okay, I had Elasticsearch and I had Neo4j and I had Postgres and I had BigQuery. Now I have Postgres, I have BigQuery with a PG-compatible protocol. I have Elasticsearch
Starting point is 00:20:55 with a PG-compatible protocol. I have Neo4j with a PG-compatible protocol. I'm still running four completely different systems. That's incrementally better And like the data integration story is better. And, you know, operationally, it still kind of sucks. What I would really like is to have fewer systems, right? That I can all operate in a little bit of an easier way. And so I think if vendors can figure out how to make HTAP or multi-model systems work, then the architecture fundamentally changes for the better. Because I have fewer systems, I have less operational overhead, less cost, less complexity. And I think the if in that sentence is doing a lot of the heavy lifting there. But I'm somewhat optimistic just due to the way cloud storage is rolling out. We've got NVMe and we've got all these
Starting point is 00:21:40 three engine systems and stuff. I actually do think it's possible. And again, citing CedarDB and Umbra, which is the sort of academic system that they are built off of or came from, there's signal, like there's signal that this looks like it could be possible this time around. And if that's the case, like whoever nails that is gonna be a real big winner.
Starting point is 00:21:57 And so you can start to imagine, well, like, okay, you know, we've got some vendors that are going after multimodal, there's a single store and Unistore and so on. We've got some vendors that are going after multimodal. This is single store and unit store and so on. We've got some vendors that are going after verticals. And then eventually they're going to inevitably have to start building platforms. And you can already see some of this with like Nile, where they're building platform-like features around forking of schemas and slowly rolling out schema changes across tenants and so on. And so I think it's likely that you'll see some spillover.
Starting point is 00:22:23 But I think that's how vendors are going to have to compete is on the surrounding stuff. I'm really curious to dig in. That was a master class, by the way. But one of the things you just said that I'm curious to dig in on is you talked a little bit about the object store. I think this is something Tim and I have talked to multiple people about is why the future of data is object stores. I'm curious to get your take and how disruptive you think this is in the context of, you know, we had WarpStream on
Starting point is 00:22:49 and they're rebuilding Kafka on top of the object store. Like, how do you think the object store accelerates both this sort of a Cambrian explosion in many ways or the rebuilding of the data ecosystem to the default being, you know, an S3 bucket or a GCS bucket?
Starting point is 00:23:04 And the other part of it is, what does that do in terms of the way we build our data ecosystems? Yeah, I've been thinking about this quite a bit too. So I have two thoughts. The first one is, yes, I'm definitely a huge proponent of object store as primary store.
Starting point is 00:23:21 And I think there's sort of two ways that that pans out. One is like literally all your data is stored in the object store and the software is stateless. Otherwise, the classic example is something like WarpStream, which you mentioned, which caveat, I do have some money in WarpStream as well. And their agents are literally stateless and all the data goes into the object store, right? S3 or GCS or whatever. The other pattern is something more like Neon, where the object store is sort of primary storage, but there's this write-ahead log. And it's not part of the object storage, but it gives you multi-region
Starting point is 00:23:56 transactionality and consistency and lower latency and stuff. I think it's likely over the long term that even the write-ahead logs with stuff like S3 Express, especially as that goes multi-region, which I'm sure it will, that the walls will move into the object storage as well. And then the main prohibitive issue there is just going to be cost. I'm not sure how that's going to play out. There may still be some requirement for caching outside of the object store. My hope is that competition between, you know,
Starting point is 00:24:25 GCS, S3, and so on helps keep costs reasonably low. Maybe not, maybe not, right? We'll see. So I'm not sure whether we're going to continue to need off-object storage caching in order to mitigate the cost issues. It probably depends a bit on use case. This is somewhat of a tangent,
Starting point is 00:24:42 but I've been kind of noodling about building Redis on top of S3 and what that would look like. And I was doing some back of the envelope math about, okay, how much would this cost if I'm doing a bunch of key value get puts on S3 Express? And at low volume, it's reasonable and at high volume, it gets more pricey. But yeah, I think it's a big deal using object storage. The big deal for me really is that it, out of the box, gives you super high durability, multi-region replication, essentially for nothing. And those are the hardest things to get right with one building system,
Starting point is 00:25:19 is getting multi-region replication, getting high durability. To build the consensus algorithm and the replication protocol and like all the stuff you need to do is just a nightmare. And so moving into a world where we don't really have to think about that and it's just sort of commoditized by the object store is like a huge, huge deal. This is going back like 15, 20 years, but I have this vivid memory where I was working on SAMSA at the time, the stream processing system. And I was sitting next to Jay Kreps, who's the original author of Kafka, and he
Starting point is 00:25:47 turned to me at one point and he was just like, if we had a distributed file system that actually worked, we wouldn't need to do any of this. I was like, wow, that's mind-blowing. And I think while we're not there yet with S3 and stuff, we're getting
Starting point is 00:26:03 closer and closer and closer. You can start to imagine a world where you get S3 Express stuff, we're getting closer and closer and closer, right? You can start to imagine a world where, you know, you get S3 Express, you start getting multi-region S3 Express, and maybe you get optimistic concurrency control on rights, which is something GCS can provide that S3 doesn't. Like, you can start to get really, really close to, like, what you would need to do
Starting point is 00:26:19 a lot of highly consistent operations. And so I just feel like the architectural advantage of not having to worry about any of that stuff is just astronomically high. So I'm a big proponent of it. With respect to your second question about what is the implication for the sort of data space writ large, I don't have a good answer, but something I've been thinking a lot about is if I imagine a world in which all of our primary store data is living in object storage, what does that do for the data integration story? Let's imagine your MySQL slash Postgres data is in S3. Let's imagine that your streaming data is
Starting point is 00:26:58 in S3. Let's imagine that you've got your data lake there with your data warehouse. I think one thing that it definitely does in my mind is it definitely adds sort of a challenge for the Kafka world, because Kafka, I think, by and large, one of its major use cases is data integration. And the idea is, hey, we're going to tie together all these disparate systems and move the data from one to the other.
Starting point is 00:27:19 And if all the data is already sitting in S3, it's like, well, what does the data integration story look like there? It probably looks something more like Flink or Spark, like reading these files and transforming them and loading them into the data lake or whatever. I'm not sure what it does for the data integration streaming story that we've kind of had for the past 10 years or so with Kafka.
Starting point is 00:27:38 I don't have a clear vision on what that looks like because on the one hand, it seems very natural that you would want to do some kind of utl on top of s3 to load data from your primary stores into your data lake and so on or in your search engine or what have you but each of these systems is going to have its own storage format and not only that they're also going to have their own like directory structure and semantics about when the files are going to get deleted and when you it's safe to read the files and so on right and so it's not clear to me how that's going to shake out.
Starting point is 00:28:05 Like, is everyone going to be reading these raw files or is there going to be some like agreed upon format? Maybe the format is just the data lake. And so everyone shovels their data and integrates with, you know, Iceberg or Delta Lake or whatever. I'm not sure what that looks like, but it seems to me no matter how it shakes out
Starting point is 00:28:22 that it poses a challenge for the streaming data integration story. So let's jump into what we call the spicy future. I think you already alluded to a lot of possible future states in our last two conversations of sorts. But I think to make it even more interesting, take a stance. What do you believe it should happen in this data system world in the next sort of like, let's just pick three to five years.
Starting point is 00:28:55 Like you mentioned a bunch of object storage and error fusion, there's a lot of stuff. What do you believe should become the norm in data systems? I think I'm going to make some compromise between my dream world, which is just one Postgres system that does everything, and we figured out multi-model and HDAP on one system. I think the more realistic and likely scenario is that we will shrink our operational footprint somewhat by having a bunch of different
Starting point is 00:29:26 Postgres-compatible databases that serve the standard use cases that we have, Vector Search and, you know, Text Search and GIS and all the stuff I've mentioned. And there will be fewer than there are now because some of those workloads will get folded in.
Starting point is 00:29:42 You know, PG may be able to service VectorSearch. I think there's a very good article, I forget who it was from, maybe the OtterToon folks, about whether VectorSearch is just a feature or a database. So some of them could get folded into Postgres. Some of them could get folded into the data warehouse. And so I think we'll see some slight shrinking in terms of the operational footprint
Starting point is 00:30:01 of the data systems that we run. I think they'll all be stitched together by Postgres and Postgres-compatible protocols. I think you can probably swap MySQL and Postgres protocol interchangeably in that statement, but I think most light Postgres seems to be winning out. And then I think these systems will, for the most part, be built on top of object storage. And again, most of them will probably use some form of caching layer on top of the object storage, but a few will not, right? So I think two that definitely aren't right now, I can cite examples, are TurboPuffer, which is this vector search serverless system, straight to object store.
Starting point is 00:30:34 And I think WarpStream is the other one. And so I think that's what the database world will look like. And then I think on the data integration side, we're going to start seeing a lot more Spark, Flink, and related systems. I think there's some new up-and-coming ones that are going to be servicing sort of the data integration, data lakehouse, maintenance side of things. So that's kind of my take on how I think things will shake out. How do you think the evolution of LLMs play in the data space? Because I mean, it's interesting, right? Like you alluded a little bit to vector store collapsing in. A lot of the space has been in discussion around LLMs
Starting point is 00:31:12 from a data place. It's been like, how do I get data into the vector store? How do you think about evolution? Because it is interesting, and I'm also curious, how does moving data to the edge play into this? I guess if the object store is there, then the object store will get the edge. You run this with the edge.
Starting point is 00:31:31 I'm curious, have you given any of that thought where that kind of fits? Do you have a take on it? Oh, boy. So edge is a whole other world that we can talk about. On the LLM front, I will just say flat out, I am probably not qualified to talk a lot about AI and the future of LLMs. I have somewhat deliberately stayed away from it because I just think it's a lot right now. It's very chaotic, and I just haven't put in the effort to really track it. On the vector search and sort of RAG prime, this is just anecdotal. I saw this company, OneGrep, that's essentially doing RAG search on your code base,
Starting point is 00:32:04 and they're providing the ability to ask questions about your code. But one of the things I noticed is that they're not just dumping your code into the vector search stuff. They're doing a whole bunch of pre-processing first. And they're parsing the AST, and they're doing documentation generation before they put it into the RAG engine. And so there's kind of this almost ETL-ish looking thing, this transformation thing that's happening before it even goes into RAG. And so there's kind of this almost ETL-ish looking thing, this transformation thing that's happening before it even goes into RAG. And so that's something that kind of struck me as interesting that I think is kind of related to what you're asking about how data is getting into these systems. But I don't, I just, I'm not qualified to give you a good answer. So I'm going
Starting point is 00:32:37 to back away from that one. On the edge and embedded world, very interesting. So the object store stuff, I think helps a lot with that because there's buckets in every region and you can ship stuff around. And again, going back to Nile, this is something they offer where at the tenant level, you can be like, to use DoorDash as an example, not a customer of Nile as far as I know, but if the restaurant is in Oakland, then they can place the tenant's information in the bucket and, you know, US West one or whatever. That's helpful. I think as you go beyond the server and object store into client side, that's where it gets even more interesting. And the area that I've kind of
Starting point is 00:33:17 been poking at there recently is around CRDT and how you manage to give the developers a reasonable experience when they're developing with, I don't want to say local first, but just areas where you're spanning client and server. And then it's like a whole bunch of different takes on this. But I think that's something that we need to figure out sort of collectively in industry, because my experience with databases in the past has been that developer ergonomics matter a lot. And, you know, going back to, you know, the Voldemort key value store that we built at LinkedIn, which was this Dynamo, not DynamoDB, but Dynamo clone, which is sort of this eventually consistent kind of
Starting point is 00:33:49 database that used vector clocks. We actually exposed the vector clocks to the application engineers. They were just all like, oh, what's this? Ignore. And so that was a lesson that kind of left an indelible mark on me. It was like, yeah, you kind of can't do that. It has to work well. And so when I look at CRDT, a lot of what I see is like, oh boy, this is like super complicated for the application engineers to think through. And basically the moment you ask them to think about anything beyond last ride or wins, they just are like, yeah, it's not going to work. So I think Edge is a really interesting space. I think we really need to figure it out, but I don't know that we've necessarily cracked it yet. I think the other thing I would say about Edge is that there's just
Starting point is 00:34:27 getting back to sort of the three ways vendors can make money and the verticalized approach. I think there's a lot of different ways to tackle this. I think if you look at the company like Ditto, which a friend of mine just joined as their VP of engineering, they're an Edge company, but they're hyper-focused on payments. And so their product is not just the Edge database, but also this networking layer that's designed to create a mesh and use Bluetooth low energy and handle these really spotty internet things. And so they were purpose-built for this low connectivity environment. And I think if you look at other Edge databases, they're built for reactivity. And so they're thinking more about, well, I have a React front end. And so every time
Starting point is 00:35:04 somebody changes something, it needs to propagate out and update the UI and stuff. So I think these hyper-specific verticalized edge databases are a thing that are going to be increasingly so. I think there are also generalized edge databases that are trying to figure this out in more detail, more generalizable. I think the one there is Terzo, Electric SQL, these kind of companies. Caveat, I have a little bit of money in Electric SQL, but I think they're a little more general purpose and they're trying to provide semantics that feel more ergonomic. So Electric SQL has done all this work around TCC Plus and trying to like, as best as possible, make developers feel comfortable with running these local first databases but it's a
Starting point is 00:35:45 challenge i'm rambling now i will stop super fun well chris i think maybe the last question before you go talk about what's next for you personally because you've been doing investing advising i guess is this like your full-time gig maybe you don't know yet but i'm just curious too like what is next for you yeah boy uh i don't have a good answer to'm just curious too, like what is next for you? Yeah, boy, I don't have a good answer to that question. I thought I would get bored by now and want to go get a job. I kind of explored the VC space a while back and was looking at like all these different approaches to venture capital, joining a firm, starting a fund, doing a rolling fund, doing a scout fund. And like none of those really suit what I wanted to do so i kind of backed off
Starting point is 00:36:26 that and so i'm sort of just in exploration mode right now i guess for lack of a better term and it's unbound i'm fortunate it's unbounded i'm sort of pulling at the threads i want to pull at talking to people i want to talk to it's a lot of fun it's been a great conversations being able to connect people and talk with different people and i'm learning learning a lot. And so, I don't know, I guess I'm just sort of opportunistically and greedily grabbing at whatever pops up that's interesting. So it's not a good answer, but it's sort of like, you know, Ronan wandering the hillside, masterless samurai. That's when we know you're going to be on a great path of something you've never seen and never heard before. That is actually even more exciting, right? Cool. Well, how do people subscribe to Materialize View and how do people find you on social media?
Starting point is 00:37:15 Yeah, materializeview.io is the newsletter. I'm on Twitter at C and then my last name, Cricomini. On Blue Sky, I am Chris.blue. Those are probably the three best ways to get ahold of me. My DMs are open. I feed a chat about all interesting things and I appreciate you guys having me on. This is a great way to kick the morning off for me on a cloudy Thursday. Well, we are so happy to have you. This is so much fun. And I think we're all kindred spirits of interest in what's happening here. i learned a lot but also i uh i think this is a great conversation a lot of people are learning a lot of good stuff and look forward to having you back so i'm sure we're gonna get you know as the space evolves we're gonna come back at this over and over again
Starting point is 00:37:58 and be like let's talk about this new thing that happened you know so it was a great time yeah awesome appreciate it

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.