The Infra Pod - The Materialized View infra influencer is back again! Chat with Chris Riccomini

Starting point is 00:00:00 Well, welcome to the InfraPod. This is Tim from Essence and Ian, let's go. This is Ian Livingston. Super excited. We have a special guest today. Chris Riccomini, author of Many Things, inventor of Some Things and general purpose, smart dude about infrastructure. Chris, how are you doing?

Starting point is 00:00:24 How excited are you to chat about all your spicy takes today? I'm doing well. I've got my coffee here and I'm ingesting as quickly as I can. Yeah, I'm excited. Ready to give you my spicy takes on all things. I think just in the pre-roll, we were doing all kinds of ridiculous stuff.

Starting point is 00:00:43 What did we cover? Derby execution, BYOC, S3, clones. Yeah, the list was long, so let's do it. Yeah, Tim and I were talking how excited we were to have you back, because every time we talk, it just is like this really awesome conversation. You know, one of the places I wanted to start was you wrote this blog post.

Starting point is 00:01:01 It was now back in September, which by the way, is when we originally reached out to get you back on. But you're so busy, so hard to book. But you were writing about WarpStream, BYOC, Confluent, and what that represented. And this is such a hot topic in infrastructure. Generally speaking, if you're a company buying, if you're a company building infrared to sell to other people, there's this huge discussion about, well, how do I package this infrastructure up and make it possible for you to consume it other people. There's this huge discussion about like, well, how do I package this infrastructure up and make it possible for you to like consume it? And the answers to this question are all over the place.

Starting point is 00:01:30 So give us a little like thought process on your view of BYOC. And we'd love to dive into this amongst all the other topics you mentioned. Yeah. So on the BYOC front, it really resonated for me as a use case going back to when I was at WePay, which is a payment processor that I worked at circa 5-10 years ago. And payment processors in general and fintech tend to be very security focused. And so the idea that you could store your data in your cloud, but still get some of the advantages of a traditional SaaS, traditional,

Starting point is 00:02:08 I guess what we would call a traditional SaaS company, seemed really appealing from a security perspective. Then again, while I was at WePay, from an operations perspective, being able to run stuff in object storage the way that WarpStream, I would say, pioneered, seemed really appealing as well. We had this graph database that we had built, and we built it the same way we built all our other distributed infrastructure, which was, you know, like

Starting point is 00:02:33 traditional sort of semi-tightly coupled compute and storage. And the ops guys essentially said no, they weren't going to run that. And, you know, this was coming off of us giving them a distributed ledger, sort of a write-ahead log kind of thing to manage credits and debits, which they were struggling to run. It was this kind of quorum-based Kafka-esque thing. So they said, no, we're not going to run this graph database thing. You have to store the data in object storage.

Starting point is 00:02:57 And I was like, oh, that's really interesting at the time. So those two things were kind of rattling around in my head. And then I came across WarpStream and that's where everything clicked. And I was like, Oh, this is like, obviously the right thing to do. I wrote a lot about that. And I think, you know, time has shown that it's definitely a competitive advantage to be able to go into a, an organization and say, Hey, you know, you can run on your own infrastructure and then we'll manage the control plane for you.

Starting point is 00:03:25 You get sort of a SaaS feel. And what's really enabled that in my opinion is this separation of compute and storage that allows the agents to be pretty stateless. And so when Confluent Bottom, I was initially like, really, this looks like freight to me, which they like just launched. And it also kind of looked like Quora to me.

Starting point is 00:03:44 The messaging at the time was no, no, WarpStream is for BYOC, right? So yeah, they were saying we bought this for the BYOC advantages and I was initially skeptical of that. The more I thought through it, the more I was like, okay, I could actually buy this. I think over the long run freight probably is the same thing as WarpStream and you know, they should probably just get rid of freight and replace thing as WarpStream. And you know, they should probably just get rid of freight and replace it with WarpStream. The sort of official party line for them

Starting point is 00:04:09 with respect to Cora was, hey, you know, we need low latency stuff. So we're going to have Cora be kind of the low latency Kafka thing and WarpStream be the BYOC Kafka thing. So that was sort of part of the BYOC excitement for me. And I think the other part was I was talking with the Daft guys who are working on sort of a Spark query engine replacement.

Starting point is 00:04:32 And the fascinating thing talking to them was they were and are working on making it work embedded. So you can take this library that is distributed Spark looking thing, but So you can take this library that is distributed, spark looking thing, but then you can also run it embedded in your own process like DuckDB, right? And to be able to scale up and scale down like that was something that kind of struck me as really interesting

Starting point is 00:05:00 and it makes it quite portable. And so those were kind of two things floating around in my head. It's like you need to be able to run everywhere. So you need the confluent approach. You need to be able to do SAS, and you need to be able to do BYOC, and you need to be able to do on-prem. And to extend that even further, you

Starting point is 00:05:17 need to be able to do in-process. And so these are all, in my mind, just different deployment models. You can deploy in-process. You can deploy in data center. You can deploy in process, you can deploy in data center, you can deploy in SAS, you can deploy in BIOC. And when I look at the modern software that an infrastructure that's being built now,

Starting point is 00:05:34 it seems to me that for a lot of these pieces of infrastructure, the eventual end state is going to be systems that can scale up and down and be deployed sort of everywhere. And so I think WarpStream, Daft are both sort of early signals about where we're headed. I think if you look at it from the opposite direction, what MotherDuck is trying to do with DuckDB is the same thing where they're taking something that was embedded and they're saying, okay, well now we're going to tier it and make it cloud-based and it's going to run on your client side,

Starting point is 00:06:02 but it's also going to run in our centralized servers and so on. So I think that deployment flexibility is just becoming really critical in order to compete and get into the various enterprises that are buying. You know I always think of this as a duality here to be honest between startups and early stage companies and what they need to do from a packaging perspective in order to and early stage companies and what they need to do from a packaging perspective in order to have buyers. And then what late stage infrastructure companies can do. A good example is if you go and talk to Databricks, which originally kind of was a BYOC

Starting point is 00:06:35 deploying your own cloud, now they want you to run inside their cloud. And their whole argument is, well, we can be faster and cheaper this way, right? And so there's also this duality. I think the other component is, and cheaper this way. What this also creates for these infrastructure companies and this challenge is how do I build the infrastructure to actually be able to deploy and manage it inside your cloud?

Starting point is 00:06:55 There isn't an operating system for these patterns today. I'm really interested to get your perspective on that duality and also where you see this going And so, you know, on the BYOC front with Databricks in particular, and the argument that like, hey, it's going to be cheaper if we run it in cloud. And I think there have been a number of companies that have said like, if we were doing this from scratch, we would have just done purely SaaS and not BYOC. And I think the reality is some vendors are going to want SaaS

Starting point is 00:07:38 and some are going to want BYOC. And the most comical version of this is like, I think Jack van Lightly wrote a pretty good blog post where he kind of described before they bought WarpStream why BYOC was maybe not the best in all situations and you can do cost sharing and blah, blah, blah, right? And then they turn around and bought Confluent and then the blog post was like,

Starting point is 00:07:58 well, BYOC is pretty amazing and you can do all this great stuff. And I think he's right in both blog posts, right? So there are certainly situations where you're small and you can do all this great stuff. it's pretty nice to be able to run in your own cloud so that you can take advantage of those credits, right? So like there are arguments, I think, for both and I think there are arguments for both at all sizes, right? So if you're a huge enterprise, you may want to run in SaaS or you may want to run BYOC for various reasons. If you're a little tiny startup, you may want to run SaaS or you may want to run BYOC for various reasons. And so I don't think there's a right answer.

Starting point is 00:08:41 I think, again, as a vendor, you're going to need to support both. Right. And then when it comes to the second point, this was around sort of the operating system for BYOC and stuff. Right. Like it's really early days. I think the self-serving answer would be, I've been talking a lot with John Morehouse over at Nuhan, which is a company that I invested in. And I think up to now, what companies like Nuon and Omnistrade and is it replicate? I think is another one are replicated. They've been focusing a lot on deployment. And I think that's fairly commoditized. Like a lot of people that I talk,

Starting point is 00:09:16 vendors that I talk to, like they don't really care that much. They either have an AMI or they've got like a Docker image or a Terraform script. And they kind of give that to the customers and it's good enough. What's really missing right now is actually, John calls it day two operation stuff,

Starting point is 00:09:32 but essentially it's like the ability as a vendor to get visibility into your customers' deployments so that you can, you know, in the control plane, like make the right decisions and debug things. And so an example would be, you know, if you're BYOC deployed and you can, in the control plane, make the right decisions and debug things. And so an example would be, if you're BYOC deployed information. Like, there's not really a good way to do that right now. The customers themselves have, you know, Datadog or, you know, some Prometheus instance or whatever. But there's kind of like, there's no, to borrow from the data warehouse world, there's no

Starting point is 00:10:13 like data clean room where you can kind of both work together to see what the data looks like or rather the metrics and logs and observability stuff looks like for the deployment so that you can work on it. And so a lot of people are kind of like rolling their own, or they're doing a Zoom share where the customer is like sharing their Datadog dashboards, they can see what's going on. I think there's a lot of room for improvement there.

Starting point is 00:10:34 And that's something that the Nuon guys are working with. I think longer term, they also have a vision around sort of being a platform where there's some kind of a standard way to deploy and monitor this stuff across vendors so that you can have multiple vendors all kind of using the same tool stack, i.e. Nuon in their vision. And it makes it easier on the users or customers because they have like one tool that manages all their BYOC infrastructure kind of in a similar vein to like an AWS console or something, right? I think we're far away from that and you know, maybe it comes to fruition, maybe it doesn't,

Starting point is 00:11:08 but just nailing, you know, like tenant observability for your customers would be a pretty big win. So I don't know if that answers your question, but that's where my mind goes. Yeah, yeah. You know, and disclaimer, I'm investor in Nuon. Oh, you are. All right. There you go. We're team Nuon. Yeah.

Starting point is 00:11:26 And so, so I mean, I've been excited about BIOC, but of course I don't want to like just promote them or promote BIOC alone. Cause I think there is this contention. It's not like everyone's default choice is always BIOC. For example, Modo, one of our companies we backed, there's no on-prem deployment option. And I think for Eric, there's no on-prem deployment option. And I think for Eric, he's almost like resisting it as much as possible, it's as long as time possible.

Starting point is 00:11:51 You know, and I think like, you know, we can point to Datadog and point to some other like enterprise data type of products where like there is no BIOC option at all. And so there seems to be this choice. You know, you can either choose not to deploy in anybody's enterprise or delay it until you have to. It's like year seven or year eight. Or trying to do it early. And I don't know, since you talked to some of our companies,

Starting point is 00:12:16 I have to usually have this discussions with folks. What kind of patterns have you seen? Like, hey, this is folks that should embrace BOC much earlier, or folks that just shouldn't? Yeah, that's a good question. I think there's a couple different ways to approach it. The one that I think is most clear cut is, like, if you're a startup and you're selling into the enterprise, you know, you raise Modal as an example,

Starting point is 00:12:38 like, they're very bottoms up, you know, sort of product-led. Let's just let the users sign up and use our stuff and grow. I mean, not to say that they're not selling to the enterprise, but I think early on, their ethos was much more around like a PLG kind of sales motion. I've talked to other infrastructure companies where they're like, yep, the thing that we're building, we are selling to Fortune 1000 companies. And in that world, if you show up and you're like, I'm going to sell you my SaaS, and they're like, ah, you know, here's sock two, and here's all the compliance things

Starting point is 00:13:08 and go talk to the security team and the you know, the sales is cycles going to take a year. Like the BYUC approach is just way easier in that world. So I think there's a bit of a question around who your initial customers are, and you know, how you're going to sell your first, you know, x million in revenue. I also think there's a question around how you're going to sell your first X million in revenue. I also think there's a question around how reasonable it is to build for BYOC upfront. I think some infrastructure is going to lend itself a little bit easier. It's going to be a little bit easier to build infrastructure in a BYOC friendly way versus others.

Starting point is 00:13:42 Specifically with the control plane data plane split, something like WarpStream or TurboPuffer is another example. TurboPuffer is not BYSC as far as I know right now, but I think they very easily could be. That's a vector search database for those that aren't aware. But for systems like WarpStream where maybe latency is a little more tolerable and you could simplify the architecture

Starting point is 00:14:01 and truly make your agent stateless from the get-go, that's a very appealing architecture because you can run that as a SaaS solution as well yourself. and you could simplify the architecture and truly make your agent stateless from the get-go, that's a very appealing architecture because you can run that as a SaaS solution as well yourself. And in fact, that's what they were doing prior to the acquisition is starting to work on SaaS space hosting as well. There's other pieces of infrastructure

Starting point is 00:14:17 that just fundamentally it's much more challenging to build a BYOC solution. And in those worlds, maybe you don't want to make that initial investment upfront. If I'm remembering correctly, that was one of Jack, Jack VanLightly's critiques of my post was like, it is expensive to build BYOC. Actually, no, it was Ram at Nile that was commenting on this.

Starting point is 00:14:38 That it's actually can be expensive to build BYOC upfront. And it's a risk if you don't have product market fit and you're investing in this. And he's working on Nile, which is like this serverless Postgres thing and that's an example of something that building serverless Postgres on BIOC upfront is probably more challenging because you know to look at something like Neon's architecture for example which is another serverless PG offering they've got like this whole safe keeper and page server infrastructure that they need to run

Starting point is 00:15:05 that is not, it's state, right? It's a service that is persisting data outside of object storage in order to manage transactionality and keep latency low. That's an example of a piece of infrastructure that maybe you don't want to ship that as BOSE on day one. So I think there's sort of an architectural, how easy is it to do kind of question, which is why I pointed at daft earlier. I think that's one where I'm not saying it's easy, but the fact is it to do kind of question, which is why I pointed at daft earlier I think that's one where I'm not saying it's easy

Starting point is 00:15:26 But the fact that it's a query engine means that to sort of get it to work in process versus out of process You know, you're not thinking as much about Running a whole bunch of micro service, you know as a fleet inside somebody's cloud account Yeah, yeah, I think this choice of like the cost of things and like I definitely is sort of believing in NuonK and like lowering the cost enough to like get everybody to like not worry about it right so. Yeah well I guess we're maybe talking about two different things because there's I think there's a

Starting point is 00:15:58 cost of like deploying and managing the BYOC infrastructure that you've built right which is really I think what Nuon's after. But what I'm talking about is more about just fundamental cost of building your infrastructure so that it can work with BYOC in a friendly way. Which is not so much about observability and deployment and sort of the operations aspect of it. It's more about the compute data plane split and how you manage latency and caching and transactionality.

Starting point is 00:16:22 And for some systems where that's really important and you need really low latency and high transactionality, OK, that's a much more sophisticated system. Both WarpStream and TurboPuffer that I mentioned, the transactionality and latency stuff, they've made trade-offs there. So in the case of WarpStream, they were working on Kafka transactionality.

Starting point is 00:16:38 But the latency is just a fundamental thing where they've sacrificed some amount of latency in order for a simpler architecture. In the case of TurboPuffer, you know, because it's sort of like secondary data vector search stuff, like they don't want to lose data, but, but like, you know, they can sort of atomic batch right into S3 in order to keep the data durable, but they don't have to think about it something the way that like, again, Postgres or, you know, the ledger thing I mentioned earlier that we were working on, just very different use cases that I think increase the cost of actually building your own software independent of how it's deployed.

Starting point is 00:17:12 Yeah, yeah, it's amazing. And so let's switch to something we've been chatting about just before this. Like, you know, we mentioned about waves, right? About like, you're seeing certain companies show up within a certain time periods. And so what is the current wave that's kind of like dominating all your conversations right now when it comes to all the companies you're talking with? The data lake iceberg, I don't even know how to classify it. Like the catalog wave is like everywhere. And it's been fascinating to watch how quickly it's gone from like, you know, the investors wanting to find these things to invest in to all of a sudden is like,

Starting point is 00:17:51 it's completely saturated and all the investors are saying, no now, and there's like 20 companies that are all raising money. So that's definitely something I've seen. Examples of companies that were sort of early on in this area would be something like Bow Plan, where they're kind of building like a data lake in a box thing. And yeah, there's a lot of these now. So Tabular I think would be another obvious example of something in this area that was unique in that they were a lot of committers and creators from Meisberg, but also early on. Another wave I think that I've been seeing lately is really just the, like,

Starting point is 00:18:29 I don't even know what to call it, but client side database slash replication engine thing that's happening where we're putting databases on clients, you know, a lot of ducti B, but more recently PG light, Terso is doing it with SQL. I think, you know, some of those guys have been around for a little while, but there's really kind of a surge in that area right now. I think a lot of people are spending a lot of time working on that, which has been interesting to watch. The third one that we were talking about beforehand is there's like 80 durable execution companies now. That's an exaggeration, but probably not by much. Yeah, those are some of the waves that I'm seeing.

Starting point is 00:19:07 I mean, I think they're all broadly driven by just a few key underlying trends, right? And I think the interesting thing with drovel compute, like we were talking before the podcast recording, we were talking about the difference between a temporal versus a reboot versus an ingest versus a restate and how all these things are at the core trying to solve the same basic problem which is how do I coordinate transactionality across many different systems right and at the end of the day end up with the same result or have an ability to roll back and manage all these different the complexes that come from the fact that we now live in this highly connected world where compute is coordinated across many different components and how do you make that really

Starting point is 00:19:47 easy to consume? And it seems like, oh, this is actually just like a problem that occurs everywhere across the computing landscape. So you end up with many different companies with many different opportunities to basically repackage the same idea and the same solution, but for a different audience. Yeah, I buy that. The way that I think about it is maybe a little bit less about transactionality and a little bit more about, I guess, object storage is sort of what made it

Starting point is 00:20:12 click for me, but essentially how we're building database and data systems. And I think there's sort of two things. One of them I think of as horizontal and the other is vertical. And so if you think about a traditional database, right, on the storage layer, I see that just we're stretching the storage all the way, right? So on the far end of storage, cold storage, right? We have an object storage. And on the other end of it, we have like client-side storage, right?

Starting point is 00:20:36 We're starting to put data along this entire spectrum. A lot of it's going into object storage, and then we have caching, you know, in the mid tier, and we have caching in CDNs and Edge, and then we have caching in the mid-tier, and we have caching in CDNs and Edge, and then we have caching on the local disk. And in some cases, Source of Truth is actually local disk, and we're doing CRDT or sync engines or whatever to get it back.

Starting point is 00:20:53 But there's sort of like this stretching of persistence from cold storage all the way to client. So like very global stretching of persistence. And then vertically, if you take the database and you look at the query engine, this is what Wes's whole thing is like, we're decomposing the query engine into a bunch of separate components.

Starting point is 00:21:12 And so we've stretched horizontally the data plane, and then vertically the query engine is just being ripped apart. So we have like all these components and everything's everywhere. And I think that's where kind of your point fits in around like, okay, well now everything's everywhere, but we like still want transactions, right? Yeah. And so how do you

Starting point is 00:21:28 manage this? And you know, I had a really interesting conversation with Charles, I'm going to mispronounce his last name, but it's like Zwilowski or something. But he was an x cloud era guy that was a CTO at temporal. We talked fairly frequently. But his insight to me was actually that he's thinking about it more from network, which is like now that We talk fairly frequently, but his insight to me was actually that he's thinking about it more from network, which is like now that we've stretched everything, right? So we put all the data everywhere and we've taken apart the compute engine,

Starting point is 00:22:00 there's like network everywhere and that I think that's really where like durable execution comes into play. And he was talking about it, you know, under the context of Skip Labs and some of the stuff they're doing sort of in the, for lack of a better term, durable execution space. But they're actually kind of working in a space that ties these two things together, durable execution and sort of the client side sync engine stuff. So I think it's a good example of what you're talking about combined with what I'm talking about. sort of the client side sync engine stuff.

Starting point is 00:22:43 My entire time I've been an engineer, every time you try and hide the network, it's like a disaster. But maybe this time is different, I don't know. Yeah, it's super interesting because one of the things I often think about is just the broad discussion we're having. The way we do our applications is changing. It's been always morphing. It was the most simplest when it was three tiered on-prem. You had a data layer, like, figured out how to present that data.

Starting point is 00:23:05 And you probably had one model of database. And, you know, that was conceptually very simple. Obviously it didn't work for lots of other reasons, let's say, scaled, which is why I ended up where we are today. But I think, like, thinking about it broadly, they're like navigating the space from a developer's perspective is actually quite difficult

Starting point is 00:23:21 because basically what we're saying is all these things that used to live were resolved for you in 3D architecture. Hey, you have one giant database. It's Oracle. Maybe it's MySQL. Maybe you were cool and running Postgres back in those days. But as a developer, all you do is write some SQL and spit it on over to your Oracle instance or whatever you're running, and things like transaction I recovered for you, right? You didn't really have to understand these components, but now we have this vastly heterogeneous world with all these different concepts. The problems are really the same, like from a computer science perspective,

Starting point is 00:23:56 it's really the same problems, but now we've brought them up a level so that the user, the developer, has to really think about these things in a way they never did before. I'm kind of curious to think about, to get your opinion on, like, what do you think usability of these systems actually ends up looking like?

Starting point is 00:24:11 And how do you think AI fits into it? Because I have to add that into the question. Yeah. You got to be able to put AI in the synopsis now. I'm just trying to get like a quippy thing for Twitter, you know, Chris? Like, I'm just trying to get some hot spicy takes here. Yeah. Setting AI aside for the moment, the thing that you said that kind of, I think is really interesting to me, a quippy thing for Twitter, you know, Chris?

Starting point is 00:24:42 the optimal developer experience is really, you know, like the lamp stack, right? It's like, we're building a monolithic thing. We have one database, we have one code base, everybody kind of works together and it just works. And when we call begin and commit, like we know what's going on, ha ha, modular what MySQL is actually doing under the hood. And so what does something like that look like

Starting point is 00:25:02 in a distributed world? This is a really good question. So what does something like that look like in a distributed world is a really good question. And I think when you look at something like temporal or restate, it seems to me like only half the puzzle, right? It's essentially only the storage layer. It's only the database. It we need the framework, right? You could pick up your Rails, or you could pick up your Django, or you could pick up your Vercell, or whatever it is. So that's sort of for app developers. And I think for infrastructure developers, things like temporal and restate are probably the right answer because you want more control, you have a fine-grain understanding of your needs,

Starting point is 00:26:14 and what kind of transactionality you want, and so on. And so putting my use cases out there, historically around financial transactions and stuff, it's like I want to be able to interact directly with restate or temporal. I get that it's harder and like, it's, it's really a bit more of a challenge, but like, I'm willing to pay that because, you know, we're dealing with important data and, and, you know, if we lose one transaction, it might be a hundred thousand dollar transaction.

Starting point is 00:26:37 So my sense is that it's going to continue to be balkanized and sort of split. And I believe the TAM is large enough, because we're essentially talking about all app development. So I believe the TAM is large enough that you're going to be able to sustain multiple winners that are implementing things in different ways. You'll probably have a series of companies that are competing for front-end engineers that

Starting point is 00:27:01 need some amount of durable execution stuff. You'll probably be competing a separate set of startups that are working with traditional, what I would call application engineers that are working on the mid-tier microservice world. And then you'll probably have a set of companies, a la Temporal and Restate, that are working more on the backend infrastructure type customers. And so I think each one of those can justify at least one winner.

Starting point is 00:27:27 Fundamentally, the use cases are different enough and the right product is different enough that in order to win one, you're probably gonna have to cut corners or make decisions that are gonna prevent you from being the best user experience in the other ones. And so that's sort of my read of the situation. It just strikes me that like the biggest thing we don't have

Starting point is 00:27:46 is a simplified model for how to think about these architectures anymore. It's just so complex. And I mean, at the end of the day, it's because we've given way to Conway's game of life for how we architect things. And we're just like, the organization can't manage clean architecture, so we're going to manage the chaos in some guardrails. Yeah, I think on the difficulty front with durable execution, that's always been my critique with temporal. It's just too complicated for most people to really work with. And I think they have a lot of

Starting point is 00:28:18 luck and growth in that there's usually a team that asks to use it. The payments team, the FinTech team, the team dealing with like health data or whatever it is, or doing some kind of like really important transactional stuff. They have no other choice. Right. And then once it's in, it's like, Oh, okay, well we can do some cron based stuff and you know, as they sort of land and expand, but that's because it's already in there.

Starting point is 00:28:38 Like if front end engineer is not going to grab temporal, I believe most of the time as our first choice when they need to do, you know, some very lightweight stuff, uh, you know, some very lightweight stuff You know, we were talking about ingest before I think is an example of something that's maybe a bit lighter weight And so I've been very critical of that I think restate is kind of going after that as well and sort of saying like we're gonna be temporal but like we were actually usable the citation I always give is like with temper not to dump on temporal too much because I think they've been doing I think a phenomenal job and essentially category creating with this. But when I was reading their docs, it was like,

Starting point is 00:29:10 I was going through their quick start and stuff. I think I was digging around their Python docs and it was like introduction. And then the second page was like this really complicated dissertation on non-determinism and clocks. And the fact that like, if you reorder your method calls in your code, like suddenly you're not going to work properly anymore because it's non-deterministic and the whole item potency thing breaks in their wall. And I was like, oh my God, this is just not gonna work. So that's sort of how I come down on the complexity thing

Starting point is 00:29:40 is like, it has to get better. That was the origin. I wrote a post last November, I think about this, where I was like, basically in order to expand the expand the town like you have to change the way that this works and I think what we're seeing now is companies doing that. Well now it's coming to your pick your poison sort of like section now like spicy future. What do you believe right now that you think most people hasn't got to or not yet agree with you yet? Let's see.

Starting point is 00:30:10 So I think I'm going to pick on the analytics engineer role. This is a post I wrote recently where I basically think that role shouldn't exist. And I think that there are a cohort of people in that role that are very vocal and very loud and are trying to justify the work they do and why what they do is so important. But the reality is I don't think they impact the bottom line the way they think they do

Starting point is 00:30:34 and we're not in a zero interest rate environment anymore. That role should get folded back in with data engineering. And so, maybe another way to say this is also that I think the data engineering role shouldn't exist. But essentially, I think we've made a mistake by taking what used to be one role and splitting it into like three roles. We have, you know, analytics engineers, we have data engineers, we have ML engineers, we have data scientists, we have business analysts.

Starting point is 00:30:57 Well, I just named five different roles. You guys aren't providing enough revenue to the company to justify five different roles. Like, I'm sorry. So it needs to get combined back. And I think the lowest hanging fruit areas to start with is probably analytics engineer, business analyst, and data engineer. Take three of those and just make two.

Starting point is 00:31:16 So that's my spicy take. I'm so curious. I mean, that is a spicy take and I love it. And it makes a lot of sense. One would say is that we've reached peak data role and now we're looking for the consolidation, right? We've explored all of the space and I love it and it makes a lot of sense. One would say is that we've reached peak data roll and now we're looking for the consolidation. We've explored all of the space and now it's time to figure out how we can do it effectively and efficiently. And also how can we reuse those dollars for whatever the business wants to do, other investments.

Starting point is 00:31:40 But what are the enablers for that transition? We ended up with the analytics engineer because it was like, the data team couldn't handle the inbound requests for what they needed to do. They're focused on other parts of the system, maybe it's maintaining the system, whatever's going on. So I'm kind of curious, do you believe this is true just purely from a numbers perspective, or do you also see things that are occurring in industry that you're like, well, when this all lands in this sort of way,

Starting point is 00:32:00 you're going to look at this and be like, I don't need 20 people for this, I need 10 people or five people. How do you view it? you're going to look at this and be like, or on the airflow front or prefront front or Daxter front, you had to write a bunch of these hooks, these connections, these integrations that would talk to all the various systems and you needed engineers to build those things. That's all been done. Basically, the VZM connects with a ton of things now.

Starting point is 00:32:36 You can pay a SaaS vendor, like decodable, to host the pipelines and so on. And so the pitch I have is number one, take the analytics engineers and make them own the data pipelines as well. And so the pitch I have is number one, take the analytics engineers and make them own the data pipelines as well. And they should have the tooling now to be able to do that. And so sort of get rid of the data engineering role as a unique thing that has to exist for that. And then I think number two on sort of the business analytics and analytics engineer front is I just think you need to make the analytics

Starting point is 00:33:05 engineers do business analytics stuff as well. I don't think there's necessarily as much of a like tooling thing around that or a technical driver is just like saying now you own the dashboards too, not just the data models. But I think the data pipeline part is one thing, the commoditization of the data pipeline. I think number two also is we have begun to seep SDLC stuff into analytics engineer and business analytics workflow. So it first started with DBT, now you're seeing it with DLT,

Starting point is 00:33:33 which is an interesting product. But as we force them sort of subversively under the hood to adopt standard best practices for like how to ship reliable code slash workflows slash transformations you know, transformations. I think that will make it easier for us to just depend on, you know, this role. So maybe the inverse of what I'm saying is also true. Rather than eliminating the analytics engineering role,

Starting point is 00:33:57 let's just call it analytics engineering role so that the loud people can continue to be happy. But let's just fold the other roles into it, data engineering and business analytics. So I guess data pipelines, more tools adopting the SDLC stuff and meeting them where they are. I'm seeing a ton of startups right now to this, the earlier discussion around waves and stuff, ton of startups that are doing sort of a verticalized tool that combines ETL, data transformation and ML

Starting point is 00:34:21 into like one Python thing that meets them in the coding IDE language that they're comfortable with, but enables them to do everything from let's ingest the data into the system, let's transform the data, and then like let's train models and hey let's deploy the models back into production. There are more and more of these systems that are coming online, so I think that trend will help here as well because they'll be able to own sort of the, sort of the entire life cycle of the data from ingest all the way to egress. So those are three, two and a half examples SDLC verticalized tooling and commoditization of pipelines. Cool. Well, where can people find you

Starting point is 00:34:58 if they haven't somehow on earth live in a place that they just don't have internet and doesn't even know materialized views? What's your brand new sir? It's a little embarrassing. So let's see, I'm on materializedview.io is my newsletter. And then I'm chris.blue on blue sky. I still lurk on Twitter a bit. So I'm C Riccomini on Twitter. Although like I've been like honestly,

Starting point is 00:35:22 it's been kind of a ghost down there lately. I just haven't been getting as much good stuff. So I've mostly moved over to Blue Sky. Materialized View Capital is my fund. I got a book that I wrote with Demetri Rieboi that you can grab on Amazon for the low, low price of, actually, I don't know how much it is, but it's, The Missing Read Me is the name of the book.

Starting point is 00:35:39 It's a guide for new software engineers. And then keep an eye out for the second edition of Designing Data Intensive Applications, which is Martin's, Martin Kleppman's book. I'm helping him out a little bit with the second edition. a guide for new software engineers, at various coffee shops too. And SlateDB too, right? Oh right, shoot, yeah, SlateDB. At least talk about SlateDB in the 30 seconds. I forgot, I forgot. Yeah, so slatedb.io, I'm on the Discord as well there.

Starting point is 00:36:15 We're working on a LSM built on object storage. Just think of it like RocksDB, but on object storage. Come say hi in the Discord or on GitHub repo, PR is welcome. Yeah, yeah. Well, when you start flipping up the pricing page, I think we're all going to be like, all right, Chris, you made it. Yeah, no, I can tell you, books are not a great way to earn a living. They're a lot of work. They can be fun, and they are almost never a good way to make money.

Starting point is 00:36:47 I'm sure, I'm sure sir. Well thanks for being on sir, thanks for being on our pod. We should have you almost like a recurring thing, but yeah. No, I'd be happy to, just let me know. It's a good way to kick off a Thursday morning coffee and an infrastructure rant. It's a good time for all of us. Thanks so much Chris.

Starting point is 00:37:02 Yeah, take care.

Your Ad Here

The Infra Pod - The Materialized View infra influencer is back again! Chat with Chris Riccomini

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.