The Infra Pod - Are all future data infra products will be built on Cloud storage? Chat with Warpstream

Starting point is 00:00:00 Welcome back to our yet another Infra Deep Dive podcast. As usual, Tim at EssenceBC and Ian, let's go. Today, I'm Ian, as always, actually, to be honest. I have never changed my name. I helped SNEAK turn into a platform, and I'm super excited to be joined by Richard and Ryan, who are building some really incredible new streaming tech. Richard, please introduce yourself and pass it off to your co-founder

Starting point is 00:00:33 and tell us what you're working on. Yeah. Hey, everyone. I'm Richie. Ryan and I have known each other for actually over four years now, I think. The company we're building right now is called WarpStream Labs. WarpStream is a Kafka protocol-compat compatible streaming system that runs directly on top of object storage. You know, it pretends to look exactly like Kafka, basically from an API perspective, but there's no local disks anywhere in the system. And the kind of main two reasons we set out to do that was to reduce basically inner zone networking costs and cloud environments.

Starting point is 00:01:06 I think maybe even more importantly than that, just to make Kafka a lot easier to manage. At Datadog, we worked on a system called Husky, which is a columnar store for observability data. So think like, whatever, I don't know, Snowflake, but heavily optimized for logs. And we spent a long time at Datadog building that system, migrating everything to it

Starting point is 00:01:24 over the course of three and a half years. And then when we were done, we had this cool auto-scaling Datalake thing that was really nice and cost-effective, but we were still getting paid for Kafka. Still had issues with brokers running out of disks, not being able to scale them up fast enough. Issues with partition balancing and stuff like that.

Starting point is 00:01:41 And it just felt really old, I guess, compared to the new S3-based system. So that was somewhat the impetus for the company. Yeah, this is Brian, my co-founder and CTO. Rishi and I, before we worked on this and before at Datadog, were working on another system called WarpTable, which was a hacky prototype of the thing

Starting point is 00:02:03 that we eventually built at Datadog. And before that, I tried to convince him to do this, the same basic idea about building Kafka on top of S3. But he thought it was too boring back then because he hadn't yet experienced the pain of using Kafka. The joys. Yeah, the pain and the joys. mostly pain. After three and a half years of interacting with Kafka in a real production, high-scale environment, he thought the problem was a lot more interesting. We co-founded the company in May of this year, and we've been heads down working on the product ever since.

Starting point is 00:02:42 So what was it, Ryan and Richie, when you were working on Husky, was there some insight or some moment where you said, okay, this is the moment to go build this company? What was it that got you over the line to think that, hey, this is a really good idea? That's a really good question.

Starting point is 00:03:00 Yeah, I think I can get us started a little bit because I don't think there's really one answer to this. There's definitely going to be more than one. I think I can get us started a little bit because I don't think there's really one answer to this. There's definitely going to be more than one. I think what made me ready was the fact that we were basically done with the Husky project. Obviously, it's still being maintained and stuff, but we set out to migrate all of the products using the legacy storage system at Datadog to Husky. And that finished early in 2023. We definitely hit our goal there.

Starting point is 00:03:36 And obviously, that makes it a good time to look around and see what's next. Yeah, I mean, we'd also just, we'd been there a while. And I think, you know, Ryan and I are both kind of entrepreneurial by nature too. You know. We ended up at Datadog trying to start a database company, basically an observability company, and realizing that we had a lot to learn, I think, about doing a B2B product and stuff. I mean, I had been working at Uber. I'd never even worked at a B2B company.

Starting point is 00:03:57 So Datadog was really good to us. We had a lot of fun there, but it always felt like a bit of a compromise, right? Like we wanted to go start our own company, basically. The amount of things that we did to just avoid introducing Kafka into new places in our architecture was also extremely frustrating. Whenever we thought about companies that we might want to start, a lot of them involved needing something like Kafka.

Starting point is 00:04:22 And we're like, I don't want to actually go figure that out again from scratch with no help. A lot of big data problems, if you want to use Kafka for them, it's not cost effective at all, basically, to use a vendor. It just is margin destroying for some businesses. So it just felt like a necessary precursor for a lot of other interesting problems to be solved, basically. So I don't know. It just felt like a good time to do it.

Starting point is 00:04:46 So I want to talk about the history you're getting here, because I think Husky definitely is more like the new next-gen log management story you guys were building for it. I remember there's definitely a lot of history based on the work you guys did for Foundation DB back in the day. So I would assume the most straightforward path for you guys to start a company is basically doing exactly what you did for Husky, basically, like a log story type tool. And looking at the blog post of Husky, you're talking about sketches and better ways to be able to integrate different algorithms directly into storage engines. So it's a lot more logs and probably some more observability type of data.

Starting point is 00:05:22 But now looking at the new Kafka, you're sort of just assuming any data streams, right? But just built a fundamental different way. Like why go to Kafka, but why not just do the Husky newborn? What was the thought process for you to like, hey, this is way more important or this is something we should do instead? Yeah, that's a totally fair question. I mean, we're joking about fundraising a little bit before this. I think like that would have been a really easy fundraise. So I think that would have for sure been an easy path for us. I just think that, honestly, we spent four years solving that problem,

Starting point is 00:05:53 and I just didn't want to solve it again. It's really hard for me to get excited about watching the same movie twice, basically. Whereas the Kafka stuff felt... I've kind of said this before, too. I feel like it's kind of understood now. If you think about what Husky is, it's basically an observability data lake. It's a solved problem in the sense that people know how to build it.

Starting point is 00:06:14 I don't think there's a clear leader in the open source space or anything, but people know how to build these systems now. You ingest data, you buffer it for a little while, you dump it into object storage in a columnar format, you track metadata in some external store, you have a scatter gather query engine, you add caching, etc, etc. People know how to do this and there's a lot of systems that do that. Doing quote-unquote low latency streaming on top of a high latency storage medium like S3, I don't really think that's a super well-solved problem. And so if I'm looking at

Starting point is 00:06:45 the next four to 10 years of my life, that felt a lot more interesting. Cool. Let's talk more about this Kafka space, right? Because obviously Kafka has been there forever. I worked on Kafka actually before, and I saw how Red Panda started back in the day. So we've seen other entrants in this space before. Now you're coming into it, I think there's definitely a lot of decisions you can make to go after the space. And your Kafka is dead. Blockpost definitely went viral as well. So let me talk about the thought process here. Building a new Kafka, what does it look like? What do you want to make sure you're able to talk about? Because the biggest thing on the front page is cost. Are you comparing how much it costs to run different

Starting point is 00:07:25 versions of it and your 10x way more cheaper? So it sounds like cost is like a number one hook, but I'm sure there's way more thoughts around this. We basically talk about two things, cost and ease of operations or ease of use. Those are our two main value propositions. And cost, I think, is actually a lot more important than I think people realized in the beginning. Because when you're trying to solve a new and interesting data problem, especially at the frontiers of what is possible in computing today, you may self-select out of different kinds of architectures based on how much you think they will cost. If you're going to be doing streaming analytics on video, you probably wouldn't back that with

Starting point is 00:08:10 Kafka. It just wouldn't function from a cost perspective. You would do something like writing video files to S3 and then maybe tracking them in Kafka, like pointers to them. It would be nice to just use Kafka to do it because there's no differentiating value as the person who's making this video streaming analytics thing on doing the engineering behind, I'm going to write video files to S3, and then I'm going to track pointers to it in Kafka. The cost thing is not just about taking existing workloads and lowering their cost it's about making sure that people starting today with new workloads that maybe are at the edge of what you would reasonably choose kafka for or even well beyond it can choose the system that they want

Starting point is 00:08:58 based on the features and the ease of use rather than purely about the cost model. And we think that we can deliver on that. We think that we can build something that would be essentially as efficient for a high data volume use case like doing stream analytics on video. We're not building anything in particular for it. We're just building exactly the correct, from an architecture perspective, way to move a lot of data through S3 through a Kafka-like interface. The other side of that is the ease of operations. Amazon and any other hyperscaler cloud vendors

Starting point is 00:09:31 have put an inordinate amount of engineering time behind making object storage scalable. And if you look at the right pattern of Kafka, ignoring the latency for a second. It's not as if Kafka requires writing a bunch of tiny objects to S3. I think most developers could imagine a way that you could make it work.

Starting point is 00:09:54 There's a whole bunch of metadata aspects of it that you have to think a lot harder about, but just from the data side, I think most developers could figure out a way to do that. And once you shift that burden onto the cloud provider, you have so, so many more options

Starting point is 00:10:06 about how you build the rest of the system. Like one thing that you would get rid of right away is partition rebalancing. If you're an open source user of Kafka, you have to have some kind of tool to move partitions from one broker to another, which is like the absolute most archaic thing. When you're coming from other systems

Starting point is 00:10:24 that automatically shuffle data around for you behind the scenes, having to think about that for Kafka is just very funny, which is why most people either use some fancy tool for it that automates it, or they pay a vendor like Confluent or Amazon to do it for them. But it'd be nice to just never have to think about that again.

Starting point is 00:10:42 That's one thing that Workstream gives you is you just run a stateless Docker container in whatever your container orchestrator of choice is. We've tested it in both ECS and Kubernetes. You point it at an S3 bucket and you get Kafka on the other side. There's no partition rebalancing. So you can scale it just as easily as you'd scale Nginx. You add more containers based

Starting point is 00:11:05 on the CPU usage, and that's basically it. All of that ease is provided by the fact that you use the object storage. Object storage is such a fundamental game changer in terms of the ability to design high-scale systems that if you're starting today building a new system like WarpStream, you just don't have another option if you want to build something that works at high scale and is easy to use. Yeah, I think that's a point I want to emphasize. I don't think it's possible for big data systems to remain competitive if they're not object storage based. If you're running in a cloud environment and you're dealing with large volumes of data, you have to move to a completely object storage based architecture or your system will just not be cost-effective

Starting point is 00:11:48 or effective in other ways long-term. I don't think you'll be able to essentially keep up. The cost differences there are too huge and the economies of scale there, it doesn't matter how good your software is, you can't accomplish them any other way in those environments. And the primary reason for that is that you have

Starting point is 00:12:05 these cloud providers like Amazon and ObjectStore where they're investing all the money to create a great abstraction to abstract away spinning disks from you and the complexities of dealing with hardware failures and all these other different complexities and have hidden it behind APIs. Is that the primary reason? That's part of it. There's also

Starting point is 00:12:22 the millions of engineering hours that have gone into making that extremely reliable and dealing with hotspots and basically allowing you to burst temporarily and all the stuff that they do. But there's also just some physics to it, right? If you think about the way S3 works, the kind of trick there is that

Starting point is 00:12:39 if you're an object storage provider, you can go out and buy these disks. They actually get slower every year, but their capacity just grows continuously. And they have so many disks that are just sitting filled with data that are completely idle, right? It's data at rest that's not being read, that's not being written. And basically, the number of disks that you would need to buy to get the amount of IOPS you can get out of S3 or object storage by just signing up with your personal credit card and not asking for any quotas,

Starting point is 00:13:08 the number of SSDs you would need to accomplish that and how much that would cost you, it's ridiculous. I remember Ryan and I did some cost modeling a while back. We were trying to estimate how much it would cost, basically, and what it would take to build a single layer of redundancy against object storage and cloud environments just for a couple of hours, it's unimaginably expensive.

Starting point is 00:13:27 You can't compete with the economies of scale that they have from those giant arrays of disks. And also, part of this is just the cloud tax of where you just get hammered on these inner-zone networking fees. If you ever have to move data across availability zones, large volumes of data, and you don't go through, essentially, the object storage APIs, you get crushed.

Starting point is 00:13:48 To send a gigabyte of data from one availability zone to another in AWS costs the same as storing an S3 for two months. So it's like two months of data storage versus copying it over the network once. It's crazy. It's completely imbalanced in terms of cost. There are plenty of vendors who are building new systems that are not object storage based, at least not entirely. They have some local disk layer in the front of them because they have a poorly engineered application that doesn't know how to work well against object storage. They built it assuming local disks exist.

Starting point is 00:14:26 And there are a bazillion IOPS that you can get on the local disk at a four kilobyte size and that you don't have to pay for put and get requests. But those systems just, they'll work at toy scales. Like if they have a lot of tenants that are doing a bunch of tiny workloads that don't spike. But if you want to deliver this to a giant bank

Starting point is 00:14:46 or other Fortune 500 companies with a diverse range of applications, a diverse range of sizes of those applications, and run it in a multi-tenant environment, you're not going to be able to do that with local disks effectively. It doesn't even matter how good you are at writing software against the local disk.

Starting point is 00:15:01 The economies of scale will crush you, fortunately or unfortunately. I don't necessarily think object storage is a good thing in the abstract. It's just, if you're in the cloud, there's no other way to meet the cost goals, at least on the analytics side. If you're on the operational database side, there are potentially different choices you can make there. But if you're anywhere close to the analytics side, you just can't do it.

Starting point is 00:15:22 Yeah, I think it's actually bringing up really good points where if you're fundamentally designed on top of object storage and there's a cloud, like the partition rebalancing and all that kind of stuff that you have to do before, the reason we had to do it before is because all the data is sitting on a disk somewhere. And there's a bunch of limitations,

Starting point is 00:15:38 right? How many topics can you actually write? How many partitions can you actually write? Because you have Zookeeper, you have all this stuff, right? I think over time, Kafka has changed quite a bit. And trying to add a bunch of stuff to existing architecture has always been a challenge because they added like exactly one writing and transactions, all that stuff has been harder. There's like a huge benefits if you're fundamentally designed differently with object storage behind the scenes. But there's trade-offs too, right?

Starting point is 00:16:02 Once you have object storage as your main storage, there's actually the downsides. What are the trade-offs of being object storage-based? What are things you have to do more carefully or harder that doesn't have to be done in the past? The obvious answer is there's a latency trade-off that we make, right? We can talk about that in more detail too, because there's other types of storage besides object storage, but there is a latency trade-off. You're never going to be able to ensure that something is durable in a millisecond. You can do that with an NVMe SSD. That's not going to happen when you're using S3. The other kind of thing I think you have to think about is that a lot of these systems, I see what people do is they take software that was written for SSDs and then they lift and shift it into the cloud.

Starting point is 00:16:46 They put it in Kubernetes and then they copy data asynchronously to S3 and page it in when they need it later. And there's tons of systems that do this. That works in the sense that you can now store way more data on a much more limited set of hardware, but you miss out on a ton of the potential of what the system could really be capable of. Really to take advantage of object storage properly,

Starting point is 00:17:08 you basically have to rewrite all the software from the ground up around object storage. For example, even just forgetting WarpStream for a second and looking at Husky, every design vision made in Husky, if you root cause it to the bottom, it's because it's S3 based to the point where there's no local buffering

Starting point is 00:17:24 and ingestion either. And everything kind of calls out of that. The file format falls out of that. The query engine style falls out of that. The data structure, every single data structure in the system falls out of that. It's the same thing with WarpStream. It's Kafka protocol compatible in the sense that it implements the semantics and protocol of Kafka, but literally nothing about the internals of it looks anything like traditional open-source Kafka. There's very little you can reuse. You really just have to be willing to start from scratch and start from first principles.

Starting point is 00:17:54 If I had a disk and every time I tried to write to it, the P99 was like 400 milliseconds, what would my software look like? And so you have to design for massive amounts of parallelism and minimizing sequential operations in your storage system. You have to design around larger IOPS. You have to think about caching a lot more intently. You have to think about the fact that basically your entire file system is immutable. Metadata always becomes a huge thing. That's, I think, actually probably the main point of leverage when you're designing around object storage

Starting point is 00:18:25 is how well can you handle the metadata that's required to make the object store look like something else, basically. So for Husky, we use FoundationDB and with WarpStream, we had to do something significantly more custom because of the semantics of the Kafka protocol. I mean, those are pretty big trade-offs.

Starting point is 00:18:42 You just kind of have to start over and the latency. The things that you start over, after you decide you're going to start over, a lot of these things aren't unknowns. If you go read papers about file systems and databases and stuff, the tips and tricks basically that you need to build a system

Starting point is 00:18:59 on top of object storage, they're all there. They're just kind of left behind now because they're the same things that you would use on spinning disks from like the 1980s. The fundamentals are actually not that different. It's just people have completely forgotten how to write software in an environment like that, where the latency of an IO is very high and variable. But once you do decide to start over, it's not as if you have to invent whole new fields in computer science to solve the problems. A lot of stuff is reusable. What's a good example for a 1980s that's left behind?

Starting point is 00:19:40 It's not specifically about the time period for the whole system. It's just spinning disks used to be even slower than they are, and they used to have lower IO bandwidth. and you'd get a lot of IOPS and you'd get a lot of throughput out of that, but the latency was not great, especially when you're oversubscribed in terms of how many IOPS the system has versus the applications you're running across it. The latency for an individual IO can be high and variable. The difference between the median and the P99 might be really large, which is exactly the way that S3 works.

Starting point is 00:20:22 But it is not the way that an NVMe SSD works. And most people that are writing software today assume that if they are going to use the file system, they can just read and write anywhere inside their application code, maybe even not without putting the I.O. onto a thread pool or doing it asynchronously. They're just like, I'm going to make the write happen right here

Starting point is 00:20:41 because the operating system is going to cache the file system for me, and I can mostly just not think about it and just tell the user make sure that the working set of the application fits in memory and then nothing will ever have to happen. But if the way that you design the system is every I.O. is going to be uncached and every I.O. is going to take a really long time And it costs money. It actually costs money to do these IOs.

Starting point is 00:21:06 Once you constrain yourself into that interface, which most people have left behind long ago, over time, even spinning disks, the latency went down a little bit. Now it's going the other way again in the last few years as the technologies had to get more and more obscure technologies to build spinning disks around.

Starting point is 00:21:25 Outlier control is a good example, I think. When I worked on M3DB at Uber, and we ran it on top of real disks and SSDs, if it took a second or a couple seconds to write some data to the disk, the disk was broken. That machine was getting ripped out immediately because the system did not work when individual IOs started taking seconds.

Starting point is 00:21:45 The system's designed for individual IOs to take milliseconds. Whereas that's a completely normal occurrence when you're programming against object storage. Writes and reads, I think stream outliers will just take a second or two. And that's normal and has to be designed around. One really dumb example that you can do that's extremely effective, at least with S3, and I've done this in various things, is you can monitor the P99 latency of reads or writes, or the P99, basically, and just automatically retry beforehand

Starting point is 00:22:18 when you detect that it's taking too long. It's not like the data in S3 is on one disk. It's like there's multiple different sets of disks that they can read from to answer the same request. Or you just got unlucky, you hit one of the thousands of the machines in the cluster that was slightly overloaded or whatever. So stuff like that that just doesn't happen

Starting point is 00:22:35 with an NVMe SSD unless you've absolutely worn it to the ground. And you rewrote it 50 times a day, every day for years, and it's starting to malfunction. It's a thing that we've done. So don't do that. I mean, you both have been through the ringer

Starting point is 00:22:52 on both running the hardware, building software for running the hardware, and now focus on your new approach using object storage and what you learned at Datadog and BuildHusky. I'm curious, there are a lot of different use cases for things like Kafka. I'm kind of curious to get your perspective on what use cases this most makes sense to use for today. And then also,

Starting point is 00:23:12 are there use cases where it doesn't make any sense? And is there a path to actually making it work for there? And then maybe it's just a building, a lot of software to write. Something has to change in the cloud providers or something else. The obvious use case

Starting point is 00:23:24 where it's really great today is just like, we've been calling it analytics, just really anything that looks like high volume streams of telemetry data. That's the most obvious use case because you just get hit so hard. If you're like, I have a bunch of telemetry data, I need to dump it somewhere, I need to stream it somewhere,

Starting point is 00:23:41 you get whacked twice because first you just get hammered on the inner zone networking costs. So you're like, why is my bill so high? I'm already stream it somewhere. You get whacked twice because first you just get hammered on the interzone networking costs. So you're like, why is my bill so high? I'm already running this myself. I already got priced out of all the vendor solutions. It's like a gigabyte per second is like $1.7 million per year in interzone networking. And a lot of people have much bigger workloads than that.

Starting point is 00:23:59 But then it's also like a gigabyte per second workload is not a trivially managed, self-hosted Kafka cluster. And so both of those problems just kind of disappear if you can offload most of the storage problems to object storage. So that's where it shines, I think, today. Long-term retention can also be a really good use case too. There's a bunch of stuff happening right now around open source Kafka with tiered storage, which is really not the same thing.

Starting point is 00:24:23 And I can get into the difference between what we're doing and tiered storage too, but that does help a little bit. And then obviously the place where it's not good today is just like, I've yet to find a ton of people who actually have really solid use cases for extremely low latency Kafka, but there are some.

Starting point is 00:24:39 You can get a really well-tuned Kafka cluster that's over-precisioned. You can get it down so writes are consistently finishing under 20 milliseconds, right? If you're willing to spend money on that. That's just not a thing you can do with object storage in basically any cloud environment. You could do things like you could run Minio.

Starting point is 00:24:57 There are other options too. At some point, we'll offer a low-latency version of WarpStream, but there's other trade-offs there. And so that's the place where today it just doesn't work very well. All right, so I guess we've got to jump into what we call the spicy future. Very simply, spicy future is to talk about what you believe should happen in the future. And also we want to just present it as facts, right? Just like, hey, this is what we believe will happen in the next, you know,

Starting point is 00:25:26 two to three years. And I think in this context, we'll be interesting to kind of talk about like, what do we believe the ecosystem will look like from a data infrastructure perspective? Like if projects are going to be rewritten on top of S3, or do you see every project rewritten on S3 and what does the side effects or the downstream effects of things happening? So however you guys want to start, what do you believe the next two, three years should look like?

Starting point is 00:25:48 What do you see happening? There are definitely a lot of things that I think are going to change. I think for developer infrastructure products, the scale to zero is going to be even more prevalent than it is today. And the reason for this is not just people being cheap. I think scale to zero, once you have it everywhere in your stack, the way that you think about building development environments and CI environments changes completely. If you can set up a complete development environment that has its own copy of all of its database dependencies

Starting point is 00:26:23 and Kafka cluster dependencies and application server dependencies, the reliability of CI will greatly increase relative to the battle days of everybody sharing one gigantic staging database, one gigantic staging Kafka cluster. And databases, I think, have solved this problem a little bit better than most other infrastructure pieces

Starting point is 00:26:46 because you could at least afford to create a different schema in Postgres for each CI run. But once you get outside of that very narrow universe, there is not much out there. You have to build your own hacks around multi-tenancy into things. That translates also into building multi-tenant systems. If you can have a multi-tenant system whose dependencies are also scaled to zero multi-tenant, you can build it all the way up the

Starting point is 00:27:09 stack. I think that's going to become table stakes, basically, in most of these products. Not to talk about Warpstream's competitors too much, but if you look at the competitors' products, most of them have some fixed cost element that does not go away if you want to have a coffee cluster. I think that there are other developer features type stuff that will become more prevalent, like branching and snapshots or developer infrastructure stuff. If you're both on top of S3, this is a lot easier than it is for systems based around local storage. If you think of your system as a copy on write type model, you can just leave all the old files around and make a much smaller copy of the metadata for the system to share between the branches or snapshots, so to speak. Snowflake has had that feature for a long time, and I think it's going to filter its way down into a lot more

Starting point is 00:28:03 developer products. Again, not just because it's cool, but because I think that there are actual use cases for it. Like if you were going to build in CI, you could branch your Kafka cluster and start it from a data set that already exists, a snapshot of some data that you're always using in CI, instead of spending the first 30 seconds of every CI run writing a bunch of data to Kafka. It could just already be there right from the beginning. Or to validate changes against a production database, you could spin up your application server

Starting point is 00:28:36 against a clone of it, run your tests, and assert that, hey, I didn't break all of the data in my database. I think it'll make development a lot safer instead of just shipping your stuff to prod and hoping it works. Those are my two biggest ones, I think, related to object storage.

Starting point is 00:28:54 I think at least the open source implementation of Kafka has just had a stranglehold on the streaming industry. If you want my spiciest take, I think it's been holding everything back. If you compare what's happened in the batch and ware If you want my spiciest take, I think it's been holding everything back. I think if you compare what's happened in the batch and warehousing worlds,

Starting point is 00:29:09 what people have managed to build basically over the last 10 years, especially all the modern ones now that all have these object storage takes where the storage is solved. Your job is to now provide value above the layer of just copying bytes around. A lot of really interesting

Starting point is 00:29:23 and innovative stuff has happened. We're just starting to see that in the streaming space. If you look at just the number of streaming compute engines and databases that have emerged in the last two to three years, basically, it's crazy. I can probably name seven off the top of my head. And I think that'll keep happening. I think that'll keep growing.

Starting point is 00:29:44 And I don't think that's because people like necessarily streaming compute tools. I think it's because they dislike Kafka. And they dislike programming against Kafka. SQL has evolved enough now and most people understand at least the basics of a tumbling window or something like that when they're going to do streaming analytics. And now that you can express most of those things in SQL, that's what people want to do a lot of times. Kafka has to be a pipe for them

Starting point is 00:30:14 in order to write the SQL query that they want. It doesn't translate into every application, but so many of them can be solved with a relatively straightforward SQL query. And Kafka has definitely been holding people back because they write an application directly against Kafka. And it's an absolute nightmare. It's really easy to ship the first basic thing into production. It is next to impossible to maintain like a long lived evolving application that has local state, that is an aggregation of some of the data in Kafka.

Starting point is 00:30:48 I don't believe there are actually a significant number of applications in the real world that do interesting things directly against the Kafka protocol. They're either doing something extremely basic, or they're using Kafka as a pipe to get their data into what is essentially a batch system. Yeah. And I think the thing, the area that's going to be really, really interesting is bringing those two worlds basically closer together. And then also just kind of this idea of streaming batch that you kind of see a lot of people talking about right now. Basically, the ability to write an application

Starting point is 00:31:19 that looks like a batch processing job, but is actually running and updating something every couple of seconds, basically. And I think the line between those two things will start to blur. And I think that'll be a really good thing for everyone. With our last two minutes here, I'm going to ask the uncomfortable question that I'm sure you're asked all the time. Programming against S3 and you have GCS and Azure is the same thing, and they're relatively

Starting point is 00:31:41 the same stuff. When someone suggests to you, that's just another way for vendor capture, how do you think about and rationalize this? That S3 is like you're getting trapped into S3. Object storage is the biggest commodity

Starting point is 00:31:57 in the cloud, man. It's like VMs and then object storage. You've got to put servers in your basement at that point if that's what you're worried about. The APIs are standardized. We wrote both Husky and WarpStream. We use a single library to talk to object storage.

Starting point is 00:32:15 We ran Husky in AWS, GCP, Azure, GovCloud. It's the same stuff. Now, Amazon's object storage implementation is the best for sure, but the other ones work good enough. I view object storage as an absolute commodity. I think it's bordering on the POSIX API at this point. It's available everywhere.

Starting point is 00:32:36 There's multiple open source implementations. That one doesn't worry me at all. Now, a lot of the other stuff I do think is if you use Spanner, you're getting locked in. But you know, that's a very different story. Cool. Well, we can probably ask a whole lot more questions,

Starting point is 00:32:51 but I think there's actually a great way to kind of like segment into what we want. So thanks a lot, guys. And thanks for being on our pod. Yeah, thanks for having us. It was awesome. Thank you so much.

The Infra Pod - Are all future data infra products will be built on Cloud storage? Chat with Warpstream

Ian and Tim invited the cofounders of Warpstream (Serverless Kafka built on S3) to talk about the tradeoffs of building a new Streaming product that is designed from ground up to work on top of Cloud ...storage like S3, and also what the future implications mean.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

The Infra Pod - Are all future data infra products will be built on Cloud storage? Chat with Warpstream

Ian and Tim invited the cofounders of Warpstream (Serverless Kafka built on S3) to talk about the tradeoffs of building a new Streaming product that is designed from ground up to work on top of Cloud ...storage like S3, and also what the future implications mean.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.