The Data Stack Show - 163: Simplifying Real-Time Streaming with David Yaffe and Johnny Graettinger of Estuary

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Now since this week's recording is with Johnny and Dave from estuary.dev. And I think this is going to be a really fun conversation. It's a topic that we've actually covered quite a bit on the show, which is streaming,

Starting point is 00:00:39 you know, in particular real-time streaming. But this is really in the context of, I think, what you use streaming for. And we really dig into sort of the Kafka side of the conversation, which we haven't covered in depth a ton. But part of the estuary story is really reacting to real-time streaming needs, evaluating Kafka, and seeing some pretty severe shortcomings, which is why they built Estuary. Now, what's really interesting to me is, in many ways, they don't talk about Estuary as

Starting point is 00:01:15 a streaming service. You know, they kind of talk about it almost as real-time ETL, which is fascinating. There's some open-source technology under the hood. And this is really, I think, going to be an interesting conversation because streaming which is fascinating. There's some open-source technology under the hood. And this is really, I think, going to be an interesting conversation because streaming is obviously a hot topic. And there are multiple technologies. So really interested to see what the SRE team has built. Yeah, 100%.

Starting point is 00:01:37 It was a very fascinating conversation, actually, for many different reasons. First of all, it was pretty technical, and only in terms of talking about Eswari itself. Actually, we had a very deep dive into Kafka, how Kafka is built, and some of the issues there that actually Eswari is addressing, like from the perspective of the architecture of the system. Like for example, we were talking about how compute and storage in Kafka is like very

Starting point is 00:02:14 tied together and how this has been like changed with using something as sorry. And like, what does this mean in terms of like managing the system and like what type of like use cases it enables. So we did like a very interesting architectural conversation around like this type of system. So anyone who is interested to understand like better how Kafka and like this type of streaming systems are like working, definitely like should listen to that. And then we talked a lot about also some important concepts like CDC, right?

Starting point is 00:02:49 And why CDC is important, how we use it and how they implemented it. Because the standard out there is pretty much like using something like Debezium, but the folks at Estuary actually implemented everything from scratch. And they have some really good reasons why they did that. And they are talking through these things. So amazing people, both Johnny and Dave, very deep expertise in this type of technology. And we had an amazing conversation ranging from the technical side of things up to the business side of things.

Starting point is 00:03:27 So I think everyone should listen to them. And hopefully we're going to have them again in the future because I don't think one hour was enough to go through all the different topics when it comes to streaming. Totally agree. Well, let's dig in and talk about streaming with the Estuary team. Dave, Johnny, welcome to the Data Stack Show. We are so excited to chat about lots of things. So thanks for giving us some of your time.

Starting point is 00:03:53 Yeah, thanks for having us. Thanks for having us. All right, we'll start where we always do. You have a great history of working together on multiple different things, doing multiple startups, selling different companies. So I'll let you choose. Who wants to start with the background story? I'll go for it. So I'm Dave Yappi, co-founder of Estuary. I've been working with Johnny for about 15 years. We started working together back in probably 2009 and 2008 at an ad tech company. And we built a lot of technology around those use cases,

Starting point is 00:04:45 which helped us meet customer needs. We ended up building a platform called Arbor, which was a data platform that facilitated transactions between publishers and advertisers. And that required very low latency. We built a streaming system to facilitate that. And we ended up selling that to a public company back in 2016. So the streaming system we built was called Gazette.

Starting point is 00:05:13 Gazette was a pretty novel piece of technology, which Johnny will dive a lot deeper into. But we cared about low latency because when someone's purchasing an item on a website, they are expressing interest in a product. And the key is to get back in front of them really quickly in order to use that data as efficiently as possible. So we wanted to make sure we were able to do that. We built that system. And now we've been applying it for the past four years at Estuary to the world of data infrastructure. Yeah. So hi, I'm Johnny. Dave mentioned I've been working together for quite a while now.

Starting point is 00:05:51 And yeah, whatever your feelings may be on the advertising ecosystem, in some respects, we're like recovering from the ad tech ecosystem. But as an engineer in the space, there are pretty fascinating data challenges just because of the speed at which you're having to make decisions, which is typically in like the 100 millisecond type

Starting point is 00:06:10 timeframe. And the amount of data that you need to bring to bear to make those decisions, because you're consulting various indices of data to figure out like whether to show an advertisement and how much to pay for it and that kind of thing. And just from a data management perspective too, because you're talking about gobs of this stuff coming in, you need to sort of connect it together, build various graphs that sort of allow you to understand how these data points fit together, project information through it, get it back out again as quickly as possible. So there are a lot of sort of very interesting engineering

Starting point is 00:06:44 challenges and like wrangling this constant influx of data into useful data products and doing it really quickly. So as Dave mentioned, we ended up building some technology to help us do that for the last company that we built called Arbor that we open sourced as a project called Gazette. And we are now continuing to build on that project today with our current company, Astuary, which focuses on essentially sort of making it really easy to connect the various systems that you have that contain data that you care about. So whether that's a OLTP database or a PubSub system, or even a Google Sheet, or various SaaS APIs, being able to connect that data to where you want it to be, which could be another database or an analytics warehouse or an elastic search cluster. So there are generally all kinds of places where you have data, and we make it a lot easier to kind of move it around and transform it along the way.

Starting point is 00:07:42 Very cool. So this is just a personal curiosity about ad tech because I'm somewhat familiar, not definitely not as intimately as each of you, but it sounds like the APIs for delivery of ads were pretty robust because you're talking about sort of collecting this data, determining, you know, so someone's like browsing products on a site, you want to sort of retarget them. And so you build Gazette to sort of facilitate collecting that data and syncing it to these other systems. But then you also have all these mechanisms to deliver those ads. I mean, is that, you know, because you're to some extent, even if you can move the data

Starting point is 00:08:21 with extremely low latency, you're limited by the API that you're delivering it to, right? I'll go quickly. I'm sure Johnny will add a lot of context, but there's a thing in the advertising industry called real-time bidding, right? And real-time bidding is effectively a protocol which enables publishers or exchanges of publishers to share opportunities to partake in an auction. And what they'll do is they'll have an auction which goes out to tens or hundreds or even more different potential buyers.

Starting point is 00:08:54 And they'll say, here's the opportunity. Do you want to be part of it? That's a protocol that existed back in 2008, 2007. And it still exists today. It's how almost all of the transactions that you see online, the ads that you see online are transacted. And it's something that happens when I left that space in 2014, 16 million times a second across the United States. So it's really, really robust. Yeah. And some of the data challenges are So it's really, really robust. Yeah. And some of the data challenges are when you're making, if you're a demand side platform, you're basically, you're operating a bidder that's trying to make a decision. There's information that comes in that real-time bidding request, and you're using that information to consult various indices you have to figure out whether or not you're going to make a purchase decision and how much you're going to pay for that.

Starting point is 00:09:48 So those auxiliary sort of indexes of data that you're using to make those decisions need to be up to date and ready to go in order to make that decision. So it's not just handling the request coming in. It's also handling all of the data that you're using to make decisions in terms of how to respond to that request. So I'd love to hear about a couple more of those types of use cases. So I mean, that makes a ton of sense, right? Where you need to move a massive amount of data in near real time so that people can sort of get these highly contextualized ads. What are some other use cases where moving data at that scale, you know, people are using Gazette or Estuary to facilitate? Yeah, scaling down a little bit because not, of course, most companies are not in the ad tech

Starting point is 00:10:40 space. It's mostly just sort of giving our journey a little bit in terms of how we got into this space and why we do what we do. But another example I'll kind of reach for is a lot of companies have a database, like an OLTP database that they're using to manage sort of their business data and process transactions. You very often want to wire up caches to that, or another use case that we see a lot is search. You want to enable sort of fancier forms of search, whether that's like elastic search or vector databases like Pinecone, which are really just a variant of like semantic search. So you have all of these different systems that you might want to use to power search

Starting point is 00:11:22 for your product or whatever it is that's your kind of business domain. And you don't necessarily want all of that just going through your one database. So a fairly common bread and butter use case is taking that data from your primary database and then pulling it out, transforming it a little bit and making it useful in these various backends for search like Pinecone or Elastic and others. And there, it's a caching problem. So it's important that be up to date and reflect what's in your database right now. Yep. That makes total sense. And I mean, a number of sort of ways that people solve this currently

Starting point is 00:12:01 come to mind, but what are the patterns that you see? So if you're not using Gazette or Estuary, what are people doing? Is it just a much slower sort of database replication process? I mean, a lot of people would even use Kafka to just ingest the change log and do some sort of transformation so that they can populate some other downstream system. What are the popular ways that you see people do this today? Yeah, there's if you truly care about real time, it's probably Kafka plus the BZM, plus engineers to manually manage a lot of that stuff, because there is some manual aspect to it. And if you don't care about real time, which is also common, you might be using a system like Fivetran or Stitch to be able to sync those assets together.

Starting point is 00:12:50 Sometimes you'll see people who just query their database and simply do it all in-house without using something like the BZ. But lots of different possibilities on how you could potentially sync that data. Our goal is to make it as easy to work with streaming data that's truly updated in real time as it is to use any of those other methods. So really just like configuration, get it from your source, get it to your destination. If you want to do transformations, try to make it as simple as SQL

Starting point is 00:13:19 and kind of make the whole thing much more point and click and scalable than it would be otherwise. Yep. Okay. And I have to ask this question because, I mean, I don't know, maybe our listeners aren't asking this, but it's in my head. So I have to believe that you looked at using or used Kafka to solve some of these problems early in your journey pre-Gazette, right? I mean, there weren't a lot of other systems that I think would be able to handle the type of scale that you were describing sort of in the ad tech industry. Is that true? Yeah. So the decision to start Gazette was made

Starting point is 00:13:58 back in 2014. And the central reason, I kind of was evaluating Kafka at the time and we were sort of building this business that focused on data within the advertising ecosystem. And we knew that latency was really important for the business. It was important that we'd be able to sort of capture this stuff and transform it and move it as quickly as possible.

Starting point is 00:14:21 So Kafka is the obvious game in town for doing that. And that was the case in 2014 as possible. So Kafka is the obvious game in town for doing that. And that was the case in 2014 as well. There was one sort of architectural limitation of Kafka that I just couldn't get past, which is the central reason for building Gazette, which is the coupling of compute and storage that Kafka does. So when you have Kafka brokers, you are writing data into Kafka topics. And those Kafka brokers are basically managing that data on the disk of the Kafka broker. And as you expand the cluster or grow it, it's basically moving that data around, but you are fundamentally limited to the disks that are attached to your Kafka brokers. But even worse

Starting point is 00:15:06 than that, there's this tension that exists within Kafka and downstream usages of Kafka, where your brokers are responsible for accepting the real-time writes that are coming into your system, basically like the data that's coming in that you cannot lose. So the number one most important role of those brokers is getting that data down, getting it replicated, making sure you're not going to lose it so that it's available. But at the same time, you have these downstream use cases that are wanting to pull from these topics of data and do something with it, process that data, turn it into some kind of new data product. And if you imagine for a moment that you're bringing up a new application, you're a startup, perhaps in the ad tech space, you're bringing up some kind of new consumer of data that's got to basically backfill over all this historical data that you might've seen so far. Now that application, you want it to basically

Starting point is 00:16:02 slurp up data as quickly as possible from your streaming system because you want to process all that backlog of historical data, weeks of data, months, maybe even years of data. There's an inherent tension where that application wants to read data as quickly as possible from brokers and the same disks that are also responsible for your real-time rights coming into the system that you cannot lose and you want to be fast. So that coupling of storage and compute is the primary reason for starting Gazette is solving that problem in decoupling ways. So what Gazette ends up doing differently is that the brokers are really just focused on the really near real-time stuff, like the rates that have just happened. They're focused on getting that replicated, getting it down, making sure it's not lost. Then very quickly after that, they persisted into cloud storage. Cloud storage is actually the durable representation of that data. That's where it actually lives. After the near moment has passed

Starting point is 00:17:08 and it's now in the historical record, it lives only in cloud storage. And what's really neat about that is that applications can basically, you can stand up a new streaming application that is going to process months of historical data and also is going to catch up and stream in real time once it processes that backlog. And when that new application comes up, rather than needing to go hammer the brokers and pull as much data through them as they can, what they're able to do instead is basically just ask the brokers occasionally, hey, where do I go find the next gigabyte of data? And then just go read it directly out of cloud storage and then sort of seamlessly transition to real-time streaming as they get to

Starting point is 00:17:50 the present. So that's a capability that we built out in Gazette and leaned on a lot for building all kinds of sort of downstream applications that are consuming quite a bit of data. Obviously, that decoupling of storing things in cloud storage means that there's really no retention limit on how much you can keep in that historical record so it in a sense i had one person tell me it let us be fearless when building architecture and figuring out like putting these pieces together because we really just didn't have to worry about, you know, whether this new downstream use case is going to bring down production.

Starting point is 00:18:29 Interesting. Okay. One more question for me, but actually, Kostas, I'm sure you have a ton of questions on Gazette. So I want to hear your questions on Gazette, but did you, when you started building Gazette, did you know you were going to open source it or was it just, you know, you were trying to solve your own problem? How did the open source story for Gazette come about? Yeah, it was always the intention. But when we were going through the acquisition process with our last company, we kind of made sure of it. It was something, you know, honestly, it was a negotiated thing with our acquirer where as we were going through that conversation and the diligence process that we wanted it to be open sourced and they were good enough to let us do that and follow through.

Starting point is 00:19:16 Very cool. Yeah. If you develop technology inside of a company, getting it actually open source, especially through an acquisition, certainly seems like one of the hardest parts. So yeah, Costas, Gazette, I'm interested in your questions. Yeah, yeah. Yeah, it's very interesting, actually, like while I was sharing all this time. So, okay, question.

Starting point is 00:19:41 Why would anyone like use Kafka today if Gazette is out there? And the reason I'm asking this, it sounds like a little bit of a provocative question, but the reason I'm asking is because when you're engineering systems at that scale and precision, let's say, like the operations that are supposed to be getting done there,

Starting point is 00:20:04 there are always trade doves, right, there are always trade-offs. Kafka did some trade-offs. I'm pretty sure that you've also done some different trade-offs to deliver the technology there. I think learning about these trade-offs is

Starting point is 00:20:19 very important for any engineer because it gives you a very deep insight on like why the system works the way it works and then we can also make like a connection there with like the problems that are getting solved right because different problems usually require like different trade dose so why would i today go and use kafka or Red Panda or any other similar solution that we hear about out there instead of Estuary?

Starting point is 00:20:50 Just from a purely business standpoint, Kafka... The way I look at Kafka is it's the protocol more than anything else at this point. Kafka, of course, there's an open source project and a whole bunch of stuff around it, but the protocol is the thing that really matters, right?

Starting point is 00:21:05 Red Panda is a totally new implementation, which happens to use the Kafka protocol. And Gazette currently does not use the Kafka protocol, but soon it will, right? Well, it will have the option to support the Kafka protocol as well. So you could think of Gazette as Kafka just with some different architectural decisions that make it function differently. So from my point of view, from a business standpoint, what you really want is the Kafka protocol. That's all you really care about, right? Because then you have all the integrations that they've built.

Starting point is 00:21:34 You have the ability to use Kafka Connect and the whole ecosystem that sits around it. So that's my point of view. But Johnny, what would you say? Yeah, I think a lot of it is the ecosystem when we were starting estuary we spent a bunch of time kind of surveying because we you know we were coming out we just sort of finished our our stay at the the acquiring company and we were we were leaving we were because that had been open sourcing we were kind of figuring out what's next and we spent some time, the obvious thing is like, okay, well, if you squint and look at it from a distance, there are a lot of similarities with Kafka. Should we make a go at promoting this as an alternative to Kafka?

Starting point is 00:22:18 And we spent a bunch of time surveying Kafka users within various companies. And there are a couple of answers that kind of came out of that. One is that, of course, the ecosystem matters a lot. So the existing Kafka Connect ecosystem matters. The existing applications that already speak this protocol matter. And if you're not using Kafka today, maybe you're using something like Kinesis or you're using Google PubSub or tools like that. So there are all kinds of different tools in the streaming space. Like why does anybody really need another streaming protocol?

Starting point is 00:22:57 And honestly, I can't necessarily say that they do. Like I'm not here promoting Kafka as like, you should go pick this up or sorry, promoting Gazette as like, you should go pick this up, or sorry, promoting Gazette and saying, you should go pick this up and use it today. One major limitation is we only ever implemented clients in the Go programming language. So in some respects, our realization was that what we think the industry really needs is more of an up-leveling. Like focusing on the streaming broker is the wrong place to be focusing. Like it's, we think of it more as an implementation detail, honestly, of the real problem, which is like, I've got these different systems

Starting point is 00:23:33 that I want to connect together and I want to sort of describe what I want to have happen and then I want you to go do it for me. So yes, we happen to use Gazette under the hood, but it's an implementation detail and it honestly doesn't even really matter that much from our customers' perspectives. me. So yes, we happen to use Gazette under the hood, but it's an implementation detail. And it honestly doesn't even really matter that much from our customers' perspectives. But there's another answer here too, which is that I vividly remember conversations we had with teams that were

Starting point is 00:23:59 managing Kafka within large companies. And I remember one engineer telling us about the work that they had done to basically impose these various quota constraints on other teams so that they weren't overloading their Kafka brokers and that they were able to keep things nice and stable. And they were presenting this as a good thing. And all the while, while I'm listening in this conversation, I'm thinking, okay, so you pushed this problem down onto the rest of your organization and limited your other team's abilities to actually use this infrastructure. But here you have an entire team who's like, this is their job. It's their fiefdom. Managing Kafka is what they do. So going in and telling them you're doing the wrong thing is not a great message to be

Starting point is 00:24:45 delivering to them. So that kind of turned us off from trying to sort of directly take on Kafka as like, we want to build something different here or introduce something different here that's roughly the same shape and fits into the same box. So those are some high-level answers. In terms of engineering trade-offs, honestly, there aren't that many. A central one probably is that Gazette manages immutable logs. So one key difference between Gazette and Kafka is that Gazette is writing immutable logs where you can add new content to that log. And that log can be as big as you want. And it's backed by cloud storage.

Starting point is 00:25:30 So you really don't have to worry about its size. And you can read it very efficiently because you're pulling that log directly out of cloud storage. But it does not do what are called compactions out of the box. Like the broker is not responsible for compacting that log over time where Kafka is. So if you have multiple instances of a key, Kafka will compact prior versions of that key away. Now, we can get into why that actually doesn't really matter that much as the next follow-up question, but I don want to like steal too much time for this one answer yeah actually let's do that and let's reverse the question like let me ask you why it is important like to do combustion i'm thinking like in okay

Starting point is 00:26:16 let's say if i was like an engineer working like on kafka like implementing combustion doesn't sound like like the easiest thing to do, right? Especially with these throughputs that you have to go and maintain there. It's definitely going to add complexity to the management of the system anyway, right? So why, let's say,

Starting point is 00:26:38 they decided to do that? What's the use case for compaction out there? And as a follow-up to that, why, from the H2R point of view, it's not that important to do the compaction? Yeah, I would say compaction is very important, but there are different ways to do it. So I'm going to get a little bit technical in this answer. That's okay. But let's say you're building a streaming application, and very often streaming applications need to manage state.

Starting point is 00:27:11 So if you're trying to do like a stateful join or something like that, you've got two different streams of data, and you need to sort of index one side of that stream so that you can query that index and look up values when you see instances of the other side of that stream. So you need some kind of stateful database around in order to do that. So in the Kafka streams ecosystem, that's typically RocksDB. It's also RocksDB within Gazette. If you're building like an application on top of Gazette, you're generally using RocksDB to do it. We use RocksDB to do it as well under the hood.

Starting point is 00:27:53 Now, Kafka brokers are doing compactions of Kafka topics. So when you write multiple instances of a key, it's eventually sort of pruning out older versions of that key. And that's important just from a disk usage standpoint as well. You don't want that topic to be unbounded in size. But the kind of funny thing that happens is when you're building this streaming application, this staple streaming application that's using RocksDB, and you're consuming from one topic stream, and you're indexing those values, RocksDB is also doing compaction. RocksDB is what's called a log-structured merge tree. Part of the way it works, it's basically an append-only sort of, it's managing this append-only tree of files

Starting point is 00:28:30 that represent the content of the database. And as you write new content, it's sort of taking some of those files, compacting them together, writing new files, and then deleting files. So one of the insights we had is this rockssDB, it's already doing this compaction. Why are we doing compaction twice? Why are we doing it in these two different places?

Starting point is 00:28:52 What if we instead sort of leverage the compaction work that RocksDB is already doing, harness that, and then we don't really need to worry about this problem within the broker? That's interesting. So, okay. Follow up question to that, because you mentioned... problem within the broker. That's interesting. Okay. Follow-up question to that because you mentioned the term states there. I want to ask that because

Starting point is 00:29:14 Confluent had their conference pretty recently, right? And there was also the acquisition that they did a couple of months ago, I think, from a company that is working with Flink. If there's state management somehow part of the system itself, why do we also need a system like Flink that does, again, processing. And the main value that it adds, from my understanding at least, is that it is a stateful query engine on the end,

Starting point is 00:29:50 like a stateful streaming query engine. So there's some kind of state management in so many different places. And I think it's hard for people to understand why we need all these different, complicated systems to actually go and work with each other in order to achieve something. So I'd like to hear, learn a little bit more about that. Yeah, well, I'll attempt to tease this apart a little bit, sort of within the Kafka ecosystem. Because we're talking about a couple of different things here. We have Kafka streams, and Kafka streams can have as its backend

Starting point is 00:30:32 sort of a state management, like a database for indexing state, and that can be RocksDB, for example. But Flink is actually managing entirely separate states. So when you have state within Flink, what it's doing is it's typically persisting checkpoints of state to cloud storage and then recovering those checkpoints of state from cloud storage. So they're completely separate state stores and this is a little confusing.

Starting point is 00:30:58 Like the streaming ecosystem is confusing because state ends up being a very hard problem. It's a hard problem to solve well and to scale well. So we see a bunch of different sort of competing solutions for how to best do it. So I'm not sure that's directly answering your question, honestly, but I just wanted to kind of articulate a little bit like the different flavor, like these different flavors of tools goals or managing sort of state within a streaming context. One of the trade-offs I guess is for really low latency, the way Kafka

Starting point is 00:31:34 streams does it is going to be faster than Flink because Flink basically needs to accrue a bunch of like a window of a bunch of data and then run a transaction where it's like computing all of its effects and the state updates and then snapshot that persistent into cloud storage. And then it's releasing the results. So there's a little more latency involved in the process when you're using Flink for that reason. Makes sense.

Starting point is 00:31:58 So you mentioned about the story that there is also like processing that someone can do with a system, right? How is that different, if it is different, from like using Kafka streams or Kafka streams with like Flink compared like to Azure? Yeah, so a major thing that we are, just from a vision standpoint, when you work with existing tools within the streaming ecosystem, whether that's Kafka Streams or also Flink or Spark, and you're trying to do some kind of stateful computation

Starting point is 00:32:32 where you need a join or something where you need state in order to facilitate that computation, all of these frameworks basically require that you adapt your modeling of state into their mechanism for persisting that state and restoring that state. So from a vision standpoint, where we're trying to get to really is that you as a user, as an author of these transformations, should not have to do that. You should be able to bring your chosen tool to bear in terms of managing state, whether that's a Python script, that's keeping something in memory, you know, the same way that you would write it out where you're just like slurping in

Starting point is 00:33:13 something from a file and then, you're computing, computing something in memory and then spitting out an output, like why can't you do that in a streaming context as well? So from a vision standpoint, what we're really after is basically Unix processes and files and pipes scaled up to the cloud, where you can basically bring your own program, which can manage state however it likes, and run that in the context of a transformation within the platform we're building. So we're not quite there. We have more to build before we fully realize that vision.

Starting point is 00:33:47 But the major goal we have is to sort of break that coupling of the user-defined state and the way that the framework needs that state in order to manage its persistence. Yeah, that's super interesting because I would assume that, okay, imposing a framework on how to manage states allows the system designer to reason about how to effectively do the state management, right? While when you relax that, many things can go wrong, especially when someone brings their own Python code or something else. So I will be super excited to see how this works, because it is a great opportunity,

Starting point is 00:34:34 actually, to open this platform to all these different people to go and write code for that without having to know deeply about all the intricacies of these distributed systems. I think it is super important if we would like to bring more people into the streaming ecosystem. It's also a very hard problem. From an engineering perspective, it's a very hard problem.

Starting point is 00:34:58 I'd love to see how you do that as you are materializing your vision. I'll drop this nugget, which is that basically the VM ecosystem and the ability to snapshot what processes are doing has gotten a lot better. And you can think about incorporating that into an overall streaming transactional flow.

Starting point is 00:35:20 Yeah, that's interesting. I mean, I was looking a lot into a special, like this micro VM kind of system is like Firecracker. yeah that's interesting I mean I was looking a lot into like a special like this micro VM kind of system is like Firecracker some others that are like really good at like doing this snapshotting and

Starting point is 00:35:34 even like moving these snapshots around my question always was like okay getting like a snapshot of the VM is one thing if this VM also I't know, it's like a couple of gigabytes there in its memory that needs like to be snapshot, it will take time, right? Like it's latency that is added there. Like you can't like avoid that. So it's very interesting to see how in the context of like streaming processing, this might be different than, let's say, an all-up system, where you might have large amounts of part of the state being in a node and how to move that around.

Starting point is 00:36:11 But anyway, that's a very interesting conversation we can have at some point. You mentioned at the beginning also the term of caching. And you were describing a use case, where you have all these different systems, you have Elasticsearch, you have Redis, you can have Pinecone, you can have Redshift, whatever. And using something like Estuary actually to replicate the state of an input, let's say, system to all these different systems. Now, it's commonly said that one of the hardest problems is casting validation, right? Figuring out when the state between the two systems is actually not in sync and doing something about

Starting point is 00:36:59 that. And I would assume that each system has like different requirements around that but like how do you see this problem from your point of view as being like the system that needs to be able like to support that right what you've seen in the industry out there like in terms of like problems around that and like how do you solve the problem? Yeah. In some respects, this is sort of an organizational problem that companies have to decide, what are the authoritative systems for data sets that we care about? Where are we making changes? What system are we issuing rights to, essentially? And I'm not sure there's any tool out there, Estuary included, that's going to be able to make that decision for you,

Starting point is 00:37:48 but we can build tools that will support the decisions that you make. So the primary answers are basically around discovery, like being able to understand that there's kind of an interesting concept that's grown up in certain parallel to what we've been doing, which is the data mesh concept. And for those who aren't aware of it, like data mesh is fundamentally the idea is like trying to engender a more self-service, like internal ecosystem within a company of different systems where data lives and democratized access to discover where that

Starting point is 00:38:26 data is, discover the various data products that exist, and use them for different downstream purposes to make different downstream data products. And this concept out of ThoughtWorks is kind of very much in parallel to our own thinking and what we were building at the time. But it's something we strongly agree with, that a big part of this is, it's one thing to sort of build the pipe that's going to move data. But if you can't communicate what those pipes are, and what they're doing to the, you know, the broader cohort of users within your organization, then that's only sort of a partial solution. So a lot of, you know, part of the answer is basically just building

Starting point is 00:39:05 good tools for people to understand what the data flows are, where data is coming from, how it was built, like what went into building a data product, where it came from, et cetera. Yeah. Yeah. I love that you brought like data mesh because like that's triggered like an interesting question, I think. So data data mess has a very interesting concept of, okay, you have the data leads in all these different places, and you should be able to query it and reason about that in the place where the data is. So, two approaches, I would say, technically, to realize that.

Starting point is 00:39:44 One is actually replicate the state that is needed between the different systems, something that I would assume with something like Asteroid you can do, right? I have my production OLTP database, some part of the state there, I would replicate it to my all-app system so I can go and do the analytics that I need. But there's also the concept of, instead of moving the data around, moving the logic around, to do query federation action. Instead of trying to move the data and solve the hard problem of replicating state and reasoning about the state in a distributed manner, let's just ask the questions directly at the source, right?

Starting point is 00:40:30 Yeah. I have my own reasons here to share what are the pros and cons with each one of them, but I'd love to hear your take on that. Two immediate thoughts. One is, of course, like if you're talking about your production database, you might not want your analytical query load going to your production database. So that's problem number one. But another problem is you might not care about the data as it is right now. So a really common use case that we see is, you know, because we're doing change data capture from databases, which means that we have essentially this immutable log of all

Starting point is 00:41:11 the individual changes that have been made to your tables over time. And that can certainly be used to maintain a materialized view of the current state of the table. And that's what we call like a standard materialization. But there are actually a couple of other use cases that fall out of this too, because another thing people commonly want is auditing. It's an audit table of what changes have been made to this operational table over time. So that's something that you can, it's essentially a byproduct of having an immutable log of all the change events of your database.

Starting point is 00:41:46 So we see customers who do both. They actually want both a history, but they also want the real-time view in different systems. And yet another one is sort of point-in-time recovery, where I might care about what the state of this table was last Thursday. And asking the current table is not going to tell me that, but I can ask the submutable log of the change events and I can basically rebuild that view as it existed at some previous point in time. So the current table,

Starting point is 00:42:17 if you can just issue a query to the current table, great. But there's often all these kinds of related data products that you want that the operational table can't tell you. Yeah. Yeah. Makes a lot of sense. And by the way, like it's, this is like being able to time travel is especially like important for ML reasons.

Starting point is 00:42:38 Like it's a very common thing that like, especially with feature engineering, to be honest, like to be able like to calculate a feature in time like in different snapshots in time for like the same user so it's a it it becomes like it arises as a problem like in many different like use cases which makes it like super interesting all right cool so you mentioned like cd CDC and it was mentioned like that in Kafka, you have like the ecosystem there, like Kafka together with the Bayesian, right? From Red Hat. I always wondered why we haven't seen like more attempts in the open source for something like the Bayesian, to be honest, or why these things are not even more standardized in a way.

Starting point is 00:43:27 So I'd like to hear your thoughts there. First of all, do you use something like the BISM? How do you do it? We actually do. Yeah, we tried to, and we're dragged kicking and screaming into basically building our own implementation from soup to nuts. Okay. And there are a couple of reasons why.

Starting point is 00:43:49 So the biggest probably is that our focus is on being able to handle very large databases and do it efficiently. So databases that are terabytes in size. And one of the big challenges with the BZM and a lot of sort of naive replication or change data capture from database, it's very easy to grab like the ongoing feed of the existing rows in your table, because your database is not keeping around an infinite write-ahead log where it's got all its change events that's constantly compacting it. So you really need to both process the write-ahead log and then also query the table for all of the existing state in that table. And it's important to do it correctly as well. It's very easy to

Starting point is 00:44:42 implement this incorrectly where you're getting kind of logically inconsistent results. So one of the big problems, like the naive solution for exchange data capture, which certainly at the time where we were starting and looking at this problem, this is what the museum was doing, is basically like starting a write-ahead log reader, but not actually reading it. So you basically just create this write-ahead log reader, but not actually reading it. So you basically just create this write-ahead log reader and you're sort of pinning the write-ahead log.

Starting point is 00:45:09 And then you start scanning all the table content of the table that you want to backfill. And then once you've, you know, you're basically issuing like a select star for the table. And then once you've read all of that, you begin reading that write-ahead log again. But a major issue when you start getting into larger databases that this has is that, first of all, you have to think about what happens if that select star statement crashes partway through. So that's one issue.

Starting point is 00:45:39 You might need to reissue that a few times, a bunch of times. It might have to restart some of that work. But another even bigger issue is that when you have created that write-ahead log reader, but you're not actually consuming it, you're not actually reading from it, what the database is doing under the hood is it's basically saying, well, someone wants to read this, so I can't delete these write-ahead log segments. And what commonly happens is that people's disks fill up on their production database, which is not good.

Starting point is 00:46:17 So one of the goals we had from very early on was we want to do really incremental backfills where we are always reading the write-ahead log and never putting the production database at risk while also being capable of doing many multi-terabyte backfills. So there's a paper out of Netflix, actually, called DbLog. And we essentially kind of looked at that and had some ideas from it and ran with it. Yeah, that's very interesting

Starting point is 00:46:40 because we had the author of that at the podcast at some point. And I think the Bayesian, in a more recent version, they have implemented something similar with the watermarking that they do there so you can do it in parallel.

Starting point is 00:47:00 I think it was again inspired by the same publication. But that would have been my next question, to be honest, because I've seen that in practice. And it is a problem with how CDC has been implemented in systems like with Fivetran, for example. Because you have these huge databases that we need to replicate the state from there before you can start consuming the change log.

Starting point is 00:47:28 And it's always a hard problem to solve. So, okay. Is this something that's open source? Is it part of like, is Estuary part of like, it's a separated project? Like how is the CDC parts being? Yeah.

Starting point is 00:47:50 Yeah. So we model essentially estuary is like two things it's the platform itself for running these data flows and then it's the connectors which are really applications that sit on top of the platform for talking to different systems for doing things like CDC and doing materializations which are sort of pushing data into systems and keeping it up to date there, as well as transformations. So the fundamental architecture is that we have the platform, which is sort of facilitating data movement between these pieces. And then these applications, which are basically Docker images that speak a protocol over standard in and standard out for doing sort of streaming capture streaming materialization so short answer is yes

Starting point is 00:48:27 we the we have these are source available on github we package them as docker images that speak a it's a capture protocol we ended up like kind of kicking and screaming again designing our own protocol for captures and materializations of data. So that part is a little bit bespoke, but it's a fairly accessible protocol and they are source available. Okay, that's super interesting. Okay, one last question from me, and then I'll give the microphone back to Eric.

Starting point is 00:49:00 You mentioned materialization and streaming processing. In the streaming landscape, there's also this new model of computation that's based on timely or differential data flow. I think the most known product using this kind of approach is materialized. How is it different than using something like Estuary, right? If you can go and do, let's say, this incremental materialization. And is there space for both solutions there? Different use cases? Help us understand. Again,

Starting point is 00:49:45 I'm asking these questions because I feel like the streaming landscape is always like complex. There are like some very like heavy buzzwords, like differential data flow and like, yeah, you know, like people get, and like not everyone is like, you know, like a systems engineer that like wants to get that deep into how things work, right? But it is part, they have to interact with systems and it is important for them to understand

Starting point is 00:50:14 why they exist and why they are out there and what we are trying to solve. So I'd love to hear from you if there is some competitive element there or they are just like complementary solutions at the end. Yeah, we actually work with Materialize a little bit at least.

Starting point is 00:50:32 We're trying to do so more and more from a team standpoint. Materialize does not focus on connectors. They focus on streaming SQL and making that a really first-class thing. And also trying to up-level it and making it exactly the same protocol as standard SQL. That's not what we're doing, right? That's a very different problem than what we're doing. We're more focused on getting data from your systems, acting as the pipes that can kind of get into other places.

Starting point is 00:51:02 So getting it into Materialize is a very core thing that we kind of get into other places. So getting it into materializes a very, you know, core thing that we want to be doing with them. And so at a very high level, we don't see ourselves as competitors at all. But I'm sure Johnny will go into technical details on the differences between differential data flow and our specific processing model for SQL, streaming SQL. Yeah, and I'd say kind of my quip answer is like, Materialize is a really cool system for sort of

Starting point is 00:51:32 streaming SQL-based compute and getting answers out of it. And then Flow you can look at as a system that makes all of your other databases real time. So in some respects, like you can think of materialize and another company that comes to mind as rising wave, which is a fairly similar, fairly similar position in space. These are streaming compute engines for doing streaming transformation based around SQL from like a product philosophy perspective, our view is like more is better, you know, we're like, we kind of want to meet users where they are, where they want to be in

Starting point is 00:52:13 terms of using whatever sort of streaming compute that they would like for computing data products. So like another example of this is a company called BikeWax, which has taken the differential data flow library and wrapped it with Python bindings to make it something that you can write Python programs around. Like, well, we'd like to be able to offer that as a way of doing transformation using flow. So I think there are a lot of interesting different solutions. And I think this stuff is going to play out for quite a while because there is no silver bullet for streaming. It's a legitimately very hard problem, and they've got some really interesting takes on how to solve it. It's a great answer, to be honest. I think it would be nice like at some point like to have all like a

Starting point is 00:53:05 couple of like people from the streaming processing like space that they are like working on different like ways of solving the problem and like have like a conversation all together because it is like i think like it will be a service let's say, to the industry. I'd like to help people understand why all these products and technologies are built. Why at the end it's so confusing. There are good reasons. Anyone is trying to confuse anyone here. Actually, it's the opposite. It's just a very hard problem.

Starting point is 00:53:41 The term real-time, many times, it's like the semantics of real time is like so different depending on who you ask. Right. That makes things like hard. So anyway, I'd love like to have like this kind of panel discussion at some point. Like maybe we should also like reach out to the folks at Materialize and try like to get you all together and like discuss about these things. I think it would be like super interesting. But I have to stop now. So so eric the microphone is yours again yeah i just to follow on though i think that is uh i really appreciate that perspective i think it's really easy to look at similar technologies

Starting point is 00:54:19 and you know i think i don't know just naturally i think there can don't know, just naturally, I think there can be defensiveness or disagreements about different ways of doing things. But I think, at least from my perspective, I totally agree more is better, right? I mean, this type of data flow is something that we want to be more broadly accepted by a much larger portion of the market.

Starting point is 00:54:47 Which actually brings me to one of my final questions, which is, Dave, you mentioned something earlier, and I'm interested in both of your perspectives on this. You know, if you want to do real-time, you have sort of this set of options. If you want to do, you know, sort of batch, you have like sort of the 5-channel, like the traditional batch ETL flow. What is keeping companies from just moving to this kind of, let's say, streaming ETL, real-time ETL? Honestly, nothing anymore.

Starting point is 00:55:30 In a lot of ways, we're working with companies that don't necessarily even have real-time problems. They're just using us for analytics because we've made it as simple as working with a batch system to execute a data flow. And that's our goal, really, to kind of take the gap away. Of course, things get more complicated when you start doing streaming transformations. But if you're just doing effectively

Starting point is 00:55:51 point-to-point or even one-point-to-many-points data transfer solution, they really are as easy as each other at this point. So I don't think there's really massive differences from the UX there.

Starting point is 00:56:08 There used to be, right? Like most of the times when you look at a streaming system, you're having to do stuff like manage your own scheme evolution. So if you really want to bridge the gap, you have to do stuff like that. That was a core problem for us that we wanted to make sure that we tackled really well. But yeah, in general, I think that those are essentially closing.

Starting point is 00:56:32 And one followed, actually, Johnny, I said I wanted both of your perspectives. So before I have a follow up, your thoughts? Yeah, sorry, editor, I was wool gathering there a little bit. So the question and the question is, what is holding companies back from... That's a gross oversimplification. But yeah, I mean, I don't know. I have in my mind that there are multiple advantages of moving to more of a streaming architecture, but why aren't more companies doing it? Maybe they are and we don't see it.

Starting point is 00:57:16 I think they are and we don't acknowledge it, actually. So if you are using a tool like Fivetran and you're using sort of incremental syncs from your database to an analytics warehouse, and you're running that every hour or something like that, guess what? You're streaming. That is streaming. You are moving incremental data from one place to another and avoiding moving the whole data set. So that is already streaming. It's just a matter of like, what is the latency involved?

Starting point is 00:57:48 So in some respects, what we're talking about is just a dramatically tighter latency, but functionally doing the same thing. So in many respects, streaming is already happening and people are kind of have cobbled it into batch systems, whether that's iTran or another tool where you're doing sort of an incremental updating resolution of new data and moving it from one place to another. And then, you know, you have tools like dbt where you're building incremental models where you're looking, you're using like a cursor to figure out how to go update some downstream data product based on just the new data

Starting point is 00:58:25 so people are already doing this it already exists in the ecosystem and in some respects we're just talking about making it faster and simpler and tightening the latency that's available one more add-on to that oh yeah go for it if you don't mind yeah i think that classically people have been burned by streaming right so that you'll talk to a whole group of practitioners who hear the word streaming and they say oh no i don't want that so you know that's something that we found out when we first started the business and a lot of the times we won't lead with streaming anymore right we'll lead with incremental data low latency and that actually opens it up to a whole group of people who wouldn't otherwise be interested, right? If you're doing analysis, a lot of the times you're going to think streaming is very expensive because classically it used to be.

Starting point is 00:59:15 It's not anymore, but that's something that we definitely heard a lot. Such an interesting point, Dave. I think there's a lot of Kafka baggage there. And, you know, we talked about some of that earlier where you face this severe trade-off that is actually a pretty interesting, like, you know, almost like DevOps problem that doesn't even relate to data, you know, like the difficulty of operating a system. I've also heard a lot of companies try to move to event-driven architecture in terms of the way that they're building their stuff. And that, you And that can take you down a rabbit hole. One question I do have though, and Brooks, I'm sorry I'm going long, but this is me punishing Costas by asking more interesting questions that he's not here to enjoy. Because Costas had to drop off for another meeting.

Starting point is 01:00:00 For our listeners, that's the context. But how does this impact orchestration? Because when you think about streaming, one of the interesting things when you come from a world where, let's say you're running multiple batch jobs, you have multiple downstream dependencies. And just because of the nature of data, you require an orchestration level because these jobs take different times to run. They need to be sequenced differently, and the downstream dependencies obviously are impacted. But when you move to a real-time streaming setup, at least on the surface, it seems like it solves some of the chronological orchestration problems.

Starting point is 01:00:52 Yeah, for sure. I get asked routinely when I'm demoing our products, how do I orchestrate this? And the answer is you don't. Your data, when it's available, it goes there. And that's how it works. But yeah, it's something that takes a little while to understand. And it's also complex when it comes to transformations, because if you're thinking about transformations in terms of events,

Starting point is 01:01:17 it's a little bit of a different thought process than it is in terms of full data sets that you're transforming. Makes total sense. Johnny, anything to add on the orchestration side? Yeah, that's fundamentally right. It's not necessarily like either or. If you're in our world and products, you don't orchestrate it, as Dave said.

Starting point is 01:01:37 Users will sometimes use us to do some pre-transformation to sort of reduce their cost within the analytics warehouse. Like they still want to work with it in Snowflake, but they're using us to do some deduplication or to pre-join some data that they commonly kind of want to access in a joined fashion. And then they might process it

Starting point is 01:01:56 from there downstream with dbt. So it's, you know, you can marry these together and there are quite good reasons to do that. Yeah, makes total sense. Okay, Last question. Sorry for going long Brooks. Where did the name estuary come from? I mean, I know from a geographical standpoint, what it means, but is there any relation? Yeah. Fundamentally, you know, we're in streaming data that we've kind of de-emphasized and no longer lead with streaming quite so much but evocatively sort of where streams meet that is what an estuary is it's

Starting point is 01:02:32 where sort of the ocean and rivers are meeting together so there's some evocative like you know an evocative like imagery that we're going after there. Otherwise, it's a name. I love it. You have your real-time data, and you have your data lake. The ocean is where it's all done. We are effectively a real-time data lake. That's what Gazette enables. That's what the idea was. It's lost on a lot of people.

Starting point is 01:03:01 I like it. It's a calming... An estuary seems like a place you want to be. I like it. It's a calming, you know, like an estuary seems like a place you want to be. I like it. Cool. Well, thank you for letting me

Starting point is 01:03:09 throw a couple of additional questions at you to satisfy my curiosity around orchestration and would love to have you on a streaming panel

Starting point is 01:03:17 and have you back on the show. Awesome. Thanks for having us. Yeah, thanks for having us. We hope you enjoyed this episode of the Data Stack Show.

Starting point is 01:03:24 Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rutterstack.com.

Your Ad Here

The Data Stack Show - 163: Simplifying Real-Time Streaming with David Yaffe and Johnny Graettinger of Estuary

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.