The Data Stack Show - Data Council Week (Ep 1) - The Evolution of Stream Processing With Eric Sammer of Decodable

Episode Date: April 23, 2023

Highlights from this week’s conversation include:Eric’s journey to becoming CEO of Decoable (0:20)Does real time matter? (2:12)Differences in stream processing systems (7:57)Processing in motion (...13:04)Why haven’t there been more open source projects around CDC? (20:34)The Decodable experience and future focuses for the company (24:31)Streaming processing and data lakes (32:54)Data flow processing technologies of today (39:01)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. All right, we are in person at Data Council Austin, and we are able to sit down with Eric Sammer. He's the CEO at Decodable. I'm Brooks. I'm filling in for Eric. He got in a biking accident. He's fine, but wasn't able to make
Starting point is 00:00:39 it to the conference. So I'm coming out from behind the curtain here and excited to chat with Eric. We've got Costas here, of course. But Eric, to get started, could you just kind of give us your background and what led to you now kind of becoming CEO at Decodable? Yeah, absolutely. First of all, thanks for having me. It's always a pleasure to get a chance to talk about this stuff. Yeah, I mean, so my background, you know know i've been doing this now for 25 something years really focusing around data infrastructure so i i lovingly refer to myself as an infrastructure monkey versus like uh you know while people are doing fancy math and cool stuff with data i'm moving bytes around and you know flipping zeros to one so i spent a lot of time working on things
Starting point is 00:01:23 like sql query engines and stream processing infrastructure, which has really taken up the last decade or so of my life. I built a bunch of systems internally for mostly market, marketing and advertising applications. And then sometime around 2010, late 2009, 2010, I wound up being an early employee at Cloudera and spent like four years working on sort of the first generation of big data stuff and then wound up creating a company that eventually, you know, we were we were acquired by Splunk and spent a bunch of time there working on real time infrastructure, stream processing and just like cloud platforms in general for observability data.
Starting point is 00:02:08 And then about two years ago, broke out and started Decodable, which is a stream processing platform as a cloud service. We could get into the details if that's interesting. Really focused around being able to acquire data from all of the fun and interesting sources, process that data in real time and get it into all the right destination systems in the format that's like ready for use in analytics. Cool. One thing we were chatting about just before we hit record that you kind of brought up is just the idea of does real time matter? Could you unpack that for us and just kind of talk about what you mean there? Yeah. I know there are different camps who would probably argue different things. Yeah, I mean, you know, I think there is a segment of the market who,
Starting point is 00:03:00 I mean, people probably break down into three groups. There are people who are very sure of what they get out of real-time data. And by real-time specifically, I mean low-latency, sub-second availability of data, either for analytics or for driving online applications and systems and those kinds of things. There's one group who fully understand it, know exactly what they're talking about, and have a strong opinion about it. There's a group of people who say, well, it depends on the use case. And like some use cases demand real time, in some use cases don't. And then there's a third group of people who say nothing really matters.
Starting point is 00:03:43 Like real time is never important and those kinds of things. And, you know, I think like, you know, selection bias, of course, but like we talk to the second and third group, you know, first and second group, sorry, you know, most of the time. And I would say like the biggest thing that we hear from some people is, you know, my use case doesn't require real time. And like the interesting thing there is that like at some level, I don't disagree. The thing I would point out is that like, if you asked three years ago, whether or not you needed to know exactly whether or not your food had been picked up
Starting point is 00:04:20 from the restaurant and where it was in between the restaurant and your house everybody would have went like who really cares and then covid hit and now everybody fully expects up to the second visibility into where their fried chicken is right and they think that like so like what winds up happening is the use case i I would argue, doesn't require real time until someone decides to do it and changes the expectation. And I think companies like Grubhub or Netflix
Starting point is 00:04:53 or YouTube content recommendations or any of these other things have changed the expectations and that as a result now is a it is a either saves the money or generates revenue and like one go-to use case for me is I don't know about you guys but I don't have a lot of loyalty to like retailers around certain things so if I need you know i can't think of a good decision
Starting point is 00:05:26 if i need a mop i don't care where i buy it i care that it's in stock and i care who can get it to me fastest and if that's hypothetically amazon walmart target you know what i mean like i will get it from either one of them so i care about inventory being up to date. I care about who has the lowest price. And like all these things are things that are responsive to inventory arriving at a loading dock or a dynamic pricing logic to adjust prices based on competitive sales and like those kinds of things. So my argument is everything's real time, like either in potentia or you know something that winds up being real time you know because a competitor has driven it in that direction and you know i'm sort of interested if you guys agree with that or not but like that's my take on the world yeah that's great. Do you agree? Yeah, I do agree.
Starting point is 00:06:32 I mean, I think there is a reason we have all this infrastructure out there and all this technology being built. I don't think it's just because, like, you know, geeks want to, you know, like, have the equivalent of a fast car. They build Kafka, right? Like, something similar. But at the same time, yeah, like I think the problem with real time is that real time is a very relative term and like the semantics a lot. So if you talk like to a marketeer, like what is real time and you talk to someone who is responsible for fraud detection, you're probably going to get a bit of a difference.
Starting point is 00:07:06 Yes. Not only definition of what real-time is, but also of the importance of real-time, right? Yeah, like if my campaign, let's say, runs like five minutes later, eh, okay. Probably nothing will happen. Although I will probably be frustrated because I have to, right? But if someone gets, I don't know, like a report on fraud like a day after, that's not to, right? But if someone gets, I don't know, like a report on fraud like a day after, that's not fun, right?
Starting point is 00:07:30 But let's talk a little bit more about like the technology, right? We, I remember like some of the first like real-time processing, let's say pieces of infrastructure that came out of Twitter. It was a definition of real-time, right? Yep. Back then. You had technologies like Samza. What was the name of the Twitter?
Starting point is 00:07:55 They had a platform. Yeah, so LinkedIn had Samza. Twitter had initially Storm. Storm, yeah. And then they built Heron, which was another one. And then there was Spark streaming that came out of the Spark ecosystem and Apache Flink. And so there's been a couple of these things
Starting point is 00:08:14 that have grown up over time. Yeah. And I want to talk about this and also compare it with something like Kafka to understand what's the difference between Kafka and a system like flink or samza right yeah absolutely so i think like let's pick apart sort of kafka just for a second kafka really there are four main components or projects that people talk about when they talk about kafka maybe even five. One is the actual messaging broker itself, right? And that's the part that I think of as like Kafka. Then there's KStreams, which was the Java library for actually doing stream processing.
Starting point is 00:08:57 And KSQL, which was the SQL interface built on top of KStream. Then there's Kafka Connect, which was the connector layer. And then there was the schema registry. You know, some of these things are under Apache licenses. Some of these things are under the Confluent Community License, if I remember correctly. But, you know, so when I think about Kafka, I think of the broker proper, which is really just about PubSub messaging or eventing, you know, so like really about just the transport of data and real processing capabilities beyond just moving it from A to B. And so I think that the processing systems that we're talking about, Storm, SAMS, KStreams, KSQL, Flink, which is the one that I'm probably most familiar with. That's what we're based on at Decodable.
Starting point is 00:09:48 You know, various other systems like that run on top of those Kafka topics, right? And many of them support not just Kafka, but Kafka-like systems, including some of the cloud provider stuff like Kinesis and GCP PubSub and those kinds of things. Okay, we have like the processing, right? I would argue, let's say, and let's forget KSQL, KStrange and all that stuff. Okay, I have a producer, I have a consumer, I can write business logic there, I can do processing on top of Kafka. What's the difference between that and having a system like Flink? Yeah. So in general, you could argue that anything that writes to a Kafka topic or reads
Starting point is 00:10:33 is effectively doing stream processing at some level. It might just be doing minimal transformation. It might be doing sophisticated transformation, those kinds of things. I think that the difference is really, like the stream processing frameworks are just that. They're frameworks, right? So they're going to give you a bunch of capabilities, including an execution engine typically that's optimized and sort of understands things like predicated analysis and aggregation operations and window functions and all these other kinds of things, they typically also understand schema serial or event serialization, deserialization. They typically understand state management. Where am I in the stream? What happens when I fail and how do I recover to achieve either at least once or exactly once processing of data, you know, getting rid of duplicates, those kinds of things, or not producing them to begin with. And also some higher order
Starting point is 00:11:31 concepts like a notion of event time and watermarking and like all of these other sort of more sophisticated things that sort of help achieve correctness, the processing data. So, you know, in that sense, you should think about stream processing systems the same way you would think about a database in the sense that, not that they necessarily work the same way, but that rather than just have files on disk and like reinvent Postgres on top of that, it behooves you to take advantage of the fact that people have put in a lot of work to get the correctness and the processing and those kinds of stuff.
Starting point is 00:12:11 Does that make sense? It makes absolutely sense. But I have a follow-up question on that. The way that you describe it is how I visualize it. I have data in motion, and I'm applying aggregation, any kind of data processing, as the data is still in motion. A couple of years ago, let's say after 2015 or so, we started hearing a lot about the concept of ELT instead of ETL. Because what you are describing sounds more like ETL, right? You extract the data.
Starting point is 00:12:46 Somehow the data is, like, still in motion. Like, I transform the data, and then, like, I'm going, like, to do something with whatever I produce from there, right? Yeah. But then we had, like, this whole concept of, like, you don't have to do that anymore. Let's just, like, extract and load the data and after the data is loaded you can go and like with extreme scale go and process the data okay assuming let's say i have kafka there the latencies are low theoretically at least i can get closed let's say like to real time and in some cases let's say i have something like Pino or ClickHouse, I can have like real
Starting point is 00:13:27 time. Yes. Okay. So what's the difference there? Why we still need to have these complicated systems because they are complicated, right? Like some size, not like the easiest thing, like to go into operating tab and And do this processing in motion. Yeah. I mean, this is a really interesting question. And I think it's a philosophical debate. So, you know, you're right if you look at this through the lens of being, for instance, like a Snowflake user. Like from your perspective, you have many sources of data.
Starting point is 00:14:03 You want to get them into Snowflake. You want to do your processing there. And why on earth would you ever do any kind of transformation? So a couple of things. One that comes up, and a lot of the quote-unquote ELT tools do this under the hood, they are actually doing things like mapping data types. They are doing processing, but it's de minimis processing. It's not business logic processing and someone explained
Starting point is 00:14:27 it to me is that the thing about elt is that it's not actually that it doesn't do transformation it does it's that the majority of the business logic is pushed to the target system and that definition made sense to me so it's actually actually E-T-L-T. Yes. Right? You know, there's two Ts in there, which is okay. When it becomes interesting, so a couple things. One, you're actually using your costly CPU to do the processing if you do that.
Starting point is 00:14:58 You know, there's latency characteristics and those kinds of things. But I actually think that the more interesting angle on this is that if you zoom out and you think about other places that data wants to go, you start to go like, okay, so it's going to go to S3. It's also going to go to Snowflake. It's also going to go to ClickHouse or Pino or Rockset or whatever, wherever it's going to wind up going, Druid. It's also going to go back into operational systems like elastic search that you can provide online product search and like or outgolia or
Starting point is 00:15:31 like whatever people are using these days it might also get cooked in various ways and go to a bunch of microservices and so it's not so much that you want to push all of your business logic in the world into the stream it's that you want to have the capabilities to do impedance matching between all those systems some of them aren't allowed to have pii data some of them aren't don't want certain records some of them need quality fixed before it lands in those systems where you can't do updates and like mutations and like those kinds of things and so i think i would think about stream processing the way you think about i use networking as an example but like packet mangling on a network you know stream processing is the equivalent of your load balancer right like it it allows you to do some amount of processing before the packets land in the target system.
Starting point is 00:16:29 And I think when you think about it from a holistic perspective, you kind of go like, oh, then it actually makes sense because you're not tightly binding the schemas and the structure of the data between the source system and the target system. And one of the biggest challenges that I hear is that if you're doing ELT into a system like Snowflake and somebody makes a schema incompatible change, you've broken your target system. And you're very tightly coupled to those operational systems. So I think that when you start talking about data contracts and larger organizations, like being able to do these things and pave over those problems, I think stream processing is one way you can start to cut into that. Yeah, 100%.
Starting point is 00:17:15 And I think I asked this question not because, like, I totally agree with you. And I get also, like, why people might wonder about these things. Sure. And I think there's always like a huge gap between what theoretically can be achieved and what like in practice is happening. Right. And usually that's okay. That's where engineering comes from. Right.
Starting point is 00:17:40 Like that's why we need engineering. That's why we engineer these systems. Right. Because they're always like trade dogs. And like each one of them always like trade-offs. And like each one of them has like unique trade-offs. Like, yeah, sure. Why not just use only ClickHouse, right?
Starting point is 00:17:55 And do everything. Theoretically, you should be able to do it. Yeah. Have you tried like to do like a lot of joins there, for example? Right? Have you tried to do a lot of joins there, for example? Or how is it to change the schema on something like Pino? There are always trade-offs, and that's why there's, at the end, wisdom in the industry. It's not like these things are just because crazy VCs and founders,
Starting point is 00:18:21 they won't like to push their agendas and build stuff. That's what I'm usually saying. But I would like to go one step before that, the processing. Because I know that another important component on Decode Double has to do with CDC. And CDC is one of these interesting things that everyone kind of talks about it, says it's important, it's a very good idea, all these things. At the same time, if you think about it, outside of Debezium, I don't think there's any other mature at least framework to attach it on like an operational database like an OLTP database and turn it into like a stream right? I think not in the open
Starting point is 00:19:16 source world certainly there's a bunch of commercial systems that have been around for a very long time in sort of various forms. You know, I think GoldenGate is probably one of the more well-known. HVR was acquired by Fivetran, does this kind of thing. So there's those kinds of things. But I think in the open source world, I actually don't have a great sense of like Airbyte adoption these days. And I think Airbyte actually uses Debezium. It does.
Starting point is 00:19:43 Yeah. Okay. So Debezium is the one that I know best and we know best at decodable because again we're based on parts of that but but i think you're right i think you know one thing that is interesting is not just about lower latency to getting the changes, but there's this whole host of applications, especially like on the operational side of the business versus the analytical side of the business
Starting point is 00:20:13 that can use change data capture data as effectively triggers to like kick off a bunch of really interesting stuff. You know, we were talking earlier about inventory gets updated maybe you want to make only things that are in stock searchable yeah you know and you want to play with search relevance you know for instance for like an e-commerce site you know based on inventory so, that's the kind of thing. Or marketing campaigns. When PlayStation 5s come back in stock, I want to alert everybody who has one on their wish list. Right?
Starting point is 00:20:52 Like, those are the kinds of things that I think we can enable with CDC beyond just database replication, which is a core use case, of course. Yeah. Why do you think, like, we haven't seen more open source projects around CDC? Because it's really hard. Because every database system implements exposure of the bin log and the transaction log a different way. And some of them don't have, there really hasn't been a single good way of exposing this. So Postgres, MySQL, Oracle, Mongo, they all have like just different database level
Starting point is 00:21:30 specific substrate, you know, substrates for those kinds of things. And I think it just takes a special kind of person to commit themselves to going and solving that kind of problem. You know, we are very lucky to have, you know, Gunnar Mor more like he was the project leader to museum at red hat for a long time at decodable so like gunner like spends a lot of
Starting point is 00:21:51 time thinking about these kinds of things to his credit but it's i don't want to say it's thankless because i think people appreciate it but it is really hard you know it is really hard. Yeah, it makes sense. And like, one thing that I always found interesting, both in good and in a bad way, about Debezium, it's, let's say, attachment to Kafka, right? It is a project that, I mean, technically, like, you have, in a way, like, to run it with Kafka Connect, at least. Like, the moment you decide, decide like to not have Kafka there you start being like very hacky with it right yeah do you see like I mean and I'm asking you not because like okay obviously I'm not like a committer or like you're like the Bayesian but you work with it right like it is like part of your stack. Like do you see like this changing and also why?
Starting point is 00:22:47 Like why it has to be ShowaTots like to Kafka? Yeah, I mean you're absolutely right. I mean there's multiple layers there like in the implementation and so like in even internally inside of Decodable we wind up using Debezium without Kafka in certain places, like for certain use cases more as a library to access certain things it's definitely tricky, you gotta know the internals, I think it's quite frankly, and there is
Starting point is 00:23:15 again, I'm not an expert in what's happening in the community on this so, excuse me, please take it with a grain of salt, but my understanding is that there's a long term feature request inside of the Debezium community to support running without Kafka there. I think this is like a trap that open source projects fall into is that like there's always this like, well, why don't we make it configurable thing, which explodes the complexity of these projects pretty significantly i you know my sense is that the upside is you could potentially remove the kafka dependency the downside is that it only makes more complicated i mean this is like a plug but you know one of the things that we focus on
Starting point is 00:24:00 is just making debusium less complicated and flink is part of that for us as well so like if you don't know or care about flink and kafka and debysium we try and create a platform where you can define a connection to postgres and get the result in pino or in kafka or in kinesis or in any other system that we wind up supporting there without having to deal with the guts on this stuff. So to some degree, that's the value or part of the value that we deliver there. So next provocative question. I heard you saying that you have three pieces of technology that you are using as part of the codable,
Starting point is 00:24:40 like Depejum, Kafka, and Flink. Each one of them, it's an operational nightmare. Yes, that is not controversial. I'll take that. That is true. Like, I had, like, whenever I had, like, to do with any of these, okay, it wasn't fun. Let's put it this way.
Starting point is 00:24:58 Like, you need, like, a very, like, I don't know, like, a special type of person who enjoys working with these things. Yes. So, I'm scared. I don't know like a special type of person who enjoys working with these things yes so I'm scared why would I come to Decodable when I know that like there's like
Starting point is 00:25:12 all these complexity there why I would do that I mean that's what pays the bills at Decodable is that like the people reason people come to us
Starting point is 00:25:20 is because they want the capabilities but they don't want the operational overhead and so you know flink alone has a couple hundred configuration parameters if i remember correctly it's yeah it's sizable our goal is to like make that disappear so like we try and offer what i think is the right user experience which is largely serverless you can give us a connection you know a bunch of connection information for our database or a SQL query,
Starting point is 00:25:48 and you don't have to know that it's Debezium and Flink and all these other kinds of things under the hood. If you do care, we give you the right trapdoors to give us a Flink job if that's what you want. You don't want to give us SQL or something like that. We'll handle that. But it's funny because there's just like this goldilocks zone where like if it's so complicated people don't want to adopt the technology at all no matter how much a vendor paves over it and if it's so easy then no one needs us right so like obviously you know that said i do think
Starting point is 00:26:24 we always want to make it easier and we do spend some time upstream so like obviously you know that said i do think we always want to make it easier and we do spend some time upstream trying to like you know do some work there to make this stuff you know easier to use but the reality is that all of the options all of the well i don't want to use s3 as my state store i want to to use this other thing. And like all that pluggability, all that optionality makes it more like a toolkit for stream processing and less like a solution
Starting point is 00:26:54 for stream processing. And so you know, there's value in that, but that cuts both ways, right? And so, I don't know. I like to think, I'm biased, but I like to think, I'm biased, but I like to think that we solve this problem for people. But you're right.
Starting point is 00:27:10 I mean, it's a real concern, the complexity of any disaggregated system. I think there's been some good discussions about disaggregation and like the modern data stack and those kinds of things. It generates complexity. Yeah, 100%, 100%. And, okay, I know there have been also some very interesting announcements
Starting point is 00:27:29 about the product lately. You mentioned the modern data stack, and I know that one of these has to do with IT. So would you like to share a little bit more about some interesting things that are happening with the product right now? Yeah. The two kinds of users that we see in decodable are data engineers you know who are ingesting data or sort of like making it ready for ml pipelines and analytics stuff like that and then application developer who are building these more like online applications you know you know real-time
Starting point is 00:28:00 applications same underlying tech stack so for the data engineers you know out real-time applications. Same underlying tech stack. So for the data engineers, you know, out there, what we wanted to do was allow people who know Snowflake, DBT, and Airflow to be productive stream processing people without having to take on the Debezium, Kafka, Flink stack. And so for them, we announced earlier today support for a DBT adapter. We now support DBT. You can use DBT to build your stream processing jobs in SQL with the same tool set and the workflow that you know. And the other thing that we're super excited about that we announced today is support, first-class support for Snowflake's Snowpipe Streaming API. Now, without spinning up a warehouse,
Starting point is 00:28:48 you can ingest data in real time into Snowflake with no S3 bucket configured, with no SNS queues, with no IAM policy stuff. Just tell us what the data warehouse is, the data warehouse name, and we will ingest. And it turns out that that snowflake has made this incredibly cost effective so you're not paying for warehouse time there's a small amount of money you know that you end up paying in terms of credits but it is substantially more cost effective
Starting point is 00:29:19 to ingest data into snowflake and it shows up in real time, which is incredibly interesting. So when you say real time, because the last time that I worked with Snowpipe, I think the end-to-end, and when I say end-to-end, I mean from the moment that the event hits Snowpipe to when you see it on the view, when it gets materialized inside your data warehouse. We're talking about a span of three minutes, two minutes. Is this something that has changed with Snowflake? Yeah.
Starting point is 00:29:52 So this is what the Snowpipe streaming API does. You're actually writing. I don't know the implementation. My understanding is that you're basically running into Snowflake's internal formats there. So you're skipping a lot of the batch load steps. And so we've seen on the order of seconds and even less than a second, I think. So you can actually run select statements and watch records change.
Starting point is 00:30:19 It's incredible. Yeah, because I think the previous implementation, the first implementation of Snowpipe, was more of a micro-batching architecture, but was still using S3 under the hood to stage the data. But it was obviously optimized in a way to reduce the latency there as much as possible. But again, when you have object storage in there, you add another layer of latency, you cannot avoid that. So that's very interesting, that's something I should definitely also check for myself. I knew that Snowflake was working on
Starting point is 00:30:55 the streaming capabilities that they had, so it's very interesting to see what they've done. And I'm also looking forward to see what Spark and Databricks are going to be doing on that. Because I think Spark streaming, it shows its age. Sorry to say that, but I don't think that anyone really loves working with that streaming. It's very inflexible. It's really hard. So I'm very curious to see what the response is going to be there yeah i you know i've had the privilege of working with some of the people who are now working on spark streaming structured streaming yeah and that was hard for me to say structured streaming yeah and and those kinds of things i'll say this i mean
Starting point is 00:31:43 i won't claim to be an expert on the internals of spark i'll say that they have really smart people they are working on this you know we of course are super biased we we think that flink has effectively won you know the out of like all of the open source projects it has the most robust and sort of battle tested stream processing engine but i'm interested to see what the team at databricks does they have a fantastic team over there and my guess is that it's actually going to be very hard to make the kinds of changes that i think they need to make without breaking anyone who's already using it today and I think that's gonna create a challenge for you know for them. Yeah definitely I think
Starting point is 00:32:31 it's going to be interesting because like I think it also touches like the fundamental concepts behind like Spark itself on how like it has the guarantees it has with like micro-batching and like all that stuff. So it will be very interesting to see like what they come up with. But they will definitely come up with something. Like for example like the okay the autoloader features that they have like on Tetabricks like it's pretty good actually like it's very like okay it doesn't have like the simplicity that like Snowflake has but at the same time like it's very robust in both performance and also the capabilities that it gives you.
Starting point is 00:33:09 And, okay, Databricks will always be a more configurable product than Snowflake. They have a completely different product thesis in terms of the user experience, which makes total sense. One other thing I wanted to ask you is about data lakes specifically. And there are two reasons for that. One is because streaming data and data lakes, they make a ton of sense together because usually streams provide the volume of data
Starting point is 00:33:36 that make viable having the data. That's one of the reasons. The other reason is because all the table formats, like Delta, for example, and Iceberg is also working on that. I'm not so sure about Hudi, but I'm pretty sure they also have like something similar. There is a concept of CDC there, right? They propagate changes like when you do like something with a table and you can have like a feed to listen to these changes, which is kind of interesting to see this from happening like in systems that are supposed to be more slow moving, right?
Starting point is 00:34:11 By definition, in a way. So what's your experience so far in terms of streaming processing and plagues, both from consuming and pushing data into them? Yeah, you know, I think it's funny, I mean, we say, we use three words, like continuous streaming and real time. And I actually think that
Starting point is 00:34:33 continuous processing is what Hudi and Iceberg and Delta Lake are, and I think Delta Live Tables specifically with Databricks are all trending towards. And I actually think that's a positive thing, right? It's really about the propagation of change throughout the dependencies, like the downstream processes.
Starting point is 00:34:56 I think that, like, on the whole, this will increasingly remove the need for sort of out-of- band processing on a lot of these kinds of things which is i think a net positive you know and i think anything that simplifies the lives of data engineers is a good thing you know just like way too complicated to do relatively simple stuff i think continuous processing and better primitives for continuous process are going to be seen in the data lake but i agree that like these things are compatible because again i view the data lake as one destination for streaming data you know and again like that's my bias because a lot of our customers have these online systems as well that need different cuts of data and stuff like that. So I think that this is actually a natural continuum of this change-based or continuous change-based processing that can now extend into the data lake,
Starting point is 00:35:54 which historically has been immutable, which has always been complicated, even as far back as the Hadoop days with HDFS and stuff like that. Yeah, 100%. All right, and one last question from... So we talked at the beginning of our conversation about the history behind, like, the streaming processing frameworks, right? And they go back, like, pretty much, like,
Starting point is 00:36:17 the same, like, the Hadoop era, right? Now, since then, and, like, this whole, like, big data movement, we've seen companies IPO-ing. Kafka became Confluent IPO. Okay, Databricks, let's say they IPO-ed. They haven't, but let's say they did. They're on their way. Yeah. We have Snowflake. There's a lot of, let's say, value created from data-related technologies. But we haven't really seen any of these streaming processing frameworks creating a company that is like the Snowflake or the Kafka or the Confluent of companies out there.
Starting point is 00:37:03 Why do you think this is the case? Especially after seeing the industry investing so much in them, right? Like because we are talking about like super complex distributed systems that someone has to build and there are like many attempts on them It's not just Flink Yeah, I think it's a really good question. I think you know a couple of things Well one to their credit, I would say that confluent has done some of this right like even though like we probably overlap with them a little bit you know we'd like to partner with them i think more closely than we do sometimes but you know i think to their credit
Starting point is 00:37:37 they are probably the closest thing to a publicly traded company that has those kinds of that is based on that kind of capability, but not a pure stream processing kind of solution. You're absolutely right. I think a couple of things. One, I really think that the use cases have finally caught up to the technology. I think in a lot of cases, even back in 2015, people weren't as bullish on what they could get out of i don't think that people fully understood what they could get out of lower latency you know sort of higher throughput
Starting point is 00:38:14 data on the processing side i think people are starting to get that now when you see if you're a logistics company see grubhub or you see FedEx, you start to get it. So I think that's been part of it. I also think that the tooling was nowhere near as mature and as sophisticated as it is today. We talked about a bunch of different systems. I think each generation of at least the open source stream processing systems, excuse me, have gotten like incrementally higher throughput, higher performance, but also just like more stable, more correct under failures, easier to reason about.
Starting point is 00:38:54 And, you know, quite frankly, got SQL support. And I think like as much as we malign SQL, like people know it and it works and like people get it and i think that like gaining sequel support is actually a big accelerator for sure processing in general that's super cool now okay one more question i have to ask that too sorry like i just thought about it so yeah sorry sorry like i can I can't not do that. So talking about streaming processing, right? We also have a new family of technologies based on timely data flow
Starting point is 00:39:33 and the data flow family of processing out there. Materialize is one of them, but there are more. What's your take on that and how this is different compared to something like flink yeah i mean i think the answer is that there's a lot of differences in implementation for sure timely data flow and like what the materialized team have done there i mean it's actually really interesting and exciting technology I think that they are tackling the next, at least, you know, from my vantage point, they're trying to make streaming more intuitive by attacking some of
Starting point is 00:40:14 the consistency stuff that like tends to crop up in stream processing. I think it's really interesting. I think that tech is probably less, and again, I'm biased, I sort of have to put that on the table as a disclaimer every time I start to say something. But I do think it's less mature than things like Flink and all these other kinds of things. So warts and all, you know, I think that like, you know, Flink is incredibly robust and it's sort of well understood in this kind of cases. But I think it's anything that pushes stream processing forward i think is a good thing and so differential data flow and timely data flow are exciting projects i think it'll be really interesting to see what materialized and that kind that generation of companies do with that technology i still think it has a little bit of a ways to go
Starting point is 00:41:02 but like you know i think that's a place where I'm sure Frank over at Materialize would disagree with me so it's like an interesting conversation to have and in fact here at the conference there is a presentation on tiny data flows so it'll be interesting Yep, 100% Alright, so that's all from me for today so we should conclude before I come up with more questions.
Starting point is 00:41:27 These are great. I love these questions. We have gotten past the buzzer, I think, Kostas, because you have so many great questions. But Eric, before we sign off here, if folks listening want to find out more about Decodable, where should they go? Yeah, they should go to decodable.co
Starting point is 00:41:43 and they can sign up for a free account, get started right away. There's a free tier there that allows people to get up and running with both Flink APIs as well as SQL. Awesome. Eric, thanks so much for coming on. Guys, thank you so much for having me.
Starting point is 00:41:59 It was a real pleasure. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com.
Starting point is 00:42:14 That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.