The Data Stack Show - 91: The Future of Streaming Data with Stripe, Deephaven, Materialize, and Benthos

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. I love our panel discussions because it has so many different perspectives. And in the live stream today, the panel is full of streaming experts. So we have people who have built streaming projects, open source, multiple people, actually, someone who's built streaming

Starting point is 00:00:41 infrastructure at Skype and Netflix, and then several really cool streaming tools, the founders of these streaming tools, which is super cool. Here's the burning question that I want to ask. Streaming is becoming increasingly popular and the technology around it has grown significantly, right? So if you think, you know, sort of beyond Kafka, right? Kafka is kind of a de facto, right? But there are so many new technologies that emerge. And so what I want to ask these people is how we think about the difference between batch versus streaming. Is that even a helpful paradigm,

Starting point is 00:01:14 you know, as we look out into the future, you know, of how data should flow through a system? So that's what I'm going to ask about you. Yeah, actually, I want to ask, like, what the title of this panel is, what's the future of streaming? Like, the industry is, like, producing and has developed a large number of, like, streaming platforms for the past, like, decades or so.

Starting point is 00:01:36 So what's next? Like, what do we need to build and what kind of use cases we are addressing that we couldn't do efficiently like in the past or in the present. Because although like these guys are like building some amazing new tools, bring some new paradigms in how to do like streaming processing. And I'd love to learn like the why and the how and how this connects like with the previous generation of streaming processing platforms.

Starting point is 00:02:04 Well, let's do it. Welcome to the Data Stack Show. I have been so excited about having this group of panelists because we had such fun conversations with all of you about streaming. And today we get to talk about all sorts of stuff as it relates to streaming, the future of streaming, etc. So before we get going, let's just do some quick intros for those people who have not listened to the episodes with all of you. So

Starting point is 00:02:30 I'll just call your name out and go in the order that you are in the Zoom box. So Jeff, do you want to kick us off? Thanks. Hey, everybody. Good morning. Good afternoon. I'm Jeff. I'm an engineer at Stripe. I work on stream processing systems. Right now I lead change data capture at Stripe. And for Stripe, I worked at Netflix for a number of years where I also worked on stream processing systems, worked on an open source project called Mantis, and also worked on other infrastructures such as Splank and Kafka.

Starting point is 00:03:00 Awesome, thanks so much. Pete. Hi everybody, I'm Pete Goddard, the CEO of Deep Haven Data Lab. My history is in the capital markets. So think of Wall Street meets systems meets, you know, quantitative math and things like that. I run Deep Haven, which is an open source query engine

Starting point is 00:03:18 that's built from the ground up to be very good with real-time data on its own and in conjunction with historical or static stuff. It's also a series of APIs and integrations and experiences that create a framework because working with real-time data is pretty young. So we want people to be productive. Nice to be here today. Thanks so much. All right, Arjun. Hi, everyone. I'm Arjun Narain. I'm the co-founder and CEO of Materialize. Materialize is a streaming database that allows you to build and deploy streaming applications and analytics with just standard SQL. Materialize is built on top of an open source stream processor called

Starting point is 00:03:57 Timely Dataflow. I'm sort of a, I bring a SQL partisan perspective to this panel. Before this, I was an engineer at Cockroach Labs, working on CockroachDB, which is a scalable, scale-out SQL OLTP database. And before that, I did a PhD in distributed systems, being in the distributed systems and database world for quite a while and excited to be here. Awesome. Love that you're getting the bias out up front. It's great. It's equal partisan. I think people needed the disclosure early on before I got fiery. Love it. All right, Ashley. service. I think we'll call it today a benthos. And that includes writing code, pull requests, building emojis and mascots. And I've been doing that for about five years. And before that, my life had no meaning. Emojis are the most important part of the project.

Starting point is 00:05:00 Pretty much. That's where you know your project is a success. That's exactly right. The swag is actually great. Well, Ashley, I actually want to, you said streaming ETL, which is a really interesting phrase. And so what I want to start off with is a discussion on how we should think about batch versus streaming. There's a lot of new streaming technology out there and people are working on all sorts of different data flows. And sort of the standard stack has a lot of batch and has maybe implemented some sort of streaming capabilities. But I'd love to hear from each of you, your perspective on like how to think about

Starting point is 00:05:40 batch versus streaming sort of in this new world where we have a lot of streaming tools. And Jeff, I think I'd actually like to start with you because you've built a lot of these systems at large companies and you're sort of a practitioner doing the work inside of a company. So what does that look like for you in the work that you're doing? Yeah, it's come a long way, but it's still quite nascent, it being stream processing. For me, I like to segment it into two parts. One is the developer experience, and then the other is the user experience. So in terms of developer experience, people are sort of conforming around this SQL-like

Starting point is 00:06:17 tooling. There's also declarative tooling as well, the tooling to get you started on writing stream processing applications, and then deploying them out, maintaining, observing them, etc. So that's still got quite a ways to go. It's certainly a lot better than it has been in the past, especially compared to Batch, given that that has had more community involvement in terms of bringing that up in the open source and improving on the tooling. The second part is user experience. So from the user's perspective, they just want to access their data, whether it's from streaming or batch worlds, whether they're doing streaming or

Starting point is 00:06:50 batch computations. The challenges are in that the semantics are a bit different. And so the output or the behaviors of streaming applications might not necessarily be intuitive to the end user. So for example, in streaming things are in more like a real time fashion. So windowing comes into play where it's batched. The windows are much larger. That plays into the last part of my thinking, which is I think about it in terms of use cases.

Starting point is 00:07:19 So specifically with SLOs as a requirements for these use cases, SLOs being freshness, correctness, coverage, and throughput. And then so as an example at Stripe, I work in the finance, the payment space. So correctness is pretty important. Freshness, less so originally, but it became more important because expectations only go up. So originally Stripe had a lot of systems in batch,

Starting point is 00:07:42 but we were recomputing the world every day or every so often. And it just got very expensive. So then you try to go the angle of reducing the cost for that use case. After the cost is, OK, well, we want things quicker. Expectations only go up. So we want more freshness. Oh, but the freshness is there. But now we want things to be correct.

Starting point is 00:08:02 And so now we're doing that. And Netflix is a little bit different. I worked on a system called Mantis, where we were basically trying to make sure that the systems, the streaming systems could stay up when you hit play, that it works and it works well, the bit rate is there as it should be. And so Mantis took a different angle where it used the user experience for the, used an SQL-like user experience, but made a certain set of trade-offs that we can go into another time or later

Starting point is 00:08:31 to make sure that we can compute all of these streams in a very cost-effective manner without losing the ability to gather insights. And Arjun knows exactly what I'm talking about here with materialize. So that's it. So developer experience, user experience, use cases, use cases led by SLOs,

Starting point is 00:08:50 which is throughput, correctness, freshness, and coverage. Yeah, totally. Well, I love the perspective of the user demand, right? And sort of streaming being driven by user demand for data that is more real time, which is interesting, right? And that's sort of a self-fulfilling cycle where you get data faster and then you want it faster and then you

Starting point is 00:09:10 want it faster, which is interesting. Ashley, let's jump to you. Do you want to explain what you mean by streaming ETL? And I think for a lot of people, those are two separate terms, right? As they just think about their practical day-to-day work, right? Well, ETL handles, you know, we have ETL jobs that run our batch stuff and then we have streaming stuff, but you put the two terms together. Can you dig into why you did that? Yeah. So when I, so my introduction to data engineering was streaming because we had data volumes that were too big to run as a batch. We were basically selling that data on and it needed to be dealt with continuously. So to do that at the levels that we had, in order to execute that as some batch job,

Starting point is 00:09:54 we would need a stupid amount of machines to be running it. So it was kind of like a defensive move, which I think is why some people early on resorted to streaming. And it was about that time that Kafka sort of arrived and we started seeing that in the wild. But the main takeaway, so you imagine like an ETL job as a batch is a process that I would describe as you can have humans in the loop. So it's something that runs at some cadence and has a execution time that means it's realistic to have a human watching it happen and maybe even able to intervene.

Starting point is 00:10:31 And it's therefore something that the tooling doesn't necessarily need to be that reliable. It doesn't need to be as self-fixing, self-correcting. Whereas when you're in a space where the data is continuously moving and it's moving at a volume that you can't deal with in a batch context anymore, you don't have the resources for dealing with a backlog that's so large. A day's worth of data now is a massive cost to the business to process offline. Now you're in a position where you can't always have a human watching it happen. And if something does occur that is an issue, they're not going to be able to deal with it in a way that doesn't cost the business a huge amount of money. So the requirement for the tooling to deal with that problem is now, it's much more important that it's able to keep that process running for as long as possible with as little intervention as possible. And when things do go wrong, it needs to be self-fixing to an extent where it's not going to take you a silly amount of time. And I think what's kind of

Starting point is 00:11:28 happened over time is that there were people like in my position who had to do streaming and they basically had to reinvent all these ETL tools because that's all we were doing. We were just taking data, transforming it, enrichments, that kind of thing, filter, maybe a bit of mapping, that kind of stuff. And then we're just writing it somewhere else. So that is an ETL job, which traditionally, you wouldn't have so much data that you can just do that once a day and it would go for an hour, all that kind of thing. But then we were in a position where that wouldn't work. So we had to reinvent all these tools. And essentially in the process of doing that, which I would imagine my colleagues here are familiar with, is this idea that you have to make your tool that much

Starting point is 00:12:03 better at dealing with various edge cases and problems because you can't expect a human to be able to just come in and fix everything for you. And I think one of the repercussions of that is because the tools are becoming so much more autonomous and able to deal with all those problems. People who have batch workloads, they don't necessarily need to do things in a streaming way. They can now look at these tools and go, oh, that looks a lot easier. That actually looks like I have to do less stuff in order to have that do my workload. So I kind of call it a streaming ETL because I see a lot of people using Benfos who, they have a batch workload, right? They don't have a volume of

Starting point is 00:12:39 data that requires continuous real-time flow. They'd be quite happy with just an hour-long job run daily or even weekly. And they're choosing to use Penthouse just because they know that they can just leave it running. And if there's a problem, it'll fix itself. It's not going to require some sort of intervention. And they don't need to think, oh, have we checked that it executed today and check the status of that. They just know that they'll get alerts if something's gone wrong. And if something has gone wrong, they probably just need to restart it or add more nodes, that kind of thing. So I kind of feel like it's, I call it that because I don't want people to look at a streaming project and think, oh, well, that's not for me because I don't have real-time payloads. I don't have requirements that would necessitate something that complicated.

Starting point is 00:13:24 Because really at the end of the day, it ends up becoming operationally simpler in a lot of ways than some sort of large batch workflow tool. Super helpful perspective. Okay, Arjun, I want to jump to you and I'm going to add a modification to the question. So with that context where you don't necessarily have to have a streaming service running something because the basic parameters or sort of SLAs around batch are fine, but it simplifies a lot of things. Would love for you to speak to that, but also with the added question of like,

Starting point is 00:13:54 do you think there's a future where batch kind of goes away because of those techniques, right? The tooling gets way better and it's actually just easier to have continuous streams. So I think it's worth thinking about the use case carefully in that there's a big difference between human in the loop versus no human in the loop. So when there's a human in the loop, there is a much higher latency that we can tolerate because there's only so much, there's only so frequently that a human's going to like look at a dashboard or react to a new data point, right? And humans go home, they sleep, you can run the batch job then. There's a big difference in a phase shift

Starting point is 00:14:36 that comes about when you start doing things in an automated fashion, where powering automated actions off of batch workloads starts to immediately feel too slow because you have the data, you know what the data point is, you've collected the events, say a customer is on your website, they've done an action, and then it's going to take you something like 10 hours for the ETL to crank through, for the batch workload to finish, and then for you to take some action. That immediately feels like an unacceptable amount of latency. And that's oftentimes the big differentiator between streaming and batch.

Starting point is 00:15:09 It's when you're doing these automated actions. And most companies get introduced to this workflow because they start to do email marketing. So they have some data, they have some actions, they segment their users, they decide they want to do some email marketing. And that's actually fine in batch. But you quickly realize the difference between taking action when somebody is on your website versus the next day. I'm sure we've all had that experience where you were searching for a mattress, you go online, you spend about two hours, there's a two-hour window, you're trying to find the perfect mattress. And then you find the perfect mattress, you hit checkout.

Starting point is 00:15:41 The next day you come back on the internet, it's like mattresses, mattresses, mattresses. It's like, it's too late, right? Like it's because all those batch jobs just finished overnight. And then they decided you're in the market for mattresses. And your perspective is like, I was, and I'm not buying another mattress for another decade. Please go away. And then eventually, you know, they realized that the moment of this has moved on. And it's these automated actions, these sort of personalization, these segmenting the customer, these things where streaming really delivers outsized returns found because we thought we saw the future. Our team comes from the capital markets. And unlike most of the people that are in this space, frankly, people in the capital markets know it is not a batch world and now they're streaming. We know that the opposite is true. We know that for ever since the capital markets went electronic,

Starting point is 00:16:46 just the late 90s in Western Europe and in the US, that actually you make all of your money in streaming. There is no such thing as trading an hour ago. There's no such thing as I want to buy a stock yesterday. I can not do it. I can only react to things right now. So all of Wall Street has been automated for a couple of decades around real-time data first. And it's done that with, of course, you need context. Of course, you need history. Of course, you need state. But that real-time streaming technology has been married to historical static batch data sets as well.

Starting point is 00:17:20 I think the change in the last 10 years, and certainly the change that Deep Haven is trying to be a part of, is how do you move that from being bespoke code that is written by very, very sophisticated developers and, frankly, the elite players in the tech groups across Wall Street? How do you migrate it from being those custom solutions to instead being general purpose streaming technologies. And I think many of the technologies that you guys are mentioning here are relevant. There's the transport buses that obviously have become quite popular with Kafka-compatible stuff and the Xero and RabbitMQ and Solace and Chronicle queues and all of these types of things. All you're doing is you're really taking what we've been doing on Wall Street for a long time with our custom solutions and we're making it open source, general form, et cetera.

Starting point is 00:18:12 And we think that this is the future for many other industries, that that will be what all of you are saying is, you know, what we agree with. And that is, you know, your real-time data is the most valuable stuff. Whether real-time to you means a millisecond, real-time to you means 10 minutes, I don't care. Let's call that real-time. And, you know, what Jeff said about that data being available to a user, the

Starting point is 00:18:35 same as any historical data, is entirely relevant. And so what we think is important is that the same methods need to work on streaming and dynamic data as it does on batch and static data. And that good technologies will make it so that the user doesn't have to really care. They can just use the same stuff and it will work on both. It will all merge, it can all join, and you can just get to work on data and stop thinking about streaming versus batch. And that's the way Deep Haven organizes itself. So I love Pete, if I may jump in. I love the background that you just gave us because you're absolutely right

Starting point is 00:19:10 that Wall Street has been doing streaming before anybody else has been doing streaming. I mean, if you look back, before Kafka and RabbitMQ, there was TIBCO. And then before any of these stream processing frameworks, there was KDB Plus. And these were sort of the pioneering ways in which to express computation move data around but it also the flip side of it is it required this extremely

Starting point is 00:19:32 specialized skill sets right like you have all these banks where there was these you know three KDB programmers and they were you know writing this almost hieroglyphic programming that nobody can understand and everything sort of runs through them. And what I love about Deep Haven is it's really bringing the best of, you know, the modernized data science, Python, R communities, the communities that have not had access to streaming. They have been living in a pure batch world and bringing the best of both worlds together. And I obviously bring a very different sort of set of backgrounds to streaming. I mean, I've never been on Wall Street, never worked, but I have the greatest respect because those guys really were first to implementing these technologies and pushing them to production.

Starting point is 00:20:21 And whenever we do interact with them, they bring a wealth of expertise and experience and speak in much more technical precision than I think in Silicon Valley, including sort of the deep theoretical understanding of like the bi-temporal, multi-temporal queries and how one can express those computations. Yeah, well, it's sweet of you to say. I think the, you know, I'm not a computer scientist.

Starting point is 00:20:47 I'm not a, you know, I'm not well-versed on the history of how all this has come to be in the last few decades. So the battle between streaming and batch is somewhat lost on me. It seems like semantics to a certain extent. Because when we think about architecture, you know, we think of print, you know, first principles of architecture, fundamentally, we just think that data changes and your architecture should expect data to change. I know that you, Arjun, very much embrace similar there where, you know, rather than

Starting point is 00:21:22 pictures of data that then you need to compute on a whole new batch set that has come in to have your architecture organized in such a way that you know new stuff is going to come in. You want to be able to react to it to serve your use cases. You know, that's the architecture you should put in place. If that means you want to use the word stream for that, fantastic. But you could use whatever terminology you want. Just fundamentally, we think you should embrace that the first principle that your data is going to change and you should be organized accordingly.

Starting point is 00:21:52 So guys, I think that's like a great opportunity to have a little bit of history relation, let's say. So it would be great to hear from you how you have experienced, let's say, the evolution of the streaming processing platforms. I remember, for example, I mean, beginning of the previous decades,

Starting point is 00:22:12 we had Spark streaming at some point. Then we had Twitter coming out with Storm, some like there was some kind of explosion like seeing at least open source projects around like streaming processing. New architectures coming out there, like the Lambda architecture, for example, where like people were trying to put like bots together with streaming. Then we had Kafka with architecture. I think they like trying to merge everything into like a streaming platform and like give like also some primitives that are like more native to Batch and obviously like there's a lot of value delivered through all that stuff. Obviously some technology survived, some are obsolete today, but we have like companies like Confluent going public and people making money obviously out

Starting point is 00:23:01 of all that stuff, which is always good. So what's happened this past decade, let's say, and also what do you feel like it's next? Because all of you, one way or another, you're working on building the next iteration of streaming platforms. So it would be great to have this connection. Let's start with Arjun. Thank you. streaming platforms. So it would be great to have this connection. Let's start with Arjun. Thank you. Yeah, I think what Pete articulated is the destination or the place we all want to get to where batch and streaming are really sort of implementation details. I completely agree with

Starting point is 00:23:37 that 100%. I think we're about 10% of the way there, right? It's been a multi-decade journey. Some of the projects you brought up, I think Storm was sort of the earliest open source project. And the way I would articulate it is the things you could do in streaming were a tiny, tiny, tiny, tiny subset of what you could do in batch back then, right? So precisely on the implementation level, you couldn't do any stream processing that required that your stream processors maintain a lot of state, right? So you couldn't effectively do lookups to historical data. You were sort of building these. And in the algorithms world, there's this sort of

Starting point is 00:24:15 streaming algorithms or those algorithms precisely that don't maintain very much state, sort of in a big O sense. And that was really what was enabled by Apache Storm, right? So you had this, as a user, you had this big trade-off. You had, do I care about low latency and am I willing to limit what I do computation-wise? Or do I need that complex computation and I'm willing to give up low latency and do it in batch processing? So that was sort of the trade-off you had to navigate. In terms of a Venn diagram of, you know, the use cases were, you know, pretty, pretty separated. Like there was this giant circle, which is the things you can do in batch,

Starting point is 00:24:54 and this tiny, tiny circle off to the side of what you can do in streaming. And over time, as we've had more capable stream processors become, you know, available, that Venn diagrams have started to have a little bit of overlap, right? And that is where I would say is the Kappa architectures that, you know, I think, I think the Kafka folks articulated before that we even had the Lambda architecture.

Starting point is 00:25:16 So, so, so, so zooming even further back, people, people didn't want to make this trade off, right? They were like, I care about low latency, but I also want the fancy computation. So maybe what I can do is sort of by getting the best, I'll run a batch computation side-by-side with the streaming computation. I'll do the fancy stuff in batch and I'll do some sort of approximation of the computation that I want in streaming. And I'll keep periodically running the batch computation because I'm going to diverge from the true computation that I want because the streaming is an approximation.

Starting point is 00:25:47 And then I'll sort of revert back to the batch and then reset and then recontinue. And this was hideously complex, right? Now you're running like three different systems. You've got the batch, the streaming, and then the little thing putting it all together. And the Kappa architecture was the simplification saying, why don't you just run everything through the stream processor? The problem with this, as Ashley has brought up, is the developer experience is terrible, right? Like everything in streaming is manual and writing lots of code and maintaining tons of infrastructure. Whereas in batch, the lived experience

Starting point is 00:26:20 that people have in batch is like, I just write a little sequel, or I write little, you know, in the in the data science world, I write this little Python program, and I get to harness the power of a scale out horizontally scalable, reliable, massively parallel cloud architecture that works on terabytes and terabytes of data for me, you don't have any of that in streaming, right? So streaming today, it's like, well, that's great. Now you're on the hook for your schema changes, you're on the hook for, you know, errors in your data stream. You're on the hook for expressing the implementation. So the nice thing about SQL or, and it's not just SQL, there's other sort of languages for sure in which you can express computation declaratively, but in streaming, you haven't had that luxury. You've had to build out your own implementation languages.

Starting point is 00:27:11 And I think over time, and certainly our goal at Materialize is to make for the subset of batch computation, which you think is a pretty large subset of SQL, to make that available to people in streaming so that they can be streaming with just SQL. This does not cover all of the use cases, right? So just as you have Snowflake and Databricks, right? There's a whole world out there that's not SQL for data science and machine learning. And absolutely, we'll need solutions in that side of the space as well. But I think we're maybe 10, maybe 20% of the way there, such that we're getting to that promised land. The way I would rephrase what Pete said, it's like, we need to get to the point where streaming is a superset of what you can do in batch, right? So that eventually we think of batch as a subset of the capabilities that we can do in stream.

Starting point is 00:27:56 Does this mean the batch processes are going to go away? No, I think there'll absolutely be some times where the same computation you can sort of choose. Do I want to use a batch runner because I have all that infrastructure already set up. Maybe it's a little bit cheaper or do I want a streaming runner? You should be able to move computation back and forth between sort of underlying infrastructures. I don't think, I don't think batch is going away entirely, but I want it to become, I want the batch versus streaming debate, I think decades from now to be the sort of, as of as much of an implementation detail as like, hey, are you running this on a column or a row oriented execution engine?

Starting point is 00:28:30 It's like actually, you know, like 99% of people do not care what the answer to that question is, right? They just experience it as like, oh, this is great. I'm having a really pleasant developer experience. Yeah, makes total sense. So, okay, let's ask, I'd like to ask you. So you said that you had, I mean, you were working like in a company, you had like to work with a lot of data.

Starting point is 00:28:50 You had like to deliver the data, like in a real time fashion. Why didn't you use just, I don't know, like Flickr and Kafka, right? Or whatever, Storm, like whatever was like available back then out there. And you decided to go and like build your own solution that's ended up like becoming Penthouse. It was because of the types of work we wanted to do. So I think one of my early frustrations with stream processing tools was that there was this focus on the really high fruit in the tree,

Starting point is 00:29:22 which was the actual queries. You queries. I was interested in single message transforms, which was we're taking data, we're doing a small modification to it. You can do that stuff with those tools, but we had latency requirements that it just didn't work out. But the other bit that was missing was being able to orchestrate a network of enrichments. So we had lots of different types of data science-y enrichment that we needed to add to these payloads as they were passing through our platform. And it was non-trivial, the mappings between them. So we needed, one was requirement, the other.

Starting point is 00:30:00 So we ended up with this DAG of directed acrylic graph of these enrichments that we needed to execute in a streaming context, which meant as quickly as possible with as much parallelism as possible. And none of those tools, you could do it with those tools as a framework, but we needed something that was going to execute those things. And what we were looking for was declarative because we needed to slowly iterate on those enrichments, change where they were, how they operated, what kind of data they were giving us back and requiring because they were being actively worked on and slowly change the

Starting point is 00:30:37 graph as new requirements came in. And the idea of having to compile a code base every time that happened, it just wasn't realistic. So we wanted to offload some of that work to non-engineers, but it still be owned by engineers. So we kind of went down this decorative path. And that's kind of where the Benthos essentially started was kind of iterating on those key principles. But I mean, all of those tasks are something that you could do with a batch system right now. And as we were getting these streaming tools that were kind of like an alternative to batch for the analytics part, we just didn't have any options for the actual pipelining stuff. There's like logging tools, because almost all engineers have logging as a streaming problem. It's basically a stream

Starting point is 00:31:24 processor, but then you don't get the delivery guarantees and the observability and those kinds of things baked into it. So yeah, it felt like for me, in my position, that was the biggest gap. That's the bit that I kind of focused in on was integrations with services, enrichments, transformations, that kind of stuff.

Starting point is 00:31:44 How do you see like the, like Bentos being, how did the project like evolve, especially like from a perspective of, let's say, the use cases, right? From when it started inside like the company up to today, right? Where you're like maintaining a project that is used by like so many like people outside the company that initially was intended for like what's the what's your experience there like how did you see like things changing in terms of like taking a streaming processor for a very specific use case that you had there and being adopted I would assume like for different use cases too so can you so I we've already had a bit of practice with making something like that generalized. So having like a configuration schema for expressing enrichments

Starting point is 00:32:31 as integrations to arbitrary things. So the idea of, you know, you have language that can communicate interacting with an HTTP service or a Lambda function or a database query or, you know, this and that and this. So I'd kind of dabbled with it.

Starting point is 00:32:44 I could see where the problems were in generalizing that stuff. So you end up just biting a bigger chunk of how much you're going to abstract over that stuff and it still be user-friendly. Once I'd gotten that, you reach that nice Goldilocks position of being super easy for people to pick up quickly and run with. But also when somebody comes to you and says, hey, I have this thing that it can't do yet. Can you make it do that? You can really quickly add that in.

Starting point is 00:33:13 So you're not immobilized by it's so generic and it's so generalized that you can't make things easy for people, but then it's not so easy and intuitive for its existing use cases that you can't slot something else in. That's mostly just the config. It's basically the way that I've kind of decided to chop the concept of batching and what a message is, what a batch is inside the process itself, and what the responsibilities of the various component types. What is a process? What is a transform? What is an output? How do you compose those things? But the core concepts haven't really changed that much since day one. And that's because I'd already had practice with a few attempts at generalizing that kind of thing. And then after that point, it literally just becomes

Starting point is 00:33:56 a case of five years of people going, oh, but what if it could just do this extra bit as well? And I'm like, okay, let's do it. Let's put that in. And they're saying no. That's, that's, that's, that's, that's cool. So, okay, Jeff, your experience on like in Netflix, again, like all these tools out there, right? Why you ended up like building Mondays? Like what was the reason that made you like say, okay, we need something new. Like whatever tools are out there, like they do not fill the gaps that made you like say, okay, we need something new. Like whatever tools are out there, like they do not fill the gaps that we have here.

Starting point is 00:34:29 So, so Mantis fundamentally is a stream processor, but the use case that started it all was basically trying to reduce downtime or MTTT or like mean time to insight and understanding why something's happening like something negative is happening to the service and so you have your traditional telemetry stacks like metrics logging logs and traces and so certainly we at netflix at the time we did use that but after some point it just got too expensive to like shove all if we wanted like the most granular insight possible we would have to log everything all the time aggregate it store it hopefully ttl it out the other end like but we needed something that could get us the cost efficient cost

Starting point is 00:35:17 effectiveness while upholding the like the granular insights and so to give an example, like petabytes of data were going through the system and then most people would filter out like 99.9% of it. And then so one of the great use cases is like, I'll give you all an example. So Mantis gives you the ability to say, if I'm looking at,

Starting point is 00:35:42 if there's a playback issue, like someone's having a problem playing a title, we can get that insight into which country they're coming from, which title, which episode, which device. And then like versions of the device, and then you can exactly see like what's going on. And so that's very, very targeted query to like a specific thing. And so you don't really pay, you don't pay the cost of that query if it's not in use basically. So it's more of like a reactive model,

Starting point is 00:36:10 like the reactive streams model. And so another example of like when we were decommissioning a Netflix app for some devices, we can see exactly which users are using that over time. And then maybe some of the many, I'll say like, Hey, we're going to decommission this app for this device, please go do something else. And so it was more of like a real time use case, like what's happening here.

Starting point is 00:36:35 And now another example is when we, when we do like regional failovers, which happens quite regularly, like you can see the number of people hitting play, like dip in one region and increase in another region. And then when we show back, it goes the other way. And so it's used for a lot of tooling. But then the other feedback we've gotten was like, okay, well, what if I don't want real time? What if I want to look at something like in a batch case

Starting point is 00:36:59 or in like a snapshot of data or join that like historical context with the streaming set of data to make some sort of decision later on. So then our answer to that was, okay, we would just build in sinks. So sink it out somewhere else. And then you would have to do like basically a stream table joined later on. But that's another story. So it all started with observability basically.

Starting point is 00:37:23 Hey, Jeff, can you, by chance, I've heard you speak before, and one of the things that you say that I think is really interesting and very important, actually, for the stream community is this idea of not sharing all of the data. I think that that's one of the founding principles of Bantus. And I think for stream processing players that are used to just receiving a firehose queue and then having to do things about it.

Starting point is 00:37:46 You're approaching that a little bit differently. And I think that's an important concept because you can all of a sudden be moving a lot less data around your system or your mesh, which is a principle also that I think Arjun holds dear. So can you talk to us about how that came out as a principle and how you've executed that in your system? Yeah, there's three parts. So first is we encourage developers to publish all of the data that they can. So like an event might have hundreds of fields and it might be like very large in size, basically. But we also have it so that the infrastructure doesn't consume all of that data by default.

Starting point is 00:38:21 Like you have the ability to do that, you being the user, but you have to be very intentional in doing so. So certainly there are some long running jobs which basically perform a select star, 100% sample, but it has to be, maybe it's a low volume stream or something like that. But most people would either do like a sample or select some of the fields or even filter some events out based off of a condition. So the first is like publish everything, but be very intentional about what you consume.

Starting point is 00:38:49 And then the second is like reusing like subscriptions basically. So if multiple people are asking for the same set of data, and if you have that awkward in memory, like just don't go all the way up to the source of the data, like these applications, just send what you have down to the people that are subscribed to that same, with the same pattern of data that they're asking for. And so really you're getting like a lot of reuse. You're getting a lot of like intentionality, like consume only what you need. Because in reality, you don't want all of the data.

Starting point is 00:39:22 And if you do, then you can do that if you want. Yeah, we hold the same things dear. We use the phrase, you know, we move data around the system in lazy fashion, right? That the producer, our APIs allow the producer to have quite a bit of information from the consumer in terms of what it wants from the table and at what update frequency it wants, because there's different use cases. Some of them are throughput, some of them them are latency and some of them are consistency driven and then we also sort of hold on to what you just talked about in regards to sharing work

Starting point is 00:39:54 products you know we memoize data so that you know if you have a few consumers that want the same thing you're you're able to because oftentimes you know you think it's scale out it might be a few thousand consumers and so there's a lot of there's a lot of less work that needs to be done in these, you know, these, the types of use cases that are really evolving quickly in the streaming world, we think. So guys, one of the things that surfaced like through the conversations that we have is that like something that's like quite important around like streaming processing is like the developer experience.

Starting point is 00:40:26 It's like something that like still needs like a lot of work and probably is one of the things that I mean like the previous wave of technology out there like didn't pay as much attention as like they should. And my question is like, and I'll start a bit with you because you are offering a product that exposes an interface to the users that's very, let's say, familiar, to a very specific group of people that are not necessarily extremely technical, in the sense of the engineering of these systems. So I want to ask how we can take streaming and make it at the end more accessible to engineers with a database, you don't think in this terms, right? Like, and a database is like what most engineers, like developers have

Starting point is 00:41:30 been exposed to. So how we can bridge, let's say, all these primitives that, and the new language of like streaming is bringing on the table with what is like most commonly known out there to developers. And it's one of the questions that we got also like from one of like the attendees. That's okay. Like you can, let's say for example, exchange like the terms extract to publish, load to consume.

Starting point is 00:41:53 But at the end, like, is this what is needed? Like, do we need to change our language and like educate people or we can do better? So I have a, I have a bit of a radical answer to that. It's a little bit self-serving, so I hope you'll pardon my indulgence here. But we actually introduced a new primitive that we think is important. So we think event streams and message queues are vital. And we understand that that is the foundational building block of the conversation we're having today. We could not embrace it more heavily.

Starting point is 00:42:23 However, we don't think it's the whole story. We think that the most intuitive object for people to work with for data-driven applications, for data science, for data analysis is the table, the data frame. I'm not making this up. This is in Pete Goddard's opinion. We'll walk around, see how many people use Excel, go survey SQL, look at our data frames, look at Python pandas. It's not an accident. They're all tables or something darn close to the table. So what we've done at Deephaven and we put it out in open source is we've really tried to create a dynamic table.

Starting point is 00:43:00 So tables should not be static. They should be things that can change, right? That should be true at the API level. That should be true at the engine level. And that should be true at the user experience level and the API, the JavaScript APIs that are supporting that. So we think that that is a very important proposal to the world or to the community in regards to how do I make it easier to work with streams? So when you're using Deephaven, you might integrate with whatever, you'll copy paste something we have from our documentation to onboard a Kafka stream or a Red Panda stream,

Starting point is 00:43:38 or you'll integrate Solace or something like that. Right. And now you, or, or our enterprise customers will, you know, push binary log from Java or C++ applications of theirs, right? But once it hits, when it hits deep haven, it, it really is a, it hits as table structure and, and then we keep track of changes in the table. And I think making it so that we can do really high performance compute, because we're only doing calculations on changes instead of on the actual data, which means less data and faster compute is powerful, but it really works because we let people think in terms of tables and we let people use tables and oh you want to do machine learning well chances are you're using tensors which are arrays which are you know chunks or columns of tables this construct of pushing a table around a dynamic table around as a first primitive is really really helpful for the user experiences you're talking about you know in addition to one of the things that this group is well aware of, as soon

Starting point is 00:44:47 as you get out of the bat, like think of how much money, meaning market cap is involved in batch world, right? So just, just think of BI, right? BI, you have Tableau and Looker and, you know, Power BI and Click and, you know, 12 other competitors, right? Give any of them a stream and like, you stream and your experience is pretty much gonzo. So there needs to be significant investment to support those types of development, exploratory, interactive, and dashboarding experiences with real-time data. We've been very involved with that for a number of years, but we obviously will bring it.

Starting point is 00:45:27 It will need a community to get it all the way there. Yeah. So Jeff, you mentioned like for Mondix specifically that you like implemented more of like a reactive model there, right? Which reactive programming is like a paradigm that is, I guess, like quite well known in people that are doing like even like from then development, right? So what's your opinion on like how we can make like streaming technologies more approachable to more engineers out there? Yeah, that's a good question. In my mind, what's going on in my head, there's like the user and then there's developer. User is a fun story. But what I've been thinking about a lot lately is I've actually written a stream processor very early on in my, in my like computering career.

Starting point is 00:46:15 It looks like take a list, call stream, call flat map, call reduce. Right. So that's like an API that's very standard in a lot of programming languages. Or, or, Or take another example. You have a list and then you have a for loop and then you go through and then you do something with it. And so like kind of like the reactive streams, like you have like API that has, you know, your flat maps and all that other stuff. Like I'm wondering from a developer perspective, that's kind of just what I want to do.

Starting point is 00:46:47 I know how to write a for loop. I know how to write a stream, flat map, reduce, et cetera. And if I could just give that to somebody and then they'll parallelize it, they'll manage all the state, throw in the rocks DB and all that other stuff. Like that's good for me because I know what a list is

Starting point is 00:47:02 and I know what a for loop is. And that works for me. Another example is like in some of these stream processors, there are great APIs. You like windowing and triggers and all sorts of stuff. But if you want the ultimate flexibility, there's things like just the process function or just basically a block of code that you write, you run, you put all sorts of variables in there. And that's pretty nice. Like that's kind of just what I want as a developer. Yeah.

Starting point is 00:47:28 Makes sense. Ashley, what about you? Like you have built a tool from scratch, so, and you are like interacting a lot with like developers out there. So what is missing? Like, what do you think that's missing, like from streaming infrastructure to become like more approachable? Ashley Hyland- No, I think the first thing that

Starting point is 00:47:45 if you take a developer that's innocent and hasn't suffered at the hands of stream processing and you invite them into your world of, hey, why don't you do that with streams? I think the immediate repulsion to that is, oh, that looks like a lot of hassle. That looks like it's going to wake me up at night and I'm going to have to do all kinds of stuff to get that back up. And I think answering that anxiety is,

Starting point is 00:48:16 I think for me, that's been one of the, I wouldn't say it's a struggle. It's more when people get to me, they've already overcome that somehow. So I'm kind of like seeing this relieved. Oh, actually, this isn't that bad. But I think the operational side is still a nightmare. And it's still, there's a lot of moving parts still. There's all the different services. You imagine like an architecture of stream processing, a team that's decided that they're going to lean into it heavily and everything's going to be on the low latency side. The number of moving parts is vast if you want everything. And every component wants its own state. It wants its own claim to the disk. It wants its own thing that can fail. And the idea of being woken up at 3am and one of your disks is corrupt and you've got to sort out, okay, well, which services in my streaming platform now,

Starting point is 00:49:05 you know, have been suffering? How have they been suffering? How do I recover them from that state? It's, it's overwhelming, I think for a lot of people and it's very comfortable to just say, okay, well, we don't need that yet. We don't, we don't need that kind of thing. I think it's kind of similar to, to, you know, Kubernetes and things like that, where it's exciting to some people, but to a lot of people, it's too much hassle, feels like it's not worth the investment. And I think it's something that is improving over time. There's obviously alternatives to Kafka now. There's companies like Red Panda.

Starting point is 00:49:40 Lots of tools are coming along that try to simplify. A lot of the products here are nice and simple to use and just kind of slot into existing systems, which is great. So I think that's kind of what I'm hopeful for in the future is that a lot of those moving parts become less moving and more simple. So do you feel like, I mean, isn't the data itself and the use case itself deriving this? I'm just unfamiliar, right? I don't come from the same background as you. So I'm just unfamiliar of, oh, this is a thing that's working in batch.

Starting point is 00:50:13 Should we try it in streaming? That doesn't, there isn't much resonance with me, but hey, there's a new type of data, it drives a new type of value to me. I need to keep up with my competition. I need to be responsive in a shorter amount of time, or there are new data sets that are coming in where transactionality might not be as important. Therefore, I can embrace a new way of doing it. To me, from the people you talk to, is that oftentimes driving the migration to dynamic data, real-time data, streaming technologies?

Starting point is 00:50:47 Or is it really just, oh, you know, what you cited at first, which was they want to do, you wanted to do batch, but you need to do it at such scale that you kind of had to do it all the time? Or what really drives people into this dynamic data stuff? It's a bit of both. So there's some people who have to, and they used to have something that they wrote themselves, and it's dodgy, and they wanted something that's not. And then there's a lot of people who are just, you know,

Starting point is 00:51:14 they're using Penthouse, which is a stream processor to basically just read a batch workload because you can read like SFTP files, and they just, they like the fact that they can just run this thing and it's always on.

Starting point is 00:51:24 It's always, you know, they only need this data like once a day to be refreshed. But, you know, they just like the fact that this thing will just always run. It just sorts itself out and seemingly doesn't require much effort. It's got a nice little config that, you know, three people can be working on in a source control. And to them, it's a simpler world. And the job isn't that complicated. It's just hitting some endpoints and maybe has a fallback with some dead letter queue or something on an alert. So I mean, there's those two extremes and then there's everything in between as well. There's people who feel like we probably ought to make this

Starting point is 00:52:01 more efficient or streamy, but we don't have to yet until we find the thing that, you know, suits our particular requirements. It's kind of hard to say, but yeah, I see all of them. I see people from all walks of life coming and discovering, actually, it's not that bad. That's the main take is actually, it's not that bad. Yeah, that's great. Okay. Well, we are right at time here, but I want to get in this question really quickly and we'll see how many of you can answer. Arjun, we'll start with you. Chris asked a great question.

Starting point is 00:52:31 He said, coming from the traditional batch ETL world, I found using new vocabulary to describe things can help with thinking about how to do things in new ways. Examples, extra versus publish or load versus consume. Has there been any terminology that you use or have heard that you think has been helpful in terms of breaking out of the way that we talk about these things traditionally? I love this question. I'm going to take the completely contrarian point of view, which is that no, we should stop doing this. There's so much amazing vocabulary and intuition in batch. I mean, it's the entire

Starting point is 00:53:07 reason we named the company Materialize, right? Which is, what are you doing? What are you trying to do with the vast majority of these streaming pipelines? You are trying to build a materialized view that stays up to date over changing data and relating the difficult newness of streaming to the concepts that people are familiar with from batch, I think has a tremendous amount of value. I know this is a bit of a contrarian hot take. It isn't very much a reaction to how needlessly complicated streaming has been for so long and trying to simplify it in the terms that are most relatable to the audience. And we really see the audience as people, you know, the 100x the amount of developers who have not tried streaming,

Starting point is 00:53:58 have never, you know, have never poked at it and are maybe even a little bit rightly so afraid of poking that bear. So I'm going to take the contrarian take. Maybe I should have been the last to speak. I love it. I love it. Jeff, what say you? Yeah, no, I'm actually plus one on that.

Starting point is 00:54:13 Just keep it simple. So I sit in the change data capture space right now. Really, you're just capturing changes from databases, aka extracting something from a database. Someone wants to transform that and someone wants to load that somewhere else. So just keep it simple, I think. Ultimately, for me, are we talking about the technology or are we talking about the use case? I think ETL as a use case, yeah, you're going to extract something, you're going to transform it, you're going to load it. As a technology, there are many things out of the hood which have other definitions.

Starting point is 00:54:42 Yep, that's great. All right, Ashley, you invented a new term called streaming ETL earlier on the call. So what do you think? I did not invent that. I've seen, I promise you, I've seen that somewhere else. And I thought, oh, maybe that fits. I'm not good to ask about this stuff.

Starting point is 00:55:01 I'm really bad with it. Because I mean, I didn't think I was a data engineer for ages. I discovered data engineering way late. I had a data engineering tool and I didn't know what data engineering was. So all this stuff I've had to learn, what do they mean when they talk about ETL? Because to me that was just processing. Every program is ETL. It's reading something, doing something and then putting something out. So, I mean, yeah, I feel like all these terms are super vague and kind of conveniently unspecific anyway. So I just kind of adopted them for whatever I had going on. Yeah, I don't know. Just wait till you hear ELT.

Starting point is 00:55:41 The marketers are trying to are trying to confuse us all right pete with the last word you know i think my answer is similar mostly i just don't feel qualified to to dictate language and nomenclature to people certainly i come from a space where pub sub systems feel pretty natural that you're moving data from one server to another, and there are publishers and subscribers, that forms a data mesh. I don't think that that's a sophisticated concept to embrace and might be an easy way to think of things. As others have said, I think the best thing we can do as a group is make it so that we

Starting point is 00:56:23 can talk in English that a sixth grader can understand, right? So I am moving data here. This thing wants this data. It is going to do that with it, et cetera. And whatever words you choose to coalesce your team around, that'll work for us. This idea of a, the only term I would introduce is the one that I mentioned earlier, which is, is an invention of ours, which is this streaming table, which is, look, it's got the same attributes as streams plus a lot other more use cases that you can handle, handle semantically with that primitive. But it's, it's, it's a table that changes. And probably when I say, Hey, it's a table that changes to a sixth grader, they can generally understand probably what I mean. And that is exactly what it is.

Starting point is 00:57:10 Yep. Love it. Love the push for simplicity across the board. All right. Well, thank you so much, everyone, for joining. We're a little over time. So thank you to the listeners and to the panelists. We really appreciate your time and have learned so much. Thank you. Thanks very much, everybody. Cheers. Me and Costas, we get to talk to a lot of really smart people. You know, I think my takeaway was that through all of the different complexities that we talked through, right?

Starting point is 00:57:39 So use case complexities, technology complexities, different opinions on that, was that at the end of the day, when we asked them how to describe these things, you know, with the user question, everyone said, keep it really simple, right? And Ashley even said, you know, when I first, I didn't know what ETL meant, right? I just called it processing. I'm taking data from here and moving it over here. I was just processing data, right? I just called it processing. I'm taking data from here and moving it over here. I was just processing data, right? And I really appreciated that because I think it's good with all of the new technology, sort of the new ways of thinking about things. I think it's really healthy for us to step back and say, you know, at the end of the day,

Starting point is 00:58:20 we're moving data. The technology can do some really cool stuff, but the fundamentals are actually not that complicated. And I think the other sub point, I guess I'm doing it, I'm forcing it to do for one year, is that Jeff made the point that the user, the end user who wants the data could care less. So those are just really, really good reminders for me. How about you? Yeah, absolutely. I think that's what I'm going to keep from this conversation that we've had is that the experience matters a lot when you are working with these tools.

Starting point is 00:58:55 And there are like two levels of the experience. There one is the experience that the developer has who's going to build whatever on top of these technologies. And then there's also like the user experience, which is the consumer of the data that comes top of these technologies. And then there's also the user experience, which is the consumer of the data that comes out of these systems. And both of them, if we want to move forward and increase the adoption of these tools, we need to make sure that they don't have to learn new terminology,

Starting point is 00:59:19 they don't have to learn new ways of thinking and designing systems, and we need to keep things familiar and simple as much as we can. Yep. Well, another amazing live stream episode of the panel. We have more of these on the schedule, so be sure to keep your eye out. We'll let you know when they're coming and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app

Starting point is 00:59:45 to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - 91: The Future of Streaming Data with Stripe, Deephaven, Materialize, and Benthos

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.