The Data Stack Show - 184: Kafka Streams and Operationalizing Event Driven Applications with Apurva Mehta of Responsive

Episode Date: April 3, 2024

Highlights from this week’s conversation include:Apruva’s background in streaming technology (0:48)Developer experience and Kafka streams (2:47)Motivation to bootstrap a startup (4:09)Meeting the ...Confluent founders and early work at Confluent (6:59)Projects at Confluent and transition to engineering management (10:34)Overview of Responsive and event-driven applications (12:55)Defining event-driven applications (15:33)Importance of latency and state in event-driven applications (18:54)Low Latency and Stateful Processing (21:52)In-Memory Storage and Evolution of Kafka (25:02)Motivation for KSQL and Kafka Streams (29:46)Category Creation and Database-like Interface (34:33)Developer Experience with Kafka and Kafka Streams (38:50)Kafka Streams Functionality and Operational Challenges (41:44)Metrics and Tuning Configurations (43:33)Architecture and Decoupling in Kafka Streams (45:39)State Storage and Transition from RocksDB (47:48)Future of Event-Driven Architectures (56:30)Final thoughts and takeaways (57:36)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. We are here with Aparva from Responsive, and we are so excited to chat about many topics today, event-based applications, and all the infrastructure, both underneath and on top of them to run them. So Aparva, thank you so much for joining us on the show today.
Starting point is 00:00:47 Yeah, great. Thank you for having me. Excited to be here. Give us just your brief background. You spent a lot of time in the world of streaming. You're a multi-time entrepreneur, but give us the story. How did you get into the world of streaming? Yeah, thanks. No, yeah, I have. I first got exposed. I mean, I was at LinkedIn, you know, like I've been at LinkedIn content and now I'm doing have. I first got exposed. I mean, I was at LinkedIn. I've been at LinkedIn, Confluent, and now I'm doing Responsive. But I got exposed to streaming at LinkedIn.
Starting point is 00:01:09 I wrote a SAMHSA job back in the day. And through LinkedIn, got to know the founders of Confluent, moved on to Confluent, worked in Kafka and Kafka,
Starting point is 00:01:20 KSQL, Kafka Streams for years, basically. And Responsive continues the line. we are building a platform to manage these event-driven applications in the cloud and yeah so but it began at linkedin i would say in 2013 that's awesome well i have a couple of actually i have many questions that's i'd like to ask you hopefully we'll have the time like to go through all of that stuff
Starting point is 00:01:43 but some of like the high level topics that I'm really interested in, first of all, I'd love to hear from you. What is Kafka? And what are, as a technology, as a protocol, or as a standard, probably, and what's happening with the products that are built around it and the next thing that is like very interesting is and inspired from what you are doing like today about developer experience and like things around like kafka streams right like what's what it means like to build why we want to build things with Kafka streams, right?
Starting point is 00:02:26 And what can be done better for the engineers out there that they are using Kafka streams and Kafka, obviously, to work with that. So that's like a few of the things that I'd love to get deeper into. But how about you? What are some things that you're excited about like discussing today yeah all of that that sounds and you know those are all topics i'm very passionate about i think it's good to tell the story right like what you said kafka as a tech standard as a technology the differentiation there i think i would also say it's talking a bit about like this world on top
Starting point is 00:03:03 of kafka right like stream processing i think is a very overloaded world. There are many use cases, many different ways to slice that space up, right? Many different technologies in that space. You know, I think talking about the space itself and how, you know, how, you know, sharing my point of view on how to think about it and then maybe getting into Kafka Streams and how the technology is different with a focus on Kafka Streams, I think would be a very interesting,
Starting point is 00:03:30 I think, an educational conversation. So I'm excited about doing that. Yeah, 100%. Let's do it. What do you think, Eric? Yeah, let's dig in. This is going to be great. All right, well, you gave us a brief
Starting point is 00:03:46 introduction about your time at LinkedIn and how you sort of got into streaming and how you met some of the Confluent people. I want to dig into that. I want to dig into that story, especially the beginnings at LinkedIn. But when we were chatting before the show, you talked about bootstrapping your own startup. And I just love to ask you a couple of questions. I know we have several entrepreneurs who are on the show. What motivated you to bootstrap your own app and company? It was way back in 2011.
Starting point is 00:04:21 So I had just started before that at Yahoo, actually in the cloud platform team, you know, in 2008 was my first job. Yeah, and I think I was almost three years at Yahoo. And then, you know, I was like, I don't know, I just had the age, right? Like, I think I've always been a startup person, right? I always wanted to do it. And back then the App Store, Apple was new, right? Everyone was writing an app right cloud services were new right yeah so it was like the thing to do you know that you know and i still think but even back then there was a big thing around bootstrapping companies being a very great way to have a
Starting point is 00:04:57 good life and you know so that i was also very young right you're very impressionable so i mean it's what you do right like i saved some money from yahoo i had some experience building stuff at a good company and you know i had this i also played music and you know this music recording sharing music recordings annotating music recordings it was not no products for that right and i said why don't we build something that's cloud native mobile native that allows you know groups to kind of share rehearsals with each other, discuss rehearsals. And it was definitely a problem I had personally. And that was also standard advice, solve a problem you personally have. That's a great way to start a company. So that's kind of the genesis. Yeah, yeah. Very cool. And well, what instrument or
Starting point is 00:05:41 instruments do you play? I stopped since my son was born several years ago, but I played the tabla for a long time. Like it's an Indian drums, Indian classical drums. Oh, cool. So I played that for 10, 12 years pretty intelligently. I built an app to help with my lessons and my practices and rehearsals also. So yeah, but yeah, it's just a lovely,
Starting point is 00:06:03 I hope to get back into it. Yeah, that's great. Okay, just one more question. When's the last time you went back and looked at the code you wrote for that first app that you built? Oh. I have many years. I don't know. I guess once I went
Starting point is 00:06:19 to LinkedIn, you know, family, life, it was very busy. It's been, probably when I stopped, I never looked at it again. Yeah. The reason I ask is I was looking for a file the other day and I ran across this old projects folder, you know, and I was like, whoa, you know, this has like 10 year old stuff in it. And you look at it and you're like, wow, that seemed like so awesome at the time.
Starting point is 00:06:44 And okay. So LinkedIn. So tell us about meeting. up in it and you look at it and you're like, wow, that seemed like so awesome at the time. Okay. So LinkedIn. So tell us about meeting, how did you meet the Confluent founders at LinkedIn? And, you know, it was somewhat related to the work that you were doing there, but give us that a little bit deeper peek into that story. Yeah. So at LinkedIn, I was working, I did two projects. One was on like the graph database, you know, so like LinkedIn has like this, I don't know if it's still true, but back then it was, you know, like one in memory graph database that serves all your connections. Like basically it's a very fast, low latency lookup that, you know, who's my, who's your connections.
Starting point is 00:07:16 And that kind of is used by the whole website to do all sorts of things, right? Like, can you even see someone's profile? Can you, in your search results, what's the ranking, right? Like there's you even see someone's profile? Can you, in your search results, what's the ranking, right? Like there's a lot of calls made to that. At least it used to be, now I can't speak to what it is today, but basically like getting the statistic back then is for every call to linkedin.com, it would be 500 calls to this graph database through the whole service tree. Wow. And often they'd stack up. So like, you know, it was not very efficient.
Starting point is 00:07:45 I would say the bottom layer would call and then it'll send layers up and the upper layer would also call back. And so it was not, you know, like it's like a massively distributed microservices architecture, which, you know, obviously no team actually, no one actually optimized it top to bottom. But anyway, but the net effect is if there was any latency at the bottom layer, it could basically halt LinkedIn.com. Like basically,
Starting point is 00:08:10 there'll be a cascading timeouts through the service stack and LinkedIn.com could be down. Like, and we're talking microseconds. If the microsecond lookup went into like a millisecond, LinkedIn would, and this is true.
Starting point is 00:08:22 There was a Saturday back in 2013 where LinkedIn was down for eight hours. Wow. And the reason was LinkedIn.com just wouldn't load. And the reason was this graph database was slightly slow. And so that's kind of basically, it was a P0 for the company to figure it out. And, but I started digging into it and ultimately it led deep into basically the configuration of Linux on LinkedIn's hardware.
Starting point is 00:08:54 I don't know what they had in the data center. So they had like what you call multi-socket systems and basically the memory banks were kind of divided across the two CPU sockets. And then Linux was set up to keep each bank only pinned. So if you have a thread running on one socket, it cannot read memory from another. So if by chance, like you had 64 gigs of memory, that's what we had that time in RAM. Even if 32 was filled, they were paging out data because of the setting.
Starting point is 00:09:21 It was some NUMA interleave setting. I forgot what it's called. But basically it was a BIOS configuration plus a Linux configuration that caused us to use no more than half the memory. So we had tons of memory, but we were paging out at half. And because it was a page out now, your in-memory lookup is no longer in-memory lookup. It's a disk lookup.
Starting point is 00:09:40 That's a latency spike and that kind of cascades up. And that was a root cause of our latency. I wrote a blog post about this. We can put it in the show notes. I can send it to you. But basically, having discovered that, it also affected Kafka because Kafka is an in-memory system. It expects for its performance to be not hitting disk.
Starting point is 00:09:59 And they were also getting paged out. It was a problem for them too. And this helped them and other teams. But basically because of this, I kind of got to know the founders of Confluent. And since then, we've been in touch and that's how Confluent happened later, I would say. Very cool. And just tell us a little bit, you said there were a couple of things you worked on at Confluent. Can you talk about, were those areas of interest for you?
Starting point is 00:10:28 You joined Confluent pretty early, so maybe they were just key areas of product development, but tell us a little bit about that. Yeah. I started at Confluent in 2016, summer. And I think the first project was to add transactions. Basically, this exactly one semantics where, you know, you basically, when you produce a message to Kafka,
Starting point is 00:10:49 it is persisted only once, right? And then a transactional system on top of that, where, you know, you can also consume it like a batch of messages exactly one time. So I can commit a batch of messages across multiple topics as a unit
Starting point is 00:11:02 and then only make sure the committed messages are are consumed so that was like a major upgrade to the kafka protocol and kafka capabilities so i worked on that project so there's three of us it was probably the most fun i've had as an engineer because two really good two really good engineers it's a really hard problem streaming exactly once is kind of like it was very innovative, I would say. Very hard from an engineering perspective. We actually made the performance better.
Starting point is 00:11:31 Even though, or like at least no loss, even though you had transactions, because we actually optimized so much else to get the performance there. So it was a great engineering thing. And I think it's just a lot of fun doing it with that team back, you know, in a small like one room in Palo Alto, you know, the whole company. It was a lot of fun. So yeah, so that was my first project. Then I moved on to the KSQL team.
Starting point is 00:11:56 KSQL just launched around the time we launched this transaction. It was actually very popularly received, like it's streaming SQL on Kafka. And then the company wanted to invest more. So I moved to the KSQL team. And then I did a technical project to add joins to KSQL. And then Confluent was growing really fast, right? So they had only like two or three engineering managers for like 50, 60 engineers at some point in 2018. Wow. And so they asked you, do I want to manage this KSQL team? I said, sure, why not? And then since then, I kind of became a manager of KSQL, Kafka Streams, all Groovy, Launch
Starting point is 00:12:31 the Cloud product. But that would say the last four years, that's what I was doing at Confluent, learning how to be an engineering manager. Yeah. And you have started Responsive, so can you just give us a quick overview because it obviously works with Kafka as a first-class citizen but is a layer on top for building
Starting point is 00:12:53 event-based applications. Correct, yeah. This idea for Responsive came about in 2022. What we saw at Confluent was these developers are building all sorts of truly mission critical apps. You know, like there's a real distraction and we can probably get into that in the show.
Starting point is 00:13:12 I guess that's a big topic of conversation. But essentially, maybe I'll keep it short for now and then we can just dig into it, you know, as we go rather than. But basically, yeah, I mean, what I saw at Confluent was, you know, this category of like, you know, you have these kind of backend applications that are not in the direct, like you're not sending, loading a website and sending a request, right? But to an application that's serving from a database, but basically you're starting off something in the back. Like it could be fulfilling an order. It could be, you know, like, you know, fulfilling a prescription prescription. There's actually a pharmacy that uses Kafka Streams to fulfill prescriptions. It could be
Starting point is 00:13:49 things in your warehouse, logistics, very common. All the applications that make sure things get to the right place at the right time in general are very good fit for these event-driven patterns, back-end event-driven patterns. backend event-driven patterns, right?
Starting point is 00:14:06 You know, so something happened, you have to react to it, you have to keep state around what's happening to make the right decision. And this has to be programmed, right? So there's application developers building these systems. And I thought that was a tremendous, it's actually very popular. Surprised, like so many of these apps exist,
Starting point is 00:14:23 whether they're directly written with the producers or consumers or written with Kafka streams or whatever. But my thing was that, look, it's a massive market. The pattern, the architecture makes a lot of sense for a lot of people. Kafka, the data source is there. But there's no real focus on that, making that really easy and good to do, right? So that was where Responsive, like, can we be in the cloud,
Starting point is 00:14:50 assuming that Karka is ubiquitous, you know, become the platform, you know, the true foundation for writing these apps that make it scalable, easy to test, easy to run, you know, all of it. So that's kind of, we just, I mean, for me, the option is there and responsive is kind of founded to capitalize and kind of add value in that space. Yeah, makes total sense. Well, I know, I want to dig into that. Costas probably wants to dig into that even more just around, you know, sort of operationalizing event driven apps. But before we dig into the specifics, can we just talk briefly about
Starting point is 00:15:27 what you consider an event-driven app, right? I mean, that can mean a ton of different things. Like you said, there is a sense in which it's ubiquitous, right? If there's any sort of process happening in the real world that needs to be represented digitally to accomplish some, you know, process or logistical flow within an organization, you know, those are technically sort of all potentially event-based applications. But how do you think about that? Are there parameters around how you would define it? Do you think a lot of people would have different definitions?
Starting point is 00:16:02 Yeah, I think, I mean, a very reductive definition would be, right, if you have an application, like, and this probably describes all applications, so it's kind of irrelevant. But basically, like, ultimately, like, you know, like, it's more like a pattern, right? Like, are you writing some functions which are basically waiting for triggers to kind of execute and, you know and potentially triggering other functions downstream of them. And in totality, achieving some outcome, right? Like basically it's kind of like reductively
Starting point is 00:16:31 like functional programming is eventual, right? You just function invocations, a chain of them achieves an outcome. I think in the context of what I said about responsive, these are like more, I would say, backend. Like basically, you could have, if you think of it like even a web server is event-driven, right? You're waiting for a request to come in and you're responding to it.
Starting point is 00:16:50 The request is the event and the response is, you know, the output. And in a service mesh, service-oriented architecture, there's a stack of these services calling each other. And like that's basically synchronous and event-driven, right? But I would say the asynchronous event-driven is kind of the space Kafka, the space on event-driven, right? But I would say the asynchronous event-driven is kind of the space Kafka, the space on top of Kafka, right? Like where you logged an event to Kafka
Starting point is 00:17:11 and now that event is being propagated to many apps and they are all reacting to it in their own time in some sense and doing different things with it, right? So yeah, I would say like, there's this asynchronous backend event-driven apps, you know, have basically a source like Kafka, right? I would say like, I don't know if that's a useful categorization, but that's really the categorization in which, for example, responsive, you know, that space of like, you have these apps that are built off of these event sources like Kafka. And we are there to kind of, you know, we are focusing on that area, basically, if that makes sense.
Starting point is 00:17:46 I'm happy to have a discussion, honestly, like how do you all see it? Right. And it, you know, there's another way to see it is that there's different things people do even in that categorization, right? Like you could do some sort of streaming analytics, streaming ETL, you can actually build like these, you know, true, you know, kind of workflow like things on top of it. But, you know, there's so many ways to slice that too. Yeah. Yeah. Maybe I'm interested, Kostas, in your opinion as well. I mean, one question I have for both of you is this may be
Starting point is 00:18:16 a very oversimplified, you know, way to look at it or even an unfair question, but you said, you know, functional programming, if you're extremely reductive, is essentially sort of an event-driven application. But one thing that's interesting about some of the use cases that you mentioned around logistics and other things is that there's a timestamp that represents something happening in the physical world, right? Or like an input from something outside of the application creating a trigger, how important do you think that is in terms of defining an event-driven application? And Kostas would love your thought as well.
Starting point is 00:18:54 I think, and I'm going to talk from my experience with hearing people talking about bots and stream processing, primarily analytical workloads, where it's kind of like people... It feels like it's the holy grail of merging the two. You can hear people saying, you know what? Bots is just a subset of streaming at the end. And if we can make everything streaming, why we would need bots, right?
Starting point is 00:19:32 I think, and we forget marketing here and like trying to, you know, sell new products and like all these things and go back to the fundamentals. I think there are like a couple of, like a very small set of dimensions that really matter here. And I think one is latency that is associated with, let's say, the use case that you have. I mean, how fast do you have to respond to something?
Starting point is 00:19:55 Yeah. Because, yeah, like, sure, like in the general case, let's say we can say if everything was free in this universe, yeah, if I can have an instant answer like any question i have like would be great but our universe does not work like that like it's
Starting point is 00:20:09 right so there are use cases where low latency matters let's say i'm swapping my credit card and someone has to go and do like i don't know like credit check or go and do something like a pro detection right sure by the way i would ask people next time that they pay online for something and they put their credit card, like just count the time it takes and they will see that like there is like some time there, right? It's not like loading the Google landing page, for example, right? And we are okay with that. We are happy with this trade-off, right?
Starting point is 00:20:44 Like to wait a little bit longer, but make sure that we are okay with that we we are happy with this trade-off right like to wait a little bit longer but make sure that we are safe like no one's going like to steal our identity or like our credit and the other thing is state and when i mean by state i mean like how much and how complicated the information that we need to act upon has to be right if for example we need to act upon has to be, right? If, for example, we needed to go and keep track of every possible state, every possible interaction that the user has done in the past couple of years, right? It's just impossible to be like, hey, I'm going to process this on the fly
Starting point is 00:21:20 and make sure that we have it in a sub-second latency back. So this is how, at least the mental model that I have built when I'm thinking about trying to figure out what is needed out there and at the end where is the right trade-offs. Because there are always trade-offs made. I don't know, Apoorva, if you agree with that mental model or you would add something
Starting point is 00:21:51 or remove something from that. Yeah, I think that's a good way of thinking about it. I think essentially what you're saying is if you need low latency, sophisticated stateful processing, event-driven backend,
Starting point is 00:22:03 event-driven apps probably are a good fit. Like fraud detection was a common, it's a most common example, right? Like it's complicated to get it right. It needs to be done with low latency. So having like, you know, something that's running that reacts to your swipe and, you know, detecting it for fraud before bringing back that it's good to go. That's a perfect example of an
Starting point is 00:22:25 event-driven architecture being a good use right i think that's and then you know i don't know if you know you know walmart has a talk about this but they use kafka streams when you check out on all their walmart.com jet.com properties it's sitting there you know evaluating your stuff for fraud giving you recommendations while you're on the shopping cart like all of those are very low latency, right? You know, it's all done on, you know, Kafka Streams is part of that thing. Anyway, that's a good categorization. I think to Eric's question, I think around, yeah, I think if there's a, yeah, I think you mentioned timestamp in the message.
Starting point is 00:22:58 I think one important property outside of latency, I think is also replayability. I think the general, like, unless like you're actually responding to a request that I do a click, and in the case of credit card, you know, you go to another screen and you're waiting, and then it, you know, it brings back and you're good to go kind of thing. But I think, and so then in that case, there's a very synchronous aspect to it, but you're kind of made to wait. But the other way you generally don't like you're clicking a button, you want a response, right? So those, I think,
Starting point is 00:23:27 you're never going to use Kafka-based systems to serve your click, right? Unless it's like for this very complicated, fraught kind of use cases. But I think there's a lot where you don't actually have that really need to ping back and then actually logging an event to something
Starting point is 00:23:43 like Kata, which makes it durable, and then actually logging an event to something like kata which makes it durable and then writing backends to kind of process them actually has many nice properties right like even when you talk about batch versus streaming like streaming is incremental processing it's actually cheaper over time to do streaming right because you're not reprocessing the same thing again and again by definition and then are properties like this where if you have a durable event written to a log, there's auditability to it. There's, you know, there's a lot of, and you know, you can, if you find a bug in your app,
Starting point is 00:24:12 you can replay the event. Potentially, if you build your system right and redo the output, these are all opportunities you don't have in like a synchronous request response world because once the response is gone, it's gone. But with event, if you actually build an event system, right, you actually have very nice properties around,
Starting point is 00:24:30 you know, making it better, right? Like it's actually a better way to build. I think I'm obviously biased. So I think that's a very nuanced aspect of, you know, of when to choose the architecture, right? You know, you might even want to do it for reasons outside of core latency reasons. So like, you know, it could be more efficient.
Starting point is 00:24:50 It's more operable. It's more auditable. It's more like, you know, you can eventually get better correctness over time, right? It's also complicated, but the architecture lends itself to those properties. Yeah, I have a question on that. So that's great, right? Like being able to, let's say,
Starting point is 00:25:08 keep track of all the interactions that change like the state out there and being able like to replicate and even do things like time travel, like let's say, want to see how like my user looked like five hours ago, like whatever. It's great. It's amazing. But my question here is, you said that at some point, Kafka is an in-memory system. And my question here is that what we're describing here is a system that will infinitely grow in terms of the data it has to store. As we create more interactions, we have to keep them there and make sure we can go back and like access all that. So with a system that has to live in memory to be performant and provide also that low
Starting point is 00:25:55 latency, let's say guarantees that everyone is looking for like in Kafka, how you can also make sure that you can store everything in there, right? How did it work from the beginning? And if there's evolution in how Kafka is treating these, I'd love to hear the story. Yeah, that's a great question.
Starting point is 00:26:17 I think this is a super interesting one. I mean, I did say Kafka expects to be in memory, right? i think the original design i think it's still true right like like it used the linux page cache significant you know very heavily to kind of make sure the segments were in memory so that almost every read is just hitting memory the outside optimizations where you could copy from the page guys to the network buffer directly so it doesn't go through the jvm and all that stuff so there was highly optimized for low latency efficient scale you know and i think the original
Starting point is 00:26:48 originally right and it's still to a large extent a lot of traffic in kafka it's kind of metrics right like you're logging all like at linkedin that was what it was used the primary one of the big original use cases was the entire monitoring of linkedin.com was through a metrics lock to Kafka and then served on the monitoring dashboards and whatever. So I think there are use cases where you have high volume writes and a large number
Starting point is 00:27:16 of reads of recent data that expect super low latency where they have to be in memory. And that's kind of getting into like, that's a class of use cases, right? So single in a high volume of writes, and then there's a notion of fan art,
Starting point is 00:27:31 like for one write, how many reads on that event, right? Like, is it five? Is it 10? Is it 20? There's a notion of how soon after the write is the reads happening, right? Is it very soon after? Like, are you tailing the end of the log basically, right?? Are you tailing the end of the log, basically?
Starting point is 00:27:48 Are many consumers tailing the end of the log? And Kafka, I would say, is optimized for that. You have a high volume of writes and many readers tailing the log. And a lot of systems are built where that basically just looks. And it's highly optimized for that. And I think that's a very good use of Kafka. It's kind of metrics, logging, use cases, right?
Starting point is 00:28:09 Maybe there's some apps that need, you know, like that kind of transactional data. It's generally not that high volume. That's generally the point, right? Like, you know, transactional data, you're not like every click on your website is not the same as every checkout, right? It's a completely many different orders of magnitude.
Starting point is 00:28:26 So Kafka started with that. And I think over time, now they added compacted topics. So what that means is that for every key that you write, you can just keep the latest version of that key. So you kind of significantly reduce the data you store. Then they have tiered storage now that tiers off older data to S3,
Starting point is 00:28:46 but you can still read it through the same protocol. It's just the latency will be slower, right? So I think with things like compacted topics, you know, tiering, you could keep it
Starting point is 00:28:54 as the, you know, the system of record for the evolution of the data in your company, right? And that opens up different use cases. And this is kind of what
Starting point is 00:29:03 they're starting to get into, right? Like Kafka is amazingly not that old, right? As systems go, Confluent itself is going to be 10 years old this year, right? Like it's not that old, you know, for a new category of data systems. So I would say the original was this high volume, high fan out, low latency use cases around, you know, metrics distribution and that kind of thing, which is still very popular. And then you have more of these transactional application use cases, system of record use cases that are actually there. Like banks use it, like many really sophisticated organizations use Kafka in that capacity.
Starting point is 00:29:38 It's evolving in that direction, right? So both those things are true. And the latter is more new, I would say, relatively speaking. Yeah. And you said, okay, like it started like that. And then you mentioned also there are like two things that were built like on top of that, right? Like one was like KSQL and also like Kafka streams. So first of all, what was the reasoning behind building that? Why in the system that initially was, let's say, this very resilient and performance, almost like a queue. I know it's not a queue,
Starting point is 00:30:17 but something where you write data and then you have consumers on the other side that they can really fast read and tail the log, why get into systems that are much more complicated in terms of the logic that you can build? An example of that is SQL itself. It's even a completely different way, like a model, like a relational model, like a way that's like you interact with data instead of
Starting point is 00:30:47 having events one after the other in there. So what was the reason behind that? What was the motivation to build this? And where these systems are today, right? How they are used and what they're started? Yeah, that's a great question. I think
Starting point is 00:31:03 so, by the way, just chronologically, like Kafka Streams at Confluent came first, right? It was launched in 2015 or 16 and then KSQL was launched in 2017. KSQL is actually built on Kafka Streams. But I would say like the motivations for either kind of is little even before that, right? I would say like if you look
Starting point is 00:31:21 at the early internet companies, right? Like the ones that did open source like twitter like linkedin so i can speak to linkedin right like there was kafka and the same group roughly who did kafka also built samsa right samsa is kind of a stream processor right yeah and what are the use cases motivating use cases of samsa like i would say like for example we used it to ingest into the graph database. So we're getting the exhaust of like, you connected a change log, someone created a connection on LinkedIn. That's a log logged in Kafka. And then there was a SAMHSA job that would
Starting point is 00:31:56 take that and write it in the format that could be indexed in the graph database, for example. So it's kind of like an ETL job, real-time ETL job. Same for search. There was like, in search, we used, eventually after I left, they used SAMHSA to do search ingestion. You need to build a live index of recent activity.
Starting point is 00:32:17 The common thing is that if I connect with you, the first thing I do is search for you. And you want that search to show up, not some random person, if you have a common name, but the connection, right? So that needs to be indexed. That needs to be indexed quickly.
Starting point is 00:32:30 So that was a SAMHSA job, right? So there is this class of, you know, sophisticated processing you need to do on these event streams for things like this real-time ETL or things like real-time fraud detection, like real-time analytics kind of use cases, right?
Starting point is 00:32:46 So that's the motivation. So basically, you could do it, just consumer events, write the state manager, write the fault tolerance, write the load distribution, write fault. These are highly elastic, scalable, stateful operations. If you do it on your own, you have to bear the scaling, load distribution, fault availability
Starting point is 00:33:09 detection, state management yourself. Right? And SAMHSA and Kafka Streams and KSQL and others all are trying to solve exactly those problems of load management,
Starting point is 00:33:25 liveness detection, false tolerance protocols, state management with all those properties and give it to you in a consumable API that as a developer, you're just thinking of processing your events.
Starting point is 00:33:40 I get an event, I need to process it. I have state available in processing that event and everything else just works. That's the goal of those systems. That's the motivation for them to exist. And there's a huge class of use cases
Starting point is 00:33:50 that actually are stateful, are scalable, are high scale, need high availability that people want to build on these event streams. And you need technology for that. It's not something you could do it on your own, right? But most teams will fail to do it well. Yeah, that makes sense. And okay, so you said KSQL was built
Starting point is 00:34:12 like on top of Kafka streams, right? So why put like SQL there as the interface? Like instead of like something more primitive. And primitive is not a bad thing here, what I'm saying, right? It's just like different API at the end, like to describe your business logic that you want to execute on top of the data. I mean, I can't speak definitively that like KSQL, I kind of joined that project after it was already launched in public review and whatever.
Starting point is 00:34:41 So I wasn't there at the origins of the of you know why you know the genesis of those conversations but i think in general i can say you know that you know in general this whole date category of events streaming right like this whole like if you i mean honestly like i'm proud to have been at confluent because they kind of created this category like this data in motion is now something you understand. It's different from data at rest, right? It's, you know, like this, like it's hard to imagine that before then, what is this? There's no category, right?
Starting point is 00:35:16 You know, like this whole space didn't exist. And they created a category of like these event-driven data streaming, data in motion systems that are distinct from your database request response, data address systems, right? And I think this is what I'm trying to say is that category creation is hard. Oh yeah.
Starting point is 00:35:35 Right. You know, and I think if you can make, like, you know, if you're selling a database and this is something that, you know, Jay at Confluent, the CEO founder says, like you're selling a database, you know, you you're selling a database, and this is something that, you know, Jay at Confluent, the CEO founder says, like, you're selling a database, you know, you're not creating a new category, you just compete on why your database is better than their database, right? People understand they need a database. Now, do they need your database is a much simpler question, right? In many cases to answer as a company selling something.
Starting point is 00:36:00 If you're selling an event streaming system, like, what is it? I don't know. Do I need it? Is it like streaming video to me? Is it like streaming a live event online? People don't know the words, right? And so I think the idea was that if you can make it look like a database,
Starting point is 00:36:17 then you have expanded the number of people who can get what you're doing. It expanded the people you can, it basically is the bet to grow adoption which is played out like ksql was you know many more people tried ksql than tried to focus things i didn't you know so so i think that was the i can i think now i'm i was again caveated i wasn't there to genesis conversations for ksql specifically but that was i would say pretty confidently,
Starting point is 00:36:45 that was the idea. Like, can you broaden the market by making it look like a database, giving it in a language that people already know, SQL, right? That would be the main, I would think the main reason to do it. Yeah, yeah, 100%. And I think it's a good way for me to ask the next question
Starting point is 00:37:02 because I think one of the reasons that people try to introduce, let's say, new APIs for interacting with the system is because you want to provide different experience to the user, right? And by that, you might improve the experience that they have and also expand the user base, right? As you said, like there are obviously many more people that they can write SQL. And then there are people that can write Kafka streams, right? And it makes sense. So, and with that in mind, I'd like to ask you about,
Starting point is 00:37:38 and I'm talking also like as a person who, like I built production top of Kafka, like my first company like Blendo, was actually... Kafka was a very core part of the platform that we built there. Back then, at least, I'm talking about 2015, 2014, managing in production Kafka was not the easiest thing. Right? managing in production Kafka was not the easiest thing, right? It was like a system that was promising a lot,
Starting point is 00:38:11 it could deliver a lot, but you really had to know a lot about the system itself and the internals of the system in order to manage it properly. And obviously that makes it hard for more people to work with the technology, right? So have you seen, let's say, the developer experience evolving around Kafka these years? And if you could share a little bit of around, let's say, what makes it hard to work with Kafka and what are some ways of solving these and abstracting them? So you mean Kafka or Kafka Streams or both?
Starting point is 00:38:53 Both, actually. I just want to start from Kafka because Kafka Streams is built on top of Kafka, right? So I'm sure that there are things that are inherited from there, right? Actually, it's very different. But yeah, we can start with Kafka. So I think Kafka, like, you know, I mean, for sure, right? Like you have these massively stateful brokers, right? It was, you know, built for a company like LinkedIn, which is very good engineers, right? It's very hard to kind of massage all of that into something you can just take and run
Starting point is 00:39:26 beyond the thing beyond a certain scale and beyond a certain you know level of mission criticality you have to learn a ton of stuff right there's so much inside Kafka right that so many norms so much tuning so much stuff you have to learn to run it well so it's not surprising it was problematic for most people in 2014. I think now, I mean, there's so many good management services, right? Like, you know, like Confluent, we use Confluent Cloud in our company now. And, you know, like, honestly, like, you don't, honestly, we don't have to care. Like, it basically mostly just works, right, on the backend. And so I think that's a huge thing right like you don't
Starting point is 00:40:06 think you start from basically zero and you know pay for what you use and your low volume pay little and it just keeps scaling it for you it's a you basically now just care about that protocol right and it just like you're it's not going to it's not going to break on you from a performance perspective so i think today in 2024 it's a completely different world from 2014. There's many managed services for Kafka offering different tiers of service. You know, like Amazon basically runs it on some host for you. You'd love to learn a lot. And you have like these fully managed serverless offerings on the other end of the spectrum
Starting point is 00:40:41 and many people competing in that space. So I think most people today, and then if you're running it on your own there's so many other companies that do management tooling and like observability tooling and like there's like at least five companies doing that if you run it on your own you know there are many more people actually know like there's courses and trainings and certifications for kafka operators so you know they have stream z and all these other people who are shipping Kubernetes operators that you can run on OpenStack or whatever.
Starting point is 00:41:08 So there's so much around it that regardless of where you are in terms of your requirements, you probably won't have as much of a problem as you did in 2014, right? So I think it's come along. I'm sure there's a lot to do, but I think there's so many more options now. And what about Kafka Streams? What it
Starting point is 00:41:27 means to operate Kafka Streams? And what's the developer experience there? And what's the space for improvement? What can be done to make Kafka Streams a much better experience for someone to develop and manage?
Starting point is 00:41:44 Yeah, I think i don't think we've talked about what it is actually right like i think kafka streams you know it's a library that ships with kafka it's open source it's part of apache kafka product and you know it kind of gives you these really rich apis like you know that you can write a function that reacts to an event. It has a state that you can build up over time that is maintained for you, right? Kafka Streams, you take this library, build this app, deploy the app, and now Kafka Streams can scale you out to new nodes, scale you in when you scale down,
Starting point is 00:42:16 and it kind of detects node failures, rebalances stuff onto other nodes. It maintains state across all that, right? Moves state around for you. And if you think about it, it's just the library running in your app. You're running an app like you're on onto other nodes, it maintains state across all that, right? Moves state around for you. And I think about it, it's just the library running in your app. You're running an app like you're on any other app, but it does all this stuff for you on the backend, right?
Starting point is 00:42:32 So that's what the library does in build. And it's great. People love it for that flexibility. You literally are an application team running your app. You own your pager, you have your tools. And it feels like other services you operate and people love it because it allows them to feel that
Starting point is 00:42:49 as opposed to submitting a job in someone else's cluster, which is a very different mindset. But I think then coming to your question, so that's an intro for Kafka Streams, but I think the problem is that because it's a library now you as an application team are basically running a sharded distributed replicated database if you have a stateful application, right?
Starting point is 00:43:13 And now you have to learn about the... It's kind of like what you have to do with Kafka in 2014, right? You have to learn about all the internals, about how it works to solve problems when you hit them and you will hit them, right? And the default Kafka streams, you have to collect the metrics somehow. There's JMX creeper, there's JM beans, whatever. You have to get them out.
Starting point is 00:43:33 Which metrics? There are like 700 metrics. Which ones should you care about? There are like 500 tuning configurations. Which ones should you tune, right? And then you have to do all that. You have to collect the logs at the right level for the right classes. And then you do all that you have to collect the logs at the right level for the
Starting point is 00:43:45 right classes and then you do all that to solve problems if you hit them in production with your stateful applications right most companies they start it's super nice it's great it works and then when it doesn't work i've been on calls you write support calls where they didn't have metrics they didn't have logs but they needed to work right now. Yeah. Right? So you're very unprepared a lot of the time because you don't know. It's so magical to start with. It hides a lot of complexity,
Starting point is 00:44:13 but, you know, it only hides it so far. Like, you would not deploy an application today with a co-located database. It's totally all co-located, distributed database and run it in production
Starting point is 00:44:23 and expect it to work. Right? But really, that's what people are doing with Kafka Streams Kafka streams right and that's the root cause of a lot of issues so I think you know like what we are doing for example it's kind of you know allowing people to eat their cake and have it do it keep the form factor it's still a library running in your application keep all the the great things, right? But delegate all the hard things to us, right? So that you don't have to solve it. We give you clean SLAs, clean interfaces. And that kind of is a big step forward for operations, right? Like, you know, you're running it like you would other production services where you're separated state from compute. You're letting teams, experts run the distributed stateful stuff for you behind clean
Starting point is 00:45:06 and SLAs. I would say that's the biggest problem. This coupling is great to start. It's horrible when you hit the limits of it. And I think that's the biggest problem I would say to solve. And I think if you solve that well,
Starting point is 00:45:22 you know, like more people will write more of these apps, because they're really compelling. There's a lot you can do with it, and it's really easy to start with. So that's kind of, that's an answer to your question. That's the reason to do, just to focus on those operational problems. Yeah. And how do you do this decoupling? I'm sure it's a hard problem, but but like, can you take us through like, let's say the architecture that you have to like build at the end to achieve this decoupling?
Starting point is 00:45:55 And with the decoupling, like, okay, then offer all the additional benefits that you talked about. Yeah. I mean, I would say, first of all, we have the experts right on our company, right? I think it matters that, you know, a lot of the people we have built the system. Like, so we know how to do it.
Starting point is 00:46:12 I think the other big important thing is Kafka Streams is always kind of written as this kind of application framework, right? Like it is meant to link into your app. So you could write your app anywhere you want. It was also built so that the underlying systems are very componentized, very all, like the state store is an interface. You can implement whatever state store you want. It was also built so that the underlying systems are very componentized. The state store is an interface.
Starting point is 00:46:28 You can implement whatever state store you want. You can implement any assignment logic you want. You can implement anything you want. You can implement whatever client you want. So basically, you could plug in everything at the bottom. It was built so that we could do it. What we are really doing, and obviously, that was a design concept.
Starting point is 00:46:44 Implementation is much harder. But it was always designed with that in mind. So what we're doing is, okay, we're taking advantage of that design. And we're plugging into points where it's clean to plug into already by design. So, you know, like you actually have to rewrite zero in your app. Like our customers have moved in 10 minutes, you know. Completely offload state, completely offload management, but you don't need to change anything.
Starting point is 00:47:07 So I think it's both. It was designed like that, and we know the system and the two things combined allow us to kind of deliver this with a very true, like it's basically Kafka Streams. Nothing changes in your application.
Starting point is 00:47:19 Yeah, yeah. So I was going through like a very high level like the documentation of responsive, and I saw there that a very high level documentation of responsive. And I saw there that, and correct me if I'm wrong, I'm not an expert in Kafka streams, but Kafka streams, because we are talking here about managing state, right? It has to store the state somewhere.
Starting point is 00:47:40 If I'm not wrong, the state is stored in a RocksDB instance. Is that correct? Yeah, correct. One of the things that you change with Responsive is what's the data store where the state is stored. So you have ScalaDB there and also you mentioned MongoDB. Can you tell us a little bit about that? Why you decided to move away from RocksDB into these systems?
Starting point is 00:48:11 And what's next after that? Because I think, again, if I understood correctly from what I've read, that you're also working on building your own store ads there. Tell me a little bit about that, and then I have one last question on this topic, but I don't want to just put too many questions together. So, yeah, tell us a little bit about that. Yeah, no, so
Starting point is 00:48:34 yeah, I think the reason to move, yeah, you're right, it's RocksDB by default, right? Status materialized in RocksDB. The reason to move, again, is if you're running a replicated Rocks, sharded RocksDB in the app context of your application, it hurts operations a lot, right? Like like if you're running a like replicated rocks sharded rocks db in the app context of your application it hurts operations a lot right like if you lose a node now basically have to reread from a kafka topic all the state to materialize a new rocks db and since the ticket
Starting point is 00:48:56 takes away elasticity right and then there are a lot of solutions on the protocols on top to help you manage that but they complicate those protocols and often you get into vicious loops of you know you're stuck trying to restore state but your group keeps rebalancing and your things keep in the infinite restore loop unless you have the right tunings it just doesn't work there's a lot of complexity with doing it that way so that's why you know i mean honestly like like most app teams don't want to run a stateful app. They want to run stateless apps that are autoscaled, right? So you have to remove storage. It's almost like a precondition for good operations, in my opinion.
Starting point is 00:49:34 That's a strong view we have as a company as well. So that's kind of the answer to why remove RoskDB. I think it's just operationally really hard once you hit a certain scale. And we have heard so many stories of people, don't touch the system. remove RoskDB, right? I think it's just operationally really hard once you hit a certain scale. You know, and we have heard so many stories of people, don't touch the system. Because if you touch it, when it breaks, you can't fix it. And that kind of scares, you know, like,
Starting point is 00:49:53 kind of feel. But yeah, so we solved that by removing it. And then why is still our Mongo and our own eventually is basically, you still need a transactional store, right? Like, you need basic, you know, like Kafka does, like I mentioned, we do transactions, you know, Kafka Streams does allow you to kind of read,
Starting point is 00:50:11 modify, write sequences of events. It's extremely powerful, primitive to write correct applications. And so you need your store now to be transactional too, right? So that's why we had to pick some of these kinds of stores that, you know, that kind of have these, you couldn't just dump it to S3, you couldn't just be something naive about it. So transactions is a big requirement and ability to transaction from the right with the right
Starting point is 00:50:32 performance, right? Like that's another problem. The semantics have to be there. The performance also has to be there. So we've done a lot of work to make these transactional systems in Mongo and sell out work with Kafka streams, right? At good performance. Excellent performance, I would say.
Starting point is 00:50:50 So I think the other thing is also just in terms of form factor, right? Like Mongo especially is everywhere. Like all the big companies use it. They already are contracted out and many of them have a requirement that things can't leave their data center. They can't leave the network, right? You bring your own Mongo, it's already blessed. You really have very little security risk. So from an adoption
Starting point is 00:51:07 perspective, especially because Kafka Streams is used in most major enterprises, they're very locked down. And just saying that you bring your own Mongo and we will make it work really well is an extremely good from a procurement and
Starting point is 00:51:23 security perspective, which is mostly a blocker for all startups, right? We can't give you our data, it's too bad. We're saying, you don't have to give us your data, you keep your data. We'll just make it work for you, right? So that's kind of the other reason to go with these vendors. And then, you know, that comes to your question about our own, is that these are also not optimized, right?
Starting point is 00:51:41 Like we can do, we already have a system that works that, you know, you are kind of in the default Kafka streams having rocks DB on EBS, typically if you're on AWS replicated over Kafka, right? We could pull it out, you know, replicate over S3, like, you know, like some companies are doing with Kafka, make it tiered to S3, keep a really cheap intermediate cache
Starting point is 00:52:04 to serve quick reads and to get a quick buffered writes and make it cheaper to S3, keep a really cheap intermediate cache to serve quick reads and to take quick buffered writes and make it cheaper than the default. But with infinite scalability because you have S3 and with no management, right? So that's kind of the long-term, like that kind of will further
Starting point is 00:52:18 open up the market for us, right? Like you can have really high state, high scale apps without worrying about cost with great operations. So that's the reason to do it. It's still a long way off. Honestly, Synczilla and Mongo work really well.
Starting point is 00:52:41 And for a lot of people who are building transactional apps who have strict data governance and data residency requirements, it's much easier for them to just use something that's blessed in their company. Yeah, makes sense. And like the final part of the equation, why not Kafka itself? Because you mentioned, and you talked like a little bit to the why, but just want to make like more explicit because as you say, like you build like transactional semantics
Starting point is 00:52:57 on top of like Kafka. So why not just use like Kafka itself and simplify the architecture, right? To store the state also there? I mean, Kafka is basically a log. Kafka Streams default does store the state in Kafka and then rebuilds it into RocksDB when a new node comes online or whatever.
Starting point is 00:53:16 It's called a restore process. I think this whole problem of if you want real elasticity, this process of rebuilding state or even sometimes re-downloading from S3, it still takes time if it's a lot of state. So having any kind of local state that needs to be there locally on a local disk before you can do any processing can be a big problem, especially at scale. And even if it's not at scale, if it takes a long time to rebuild from Kafka, you're waiting a long time to scale out. And that could be deadly for some people, right? So I think that's the biggest,
Starting point is 00:53:50 I mean, I think it's debatable whether that's the right architecture if you have a really mission-critical, high-available app, and especially if you need elasticity, right? Like just the time waiting to build up a terabyte or state from Kafka could be a day. Yeah, are you going to wait a day?
Starting point is 00:54:04 Right? And I think that's a really good question. And Kafka, it could be a day. Yeah, are you going to wait a day? Right? And I think that's really the question. And people do it. Honestly, people do it. Like there are many companies successfully using it and they've configured it and they can work with it. But most people, right, hit problems, right?
Starting point is 00:54:16 And I think having the choice is great. Makes sense. Okay, cool. I'll give the microphone back to Eric. So Eric, all yours again. I have a feeling we need to arrange another one of these discussions here. And I'd like some very interesting questions that were triggered by all the things that we talked about. I'm really looking forward to have you again back. But Eric, all yours. Yeah, I agree. Well, we're right at the buzzer, as Brooks likes to say. But, you know, we have talked. Your knowledge of streaming systems is so deep, Aperva. It's really incredible.
Starting point is 00:54:56 But I can't help but ask, you know, if you had to go solve some other technical problem that didn't have anything to do with streaming. You know, you sort of had a blank check to go solve a problem in a different area. What interests you, you know, sort of outside of streaming? Like, what do you think about when you're not thinking about streaming? Oh, man.
Starting point is 00:55:21 Or is it possible after so many years for you, you think in streams now? I would say the latter is true. If it's not streaming and not this, it's like my kids, my family, it's like anything but book. You see everything as a Kafka topic. No, I mean, I think it's not that, right?
Starting point is 00:55:40 I mean, there are obviously so many great things to do, but for me, I just feel like, like I genuinely, having been at Confluent for so long i would say even at confluent it took a couple of years to get into the mindset and once you're in the mindset of these event-driven architectures you know this durable log that's the source of truth of all data and your whole company that's the vision at confluent like the central nervous system is another thing they say i just think that honestly there's so much that can be done in this space right there's so many ways like the apps of the future the kinds of things you could do with them the kinds of you know how fast you could deliver great new capabilities i think the like we're very early stages might fundamentally believe
Starting point is 00:56:19 it right and that's genuinely exciting to actually it It's hard technical problems, first of all. As an engineer and technologist, that's exciting. Many of these are unsolved. And then from a business and use case perspective, I think there's so much you could do to kind of grow the market and innovate. And pricing, right? Pricing, how do you package the product?
Starting point is 00:56:40 So many things that nobody makes money in stream processing. I shouldn't say that our company is in the space, but this is just a fact, right? I think there's a lot of value there also, right? And I think honestly, like if you think hard technical problems, and I think a clear market, because people are doing things, right? It's just that it's hard to kind of do it and make capture value. Honestly, like it's perfect, right? And I have a background in it, right? It's just that it's hard to kind of do it and make capture value. Honestly, like it's perfect. Right. And I have a background in it, right? It's not like I'm just coming off the street, reading a book and getting, you know, like it was a deep excitement. It's not like
Starting point is 00:57:13 super chill excitement. So I think honestly, like to your, I mean, maybe it's a disappointing answer, but honestly, not at this point, right? Like I'm pretty excited about it. Yeah. I think deep excitement is a really good way to describe it. And that's very palpable just talking with you. So very excited for what you're building. And thank you again for teaching us so much with your time with us on the show. Thank you so much for having me.
Starting point is 00:57:37 It was a lot of fun. I hope I didn't take too long to answer over time. But yeah, happy to do this again. You know, I mean, it was a lot of fun if you want to continue the conversation. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by
Starting point is 00:58:08 Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.