The Data Stack Show - 184: Kafka Streams and Operationalizing Event Driven Applications with Apurva Mehta of Responsive
Episode Date: April 3, 2024Highlights from this week’s conversation include:Apruva’s background in streaming technology (0:48)Developer experience and Kafka streams (2:47)Motivation to bootstrap a startup (4:09)Meeting the ...Confluent founders and early work at Confluent (6:59)Projects at Confluent and transition to engineering management (10:34)Overview of Responsive and event-driven applications (12:55)Defining event-driven applications (15:33)Importance of latency and state in event-driven applications (18:54)Low Latency and Stateful Processing (21:52)In-Memory Storage and Evolution of Kafka (25:02)Motivation for KSQL and Kafka Streams (29:46)Category Creation and Database-like Interface (34:33)Developer Experience with Kafka and Kafka Streams (38:50)Kafka Streams Functionality and Operational Challenges (41:44)Metrics and Tuning Configurations (43:33)Architecture and Decoupling in Kafka Streams (45:39)State Storage and Transition from RocksDB (47:48)Future of Event-Driven Architectures (56:30)Final thoughts and takeaways (57:36)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com. We are here with Aparva from Responsive,
and we are so excited to chat about many topics today, event-based applications,
and all the infrastructure, both underneath and on top of them to run them. So Aparva,
thank you so much for joining us on the show today.
Yeah, great. Thank you for having me. Excited to be here.
Give us just your brief background. You spent a lot of time in the world of streaming.
You're a multi-time entrepreneur, but give us the story. How did you get into the world of streaming?
Yeah, thanks. No, yeah, I have. I first got exposed. I mean, I was at LinkedIn, you know,
like I've been at LinkedIn content and now I'm doing have. I first got exposed. I mean, I was at LinkedIn. I've been at LinkedIn,
Confluent,
and now I'm doing Responsive.
But I got exposed to streaming at LinkedIn.
I wrote a SAMHSA job
back in the day.
And through LinkedIn,
got to know the founders
of Confluent,
moved on to Confluent,
worked in Kafka
and Kafka,
KSQL,
Kafka Streams
for years, basically.
And Responsive continues
the line. we are building
a platform to manage these event-driven applications in the cloud and yeah so but it began at linkedin
i would say in 2013 that's awesome well i have a couple of actually i have many questions that's
i'd like to ask you hopefully we'll have the time like to go through all of that stuff
but some of like the high level topics that I'm really interested in,
first of all, I'd love to hear from you.
What is Kafka?
And what are, as a technology, as a protocol, or as a standard, probably,
and what's happening with the products that are built around it
and the next thing that is like very interesting is and inspired from what you are doing like today
about developer experience and like things around like kafka streams right like what's
what it means like to build why we want to build things with Kafka streams, right?
And what can be done better for the engineers out there that they are using Kafka streams
and Kafka, obviously, to work with that.
So that's like a few of the things that I'd love to get deeper into.
But how about you?
What are some things that you're excited about like discussing today
yeah all of that that sounds and you know those are all topics i'm very passionate about i think
it's good to tell the story right like what you said kafka as a tech standard as a technology
the differentiation there i think i would also say it's talking a bit about like this world on top
of kafka right like stream processing i think is a very overloaded world.
There are many use cases, many different ways to slice that space up, right?
Many different technologies in that space.
You know, I think talking about the space itself and how, you know, how, you know, sharing
my point of view on how to think about it and then maybe getting into Kafka Streams
and how the technology is different
with a focus on Kafka Streams,
I think would be a very interesting,
I think, an educational conversation.
So I'm excited about doing that.
Yeah, 100%.
Let's do it.
What do you think, Eric?
Yeah, let's dig in.
This is going to be great.
All right, well, you gave us a brief
introduction about your time at LinkedIn and how you sort of got into streaming and how you met
some of the Confluent people. I want to dig into that. I want to dig into that story, especially
the beginnings at LinkedIn. But when we were chatting before the show, you talked about bootstrapping your own startup.
And I just love to ask you a couple of questions.
I know we have several entrepreneurs
who are on the show.
What motivated you to bootstrap your own app and company?
It was way back in 2011.
So I had just started before that at Yahoo,
actually in the cloud platform
team, you know, in 2008 was my first job. Yeah, and I think I was almost three years at Yahoo.
And then, you know, I was like, I don't know, I just had the age, right? Like, I think I've
always been a startup person, right? I always wanted to do it. And back then the App Store,
Apple was new, right? Everyone was writing an app right cloud services were new right
yeah so it was like the thing to do you know that you know and i still think but even back then
there was a big thing around bootstrapping companies being a very great way to have a
good life and you know so that i was also very young right you're very impressionable so i mean
it's what you do right like i saved some money from
yahoo i had some experience building stuff at a good company and you know i had this i also played
music and you know this music recording sharing music recordings annotating music recordings it
was not no products for that right and i said why don't we build something that's cloud native
mobile native that allows you know groups to kind of share rehearsals with each other, discuss rehearsals. And it was definitely a problem I had personally. And that
was also standard advice, solve a problem you personally have. That's a great way to start a
company. So that's kind of the genesis. Yeah, yeah. Very cool. And well, what instrument or
instruments do you play? I stopped since my son was born several years ago,
but I played the tabla for a long time.
Like it's an Indian drums, Indian classical drums.
Oh, cool.
So I played that for 10, 12 years pretty intelligently.
I built an app to help with my lessons
and my practices and rehearsals also.
So yeah, but yeah, it's just a lovely,
I hope to get back into it.
Yeah, that's great. Okay, just one
more question. When's the last time you went back
and looked at the code you wrote for that
first app that you built?
Oh.
I have many years.
I don't know. I guess once I went
to LinkedIn, you know,
family, life,
it was very busy.
It's been, probably when I stopped, I never looked at it again.
Yeah.
The reason I ask is I was looking for a file the other day and I ran across this old projects
folder, you know, and I was like, whoa, you know, this has like 10 year old stuff in it.
And you look at it and you're like, wow, that seemed like so awesome at the time.
And okay. So LinkedIn. So tell us about meeting. up in it and you look at it and you're like, wow, that seemed like so awesome at the time.
Okay. So LinkedIn. So tell us about meeting, how did you meet the Confluent founders at LinkedIn?
And, you know, it was somewhat related to the work that you were doing there, but give us that a little bit deeper peek into that story. Yeah. So at LinkedIn, I was working, I did two
projects. One was on like the graph database, you know, so like LinkedIn has like this, I don't know
if it's still true, but back then it was, you know, like one in memory graph database
that serves all your connections.
Like basically it's a very fast, low latency lookup that, you know, who's my, who's your
connections.
And that kind of is used by the whole website to do all sorts of things, right?
Like, can you even see someone's profile?
Can you, in your search results, what's the ranking, right? Like there's you even see someone's profile? Can you, in your search results, what's
the ranking, right? Like there's a lot of calls made to that. At least it used to be, now I can't
speak to what it is today, but basically like getting the statistic back then is for every call
to linkedin.com, it would be 500 calls to this graph database through the whole service tree.
Wow.
And often they'd stack up. So like, you know, it was not very efficient.
I would say the bottom layer would call and then it'll send layers up and the upper layer
would also call back.
And so it was not, you know, like it's like a massively distributed microservices architecture,
which, you know, obviously no team actually, no one actually optimized it top to bottom.
But anyway, but the net effect is if there was any latency at the bottom layer,
it could basically halt
LinkedIn.com.
Like basically,
there'll be a cascading timeouts
through the service stack
and LinkedIn.com could be down.
Like, and we're talking microseconds.
If the microsecond lookup
went into like a millisecond,
LinkedIn would,
and this is true.
There was a Saturday
back in 2013 where LinkedIn was down for eight hours.
Wow.
And the reason was LinkedIn.com just wouldn't load.
And the reason was this graph database was slightly slow.
And so that's kind of basically, it was a P0 for the company to figure it out. And, but I started digging into it
and ultimately it led deep into basically
the configuration of Linux on LinkedIn's hardware.
I don't know what they had in the data center.
So they had like what you call multi-socket systems
and basically the memory banks
were kind of divided across the two CPU sockets.
And then Linux was set up to keep each bank only pinned.
So if you have a thread running on one socket, it cannot read memory from another.
So if by chance, like you had 64 gigs of memory, that's what we had that time in RAM.
Even if 32 was filled, they were paging out data because of the setting.
It was some NUMA interleave setting. I forgot what it's called. But basically it was a BIOS configuration
plus a Linux configuration
that caused us to use no more than half the memory.
So we had tons of memory,
but we were paging out at half.
And because it was a page out now,
your in-memory lookup is no longer in-memory lookup.
It's a disk lookup.
That's a latency spike and that kind of cascades up.
And that was a root cause of our latency.
I wrote a blog post about this.
We can put it in the show notes.
I can send it to you.
But basically, having discovered that, it also affected Kafka
because Kafka is an in-memory system.
It expects for its performance to be not hitting disk.
And they were also getting paged out.
It was a problem for them too.
And this helped them and
other teams. But basically because of this, I kind of got to know the founders of Confluent.
And since then, we've been in touch and that's how Confluent happened later, I would say.
Very cool. And just tell us a little bit, you said there were a couple of things you worked
on at Confluent. Can you talk about, were those areas of interest
for you?
You joined Confluent pretty early, so maybe they were just key areas of product development,
but tell us a little bit about that.
Yeah.
I started at Confluent in 2016, summer.
And I think the first project was to add transactions.
Basically, this exactly one semantics where, you know,
you basically,
when you produce a message to Kafka,
it is persisted only once, right?
And then a transactional system on top of that,
where, you know,
you can also consume it
like a batch of messages
exactly one time.
So I can commit a batch of messages
across multiple topics as a unit
and then only make sure
the committed messages are are consumed so that
was like a major upgrade to the kafka protocol and kafka capabilities so i worked on that project
so there's three of us it was probably the most fun i've had as an engineer because two really
good two really good engineers it's a really hard problem streaming exactly once is kind of
like it was very innovative, I would say.
Very hard from an engineering perspective.
We actually made the performance better.
Even though, or like at least no loss, even though you had transactions, because we actually
optimized so much else to get the performance there.
So it was a great engineering thing.
And I think it's just a lot of fun doing it with that team back, you know, in a small
like one room in Palo Alto, you know, the whole company.
It was a lot of fun.
So yeah, so that was my first project.
Then I moved on to the KSQL team.
KSQL just launched around the time we launched this transaction.
It was actually very popularly received, like it's streaming SQL on Kafka.
And then the company wanted to invest more. So I moved to the KSQL team. And then I did a technical
project to add joins to KSQL. And then Confluent was growing really fast, right? So they had only
like two or three engineering managers for like 50, 60 engineers at some point in 2018.
Wow.
And so they asked you, do I want to manage this KSQL team? I said, sure, why not?
And then since then, I kind of became a manager of KSQL, Kafka Streams, all Groovy, Launch
the Cloud product.
But that would say the last four years, that's what I was doing at Confluent, learning how
to be an engineering manager.
Yeah.
And you have started Responsive, so can you just give us a quick overview
because it obviously works
with Kafka as a first-class citizen
but is a layer on top for building
event-based applications.
Correct, yeah.
This idea for Responsive came about
in 2022.
What we saw at Confluent was
these developers are
building all sorts of truly mission critical apps.
You know, like there's a real distraction and we can probably get into that in the show.
I guess that's a big topic of conversation.
But essentially, maybe I'll keep it short for now and then we can just dig into it, you know, as we go rather than.
But basically, yeah, I mean, what I saw at Confluent was, you know, this category of like, you know, you have these kind of backend applications that are not in the direct, like you're not sending, loading a website and sending a request, right?
But to an application that's serving from a database, but basically you're starting off something in the back.
Like it could be fulfilling an order.
It could be, you know, like, you know, fulfilling a prescription prescription. There's actually a pharmacy that uses
Kafka Streams to fulfill prescriptions.
It could be
things in your warehouse, logistics,
very common. All the
applications that make sure things get to the right
place at the right time in general
are very
good
fit for these event-driven patterns,
back-end event-driven patterns. backend event-driven patterns, right?
You know, so something happened, you have to react to it,
you have to keep state around what's happening
to make the right decision.
And this has to be programmed, right?
So there's application developers building these systems.
And I thought that was a tremendous,
it's actually very popular.
Surprised, like so many of these apps exist,
whether they're directly written with the producers or consumers
or written with Kafka streams or whatever.
But my thing was that, look, it's a massive market.
The pattern, the architecture makes a lot of sense for a lot of people.
Kafka, the data source is there.
But there's no real focus on that,
making that really easy and good to do, right?
So that was where Responsive, like, can we be in the cloud,
assuming that Karka is ubiquitous, you know, become the platform,
you know, the true foundation for writing these apps
that make it scalable, easy to test, easy to run, you know, all of it.
So that's kind of, we just, I mean, for me, the option is there
and responsive is kind of founded to capitalize and kind of add value in that space.
Yeah, makes total sense. Well, I know, I want to dig into that. Costas probably wants to dig
into that even more just around, you know, sort of operationalizing event driven apps. But
before we dig into the specifics, can we just talk briefly about
what you consider an event-driven app, right? I mean, that can mean a ton of different things.
Like you said, there is a sense in which it's ubiquitous, right? If there's any sort of process
happening in the real world that needs to be represented digitally to accomplish some, you know, process
or logistical flow within an organization, you know, those are technically sort of all
potentially event-based applications.
But how do you think about that?
Are there parameters around how you would define it?
Do you think a lot of people would have different definitions?
Yeah, I think, I mean, a very reductive definition would be, right, if you
have an application, like, and this probably describes all applications, so it's kind of
irrelevant. But basically, like, ultimately, like, you know, like, it's more like a pattern,
right? Like, are you writing some functions which are basically waiting for triggers to kind of
execute and, you know and potentially triggering other functions
downstream of them.
And in totality, achieving some outcome, right?
Like basically it's kind of like reductively
like functional programming is eventual, right?
You just function invocations,
a chain of them achieves an outcome.
I think in the context of what I said about responsive,
these are like more, I would say, backend.
Like basically, you could have,
if you think of it like even a web server is event-driven, right?
You're waiting for a request to come in and you're responding to it.
The request is the event and the response is, you know, the output.
And in a service mesh, service-oriented architecture,
there's a stack of these services calling each other.
And like that's basically synchronous and event-driven, right?
But I would say the asynchronous event-driven is kind of the space Kafka, the space on event-driven, right? But I would say the asynchronous event-driven
is kind of the space Kafka,
the space on top of Kafka, right?
Like where you logged an event to Kafka
and now that event is being propagated to many apps
and they are all reacting to it in their own time
in some sense and doing different things with it, right?
So yeah, I would say like,
there's this asynchronous backend event-driven apps, you know, have basically a source like Kafka, right?
I would say like, I don't know if that's a useful categorization, but that's really the categorization in which, for example, responsive, you know, that space of like, you have these apps that are built off of these event sources like Kafka.
And we are there to kind of, you know, we are focusing on that area,
basically, if that makes sense.
I'm happy to have a discussion, honestly, like how do you all see it?
Right.
And it, you know, there's another way to see it is that there's different things people
do even in that categorization, right?
Like you could do some sort of streaming analytics, streaming ETL, you can actually build like
these, you know, true, you know, kind of workflow like things on top of it.
But, you know, there's so many ways to slice that too. Yeah. Yeah. Maybe I'm interested,
Kostas, in your opinion as well. I mean, one question I have for both of you is this may be
a very oversimplified, you know, way to look at it or even an unfair question, but
you said, you know, functional programming, if you're extremely reductive,
is essentially sort of an event-driven application. But one thing that's interesting about some of the
use cases that you mentioned around logistics and other things is that there's a timestamp
that represents something happening in the physical world, right? Or like an input from
something outside of the application creating a trigger,
how important do you think that is in terms of defining an event-driven application?
And Kostas would love your thought as well.
I think, and I'm going to talk from my experience
with hearing people talking about bots
and stream processing,
primarily analytical workloads, where it's kind of like people...
It feels like it's the holy grail of merging the two.
You can hear people saying, you know what?
Bots is just a subset of streaming at the end.
And if we can make everything streaming, why we would need bots, right?
I think, and we forget marketing here and like trying to, you know,
sell new products and like all these things and go back to the fundamentals.
I think there are like a couple of, like a very small set of dimensions that really matter here.
And I think one is latency
that is associated with,
let's say, the use case that you have.
I mean, how fast do you have
to respond to something?
Yeah.
Because, yeah, like,
sure, like in the general case,
let's say we can say
if everything was free
in this universe,
yeah, if I can have an instant answer
like any question i have like would be great but our universe does not work like that like it's
right so there are use cases where low latency matters let's say i'm swapping my credit card
and someone has to go and do like i don't know like credit check or go and do something like
a pro detection right sure by the way i would ask people next time that they pay online for something and they put
their credit card, like just count the time it takes and they will see that like there
is like some time there, right?
It's not like loading the Google landing page, for example, right?
And we are okay with that.
We are happy with this trade-off, right?
Like to wait a little bit longer, but make sure that we are okay with that we we are happy with this trade-off right like to wait a little
bit longer but make sure that we are safe like no one's going like to steal our identity or like our
credit and the other thing is state and when i mean by state i mean like how much and how
complicated the information that we need to act upon has to be right if for example we need to act upon has to be, right? If, for example, we needed to go and keep track
of every possible state, every possible interaction
that the user has done in the past couple of years, right?
It's just impossible to be like,
hey, I'm going to process this on the fly
and make sure that we have it in a sub-second latency back.
So this is how, at least the mental model that I have built when I'm thinking about
trying to figure out what is needed out there and at the end where is the right trade-offs.
Because there are always trade-offs made.
I don't know,
Apoorva,
if you agree with that mental model
or you would add something
or remove something from that.
Yeah, I think that's a good way
of thinking about it.
I think essentially
what you're saying is
if you need low latency,
sophisticated stateful processing,
event-driven backend,
event-driven apps
probably are a good fit.
Like fraud detection was a common, it's a most common example, right?
Like it's complicated to get it right.
It needs to be done with low latency.
So having like, you know, something that's running that reacts to your swipe and, you
know, detecting it for fraud before bringing back that it's good to go.
That's a perfect example of an
event-driven architecture being a good use right i think that's and then you know i don't know if
you know you know walmart has a talk about this but they use kafka streams when you check out on
all their walmart.com jet.com properties it's sitting there you know evaluating your stuff
for fraud giving you recommendations while you're on the shopping cart like all of those are very
low latency, right?
You know, it's all done on, you know, Kafka Streams is part of that thing.
Anyway, that's a good categorization.
I think to Eric's question, I think around, yeah, I think if there's a, yeah, I think you mentioned timestamp in the message.
I think one important property outside of latency, I think is also replayability.
I think the general, like,
unless like you're actually responding to a request that I do a click, and in the case of
credit card, you know, you go to another screen and you're waiting, and then it, you know, it
brings back and you're good to go kind of thing. But I think, and so then in that case, there's a
very synchronous aspect to it, but you're kind of made to wait. But the other way you generally
don't like you're clicking a button, you want a response, right?
So those, I think,
you're never going to use
Kafka-based systems to serve
your click, right? Unless it's like for this
very complicated, fraught kind of use cases.
But I think there's a
lot where you don't actually have that
really need to ping back
and then actually logging an event to something
like Kata, which makes it durable, and then actually logging an event to something like kata which makes it
durable and then writing backends to kind of process them actually has many nice properties
right like even when you talk about batch versus streaming like streaming is incremental processing
it's actually cheaper over time to do streaming right because you're not reprocessing the same
thing again and again by definition and then are properties like this where if you have a durable event written to a log,
there's auditability to it.
There's, you know, there's a lot of,
and you know, you can, if you find a bug in your app,
you can replay the event.
Potentially, if you build your system right
and redo the output,
these are all opportunities you don't have
in like a synchronous request response world
because once the response is gone, it's gone.
But with event, if you actually build an event system, right,
you actually have very nice properties around,
you know, making it better, right?
Like it's actually a better way to build.
I think I'm obviously biased.
So I think that's a very nuanced aspect of, you know,
of when to choose the architecture, right?
You know, you might even want to do it
for reasons outside of core latency reasons.
So like, you know, it could be more efficient.
It's more operable.
It's more auditable.
It's more like, you know,
you can eventually get better correctness over time, right?
It's also complicated,
but the architecture lends itself to those properties.
Yeah, I have a question on that.
So that's great, right? Like being able to, let's say,
keep track of all the interactions that change like the state out there and being able like to
replicate and even do things like time travel, like let's say, want to see how like my user
looked like five hours ago, like whatever. It's great. It's amazing. But my question here is, you said that at some
point, Kafka is an in-memory system. And my question here is that what we're describing here
is a system that will infinitely grow in terms of the data it has to store. As we create more
interactions, we have to keep them there and make sure we can
go back and like access all that.
So with a system that has to live in memory to be performant and provide also that low
latency, let's say guarantees that everyone is looking for like in Kafka, how you can
also make sure that you can store everything in there, right?
How
did it work
from the beginning? And if there's evolution
in how Kafka is treating these,
I'd love to hear the story.
Yeah, that's a great question.
I think this is a super interesting
one. I mean, I did say Kafka
expects to be in memory,
right? i think the
original design i think it's still true right like like it used the linux page cache significant you
know very heavily to kind of make sure the segments were in memory so that almost every read
is just hitting memory the outside optimizations where you could copy from the page guys to the
network buffer directly so it doesn't go through the jvm and all that stuff so there was highly optimized for low latency efficient scale you know and i think the original
originally right and it's still to a large extent a lot of traffic in kafka it's kind of metrics
right like you're logging all like at linkedin that was what it was used the primary one of the
big original use cases was the entire monitoring of linkedin.com was through a metrics lock to Kafka and
then served on the
monitoring dashboards and whatever.
So I think there are use
cases where you have high volume
writes and a large number
of reads of recent data
that expect super low latency
where they have to be in memory.
And that's
kind of getting into like,
that's a class of use cases, right?
So single in a high volume of writes,
and then there's a notion of fan art,
like for one write, how many reads on that event, right?
Like, is it five?
Is it 10?
Is it 20?
There's a notion of how soon after the write
is the reads happening, right?
Is it very soon after?
Like, are you tailing the end of the log basically, right?? Are you tailing the end of the log, basically?
Are many consumers tailing the end of the log?
And Kafka, I would say, is optimized for that.
You have a high volume of writes and many readers tailing the log.
And a lot of systems are built
where that basically just looks.
And it's highly optimized for that.
And I think that's a very good use of Kafka.
It's kind of metrics, logging, use cases, right?
Maybe there's some apps that need, you know,
like that kind of transactional data.
It's generally not that high volume.
That's generally the point, right?
Like, you know, transactional data,
you're not like every click on your website
is not the same as every checkout, right?
It's a completely many different orders of magnitude.
So Kafka started with that.
And I think over time,
now they added compacted topics.
So what that means is that for every key that you write,
you can just keep the latest version of that key.
So you kind of significantly reduce the data you store.
Then they have tiered storage now
that tiers off older data to S3,
but you can still read it
through the same protocol.
It's just the latency
will be slower, right?
So I think with things
like compacted topics,
you know, tiering,
you could keep it
as the, you know,
the system of record
for the evolution
of the data in your company,
right?
And that opens up
different use cases.
And this is kind of what
they're starting to get into, right? Like Kafka is amazingly not that old, right? As systems go,
Confluent itself is going to be 10 years old this year, right? Like it's not that old,
you know, for a new category of data systems. So I would say the original was this high volume,
high fan out, low latency use cases around, you know, metrics distribution and
that kind of thing, which is still very popular.
And then you have more of these transactional application use cases, system of record use
cases that are actually there.
Like banks use it, like many really sophisticated organizations use Kafka in that capacity.
It's evolving in that direction, right?
So both those things are true.
And the latter is more new, I would say, relatively speaking. Yeah. And you said, okay, like it started like that. And then you mentioned also
there are like two things that were built like on top of that, right? Like one was like KSQL and
also like Kafka streams. So first of all, what was the reasoning behind building that? Why in the system that initially was, let's say,
this very resilient and performance,
almost like a queue.
I know it's not a queue,
but something where you write data
and then you have consumers on the other side
that they can really fast read and tail the
log, why get into systems that are much more complicated in terms of the logic that you
can build?
An example of that is SQL itself.
It's even a completely different way, like a model, like a relational model, like a way that's like you interact
with data instead of
having events one after
the other in there. So
what was the reason behind that? What was
the motivation to build this?
And where these systems are
today, right? How they are
used and what they're
started? Yeah, that's a great question. I think
so, by the way, just chronologically,
like Kafka Streams at Confluent came first, right?
It was launched in 2015 or 16
and then KSQL was launched in 2017.
KSQL is actually built on Kafka Streams.
But I would say like the motivations for either
kind of is little even before that, right?
I would say like if you look
at the early internet companies, right?
Like the ones that did
open source like twitter like linkedin so i can speak to linkedin right like there was kafka and
the same group roughly who did kafka also built samsa right samsa is kind of a stream processor
right yeah and what are the use cases motivating use cases of samsa like i would say like for
example we used it to ingest into the graph database.
So we're getting the exhaust of like, you connected a change log, someone created a
connection on LinkedIn. That's a log logged in Kafka. And then there was a SAMHSA job that would
take that and write it in the format that could be indexed in the graph database, for example.
So it's kind of like an ETL job, real-time ETL job.
Same for search.
There was like, in search, we used,
eventually after I left,
they used SAMHSA to do search ingestion.
You need to build a live index
of recent activity.
The common thing is that if I connect with you,
the first thing I do is search for you.
And you want that search to show up,
not some random person,
if you have a common name,
but the connection, right?
So that needs to be indexed.
That needs to be indexed quickly.
So that was a SAMHSA job, right?
So there is this class of,
you know, sophisticated processing
you need to do on these event streams
for things like this real-time ETL
or things like real-time fraud detection,
like real-time analytics
kind of use cases, right?
So that's the motivation.
So basically, you could do it, just consumer events,
write the state manager, write the fault tolerance,
write the load distribution, write fault.
These are highly elastic, scalable, stateful operations.
If you do it on your own, you have to bear
the scaling, load
distribution, fault availability
detection, state management yourself.
Right? And
SAMHSA and Kafka Streams
and KSQL and others
all are trying to solve
exactly those problems
of
load management,
liveness detection,
false tolerance protocols,
state management
with all those properties
and give it to you
in a consumable API
that as a developer,
you're just thinking of processing your events.
I get an event,
I need to process it.
I have state available
in processing that event
and everything else just works.
That's the goal of those systems.
That's the motivation for them to exist.
And there's a huge class of use cases
that actually are stateful, are scalable,
are high scale, need high availability
that people want to build on these event streams.
And you need technology for that.
It's not something you could do it on your own, right?
But most teams will fail to do it well.
Yeah, that makes sense.
And okay, so you said KSQL was built
like on top of Kafka streams, right?
So why put like SQL there as the interface?
Like instead of like something more primitive.
And primitive is not a bad thing here, what I'm saying, right?
It's just like different API at the end,
like to describe your business logic that you want to execute on top of the data.
I mean, I can't speak definitively that like KSQL,
I kind of joined that project after it was already launched in public review and whatever.
So I wasn't there at the origins of the of you know why you know the
genesis of those conversations but i think in general i can say you know that you know in
general this whole date category of events streaming right like this whole like if you i
mean honestly like i'm proud to have been at confluent because they kind of created this
category like this data in motion is now something you understand. It's different from data at rest, right?
It's, you know, like this, like it's hard to imagine
that before then, what is this?
There's no category, right?
You know, like this whole space didn't exist.
And they created a category of like these event-driven
data streaming, data in motion systems that are distinct
from your database request response,
data address systems, right?
And I think this is what I'm trying to say
is that category creation is hard.
Oh yeah.
Right.
You know, and I think if you can make,
like, you know, if you're selling a database
and this is something that, you know,
Jay at Confluent, the CEO founder says, like you're selling a database, you know, you you're selling a database, and this is something that, you know, Jay at Confluent, the CEO founder says, like, you're selling a database, you know, you're not creating a new category, you just compete on why your database is better than their database, right?
People understand they need a database.
Now, do they need your database is a much simpler question, right?
In many cases to answer as a company selling something.
If you're selling an event streaming system, like, what is it?
I don't know.
Do I need it?
Is it like streaming video to me?
Is it like streaming a live event online?
People don't know the words, right?
And so I think the idea was that
if you can make it look like a database,
then you have expanded the number of people
who can get what you're doing.
It expanded the people you can,
it basically is the
bet to grow adoption which is played out like ksql was you know many more people tried ksql
than tried to focus things i didn't you know so so i think that was the i can i think now i'm i
was again caveated i wasn't there to genesis conversations for ksql specifically but that was
i would say pretty confidently,
that was the idea.
Like, can you broaden the market
by making it look like a database,
giving it in a language that people already know, SQL, right?
That would be the main,
I would think the main reason to do it.
Yeah, yeah, 100%.
And I think it's a good way for me to ask the next question
because I think one of the reasons that people try to introduce,
let's say, new APIs for interacting with the system is because you want to provide
different experience to the user, right? And by that, you might improve the experience that they
have and also expand the user base, right? As you said, like there are obviously many more people that they can write SQL.
And then there are people that can write Kafka streams, right?
And it makes sense.
So, and with that in mind,
I'd like to ask you about,
and I'm talking also like as a person who,
like I built production top of Kafka,
like my first company like Blendo, was actually...
Kafka was a very core part of the platform that we built there.
Back then, at least, I'm talking about 2015, 2014,
managing in production Kafka was not the easiest thing.
Right? managing in production Kafka was not the easiest thing, right?
It was like a system that was promising a lot,
it could deliver a lot,
but you really had to know a lot about the system itself and the internals of the system in order to manage it properly.
And obviously that makes it hard
for more people to work with the technology,
right? So have you seen, let's say, the developer experience evolving around Kafka
these years? And if you could share a little bit of around, let's say, what makes it hard
to work with Kafka and what are some ways of solving these
and abstracting them? So you mean Kafka or Kafka Streams or both?
Both, actually. I just want to start from Kafka because Kafka Streams is built on top of Kafka,
right? So I'm sure that there are things that are inherited from there, right?
Actually, it's very different.
But yeah, we can start with Kafka.
So I think Kafka, like, you know, I mean, for sure, right?
Like you have these massively stateful brokers, right?
It was, you know, built for a company like LinkedIn, which is very good engineers, right?
It's very hard to kind of massage all of that into something you can just take and run
beyond the thing beyond a certain scale and beyond a certain you know level of mission
criticality you have to learn a ton of stuff right there's so much inside Kafka right that
so many norms so much tuning so much stuff you have to learn to run it well
so it's not surprising it was problematic for most people in 2014. I think now, I mean, there's so many good management services, right?
Like, you know, like Confluent, we use Confluent Cloud in our company now.
And, you know, like, honestly, like, you don't, honestly, we don't have to care.
Like, it basically mostly just works, right, on the backend.
And so I think that's a huge thing right like you don't
think you start from basically zero and you know pay for what you use and your low volume pay
little and it just keeps scaling it for you it's a you basically now just care about that protocol
right and it just like you're it's not going to it's not going to break on you from a performance
perspective so i think today in 2024 it's a completely different world from 2014.
There's many managed services for Kafka offering different tiers of service.
You know, like Amazon basically runs it on some host for you.
You'd love to learn a lot.
And you have like these fully managed serverless offerings on the other end of the spectrum
and many people competing in that space.
So I think most people today, and then if you're running it on your own there's so many other
companies that do management tooling and like observability tooling and like there's like at
least five companies doing that if you run it on your own you know there are many more people
actually know like there's courses and trainings and certifications for kafka operators so you know
they have stream z and all these other people
who are shipping Kubernetes operators that you can run
on OpenStack or whatever.
So there's so much around it that regardless of where you are
in terms of your requirements, you probably won't have
as much of a problem as you did in 2014, right?
So I think it's come along.
I'm sure there's a lot to do, but I think there's so many
more options now.
And what about Kafka
Streams? What it
means to operate Kafka
Streams? And what's
the developer experience there?
And what's the
space for improvement? What can be done
to make Kafka Streams
a much better experience
for someone to develop and manage?
Yeah, I think i don't think
we've talked about what it is actually right like i think kafka streams you know it's a library that
ships with kafka it's open source it's part of apache kafka product and you know it kind of gives
you these really rich apis like you know that you can write a function that reacts to an event. It has a state that you can build up over time
that is maintained for you, right?
Kafka Streams, you take this library, build this app,
deploy the app, and now Kafka Streams can scale you out
to new nodes, scale you in when you scale down,
and it kind of detects node failures,
rebalances stuff onto other nodes.
It maintains state across all that, right?
Moves state around for you.
And if you think about it, it's just the library running in your app. You're running an app like you're on onto other nodes, it maintains state across all that, right? Moves state around for you. And I think about it,
it's just the library running in your app.
You're running an app like you're on any other app,
but it does all this stuff for you on the backend, right?
So that's what the library does in build.
And it's great.
People love it for that flexibility.
You literally are an application team running your app.
You own your pager, you have your tools.
And it feels like other services
you operate and people love
it because it allows them to feel that
as opposed to submitting a job in someone
else's cluster, which is a very different
mindset.
But I think then coming to your question,
so that's an intro for Kafka Streams, but I think
the problem is that because it's a library
now you as an application team are basically running a sharded distributed replicated database
if you have a stateful application, right?
And now you have to learn about the...
It's kind of like what you have to do with Kafka in 2014, right?
You have to learn about all the internals, about how it works to solve problems when
you hit them and you will hit them, right?
And the default Kafka streams,
you have to collect the metrics somehow.
There's JMX creeper, there's JM beans, whatever.
You have to get them out.
Which metrics?
There are like 700 metrics.
Which ones should you care about?
There are like 500 tuning configurations.
Which ones should you tune, right?
And then you have to do all that.
You have to collect the logs at the right level
for the right classes. And then you do all that you have to collect the logs at the right level for the
right classes and then you do all that to solve problems if you hit them in production with your
stateful applications right most companies they start it's super nice it's great it works and
then when it doesn't work i've been on calls you write support calls where they didn't have metrics
they didn't have logs but they needed to work right now. Yeah. Right? So you're very unprepared
a lot of the time
because you don't know.
It's so magical to start with.
It hides a lot of complexity,
but, you know,
it only hides it so far.
Like, you would not deploy
an application today
with a co-located database.
It's totally all co-located,
distributed database
and run it in production
and expect it to work.
Right? But really, that's what people are doing with Kafka Streams Kafka streams right and that's the root cause of a lot of issues so I think you know like what we are doing for example it's kind of
you know allowing people to eat their cake and have it do it keep the form factor it's still a
library running in your application keep all the the great things, right? But delegate all the
hard things to us, right? So that you don't have to solve it. We give you clean SLAs,
clean interfaces. And that kind of is a big step forward for operations, right? Like,
you know, you're running it like you would other production services where you're separated state
from compute. You're letting teams, experts run the distributed stateful stuff for you behind clean
and SLAs. I would say
that's the biggest problem.
This coupling
is great to start.
It's horrible when you hit
the limits of it. And I think
that's the biggest problem I would say to solve.
And I think if you solve that well,
you know, like
more people will write more of these apps, because they're
really compelling. There's a lot you can do with it, and it's really easy to start with.
So that's kind of, that's an answer to your question. That's the reason to do,
just to focus on those operational problems. Yeah. And how do you do this decoupling?
I'm sure it's a hard problem, but but like, can you take us through like,
let's say the architecture that you have to like build
at the end to achieve this decoupling?
And with the decoupling, like, okay,
then offer all the additional benefits that you talked about.
Yeah.
I mean, I would say, first of all,
we have the experts right on our company, right?
I think it matters that, you know,
a lot of the people we have built the system.
Like, so we know how to do it.
I think the other big important thing
is Kafka Streams is always kind of written
as this kind of application framework, right?
Like it is meant to link into your app.
So you could write your app anywhere you want.
It was also built so that the underlying systems
are very componentized, very all, like the state store is an interface. You can implement whatever state store you want. It was also built so that the underlying systems are very componentized.
The state store is an interface.
You can implement whatever state store you want. You can implement
any assignment logic you want. You can implement anything
you want. You can implement whatever client you want.
So basically, you could plug in everything
at the bottom.
It was built so that we could do it.
What we are really doing,
and obviously, that was a design concept.
Implementation is much harder.
But it was always designed with that in mind.
So what we're doing is, okay, we're taking advantage of that design.
And we're plugging into points where it's clean to plug into already by design.
So, you know, like you actually have to rewrite zero in your app.
Like our customers have moved in 10 minutes, you know.
Completely offload state, completely offload management,
but you don't need to change anything.
So I think it's both.
It was designed like that,
and we know the system
and the two things combined
allow us to kind of deliver this
with a very true,
like it's basically Kafka Streams.
Nothing changes in your application.
Yeah, yeah.
So I was going through
like a very high level
like the documentation of responsive, and I saw there that a very high level documentation of responsive.
And I saw there that, and correct me if I'm wrong,
I'm not an expert in Kafka streams,
but Kafka streams, because we are talking here about managing state, right?
It has to store the state somewhere.
If I'm not wrong, the state is stored in a RocksDB instance.
Is that correct?
Yeah, correct.
One of the things that you change with Responsive is what's the data store
where the state is stored.
So you have ScalaDB there and also you mentioned MongoDB.
Can you tell us a little bit about that?
Why you decided to move away from RocksDB into these systems?
And what's next after that?
Because I think, again, if I understood correctly from what I've read,
that you're also working on building your own store ads there.
Tell me a little bit about that, and then I have
one last question on this topic, but
I don't want to just put too many questions
together. So, yeah, tell us a little bit
about that. Yeah, no, so
yeah, I think the reason to move, yeah, you're right,
it's RocksDB by default, right?
Status materialized in RocksDB. The reason
to move, again, is if you're running
a replicated
Rocks, sharded RocksDB in the app context of your application, it hurts operations a lot, right? Like like if you're running a like replicated rocks sharded rocks db in the app
context of your application it hurts operations a lot right like if you lose a node now basically
have to reread from a kafka topic all the state to materialize a new rocks db and since the ticket
takes away elasticity right and then there are a lot of solutions on the protocols on top to help
you manage that but they complicate those protocols and often you get into vicious loops of you know you're stuck trying to restore state but your
group keeps rebalancing and your things keep in the infinite restore loop unless you have the
right tunings it just doesn't work there's a lot of complexity with doing it that way
so that's why you know i mean honestly like like most app teams don't want to run a stateful app.
They want to run stateless apps that are autoscaled, right?
So you have to remove storage.
It's almost like a precondition for good operations, in my opinion.
That's a strong view we have as a company as well.
So that's kind of the answer to why remove RoskDB.
I think it's just operationally really hard once you hit a certain scale.
And we have heard so many stories of people, don't touch the system. remove RoskDB, right? I think it's just operationally really hard once you hit a certain scale. You know, and
we have heard so many stories of people, don't touch
the system. Because if you touch it,
when it breaks, you can't fix it.
And that kind of scares, you know, like,
kind of feel. But yeah, so we solved
that by removing it. And then why is still our
Mongo and our own eventually
is basically, you still
need a transactional store, right? Like,
you need basic, you know, like Kafka does,
like I mentioned, we do transactions, you know,
Kafka Streams does allow you to kind of read,
modify, write sequences of events.
It's extremely powerful, primitive
to write correct applications.
And so you need your store now to be transactional too, right?
So that's why we had to pick some of these kinds of stores
that, you know, that kind of have these, you
couldn't just dump it to S3, you couldn't just be something naive about it.
So transactions is a big requirement and ability to transaction from the right with the right
performance, right?
Like that's another problem.
The semantics have to be there.
The performance also has to be there.
So we've done a lot of work to make these transactional systems in Mongo and sell out
work with Kafka streams, right?
At good performance.
Excellent performance, I would say.
So I think the other thing is also just in terms of form factor, right?
Like Mongo especially is everywhere.
Like all the big companies use it.
They already are contracted out and many of them have a requirement that things can't leave their data center.
They can't leave the network, right?
You bring your own Mongo, it's already blessed.
You really have very little
security risk. So from an adoption
perspective, especially because Kafka Streams
is used in
most major enterprises, they're very locked
down. And just
saying that you bring your own Mongo
and we will make it work really well
is an extremely good from
a procurement and
security perspective, which is mostly a blocker for all startups, right?
We can't give you our data, it's too bad.
We're saying, you don't have to give us your data,
you keep your data.
We'll just make it work for you, right?
So that's kind of the other reason to go with these vendors.
And then, you know, that comes to your question about our own,
is that these are also not optimized, right?
Like we can do, we already have a system that works that,
you know, you are kind of in the default Kafka streams
having rocks DB on EBS,
typically if you're on AWS replicated over Kafka, right?
We could pull it out, you know, replicate over S3,
like, you know, like some companies are doing with Kafka,
make it tiered to S3,
keep a really cheap intermediate cache
to serve quick reads and to get a quick buffered writes and make it cheaper to S3, keep a really cheap intermediate cache to serve quick reads
and to take quick buffered writes
and make it cheaper than the default.
But with infinite scalability
because you have S3
and with no management, right?
So that's kind of the long-term,
like that kind of will further
open up the market for us, right?
Like you can have really high state,
high scale apps
without worrying about cost
with great operations.
So that's the reason to do it.
It's still a long way off.
Honestly, Synczilla and Mongo work really well.
And for a lot of people who are building transactional apps who have strict data governance and data residency requirements, it's much easier for them to just use something that's blessed in their company.
Yeah, makes sense.
And like the final part of the equation,
why not Kafka itself?
Because you mentioned,
and you talked like a little bit to the why,
but just want to make like more explicit because as you say,
like you build like transactional semantics
on top of like Kafka.
So why not just use like Kafka itself
and simplify the architecture, right?
To store the state also there?
I mean, Kafka is basically a log.
Kafka Streams default does store the state in Kafka
and then rebuilds it into RocksDB
when a new node comes online or whatever.
It's called a restore process.
I think this whole problem of
if you want real elasticity,
this process of rebuilding state or even sometimes re-downloading from S3, it still takes time if it's a lot of state.
So having any kind of local state that needs to be there locally on a local disk before you can do any processing can be a big problem, especially at scale.
And even if it's not at scale, if it takes a long time to rebuild from Kafka, you're waiting a long time to scale out.
And that could be deadly for some people, right?
So I think that's the biggest,
I mean, I think it's debatable
whether that's the right architecture
if you have a really mission-critical,
high-available app,
and especially if you need elasticity, right?
Like just the time waiting to build up a terabyte
or state from Kafka could be a day.
Yeah, are you going to wait a day?
Right?
And I think that's a really good question. And Kafka, it could be a day. Yeah, are you going to wait a day? Right? And I think that's really the question.
And people do it.
Honestly, people do it.
Like there are many companies successfully using it
and they've configured it
and they can work with it.
But most people, right, hit problems, right?
And I think having the choice is great.
Makes sense.
Okay, cool.
I'll give the microphone back to Eric.
So Eric, all yours again. I have a feeling we need to arrange another one of these discussions here. And I'd like some very interesting questions that were triggered by all the things that we talked about. I'm really looking forward to have you again back. But Eric, all yours. Yeah, I agree.
Well, we're right at the buzzer, as Brooks likes to say. But, you know, we have talked.
Your knowledge of streaming systems is so deep, Aperva.
It's really incredible.
But I can't help but ask, you know, if you had to go solve some other technical problem
that didn't have anything to do with streaming.
You know, you sort of had a blank check
to go solve a problem in a different area.
What interests you, you know, sort of outside of streaming?
Like, what do you think about
when you're not thinking about streaming?
Oh, man.
Or is it possible after so many years for you,
you think in streams now?
I would say the latter is true.
If it's not streaming and not this,
it's like my kids, my family,
it's like anything but book.
You see everything as a Kafka topic.
No, I mean, I think it's not that, right?
I mean, there are obviously so many great things to do,
but for me, I just feel like,
like I genuinely, having been at Confluent for so long i would say even at confluent it took a couple of years to get into the mindset and once you're in the mindset of
these event-driven architectures you know this durable log that's the source of truth of all
data and your whole company that's the vision at confluent like the central nervous system
is another thing they say i just think that honestly there's so much that can be done in this space right there's so many ways like the apps of the
future the kinds of things you could do with them the kinds of you know how fast you could deliver
great new capabilities i think the like we're very early stages might fundamentally believe
it right and that's genuinely exciting to actually it It's hard technical problems, first of all.
As an engineer and technologist, that's exciting.
Many of these are unsolved.
And then from a business and use case perspective,
I think there's so much you could do
to kind of grow the market and innovate.
And pricing, right?
Pricing, how do you package the product?
So many things that nobody makes money in stream processing.
I shouldn't say that our company is in the space, but this is just a fact, right?
I think there's a lot of value there also, right?
And I think honestly, like if you think hard technical problems, and I think a clear market,
because people are doing things, right?
It's just that it's hard to kind of do it and make capture value.
Honestly, like it's perfect, right? And I have a background in it, right? It's just that it's hard to kind of do it and make capture value. Honestly, like
it's perfect. Right. And I have a background in it, right? It's not like I'm just coming off the street, reading a book and getting, you know, like it was a deep excitement. It's not like
super chill excitement. So I think honestly, like to your, I mean, maybe it's a disappointing
answer, but honestly, not at this point, right? Like I'm pretty excited about it.
Yeah. I think deep excitement is a really good way to describe it.
And that's very palpable just talking with you.
So very excited for what you're building.
And thank you again for teaching us so much
with your time with us on the show.
Thank you so much for having me.
It was a lot of fun.
I hope I didn't take too long to answer over time.
But yeah, happy to do this again.
You know, I mean, it was a lot of fun if you want to continue the conversation.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
podcast app to get notified about new episodes every week. We'd also love your feedback. You
can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by
Rudderstack, the CDP for developers.
Learn how to build a CDP
on your data warehouse at
rudderstack.com.