The Data Stack Show - Data Council Week (Ep 1) - The Evolution of Stream Processing With Eric Sammer of Decodable
Episode Date: April 23, 2023Highlights from this week’s conversation include:Eric’s journey to becoming CEO of Decoable (0:20)Does real time matter? (2:12)Differences in stream processing systems (7:57)Processing in motion (...13:04)Why haven’t there been more open source projects around CDC? (20:34)The Decodable experience and future focuses for the company (24:31)Streaming processing and data lakes (32:54)Data flow processing technologies of today (39:01)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
All right, we are in person at Data
Council Austin, and we are able to sit down with Eric Sammer. He's the CEO at Decodable. I'm Brooks.
I'm filling in for Eric. He got in a biking accident. He's fine, but wasn't able to make
it to the conference. So I'm coming out from behind the curtain here and excited to chat with Eric.
We've got Costas here, of course. But Eric, to get started, could you just kind of give us your
background and what led to you now kind of becoming CEO at Decodable? Yeah, absolutely. First of all,
thanks for having me. It's always a pleasure to get a chance to talk about this stuff.
Yeah, I mean, so my background, you know know i've been doing this now for 25 something
years really focusing around data infrastructure so i i lovingly refer to myself as an infrastructure
monkey versus like uh you know while people are doing fancy math and cool stuff with data i'm
moving bytes around and you know flipping zeros to one so i spent a lot of time working on things
like sql query engines and stream processing infrastructure, which has really taken up
the last decade or so of my life. I built a bunch of systems internally for mostly market,
marketing and advertising applications. And then sometime around 2010, late 2009, 2010,
I wound up being an early employee at
Cloudera and spent like four years working on sort of the first generation of big data
stuff and then wound up creating a company that eventually, you know, we were we were
acquired by Splunk and spent a bunch of time there working on real time infrastructure,
stream processing and just like cloud platforms in general for observability data.
And then about two years ago, broke out and started Decodable, which is a stream processing
platform as a cloud service. We could get into the details if that's interesting.
Really focused around being able to acquire data from all of the fun and interesting sources, process that data in real time and get it into all the right destination systems in the format that's like ready for use in analytics.
Cool. One thing we were chatting about just before we hit record that you kind of brought up is just the idea of does real time matter?
Could you unpack that for us and just kind of talk about what you mean there?
Yeah.
I know there are different camps who would probably argue different things.
Yeah, I mean, you know, I think there is a segment of the market who,
I mean, people probably break down into three groups.
There are people who are very sure of what they get out of real-time data.
And by real-time specifically, I mean low-latency, sub-second availability of data,
either for analytics or for driving online applications and systems and those kinds of things.
There's one group who fully understand
it, know exactly what they're talking about, and have a strong opinion about it. There's a group
of people who say, well, it depends on the use case. And like some use cases demand real time,
in some use cases don't. And then there's a third group of people who say nothing really matters.
Like real time is never important and those kinds of things.
And, you know, I think like, you know, selection bias, of course, but like we talk to the second
and third group, you know, first and second group, sorry, you know, most of the time.
And I would say like the biggest thing that we hear from some people is, you know, my
use case doesn't require real time.
And like the interesting thing there is that like
at some level, I don't disagree. The thing I would point out is that like, if you asked three years
ago, whether or not you needed to know exactly whether or not your food had been picked up
from the restaurant and where it was in between the restaurant and your house everybody
would have went like who really cares and then covid hit and now everybody fully expects up to
the second visibility into where their fried chicken is right and they think that like so like
what winds up happening is the use case i I would argue, doesn't require real time until
someone decides to do it
and changes the expectation.
And I think companies like
Grubhub or Netflix
or YouTube content
recommendations or any of
these other things have changed
the expectations
and
that as a result now is a it is a either saves the money or generates revenue
and like one go-to use case for me is I don't know about you guys but I don't have a lot of loyalty
to like retailers around certain things so if I need you know i can't think of a good decision
if i need a mop i don't care where i buy it i care that it's in stock and i care who can get
it to me fastest and if that's hypothetically amazon walmart target you know what i mean like
i will get it from either one of them so i care about inventory being up to date. I care about who has the lowest price. And like all these things are things that are responsive to inventory arriving at a loading
dock or a dynamic pricing logic to adjust prices based on competitive sales and like those kinds
of things. So my argument is everything's real time, like either in potentia or you know something that winds up being real time
you know because a competitor has driven it in that direction and you know i'm sort of interested
if you guys agree with that or not but like that's my take on the world yeah that's great. Do you agree?
Yeah, I do agree.
I mean, I think there is a reason we have all this infrastructure out there and all this technology being built.
I don't think it's just because, like, you know,
geeks want to, you know, like, have the equivalent of a fast car.
They build Kafka, right?
Like, something similar. But at the
same time, yeah, like I think the problem with real time is that real time is a very relative
term and like the semantics a lot. So if you talk like to a marketeer, like what is real time and
you talk to someone who is responsible for fraud detection, you're probably going to get a bit of a difference.
Yes.
Not only definition of what real-time is, but also of the importance of real-time, right?
Yeah, like if my campaign, let's say, runs like five minutes later, eh, okay.
Probably nothing will happen.
Although I will probably be frustrated because I have to, right?
But if someone gets, I don't know, like a report on fraud like a day after, that's not to, right? But if someone gets, I don't know,
like a report on fraud like a day after,
that's not fun, right?
But let's talk a little bit more about like the technology, right?
We, I remember like some of the first like real-time processing,
let's say pieces of infrastructure that came out of Twitter.
It was a definition of real-time, right?
Yep.
Back then.
You had technologies like Samza.
What was the name of the Twitter?
They had a platform.
Yeah, so LinkedIn had Samza.
Twitter had initially Storm.
Storm, yeah.
And then they built Heron, which was another one.
And then there was Spark streaming
that came out of the Spark ecosystem and Apache Flink.
And so there's been a couple of these things
that have grown up over time.
Yeah.
And I want to talk about this
and also compare it with something like Kafka
to understand what's the difference between Kafka and a system like flink or samza right yeah absolutely so i think like let's pick apart sort
of kafka just for a second kafka really there are four main components or projects that people talk
about when they talk about kafka maybe even five. One is the actual messaging broker itself, right? And that's the part that I think of as like Kafka.
Then there's KStreams, which was the Java library for actually doing stream processing.
And KSQL, which was the SQL interface built on top of KStream. Then there's Kafka Connect,
which was the connector layer. And then
there was the schema registry. You know, some of these things are under Apache licenses. Some of
these things are under the Confluent Community License, if I remember correctly. But, you know,
so when I think about Kafka, I think of the broker proper, which is really just about
PubSub messaging or eventing, you know, so like really about just the transport of data and real processing capabilities beyond just moving it from A to B.
And so I think that the processing systems that we're talking about, Storm, SAMS, KStreams, KSQL, Flink, which is the one that I'm probably most familiar with.
That's what we're based on at Decodable.
You know, various other systems like that run on top of those Kafka topics, right?
And many of them support not just Kafka, but Kafka-like systems, including some of the cloud provider stuff like Kinesis and GCP PubSub and those kinds of things.
Okay, we have like the processing, right?
I would argue, let's say, and let's forget KSQL, KStrange and all that stuff.
Okay, I have a producer, I have a consumer, I can write business logic there,
I can do processing on top of Kafka.
What's the difference between that and having a system like Flink?
Yeah. So in general, you could argue that anything that writes to a Kafka topic or reads
is effectively doing stream processing at some level. It might just be doing minimal
transformation. It might be doing sophisticated transformation, those kinds of things. I think that the difference is really, like the stream processing frameworks are just that.
They're frameworks, right?
So they're going to give you a bunch of capabilities, including an execution engine typically that's optimized and sort of understands things like predicated analysis and aggregation operations and window functions and all these other kinds of things,
they typically also understand schema serial or event serialization, deserialization. They
typically understand state management. Where am I in the stream? What happens when I fail and how
do I recover to achieve either at least once or exactly once processing of data, you know, getting rid of
duplicates, those kinds of things, or not producing them to begin with. And also some higher order
concepts like a notion of event time and watermarking and like all of these other sort of
more sophisticated things that sort of help achieve correctness, the processing data. So,
you know, in that sense, you should think about stream processing systems the same way you would think about a database in the sense that,
not that they necessarily work the same way, but that rather than just have files on disk and like reinvent Postgres on top of that,
it behooves you to take advantage of the fact
that people have put in a lot of work
to get the correctness and the processing
and those kinds of stuff.
Does that make sense?
It makes absolutely sense.
But I have a follow-up question on that.
The way that you describe it is how I visualize it.
I have data in motion, and I'm applying aggregation,
any kind of data processing, as the data is still in motion.
A couple of years ago, let's say after 2015 or so, we started hearing a lot about the concept of ELT instead of ETL. Because what you are describing sounds more like ETL, right?
You extract the data.
Somehow the data is, like, still in motion.
Like, I transform the data, and then, like, I'm going, like, to do something with whatever I produce from there, right?
Yeah.
But then we had, like, this whole concept of, like, you don't have to do that anymore.
Let's just, like, extract and load the data and after the data is loaded you can go and like
with extreme scale go and process the data okay assuming let's say i have kafka there
the latencies are low theoretically at least i can get closed let's say like to real time
and in some cases let's say i have something like Pino or ClickHouse, I can have like real
time. Yes. Okay. So what's the difference there? Why we still need to have these complicated
systems because they are complicated, right? Like some size, not like the easiest thing,
like to go into operating tab and And do this processing in motion.
Yeah.
I mean, this is a really interesting question.
And I think it's a philosophical debate.
So, you know, you're right if you look at this through the lens of being, for instance, like a Snowflake user.
Like from your perspective, you have many sources of data.
You want to get them into Snowflake.
You want to do your processing there.
And why on earth would you ever do any kind of transformation?
So a couple of things.
One that comes up, and a lot of the quote-unquote ELT tools do this under the hood,
they are actually doing things like mapping data types.
They are doing processing, but it's de minimis processing.
It's not business logic processing and someone explained
it to me is that the thing about elt is that it's not actually that it doesn't do transformation it
does it's that the majority of the business logic is pushed to the target system and that definition
made sense to me so it's actually actually E-T-L-T.
Yes.
Right?
You know, there's two Ts in there, which is okay.
When it becomes interesting, so a couple things.
One, you're actually using your costly CPU to do the processing if you do that.
You know, there's latency characteristics and those kinds of things. But I actually think that the more interesting angle on this is that if you zoom out
and you think about other places that data wants to go,
you start to go like, okay, so it's going to go to S3.
It's also going to go to Snowflake.
It's also going to go to ClickHouse or Pino or Rockset or whatever,
wherever it's going to wind up going, Druid.
It's also going to go back into operational
systems like elastic search that you can provide online product search and like or outgolia or
like whatever people are using these days it might also get cooked in various ways and go to a bunch
of microservices and so it's not so much that you want to push all of your business logic in the world into the stream it's that you
want to have the capabilities to do impedance matching between all those systems some of them
aren't allowed to have pii data some of them aren't don't want certain records some of them
need quality fixed before it lands in those systems where you can't do updates and like mutations and like those
kinds of things and so i think i would think about stream processing the way you think about
i use networking as an example but like packet mangling on a network you know stream processing
is the equivalent of your load balancer right like it it allows you to do some amount of processing before the packets land in the target system.
And I think when you think about it from a holistic perspective, you kind of go like, oh, then it actually makes sense because you're not tightly binding the schemas and the structure of the data between the source system and the target system. And one of the biggest challenges that I hear is that if you're doing ELT into a system
like Snowflake and somebody makes a schema incompatible change, you've broken your target
system.
And you're very tightly coupled to those operational systems.
So I think that when you start talking about data contracts and larger organizations, like
being able to do these things and pave over those problems, I think stream processing
is one way you can start to cut into that.
Yeah, 100%.
And I think I asked this question not because, like, I totally agree with you.
And I get also, like, why people might wonder about these things.
Sure.
And I think there's always like a huge gap between what theoretically can be achieved and what like in practice is happening.
Right.
And usually that's okay.
That's where engineering comes from.
Right.
Like that's why we need engineering.
That's why we engineer these systems.
Right.
Because they're always like trade dogs.
And like each one of them always like trade-offs.
And like each one of them has like unique trade-offs.
Like, yeah, sure.
Why not just use only ClickHouse, right?
And do everything.
Theoretically, you should be able to do it.
Yeah.
Have you tried like to do like a lot of joins there, for example?
Right? Have you tried to do a lot of joins there, for example? Or how is it to change the schema on something like Pino?
There are always trade-offs, and that's why there's, at the end,
wisdom in the industry.
It's not like these things are just because crazy VCs and founders,
they won't like to push their agendas and build stuff.
That's what I'm usually saying.
But I would like to go one step before that, the processing.
Because I know that another important component on Decode Double has to do with CDC.
And CDC is one of these interesting things that everyone kind of talks about it, says
it's important, it's a very good idea, all these things.
At the same time, if you think about it, outside of Debezium, I don't think there's any other mature at least framework to attach it on like an operational database like an
OLTP database and turn it into like a stream right? I think not in the open
source world certainly there's a bunch of commercial systems that have been
around for a very long time in sort of various forms. You know, I think GoldenGate is probably one of the more well-known.
HVR was acquired by Fivetran, does this kind of thing.
So there's those kinds of things.
But I think in the open source world,
I actually don't have a great sense of like Airbyte adoption these days.
And I think Airbyte actually uses Debezium.
It does.
Yeah.
Okay.
So Debezium is the one that I know best and we know best at decodable because again we're based on parts of
that but but i think you're right i think you know one thing that is interesting is not just about
lower latency to getting the changes,
but there's this whole host of applications,
especially like on the operational side of the business
versus the analytical side of the business
that can use change data capture data
as effectively triggers
to like kick off a bunch of really interesting stuff.
You know, we were talking earlier about inventory gets updated maybe you want to make only things that are in stock searchable
yeah you know and you want to play with search relevance you know for instance for like an
e-commerce site you know based on inventory so, that's the kind of thing. Or marketing campaigns. When PlayStation 5s come back in stock,
I want to alert everybody who has one on their wish list.
Right?
Like, those are the kinds of things that I think we can enable with CDC
beyond just database replication, which is a core use case, of course.
Yeah.
Why do you think, like, we haven't seen more open source projects around CDC?
Because it's really hard.
Because every database system implements exposure of the bin log and the transaction log a different way.
And some of them don't have, there really hasn't been a single good way of exposing this. So Postgres, MySQL, Oracle, Mongo,
they all have like just different database level
specific substrate, you know,
substrates for those kinds of things.
And I think it just takes a special kind of person
to commit themselves to going and solving
that kind of problem.
You know, we are very lucky to have,
you know, Gunnar Mor more like he was the project
leader to museum at red hat for a long time at decodable so like gunner like spends a lot of
time thinking about these kinds of things to his credit but it's i don't want to say it's thankless
because i think people appreciate it but it is really hard you know it is really hard. Yeah, it makes sense. And like, one thing that I always found interesting,
both in good and in a bad way, about Debezium, it's, let's say, attachment to Kafka, right? It is a
project that, I mean, technically, like, you have, in a way, like, to run it with Kafka Connect, at least.
Like, the moment you decide, decide like to not have Kafka there
you start being like very hacky with it right yeah do you see like I mean and I'm asking you
not because like okay obviously I'm not like a committer or like you're like the Bayesian but
you work with it right like it is like part of your stack. Like do you see like this changing and also why?
Like why it has to be ShowaTots like to Kafka? Yeah, I mean you're absolutely right. I mean
there's multiple layers there like in the implementation and so like in even internally
inside of Decodable we wind up using Debezium without Kafka in certain places, like for certain use cases
more as a library
to access certain things
it's definitely tricky, you gotta know the
internals, I think it's quite
frankly, and there is
again, I'm not an expert in
what's happening in the community on this
so, excuse me, please take it with
a grain of salt, but my understanding
is that there's a long term feature request inside of the Debezium community to support running without Kafka there.
I think this is like a trap that open source projects fall into is that like there's always this like, well, why don't we make it configurable thing, which explodes the complexity of these projects pretty significantly i you know my sense is that
the upside is you could potentially remove the kafka dependency the downside is that it only
makes more complicated i mean this is like a plug but you know one of the things that we focus on
is just making debusium less complicated and flink is part of that for us as well so like if
you don't know or care about flink and kafka and debysium we try and create a platform where you
can define a connection to postgres and get the result in pino or in kafka or in kinesis or in
any other system that we wind up supporting there
without having to deal with the guts on this stuff.
So to some degree, that's the value or part of the value that we deliver there.
So next provocative question.
I heard you saying that you have three pieces of technology that you are using as part of the codable,
like Depejum, Kafka, and Flink.
Each one of them, it's an operational nightmare.
Yes, that is not controversial.
I'll take that.
That is true.
Like, I had, like, whenever I had, like, to do with any of these,
okay, it wasn't fun.
Let's put it this way.
Like, you need, like, a very, like, I don't know, like,
a special type of person who enjoys working with these things.
Yes. So, I'm scared. I don't know like a special type of person who enjoys working with these things yes so
I'm scared
why would I
come to Decodable
when I know that
like there's like
all these complexity there
why I would do that
I mean
that's what pays the bills
at Decodable
is that like
the people
reason people come to us
is because
they want the capabilities
but they don't want
the operational overhead and so you know flink alone has a couple hundred configuration parameters if i
remember correctly it's yeah it's sizable our goal is to like make that disappear so like we try and
offer what i think is the right user experience which is largely serverless you can give us a
connection you know a bunch of connection information for our database
or a SQL query,
and you don't have to know that it's Debezium and Flink
and all these other kinds of things under the hood.
If you do care, we give you the right trapdoors
to give us a Flink job if that's what you want.
You don't want to give us SQL or something like that.
We'll handle that.
But it's funny because there's just like this goldilocks zone where like if it's so complicated people don't want to adopt the technology at all no matter how much a vendor paves over it
and if it's so easy then no one needs us right so like obviously you know that said i do think
we always want to make it easier and we do spend some time upstream so like obviously you know that said i do think we always want to make
it easier and we do spend some time upstream trying to like you know do some work there
to make this stuff you know easier to use but the reality is that all of the options all of the well
i don't want to use s3 as my state store i want to to use this other thing. And like all that pluggability, all that optionality
makes it
more like a toolkit
for stream processing and less like
a solution
for stream processing. And so
you know,
there's value in that, but that
cuts both ways, right?
And so, I don't know. I like
to think, I'm biased, but I like to think, I'm biased,
but I like to think that we solve this problem for people.
But you're right.
I mean, it's a real concern,
the complexity of any disaggregated system.
I think there's been some good discussions
about disaggregation and like the modern data stack
and those kinds of things.
It generates complexity.
Yeah, 100%, 100%.
And, okay, I know there have been also some very interesting announcements
about the product lately.
You mentioned the modern data stack, and I know that one of these has to do with IT.
So would you like to share a little bit more about some interesting things
that are happening with the product right now?
Yeah.
The two kinds of users that we see in decodable are data engineers you know who are ingesting data or sort
of like making it ready for ml pipelines and analytics stuff like that and then application
developer who are building these more like online applications you know you know real-time
applications same underlying tech stack so for the data engineers you know out real-time applications. Same underlying tech stack. So for the data engineers, you know,
out there, what we wanted to do was allow people who know Snowflake, DBT, and Airflow to be
productive stream processing people without having to take on the Debezium, Kafka, Flink stack.
And so for them, we announced earlier today support for a DBT adapter. We now support
DBT. You can use DBT to build your stream processing jobs in SQL with the same tool set
and the workflow that you know. And the other thing that we're super excited about that we
announced today is support, first-class support for Snowflake's Snowpipe Streaming API.
Now, without spinning up a warehouse,
you can ingest data in real time into Snowflake
with no S3 bucket configured, with no SNS queues,
with no IAM policy stuff.
Just tell us what the data warehouse is,
the data warehouse name, and we will ingest.
And it turns out that that snowflake has made this
incredibly cost effective so you're not paying for warehouse time there's a small amount of money
you know that you end up paying in terms of credits but it is substantially more cost effective
to ingest data into snowflake and it shows up in real time, which is incredibly interesting.
So when you say real time, because the last time that I worked with Snowpipe,
I think the end-to-end, and when I say end-to-end,
I mean from the moment that the event hits Snowpipe
to when you see it on the view, when it gets materialized inside your data warehouse.
We're talking about a span of three minutes, two minutes.
Is this something that has changed with Snowflake?
Yeah.
So this is what the Snowpipe streaming API does.
You're actually writing.
I don't know the implementation.
My understanding is that you're basically running into Snowflake's internal formats there.
So you're skipping a lot of the batch load steps.
And so we've seen on the order of seconds
and even less than a second, I think.
So you can actually run select statements and watch records change.
It's incredible.
Yeah, because I think the previous implementation,
the first implementation of Snowpipe, was
more of a micro-batching architecture, but was still using S3 under the hood to stage
the data.
But it was obviously optimized in a way to reduce the latency there as much as possible.
But again, when you have object storage in there, you add another layer of latency, you cannot avoid that. So that's very interesting,
that's something I should definitely also check for myself. I knew that Snowflake was working on
the streaming capabilities that they had, so it's very interesting to see what they've done. And
I'm also looking forward to see what Spark and Databricks are going to be doing on that.
Because I think Spark streaming, it shows its age.
Sorry to say that, but I don't think that anyone really loves working with that streaming.
It's very inflexible. It's really hard.
So I'm very curious to see what the response is going to be there yeah i you know i've had the privilege of working with some
of the people who are now working on spark streaming structured streaming yeah and that
was hard for me to say structured streaming yeah and and those kinds of things i'll say this i mean
i won't claim to be an expert on the internals of
spark i'll say that they have really smart people they are working on this you know we of course are
super biased we we think that flink has effectively won you know the out of like all of the open
source projects it has the most robust and sort of battle tested stream processing engine
but i'm interested to see what the team at databricks does they have a fantastic team over
there and my guess is that it's actually going to be very hard to make the kinds of changes that i
think they need to make without breaking anyone who's already using it today and I think
that's gonna create a challenge for you know for them. Yeah definitely I think
it's going to be interesting because like I think it also touches like the
fundamental concepts behind like Spark itself on how like it has the guarantees
it has with like micro-batching and like all that stuff. So it will be very interesting
to see like what they come up with. But they will definitely come up with something. Like for example
like the okay the autoloader features that they have like on Tetabricks like it's pretty good
actually like it's very like okay it doesn't have like the simplicity that like Snowflake has
but at the same time like it's very robust in both
performance and also the capabilities that it gives you.
And, okay, Databricks will always be a more configurable product than Snowflake.
They have a completely different product thesis in terms of the user experience, which makes
total sense.
One other thing I wanted to ask you is about data lakes specifically.
And there are two reasons for that.
One is because streaming data and data lakes,
they make a ton of sense together
because usually streams provide the volume of data
that make viable having the data.
That's one of the reasons.
The other reason is because all the table formats,
like Delta, for example, and Iceberg is also working on that.
I'm not so sure about Hudi, but I'm pretty sure they also have like something similar.
There is a concept of CDC there, right?
They propagate changes like when you do like something with a table and you can have like a feed to listen to these changes, which is kind of interesting to see this from happening like in systems
that are supposed to be more slow moving, right?
By definition, in a way.
So what's your experience so far in terms of streaming processing and plagues,
both from consuming and pushing data into them?
Yeah, you know, I think
it's funny, I mean, we say, we use
three words, like continuous
streaming and real
time. And I actually think that
continuous processing
is what
Hudi and
Iceberg and Delta Lake are, and
I think Delta Live Tables specifically
with Databricks are all trending towards.
And I actually think that's a positive thing, right?
It's really about the propagation of change throughout the dependencies, like the downstream processes.
I think that, like, on the whole, this will increasingly remove the need for sort of out-of- band processing on a lot of these kinds of things
which is i think a net positive you know and i think anything that simplifies the lives of data
engineers is a good thing you know just like way too complicated to do relatively simple stuff i
think continuous processing and better primitives for continuous process are going to be seen in the data lake
but i agree that like these things are compatible because again i view the data lake as one
destination for streaming data you know and again like that's my bias because a lot of our customers
have these online systems as well that need different cuts of data and stuff like that. So I think that this is actually a natural continuum of this change-based
or continuous change-based processing that can now extend into the data lake,
which historically has been immutable, which has always been complicated,
even as far back as the Hadoop days with HDFS and stuff like that.
Yeah, 100%.
All right, and one last question from...
So we talked at the beginning of our conversation
about the history behind, like,
the streaming processing frameworks, right?
And they go back, like, pretty much, like,
the same, like, the Hadoop era, right?
Now, since then, and, like, this whole, like,
big data movement, we've seen companies IPO-ing.
Kafka became Confluent IPO.
Okay, Databricks, let's say they IPO-ed. They haven't, but let's say they did.
They're on their way.
Yeah. We have Snowflake. There's a lot of, let's say, value created from data-related technologies.
But we haven't really seen any of these streaming processing frameworks creating a company that is like the Snowflake or the Kafka or the Confluent of companies out there.
Why do you think this is the case?
Especially after seeing the industry investing so much in them, right?
Like because we are talking about like super complex distributed systems that someone has to build and there are like many attempts on them
It's not just Flink
Yeah, I think it's a really good question. I think you know a couple of things
Well one to their credit, I would say that confluent has done some
of this right like even though like we probably overlap with them a little bit you know we'd like
to partner with them i think more closely than we do sometimes but you know i think to their credit
they are probably the closest thing to a publicly traded company that has those kinds of that is
based on that kind of capability, but not a pure stream
processing kind of solution.
You're absolutely right.
I think a couple of things.
One, I really think that the use cases have finally caught up to the technology.
I think in a lot of cases, even back in 2015, people weren't as bullish on what they could get out of i don't think that
people fully understood what they could get out of lower latency you know sort of higher throughput
data on the processing side i think people are starting to get that now when you see
if you're a logistics company see grubhub or you see FedEx, you start to get it.
So I think that's been part of it. I also think that the tooling was nowhere near as mature and as sophisticated as it is today.
We talked about a bunch of different systems.
I think each generation of at least the open source stream processing systems, excuse me,
have gotten like incrementally higher throughput, higher performance,
but also just like more stable, more correct under failures,
easier to reason about.
And, you know, quite frankly, got SQL support.
And I think like as much as we malign SQL,
like people know it and it works and like people get it and i think that like
gaining sequel support is actually a big accelerator for sure processing in general
that's super cool now okay one more question i have to ask that too sorry like i just thought
about it so yeah sorry sorry like i can I can't not do that.
So talking about streaming processing, right?
We also have a new family of technologies based on timely data flow
and the data flow family of processing out there.
Materialize is one of them, but there are more.
What's your take on that and how this is different
compared to something like
flink yeah i mean i think the answer is that there's a lot of differences in implementation
for sure timely data flow and like what the materialized team have done there i mean
it's actually really interesting and exciting technology I think that they are tackling the next, at least, you know,
from my vantage point, they're trying to make streaming more intuitive by attacking some of
the consistency stuff that like tends to crop up in stream processing. I think it's really
interesting. I think that tech is probably less, and again, I'm biased, I sort of have to put that
on the table as a disclaimer every time I start to say something. But I do think it's less mature
than things like Flink and all these other kinds of things. So warts and all, you know, I think that
like, you know, Flink is incredibly robust and it's sort of well understood in this kind of cases.
But I think it's anything that pushes stream processing forward i think is a good thing and so differential data flow and timely data flow are exciting
projects i think it'll be really interesting to see what materialized and that kind that
generation of companies do with that technology i still think it has a little bit of a ways to go
but like you know i think that's a place where I'm sure Frank over at Materialize would disagree with me
so it's like an interesting conversation to have
and in fact here at the conference
there is a presentation on tiny data flows
so it'll be interesting
Yep, 100%
Alright, so that's all from me for today
so we should conclude before I come up with more questions.
These are great.
I love these questions.
We have gotten past the buzzer, I think, Kostas,
because you have so many great questions.
But Eric, before we sign off here,
if folks listening want to find out more about Decodable,
where should they go?
Yeah, they should go to decodable.co
and they can sign up for a free account,
get started right away.
There's a free tier there
that allows people to get up and running
with both Flink APIs as well as SQL.
Awesome.
Eric, thanks so much for coming on.
Guys, thank you so much for having me.
It was a real pleasure.
We hope you enjoyed this episode
of the Data Stack Show.
Be sure to subscribe
on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rutterstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.