The Data Stack Show - 91: The Future of Streaming Data with Stripe, Deephaven, Materialize, and Benthos
Episode Date: June 15, 2022Highlights from this week’s conversation include:How we should think about batch versus streaming (6:02)Defining “streaming ETL” (9:34)A brief history of streaming processing platforms (22:07)Th...e birth and evolution of Benthos (28:41)What led Jeff to build a new tool (34:29)Why you shouldn’t share all the data (37:23)Making streaming technologies approachable to engineers (42:09)Breaking out of traditional terminology (52:58)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
I love our panel discussions because it has so many different perspectives.
And in the live stream today, the panel is full of streaming experts. So we have people who have
built streaming projects, open source, multiple people, actually, someone who's built streaming
infrastructure at Skype and Netflix, and then several really cool streaming tools, the founders of these streaming tools, which is super cool.
Here's the burning question that I want to ask. Streaming is becoming increasingly popular and
the technology around it has grown significantly, right? So if you think, you know, sort of beyond
Kafka, right? Kafka is kind of a de facto, right? But there are so many new technologies that emerge.
And so what I want to ask these people
is how we think about the difference
between batch versus streaming.
Is that even a helpful paradigm,
you know, as we look out into the future,
you know, of how data should flow through a system?
So that's what I'm going to ask about you.
Yeah, actually, I want to ask,
like, what the title of this panel is, what's the future
of streaming?
Like, the industry is, like, producing and has developed a large number of, like, streaming
platforms for the past, like, decades or so.
So what's next?
Like, what do we need to build and what kind of use cases we are addressing that we couldn't
do efficiently
like in the past or in the present.
Because although like these guys are like building some amazing new tools, bring some
new paradigms in how to do like streaming processing.
And I'd love to learn like the why and the how and how this connects like with the previous
generation of streaming processing platforms.
Well, let's do it.
Welcome to the Data Stack Show.
I have been so excited about having this group of panelists
because we had such fun conversations with all of you about streaming.
And today we get to talk about all sorts of stuff as it relates to streaming,
the future of streaming, etc.
So before we get going, let's just do some
quick intros for those people who have not listened to the episodes with all of you. So
I'll just call your name out and go in the order that you are in the Zoom box. So Jeff,
do you want to kick us off? Thanks. Hey, everybody. Good morning. Good afternoon.
I'm Jeff. I'm an engineer at Stripe. I work on stream processing systems. Right now I lead change data capture at Stripe.
And for Stripe, I worked at Netflix for a number of years
where I also worked on stream processing systems,
worked on an open source project called Mantis,
and also worked on other infrastructures
such as Splank and Kafka.
Awesome, thanks so much.
Pete.
Hi everybody, I'm Pete Goddard,
the CEO of Deep Haven Data Lab.
My history is in the capital markets.
So think of Wall Street meets systems meets, you know,
quantitative math and things like that.
I run Deep Haven, which is an open source query engine
that's built from the ground up to be very good with real-time data
on its own and in conjunction with historical or static stuff.
It's also a series of APIs and integrations and experiences that create a framework because
working with real-time data is pretty young. So we want people to be productive. Nice to
be here today. Thanks so much. All right, Arjun. Hi, everyone. I'm Arjun Narain. I'm the co-founder
and CEO of Materialize. Materialize is a streaming
database that allows you to build and deploy streaming applications and analytics with
just standard SQL. Materialize is built on top of an open source stream processor called
Timely Dataflow. I'm sort of a, I bring a SQL partisan perspective to this panel. Before this, I was an engineer at
Cockroach Labs, working on CockroachDB, which is a scalable, scale-out SQL OLTP database. And before
that, I did a PhD in distributed systems, being in the distributed systems and database world for
quite a while and excited to be here. Awesome. Love that you're getting the bias out up front. It's great. It's equal partisan.
I think people needed the disclosure early on before I got fiery.
Love it. All right, Ashley. service. I think we'll call it today a benthos. And that includes writing code, pull requests,
building emojis and mascots. And I've been doing that for about five years. And before that,
my life had no meaning. Emojis are the most important part of the project.
Pretty much. That's where you know your project is a success.
That's exactly right. The swag is actually great. Well, Ashley, I actually want to, you said streaming ETL, which is a really interesting phrase. And so what I want to start off with is a discussion on how we should think about batch versus streaming. There's a lot of new streaming technology out there and people are working on
all sorts of different data flows.
And sort of the standard stack has a lot of batch
and has maybe implemented
some sort of streaming capabilities.
But I'd love to hear from each of you,
your perspective on like how to think about
batch versus streaming sort of in this new world
where we have a lot of streaming tools.
And Jeff, I think I'd actually like to start with you because you've built a lot of these systems at large companies and you're sort of a practitioner doing the work inside of a company.
So what does that look like for you in the work that you're doing?
Yeah, it's come a long way, but it's still quite nascent, it being stream processing.
For me, I like to segment it into two parts.
One is the developer experience, and then the other is the user experience.
So in terms of developer experience, people are sort of conforming around this SQL-like
tooling.
There's also declarative tooling as well, the tooling to get you started on writing
stream processing applications, and then deploying them
out, maintaining, observing them, etc. So that's still got quite a ways to go. It's certainly a
lot better than it has been in the past, especially compared to Batch, given that that has had more
community involvement in terms of bringing that up in the open source and improving on the tooling.
The second part is user experience. So from the user's perspective, they just want to
access their data, whether it's from streaming or batch worlds, whether they're doing streaming or
batch computations. The challenges are in that the semantics are a bit different. And so the
output or the behaviors of streaming applications might not necessarily be intuitive to the end
user.
So for example, in streaming things are in more like a real time fashion.
So windowing comes into play where it's batched.
The windows are much larger.
That plays into the last part of my thinking, which is I think
about it in terms of use cases.
So specifically with SLOs as a requirements for these use cases, SLOs being freshness,
correctness, coverage, and throughput.
And then so as an example at Stripe,
I work in the finance, the payment space.
So correctness is pretty important.
Freshness, less so originally,
but it became more important because expectations only go up.
So originally Stripe had a lot of systems in batch,
but we were recomputing the world every day or every so often.
And it just got very expensive.
So then you try to go the angle of reducing the cost for that use case.
After the cost is, OK, well, we want things quicker.
Expectations only go up.
So we want more freshness.
Oh, but the freshness is there.
But now we want things to be correct.
And so now we're doing that.
And Netflix is a little bit different.
I worked on a system called Mantis, where we were basically trying to make sure that
the systems, the streaming systems could stay up when you hit play, that it works and it
works well, the bit rate is there as it should be.
And so Mantis took a different angle where it used the user experience for the,
used an SQL-like user experience,
but made a certain set of trade-offs that we can go into another time or later
to make sure that we can compute all of these streams
in a very cost-effective manner
without losing the ability to gather insights.
And Arjun knows exactly what I'm talking about here
with materialize.
So that's it.
So developer experience, user experience,
use cases, use cases led by SLOs,
which is throughput, correctness,
freshness, and coverage.
Yeah, totally.
Well, I love the perspective of the user demand, right?
And sort of streaming being driven by user demand
for data that is more real time,
which is interesting, right? And that's sort
of a self-fulfilling cycle where you get data faster and then you want it faster and then you
want it faster, which is interesting. Ashley, let's jump to you. Do you want to explain what
you mean by streaming ETL? And I think for a lot of people, those are two separate terms,
right? As they just think about their practical day-to-day work, right? Well, ETL handles,
you know, we have ETL jobs that run our batch stuff and then we have streaming stuff,
but you put the two terms together. Can you dig into why you did that?
Yeah. So when I, so my introduction to data engineering was streaming because we had data
volumes that were too big to run as a batch. We were basically selling that data on and it needed to be dealt with
continuously. So to do that at the levels that we had, in order to execute that as some batch job,
we would need a stupid amount of machines to be running it. So it was kind of like a defensive
move, which I think is why some people early on resorted to streaming. And it was about that time that Kafka sort of arrived
and we started seeing that in the wild.
But the main takeaway, so you imagine like an ETL job as a batch
is a process that I would describe as you can have humans in the loop.
So it's something that runs at some cadence and has a execution time
that means it's realistic
to have a human watching it happen and maybe even able to intervene.
And it's therefore something that the tooling doesn't necessarily need to be that reliable.
It doesn't need to be as self-fixing, self-correcting.
Whereas when you're in a space where the data is continuously moving and it's moving at a volume that you can't deal with in a batch context anymore, you don't have
the resources for dealing with a backlog that's so large. A day's worth of data now is a massive
cost to the business to process offline. Now you're in a position where you can't always
have a human watching it happen. And if something does occur that is an issue, they're not going to be able to deal with it in a way that doesn't cost the business a huge amount of money.
So the requirement for the tooling to deal with that problem is now, it's much more important that it's able to keep that process running for as long as possible with as little intervention as possible.
And when things do go wrong, it needs to be self-fixing to an extent where it's not going to take you a silly amount of time. And I think what's kind of
happened over time is that there were people like in my position who had to do streaming and they
basically had to reinvent all these ETL tools because that's all we were doing. We were just
taking data, transforming it, enrichments, that kind of thing, filter, maybe a bit of mapping,
that kind of stuff. And then we're just writing it somewhere else. So that is an ETL job, which traditionally,
you wouldn't have so much data that you can just do that once a day and it would go for an hour,
all that kind of thing. But then we were in a position where that wouldn't work. So we had
to reinvent all these tools. And essentially in the process of doing that, which I would imagine
my colleagues here are familiar with, is this idea that you have to make your tool that much
better at dealing with
various edge cases and problems because you can't expect a human to be able to just come in
and fix everything for you. And I think one of the repercussions of that is because the tools
are becoming so much more autonomous and able to deal with all those problems. People who have
batch workloads, they don't necessarily need to do things in a streaming way. They can now look at
these tools and go, oh, that looks a lot easier. That actually looks like I have to do less stuff
in order to have that do my workload. So I kind of call it a streaming ETL because I see a lot
of people using Benfos who, they have a batch workload, right? They don't have a volume of
data that requires continuous real-time flow. They'd be quite happy with just an hour-long job run daily or even
weekly. And they're choosing to use Penthouse just because they know that they can just leave
it running. And if there's a problem, it'll fix itself. It's not going to require some sort of
intervention. And they don't need to think, oh, have we checked that it executed today and check
the status of that. They just know that they'll get alerts if something's gone wrong. And if something has gone wrong, they probably just need to restart it or add more
nodes, that kind of thing. So I kind of feel like it's, I call it that because I don't want people
to look at a streaming project and think, oh, well, that's not for me because I don't have
real-time payloads. I don't have requirements that would necessitate something that complicated.
Because really at the end of
the day, it ends up becoming operationally simpler in a lot of ways than some sort of large batch
workflow tool. Super helpful perspective. Okay, Arjun, I want to jump to you and I'm going to
add a modification to the question. So with that context where you don't necessarily have to have
a streaming service running something because the basic parameters or sort of SLAs around batch are fine,
but it simplifies a lot of things.
Would love for you to speak to that,
but also with the added question of like,
do you think there's a future where batch kind of goes away
because of those techniques, right?
The tooling gets way better
and it's actually just easier to have continuous streams.
So I think it's worth thinking about the use case carefully in that there's a big difference between human in the loop versus
no human in the loop. So when there's a human in the loop, there is a much higher latency that we
can tolerate because there's only so much, there's only so frequently that a human's going to like look at a dashboard or react to a new data point, right? And humans
go home, they sleep, you can run the batch job then. There's a big difference in a phase shift
that comes about when you start doing things in an automated fashion, where powering automated
actions off of batch workloads starts to immediately feel too slow because you have
the data, you know what the data point is, you've collected the events, say a customer
is on your website, they've done an action, and then it's going to take you something
like 10 hours for the ETL to crank through, for the batch workload to finish, and then
for you to take some action.
That immediately feels like an unacceptable amount of latency.
And that's oftentimes the big differentiator between streaming and batch.
It's when you're doing these automated actions.
And most companies get introduced to this workflow because they start to do email marketing.
So they have some data, they have some actions, they segment their users, they decide they
want to do some email marketing.
And that's actually fine in batch.
But you quickly realize the difference between taking action when somebody is on your website versus the next day.
I'm sure we've all had that experience where you were searching for a mattress, you go online, you spend about two hours, there's a two-hour window, you're trying to find the perfect mattress.
And then you find the perfect mattress, you hit checkout.
The next day you come back on the internet, it's like mattresses, mattresses, mattresses. It's like, it's too late, right? Like it's because
all those batch jobs just finished overnight. And then they decided you're in the market for
mattresses. And your perspective is like, I was, and I'm not buying another mattress for another
decade. Please go away. And then eventually, you know, they realized that the moment of this has
moved on. And it's these automated actions, these sort of personalization, these segmenting the customer, these things where streaming really delivers outsized returns found because we thought we saw the future.
Our team comes from the capital markets. And unlike most of the people that are in this space,
frankly, people in the capital markets know it is not a batch world and now they're streaming.
We know that the opposite is true. We know that for ever since the capital markets went electronic,
just the late 90s in Western Europe and in the US, that actually you make all of your money in streaming. There is no such thing as trading an hour ago. There's no such
thing as I want to buy a stock yesterday. I can not do it. I can only react to things right now.
So all of Wall Street has been automated for a couple of decades around real-time data first.
And it's done that with, of course, you need context.
Of course, you need history.
Of course, you need state.
But that real-time streaming technology has been married to historical static batch data
sets as well.
I think the change in the last 10 years, and certainly the change that Deep Haven is trying to be a part of, is how do you move that from being bespoke code that is written by very, very sophisticated developers and, frankly, the elite players in the tech groups across Wall Street?
How do you migrate it from being those custom solutions to instead being general purpose streaming technologies.
And I think many of the technologies that you guys are mentioning here are relevant.
There's the transport buses that obviously have become quite popular with Kafka-compatible stuff
and the Xero and RabbitMQ and Solace and Chronicle queues and all of these types of things.
All you're doing is you're really taking what we've been doing on Wall Street
for a long time with our custom solutions
and we're making it open source, general form, et cetera.
And we think that this is the future
for many other industries,
that that will be what all of you are saying is,
you know, what we agree with.
And that is, you know,
your real-time data is the most valuable stuff.
Whether real-time to you means a millisecond, real-time to you means 10 minutes, I don't care. Let's call
that real-time. And, you know, what Jeff said about that data being available to a user, the
same as any historical data, is entirely relevant. And so what we think is important is that the same
methods need to work on streaming and dynamic data as it does on batch
and static data. And that good technologies will make it so that the user doesn't have to really
care. They can just use the same stuff and it will work on both. It will all merge, it can all join,
and you can just get to work on data and stop thinking about streaming versus batch.
And that's the way Deep Haven organizes itself.
So I love Pete, if I may jump in.
I love the background that you just gave us because you're absolutely right
that Wall Street has been doing streaming
before anybody else has been doing streaming.
I mean, if you look back,
before Kafka and RabbitMQ, there was TIBCO.
And then before any of these stream processing frameworks,
there was KDB Plus.
And these were sort of the pioneering ways in which to express
computation move data around but it also the flip side of it is it required this extremely
specialized skill sets right like you have all these banks where there was these you know three
KDB programmers and they were you know writing this almost hieroglyphic programming that nobody
can understand and everything sort of runs
through them. And what I love about Deep Haven is it's really bringing the best of, you know,
the modernized data science, Python, R communities, the communities that have not had access to
streaming. They have been living in a pure batch world and bringing the best of both worlds together. And I obviously bring a very different sort of set of backgrounds to streaming.
I mean, I've never been on Wall Street, never worked, but I have the greatest respect because
those guys really were first to implementing these technologies and pushing them to production.
And whenever we do interact with them, they bring a wealth of expertise and experience
and speak in much more technical precision
than I think in Silicon Valley,
including sort of the deep theoretical understanding
of like the bi-temporal, multi-temporal queries
and how one can express those computations.
Yeah, well, it's sweet of you to say.
I think the, you know, I'm not a computer scientist.
I'm not a, you know, I'm not well-versed on the history
of how all this has come to be in the last few decades.
So the battle between streaming and batch is somewhat lost on me.
It seems like semantics to a certain extent.
Because when we think about architecture, you know, we think of print,
you know, first principles of architecture, fundamentally, we just think that data
changes and your architecture should expect data to change.
I know that you, Arjun, very much embrace similar there where, you know, rather than
pictures of data that then you need to compute on a whole new batch set that has come in to have your architecture organized in such a way that you know new stuff is going to come in.
You want to be able to react to it to serve your use cases.
You know, that's the architecture you should put in place.
If that means you want to use the word stream for that, fantastic.
But you could use whatever terminology you want. Just fundamentally, we think
you should embrace that the
first principle that your data is going to change and you
should be organized accordingly.
So guys, I think that's like a great
opportunity to have
a little bit of history
relation, let's say. So it would be
great to hear from you
how you have experienced, let's say, the
evolution of the streaming
processing platforms. I remember, for example, I mean, beginning of the previous decades,
we had Spark streaming at some point. Then we had Twitter coming out with Storm,
some like there was some kind of explosion like seeing at least open source projects around like streaming processing.
New architectures coming out there, like the Lambda architecture, for example, where like people were trying to put like bots together with streaming.
Then we had Kafka with architecture.
I think they like trying to merge everything into like a streaming platform and like give like also some primitives that are like more native to Batch and obviously like there's a lot
of value delivered through all that stuff.
Obviously some technology survived, some are obsolete today, but we have like
companies like Confluent going public and people making money obviously out
of all that stuff, which is always good. So what's happened this past decade, let's say,
and also what do you feel like it's next?
Because all of you, one way or another,
you're working on building the next iteration of streaming platforms.
So it would be great to have this connection.
Let's start with Arjun.
Thank you. streaming platforms. So it would be great to have this connection. Let's start with Arjun.
Thank you. Yeah, I think what Pete articulated is the destination or the place we all want to get to where batch and streaming are really sort of implementation details. I completely agree with
that 100%. I think we're about 10% of the way there, right? It's been a multi-decade journey.
Some of the projects you brought up,
I think Storm was sort of the earliest open source project. And the way I would articulate it is
the things you could do in streaming were a tiny, tiny, tiny, tiny subset of what you could do in
batch back then, right? So precisely on the implementation level, you couldn't do any
stream processing that required that
your stream processors maintain a lot of state, right? So you couldn't effectively do lookups to
historical data. You were sort of building these. And in the algorithms world, there's this sort of
streaming algorithms or those algorithms precisely that don't maintain very much state, sort of in a
big O sense. And that was really what was enabled by Apache Storm, right? So you had this, as a user, you had this big trade-off. You had, do I care about low latency
and am I willing to limit what I do computation-wise? Or do I need that complex computation
and I'm willing to give up low latency and do it in batch processing? So that was sort of the
trade-off you had to navigate. In terms of a Venn diagram of, you know,
the use cases were, you know, pretty, pretty separated.
Like there was this giant circle,
which is the things you can do in batch,
and this tiny, tiny circle off to the side
of what you can do in streaming.
And over time, as we've had more capable stream processors
become, you know, available,
that Venn diagrams have
started to have a little bit of overlap, right?
And that is where I would say is the Kappa architectures that, you know, I think, I think
the Kafka folks articulated before that we even had the Lambda architecture.
So, so, so, so zooming even further back, people, people didn't want to make this trade
off, right?
They were like, I care about low latency, but I also want the fancy computation.
So maybe what I can do is sort of by getting the best, I'll run a batch computation
side-by-side with the streaming computation. I'll do the fancy stuff in batch and I'll do some sort
of approximation of the computation that I want in streaming. And I'll keep periodically running
the batch computation because I'm going to diverge from the true computation that I want
because the streaming is an approximation.
And then I'll sort of revert back to the batch and then reset and then recontinue.
And this was hideously complex, right?
Now you're running like three different systems.
You've got the batch, the streaming, and then the little thing putting it all together.
And the Kappa architecture was the simplification saying, why don't you just run everything
through the stream processor? The problem with this, as Ashley has brought up,
is the developer experience is terrible, right? Like everything in streaming is manual and writing
lots of code and maintaining tons of infrastructure. Whereas in batch, the lived experience
that people have in batch is like, I just write a little sequel, or I write little, you know, in the in the data science world, I write this little Python program,
and I get to harness the power of a scale out horizontally scalable, reliable, massively
parallel cloud architecture that works on terabytes and terabytes of data for me,
you don't have any of that in streaming, right? So streaming today, it's like, well, that's great.
Now you're on the hook for your schema changes, you're on the hook for, you know, errors in your data stream.
You're on the hook for expressing the implementation. So the nice thing about SQL or, and it's not just
SQL, there's other sort of languages for sure in which you can express computation declaratively,
but in streaming, you haven't had that luxury. You've had to build out your own implementation languages.
And I think over time, and certainly our goal at Materialize is to make for the subset of batch computation, which you think is a pretty large subset of SQL, to make that available
to people in streaming so that they can be streaming with just SQL.
This does not cover all of the use cases, right?
So just as you have Snowflake and Databricks, right? There's a whole world out there that's not SQL for data science and
machine learning. And absolutely, we'll need solutions in that side of the space as well.
But I think we're maybe 10, maybe 20% of the way there, such that we're getting to that promised
land. The way I would rephrase what Pete said, it's like, we need to get to the point where streaming is a superset of what you can do in batch, right?
So that eventually we think of batch as a subset of the capabilities that we can do in stream.
Does this mean the batch processes are going to go away?
No, I think there'll absolutely be some times where the same computation you can sort of choose.
Do I want to use a batch runner because I have all that infrastructure already set up. Maybe it's a little
bit cheaper or do I want a streaming runner? You should be able to move computation back and forth
between sort of underlying infrastructures. I don't think, I don't think batch is going away
entirely, but I want it to become, I want the batch versus streaming debate, I think decades
from now to be the sort of, as of as much of an implementation detail as like,
hey, are you running this on a column or a row oriented execution engine?
It's like actually, you know, like 99% of people do not care what the answer to that
question is, right?
They just experience it as like, oh, this is great.
I'm having a really pleasant developer experience.
Yeah, makes total sense.
So, okay, let's ask, I'd like to ask you.
So you said that you had, I mean,
you were working like in a company, you had like to work with a lot of data.
You had like to deliver the data, like in a real time fashion.
Why didn't you use just, I don't know, like Flickr and Kafka, right?
Or whatever, Storm, like whatever was like available back then out there.
And you decided to go and like build your own solution
that's ended up like becoming Penthouse.
It was because of the types of work we wanted to do.
So I think one of my early frustrations with stream processing tools
was that there was this focus on the really high fruit in the tree,
which was the actual queries. You queries. I was interested in single message
transforms, which was we're taking data, we're doing a small modification to it. You can do that
stuff with those tools, but we had latency requirements that it just didn't work out.
But the other bit that was missing was being able to orchestrate a network of enrichments.
So we had lots of different types of data science-y enrichment that we needed to add
to these payloads as they were passing through our platform.
And it was non-trivial, the mappings between them.
So we needed, one was requirement, the other.
So we ended up with this DAG of directed acrylic graph of these enrichments that we needed
to execute in a streaming context, which meant as quickly as possible with as much
parallelism as possible.
And none of those tools, you could do it with those tools as a framework, but we needed
something that was going to execute those things.
And what we were looking for was declarative because we needed to slowly iterate
on those enrichments, change where they were, how they operated, what kind of data they were
giving us back and requiring because they were being actively worked on and slowly change the
graph as new requirements came in. And the idea of having to compile a code base every time that
happened, it just wasn't realistic. So we wanted to offload some of that work to non-engineers, but it still be owned by engineers.
So we kind of went down this decorative path.
And that's kind of where the Benthos essentially started was kind of iterating on those key principles.
But I mean, all of those tasks are something that you could do with a batch system right now. And as we were getting
these streaming tools that were kind of like an alternative to batch for the analytics part,
we just didn't have any options for the actual pipelining stuff. There's like logging tools,
because almost all engineers have logging as a streaming problem. It's basically a stream
processor,
but then you don't get the delivery guarantees
and the observability and those kinds of things baked into it.
So yeah, it felt like for me, in my position,
that was the biggest gap.
That's the bit that I kind of focused in on
was integrations with services, enrichments,
transformations, that kind of stuff.
How do you see like the, like Bentos being, how did the project like evolve, especially like from a perspective of, let's say, the use cases, right?
From when it started inside like the company up to today, right?
Where you're like maintaining a project that is used by like so many like people outside the company that initially was intended for like what's the what's your experience there like how did you see
like things changing in terms of like taking a streaming processor for a very specific use case
that you had there and being adopted I would assume like for different use cases too so can you
so I we've already had a bit of practice with making something like that generalized.
So having like a configuration schema
for expressing enrichments
as integrations to arbitrary things.
So the idea of, you know,
you have language that can communicate
interacting with an HTTP service
or a Lambda function
or a database query
or, you know, this and that and this.
So I'd kind of dabbled with it.
I could see where the problems were in generalizing that stuff.
So you end up just biting a bigger chunk of how much you're going to abstract over that stuff
and it still be user-friendly.
Once I'd gotten that, you reach that nice Goldilocks position of being super easy
for people to pick up quickly and run with.
But also when somebody comes to you and says, hey, I have this thing that it can't do yet.
Can you make it do that?
You can really quickly add that in.
So you're not immobilized by it's so generic and it's so generalized that you can't make
things easy for people, but then it's not so easy and intuitive for its existing use
cases that you can't slot something else in. That's mostly just the config. It's basically the way that I've kind of decided to
chop the concept of batching and what a message is, what a batch is inside the process itself,
and what the responsibilities of the various component types. What is a process? What is
a transform? What is an output? How do you compose those things? But the core concepts haven't really changed that
much since day one. And that's because I'd already had practice with a few attempts at
generalizing that kind of thing. And then after that point, it literally just becomes
a case of five years of people going, oh, but what if it could just do this extra bit
as well? And I'm like, okay, let's do it. Let's put that in. And they're saying no.
That's, that's, that's, that's, that's cool. So, okay, Jeff, your experience on like in Netflix, again, like all
these tools out there, right?
Why you ended up like building Mondays?
Like what was the reason that made you like say, okay, we need something new.
Like whatever tools are out there, like they do not fill the gaps that made you like say, okay, we need something new.
Like whatever tools are out there, like they do not fill the gaps that we have here.
So, so Mantis fundamentally is a stream processor, but the use case that started
it all was basically trying to reduce downtime or MTTT or like mean time to insight and understanding
why something's happening like something negative
is happening to the service and so you have your traditional telemetry stacks like metrics logging
logs and traces and so certainly we at netflix at the time we did use that but after some point it
just got too expensive to like shove all if we wanted like the most granular
insight possible we would have to log everything all the time aggregate it store it hopefully ttl
it out the other end like but we needed something that could get us the cost efficient cost
effectiveness while upholding the like the granular insights and so to give an example, like petabytes of data
were going through the system
and then most people would filter out
like 99.9% of it.
And then so one of the great use cases
is like, I'll give you all an example.
So Mantis gives you the ability to say,
if I'm looking at,
if there's a playback issue,
like someone's having a problem
playing a title, we can get that insight into which country they're coming from, which title, which episode, which device.
And then like versions of the device, and then you can exactly see like what's going on.
And so that's very, very targeted query to like a specific thing.
And so you don't really pay, you don't pay the cost of that query
if it's not in use basically.
So it's more of like a reactive model,
like the reactive streams model.
And so another example of like
when we were decommissioning a Netflix app
for some devices,
we can see exactly which users are using that over time.
And then maybe some of the many, I'll say like, Hey, we're going to decommission
this app for this device, please go do something else.
And so it was more of like a real time use case, like what's happening here.
And now another example is when we, when we do like regional failovers,
which happens quite regularly, like you can see the number of people hitting
play, like dip in one region and increase in another region.
And then when we show back, it goes the other way.
And so it's used for a lot of tooling.
But then the other feedback we've gotten was like,
okay, well, what if I don't want real time?
What if I want to look at something like in a batch case
or in like a snapshot of data
or join that like historical context
with the streaming set of data to make some sort of decision later on.
So then our answer to that was, okay, we would just build in sinks.
So sink it out somewhere else.
And then you would have to do like basically a stream table joined later on.
But that's another story.
So it all started with observability basically.
Hey, Jeff, can you, by chance, I've heard you speak before,
and one of the things that you say that I think is really interesting
and very important, actually, for the stream community
is this idea of not sharing all of the data.
I think that that's one of the founding principles of Bantus.
And I think for stream processing players
that are used to just receiving a firehose queue
and then having to do things about it.
You're approaching that a little bit differently. And I think that's an important concept because
you can all of a sudden be moving a lot less data around your system or your mesh,
which is a principle also that I think Arjun holds dear. So can you talk to us about how that came
out as a principle and how you've executed that in your system? Yeah, there's three parts. So first is we encourage developers to publish all of the data that they can.
So like an event might have hundreds of fields
and it might be like very large in size, basically.
But we also have it so that the infrastructure
doesn't consume all of that data by default.
Like you have the ability to do that, you being the user, but you have to be very intentional
in doing so.
So certainly there are some long running jobs which basically perform a select star, 100%
sample, but it has to be, maybe it's a low volume stream or something like that.
But most people would either do like a sample or select some of the fields or even filter
some events out based off of a condition.
So the first is like publish everything, but be very
intentional about what you consume.
And then the second is like reusing like subscriptions basically.
So if multiple people are asking for the same set of data, and if you have that
awkward in memory, like just don't go all the way up to the source of the data,
like these applications, just send what you have down to the people that are subscribed to that same, with the
same pattern of data that they're asking for.
And so really you're getting like a lot of reuse.
You're getting a lot of like intentionality, like consume only what you need.
Because in reality, you don't want all of the data.
And if you do, then you can do that if you want.
Yeah, we hold the same things dear.
We use the phrase, you know, we move data around the system in lazy fashion, right?
That the producer, our APIs allow the producer to have quite a bit of information from the consumer
in terms of what it wants from the table and at what update frequency it wants,
because there's different use cases.
Some of them are throughput, some of them them are latency and some of them are consistency
driven and then we also sort of hold on to what you just talked about in regards to sharing work
products you know we memoize data so that you know if you have a few consumers that want the same
thing you're you're able to because oftentimes you know you think it's scale out it might be a few
thousand consumers and so there's a lot of there's a lot of less work that needs to be done in these,
you know, these, the types of use cases that are really evolving quickly in the streaming
world, we think.
So guys, one of the things that surfaced like through the conversations that we have is
that like something that's like quite important around like streaming processing
is like the developer experience.
It's like something that like still needs like a lot of work and probably is one of
the things that I mean like the previous wave of technology out there like didn't pay as
much attention as like they should.
And my question is like, and I'll start a bit with you because you are offering a product that exposes an interface
to the users that's very, let's say, familiar, to a very specific group of people that are
not necessarily extremely technical, in the sense of the engineering of these systems.
So I want to ask how we can take streaming and make it at the end more accessible to engineers with a database, you don't think in this terms,
right? Like, and a database is like what most engineers, like developers have
been exposed to.
So how we can bridge, let's say, all these primitives that, and the new
language of like streaming is bringing on the table with what is like most
commonly known out there to developers.
And it's one of the questions that we got also like from one of like the attendees.
That's okay.
Like you can, let's say for example, exchange like the terms extract
to publish, load to consume.
But at the end, like, is this what is needed?
Like, do we need to change our language and like educate people or we can do better?
So I have a, I have a bit of a radical answer to that.
It's a little bit self-serving, so I hope you'll pardon my indulgence here.
But we actually introduced a new primitive that we think is important.
So we think event streams and message queues are vital.
And we understand that that is the foundational building block of the conversation we're having today.
We could not embrace it more heavily.
However, we don't think it's the whole story. We think that the most intuitive object for people to work with
for data-driven applications, for data science, for data analysis is the table, the data frame.
I'm not making this up. This is in Pete Goddard's opinion. We'll walk around, see how many people
use Excel, go survey SQL, look at our data frames, look at Python pandas.
It's not an accident.
They're all tables or something darn close to the table.
So what we've done at Deephaven and we put it out in open source is we've really
tried to create a dynamic table.
So tables should not be static.
They should be things that can change, right? That should be
true at the API level. That should be true at the engine level. And that should be true at the user
experience level and the API, the JavaScript APIs that are supporting that. So we think that that is
a very important proposal to the world or to the community in regards to how do I make it easier
to work with streams?
So when you're using Deephaven, you might integrate with whatever, you'll copy paste
something we have from our documentation to onboard a Kafka stream or a Red Panda stream,
or you'll integrate Solace or something like that. Right. And now you, or, or our enterprise customers will, you know, push binary
log from Java or C++ applications of theirs, right?
But once it hits, when it hits deep haven, it, it really is a, it hits
as table structure and, and then we keep track of changes in the table.
And I think making it so that we can do really high performance compute, because we're only doing calculations on changes instead of on the actual data, which means less data and faster compute is powerful, but it really works because we let people think in terms of tables and we let people use tables and oh you want to do machine learning well chances are
you're using tensors which are arrays which are you know chunks or columns of tables this construct
of pushing a table around a dynamic table around as a first primitive is really really helpful for
the user experiences you're talking about you know in addition to one of the things that this group is well aware of, as soon
as you get out of the bat, like think of how much money, meaning market cap is involved
in batch world, right?
So just, just think of BI, right?
BI, you have Tableau and Looker and, you know, Power BI and Click and, you know, 12 other
competitors, right?
Give any of them a stream and like, you stream and your experience is pretty much gonzo.
So there needs to be significant investment to support those types of development, exploratory, interactive, and dashboarding experiences with real-time data.
We've been very involved with that for a number of years, but we obviously will bring it.
It will need a community to get it all the way there.
Yeah.
So Jeff, you mentioned like for Mondix specifically that you like implemented more of like a reactive model there, right?
Which reactive programming is like a paradigm that is, I guess, like quite well known in people that are doing like even like from then development, right? So what's your opinion on like how we can make
like streaming technologies more approachable to more engineers out there?
Yeah, that's a good question. In my mind, what's going on in my head, there's like the user and
then there's developer. User is a fun story. But what I've been thinking about a lot lately is I've actually written a
stream processor very early on in my, in my like computering career.
It looks like take a list, call stream, call flat map, call reduce.
Right.
So that's like an API that's very standard in a lot of programming languages.
Or, or, Or take another example.
You have a list and then you have a for loop and then you go through and then you do something with it.
And so like kind of like the reactive streams, like you have like API that has, you know, your flat maps and all that other stuff.
Like I'm wondering from a developer perspective,
that's kind of just what I want to do.
I know how to write a for loop.
I know how to write a stream, flat map, reduce, et cetera.
And if I could just give that to somebody
and then they'll parallelize it,
they'll manage all the state,
throw in the rocks DB and all that other stuff.
Like that's good for me
because I know what a list is
and I know what a for loop is.
And that works for me.
Another example is like in some of these stream processors, there are great APIs.
You like windowing and triggers and all sorts of stuff.
But if you want the ultimate flexibility, there's things like just the process function or just basically a block of code that you write, you run, you put all sorts of variables in there.
And that's pretty nice.
Like that's kind of just what I want as a developer.
Yeah.
Makes sense.
Ashley, what about you?
Like you have built a tool from scratch, so, and you are like interacting
a lot with like developers out there.
So what is missing?
Like, what do you think that's missing, like from streaming infrastructure
to become like more approachable?
Ashley Hyland- No, I think the first thing that
if you take a developer that's innocent
and hasn't suffered at the hands of stream processing
and you invite them into your world of,
hey, why don't you do that with streams?
I think the immediate repulsion to that is,
oh, that looks like a lot of hassle.
That looks like it's going to wake me up at night
and I'm going to have to do all kinds of stuff to get that back up. And I think answering that anxiety is,
I think for me, that's been one of the, I wouldn't say it's a struggle. It's more when people get to
me, they've already overcome that somehow. So I'm kind of like seeing this relieved. Oh, actually, this isn't that bad. But I think
the operational side is still a nightmare. And it's still, there's a lot of moving parts still.
There's all the different services. You imagine like an architecture of stream processing,
a team that's decided that they're going to lean into it heavily and everything's going to be on the low latency side. The number of moving parts is vast if you want everything. And every component
wants its own state. It wants its own claim to the disk. It wants its own thing that can fail.
And the idea of being woken up at 3am and one of your disks is corrupt and you've got to sort out,
okay, well, which services in my streaming platform now,
you know, have been suffering? How have they been suffering? How do I recover them from that state?
It's, it's overwhelming, I think for a lot of people and it's very comfortable to just say,
okay, well, we don't need that yet. We don't, we don't need that kind of thing. I think it's kind
of similar to, to, you know, Kubernetes and things like that, where it's exciting to some people, but to
a lot of people, it's too much hassle, feels like it's not worth the investment.
And I think it's something that is improving over time.
There's obviously alternatives to Kafka now.
There's companies like Red Panda.
Lots of tools are coming along that try to simplify.
A lot of the products here are nice and simple to use and just kind of slot into existing
systems, which is great.
So I think that's kind of what I'm hopeful for in the future is that a lot of those moving
parts become less moving and more simple.
So do you feel like, I mean, isn't the data itself and the use case itself deriving this?
I'm just unfamiliar, right? I don't come from the same background as you.
So I'm just unfamiliar of, oh, this is a thing that's working in batch.
Should we try it in streaming?
That doesn't, there isn't much resonance with me, but hey, there's a new type of
data, it drives a new type of value to me.
I need to keep up with my competition.
I need to be responsive in a
shorter amount of time, or there are new data sets that are coming in where transactionality
might not be as important. Therefore, I can embrace a new way of doing it. To me,
from the people you talk to, is that oftentimes driving the migration to dynamic data, real-time data, streaming technologies?
Or is it really just, oh, you know, what you cited at first, which was they want to do, you wanted to do batch, but you need to do it at such scale that you kind of had to do it all the time?
Or what really drives people into this dynamic data stuff?
It's a bit of both. So there's some people who have to, and they used to have something
that they wrote themselves,
and it's dodgy,
and they wanted something that's not.
And then there's a lot of people
who are just, you know,
they're using Penthouse,
which is a stream processor
to basically just read a batch workload
because you can read like SFTP files,
and they just,
they like the fact
that they can just run this thing
and it's always on.
It's always, you know, they only need this data like once a day to be refreshed.
But, you know, they just like the fact that this thing will just always run.
It just sorts itself out and seemingly doesn't require much effort.
It's got a nice little config that, you know, three people can be working on in a source control.
And to them, it's a simpler world. And the job isn't
that complicated. It's just hitting some endpoints and maybe has a fallback with some dead letter
queue or something on an alert. So I mean, there's those two extremes and then there's
everything in between as well. There's people who feel like we probably ought to make this
more efficient or streamy, but we don't have to yet until we find the thing
that, you know, suits our particular requirements. It's kind of hard to say, but yeah, I see all of
them. I see people from all walks of life coming and discovering, actually, it's not that bad.
That's the main take is actually, it's not that bad.
Yeah, that's great. Okay. Well, we are right at time here, but I want to get in this question really quickly
and we'll see how many of you can answer.
Arjun, we'll start with you.
Chris asked a great question.
He said, coming from the traditional batch ETL world,
I found using new vocabulary to describe things
can help with thinking about how to do things in new ways.
Examples, extra versus publish or load versus consume. Has there been any terminology
that you use or have heard that you think has been helpful in terms of breaking out of the way
that we talk about these things traditionally? I love this question. I'm going to take the
completely contrarian point of view, which is that no, we should stop doing this. There's so
much amazing vocabulary and intuition in batch. I mean, it's the entire
reason we named the company Materialize, right? Which is, what are you doing? What are you trying
to do with the vast majority of these streaming pipelines? You are trying to build a materialized
view that stays up to date over changing data and relating the difficult newness of streaming to the concepts
that people are familiar with from batch, I think has a tremendous amount of value.
I know this is a bit of a contrarian hot take.
It isn't very much a reaction to how needlessly complicated streaming has been for so long and trying to simplify it in the terms that are most relatable to the audience.
And we really see the audience as people, you know,
the 100x the amount of developers who have not tried streaming,
have never, you know, have never poked at it
and are maybe even a little bit rightly so afraid of poking that bear.
So I'm going to take the contrarian take.
Maybe I should have been the last to speak.
I love it.
I love it.
Jeff, what say you?
Yeah, no, I'm actually plus one on that.
Just keep it simple.
So I sit in the change data capture space right now.
Really, you're just capturing changes from databases, aka extracting something from a
database.
Someone wants to transform that and someone wants to load that somewhere else. So just keep it simple, I think. Ultimately, for me,
are we talking about the technology or are we talking about the use case? I think ETL as a use
case, yeah, you're going to extract something, you're going to transform it, you're going to
load it. As a technology, there are many things out of the hood which have other definitions.
Yep, that's great. All right, Ashley, you invented a new term
called streaming ETL earlier on the call.
So what do you think?
I did not invent that.
I've seen, I promise you,
I've seen that somewhere else.
And I thought, oh, maybe that fits.
I'm not good to ask about this stuff.
I'm really bad with it.
Because I mean, I didn't think
I was a data engineer for ages. I discovered data engineering way late. I had a data engineering
tool and I didn't know what data engineering was. So all this stuff I've had to learn,
what do they mean when they talk about ETL? Because to me that was just processing.
Every program is ETL. It's reading something, doing something and then putting something out. So, I mean, yeah, I feel like all these terms are super vague and kind of conveniently unspecific anyway. So I just kind of adopted them for whatever I had going on.
Yeah, I don't know.
Just wait till you hear ELT.
The marketers are trying to are trying to confuse us all right pete with the last word
you know i think my answer is similar mostly i just don't feel qualified to to dictate language
and nomenclature to people certainly i come from a space where pub sub systems feel pretty natural
that you're moving data from one server to another, and there are publishers and subscribers, that
forms a data mesh.
I don't think that that's a sophisticated concept to embrace and might be an easy way
to think of things.
As others have said, I think the best thing we can do as a group is make it so that we
can talk in English that a sixth grader can understand, right?
So I am moving data here.
This thing wants this data.
It is going to do that with it, et cetera.
And whatever words you choose to coalesce your team around, that'll work for us. This idea of a, the only term I would introduce is the one that I mentioned earlier, which is, is an invention of ours, which is this streaming table, which is, look, it's got the same attributes as streams plus a lot other more use cases that you can handle, handle semantically with that primitive.
But it's, it's, it's a table that changes.
And probably when I say, Hey, it's a table that changes to a sixth grader,
they can generally understand probably what I mean. And that is exactly what it is.
Yep. Love it. Love the push for simplicity across the board. All right. Well, thank you so much,
everyone, for joining. We're a little over time. So thank you to the listeners and to the panelists.
We really appreciate your time and have learned so much.
Thank you.
Thanks very much, everybody.
Cheers.
Me and Costas, we get to talk to a lot of really smart people.
You know, I think my takeaway was that through all of the different complexities that we talked through, right?
So use case complexities, technology complexities, different opinions on that, was that at the end
of the day, when we asked them how to describe these things, you know, with the user question,
everyone said, keep it really simple, right? And Ashley even said, you know,
when I first, I didn't know what ETL meant, right? I just called it processing. I'm taking data from
here and moving it over here. I was just processing data, right? I just called it processing. I'm taking data from here and moving it over here.
I was just processing data, right? And I really appreciated that because I think it's good
with all of the new technology, sort of the new ways of thinking about things.
I think it's really healthy for us to step back and say, you know, at the end of the day,
we're moving data. The technology can do some really cool stuff, but the fundamentals
are actually not that complicated. And I think the other sub point, I guess I'm doing it, I'm
forcing it to do for one year, is that Jeff made the point that the user, the end user who wants
the data could care less. So those are just really, really good reminders for me. How about you?
Yeah, absolutely.
I think that's what I'm going to keep from this conversation that we've had
is that the experience matters a lot
when you are working with these tools.
And there are like two levels of the experience.
There one is the experience that the developer has
who's going to build whatever on top of these technologies.
And then there's also like the user experience, which is the consumer of the data that comes top of these technologies. And then there's also the user experience,
which is the consumer of the data that comes out of these systems.
And both of them, if we want to move forward
and increase the adoption of these tools,
we need to make sure that they don't have to learn new terminology,
they don't have to learn new ways of thinking and designing systems,
and we need to keep things familiar and simple as much as we can.
Yep.
Well, another amazing live stream episode of the panel.
We have more of these on the schedule, so be sure to keep your eye out.
We'll let you know when they're coming and we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.