The Data Stack Show - 163: Simplifying Real-Time Streaming with David Yaffe and Johnny Graettinger of Estuary
Episode Date: November 8, 2023Highlights from this week’s conversation include:Johnny and David’s background in working together (1:56)The background story of Estuary (4:15)The challenges of ad tech and the need for low latenc...y (5:44)Use cases for moving data at scale (10:35)Real-time data replication methods (11:54)Challenges with Kafka and the birth of Gazette (13:54)Comparing Kafka and Gazette (20:22)The importance of existing streaming tools (22:28)Challenges of managing Kafka and the need for a different approach (23:40)The role of compaction in streaming applications (26:54)The challenge of relaxing state management (34:01)Replication and the problem of data synchronization (36:48)Incremental Back Fills and Risk-Free Production Database (46:03)Estuary as a Platform and Connectors (47:45)The challenges of real-time streaming (57:56)Orchestration in real-time streaming (1:00:51)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Now since this week's recording is with Johnny
and Dave from estuary.dev. And I think this is going to be a really fun conversation.
It's a topic that we've actually covered quite a bit on the show, which is streaming,
you know, in particular real-time streaming. But this is really in the context of, I think,
what you use streaming for.
And we really dig into sort of the Kafka side of the conversation,
which we haven't covered in depth a ton.
But part of the estuary story is really reacting to real-time streaming needs,
evaluating Kafka, and seeing some pretty
severe shortcomings, which is why they built Estuary.
Now, what's really interesting to me is, in many ways, they don't talk about Estuary as
a streaming service.
You know, they kind of talk about it almost as real-time ETL, which is fascinating.
There's some open-source technology under the hood.
And this is really, I think, going to be an interesting conversation because streaming which is fascinating. There's some open-source technology under the hood.
And this is really, I think, going to be an interesting conversation because streaming is obviously a hot topic.
And there are multiple technologies.
So really interested to see what the SRE team has built.
Yeah, 100%.
It was a very fascinating conversation, actually,
for many different reasons.
First of all, it was pretty technical,
and only in terms of talking about Eswari itself.
Actually, we had a very deep dive into Kafka,
how Kafka is built, and some of the issues there
that actually Eswari is addressing,
like from the perspective of the architecture of the system. Like for example, we were talking about how compute and storage in Kafka is like very
tied together and how this has been like changed with using something as sorry.
And like, what does this mean in terms of like managing the system and like what type of like use cases it enables.
So we did like a very interesting architectural conversation
around like this type of system.
So anyone who is interested to understand like better
how Kafka and like this type of streaming systems are like working,
definitely like should listen to that.
And then we talked a lot about also some important concepts like CDC, right?
And why CDC is important, how we use it and how they implemented it.
Because the standard out there is pretty much like using something like Debezium,
but the folks at Estuary actually implemented everything from scratch.
And they have some really good reasons why they did that.
And they are talking through these things.
So amazing people, both Johnny and Dave, very deep expertise in this type of technology.
And we had an amazing conversation ranging from the technical side of things up to the
business side of things.
So I think everyone should listen to them.
And hopefully we're going to have them again in the future because I don't think one hour
was enough to go through all the different topics when it comes to streaming.
Totally agree.
Well, let's dig in and talk about streaming with the Estuary team.
Dave, Johnny, welcome to the Data Stack Show.
We are so excited to chat about lots of things.
So thanks for giving us some of your time.
Yeah, thanks for having us.
Thanks for having us.
All right, we'll start where we always do.
You have a great history of working together
on multiple different things, doing multiple startups,
selling different companies. So I'll let you choose. Who wants to start with the background story?
I'll go for it. So I'm Dave Yappi, co-founder of Estuary. I've been working with Johnny for
about 15 years. We started working together back in probably 2009 and 2008 at an ad tech company. And we built a lot of technology around those use cases,
which helped us meet customer needs.
We ended up building a platform called Arbor,
which was a data platform that facilitated transactions
between publishers and advertisers.
And that required very low latency.
We built a streaming system to facilitate that.
And we ended up selling that to a public company back in 2016.
So the streaming system we built was called Gazette.
Gazette was a pretty novel piece of technology, which Johnny will dive a lot deeper into.
But we cared about low latency because when someone's purchasing an item on a website, they are expressing interest in a
product. And the key is to get back in front of them really quickly in order to use that data as
efficiently as possible. So we wanted to make sure we were able to do that. We built that system.
And now we've been applying it for the past four years at Estuary to the world of data
infrastructure. Yeah.
So hi, I'm Johnny.
Dave mentioned I've been working together for quite a while now.
And yeah, whatever your feelings may be
on the advertising ecosystem,
in some respects, we're like recovering
from the ad tech ecosystem.
But as an engineer in the space,
there are pretty fascinating data challenges
just because of the speed at which
you're having to make decisions, which is typically in like the 100 millisecond type
timeframe.
And the amount of data that you need to bring to bear to make those decisions, because you're
consulting various indices of data to figure out like whether to show an advertisement
and how much to pay for it and that kind of thing.
And just from a data management perspective too, because you're talking about gobs of this stuff
coming in, you need to sort of connect it together, build various graphs that sort of allow
you to understand how these data points fit together, project information through it, get it
back out again as quickly as possible. So there are a lot of sort of very interesting engineering
challenges and like wrangling this constant influx of data into useful data products and doing it really quickly.
So as Dave mentioned, we ended up building some technology to help us do that for the last company that we built called Arbor that we open sourced as a project called Gazette. And we are now continuing to build on that project today with our current company, Astuary, which focuses on essentially sort of making it really easy to connect the
various systems that you have that contain data that you care about. So whether that's a OLTP
database or a PubSub system, or even a Google Sheet, or various SaaS APIs, being able to connect
that data to where you want it to
be, which could be another database or an analytics warehouse or an elastic search cluster.
So there are generally all kinds of places where you have data,
and we make it a lot easier to kind of move it around and transform it along the way.
Very cool. So this is just a personal curiosity about ad tech because I'm
somewhat familiar, not definitely not as intimately as each of you, but it sounds like
the APIs for delivery of ads were pretty robust because you're talking about sort of collecting
this data, determining, you know, so someone's like browsing products on a site, you want to sort of retarget them.
And so you build Gazette to sort of facilitate collecting that data and syncing it to these
other systems.
But then you also have all these mechanisms to deliver those ads.
I mean, is that, you know, because you're to some extent, even if you can move the data
with extremely low latency, you're limited by the API that
you're delivering it to, right?
I'll go quickly.
I'm sure Johnny will add a lot of context, but there's a thing in the advertising industry
called real-time bidding, right?
And real-time bidding is effectively a protocol which enables publishers or exchanges of publishers
to share opportunities to partake in an auction. And what they'll do is
they'll have an auction which goes out to tens or hundreds or even more different potential buyers.
And they'll say, here's the opportunity. Do you want to be part of it? That's a protocol that
existed back in 2008, 2007. And it still exists today. It's how almost all of the transactions that you see online,
the ads that you see online are transacted. And it's something that happens when I left
that space in 2014, 16 million times a second across the United States. So it's really,
really robust. Yeah. And some of the data challenges are So it's really, really robust.
Yeah. And some of the data challenges are when you're making, if you're a demand side platform, you're basically, you're operating a bidder that's trying to make a decision.
There's information that comes in that real-time bidding request, and you're using that information
to consult various indices you have to figure out whether or not you're going to make a purchase decision and how much you're going to pay for that.
So those auxiliary sort of indexes of data that you're using to make those decisions need to be up to date and ready to go in order to make that decision.
So it's not just handling the request coming in.
It's also handling all of the data that you're using to make decisions in terms of how to respond to that request. So I'd love to hear about a couple more of those
types of use cases. So I mean, that makes a ton of sense, right? Where you
need to move a massive amount of data in near real time so that people can sort of get these highly contextualized ads.
What are some other use cases where moving data at that scale, you know,
people are using Gazette or Estuary to facilitate?
Yeah, scaling down a little bit because not, of course, most companies are not in the ad tech
space. It's mostly just sort of giving our journey a little bit in terms of how
we got into this space and why we do what we do. But another example I'll kind of reach for is a
lot of companies have a database, like an OLTP database that they're using to manage sort of
their business data and process transactions. You very often want to wire up caches to that,
or another use case that we see a lot is search.
You want to enable sort of fancier forms of search, whether that's like elastic search
or vector databases like Pinecone, which are really just a variant of like semantic search.
So you have all of these different systems that you might want to use to power search
for your product or whatever it is that's your kind
of business domain.
And you don't necessarily want all of that just going through your one database.
So a fairly common bread and butter use case is taking that data from your primary database
and then pulling it out, transforming it a little bit and making it useful in these various
backends for search like Pinecone or Elastic and others. And there, it's a caching problem. So it's
important that be up to date and reflect what's in your database right now.
Yep. That makes total sense. And I mean, a number of sort of ways that people solve this currently
come to mind, but what are the patterns that you see? So if
you're not using Gazette or Estuary, what are people doing? Is it just a much slower sort of
database replication process? I mean, a lot of people would even use Kafka to just
ingest the change log and do some sort of transformation so that they can populate
some other downstream system. What are the popular ways that you see people do this today?
Yeah, there's if you truly care about real time, it's probably Kafka plus the BZM,
plus engineers to manually manage a lot of that stuff, because there is some manual aspect to it.
And if you don't care about real time, which is also common, you might be using a system like Fivetran or Stitch to be able to sync those assets together.
Sometimes you'll see people who just query their database and simply do it all in-house without using something like the BZ.
But lots of different possibilities on how you could potentially sync that data. Our goal is to make it as easy to work with streaming data
that's truly updated in real time
as it is to use any of those other methods.
So really just like configuration,
get it from your source, get it to your destination.
If you want to do transformations,
try to make it as simple as SQL
and kind of make the whole thing much more point and click
and scalable than it would be otherwise.
Yep. Okay. And I have to ask this question because, I mean, I don't know, maybe our
listeners aren't asking this, but it's in my head. So I have to believe that you
looked at using or used Kafka to solve some of these problems early in your journey pre-Gazette,
right? I mean, there weren't a lot
of other systems that I think would be able to handle the type of scale that you were describing
sort of in the ad tech industry. Is that true? Yeah. So the decision to start Gazette was made
back in 2014. And the central reason, I kind of was evaluating Kafka at the time
and we were sort of building this business
that focused on data within the advertising ecosystem.
And we knew that latency was really important
for the business.
It was important that we'd be able to sort of capture
this stuff and transform it
and move it as quickly as possible.
So Kafka is the obvious game in town for doing that.
And that was the case in 2014 as possible. So Kafka is the obvious game in town for doing that. And that was the case in 2014 as
well. There was one sort of architectural limitation of Kafka that I just couldn't get
past, which is the central reason for building Gazette, which is the coupling of compute and
storage that Kafka does. So when you have Kafka brokers,
you are writing data into Kafka topics. And those Kafka brokers are basically managing that data on
the disk of the Kafka broker. And as you expand the cluster or grow it, it's basically moving
that data around, but you are fundamentally limited to the disks that are attached to your Kafka brokers. But even worse
than that, there's this tension that exists within Kafka and downstream usages of Kafka, where
your brokers are responsible for accepting the real-time writes that are coming into your system,
basically like the data that's coming in that you cannot lose. So the number one most important role of those brokers is getting that data down, getting it replicated, making sure you're not going to lose it so that it's available.
But at the same time, you have these downstream use cases that are wanting to pull from these topics of data and do something with it, process that data, turn it into some kind of new data product.
And if you imagine for a moment
that you're bringing up a new application, you're a startup, perhaps in the ad tech space,
you're bringing up some kind of new consumer of data that's got to basically backfill over all
this historical data that you might've seen so far. Now that application, you want it to basically
slurp up data as quickly as possible from your streaming system because you want to process all that backlog of historical data, weeks of data, months, maybe even years of data.
There's an inherent tension where that application wants to read data as quickly as possible from brokers and the same disks that are also responsible for your real-time rights coming into the system that you cannot lose and you want to be fast.
So that coupling of storage and compute is the primary reason for starting Gazette is
solving that problem in decoupling ways.
So what Gazette ends up doing differently is that the brokers are really just focused
on the really near real-time stuff, like the rates that have just happened. They're focused on getting that
replicated, getting it down, making sure it's not lost. Then very quickly after that, they persisted
into cloud storage. Cloud storage is actually the durable representation of that data. That's where it actually lives. After the near moment has passed
and it's now in the historical record, it lives only in cloud storage. And what's really neat
about that is that applications can basically, you can stand up a new streaming application
that is going to process months of historical data and also is going to catch up
and stream in real time once it processes that backlog. And when that new application comes up,
rather than needing to go hammer the brokers and pull as much data through them as they can,
what they're able to do instead is basically just ask the brokers occasionally,
hey, where do I go find the next gigabyte of data? And then just go read it directly
out of cloud storage and then sort of seamlessly transition to real-time streaming as they get to
the present. So that's a capability that we built out in Gazette and leaned on a lot for building
all kinds of sort of downstream applications that are consuming quite a bit of data. Obviously,
that decoupling of storing things
in cloud storage means that there's really no retention limit on how much you can keep in that
historical record so it in a sense i had one person tell me it let us be fearless when building
architecture and figuring out like putting these pieces together because we really just didn't have
to worry about,
you know, whether this new downstream use case is going to bring down production.
Interesting. Okay. One more question for me, but actually, Kostas, I'm sure you have a ton
of questions on Gazette. So I want to hear your questions on Gazette, but did you,
when you started building Gazette, did you know you were going to open source it or was it just, you know, you were
trying to solve your own problem? How did the open source story for Gazette come about?
Yeah, it was always the intention. But when we were going through the acquisition process with
our last company, we kind of made sure of it. It was something, you know, honestly,
it was a negotiated thing with our acquirer where as we were going through that conversation and the diligence process that we wanted it to be
open sourced and they were good enough to let us do that and follow through.
Very cool. Yeah. If you develop technology inside of a company,
getting it actually open source, especially through an acquisition, certainly seems like one of the hardest parts.
So yeah, Costas, Gazette,
I'm interested in your questions.
Yeah, yeah.
Yeah, it's very interesting, actually,
like while I was sharing all this time.
So, okay, question.
Why would anyone like use Kafka today
if Gazette is out there?
And the reason I'm asking this,
it sounds like a little bit of a provocative question,
but the reason I'm asking is because
when you're engineering systems at that scale
and precision, let's say,
like the operations that are supposed to be getting done there,
there are always trade doves, right, there are always trade-offs.
Kafka
did some trade-offs. I'm pretty
sure that you've also done
some different trade-offs
to deliver the technology
there.
I think learning about these trade-offs is
very important
for any engineer because
it gives you a very deep insight on like why the system
works the way it works and then we can also make like a connection there with like the problems
that are getting solved right because different problems usually require like different trade
dose so why would i today go and use kafka or Red Panda or any other similar
solution that we hear about out there
instead of Estuary?
Just from a purely
business standpoint, Kafka...
The way I look at Kafka is it's the protocol
more than anything else at this point.
Kafka, of course,
there's an open source project and a whole bunch
of stuff around it, but the protocol is the
thing that really matters, right?
Red Panda is a totally new implementation, which happens to use the Kafka protocol.
And Gazette currently does not use the Kafka protocol, but soon it will, right?
Well, it will have the option to support the Kafka protocol as well.
So you could think of Gazette as Kafka just with some different architectural decisions that make it function differently.
So from my point of view, from a business standpoint, what you really want is the Kafka
protocol.
That's all you really care about, right?
Because then you have all the integrations that they've built.
You have the ability to use Kafka Connect and the whole ecosystem that sits around it.
So that's my point of view.
But Johnny, what would you say?
Yeah, I think a lot of it is the ecosystem when we were starting
estuary we spent a bunch of time kind of surveying because we you know we were coming out we just
sort of finished our our stay at the the acquiring company and we were we were leaving we were because
that had been open sourcing we were kind of figuring out what's next and we spent some time, the obvious thing is like, okay, well, if you squint and look at it from a distance, there are a lot of similarities with Kafka.
Should we make a go at promoting this as an alternative to Kafka?
And we spent a bunch of time surveying Kafka users within various companies.
And there are a couple of answers that kind of came out of that.
One is that, of course, the ecosystem matters a lot.
So the existing Kafka Connect ecosystem matters.
The existing applications that already speak this protocol matter.
And if you're not using Kafka today, maybe you're using something
like Kinesis or you're using Google PubSub or tools like that. So there are all kinds of different
tools in the streaming space. Like why does anybody really need another streaming protocol?
And honestly, I can't necessarily say that they do. Like I'm not here promoting Kafka as like,
you should go pick this up or sorry, promoting Gazette as like, you should go pick this up, or sorry, promoting Gazette
and saying, you should go pick this up and use it today. One major limitation is we only ever
implemented clients in the Go programming language. So in some respects, our realization was that
what we think the industry really needs is more of an up-leveling. Like focusing on the streaming broker is the wrong place to be focusing.
Like it's, we think of it more as an implementation detail,
honestly, of the real problem,
which is like, I've got these different systems
that I want to connect together
and I want to sort of describe what I want to have happen
and then I want you to go do it for me.
So yes, we happen to use Gazette under the hood,
but it's an implementation detail
and it honestly doesn't even really matter that much from our customers' perspectives. me. So yes, we happen to use Gazette under the hood, but it's an implementation detail. And it
honestly doesn't even really matter that much from our customers' perspectives. But there's another
answer here too, which is that I vividly remember conversations we had with teams that were
managing Kafka within large companies. And I remember one engineer telling us about the work that they
had done to basically impose these various quota constraints on other teams so that they weren't
overloading their Kafka brokers and that they were able to keep things nice and stable. And
they were presenting this as a good thing. And all the while, while I'm listening in this
conversation, I'm thinking, okay, so you pushed this problem down onto the rest of your organization and limited your other
team's abilities to actually use this infrastructure. But here you have an entire team who's
like, this is their job. It's their fiefdom. Managing Kafka is what they do. So going in and
telling them you're doing the wrong thing is not a great message to be
delivering to them. So that kind of turned us off from trying to sort of directly take on Kafka as
like, we want to build something different here or introduce something different here that's
roughly the same shape and fits into the same box. So those are some high-level answers.
In terms of engineering trade-offs, honestly, there aren't that many. A central one probably
is that Gazette manages immutable logs. So one key difference between Gazette and Kafka is that
Gazette is writing immutable logs where you can add new content to that log.
And that log can be as big as you want.
And it's backed by cloud storage.
So you really don't have to worry about its size.
And you can read it very efficiently because you're pulling that log directly out of cloud storage.
But it does not do what are called compactions out of the box.
Like the broker is not responsible for compacting that
log over time where Kafka is. So if you have multiple instances of a key, Kafka will compact
prior versions of that key away. Now, we can get into why that actually doesn't really matter that
much as the next follow-up question, but I don want to like steal too much time for this one answer yeah actually let's do that and let's reverse the
question like let me ask you why it is important like to do combustion i'm thinking like in okay
let's say if i was like an engineer working like on kafka like implementing combustion doesn't sound
like like the easiest thing to do, right?
Especially with these throughputs that you have
to go and maintain
there. It's definitely going to add
complexity to the management of the
system anyway, right?
So why, let's say,
they
decided to do that?
What's the use case for compaction
out there? And as a follow-up to that,
why, from the H2R point of view, it's not that important to do the compaction?
Yeah, I would say compaction is very important, but there are different ways to do it. So I'm
going to get a little bit technical in this answer. That's okay. But let's say you're building a streaming application,
and very often streaming applications need to manage state.
So if you're trying to do like a stateful join or something like that,
you've got two different streams of data,
and you need to sort of index one side of that stream
so that you can query that index and look up values
when you see instances of the other side of that
stream. So you need some kind of stateful database around in order to do that. So in the Kafka
streams ecosystem, that's typically RocksDB. It's also RocksDB within Gazette. If you're building
like an application on top of Gazette, you're generally using RocksDB to do it. We use RocksDB to do it as well under the hood.
Now, Kafka brokers are doing compactions of Kafka topics. So when you write multiple instances of a key, it's eventually sort of pruning out older versions of that key. And that's important just
from a disk usage standpoint as well. You don't want that topic to be unbounded in size. But
the kind of funny thing that happens is when you're building this streaming application,
this staple streaming application that's using RocksDB, and you're consuming from one topic
stream, and you're indexing those values, RocksDB is also doing compaction.
RocksDB is what's called a log-structured merge tree.
Part of the way it works, it's basically an append-only sort of,
it's managing this append-only tree of files
that represent the content of the database.
And as you write new content,
it's sort of taking some of those files,
compacting them together,
writing new files, and then deleting files.
So one of the insights we had is this rockssDB, it's already doing this compaction.
Why are we doing compaction twice?
Why are we doing it in these two different places?
What if we instead sort of leverage the compaction work that RocksDB is already doing, harness
that, and then we don't really need to worry about this problem within the broker?
That's interesting.
So, okay. Follow up question to that, because you mentioned... problem within the broker. That's interesting.
Okay.
Follow-up question to that because you mentioned
the term states there.
I want to ask that because
Confluent had their conference
pretty recently,
right? And there was
also the acquisition that they did a couple
of months ago, I think,
from a company that is working with Flink. If there's state management somehow part of the
system itself, why do we also need a system like Flink that does, again, processing. And the main value that it adds, from my understanding at least,
is that it is a stateful query engine on the end,
like a stateful streaming query engine.
So there's some kind of state management in so many different places.
And I think it's hard for people to understand
why we need all these different, complicated systems to actually go and work with each other in order to achieve something.
So I'd like to hear, learn a little bit more about that.
Yeah, well, I'll attempt to tease this apart a little bit, sort of within the Kafka ecosystem.
Because we're talking about a couple
of different things here. We have Kafka streams, and Kafka streams can have as its backend
sort of a state management, like a database for indexing state, and that can be RocksDB,
for example. But Flink is actually managing entirely separate states. So when you have state within Flink,
what it's doing is it's typically persisting
checkpoints of state to cloud storage
and then recovering those checkpoints of state
from cloud storage.
So they're completely separate state stores
and this is a little confusing.
Like the streaming ecosystem is confusing
because state ends up being a very hard problem.
It's a hard problem to solve well and to scale well.
So we see a bunch of different sort of competing solutions for how to best do it.
So I'm not sure that's directly answering your question, honestly,
but I just wanted to kind of articulate a little bit like the different flavor,
like these different flavors of tools goals or managing sort of state
within a streaming context. One of the trade-offs I guess is for really low latency, the way Kafka
streams does it is going to be faster than Flink because Flink basically needs to accrue a bunch
of like a window of a bunch of data and then run a transaction where it's like
computing all of its effects and the state updates and then snapshot that persistent
into cloud storage.
And then it's releasing the results.
So there's a little more latency involved in the process when you're using Flink for
that reason.
Makes sense.
So you mentioned about the story that there is also like processing that someone can do
with a system, right?
How is that different, if it is different,
from like using Kafka streams or Kafka streams with like Flink
compared like to Azure?
Yeah, so a major thing that we are, just from a vision standpoint,
when you work with existing tools within the streaming ecosystem, whether that's
Kafka Streams or also Flink or Spark, and you're trying to do some kind of stateful computation
where you need a join or something where you need state in order to facilitate that computation,
all of these frameworks basically require that you adapt your modeling of state into their mechanism
for persisting that state and restoring that state.
So from a vision standpoint, where we're trying to get to really is that you as a user, as
an author of these transformations, should not have to do that.
You should be able to bring your chosen tool to bear in terms of managing state,
whether that's a Python script, that's keeping something in memory, you know,
the same way that you would write it out where you're just like slurping in
something from a file and then, you're computing, computing something in memory
and then spitting out an output, like why can't you do that in a streaming context
as well?
So from a vision standpoint, what we're really after is
basically Unix processes and files and pipes scaled up to the cloud, where you can basically
bring your own program, which can manage state however it likes, and run that in the context
of a transformation within the platform we're building. So we're not quite there.
We have more to build before we fully realize that vision.
But the major goal we have is to sort of break that coupling of the user-defined state and
the way that the framework needs that state in order to manage its persistence.
Yeah, that's super interesting because I would assume that, okay, imposing a framework on
how to manage states allows the system designer to reason about how to effectively do the
state management, right?
While when you relax that, many things can go wrong, especially when someone brings their
own Python code or something else.
So I will be super excited to see how this works, because it is a great opportunity,
actually, to open this platform to all these different people to go and write code for that
without having to know deeply about all the intricacies of
these distributed systems. I think it is
super important if we would like to
bring more people into the streaming
ecosystem.
It's also a very hard problem. From an engineering
perspective, it's a very hard problem.
I'd love to see
how you do that as
you are materializing your vision.
I'll drop this nugget, which is that basically the VM ecosystem
and the ability to snapshot what processes are doing
has gotten a lot better.
And you can think about incorporating that
into an overall streaming transactional flow.
Yeah, that's interesting.
I mean, I was looking a lot into a special,
like this micro VM kind of system is like Firecracker. yeah that's interesting I mean I was looking a lot into like a special like
this micro VM
kind of system is like Firecracker
some others that
are like really good at
like doing this snapshotting and
even like moving
these snapshots around
my question always was
like okay getting like a snapshot
of the VM is one thing if this VM also I't know, it's like a couple of gigabytes there in its memory that needs like to be snapshot, it will take time, right?
Like it's latency that is added there.
Like you can't like avoid that.
So it's very interesting to see how in the context of like streaming processing, this might be different than, let's say, an all-up system, where you might have large amounts of part of the state being in a node and how to move that around.
But anyway, that's a very interesting conversation we can have at some point.
You mentioned at the beginning also the term of caching. And you were describing a use case,
where you have all these different systems, you have
Elasticsearch, you have Redis, you can have Pinecone, you can have Redshift, whatever.
And using something like Estuary actually to replicate the state of an input, let's
say, system to all these different systems. Now,
it's commonly said that one of the hardest problems is casting validation, right? Figuring
out when the state between the two systems is actually not in sync and doing something about
that. And I would assume that each system has like different requirements around that but like
how do you see this problem from your point of view as being like the system that needs to be
able like to support that right what you've seen in the industry out there like in terms of like
problems around that and like how do you solve the problem?
Yeah. In some respects, this is sort of an organizational problem that companies have to decide, what are the authoritative systems for data sets that we care about? Where are we making
changes? What system are we issuing rights to, essentially? And I'm not sure there's any tool out there,
Estuary included,
that's going to be able to make that decision for you,
but we can build tools
that will support the decisions that you make.
So the primary answers are basically around discovery,
like being able to understand
that there's kind of an interesting concept
that's grown up in certain parallel
to what we've been doing, which is the data mesh concept.
And for those who aren't aware of it, like data mesh is fundamentally the idea is like trying to engender a more self-service, like internal ecosystem within a company of different systems where data lives and democratized access to discover where that
data is, discover the various data products that exist, and use them for different downstream
purposes to make different downstream data products.
And this concept out of ThoughtWorks is kind of very much in parallel to our own thinking
and what we were building at the time.
But it's something we strongly agree with, that a big part of this is, it's one thing to sort of build the pipe that's
going to move data. But if you can't communicate what those pipes are, and what they're doing to
the, you know, the broader cohort of users within your organization, then that's only sort of a
partial solution. So a lot of, you know, part of the answer is basically just building
good tools for people to understand what the data flows are, where data is coming from,
how it was built, like what went into building a data product, where it came from, et cetera.
Yeah. Yeah. I love that you brought like data mesh because like that's triggered like an
interesting question, I think. So data data mess has a very interesting concept of,
okay, you have the data leads in all these different places,
and you should be able to query it and reason about that
in the place where the data is.
So, two approaches, I would say, technically, to realize that.
One is actually replicate the state that is needed between the different systems,
something that I would assume with something like Asteroid you can do, right?
I have my production OLTP database, some part of the state there,
I would replicate it to my all-app system so I can go and do the analytics that I need. But there's also the concept of,
instead of moving the data around, moving the logic around, to do query federation action.
Instead of trying to move the data and solve the hard problem of replicating state and reasoning
about the state in a distributed manner, let's just
ask the questions directly at the source, right?
Yeah.
I have my own reasons here to share what are the pros and cons with each one of them, but
I'd love to hear your take on that.
Two immediate thoughts. One is, of course, like if you're talking about your
production database, you might not want your analytical query load going to your production
database. So that's problem number one. But another problem is you might not care about the
data as it is right now. So a really common use case that we see is, you know, because we're doing change data
capture from databases, which means that we have essentially this immutable log of all
the individual changes that have been made to your tables over time.
And that can certainly be used to maintain a materialized view of the current state of
the table.
And that's what we call like a standard materialization.
But there are actually a couple of other use cases that fall out of this too, because another thing
people commonly want is auditing. It's an audit table of what changes have been made to this
operational table over time. So that's something that you can, it's essentially a byproduct of
having an immutable log of all the change events of your database.
So we see customers who do both.
They actually want both a history, but they also want the real-time view in different systems.
And yet another one is sort of point-in-time recovery, where I might care about what the state of this table was last Thursday.
And asking the current table is not going to tell me that,
but I can ask the submutable log of the change events
and I can basically rebuild that view
as it existed at some previous point in time.
So the current table,
if you can just issue a query to the current table, great.
But there's often all these kinds of related data products
that you want that the operational table can't tell you.
Yeah.
Yeah.
Makes a lot of sense.
And by the way, like it's, this is like being able to time travel is especially like important
for ML reasons.
Like it's a very common thing that like, especially with feature engineering, to be honest, like
to be able like to calculate a
feature in time like in different snapshots in time for like the same user so it's a it it becomes
like it arises as a problem like in many different like use cases which makes it like super interesting
all right cool so you mentioned like cd CDC and it was mentioned like that in Kafka,
you have like the ecosystem there, like Kafka together with the Bayesian, right? From Red Hat.
I always wondered why we haven't seen like more attempts in the open source for something like
the Bayesian, to be honest, or why these things are not even more standardized in a way.
So I'd like to hear your thoughts there.
First of all, do you use something like the BISM?
How do you do it?
We actually do.
Yeah, we tried to, and we're dragged kicking and screaming
into basically building our own implementation from soup to nuts.
Okay.
And there are a couple of reasons why.
So the biggest probably is that our focus is on being able to handle
very large databases and do it efficiently.
So databases that are terabytes in size.
And one of the big challenges with the BZM
and a lot of sort of naive replication or change data capture from database, it's very easy to grab like the ongoing feed of the existing rows in your table, because your database is not keeping around
an infinite write-ahead log where it's got all its change events that's constantly compacting it.
So you really need to both process the write-ahead log and then also query the table for all of the
existing state in that table. And it's important to do it correctly as well. It's very easy to
implement this incorrectly where you're getting kind of logically inconsistent
results.
So one of the big problems, like the naive solution for exchange data capture, which
certainly at the time where we were starting and looking at this problem, this is what
the museum was doing, is basically like starting a write-ahead log reader, but not actually
reading it. So you basically just create this write-ahead log reader, but not actually reading it.
So you basically just create this write-ahead log reader
and you're sort of pinning the write-ahead log.
And then you start scanning all the table content
of the table that you want to backfill.
And then once you've, you know,
you're basically issuing like a select star for the table.
And then once you've read all of that,
you begin reading that write-ahead log again.
But a major issue when you start getting into larger databases that this has is that, first of all, you have to think about what happens if that select star statement crashes partway through.
So that's one issue.
You might need to reissue that a few times, a bunch of times.
It might have to restart some of that work.
But another even bigger issue is that when you have created that write-ahead log reader,
but you're not actually consuming it, you're not actually reading from it,
what the database is doing under the hood is it's basically saying,
well, someone wants to read this, so I can't delete these write-ahead log segments.
And what commonly happens is that people's disks fill up on their production database,
which is not good.
So one of the goals we had from very early on was we want to do really incremental backfills where we are always reading the write-ahead log and never putting the production database
at risk while also being capable of doing many multi-terabyte backfills.
So there's a paper out of
Netflix, actually, called DbLog.
And we essentially
kind of looked at that and
had some ideas from it and ran with it.
Yeah, that's very interesting
because we
had the author of that
at the podcast at some point.
And I think the Bayesian,
in a more recent version,
they have implemented something similar
with the watermarking that they do there
so you can do it in parallel.
I think it was again inspired by the same publication.
But that would have been my next question, to be honest,
because I've seen that in practice.
And it is a problem with how CDC has been implemented
in systems like with Fivetran, for example.
Because you have these huge databases
that we need to replicate the state from there
before you can start consuming the change log.
And it's always a hard problem to solve.
So, okay.
Is this something that's open source?
Is it part of like,
is Estuary part of like,
it's a separated project?
Like how is the CDC parts being?
Yeah.
Yeah. So we model essentially estuary is like two things it's the platform itself for running these data flows and then it's the connectors which are
really applications that sit on top of the platform for talking to different systems for
doing things like CDC and doing materializations which are sort of pushing data into systems and
keeping it up to date there, as well as transformations.
So the fundamental architecture is that we have the platform, which is sort of facilitating
data movement between these pieces.
And then these applications, which are basically Docker images that speak a protocol over standard
in and standard out for doing sort of streaming capture streaming materialization so short answer is yes
we the we have these are source available on github we package them as docker images that
speak a it's a capture protocol we ended up like kind of kicking and screaming again
designing our own protocol for captures and materializations of data. So that part is a little bit bespoke,
but it's a fairly accessible protocol
and they are source available.
Okay, that's super interesting.
Okay, one last question from me,
and then I'll give the microphone back to Eric.
You mentioned materialization and streaming processing.
In the streaming landscape, there's also this new model of computation that's based on timely or differential data flow.
I think the most known product using this kind of approach is materialized.
How is it different than using something like Estuary, right?
If you can go and do, let's say, this incremental materialization.
And is there space for both solutions there?
Different use cases?
Help us understand. Again,
I'm asking these questions because I feel like the streaming landscape is always like
complex. There are like some very like heavy buzzwords, like differential data flow and like,
yeah, you know, like people get, and like not everyone is like, you know, like a systems
engineer that like wants to get that deep
into how things work, right?
But it is
part, they have to interact with
systems and it is important for them to understand
why they exist and why
they are out there and what we are trying to solve.
So I'd love to hear from you
if there is
some competitive element there or
they are just like
complementary solutions at the end.
Yeah, we actually work with Materialize a little bit at least.
We're trying to do so more and more from a team standpoint.
Materialize does not focus on connectors.
They focus on streaming SQL
and making that a really first-class thing.
And also trying to up-level it and making it exactly the same protocol as standard SQL.
That's not what we're doing, right?
That's a very different problem than what we're doing.
We're more focused on getting data from your systems, acting as the pipes that can kind of get into other places.
So getting it into Materialize is a very core thing that we kind of get into other places. So getting it into materializes a very, you know,
core thing that we want to be doing with them.
And so at a very high level,
we don't see ourselves as competitors at all.
But I'm sure Johnny will go into technical details
on the differences between differential data flow
and our specific processing model for SQL, streaming SQL.
Yeah, and I'd say kind of my quip answer is like, Materialize is a really cool system for sort of
streaming SQL-based compute and getting answers out of it. And then Flow you can look at as a
system that makes all of your other databases real time. So in some respects, like you can think of materialize and another company that comes to mind as
rising wave,
which is a fairly similar,
fairly similar position in space.
These are streaming compute engines for doing streaming transformation based
around SQL from like a product philosophy perspective, our view is like more is better,
you know, we're like, we kind of want to meet users where they are, where they want to be in
terms of using whatever sort of streaming compute that they would like for computing data products.
So like another example of this is a company called BikeWax, which has taken the differential data flow library and wrapped it with Python bindings to make it something that you can write Python programs around.
Like, well, we'd like to be able to offer that as a way of doing transformation using flow.
So I think there are a lot of interesting different solutions.
And I think this stuff is going to
play out for quite a while because there is no silver bullet for streaming. It's a legitimately
very hard problem, and they've got some really interesting takes on how to solve it.
It's a great answer, to be honest. I think it would be nice like at some point like to have all like a
couple of like people from the streaming processing like space that they are like working on different
like ways of solving the problem and like have like a conversation all together because
it is like i think like it will be a service let's say, to the industry. I'd like to help people understand why all these products and technologies are built.
Why at the end it's so confusing.
There are good reasons.
Anyone is trying to confuse anyone here.
Actually, it's the opposite.
It's just a very hard problem.
The term real-time, many times, it's like the semantics of real time is like so different depending on who you ask.
Right.
That makes things like hard.
So anyway, I'd love like to have like this kind of panel discussion at some point.
Like maybe we should also like reach out to the folks at Materialize and try like to get you all together and like discuss about these things.
I think it would be like super interesting.
But I have to stop now. So so eric the microphone is yours again yeah i just to follow on though i think that
is uh i really appreciate that perspective i think it's really easy to look at similar technologies
and you know i think i don't know just naturally i think there can don't know, just naturally,
I think there can be defensiveness
or disagreements about different ways of doing things.
But I think, at least from my perspective,
I totally agree more is better, right?
I mean, this type of data flow is something
that we want to be more broadly accepted
by a much larger portion of the market.
Which actually brings me to one of my final questions,
which is, Dave, you mentioned something earlier,
and I'm interested in both of your perspectives on this.
You know, if you want to do real-time,
you have sort of this set of options. If you want to do,
you know, sort of batch, you have like sort of the 5-channel, like the traditional batch ETL flow.
What is keeping companies from just moving to this kind of, let's say, streaming ETL, real-time ETL?
Honestly, nothing anymore.
In a lot of ways, we're working with companies that don't necessarily even have real-time problems.
They're just using us for analytics
because we've made it as simple as working with a batch system
to execute a data flow.
And that's our goal, really, to kind of take the gap away.
Of course, things get more complicated when you start doing streaming
transformations.
But if you're just doing effectively
point-to-point
or even one-point-to-many-points
data transfer
solution, they really
are as easy as each other at this
point. So I don't think there's really
massive differences
from the UX there.
There used to be, right?
Like most of the times when you look at a streaming system,
you're having to do stuff like manage your own scheme evolution.
So if you really want to bridge the gap,
you have to do stuff like that.
That was a core problem for us that we wanted to make sure
that we tackled really well.
But yeah, in general, I think that those are essentially closing.
And one followed, actually, Johnny, I said I wanted both of your perspectives.
So before I have a follow up, your thoughts?
Yeah, sorry, editor, I was wool gathering there a little bit.
So the question and the question is, what is holding companies back from... That's a gross oversimplification. But yeah, I mean, I don't know.
I have in my mind that there are multiple advantages
of moving to more of a streaming architecture,
but why aren't more companies doing it?
Maybe they are and we don't see it.
I think they are and we don't acknowledge it, actually.
So if you are using a tool like Fivetran
and you're using sort of incremental syncs from your database to an analytics warehouse, and you're running that every hour or something like that, guess what?
You're streaming.
That is streaming.
You are moving incremental data from one place to another and avoiding moving the whole data set.
So that is already streaming.
It's just a matter of like, what is the latency involved?
So in some respects, what we're talking about is just a dramatically tighter latency, but
functionally doing the same thing.
So in many respects, streaming is already happening and people are kind of have cobbled
it into batch systems, whether that's
iTran or another tool where you're doing sort of an incremental updating resolution of new data
and moving it from one place to another. And then, you know, you have tools like dbt where
you're building incremental models where you're looking, you're using like a cursor to figure out
how to go update some downstream data product based on just the new data
so people are already doing this it already exists in the ecosystem and in some respects
we're just talking about making it faster and simpler and tightening the latency that's available
one more add-on to that oh yeah go for it if you don't mind yeah i think that classically people have
been burned by streaming right so that you'll talk to a whole group of practitioners who hear the
word streaming and they say oh no i don't want that so you know that's something that we found
out when we first started the business and a lot of the times we won't lead with streaming anymore
right we'll lead with incremental data low latency and that actually opens it up to a whole group of people who wouldn't otherwise be interested, right?
If you're doing analysis, a lot of the times you're going to think streaming is very expensive because classically it used to be.
It's not anymore, but that's something that we definitely heard a lot.
Such an interesting point, Dave.
I think there's a lot of Kafka baggage there. And, you know, we talked about some of that earlier where you face this severe trade-off that is actually a pretty
interesting, like, you know, almost like DevOps problem that doesn't even relate to data, you
know, like the difficulty of operating a system. I've also heard a lot of companies try to move to
event-driven architecture in terms of the way that they're building their stuff. And that, you And that can take you down a rabbit hole. One question I do have though, and Brooks,
I'm sorry I'm going long, but this is me punishing Costas by asking more interesting
questions that he's not here to enjoy. Because Costas had to drop off for another meeting.
For our listeners, that's the context. But how does this impact orchestration? Because when you think
about streaming, one of the interesting things when you come from a world where, let's say you're
running multiple batch jobs, you have multiple downstream dependencies. And just because of the
nature of data, you require an orchestration level because
these jobs take different times to run. They need to be sequenced differently, and the downstream
dependencies obviously are impacted. But when you move to a real-time streaming setup,
at least on the surface, it seems like it solves some of the chronological
orchestration problems.
Yeah, for sure.
I get asked routinely when I'm demoing our products, how do I orchestrate this?
And the answer is you don't.
Your data, when it's available, it goes there.
And that's how it works.
But yeah, it's something that takes a little while to understand.
And it's also complex when it comes to transformations,
because if you're thinking about transformations in terms of events,
it's a little bit of a different thought process
than it is in terms of full data sets that you're transforming.
Makes total sense.
Johnny, anything to add on the orchestration side?
Yeah, that's fundamentally right.
It's not necessarily like either or.
If you're in our world and products,
you don't orchestrate it, as Dave said.
Users will sometimes use us to do some pre-transformation
to sort of reduce their cost within the analytics warehouse.
Like they still want to work with it in Snowflake,
but they're using us to do some deduplication
or to pre-join some data
that they commonly kind of want to access
in a joined fashion.
And then they might process it
from there downstream with dbt.
So it's, you know, you can marry these together
and there are quite good reasons to do that.
Yeah, makes total sense. Okay, Last question. Sorry for going long Brooks.
Where did the name estuary come from? I mean, I know from a
geographical standpoint, what it means, but is there any relation?
Yeah. Fundamentally, you know, we're in streaming data that we've kind of de-emphasized and no longer lead with streaming
quite so much but evocatively sort of where streams meet that is what an estuary is it's
where sort of the ocean and rivers are meeting together so there's some evocative like you know
an evocative like imagery that we're going after there. Otherwise, it's a name.
I love it. You have your real-time data, and you have your data lake.
The ocean is where it's all done.
We are effectively a real-time data lake.
That's what Gazette enables.
That's what the idea was.
It's lost on a lot of people.
I like it.
It's a calming...
An estuary seems like a place you want to be. I like it. It's a calming, you know, like an estuary seems like a place
you want to be.
I like it.
Cool.
Well, thank you
for letting me
throw a couple
of additional questions
at you to satisfy
my curiosity
around orchestration
and would love
to have you
on a streaming panel
and have you back
on the show.
Awesome.
Thanks for having us.
Yeah, thanks for having us.
We hope you enjoyed
this episode
of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rutterstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse at rutterstack.com.