The Data Stack Show - Data Council Week (Ep 6) - All About Debezium and Change Data Capture With Gunnar Morling of Decodable
Episode Date: April 27, 2023Highlights from this week’s conversation include:Gunner’s background in data (0:32)Setting the vision in early days of Red Hat and spearheading Debezium (6:20)Replication of data in Debezium (9:47...)The patterns and processes of Debezium (16:21)Debezium working with Kafka (19:03)Building a diverse system while incorporating common interfaces (24:09)The importance of documentation in open-sourced projects (27:59)Debezium’s vision moving forward (31:32)Why aren’t there more CDC open-sourced solutions? (34:35)Connecting with Gunnar (37:27)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
All right, we are back here at Data Council. If you're following along,
you already know Eric couldn't make it to the conference. So I'm Brooks filling in for Eric
while he is out. We've got Costas here, obviously, as well. And we just sat down with Gunnar Morling.
He's a senior engineer at Decodable. Extremely excited to chat with him today.
We chat with Eric, the CEO, earlier this week. So we're excited to kind of continue
just digging in. And to kick us off, well, I guess first, Gunnar, welcome to the show.
Awesome. Yeah. Thank you so much.
Yeah, absolutely. We'd love to just hear about kind of your background, where you started,
and kind of your path to where you are today at Decodable.
Oh, yeah, sure. Let's do this.
So, yes, I mean, I joined Decodable just a few months back in November last year.
So it's still pretty new for me.
Before that, I have been exactly up to the day, exactly for 10 years at Red Hat.
And there, you could divide my tenure at Red Hat
into two parts.
So the last five years, I was working on Debezium,
which is a tool for change data captures.
I was the project lead for Debezium.
And I guess we will talk about that in more depth.
So that's what I did for the last five years.
And before that, I was working on different parts
of the Hibernate project umbrella.
So I was mostly working on Bean Validation.
I did the Bean Validation 2.0 specification.
We had a fancy exploration of using Hibernate and object-related mapping and applying it to NoSQL stores.
So that's an interesting episode.
I'm just saying this project doesn't exist any longer.
But, you know, I learned lots of stuff back then.
So yeah, that's what I did before joining Decodable pretty much.
Cool.
Tell us a little bit about kind of what drove your decision to leave Red Hat after 10 years
to the day.
That's amazing.
Yeah.
But it sounds like when we were talking before we started recording here, there are some
kind of personal reasons,
but also some of your technical interests kind of drive you to the next thing.
Right, right.
Exactly.
It's a combination of the two things.
And, you know, I really enjoyed my time at Redhead.
It's a great company.
I have still many friends there. And it's a place I would recommend really to everyone to work at.
It's a great company.
Why I left was, well, so i was working on debesium for
approximately five years and i felt you know i wanted to do something new again not because i
didn't like the project any longer i still think it's an exciting project there's tons of
applications for it but for me it was a bit okay i've been working on it for quite some time and i
want to do something new so this was one motivation and. And then, well, I mean, Debezium is about ingesting change events
out of databases like Postgres, like MySQL, into Kafka, typically.
And, of course, people don't do this just for its own sake, right?
They don't want data just in Kafka.
They want to take it elsewhere.
They want to take their data into Snowflake,
into Elasticsearch, into other databases to do query use cases, to do full-text search.
And Debezium itself didn't have
or doesn't have a good answer to that.
It's just not the scope of the project.
This is a CDC platform.
It doesn't concern itself with taking data
out of Kafka into other platforms.
But still, you need this
as part of your overall data platform.
And I felt, okay, I want to look at this data journey, let's say,
like really end-to-end.
So I want to look at it, okay, there's the Debezium part,
which takes data into Kafka,
but also I want to help people with taking that data out of Kafka
into other systems.
So this is exactly what Decodable does amongst other things.
So this was interesting to me.
And then, of course, there's this notion of processing data, right? Like filtering data,
changing types,
changing
date formats, this kind of stuff, doing stateful
operations like joining aggregations and so on,
which all is done by Flink.
And so, well, I was
recommending people all the time when they came
to me and asked about this so
yeah use something like Flink or something like Kafka streams it allows you to do those things
yeah at Decode.io I have now the opportunity to work on this and well provide people with a
platform which does all that in a managed way so that's it in terms of a, you know, work motivation.
And then there's the other thing.
Well, I was at Reddit,
as I mentioned, for 10 years
and the company grew a lot over that time.
So when I left,
it was more than 20,000 employees.
And how big when you started?
Yeah, it was, I believe, 3,000.
So it was like almost 7x.
And this changed, you know you know i mean processes change
and it's what it is i mean you know it's not the startup obviously and i wanted to have this
experience i wanted to go to a small place and see okay how is this like if everybody really is
like pulling in one direction you make quick decisions you see okay this flies or this doesn't
fly and you don't you know think about stuff for six months to realize okay one thing about another
big massive you know and so that's why I wanted to go for a startup like the
codable yeah cool because I know you're excited to dig into to the museum so
take it away yeah sure so how did you start working on the video right yes that's a
good question and it kind of even was a coincidence i have to be if i want to be totally honest
because so as i mentioned i was working on hibernate stuff before then and again i feel
like i have this five year attention span so after years, I feel like I need to do something new.
And I was at this point, at this stage, at this time back then,
I was looking on something new.
I didn't feel like leaving Red Hat.
I wanted to explore something new within Red Hat.
And then the original founders, I'm not the founder of the museum,
the original founder left Red Hat.
So Randall Haug, who created the project, he went to Coughlin.
And so there was the, this project lead role was open.
And yeah, you know, I was there.
There was an open project lead role.
So things came together and I picked it up.
Okay, that's great.
And when did that, like, let's go talk a little bit of the history of DeBasium, right?
Like, when was DeBasium first published?
Right.
So I took over in 2017.
And by that, I believe it was roughly like one year old.
So it started in 2016.
And it was a very small team back then. So was randall who was the original project lead there was another engineer horia so the two worked
on the project and you know they pitched it really well within redhead so they made a very good case
why we should sponsor this project and then then, well, things happened.
So Randall left.
And independently, Horio, the other guy, left next month.
So we went from two people who knew about the project who could work on it to, like, having nobody.
Yeah.
So this was, of course, a challenge.
And then I came in.
And thankfully, Randall, I mean, he left. But, you know, I know i had his email and he was very helpful so i could reach out to him so i came in
and then um another engineer or yuji good friend of mine he also came in so it was the two of us
who ran the project and then it was a little bit like a startup even within Red Hat. So, you know, because back then there wasn't even a product around it.
It was just a plain upstream community project.
So we worked on new features, adding connectors.
We worked on creating brand awareness.
So I went to conferences, doing blog posts, telling people how to use it, why you would use it, all this kind of stuff.
And then, of course, we constantly made the case for getting new engineers into the projects.
I believe after a year or so, we got another engineer,
actually somebody from the Hibernate team,
I could convince to move over.
And then, you know, we grew.
And at some point, it actually became part also
of the Red Hat product portfolio.
So, you know, what they usually do there is they have, like,
this duality of having upstream community projects, like Visium and there's a commercially supported product offering where
customers can get a subscription and they get support for that.
So this happened at some point and of course this took it then to the next level, right? Because of like support
organization, proper professional documentation,
all this, all the product management all this
kind of stuff so then it really took off yeah yeah 100 and what was the initial motivation
behind starting a division why like these two folks like decided like to start building division
that's a very good question so randall was working, and we touched on this a little bit
earlier, so there was a Red Hat project
back then which was called
ATEED, which was a data
virtualization product.
You know, so like a federated query engine.
And he was working on that
originally, and I believe he
realized the need to
because they had a notion of materialized
views with ATEED, and he realized the need, we need to a notion of materialized views with indeed and he realized the
need we need to have some sort of triggers for updating those views and i believe this was his
core motivation and then he started it and yeah it was quite well perceived quite quickly actually
a community formed around it so people started to use. They had tons of use cases for it.
And it was quite popular from the beginning.
Yeah, absolutely.
And what's the most common use cases around the BISUM?
How do people use it?
Right.
So I would say the biggest category of use case
is what you could call replication in the widest sense.
So taking data out of an operational database as soon as it changed into other kind of data stores.
So typically because you have specific requirements.
So you want to take your data into a data warehouse like Snowflake so you can do analytics or Apache Pino.
So you can do like real-time
analytics and show dashboards or maybe you want to take your data from I know from a maybe even
commercial database in production like Oracle's, you know some licensing implications to it
Maybe you want to have a copy of the data in Postgres
So you can query it and on the side and do some sort of testing
So you want to take this data
across database
vendor boundaries.
So that's all
replication. And I would say also something like
updating a search index.
I would also consider that a
replication use case, because
if you want to do full-text search in a data,
typically you cannot do it as good on
a relational database. Rather, you want to use something like El search in a data, typically you cannot do it as good on a relational database.
Rather, you want to do something like, oh, you want to use something like Elasticsearch or
OpenSearch. And you want, of course, this index to be fresh, to be up-to-date. So you do your
full-text search. It gives you current search results. So that's something which people do a
lot. Feeding the search index is updating or invalidating caches if they have
cached versions of their data
to invalidate that after data change.
This comes up a lot.
Then I would say there's another big category
in the context of microservices.
Propagating
data, exchanging data between
microservices, maybe moving
from monolith to microservices,
things like this.
Different patterns in that space,
like the Outbox pattern or the Strangler fig pattern
which people use and people
benefit then again from CDC
for those kinds of things.
Makes sense.
So why would someone,
let's take the replication use case,
right? Why would someone
use a CDC pattern to do the replication
instead of going and executing SQL queries, for example,
and get updates and pull them out and then replicate on the other side?
Or there's another option there.
Just create a replica, right?
Yes.
And have, let's say, a replica of Postgres
and use that as a read-only replica, where you go and do your analytics.
So, I mean, you totally can do all those things, and I would recommend doing them when they make sense.
So if you have the replica set up, I mean, I totally can see where you want to have, like, read replicas of your Postgres data,
so you can have, I don't know, replicas closer to a user.
And in particular, if your query requirements are satisfied via Postgres, that sounds very reasonable to do, right?
So I'm all for that.
But sometimes you have other query requirements, right?
You don't want to run this kind of query in Postgres.
You want to run it maybe in a warehouse, or maybe in a search index. Or maybe you have a use case where you benefit from graph queries,
like Neo4j cipher queries, this kind of stuff. So you bridge essentially the kind of database,
and this is where you would use this rather than the built-in replication mechanisms.
Or also if you want to just cross random boundaries, right? So if you want to go from
Oracle to Postgres, probably it would make sense.
So I hope that makes sense in terms of the replication. Now you asked
why wouldn't I just go and query for
changed data, right? And that also can be a valid approach.
I always differentiate between log-based CDC,
which is what Deleezium does,
and we can dive into this with what it means,
versus query-based CDC, what you describe.
And there's a few key differences between them.
One of them is, well, if you do this query-based approach,
how often do you run this query? And what does it mean for your data freshness? So, I don't know, if you do this query-based approach, how often do you run this query and what does
it mean for your data freshness?
So I know if you run this query every hour, well, your data might be stale for one hour,
right?
Which again, depending on the use case may or not may be acceptable versus the log-based
CDC approach.
It gives you very low latency of milliseconds, two to three digit milliseconds,
maybe seconds end to end. So I just mentioned in my talk, there's Debezium users, they take data
from their operational MySQL clusters into a Google BigQuery for analytical purposes or analytics
purposes, and they have an end to end latency within less than two seconds. So, you know, really fresh data in their BigQuery system.
So now you could run this query every two seconds on your MySQL database,
but probably would kill the performance of the database.
It would be too much overhead.
And still, no matter how often you were to run this query,
you would not be sure whether you not miss any changes between two of those
polling runs.
And I mean, in the extreme case, it could happen something gets inserted and something gets deleted within two seconds.
I mean, if you were to do it every two seconds even, and then you would never know about this record.
And depending on what you want to do, this might not be good enough.
So maybe you want to use this, which is another use case for building an audit log about all your data. This must be complete, right? So you really want to be sure
you see all the updates. So that's one implication of the query-based approach. Also, you need to
define how do you actually identify your changed records. So you need to have some sort of columns
which tell you, okay, that's the last update timestamp. So it's a bit invasive on how you model your data schema.
Whereas if you do the log-based approach,
you know, this can capture changes from any tables
don't have this impact.
And lastly, yes, deleting data.
That's another interesting thing.
So if you delete something from a table,
you cannot get it with a polling-based
approach, right? Because it's just gone, unless you were to do something like a self-delete,
exactly. Whereas with the log-based approach, this goes to the transaction log of the database,
and all the events, they are appended, all the changes are appended to the transaction log.
Also, a delete is appended to this log, and Debezium will be able to get it from there.
So, okay.
We turn, let's say,
a database that is like a system that manages state
into a stream
of changes, right?
And that's what we propagate
when we are using something like Debezium.
How do we go back to the state again? Because we
have to recreate the state, right? And I know I know that like this is not something like the Bayesian does,
but it is part of like the workflow, right? Like on the other side, someone needs to go there and
be like, okay, like what's the current state that the source database has. So how does this process
look like? What kind of patterns you've seen there? How hard it is? Right, right, right. Yes. So how does that work?
So I would say this depends a little bit on the way you use Debezium and how you deploy it.
I really didn't mention this, but Debezium, at least initially, was very closely associated to Apache Kafka.
So people used it with Kafka. There is a side project of Kafka,
which is called Kafka Connect,
which is a runtime and a development framework
for connectors.
And Debezium still is based on Kafka Connect.
You know, so Kafka Connect will run
the Debezium connectors for taking data into Kafka.
And if you do this,
and this is one of several ways how you could use Debezium connectors for taking data into Kafka. And if you do this, and this is one of several ways how you could use Debezium, then you would use a sync connector for Kafka Connect.
So there's a very rich ecosystem of connectors.
So you would use a JDBC sync connector, which subscribes to those topics, maybe apply some
sort of transformation and puts the
data into a sync database. So that's what you would do with Kafka and Kafka
Connect. Now there's other ways how you could use Debezium and one which is very
interesting is what we call the embedded engine. In that case you use it as a Java
library within your JVM based application and then this means it gives
you very much
flexibility in terms of how you want to react to those changements essentially it's just a
callback method which you register whenever a changement comes in then this callback method
will be invoked and you can do with those changements whatever you want and this is what
integrators of the bzium into other platforms typically use.
So one example would be Apache Flink.
There's a project, a side project of Flink, which is called Flink CDC.
So they take the Debezium connectors and other CDC connectors to ingest change events directly into Flink.
So you don't need to run Kafka.
That's just, you know, in process in Flink, you will ingest those change events.
Then you could go about processing the data there.
Okay.
And, well, that's interesting.
So, like, if you don't have Kafka there,
how do you manage, like, the delivery semantics?
Because, yeah, one of the things that you get, like, with Kafka
is that you have, like, some very specific,
strong guarantees about the event.
And even if downstream, let's say, your consumer fails or whatever, the data is going to remain in the topic. And most importantly, you're not going to add any pressure to the source database, right? Because like, okay, like one of the most, like, I think,
I remember like at some point I started like playing around with the Bayesian
and I set up like a Postgres database on RDS.
Enable like logical replication, there I have my slots, blah blah blah, like all that stuff.
I start consuming.
You ran out of disk space.
Yes.
And then I'm like, what? Like I'm not using the database. Right. And the reason is because
there's another log there that someone needs to consume, so it gets freed. So the data writes.
And that's a very important thing operationally. Yes, absolutely. Because you don't want in any
way your production database to run out of space.
So how does it work with like not Kafka in between?
Right.
So yes, that's a specific challenge.
In this case, it's particular caused by the way how RDS productizes Postgres.
So what happens there is you can have a Postgres database and like one physical host, and there can be multiple logical databases in the same physical host.
And the transaction log, this is shared across all those logical hosts.
So there's one transaction log for the entire Postgres instance. But those replication slots which you mentioned, which are essentially the handle for a connector
like the museum to go there and extract
changes from the log, they are
specific to
one of those logical databases.
And now what's happening on
RDS in particular is they have
another, like an internal
database which you cannot access
even, it's like for administrative purposes
or whatever whatever and there
they do like heartbeat changes every five minutes or so so there is a number of changes which
happen in a database which you cannot access which you don't even know to be honest yeah about really
and so now you come and set up your debesium connector to another database, like your own logical database, and you want to
use stream changes there. And as long as there are changes coming in, this all is fine. It will
just work. The challenge, and I suppose that's the situation you ran into, is if you don't receive
any changes for your own database, or if you just stop your connector and it's not running,
this replication slot which is set up, this cannot progress.
It cannot go to the database and say
this consumer has made
progress up to this point, so you can
discard any previous
portions of the
transaction. That's a common stumbling stone.
And what you can do
there is actually, well,
A, you just have natural
traffic in your own database then it will
be fine yeah the connector progresses and every now and then it will go to the database and say
okay this is the offset i acknowledge and the database is free to discard any older log state
and if you need to account for the situation that your connector is you know not receiving changes
for quite some time then the division Debezium connector can actually
go and induce some artificial changes into the database itself. So you can configure
a specific statement you want to run, just like a heartbeat every few minutes, and this will solve
this particular problem. Okay, that makes total sense. But still, how do you work with the delivery semantics when
Kafka is not there? Right? Right. Yes, I mean, so then it depends quite a bit on the specific
connectors you were, for instance, to use with Flink. So let's say on the
sync side in Flink, you still might use maybe Kafka sync connect, right?
So then the same, you could
even do exactly one
semantics actually because this would be
like a transactional writer
and Flink would make sure
that if you crash
or whatever, that no events would be
duplicated, would be sent out another time.
But I don't know, if you use maybe a non-transactional sync,
yeah, then you probably would have again,
like at least once semantics,
which means you would, it could happen in particular,
if there's like an unclean shutdown,
that you would see some events another time.
So consumers in that scenario need to be idempotent
to be ready to receive a change event another time and then like
discard it in this kind of scenario. Yeah, makes a lot of sense. So
okay, the BISIUM is a kind of middleware, right? Like it is a piece of software that needs to interoperate with at least like a couple of different database systems as
its input. And, okay, replication was always like a very esoteric thing.
I remember, for example, okay replicating, like using like the binlog of like
MySQL was like available for like a long time
but Postgres got logical replication. 9.4 or something yeah. Yeah so it's
a much more recent thing. Before that they had like the binary replication thing
that's okay like how do you interpret this blah blah blah and that stuff. How do you build a system that needs to interoperate in a way
and create a common interface with such a diverse and esoteric
part of databases? Yes, I mean it's a challenge, right? It's exactly
the challenge which the Debezium team has.
All those databases have different interfaces, different formats, different APIs, how you would go and extract changes there.
So, for instance, in case of the Cassandra connector, it actually reads the log files from the database machine.
In case of Postgres, it's a logical replication mechanism.
In case of Oracle, it queries some specific views.
So it's different for all the databases.
And yes, it requires original engineering
for each of those connectors.
It requires at least one engineer on the team
to dive into the specifics of that particular connector
to make sure we understand how this
works to understand all the subtleties and stuff like what you describe on RDS this takes a while
to figure it out right and realize okay this is the situation this is how we can mitigate it so
it's I would say not trivial to do and this is also why there is you why the team is always conscious about adding more connectors because it means it's a maintenance effort.
You need to have the people around who can understand and work with this code base.
And in particular, it means, and this happens every now and then, people from the community come and they suggest they would provide a new connector.
And they say, hey, there's this database
and we want to build a DeMisium connector
for it. And would you be interested in
making this a part of the DeMisium project?
And now on the first thought,
this sounds interesting. You would think,
hey, we can get support for a new database,
but what you all, and it's a
mistake which, you know, you can
easily make.
You need to think about the maintenance implication of this
are those people who have usually very well intentions to provide you with this connector
will they be able to maintain it at least for some time right i mean i realize nobody can tell what's
happening in five years from now so i wouldn't expect anybody to promise they will maintain this
for five years but there must be some common understanding if somebody contributes a connector
that they are, at least for a reasonable amount of time,
that they are around and work on this connector and maintain it.
And this is, for instance, what recently just happened with Google Cloud Spanner,
which is something I'm really excited about.
So the team who work at Google on Cloud Spanner,
their distributed SQL database, excited about. So the Google, the team who work at Google on Cloud Spanner, they're distributed
to SQL database. They decided to build a CDC connector based on the Devisium framework
and they published an open source as part of the Devisium umbrella. So for me, that's
just very interesting to see how Devisium becomes sort of a de facto standard in this
space.
Yeah. I've seen that probably like on Streno. how the business becomes sort of a defective standard in this space.
I've seen that problem with Trino. Yes, we have connectors, it's the same thing.
Yeah, and what I think many people and especially new people in the
open-source community don't realize is that when you have a piece of
software that is literally used in critical parts of very large and many infrastructures out there,
you can't just decommission these things.
Yes, exactly.
When something gets into the code base, taking it out...
Is hurtful, yeah.
It hurts.
Exactly, yeah. It hurts. Exactly. So if there's no plan to maintain a connector,
it's hard for an open source project that always has limited resources, right?
Exactly. Even if you have a company behind you, it's going to be limited resources.
Exactly. Always different priorities. Yeah. And without commitments,
it makes a lot of sense for the
maintainers of this project to say no.
Because at the end, making a wrong choice with that
can really hurt not the project but the community at the end.
Totally. And it also comes to things like testing. So something
like Cloud Spanner, you need to have an instance of that
which you can run your test suite against, right?
I mean, for Postgres or all those open source databases,
I can run them locally via Docker,
so that's not a problem.
But for that, you know,
and we cannot really,
or the DeVision project cannot pay
for having a Cloud Spanner testing infrastructure.
100%.
They need to provide that.
So this also needs to be part of the conservation.
Yeah, 100%.
That's yet another example with Trino.
Yeah, you have a connector for BigQuery
or you have a connector for Redshift.
How do you test it?
Every time you run a test there,
you need a Redshift cluster to run, right?
Okay,
open source projects
are not exactly,
you know,
like drawing the money
in general.
Absolutely.
So it is hard.
Like the operations
around like
what we in a company
take for granted
in terms of like
CI, CD
or like
testing infrastructure
like all that stuff
are
at least
an order harder like you do in an open source project.
And people should be like,
it's not like people don't appreciate your effort to contribute,
but contribution to the project is more than the code.
Exactly, yeah, totally, totally.
It's maintaining, like it's the engineering part of writing the code
that is like important.
Anyway, that's like a topic on its own.
Like I think at some point it would be nice to get together some people from OpenShorts to
talk about this.
Yeah, totally.
I mean, there's also the question of documentation and helping people.
I don't know, let's say there's an esoteric database, people use this and they run into
issues.
So who helps them with that?
Because the core
team likely cannot be an expert on a gazillion of different databases.
100% yeah. So okay, Debitium has been around for a while as you said like it's
becoming it must like it let's say like a like a specification in a way.
Yeah I mean people rely on the format it's you know it's fling exposes to the Debitium change when form it. CLA DB mean, people rely on the format. It's, you know, Flink exposes
the Debezium change
when formed.
CLA DB
as another example,
they implemented
the CDC connector
which they implement
by the,
so they maintain it
on their end
but it's, again,
based on Debezium
framework
and also they expose
the same change
when formed.
Yeah.
So,
but at the same time,
I have to say
as a user
that it still feels that like Debezium is like a piece of software that is like tightly integrated with the Kafka ecosystem.
Yes.
And obviously, I don't have any problem with Kafka.
But what I want to ask is like, what's the future of Debezium?
Right.
What's the vision there?
Like, how do you see the project like moving forward?
Because it seems that it's
starting becoming more and more important and more horizontal, let's say. So how do
you handle that? How do you move forward with that?
Right. Yeah, I mean, so just to reiterate, I'm not the current project lead anymore, right? So I don't have the power or I don't make the roadmap, right?
I would see there is,
so there is an effort to evolve more
into an end-to-end platform.
So one thing which the team works on right now
is actually a JDBC sync connector.
So because right now you would use it together
with other JDBC
syncs from other vendors,
which,
you know,
you need to configure them in the right way to make sure they're all
compatible.
So having a Debezium JDBC sync connector,
this definitely will help people to use this more easily and set up
end-to-end flows based on Kafka.
Now you mentioned it's tied to Kafka.
That's true.
Well, you know, there's a strong Kafka consideration,
but then also there actually is a component,
and I think this will gain in importance,
which is called a Debezium server.
And Debezium server is,
in terms of your overall architecture,
it has the same role like Kafka Connect,
so it's the runtime for the connectors,
but then this gives you connectivity with Kinesis,
Google Cloud PubSub, Apache Pulsar, ProVega, all kinds of data streaming platforms.
People also can use Debezium with things other than Kafka, right?
Because I mean, as you say, I like Kafka as well, I love it, but maybe people are deeply
into the AWS ecosystem, they use Kinesis, and they still should be able to benefit from CDC.
So that always was the mission.
We want to give you the best CDC platform
no matter which data streaming platform you're on.
So that's something which is happening.
Then, of course, also what I observe is there is
actually quite several services
who take Devisium
and provide managed
service offerings around it. So Red Hat is
doing that.
In addition to the classical on-prem
product, there is
a connector service. But then of course
it's also startups like Decodable.
There's a few others who take this and
provide you with a very cohesive
end-to-end platform,
which then also adds the notion of processing to it, which makes this very easy to use, I feel.
So I think, I mean, that's not so much about Debezium, the core project, but I feel like more and more people are going to consume Debezium and run it
because then they don't have to operate it themselves.
Yeah, yeah, 100%.
All right, and one last question for me about the BISUME.
So CDC has been around as a pattern for quite a while, right?
But based on my experience, at least outside of the BISUME,
I haven't seen much of any other open source project
that can be used in some kind of
production environment at least, right?
Why do you think that this is the case? And do you
think that this is going to change? Do you think
we are going to see more limitations?
So I would
say, I mean, there is other
open source CDC solutions,
but usually for particular
databases. So for instance, there is
Maxwell Demon from the Zendesk team,
which is a CDC solution for MySQL.
So just for that database.
And I suppose there is others for Postgres
and for particular databases.
Right now, I'm not aware of any open source CDC solution
which really has this intention to be a CDC for all kinds
of databases, all the popular databases.
So I don't see anything coming up
like this. I'm not aware of
anything, let's say. There's a few
interesting developments. For instance,
Netflix, they have
their own internal CDC
solution. At some point,
they kind of indicated they would
open source it, but so far they haven't followed up on this.
And this has been quite a while ago.
So I don't think it's going to change now.
But what they actually did is, and this is why I'm mentioning this, is they published a research paper about their snapshotting algorithms.
So this is about taking an initial snapshot of your data and putting this into your streaming platform.
And they wrote this paper where they described this.
It's very interesting because it allows you to reboot Strap Tables,
to run multiple table snapshots in parallel and this kind of stuff.
And actually, Debezium implemented this solution from the Netflix guys.
So, yeah, that's where I see it going right now okay that's awesome I hope to
have you back on the show anytime really soon and I would love to see like a
small panel to be honest like with like people who are like veterans of like an
open source like to write communicate some of these things that like people
maybe don't understand.
And they're feeling a little bit
obscure, sometimes even a little bit rude.
But
I think it's going to be very useful
for anyone who is considering
contributing to that.
It would be my pleasure, for sure.
So let's do that.
Brooks, we'll work on putting that together.
That'd be great.
Thanks so much.
This has been a fascinating conversation.
You are clearly the authority on Debezium, very active online.
If folks want to follow along, how can they connect with you?
So they could follow me on Twitter.
I'm Gunnar Molling on Twitter.
I'm on LinkedIn. I don't know. It's my name there on Twitter. I'm Gunnar Molling on Twitter. I'm on LinkedIn. I don't know
my name there on LinkedIn.
And they also can shoot an email to
gunnar at decodable.co
if they want to talk about Flink and
Decodable maybe. So yeah,
different ways to reach out to me.
Cool. And what about Decodable?
Beth, I want to learn more about Decodable.
If you want to learn more about Decodable,
you totally should go to the Decodable website.
There's a free trial
which you could use to get your hands on the product.
You also could go to our YouTube channel.
There's a few interesting recordings there.
Kind of a sneak preview,
I'm going to do a new series on our YouTube channel
called the Data Streaming Quick Tips.
Cool.
So you also can watch out for that on the Decodable YouTube channel.
Awesome. Cool.
Well, that's at G-U-N-A-R-M-O-R-L-I-N-G on Twitter.
And then G-U-N-A-R at Decodable.
Is it Decodable.com?
No, it's.co.
.co.
Right.
So Gunnar at Decodable.co if you want to reach out on email.
And then check out the YouTube channel.
Sounds like some exciting things coming up.
Awesome.
Yeah, totally.
Thank you so much.
Yeah, thanks for coming on the show.
My pleasure.
Thanks.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric
at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by
Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.