The Data Stack Show - Data Council Week (Ep 6) - All About Debezium and Change Data Capture With Gunnar Morling of Decodable

Episode Date: April 27, 2023

Highlights from this week’s conversation include:Gunner’s background in data (0:32)Setting the vision in early days of Red Hat and spearheading Debezium (6:20)Replication of data in Debezium (9:47...)The patterns and processes of Debezium (16:21)Debezium working with Kafka (19:03)Building a diverse system while incorporating common interfaces (24:09)The importance of documentation in open-sourced projects (27:59)Debezium’s vision moving forward (31:32)Why aren’t there more CDC open-sourced solutions? (34:35)Connecting with Gunnar (37:27)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. All right, we are back here at Data Council. If you're following along, you already know Eric couldn't make it to the conference. So I'm Brooks filling in for Eric while he is out. We've got Costas here, obviously, as well. And we just sat down with Gunnar Morling.
Starting point is 00:00:41 He's a senior engineer at Decodable. Extremely excited to chat with him today. We chat with Eric, the CEO, earlier this week. So we're excited to kind of continue just digging in. And to kick us off, well, I guess first, Gunnar, welcome to the show. Awesome. Yeah. Thank you so much. Yeah, absolutely. We'd love to just hear about kind of your background, where you started, and kind of your path to where you are today at Decodable. Oh, yeah, sure. Let's do this. So, yes, I mean, I joined Decodable just a few months back in November last year.
Starting point is 00:01:14 So it's still pretty new for me. Before that, I have been exactly up to the day, exactly for 10 years at Red Hat. And there, you could divide my tenure at Red Hat into two parts. So the last five years, I was working on Debezium, which is a tool for change data captures. I was the project lead for Debezium. And I guess we will talk about that in more depth.
Starting point is 00:01:38 So that's what I did for the last five years. And before that, I was working on different parts of the Hibernate project umbrella. So I was mostly working on Bean Validation. I did the Bean Validation 2.0 specification. We had a fancy exploration of using Hibernate and object-related mapping and applying it to NoSQL stores. So that's an interesting episode. I'm just saying this project doesn't exist any longer.
Starting point is 00:02:03 But, you know, I learned lots of stuff back then. So yeah, that's what I did before joining Decodable pretty much. Cool. Tell us a little bit about kind of what drove your decision to leave Red Hat after 10 years to the day. That's amazing. Yeah. But it sounds like when we were talking before we started recording here, there are some
Starting point is 00:02:23 kind of personal reasons, but also some of your technical interests kind of drive you to the next thing. Right, right. Exactly. It's a combination of the two things. And, you know, I really enjoyed my time at Redhead. It's a great company. I have still many friends there. And it's a place I would recommend really to everyone to work at.
Starting point is 00:02:41 It's a great company. Why I left was, well, so i was working on debesium for approximately five years and i felt you know i wanted to do something new again not because i didn't like the project any longer i still think it's an exciting project there's tons of applications for it but for me it was a bit okay i've been working on it for quite some time and i want to do something new so this was one motivation and. And then, well, I mean, Debezium is about ingesting change events out of databases like Postgres, like MySQL, into Kafka, typically. And, of course, people don't do this just for its own sake, right?
Starting point is 00:03:18 They don't want data just in Kafka. They want to take it elsewhere. They want to take their data into Snowflake, into Elasticsearch, into other databases to do query use cases, to do full-text search. And Debezium itself didn't have or doesn't have a good answer to that. It's just not the scope of the project. This is a CDC platform.
Starting point is 00:03:38 It doesn't concern itself with taking data out of Kafka into other platforms. But still, you need this as part of your overall data platform. And I felt, okay, I want to look at this data journey, let's say, like really end-to-end. So I want to look at it, okay, there's the Debezium part, which takes data into Kafka,
Starting point is 00:03:56 but also I want to help people with taking that data out of Kafka into other systems. So this is exactly what Decodable does amongst other things. So this was interesting to me. And then, of course, there's this notion of processing data, right? Like filtering data, changing types, changing date formats, this kind of stuff, doing stateful
Starting point is 00:04:16 operations like joining aggregations and so on, which all is done by Flink. And so, well, I was recommending people all the time when they came to me and asked about this so yeah use something like Flink or something like Kafka streams it allows you to do those things yeah at Decode.io I have now the opportunity to work on this and well provide people with a platform which does all that in a managed way so that's it in terms of a, you know, work motivation.
Starting point is 00:04:47 And then there's the other thing. Well, I was at Reddit, as I mentioned, for 10 years and the company grew a lot over that time. So when I left, it was more than 20,000 employees. And how big when you started? Yeah, it was, I believe, 3,000.
Starting point is 00:05:02 So it was like almost 7x. And this changed, you know you know i mean processes change and it's what it is i mean you know it's not the startup obviously and i wanted to have this experience i wanted to go to a small place and see okay how is this like if everybody really is like pulling in one direction you make quick decisions you see okay this flies or this doesn't fly and you don't you know think about stuff for six months to realize okay one thing about another big massive you know and so that's why I wanted to go for a startup like the codable yeah cool because I know you're excited to dig into to the museum so
Starting point is 00:05:41 take it away yeah sure so how did you start working on the video right yes that's a good question and it kind of even was a coincidence i have to be if i want to be totally honest because so as i mentioned i was working on hibernate stuff before then and again i feel like i have this five year attention span so after years, I feel like I need to do something new. And I was at this point, at this stage, at this time back then, I was looking on something new. I didn't feel like leaving Red Hat. I wanted to explore something new within Red Hat.
Starting point is 00:06:17 And then the original founders, I'm not the founder of the museum, the original founder left Red Hat. So Randall Haug, who created the project, he went to Coughlin. And so there was the, this project lead role was open. And yeah, you know, I was there. There was an open project lead role. So things came together and I picked it up. Okay, that's great.
Starting point is 00:06:44 And when did that, like, let's go talk a little bit of the history of DeBasium, right? Like, when was DeBasium first published? Right. So I took over in 2017. And by that, I believe it was roughly like one year old. So it started in 2016. And it was a very small team back then. So was randall who was the original project lead there was another engineer horia so the two worked on the project and you know they pitched it really well within redhead so they made a very good case
Starting point is 00:07:19 why we should sponsor this project and then then, well, things happened. So Randall left. And independently, Horio, the other guy, left next month. So we went from two people who knew about the project who could work on it to, like, having nobody. Yeah. So this was, of course, a challenge. And then I came in. And thankfully, Randall, I mean, he left. But, you know, I know i had his email and he was very helpful so i could reach out to him so i came in
Starting point is 00:07:51 and then um another engineer or yuji good friend of mine he also came in so it was the two of us who ran the project and then it was a little bit like a startup even within Red Hat. So, you know, because back then there wasn't even a product around it. It was just a plain upstream community project. So we worked on new features, adding connectors. We worked on creating brand awareness. So I went to conferences, doing blog posts, telling people how to use it, why you would use it, all this kind of stuff. And then, of course, we constantly made the case for getting new engineers into the projects. I believe after a year or so, we got another engineer,
Starting point is 00:08:29 actually somebody from the Hibernate team, I could convince to move over. And then, you know, we grew. And at some point, it actually became part also of the Red Hat product portfolio. So, you know, what they usually do there is they have, like, this duality of having upstream community projects, like Visium and there's a commercially supported product offering where customers can get a subscription and they get support for that.
Starting point is 00:08:53 So this happened at some point and of course this took it then to the next level, right? Because of like support organization, proper professional documentation, all this, all the product management all this kind of stuff so then it really took off yeah yeah 100 and what was the initial motivation behind starting a division why like these two folks like decided like to start building division that's a very good question so randall was working, and we touched on this a little bit earlier, so there was a Red Hat project back then which was called
Starting point is 00:09:29 ATEED, which was a data virtualization product. You know, so like a federated query engine. And he was working on that originally, and I believe he realized the need to because they had a notion of materialized views with ATEED, and he realized the need, we need to a notion of materialized views with indeed and he realized the
Starting point is 00:09:46 need we need to have some sort of triggers for updating those views and i believe this was his core motivation and then he started it and yeah it was quite well perceived quite quickly actually a community formed around it so people started to use. They had tons of use cases for it. And it was quite popular from the beginning. Yeah, absolutely. And what's the most common use cases around the BISUM? How do people use it? Right.
Starting point is 00:10:20 So I would say the biggest category of use case is what you could call replication in the widest sense. So taking data out of an operational database as soon as it changed into other kind of data stores. So typically because you have specific requirements. So you want to take your data into a data warehouse like Snowflake so you can do analytics or Apache Pino. So you can do like real-time analytics and show dashboards or maybe you want to take your data from I know from a maybe even commercial database in production like Oracle's, you know some licensing implications to it
Starting point is 00:10:58 Maybe you want to have a copy of the data in Postgres So you can query it and on the side and do some sort of testing So you want to take this data across database vendor boundaries. So that's all replication. And I would say also something like updating a search index.
Starting point is 00:11:15 I would also consider that a replication use case, because if you want to do full-text search in a data, typically you cannot do it as good on a relational database. Rather, you want to use something like El search in a data, typically you cannot do it as good on a relational database. Rather, you want to do something like, oh, you want to use something like Elasticsearch or OpenSearch. And you want, of course, this index to be fresh, to be up-to-date. So you do your full-text search. It gives you current search results. So that's something which people do a
Starting point is 00:11:40 lot. Feeding the search index is updating or invalidating caches if they have cached versions of their data to invalidate that after data change. This comes up a lot. Then I would say there's another big category in the context of microservices. Propagating data, exchanging data between
Starting point is 00:12:00 microservices, maybe moving from monolith to microservices, things like this. Different patterns in that space, like the Outbox pattern or the Strangler fig pattern which people use and people benefit then again from CDC for those kinds of things.
Starting point is 00:12:16 Makes sense. So why would someone, let's take the replication use case, right? Why would someone use a CDC pattern to do the replication instead of going and executing SQL queries, for example, and get updates and pull them out and then replicate on the other side? Or there's another option there.
Starting point is 00:12:38 Just create a replica, right? Yes. And have, let's say, a replica of Postgres and use that as a read-only replica, where you go and do your analytics. So, I mean, you totally can do all those things, and I would recommend doing them when they make sense. So if you have the replica set up, I mean, I totally can see where you want to have, like, read replicas of your Postgres data, so you can have, I don't know, replicas closer to a user. And in particular, if your query requirements are satisfied via Postgres, that sounds very reasonable to do, right?
Starting point is 00:13:14 So I'm all for that. But sometimes you have other query requirements, right? You don't want to run this kind of query in Postgres. You want to run it maybe in a warehouse, or maybe in a search index. Or maybe you have a use case where you benefit from graph queries, like Neo4j cipher queries, this kind of stuff. So you bridge essentially the kind of database, and this is where you would use this rather than the built-in replication mechanisms. Or also if you want to just cross random boundaries, right? So if you want to go from Oracle to Postgres, probably it would make sense.
Starting point is 00:13:52 So I hope that makes sense in terms of the replication. Now you asked why wouldn't I just go and query for changed data, right? And that also can be a valid approach. I always differentiate between log-based CDC, which is what Deleezium does, and we can dive into this with what it means, versus query-based CDC, what you describe. And there's a few key differences between them.
Starting point is 00:14:19 One of them is, well, if you do this query-based approach, how often do you run this query? And what does it mean for your data freshness? So, I don't know, if you do this query-based approach, how often do you run this query and what does it mean for your data freshness? So I know if you run this query every hour, well, your data might be stale for one hour, right? Which again, depending on the use case may or not may be acceptable versus the log-based CDC approach. It gives you very low latency of milliseconds, two to three digit milliseconds,
Starting point is 00:14:46 maybe seconds end to end. So I just mentioned in my talk, there's Debezium users, they take data from their operational MySQL clusters into a Google BigQuery for analytical purposes or analytics purposes, and they have an end to end latency within less than two seconds. So, you know, really fresh data in their BigQuery system. So now you could run this query every two seconds on your MySQL database, but probably would kill the performance of the database. It would be too much overhead. And still, no matter how often you were to run this query, you would not be sure whether you not miss any changes between two of those
Starting point is 00:15:23 polling runs. And I mean, in the extreme case, it could happen something gets inserted and something gets deleted within two seconds. I mean, if you were to do it every two seconds even, and then you would never know about this record. And depending on what you want to do, this might not be good enough. So maybe you want to use this, which is another use case for building an audit log about all your data. This must be complete, right? So you really want to be sure you see all the updates. So that's one implication of the query-based approach. Also, you need to define how do you actually identify your changed records. So you need to have some sort of columns which tell you, okay, that's the last update timestamp. So it's a bit invasive on how you model your data schema.
Starting point is 00:16:08 Whereas if you do the log-based approach, you know, this can capture changes from any tables don't have this impact. And lastly, yes, deleting data. That's another interesting thing. So if you delete something from a table, you cannot get it with a polling-based approach, right? Because it's just gone, unless you were to do something like a self-delete,
Starting point is 00:16:30 exactly. Whereas with the log-based approach, this goes to the transaction log of the database, and all the events, they are appended, all the changes are appended to the transaction log. Also, a delete is appended to this log, and Debezium will be able to get it from there. So, okay. We turn, let's say, a database that is like a system that manages state into a stream of changes, right?
Starting point is 00:16:58 And that's what we propagate when we are using something like Debezium. How do we go back to the state again? Because we have to recreate the state, right? And I know I know that like this is not something like the Bayesian does, but it is part of like the workflow, right? Like on the other side, someone needs to go there and be like, okay, like what's the current state that the source database has. So how does this process look like? What kind of patterns you've seen there? How hard it is? Right, right, right. Yes. So how does that work? So I would say this depends a little bit on the way you use Debezium and how you deploy it.
Starting point is 00:17:35 I really didn't mention this, but Debezium, at least initially, was very closely associated to Apache Kafka. So people used it with Kafka. There is a side project of Kafka, which is called Kafka Connect, which is a runtime and a development framework for connectors. And Debezium still is based on Kafka Connect. You know, so Kafka Connect will run the Debezium connectors for taking data into Kafka.
Starting point is 00:18:03 And if you do this, and this is one of several ways how you could use Debezium connectors for taking data into Kafka. And if you do this, and this is one of several ways how you could use Debezium, then you would use a sync connector for Kafka Connect. So there's a very rich ecosystem of connectors. So you would use a JDBC sync connector, which subscribes to those topics, maybe apply some sort of transformation and puts the data into a sync database. So that's what you would do with Kafka and Kafka Connect. Now there's other ways how you could use Debezium and one which is very interesting is what we call the embedded engine. In that case you use it as a Java
Starting point is 00:18:40 library within your JVM based application and then this means it gives you very much flexibility in terms of how you want to react to those changements essentially it's just a callback method which you register whenever a changement comes in then this callback method will be invoked and you can do with those changements whatever you want and this is what integrators of the bzium into other platforms typically use. So one example would be Apache Flink. There's a project, a side project of Flink, which is called Flink CDC.
Starting point is 00:19:14 So they take the Debezium connectors and other CDC connectors to ingest change events directly into Flink. So you don't need to run Kafka. That's just, you know, in process in Flink, you will ingest those change events. Then you could go about processing the data there. Okay. And, well, that's interesting. So, like, if you don't have Kafka there, how do you manage, like, the delivery semantics?
Starting point is 00:19:39 Because, yeah, one of the things that you get, like, with Kafka is that you have, like, some very specific, strong guarantees about the event. And even if downstream, let's say, your consumer fails or whatever, the data is going to remain in the topic. And most importantly, you're not going to add any pressure to the source database, right? Because like, okay, like one of the most, like, I think, I remember like at some point I started like playing around with the Bayesian and I set up like a Postgres database on RDS. Enable like logical replication, there I have my slots, blah blah blah, like all that stuff. I start consuming.
Starting point is 00:20:21 You ran out of disk space. Yes. And then I'm like, what? Like I'm not using the database. Right. And the reason is because there's another log there that someone needs to consume, so it gets freed. So the data writes. And that's a very important thing operationally. Yes, absolutely. Because you don't want in any way your production database to run out of space. So how does it work with like not Kafka in between? Right.
Starting point is 00:20:50 So yes, that's a specific challenge. In this case, it's particular caused by the way how RDS productizes Postgres. So what happens there is you can have a Postgres database and like one physical host, and there can be multiple logical databases in the same physical host. And the transaction log, this is shared across all those logical hosts. So there's one transaction log for the entire Postgres instance. But those replication slots which you mentioned, which are essentially the handle for a connector like the museum to go there and extract changes from the log, they are specific to
Starting point is 00:21:31 one of those logical databases. And now what's happening on RDS in particular is they have another, like an internal database which you cannot access even, it's like for administrative purposes or whatever whatever and there they do like heartbeat changes every five minutes or so so there is a number of changes which
Starting point is 00:21:53 happen in a database which you cannot access which you don't even know to be honest yeah about really and so now you come and set up your debesium connector to another database, like your own logical database, and you want to use stream changes there. And as long as there are changes coming in, this all is fine. It will just work. The challenge, and I suppose that's the situation you ran into, is if you don't receive any changes for your own database, or if you just stop your connector and it's not running, this replication slot which is set up, this cannot progress. It cannot go to the database and say this consumer has made
Starting point is 00:22:30 progress up to this point, so you can discard any previous portions of the transaction. That's a common stumbling stone. And what you can do there is actually, well, A, you just have natural traffic in your own database then it will
Starting point is 00:22:45 be fine yeah the connector progresses and every now and then it will go to the database and say okay this is the offset i acknowledge and the database is free to discard any older log state and if you need to account for the situation that your connector is you know not receiving changes for quite some time then the division Debezium connector can actually go and induce some artificial changes into the database itself. So you can configure a specific statement you want to run, just like a heartbeat every few minutes, and this will solve this particular problem. Okay, that makes total sense. But still, how do you work with the delivery semantics when Kafka is not there? Right? Right. Yes, I mean, so then it depends quite a bit on the specific
Starting point is 00:23:35 connectors you were, for instance, to use with Flink. So let's say on the sync side in Flink, you still might use maybe Kafka sync connect, right? So then the same, you could even do exactly one semantics actually because this would be like a transactional writer and Flink would make sure that if you crash
Starting point is 00:23:59 or whatever, that no events would be duplicated, would be sent out another time. But I don't know, if you use maybe a non-transactional sync, yeah, then you probably would have again, like at least once semantics, which means you would, it could happen in particular, if there's like an unclean shutdown, that you would see some events another time.
Starting point is 00:24:22 So consumers in that scenario need to be idempotent to be ready to receive a change event another time and then like discard it in this kind of scenario. Yeah, makes a lot of sense. So okay, the BISIUM is a kind of middleware, right? Like it is a piece of software that needs to interoperate with at least like a couple of different database systems as its input. And, okay, replication was always like a very esoteric thing. I remember, for example, okay replicating, like using like the binlog of like MySQL was like available for like a long time but Postgres got logical replication. 9.4 or something yeah. Yeah so it's
Starting point is 00:25:13 a much more recent thing. Before that they had like the binary replication thing that's okay like how do you interpret this blah blah blah and that stuff. How do you build a system that needs to interoperate in a way and create a common interface with such a diverse and esoteric part of databases? Yes, I mean it's a challenge, right? It's exactly the challenge which the Debezium team has. All those databases have different interfaces, different formats, different APIs, how you would go and extract changes there. So, for instance, in case of the Cassandra connector, it actually reads the log files from the database machine. In case of Postgres, it's a logical replication mechanism.
Starting point is 00:26:07 In case of Oracle, it queries some specific views. So it's different for all the databases. And yes, it requires original engineering for each of those connectors. It requires at least one engineer on the team to dive into the specifics of that particular connector to make sure we understand how this works to understand all the subtleties and stuff like what you describe on RDS this takes a while
Starting point is 00:26:31 to figure it out right and realize okay this is the situation this is how we can mitigate it so it's I would say not trivial to do and this is also why there is you why the team is always conscious about adding more connectors because it means it's a maintenance effort. You need to have the people around who can understand and work with this code base. And in particular, it means, and this happens every now and then, people from the community come and they suggest they would provide a new connector. And they say, hey, there's this database and we want to build a DeMisium connector for it. And would you be interested in making this a part of the DeMisium project?
Starting point is 00:27:12 And now on the first thought, this sounds interesting. You would think, hey, we can get support for a new database, but what you all, and it's a mistake which, you know, you can easily make. You need to think about the maintenance implication of this are those people who have usually very well intentions to provide you with this connector
Starting point is 00:27:32 will they be able to maintain it at least for some time right i mean i realize nobody can tell what's happening in five years from now so i wouldn't expect anybody to promise they will maintain this for five years but there must be some common understanding if somebody contributes a connector that they are, at least for a reasonable amount of time, that they are around and work on this connector and maintain it. And this is, for instance, what recently just happened with Google Cloud Spanner, which is something I'm really excited about. So the team who work at Google on Cloud Spanner,
Starting point is 00:28:04 their distributed SQL database, excited about. So the Google, the team who work at Google on Cloud Spanner, they're distributed to SQL database. They decided to build a CDC connector based on the Devisium framework and they published an open source as part of the Devisium umbrella. So for me, that's just very interesting to see how Devisium becomes sort of a de facto standard in this space. Yeah. I've seen that probably like on Streno. how the business becomes sort of a defective standard in this space. I've seen that problem with Trino. Yes, we have connectors, it's the same thing. Yeah, and what I think many people and especially new people in the
Starting point is 00:28:36 open-source community don't realize is that when you have a piece of software that is literally used in critical parts of very large and many infrastructures out there, you can't just decommission these things. Yes, exactly. When something gets into the code base, taking it out... Is hurtful, yeah. It hurts. Exactly, yeah. It hurts. Exactly. So if there's no plan to maintain a connector,
Starting point is 00:29:07 it's hard for an open source project that always has limited resources, right? Exactly. Even if you have a company behind you, it's going to be limited resources. Exactly. Always different priorities. Yeah. And without commitments, it makes a lot of sense for the maintainers of this project to say no. Because at the end, making a wrong choice with that can really hurt not the project but the community at the end. Totally. And it also comes to things like testing. So something
Starting point is 00:29:42 like Cloud Spanner, you need to have an instance of that which you can run your test suite against, right? I mean, for Postgres or all those open source databases, I can run them locally via Docker, so that's not a problem. But for that, you know, and we cannot really, or the DeVision project cannot pay
Starting point is 00:30:00 for having a Cloud Spanner testing infrastructure. 100%. They need to provide that. So this also needs to be part of the conservation. Yeah, 100%. That's yet another example with Trino. Yeah, you have a connector for BigQuery or you have a connector for Redshift.
Starting point is 00:30:17 How do you test it? Every time you run a test there, you need a Redshift cluster to run, right? Okay, open source projects are not exactly, you know, like drawing the money
Starting point is 00:30:30 in general. Absolutely. So it is hard. Like the operations around like what we in a company take for granted in terms of like
Starting point is 00:30:38 CI, CD or like testing infrastructure like all that stuff are at least an order harder like you do in an open source project. And people should be like,
Starting point is 00:30:49 it's not like people don't appreciate your effort to contribute, but contribution to the project is more than the code. Exactly, yeah, totally, totally. It's maintaining, like it's the engineering part of writing the code that is like important. Anyway, that's like a topic on its own. Like I think at some point it would be nice to get together some people from OpenShorts to talk about this.
Starting point is 00:31:11 Yeah, totally. I mean, there's also the question of documentation and helping people. I don't know, let's say there's an esoteric database, people use this and they run into issues. So who helps them with that? Because the core team likely cannot be an expert on a gazillion of different databases. 100% yeah. So okay, Debitium has been around for a while as you said like it's
Starting point is 00:31:35 becoming it must like it let's say like a like a specification in a way. Yeah I mean people rely on the format it's you know it's fling exposes to the Debitium change when form it. CLA DB mean, people rely on the format. It's, you know, Flink exposes the Debezium change when formed. CLA DB as another example, they implemented the CDC connector
Starting point is 00:31:50 which they implement by the, so they maintain it on their end but it's, again, based on Debezium framework and also they expose
Starting point is 00:31:57 the same change when formed. Yeah. So, but at the same time, I have to say as a user that it still feels that like Debezium is like a piece of software that is like tightly integrated with the Kafka ecosystem.
Starting point is 00:32:11 Yes. And obviously, I don't have any problem with Kafka. But what I want to ask is like, what's the future of Debezium? Right. What's the vision there? Like, how do you see the project like moving forward? Because it seems that it's starting becoming more and more important and more horizontal, let's say. So how do
Starting point is 00:32:32 you handle that? How do you move forward with that? Right. Yeah, I mean, so just to reiterate, I'm not the current project lead anymore, right? So I don't have the power or I don't make the roadmap, right? I would see there is, so there is an effort to evolve more into an end-to-end platform. So one thing which the team works on right now is actually a JDBC sync connector. So because right now you would use it together
Starting point is 00:33:04 with other JDBC syncs from other vendors, which, you know, you need to configure them in the right way to make sure they're all compatible. So having a Debezium JDBC sync connector, this definitely will help people to use this more easily and set up
Starting point is 00:33:19 end-to-end flows based on Kafka. Now you mentioned it's tied to Kafka. That's true. Well, you know, there's a strong Kafka consideration, but then also there actually is a component, and I think this will gain in importance, which is called a Debezium server. And Debezium server is,
Starting point is 00:33:36 in terms of your overall architecture, it has the same role like Kafka Connect, so it's the runtime for the connectors, but then this gives you connectivity with Kinesis, Google Cloud PubSub, Apache Pulsar, ProVega, all kinds of data streaming platforms. People also can use Debezium with things other than Kafka, right? Because I mean, as you say, I like Kafka as well, I love it, but maybe people are deeply into the AWS ecosystem, they use Kinesis, and they still should be able to benefit from CDC.
Starting point is 00:34:07 So that always was the mission. We want to give you the best CDC platform no matter which data streaming platform you're on. So that's something which is happening. Then, of course, also what I observe is there is actually quite several services who take Devisium and provide managed
Starting point is 00:34:27 service offerings around it. So Red Hat is doing that. In addition to the classical on-prem product, there is a connector service. But then of course it's also startups like Decodable. There's a few others who take this and provide you with a very cohesive
Starting point is 00:34:43 end-to-end platform, which then also adds the notion of processing to it, which makes this very easy to use, I feel. So I think, I mean, that's not so much about Debezium, the core project, but I feel like more and more people are going to consume Debezium and run it because then they don't have to operate it themselves. Yeah, yeah, 100%. All right, and one last question for me about the BISUME. So CDC has been around as a pattern for quite a while, right? But based on my experience, at least outside of the BISUME,
Starting point is 00:35:18 I haven't seen much of any other open source project that can be used in some kind of production environment at least, right? Why do you think that this is the case? And do you think that this is going to change? Do you think we are going to see more limitations? So I would say, I mean, there is other
Starting point is 00:35:38 open source CDC solutions, but usually for particular databases. So for instance, there is Maxwell Demon from the Zendesk team, which is a CDC solution for MySQL. So just for that database. And I suppose there is others for Postgres and for particular databases.
Starting point is 00:35:56 Right now, I'm not aware of any open source CDC solution which really has this intention to be a CDC for all kinds of databases, all the popular databases. So I don't see anything coming up like this. I'm not aware of anything, let's say. There's a few interesting developments. For instance, Netflix, they have
Starting point is 00:36:17 their own internal CDC solution. At some point, they kind of indicated they would open source it, but so far they haven't followed up on this. And this has been quite a while ago. So I don't think it's going to change now. But what they actually did is, and this is why I'm mentioning this, is they published a research paper about their snapshotting algorithms. So this is about taking an initial snapshot of your data and putting this into your streaming platform.
Starting point is 00:36:44 And they wrote this paper where they described this. It's very interesting because it allows you to reboot Strap Tables, to run multiple table snapshots in parallel and this kind of stuff. And actually, Debezium implemented this solution from the Netflix guys. So, yeah, that's where I see it going right now okay that's awesome I hope to have you back on the show anytime really soon and I would love to see like a small panel to be honest like with like people who are like veterans of like an open source like to write communicate some of these things that like people
Starting point is 00:37:23 maybe don't understand. And they're feeling a little bit obscure, sometimes even a little bit rude. But I think it's going to be very useful for anyone who is considering contributing to that. It would be my pleasure, for sure.
Starting point is 00:37:39 So let's do that. Brooks, we'll work on putting that together. That'd be great. Thanks so much. This has been a fascinating conversation. You are clearly the authority on Debezium, very active online. If folks want to follow along, how can they connect with you? So they could follow me on Twitter.
Starting point is 00:38:01 I'm Gunnar Molling on Twitter. I'm on LinkedIn. I don't know. It's my name there on Twitter. I'm Gunnar Molling on Twitter. I'm on LinkedIn. I don't know my name there on LinkedIn. And they also can shoot an email to gunnar at decodable.co if they want to talk about Flink and Decodable maybe. So yeah, different ways to reach out to me.
Starting point is 00:38:17 Cool. And what about Decodable? Beth, I want to learn more about Decodable. If you want to learn more about Decodable, you totally should go to the Decodable website. There's a free trial which you could use to get your hands on the product. You also could go to our YouTube channel. There's a few interesting recordings there.
Starting point is 00:38:40 Kind of a sneak preview, I'm going to do a new series on our YouTube channel called the Data Streaming Quick Tips. Cool. So you also can watch out for that on the Decodable YouTube channel. Awesome. Cool. Well, that's at G-U-N-A-R-M-O-R-L-I-N-G on Twitter. And then G-U-N-A-R at Decodable.
Starting point is 00:38:58 Is it Decodable.com? No, it's.co. .co. Right. So Gunnar at Decodable.co if you want to reach out on email. And then check out the YouTube channel. Sounds like some exciting things coming up. Awesome.
Starting point is 00:39:13 Yeah, totally. Thank you so much. Yeah, thanks for coming on the show. My pleasure. Thanks. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by
Starting point is 00:39:38 Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.