The Data Stack Show - 39: Diving deeper into CDC with Ali Hamidi and Taron Foxworth of Meroxa

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. All right, welcome back. We are doing something on this episode that we love. And this is when we talk with companies who we talked with a long time ago. And in podcast world a long time ago is

Starting point is 00:00:37 maybe six months or so, which for us is about a season. So one of our early conversations in season one was at a company called Maroxa, and they have built some really interesting tooling around CDC. We talked with Devarious, one of the founders, and we get to talk with one of the other founders, Ali, and then a dev evangelist named Fox today from Maroxa. And they, I think, recently raised some money and have built lots of really interesting stuff since we talked with the various. So really excited to have them back on the show. One thing that I'm interested in, and I want to get a little bit practical here, especially for our audience. One of the questions I'm going to ask is, where do you start with

Starting point is 00:01:21 implementing CDC in your stack? It's useful in so many ways. And it's such an interesting technology. But it's one of those things where you can kind of, you initially think about and you're like, oh, that's kind of interesting. But then you see one use case, and you start to think of a bunch of other use cases. And so I want to ask them, where do they see their users and customers start in terms of implementing CDC and maybe even what does it replace? Costas, how about you? Yeah.

Starting point is 00:01:49 First of all, I want to see what's happened in this almost one year since we spoke with DeVaris, which one year for a startup is like a huge amount of time. And it seems that they are doing pretty well. I mean, as you said, Eric, very recently, they raised their Series A. So one of the things that I absolutely want to ask them is what happened in this past year. And also, I think that just like a couple of weeks ago, they also released publicly their product. So I want to see like the difference between now and then. That's one thing.

Starting point is 00:02:23 And the other thing is, of course, we are going to discuss in a much, much more technical depth about CDC. And I'm pretty sure that we are going to have many questions about how it can be implemented, why it is important, what is hard around the implementation, and any other technical information that we can get from Ali. Well, let's jump in and start the conversation. All right, Ali and Fox, welcome to the show. We are so excited to talk with Meroxa again. We had Devarious on the show six or eight months ago, I think, and so much has happened at Meroxa since then. And we're just glad to have you here. Yeah, thanks. Thanks for having us. Yeah, I'm so excited to talk with you today. Okay, we have a lot to cover because Costas and I are just sort of fascinated with CDC and the way that it fits into the stack. But before we go there,

Starting point is 00:03:17 so we talked with Tabarius, one of the founders. Could you just talk a little bit about each of your roles at Meroxa and maybe a little bit about what you were doing before you came to Miroxa? Yeah. So I'm Ali, Ali Hamidi. I'm the CTO and the other co-founder at Miroxa. And so before starting Miroxa with Devaris, I was a lead engineer at Heroku, which is part of Salesforce, specifically working on the Heroku data team, handling Heroku Kafka, which is the managed Kafka offering. But before that, I've always been working in and around the data space and did a ton of work around data engineering in the past.

Starting point is 00:03:53 And I'll go next. Hi, everyone. My name is Taran Foxworth. I also go by Fox. At Meroxa, I am the head of developer advocacy. I spend most of my time now building material that help customers understand data engineering and Meroxa itself. I also work a lot with our customers, actually understanding how they're using Meroxa and also trying to learn from them as much as possible. In the past,

Starting point is 00:04:17 I ran evangelism and education for an IoT platform. That's kind of where I really jumped into this data conversation because, you know, IoT generates a bunch of data, a bunch of sources. Then I joined Maroxa back in February to really dive into this data engineering world. And it's been such a blast so far. Very cool. Well, I think starting out, it'd be really good just to remind our audience, we have many, many new listeners since the last time we talked with Barakza. Could you just give us an overview of the platform and a little bit about, you know, why CDC? Yeah, sure. So Barakza is essentially a real-time sort of data engineering managed platform.

Starting point is 00:05:00 And so essentially it makes it very easy for you to integrate with data sources, pull data from one place, transform it, and then place it into another place in the format that you want. And so a big part of that for us is really the focus on CDC, which is change data capture. And it's been around for a while, but only recently really gained a lot of interest and a lot of attention. And so really the value of CDC is rather than taking a snapshot of what your source record or database looks like at the time of making that request, CDC gives you the list of changes that are occurring in your database. And so for example, if you're looking at the CDC stream within Postgres, anytime a record is created, updated, or deleted, you're getting that event, and it basically describes the operation. And so it gives you a really sort of rich view

Starting point is 00:05:54 of what exactly is happening on the upstream source rather than just, okay, this is the end result of what happened. It gives you sort of the story of what happened, and it inserts that sort of temporal aspect. There are so many uses for CDC. I'd love to know, is Meroxa focusing on a particular type of use case or particular type of data, sort of as you've built the platform out? Yeah, so we kind of see CDC as, I guess I kind of have two, or an answer in two parts to that question. So one of the things that, you know, led us to focus on CDC is really we were trying to look at the areas where we can add the most value, really apply our expertise and sort of the experience that the team has, and sort of generate the most value for customers. And so one of the areas that we looked at is setting up CDC pipelines, CDC connectors,

Starting point is 00:06:46 has always been really difficult for customers. And having spoken to lots of customers, a typical CDC project can take upwards of six, nine, 12 months, sometimes longer. And it's just an inherently sort of difficult project to get off the ground. And so really that's one of the areas we thought, OK, we can apply our expertise and our automation and our sort of technical skills to make that easier.

Starting point is 00:07:12 And so the goal of the platform, sort of the IP in the platform, is really doing the heavy lifting when it comes to setting up these connectors. And so CDC seemed like a natural place for us to focus on inherently because it was very difficult for people to focus on inherently because it was very difficult for people to do. And so if we can make it very easy, then there's value in that. We also sort of view CDC as the sort of the superset of data integration in the sense that you can create sort of the snapshot view of your data from the CDC stream, But you can't really go the other way. You can't sort of create data where there isn't any.

Starting point is 00:07:47 But you can sort of compact the CDC stream into what the end result should be. And so if you're starting from this richer, more granular stream of changes, then essentially any use case that is covered by sort of the traditional ETL or ELT use case can be also supported by the CDC. But it also unlocks new things. And so a very contrived example, but I think one that explains where the value and the addition of this temporal data is, if look at sort of an e-commerce use case where you're tracking what's happening in shopping carts, then whenever someone adds something to the cart, you could potentially, it's a very naive approach, but you could represent that shopping cart as a record in the database. And then when someone adds something, they increment the number of items to two.

Starting point is 00:08:41 And so that would actually trigger an event that would land in the stream. Whereas if you're looking at just a snapshot, then whenever you happen to look at that would be the number. And so if someone adds something and removes something and adds two of them and then removes something, that's all data that's potentially valuable. And that would land in the CDC stream of what exactly the user did.

Starting point is 00:09:01 Whereas if you're just looking at the snapshot, it's the end result. And so if I added 10 things and then I dropped it to only one and that's what I purchased. And then later when the snapshot happens, you'd only see the one thing. You wouldn't see the intermediate steps

Starting point is 00:09:13 that I went through. And so it's a highly contrived example, but I think it demonstrates the idea of the additional sort of rich data that you're potentially leaving on the table by not using the CDC stream. Ali, I have a question. So usually when CDC gets into conversation,

Starting point is 00:09:31 it's in two main use cases. One is, I think we also mentioned that already, it's about ETL-ELT. And the other one is as part of like a microservice architecture where you use CDC to feed with data like different microservices. Do you see Meroxa so far like being used more on one or the other? So we, I mean, traditionally the approach that we've been pushing and kind of marketing for is the more traditional ELT use case, mainly because I think that's easier to understand

Starting point is 00:10:04 and it's more sort of common for people to kind of wrap their minds around. But the structure and the architecture of the Moxa platform is that essentially the way it works is when you create a source connector, you're pulling data from some database, say Postgres through CDC, and it's actually landing in an intermediate, so specifically Kafka, that's managed by the Moxa platform. And so this is where the second use case or the, I'm not sure if I want to use the term data mesh, because I feel like that's pretty loaded and has a lot of baggage, but essentially the sort of the application use case or microservices use case would sort of fall into place. Because these events, these change events are actually landing in an intermediate that is being managed by us that a customer also has access to. And so,

Starting point is 00:10:50 you know, what we typically see is, is customers will come for the sort of easier low hanging fruit sort of use case of ETL, or ELT, but then sort of almost immediately realize that, oh, actually, once I have this, this change stream in this intermediate that I can easily tap into, now I can leverage it for other things. And so we have some features that kind of make that easier. An example of that is you can generate a gRPC endpoint that points directly into this change stream. And so you can have a gRPC client that receives those changes in real time. And so that kind of falls into the sort of microservices use case pretty well. But it is the same infrastructure. And that's kind of the key for us. We view Meroxa as being sort of a core part of data infrastructure. And so we want to

Starting point is 00:11:34 make it very easy for you to get your data out of wherever it is, and place it into an intermediate, so specifically Kafka, they can then hang connectors off and kind of peek into and really leverage that data for whatever you have. Yeah, that's super interesting. So a question about ETL and DLT. So CDC has, let's say, kind of like limitation, which it's not like the limitation of CDC per se, but it's more about the interface that you have with the database that when you first establish a connection with their application log, you have access to very specific data sets in the past right you don't

Starting point is 00:12:09 actually have access to all the data that to the whole state that the database has and usually when you are doing etl you want first to replicate this whole state and then keep updating this state using like a CDC kind of API. So how do you deal with that at Meroxa? Yeah, so the tooling that we use, we always say, you know, we like to say we stand on the shoulders of giants and, you know, leverage a lot of open source tools. And so the tooling that we use, you know, depending on which data source you're connecting to. So say if you're using Postgres, we're likely to provision a Debezian connector,

Starting point is 00:12:46 sort of behind the scenes. That actually supports the idea of creating a snapshot first. And so it will basically take the entire current state, push that into the stream, and then once it's called up, it will start pushing the changes by consuming the replication log.

Starting point is 00:13:01 And so you do get both. You get like the full initial state as a snapshot, and then you get the changes once that initial snapshot is done. And so that's kind of how we address that use case. Okay. That sounds interesting. So the initial snapshot, it's something that you capture through a JDBC connection? Yeah. Okay. Okay. That's clear. That's interesting. Yeah. It sounds like it makes total sense because you read the initial state and continue updating the state. Yeah, and that's supported across all of the data sources that we natively support. So whether it's

Starting point is 00:13:35 CDC through Mongo or MySQL or Postgres, they all work in a similar way. Do the initial snapshot, and then once that's pulled up, we actually start the CDC stream. Super nice. So I know that your product became publicly available pretty recently. You were in a closed beta for a while now. Do you want to share with us a little bit about what to expect when I first sign up on the product? Some very cool features that you have included there. And I know I'm adding too many questions now, but also like if there's something coming up like in the next couple of weeks or something that you are very excited about.

Starting point is 00:14:14 Sure. Yeah. So we launched, we made the platform publicly available about a month ago and it's available at rockstar.com. You can sign up. We have a generous free tier. Our pricing model is based on events processed. And so you can go in, you can create an account, you can leverage the dashboard or the CLI.

Starting point is 00:14:33 And as I mentioned, really the RIP is making it very easy to get data out of a data source and making it very easy to put data in. And so like an example of that is, I mentioned CDC streams with, with Postgres, but the platform will sort of gracefully degrade its, its mechanism for pulling data out of Postgres, depending on where it's being hosted, what version it's running, what permissions the user has and that kind of thing.

Starting point is 00:14:59 But the command or the process for the customers is uniform. It's basically the same. And that sort of also extends across data sources. So you type the same command, whether you're talking to Postgres running on RDS with CDC enabled, or you're talking to like Mongo on MongoDB Atlas. It's basically the same command, same UX.

Starting point is 00:15:18 And that's really our edge, I guess. In terms of sort of features, really what we're pitching is the user experience. We're trying to make it very, very easy to set up these pipelines and get data flowing. And that's really where a lot of our attention has been focused. That's super interesting.

Starting point is 00:15:36 And can you tell us a little bit more about the user experience? How do we interact? For example, and I guess the product is intended mainly towards engineers right so yeah it's like the whole interaction through the ui only did you provide like an api that programmatically someone can like create and destroy and update like cdc pipelines what are your thoughts around that what is like let's say in your mind also as an engineer right like the best possible experience that an engineer can have

Starting point is 00:16:08 from a product like this? Yeah, so from my perspective, being on the user side of this work for many years, really I felt most at home working through a CLI or some kind of infrastructure automation. I'd love to use something like Terraform or a similar tool to kind of set up these pipelines. For us right now, we've launched with the CLI.

Starting point is 00:16:33 So we have the Markz CLI, which has full parity with the dashboard. And we have the UI itself, which is the dashboard. And so you can sort of visually go in and create these pipelines. We haven't quite yet made a public API available, but something that we're definitely interested in and working towards. We're just not quite there yet. And similarly, I'm a huge fan of Terraform. And the idea of infrastructure as code, I think, is great.

Starting point is 00:16:58 And it's something that we definitely need to address. And that's something that we're looking forward to addressing in the future. But yeah, CLI right now, dashboard through the UI and there's so full parity between the two. Typically the way you interact with it is you'd, you'd introduce resources to the platform. And so you'd add, you know, add Postgres, give it a name, add Redshift, give it a name, and then create a, create a pipeline

Starting point is 00:17:25 and create a connection to Postgres. The platform reaches out, inspects Postgres, figures out the best way to get data out, and starts pouring it into an intermediate Kafka. And then you kind of peek into that and say, okay, take that stream and write it into Redshift now. And the rest is handled by the platform. That's super interesting. By the way, I think we also have to mention that pretty recently you also raised another financing round. Is this correct? Yeah, we raised a pretty sizable Series A. We closed towards the end of last year, but recently announced it with Drive Capital leading our series A. It's been super amazing working with them and the rest of our investors. And yeah, so that enabled us to accelerate the growth of the team, really build out our engineering team and sort of the other supporting resources. So we went from about eight people last October to 27 as of today. Oh, that's great. That's amazing. That's a really nice growth rate. And I'm pretty

Starting point is 00:18:26 sure that I'm still hiring. So yeah, everyone who listens out there, if they want to work in a pretty amazing technology and be part of an amazing team, I think they should reach out to you. For sure. For sure. We're always hiring, always looking for backend engineers, frontend engineers. Yeah. If you're interested in the data space, then we'd love to hear from you. That's cool. All right, so let's talk a little bit more about CDC and the use cases around CDC. So based on your experience so far,

Starting point is 00:18:55 what are the most common, let's say, sources and destinations for CDC? And why also? Why do you think that people are mainly interested at this point in the maturity that the product, the technology has right now are interested in this? Yeah, so at least from our point of view and what we've seen and what customers have been telling us, the most sort of common data sources would be, you know,

Starting point is 00:19:19 Postgres, MySQL, MongoDB, SQL Server, really the operational databases, the most value comes out. So, you mentioned earlier, the two sort of different paths for CDC use, one being ELT and one being like the microservices sort of application type use case. And I think there's a really nice sort of appealing aspect of saying, well, I don't need to change any of my upstream application. If all of the changes are happening in the database, I can just kind of look into that stream and radiate that information across my infrastructure and start taking advantage of it.

Starting point is 00:20:07 And so I think that's why most of the use cases, most of the requests are really around operational data stores. It's interesting. Can you share a little bit of your experience with the different data sources? Like which one do you think it's like the easiest to work with in terms of CDC and which one is the most awkward one? To provide mainly because, you know, my time at Veroku, Veroku was very famously, very strongly associated with Postgres. I'd argue that Veroku Postgres was probably the first solid sort of production grade Postgres

Starting point is 00:20:37 offering that was available as a managed Postgres. And I think Postgres as a product itself is incredible. I think it's really great to work with and its development has been super fast paced, but always very stable. And I think the way that they have implemented sort of replication has made it very, very useful for building out CDC on top of. And so that's, I think personally, that's, that's where I would kind of lean towards. I think the, to get like the premium CDC experience, Postgres is probably the best right now. I know that like MongoDB has done a ton of work with their streaming API and sort of done stuff there to make that super easy too. But yeah, just for simplicity and getting things up

Starting point is 00:21:21 and running, Postgres is great for CDC, mainly because it leverages the built-in replication mechanism. That being said, one of the things that we sort of continually see, and this is probably a good time to bring up the initiative that we're trying to work on amongst some partners and sort of industry peers. CDC itself has come a long way

Starting point is 00:21:43 in terms of what it does and interest and where it can be applied. But I think there's room for us to kind of agree as a community, as a collection of experts that work in the field, potentially some guidelines to make interoperability better. And so you have different companies building out CDC mechanisms, whether it's someone building CDC natively to their product like CockroachDB or someone like the Debezium team at Red Hat for building these CDC connectors. I think there's definitely an opportunity for us to sort of sit around a table and agree on, all right, if I want to provide a great CDC experience, and I want to enable interoperability, so maybe I want to use Debezium on one end, and I want to pour that CDC stream into CockroachDB, let us agree on at least a style of communication, like some kind of common ground between us, so that we can make this interoperability possible and make it easier

Starting point is 00:22:42 for customers to really make use of that. And so one of the things that we've been talking about, and I'll let Plox kind of talk a little bit more about the initiative in general, but we're basically partnering up with some of our sort of industry partners to push the idea of an OpenCDC initiative, essentially, to kind of agree on what it looks like to implement CDC and support CDC, and what it looks like to support it well. Oh, that's super interesting. Yeah, I'd love to hear more about this. So what's the state of Open CDC right now? Yeah, so I'd love to hop in here. I think this has been so informative. I've just been sitting here clapping my hands, soaking in all this good information about CDC. But Open CDC is really, I think, an initiative that's going to drive a lot of activity

Starting point is 00:23:32 and community just around CDC in general. Because like Ali mentioned, there are multiple ways you can actually start to capture this data. Like Debezium, for example, leverages Postgres logical replication to actually keep track of all the changes that are occurring. And the nice thing there is you get changes for every insert operation, update operation, delete operation. But there's also other mechanisms of CDC as well. Like, for example, one connection type is polling. You can constantly ask the database to look for a primary key increment. So when a new ID has come in, that's a new entry or looking at a field may say updated at. So with all these

Starting point is 00:24:12 different mechanisms of actually tracking the changes, some consistent format around systems around, okay, well, if you have a CDC event, you should be able to track, okay, here's what snapshots look like. Here's what creates look like should be able to track, okay, here's what snapshots look like, here's what creates look like, here's what updates look like, here's what deletes look like. And what we can start to do is offer some consistency amongst these systems so that CDC producers and CDC consumers all agree on what they should be producing and consuming. And then that just leads to a great foundation for kind of all the things that Ali was talking about, just the secret sauces of CDC, whether that be, you know,

Starting point is 00:24:52 replicating data, all the way to building microservices that actually leverage these events in, you know, an event driven architecture type of way. So right now, in terms of OpenCDC, we're putting together these standards and this specification. So be on the lookout for something more official soon. But if you have any ideas or something, we would love to hear from you and love to work with you on this initiative

Starting point is 00:25:14 to make sure that this is something that's really great for the CDC community. Yeah, that's amazing, guys. I hope this is going to work out at the end. And obviously, anyone who is listening to this and is involved one way or another in this kind of CDC projects, I think they should reach out to you. Is there some, I mean, outside of Meroxa right now,

Starting point is 00:25:40 are there any other partners that you have that they are part of this initiative? Yeah, one big one is Debezium itself. We talked with the lead maintainer of the Debezium project, because I think Debezium as just a project in general has been so influential in terms of CDC and their format, that JSON specification. It includes things like the schema that is being tracked from the database and the bench and the operation, the things like the transaction number of the database transaction

Starting point is 00:26:14 and in the case of logical replication, right? Like the actual wall line that it was reading from. So they have been one group that we've been working with and Materialize is another. So Materialize, they're a streaming database. And CDC is really important for them because as you're streaming changes and calculating information, that system is very important for how they consume the data and then produce that back out in a meaningful way. So I think, you know, working with the different types. So when you look at CDC in general, you might have like actual products, so say Postgres producing a CDC stream, but you also have like CDC services, say like Maroxa is actually consuming them and get you to do something useful. So I think there's different types of players and companies that we can begin to work with. But those are a couple of the few that we've been having some really awesome conversations so far about. That's super interesting. Folks, do you see value in having in these conversations also the cloud providers, for example? The reason I'm asking is because so far, the way that I've seen products that they are trying to do ETL from

Starting point is 00:27:20 Postgres, MS SQL, MySQL, depending on the cloud provider, the version of the databases, you might be able to perform a CDC or not, right? So there's no unified experience, at least across like the different providers, the cloud providers out there. Do you think it makes sense like for them also to be part of this initiative?

Starting point is 00:27:42 Yeah, I mean, I think it definitely makes sense. You know, we want to try to get as many people on board as possible. And some of the ideas that we've been talking about is how can we classify the, I don't want to say compliance, because I feel compliance is too strong. The idea of we don't necessarily want to enforce a standard, but some kind of categorization of good, better, best of if you are planning enforce a standard, but some kind of categorization of like good, better, best of like, if you are planning to, to leverage CDC, like this is, you know, a really good experience, or this is like the best possible experience where you get all of the operations you want, it's very clear, you get the before and after of the event, you get everything you need. And so, yeah, I think from from my point of view, you know, the more people that are involved,

Starting point is 00:28:22 the more people that adopt it, the more people that are kind of, you know, following our guidelines, I think, the better, the better it will be, and the more likely we'll have sort of successful interoperability. And so I, you know, I can definitely imagine a world where, you know, these bigger cloud providers are kind of not necessarily, you know, changing their formats to match it, but at least, you at least if they're going to build something, if you're going to build something new or integrate something, then why not build against some sort of commonly accepted guidelines

Starting point is 00:28:53 that benefit everyone? That's great. I think you're after something big here, guys. So I really wish you the best of luck with this. And also from our side, as rather stuck, not as a show, I think it would be great to have a conversation about that and see how we can also help with this initiative.

Starting point is 00:29:13 So we should chat more about it. All right. So some questions about CDC again and the experience around using CDC. You are providing, your product is a SaaS solution, right? So it runs on the cloud, Meroxa is connecting on the database system of your customer, and this data ends up on a Kafka topic.

Starting point is 00:29:39 And from there, from what I understand, it can be consumed as a stream using different APIs. What are the, let's say, the expectations in terms of latency that a user should expect by using CDC in general and Merox alike in more particular? Yeah, so with CDC, it's very much dependent on how it's implemented, right? So, you know, I mentioned previously that one of the things that we do is we sort of degrade gracefully in terms of what is possible. And so if you point more opposite at a Postgres instance that's running on RDS

Starting point is 00:30:12 that, you know, has the right permissions and logical replication and everything, then latency is incredibly low because it's basically building on the same mechanism that is used for replication. And so if you had a standby database, typically that's potentially less than a second behind, milliseconds behind in terms of lag for replication. And so we're seeing that data in real time at the same time as all the other standbys are.

Starting point is 00:30:38 And so the end-to-end latency can be also sub-second. But that's like the best case. You know, I mentioned with OpenCDC, like better, better, but good, better, best. This would be sort of the best year where you're really getting low latency, high throughput, sort of low resource impact. But, you know, the end-to-end is obviously very variable because once it's in Kafka, Kafka is very famously high throughput

Starting point is 00:31:02 and low latency as well. So that tends not to be the limiting factor. But what tends to be the limiting factor is what you do with that data. If you're kind of tapping into the stream directly and using something like the gRPC endpoint, the feature that we have, then you could potentially also get it sub-second, see all of those changes that are happening on the database. If you move down to something different, like maybe you're running Postgres that's very restrictive, you've given us a user that has very limited permissions and we aren't able

Starting point is 00:31:31 to plug into the logical replication slot, then we kind of fall back to JDBC polling. And so then you're looking at the longest, worst case scenario with the polling time plus whatever time it takes for us to write it into Kafka. And potentially if you're writing out something else like S3 or something that is inherently batch-based for writes, then you're kind of incurring that additional time penalty. But typically what we see end-to-end is still pretty low. Singleit seconds is quite common. That's interesting. Do you see practical workloads where you also have to take the initial snapshots? Do you see issues there in terms of catching up with the data as they are generated

Starting point is 00:32:16 using CTC? Yeah, that's kind of an area where I think there's definitely room for improvement, both in the way we handle things and tools in general. The initial snapshots can often be very, very large. So obviously, if you use something like Maroxa right at the beginning, it's great because you don't have that much data. But if you come in and are pretty late in the game and you have terabytes of data, then that's terabytes we have to pull in before we can start doing the CDC stream. And so I think there's room for improvement in terms of the tooling, you know, being able to do it in parallel, being able to do things like that would be great. And I know, you know, we're working on things internally and also sort of the upstream providers

Starting point is 00:32:57 like, you know, Debezium and other teams are also working on things like allowing, you know, incremental snapshots and being able to take snapshots on demand and stuff like that. So I think there's definitely still room for improvement. You know, I'd love for us to be able to like seed a snapshot, maybe be able to like preemptively load some historical data and then build on top of it rather than only take the snapshot ourselves and stuff like that. So yeah, I think there's still definitely

Starting point is 00:33:27 a kind of room for improvement there. Yeah, that's super interesting. One more question. CTC is considered like traditionally something that it's related to database system and like a transactional database system as a short, right? Like something like MongoDB,

Starting point is 00:33:47 something like Postgres, et cetera, et cetera. Do you see CDC becoming something more generic, let's say, as a pattern and including also other type of sources there? Yeah, I think, you know, if you kind of squint your eyes a little bit, CDC event is just an event that describes something that happened to the database, right? And so it's really no different to evented systems

Starting point is 00:34:16 if you were building out an application and you're kind of emitting an event from your application that's describing a state change. So really it's the equivalent in functionality or in semantics. And so here is an event, a state change that your database has experienced versus here is a state change that your application has experienced. And so our goal or our belief is that really if we can provide a uniform experience across the two of them, then there's, you know, it may not be necessarily called CDC, because, you know, evented systems as a term has been, you know, has been around for a while.

Starting point is 00:34:52 There's no reason they couldn't, you know, plug into like, any kind of SaaS application or your own, you know, custom application that's triggering these events, that they should be treated, you know, uniformly with the CDC events if you just consider a state change of some sort. Yeah, absolutely. I think the first example that comes in my mind, to be honest, similar related to that is Salesforce. Salesforce lets you subscribe on changes on the object that it has. And actually they call it CDC, to be honest. Don't know how well it works, but it's like a very good example of CDC as an interface with a SaaS application, right?

Starting point is 00:35:30 Yeah. So yeah, I'd love to see more of this happening out there. I think that as platforms like embrace this kind of ways to subscribe to changes and capture that, things will become much, much better in terms of integrating tools. So yeah, that's interesting. For sure.

Starting point is 00:35:47 At least something else about that. Recently, there's a lot of hype around what is called reverse ETL. So there we have the case of actually pulling data out of the data warehouse and pushing this data into different applications on the cloud. Traditionally, data warehouse is not built in a way that emits changes or even allows for many concurrent queries. It's a completely different type of technology. Regardless of that, though, we see that in examples like Snowflake. Snowflake, from what I've seen recently,

Starting point is 00:36:23 they have a way where you can track changes, right? It's not exactly CDC, but it's close to CDC, right? Yeah. Do you see CDC potentially playing also a role in these kinds of applications? I don't know. I think the jury's still out on the reverse ETL. I feel like my, my initial sort of, my initial reaction to sort of the whole idea of reverse ETL is it's kind of a, a, a fix for potentially the wrong problem. I think the reason, you know, people want reverse ETL is because you're, you're kind of following this ELT idea of dump everything raw into your data warehouse, clean it up, process it, put it in a state that is useful for my other applications.

Starting point is 00:37:08 And then now I want to take the data out and kind of plug it into my other components. But I feel like that's kind of too far downstream for us. My thinking on the subject is really if ETL in real time was good enough, if we provided the right kind of tooling, the right kind of APIs, the right kind of interface to do that kind of transformation in real time on the platform in a way that is manageable and sustainable, then it kind of removes the need

Starting point is 00:37:39 for dumping everything raw into data warehouse, doing the processing and then getting the reverse ETL. So an example of this is, because we're putting everything in Kafka, Kafka has retention. And so we could plug in a connector and say, okay, take the last two weeks worth of data, apply this processing, summarize it in this way, do the stream processing, and then take those results and write it into my application. But it also lets you do things like, well, you know, maybe the transformation was wrong. Let me rewind again and try again with a different transformation. And so I think that the task for us is really to build that tooling to kind of make the idea of reverse ETL almost unnecessary by trying to build better tooling.

Starting point is 00:38:25 I feel like ELT and reverse ETL is really a result of having funky ETL tools or tools that really didn't meet the needs or weren't really usable enough or performant enough to achieve that. So we've kind of gone extremely in the other direction of saying, just get everything raw into your data warehouse, and then we'll figure it out later. And so that's inherently not real time. And our focus is very much on real time. And so if we can provide the right tooling and do it upfront and do it on the platform, I think it should hopefully, if we do it well enough, negate the sort of need for having a reverse ETL.

Starting point is 00:39:05 That's a very interesting perspective. What do you think, Eric, about this? Well, I was actually going to ask a question. I was going to follow up with a question. So I'm so glad you asked, Costas. I think, you know, before we get, the reverse ETL topic is one that we love to sort of talk about and debate about on the show. But I think first it'd be interesting both for me and our audience just to hear what are the most common parts of the stack that are replaced with Meroxa when someone adopts the products, or is it

Starting point is 00:39:38 generally sort of a net new pipeline? I think it'd be interesting to know about that. And then I can give my thoughts on reverse ETL. Yeah. So we, I mean, we don't necessarily try to go in and replace, like you don't need to replace anything. Typically the path for using Moroxo is to deploy us in parallel. And so we'll, tap directly into your operational database and start streaming the data into our intermediate Kafka. And then you can start leveraging Kafka and the streams and the streaming data in Kafka to build out new applications or pour it into your data warehouse or whatever it is that you want. And so we use the term data infrastructure

Starting point is 00:40:16 and try to position it more as, we don't view Meroxa as a point product. It's not a point to point connection really. What we're trying to do is get your data out from the various data sources that it's, you know, it resides in and putting it into a flexible real-time intermediate that you can then sort of tap into and leverage for other things.

Starting point is 00:40:40 Yeah, absolutely makes sense. Absolutely makes sense. And I think, you know, the reverse ETL question is interesting because it sort of crosses pipelines, you know, whether it's your traditional like ETL cloud data pipelines or, you know, event streaming type tooling, et cetera, it tends to, you know, so the demands of marketing and sales to get data that's going to help them, you know, sort of drive leads and revenue, et cetera, tend to create a huge amount of demand. And so the first round

Starting point is 00:41:25 of ETL tools, I think is really focused on those, right? You're trying to get, you know, sort of audiences out of your warehouse into marketing platforms, ad platforms, enriched data from your warehouse into your sales tools, salespeople have better insight. But I think Ali, what you, it's been such an interesting conversation because it's the, the idea around sort of streaming data in and out is much, much larger than just sort of those point solutions. And so I think it'll be fascinating to see how the space evolves, especially as technologies like Maroxa become more and more common and we discover all the different use cases. It strikes me as one of those tools, even throughout this conversation,

Starting point is 00:42:09 where you sort of get an immediate use case and then you think about all of these other interesting ways that it could be useful as well, right? Which is so interesting. Yeah, for sure. That's something that we see pretty frequently. Customers will come with a particular use case in mind,

Starting point is 00:42:26 like the most common one is sort of operational data into your data warehouse. But once they have that data flowing, then they have this sort of real-time stream of events coming from their operational database that kind of talks, that includes every change that their database has seen. Then they kind of almost immediately go,

Starting point is 00:42:41 well, now that I have this, I can do these other things. Like maybe I'll tap into the same stream, transform it and keep my Elasticsearch cluster up to date in real time while I'm at it. And so then like once you do that, then you're like, oh, well, actually I can use this to make a webhook that hits my partner company whenever this particular thing happens, because now I don't need to change my infrastructure to, I don't need to sort of custom instrument anything. I'm just looking at the raw events and I can kind of tap into it and really leverage that. Yeah.

Starting point is 00:43:11 So one of the things that you mentioned, like the reverse detail idea of, you know, enriching data for, for use of marketing, I think, you know, we don't currently have this functionality, but, you know, I just want to kind of feed the thoughts of the audience and you. Imagine you jump forward some amount of time, and if we or someone like us can make it super easy to do cross-stream joins and enrich data in real time, then do you really need to put your data into a data warehouse and then pull data from Salesforce and then pull data from Zendesk and then join them across the thing, join them across all of the tables and then write them out into something else.

Starting point is 00:43:50 Whereas if you were able to do it in real time by doing stream joins and hitting third party APIs to enrich those records and create a fat record that you can then plug straight into Salesforce, I think it would be hard to argue that... I can't imagine anyone saying, you know what, this real time is just way too fast. I wish it would be hard to argue that, you know, I can't imagine anyone saying, you know what, this real time is just way too fast. I wish it was taking me several hours. Like this is just too responsive. So I feel like the task is not, it's not a question of whether anyone would want that. I think that's clear. It's whether or not anyone like us or someone else can make it happen in a way that's easy to use. Sure. I think that's really the task. Yeah, you know, it's interesting on our,

Starting point is 00:44:30 may have been our last episode or the one before, we kind of mentioned these different phases, right? So you have sort of the introduction of the data warehouse, cloud data warehouse, sorry. Introduction of the cloud data warehouse, which sort of spawned this entire crop of pipeline tools, because now all of a sudden you needed to unify your data, right? And now you have sort of the next round of that where you're seeing reverse ETL and, you know, sort of different event streaming type solutions. And it's interesting because a lot of the sort of new technologies spawn new use cases and spawn new technologies.

Starting point is 00:45:07 And so I think it is fascinating to think about a future. And this is actually something we've been discussing a lot at Rudderstack where Costas and I work, where currently we live in a phase where there's heavy implementation of pipelines. And if you imagine a world, which you talked about, Ali, where the use case is the first class citizen and the pipelines are an abstraction that you really don't even sort of deal with in terms of setting up a point to point connection. I think that's where things are going. And I think the type of sort of cross stream joins you're talking about are fascinating because then you sort of get rid of all of this manual work to create point to point connections,

Starting point is 00:45:57 which still, I mean, it's very powerful to sort of do all of that in warehouse. But if you can abstract all of that and just give someone a very easy way to activate on a use case and not have to worry about the pipelines because all that's happening under the hood. I mean, that's that opens up so many possibilities because you get so much time and effort back. Yeah, I mean, for sure. You know, you hit the nail on the head there. That's really that that's that's the use case that we're trying to address. That's the problem that we're trying to solve. You know, and that's the world that we're trying to head towards. Very cool. Well, unfortunately, wow, we are, we're actually over time a little bit. This has been such a good conversation. Well, thank you so much for joining us on the show. Audience,

Starting point is 00:46:36 please feel free to check out maroxa.com. And we'll check in with you maybe in another six or eight months and see where things are at. Thanks again. Yeah, sounds good. Thank you so much for having us. Well, Meroxa is just a cool company. And now having talked to three people there, they just seem like they attract really great people and great talent. So that's always a fun conversation. I'm going to follow up on their answer to my initial question.

Starting point is 00:47:02 And I thought it was really interesting. Some technologies, let's say you change data warehouses or you change some sort of major pipeline infrastructure in your company, that can be a pretty significant lift. And it was really cool to me the way that they talked about how their customers are approaching implementing CDC. And it really was around if you need to make some sort of change or update to some sort of data feed, then you can replace that with Meroxa. And so that's what they see a lot of companies doing. And I think that makes CDC a lot more accessible as sort of a core piece of the stack, as opposed to going through some sort of major migration.

Starting point is 00:47:42 What stuck out to you, Costas? Yeah, two things, actually. One is about this great initiative that they have started, which is the Open CDC. I'm very interested to see what's going to come out of this. Just to remind our listeners about it, it's about an initiative that will help standardize the way that CDC works and mainly about the

Starting point is 00:48:06 messages and how the data is represented. So it will be much easier to use different CDC tools. Anything that is open is always a step forward in the industry. It remains to be seen how the industry and the market is going to perceive that. So that's a very interesting part of our conversation. The second one was about reverse ETL. And the comment that Ali made that actually, if you implement CDC and ETL in general in the right way,

Starting point is 00:48:36 you don't really need reverse ETL, which is a very interesting and a little bit controversial opinion if we're considering how hot reverse ETL is right now. So again, I'm really curious to see in the future who's going to be right. So it was a very exciting conversation and I'm looking forward to chat again with them in a couple of months. Sounds great. Well, thanks again for joining us on the Data Stack Show and we'll catch you next time.

Starting point is 00:49:02 We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

The Data Stack Show - 39: Diving deeper into CDC with Ali Hamidi and Taron Foxworth of Meroxa

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.