The Data Stack Show - 04: Relational to Real-Time with Change Data Capture with DeVaris Brown of Meroxa

Episode Date: September 2, 2020

In this episode of The Data Stack Show, Kostas Pardalis and Eric Dodds talk change data capture (CDC) with DeVaris Brown, co-founder and CEO of Meroxa. Their conversation digs into the benefits of uti...lizing CDC and how Meroxa is using it. Highlights from the conversation include:Introduction to DeVaris and Meroxa (3:24)Why CDC has more traction today (6:58)How CDC is changing the way we build products (12:52)Where CDC is playing an important role (21:11)The experience that Meroxa delivers (24:42)Looking at Meroxa’s sources, technology and data stack (27:28)DeVaris’ vision for the company (37:10)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome back to the Data Stack Show. Today we're going to talk about change data capture, which as a concept isn't a brand new idea in terms of databases, but is pretty interesting as far as the way that companies are leveraging it. And on today's show, we're actually interviewing the founder of a company who's built a product on top of the concept of change data capture. So like Rudderstack, they are building a product that fits into the data stack. So I'm excited to hear about the company.
Starting point is 00:00:42 Costas, do you want to explain a little bit about what the product is and who we're talking to today? Yeah, of course. Today's episode is all about CDC. As you said, CDC is not a very new concept. I mean, as a programming pattern exists around for quite a while. But today we are going to be talking about how CDC fits in the space of databases and how this can change completely the way we interact with the data that the database holds.
Starting point is 00:01:12 So with the virus today, we will go through the product that they built at Meroxa, which is actually a service that can be attached on your database and turn your transactional database like a PostgreSQL or MS SQL or MySQL into a stream of data. And that's very interesting because it's a quite new concept in terms of doing this on the database.
Starting point is 00:01:47 But it gives a lot of flexibility and enables, as we will see, a large number of use cases on working with the data that didn't happen before. And anything that has to do as a use case where data needs to be real-time can be enabled with this technology, which is very fascinating, I believe, especially not that much for engineers, to be honest,
Starting point is 00:02:12 because, of course, engineers are aware of CDC and this pattern of turning every possible direction as part of a software into a continuous stream of changes. But especially for people in marketing and other departments that they have been used so far to work with very high latencies in terms of the data that might be from hours to days, even weeks in some cases. And here, I think that marketing people are going to be extremely happy to hear that they can get access
Starting point is 00:02:41 to customer behavior that is captured on their database in terms of like in times of like seconds or even seconds. So we are going to go through all the some details about the technology, the product itself. It's a brand new product. And yeah, let's see what Devaris has to say about it. Hello, Devaris. It's very nice to have you here today on this episode of the Data Stack Podcast.
Starting point is 00:03:10 Very excited for the conversation we have ahead of us. Can we start by giving us a small background about you and Meroxa, the company that you work for? Yeah, yeah. So thank you again for having me, the company that you work for? Yeah, yeah. So thank you again for having me, Costas. It is an extreme pleasure to talk with you. I mean, I know we met last week under different circumstances, but I'm definitely appreciative of the opportunity to talk with you
Starting point is 00:03:37 about what we're doing at Meroxa. So prior, you know, I started at the beginning and kind of worked my way up. But prior to Meroxa, I was a product manager at Heroku, building features for developer experience. So if you've used Heroku in the past, like, few years, and you've used, you know, review apps or CI, CD or chat ops or GitHub integrations, any of that stuff,
Starting point is 00:04:03 like, they came out of the work that the team I led did. And so I like to say that the products that I've worked on is powered Silicon Valley because before that, I was a product manager at Zendesk as well. And so I was on their API and apps platform. So I know a lot of folks that kind of use Zendesk or Heroku in their startup stack. And then way back in the day, I was a software
Starting point is 00:04:28 engineer focused on building scalable systems at Microsoft. So I've worked on Windows Azure and what was called Red Dog at the time. Worked on Hotmail. That was like my first
Starting point is 00:04:43 assignment. But yeah, mostly, I know, man, like every time I see somebody with a Hotmail address now, I tell them thank you. It's like you bought my mom a couch or something like that, you know? So I appreciate it. Yeah, man, that's me. So yeah, that's pretty much what I've done in the past.
Starting point is 00:05:07 And then now I'm working at Maroxa. And so Maroxa is a real-time data platform. It's a service company. We just believe that everybody should have access to real-time data infrastructure. And it shouldn't be relegated to people who just have, you know, huge data teams and unlimited amount of money to engage some of these bigger software. So we want to make it as accessible as Heroku was
Starting point is 00:05:33 for web app development. So it's kind of what we're doing. I'm the CEO. I have a co-founder, CTO. His name's Ali Hamidi. We work at Heroku together and basically saw the twinkle in each other's eye around like, hey, we should probably be doing this type of experience work for real-time data systems. So, yeah, that's me in a nutshell.
Starting point is 00:05:53 That's great. That's great. By the way, I forgot something really important. I forgot to introduce that today we have also Eric with us. So it's going to be three people participating on the podcast. So, yeah, Ehrlich is also with us. And moving forward, I mean, it's very interesting what you just said about what Meroxa is doing. And everything, I mean, I think that's what we are talking about here is something that's very commonly known in the engineering space as CDC right like change data capture so let's discuss a little bit more about that I mean it's some kind of like let's say buzzword or
Starting point is 00:06:33 like a pattern in development pattern that has been around for quite a while there are products out there like the vision for example of they have been developed for quite a while and they are getting more and more traction. But let's start with the basics. So what is CDC and what is important and why now? Like why we see today that there's more traction around using this kind of like pattern? Yeah. So, I mean, if you look at, you know, CDC says to change data capture. And if you look at like how the world has been structured previously,
Starting point is 00:07:08 it's all been around tracking events. And so what you would need to do is basically if I work for segment or if I choose segment as a platform, I literally would have to go and spend a few weeks building out these like event plans and figuring out all the things that I want to capture in an event and then have that get sent to like segment. And then you have like this buffet list of things that you can kind of go out and then usually takes around, you know, a couple of months for like this stuff to get, get done end to end. Right.
Starting point is 00:07:40 But the main things to remember is like, it requires a change to your application to get this stuff to work. And so, you know, what's been happening for years behind the scenes is that people are actually tailing the logs of their database, right, and doing this in a way that it basically captures every single transaction that happens in the database. Because, you know, whether or not I make a call to an API in some way, shape, or form, that information is going to end up in the database, right? And so you essentially have a list, a very granular list of events and things that are happening on your system with respect to data. So people just say, yo, change data capture as a pattern makes a lot more sense because now I can get that high-grade fidelity of information from the database rather than having to go instrument my app for events.
Starting point is 00:08:29 Now, you're basically kind of kicking the can down the road eventually, right? Because, you know, now I don't necessarily have to go define a schema or any of that stuff. I just pull in these raw events, and then I do some processing and get dumping into, like, a data warehouse. And now I can write SQL at that point, right? And so it really like lessens the developer time dealing with, you know, how do I capture this information? And so, you know, kind of what Maroxa saw
Starting point is 00:08:56 was that like, you know, at Heroku, we use Debezium behind the scenes, right? But we saw that change data capture wasn't always the best way to get information out of the database, right? But we saw that change data capture wasn't always the best way to get information out of the database, right? So you think about, or, you know, even information from a data source in general, right? So, you know, a lot of these companies don't, you know, SaaS companies or any of that, they don't give you access to the underlying data source underneath. And so you need to have other schemes as to how you can pull data.
Starting point is 00:09:25 Also, with those data sources, you have variations as to what capabilities that they have, right? So if I'm using anything less than Postgres 10.4, you know, I can't use PG output, I can't use a logical replication slot. So CDC doesn't necessarily work, right? And same thing on the MySQL side, right?
Starting point is 00:09:44 And so what Meroxa did, essentially, behind the hood, is that we basically built, like, a decision tree to understand, like, okay, if, you know, the data source meets these requirements and we have these specific sets of permissions, then we can use change data capture and pull from, you know, Debezium and, you know, pull that into whatever else that we need down the road. But if it doesn't, then we can use JDBC client and write a SQL query there.
Starting point is 00:10:09 Or if that doesn't work, then we just fall all the way down the pole. And it basically just looks the same either way, any connection scheme that we have. And so that gives us the ability to not focus on the bugs and the quirks that are in the underlying data sources. We just deal with it from a platform perspective. The main thing is
Starting point is 00:10:30 that change data capture gives you such much more fidelity and granularity of the data that's flowing through your system versus I'm just going to lift and shift. If I do a select all from orders, a select star from orders, a select star from users,
Starting point is 00:10:46 I just see the end result in my data warehouse. Versus with changes in the capture, I can see like, oh, there was an upsert where, you know, class says, you know, change the age of this person, blah, blah, blah, blah, or like, you know, those types of things. And I'm seeing those changes in real time versus just the end result. Yeah, I thought it was very interesting. And it's a very refreshed way of looking into how to interact with data
Starting point is 00:11:10 and how to integrate data together. I mean, I'm coming from like the more traditional ETL, the way of dealing with data and moving data around. And I remember that with all the data warehouses out there, like one common modeling problem that always existed is how we can mix the transactional data that we get from the database together with the event streams that are coming
Starting point is 00:11:30 from clickstream data or like the events that we are capturing on a website. And the traditional way that we tried like to solve this problem is by turning, let's say the event streams into something that resembles closer like the tables that you had coming from the from the databases but it didn't really scale that well or work that well because
Starting point is 00:11:52 you lose something very important that's the time resolution or the time dimension that you have when you are dealing with cdc created data or like event streams in general and for me it's surprising in a good way to see that probably like at the end, the solution to this modeling and data problem is actually to turn the database into an event stream instead of trying to figure out how to effectively save the event streams
Starting point is 00:12:17 into a relational database. So that's something that really excites me, to be honest, with what we are doing on Merox. And I'm really happy and excited to see how people are going to use it. Let me move to the next question, which is about, how do you see change data capture affect the way that we develop and build products, even if these products are like software products, or as we started using this term
Starting point is 00:12:44 more and more recently, like software products or as we started using this term more and more recently, like data products. So the things that we do with the data and turn them into value inside the organization. Yeah, I mean, I think, you know, everybody's kind of stuck in this batch world now, right? Which is like, you know, just let me write a SQL query
Starting point is 00:13:02 and I'll populate a data view, right? And so the problem is, is like SaaS tools have figured out that like real-time and our event-based architectures can provide better customer experiences, right? So, you know, we were talking to, you know, I had a proof of concept with a hospitality company out of Vegas. And you think about like a weekend in Vegas, right? And for a VIP customer, you know, you go to MGM Grand or the Wynn or something like that, you know,
Starting point is 00:13:30 you see the separate line for like the Pearl member or whatever it is called. And basically like anytime a VIP person wants to check in, I want to send them a list of specialized offers, right? And personalized offers. Well, the company that we were talking to, they're basically doing like a, you know, daily backups and dumping that into their data warehouse.
Starting point is 00:13:51 All right. So you think about a weekend in Vegas, I get there on a Friday and then I leave on a Sunday. All right. Because that backup job takes 18 hours and fails over 70% of the time, less than 10% of the people were actually just seeing that offer in time. All right. They had Braze or Interable and all these kind of like marketing automation companies installed.
Starting point is 00:14:17 But just because they couldn't get the data in a format in real time, they were losing over eight to nine, you know, figures of revenue at the end of the day, right? And it's just like, you know, some salesperson from, you know, the Hadoop world or the Spark world probably sold them something. And I mean, which is the case, and it's taken them basically like a year to get it all in the same place. But the problem is, is like, you still need your data to get into those systems for it to actually be useful for you. And so with change data capture, you can start getting like finer grain fidelity than what you would with the database backup, right? Which is like, oh, I saw this person check in. But like, let's say, you know, you got IoT devices all around your casino or your hotel property, right? Now you can say, oh, I see this person checked in over by the slots that's close to this restaurant. I received this, you know, this event from Change Data Capture, right?
Starting point is 00:15:13 Or, you know, it hits the database because of my IoT devices. And now I can basically send them an offer that's like, oh, you're close to this restaurant. Here's a deal for you, right? Like, you wouldn't be able to get that with my, you know, daily backup job that takes 18 hours and fails over 70% of the time. And so these like data-driven experiences, people are just more used to this. I mean, your end users and customers are just used to having highly relatable, highly personalized content. And you need to have that real-time capability, not like batch or like micro-batch.
Starting point is 00:15:49 You basically need to know, like, at this moment, what is the most relevant information that I need to make a decision to put in front of a customer so that they can act on, right? And, like, that's the big shift that's happened. So, you know, you look at all of the, you know, the bandwidth from networking like 5G and all that type of stuff, like you literally have a ton of data at your disposal that you can use to provide a better end-user customer experience,
Starting point is 00:16:16 have better analysis, all of that. And it's like this information is already sitting in your database. You're just not utilizing it because you don't know the best way to get it out. And so, you know, for us, it's just like, no, we'll just provide an easy way for you to get that out initially and ongoing so that you can just use it however you want to, whether it's search, recommendations, marketing, automation, you know, data warehousing, analytics, all that type of stuff.
Starting point is 00:16:39 It just makes more sense to continuously pull that stream versus having to do these huge incremental backup jobs. Would you say that it seems like Meroxa enables you to get
Starting point is 00:16:58 extremely modern functionality out of your existing system without having to overhaul the entire pipeline, right? Because you can achieve real time by building like an extremely sophisticated pipeline. But I mean, doing that at an enterprise organization is, you know, a year long effort or more, just because it touches so many things. Is that, would you say that's a strong use case for Murata? Yeah, yeah, it is.
Starting point is 00:17:26 I mean, like that's literally our use case. Is that like you just point us at a data source and we figure it out and we give you the ability to kind of multiplex that data stream into a bunch of different places. And so you don't have to like rip and replace. You can just have us alongside and just say, yo, I want this to be my GDP gdpr ccpa data pipeline all right and it's like data flows from our database i do some
Starting point is 00:17:52 streaming processing and i put it out to the world and you know into you know my data warehouse into a real-time api that we provision but you still have your you know production data sources all right and like you know you don't necessarily need, like the whole point for us was to evolve the conversation from integration to orchestration, right? Because we saw a bunch of data engineers, I interviewed a bunch of data engineers and they're like, look, 70 to 80% of my time is just putting this stuff together, figuring out,
Starting point is 00:18:20 you know, how to, you know, get my Kafka to talk to, you know, my sales force and then put this stuff either in S3 or Snowflake or like, just all that stuff takes time and you need to have expertise on each one of those data components. But we're just like,
Starting point is 00:18:38 yeah, just assume all this stuff could connect. Just figure out where do you want your data to be at this point? And like, reducing that kind of operational overhead for maintenance and provisioning, it just allows you to be more expressive and experiment more and just be able to add more customer value with these kind of data-driven experiences. So as the end user, let's just take Debra's example, just because I'm just thinking of so many situations I've faced in the past
Starting point is 00:19:08 where having that functionality would have been amazing. But let's say I'm running Braze and I'm responsible for the personalized offers. Like if I'm running the personalized offers program, does anything change for me when you enter the picture? Or is it that I'm just now getting a real-time data feed that I can leverage? Yeah, you're just getting a real – so a couple ways that it can change for you is that because you can do the preprocessing inside of the stream itself,
Starting point is 00:19:42 you don't necessarily have to worry about downstream dependencies and, and all of those types of things, right? Like it just, it just, you, you can do, you know, your concatenation, your augmentation on the stream. And then, so by the time it hits Braze, you have a better way of actually like targeting your customers at that point. Right. And so, you know, it's just one of those things where it's just like, because you have this real time data, you can actually, because Braze splits out campaigns into recurring campaigns and then point-in-time campaigns.
Starting point is 00:20:15 So if a user takes a specific action, now I can do this thing and opt them into these campaign buckets. So if you start thinking about it, now I can basically define and find a granularity what that action is and have more associated data to say, okay, well, here's the type of experience that I can provide them now because I'm sending up a specific type of data strip to Braves. Very cool. That's great.
Starting point is 00:20:46 So I think it's pretty clear that marketing is like a very big use case of using real-time data. And I think latency is also something that is important for marketing. Have you seen so far any other use cases for CDC in general, but also specifically for Meroxa
Starting point is 00:21:03 and your product? Where else you have seen CDC playing a more important role nowadays? Yeah, I think for us, the use cases that we see are instant data warehouse. If I pick one of these ELT or ETL solutions off the shelf, I essentially have basically, you know, I'm kicking the can down the road to get this information to my data warehouse, right? And so I have to basically just run these like batch jobs. And so it just kind of runs, you know, you kind of run into race conditions depending on how big your data sets are. And so for us, the main thing is people are just like, I just want my data warehouse to be an accurate reflection,
Starting point is 00:21:53 a real-time accurate reflection of our entire data picture. And so that's the number one use case that we have. Other use cases that we have is like you you know, you think about a platform engineer that uses data, right? So, you know, data warehouses are not transactional. They're absolutely terrible for that. So, usually what I do is I write some SQL queries to build data marts, and then the platform engineer basically builds an API on top of those data marts for like, you know,
Starting point is 00:22:21 dashboards or, you know, whatever you may need. Well, what Maroxa does underneath the hood essentially is like we give you the ability to point your data stream to an API endpoint. So now you essentially have gRPC, GraphQL, or RESTful API that is consuming this real-time stream. And so you think about like real-time compliance at that point, dashboards, real-time search indexing, right? Like, I can point that stream over to, you know,
Starting point is 00:22:51 my Elasticsearch cluster, my Alveolia cluster, and continuously do these things, right? Like, you think about this from an e-commerce aspect, hey, I might want to have real-time inventory, right? Like, you know, you go to a website, and you're like, dang, I really like this click. And then when you click add to cart, it just says, oh, sorry, it's out of stock right now, right?
Starting point is 00:23:11 Like, you know, that's just a terrible customer experience to have. And so having real-time data kind of facilitates all of these interactions. And I think that's the whole point of change data capture, right? It's like, you know, you start pulling these changes in real time, you do some little string processing, and it gives you the ability to have an accurate reflection in any way, shape, or form that you want to have your data take place inside of your application.
Starting point is 00:23:37 That's the real power. And unfortunately, like a lot of these other tools, they'll pull stuff on regular intervals, but you'll end up in like, it's 30 it's 30 minutes, I believe is like the default standard for pretty much everybody else. And if you want to pay extra two grand, you can get, get it down to five minutes. But then it's like, at that point, it's like, okay, well, a lot can happen in five minutes, right?
Starting point is 00:23:58 Like, especially around Black Friday or like some sort of sale or something like that, right? Like, you just want to make sure that you have an accurate reflection. And that's, that's really the, the, the, you know, the benefit of using change data capture. Yeah, that's great. So I think it's also like a good point right now to discuss a little bit more and get some more information from you about the product itself.
Starting point is 00:24:21 So can you, you know, help us understand a little bit better how's the experience with the product itself? I mean, what someone can do? What is your user? First of all, I mean, I assume it's probably someone with more technical background, but who is the user of Meroxa? And what kind of experience you deliver to them? Yeah, I mean, our users are data engineers for right now. Data engineers are data aware engineers. Like, you know, an engineer that kind of knows system design that might get assigned, you know,
Starting point is 00:24:52 the duty of making pipelines for a sprint or two, right? Like, we want those people. And we're focusing on SMB and mid-market specifically because at the top end of the market, I mean, you got a ton of folks, right?
Starting point is 00:25:09 Like Nexla, Ascend, StreamSets, Confluent. I mean, like there's a ton of people doing that as well as the open source community, right? Like, so you got, you know, Netflix putting out open source stuff and all these things. Google, all the big substrates have their own vendor-specific way that you can kind of do these things. And I think the real reason we saw this was like, that one or two person team and making that experience super easy for them to build this infrastructure and not really have to think about it. or do I just like buy one of these off-the-shelf solutions and then spend like a million dollars in six to 12 months building it
Starting point is 00:26:08 or do I just pick up something like Maroxon and it's like as fast as I can type a command, I have the same thing that like Netflix and Slack and like kind of all these bigger companies are using behind the scenes, right? And so like that to us is kind of our advantage, right? Like we're decidedly focused on the smaller folks so we can build a you know build a community around real-time streaming right because a lot of people don't even know that like this is possible for them to do right
Starting point is 00:26:35 it's like the first thing they do is like oh i'm just going to pick up segment and that's going to give me the ability to you know kind of get these events and all that type of stuff and it's like yeah kind of right or or rudder stack right like you know they can do get these events and all that type of stuff. And it's like, yeah, kind of, right? Or, or RutterStack, right? Like, you know, they can do that, but it's like, that's also is a huge engineering leap for you to start instrumenting your app and like thinking about all the events
Starting point is 00:26:55 and, you know, making sure that you have your metrics playing in order because you're basically going to have to end up doing ETL anyway, because, oh man, I forgot to put this field in the user login event. Like, you know, like that type of thing, right? Like that happens all the time. So it's just for us, it's just kind of like, it's just an easier way for us to attack that
Starting point is 00:27:14 kind of SMB audience. And that's what we really want to go focus on. Makes sense. So what kind of sources you're supporting right now? What kind of technologies I can turn into a stream of changes that I can capture? You can turn an API endpoint into a
Starting point is 00:27:32 stream. Then we have a list of databases. It's Cassandra, Postgres, MySQL, Mongo, Oracle, and I think MS SQL is on the way. And then you can point it to basically like a URL
Starting point is 00:27:51 or web hook address, right? And so you can do that. Kafka itself, you can point us to a Kafka stream, S3, file stores, GCP, all the AWS stuff, GCP stuff. We got basically kind of in Snowflake, right? So we're kind of creating that world. And how can I access the data that have been, I mean, these kind of streams of changes,
Starting point is 00:28:23 like how I can interact with them and how I can access them, and most importantly, how can I integrate this in my product? Yeah, so basically the way that you interact with the data is you have a few ways. So you can basically underneath the hood, we go from CDC into a Kafka cluster. So you can actually access the raw topics if you want to, right? Like it's just Kafka underneath the hood, we go from CDC into a Kafka cluster. So you can actually access the raw topics if you want
Starting point is 00:28:46 to, right? Like it's just Kafka underneath the hood. And then you have the ability to query. So you can basically with any resources you provision, we provide a slash query endpoint. You can write ANSI SQL to query that data. And then we actually
Starting point is 00:29:01 expose that stream eventually, you know, if you want to create an API endpoint, right? Like, so we auto-provision the API endpoint for you, or you just dump it into, like, an S3 or file storage, and you can use, you know, kind of like Presto or Athena to go through your data lake, or you can put it
Starting point is 00:29:17 into a data warehouse, whatever one you pick. You know, we support Redshift, BigQuery, and Snowflake, and so you can just use SQL there as well. And then there's also an interesting way, too, is if you want that output to be code, so we expose functions as well. So you can point the string to a function,
Starting point is 00:29:35 and then that can infinitely do whatever it is that you need to there as well via code. So you have a few different ways that you can access that information. Okay, that sounds great. So you have a few different ways that you can access that information. Okay, that sounds great. So you mentioned a few things about the underlying technologies that you are using to build the product. Can you go a little bit deeper into that? Like what kind of technologies that you are using? What kind of frameworks? You mentioned Kafka. At some point, you mentioned the Bayesian. I don't know if this is also part of your stack.
Starting point is 00:30:06 Can you give us an idea of what's the stack like behind Amaroksa? Yeah, definitely. I mean, all this will be open source in a couple months anyway. So the real magic that we found was not in the data plane. We just basically have a curated set of open source projects where we know it can function on this job. Our IP is in the control plane, like the puppet master. So all the scaling, maintenance, and all that type of stuff,
Starting point is 00:30:37 that's our IP. But we basically used Debezium into Kafka. And then the reason why we picked Kafka or an event streaming format, twofold. One, Ali, my co-founder, is one of the world's foremost experts in Kafka. And so at Heroku, the team was like engineering
Starting point is 00:30:56 team for our department of data, especially on the real-time serving aspect, was about eight people. And they did... I always get this wrong, millions of Postgres databases, thousands of Kafka clusters
Starting point is 00:31:11 for tens of thousands of customers and hundreds of, excuse me, hundreds of millions of requests per minute. And if the team was eight people, right? And so, you know, it's like we have this expertise, but also Kafka so, you know, it's like we have this expertise, but also Kafka basically allows you to shrink the footprint of like building connectors, right?
Starting point is 00:31:33 And so now we don't have to have a specific like Salesforce to Braze connector. It's just everything talks to Kafka anyway. And so now it's just our duty to get information into Kafka and out of Kafka. And so, like, that's really what we do underneath the hood. And that's really it, man. I mean, you know, we just know how to run this infrastructure better than anyone else. You know, you talk to anyone who's running Kafka, the number one thing is like,
Starting point is 00:32:02 damn it, I got to, you know to learn how to tune the JVM, got to deal with Zookeeper, all this stuff. And it's kind of like, yeah, we figured out how to do this at scale. So, you know. Yeah. That makes sense. I mean, managing Kafka clusters is an interesting experience. I mean, I had like this experience also, like in my previous company, that we had like built the product around Kafka and it's a very powerful technology, but there are many moving parts there that you have to have right
Starting point is 00:32:36 if you want to operationalize it and make sure that it's always there and working as expected. Yeah, it makes a lot of sense. So you mentioned that you're going to open source your software. So that's like an interesting question that I have is, how do you think about open source? How important do you think it is?
Starting point is 00:32:56 And why you're actually thinking of open sourcing your technology? Yeah, I mean, look, we're system engineers before software engineers, right? We want to stand on, I mean, we are standing on the backs of giants, right? So, you know, open source is vital and crucial, is like
Starting point is 00:33:17 the life's blood of what Veroxa is and to be a good citizen in this space, we not only have to leverage these things, but we also have to contribute back, right? And like, we fully recognize that, right? Like, we're not like some of the other folks that are just like, ooh, interesting open source project.
Starting point is 00:33:35 Let's just commercialize that and basically throw them the middle finger in the rear view mirror as we collect billions, right? Like, that's not what we want to be. And so for us, it's just more so like, we know that, like I said before, you know, the IP isn't necessarily in the components themselves. IP is like how you stitch those things together and run a platform and operationalize it, right? And so if we make a change, like, you know, we put these things together.
Starting point is 00:34:07 I'll give you a perfect example, right? Like, you know, we built all this stuff, and then, you know, our first connector was Redshift, right? Redshift, the Kafka Connect plugin for Redshift, only does single-world inserts. So if you imagine somebody that has a gigabyte's worth of data doing single row inserts at a time,
Starting point is 00:34:29 it's probably going to take you a while. And so what we did was essentially like, you know, we forked the Redshift connector, added the ability to do multi-row inserts, and then just contributed that backup string. And like, that's something that
Starting point is 00:34:46 you know, a lot of people at scale, you see these kind of like limits, but you know, the general public is like, oh, I'm just going to take data from a database and put it into Redshift, and it should be fine, right? Like, these are the types of things that we see that, you know, we'll leverage
Starting point is 00:35:02 the community first, but where it doesn't fit our customer needs, we'll, you know, we'll leverage the community first, but where it doesn't fit our customer needs, we'll, you know, basically do some software engineering and then contribute the changes back to the ecosystem. And so for us, like, open source is a strategy, like our go-to-market strategy, because honestly, like, we see at the end of the day, like, there really is no expertise.
Starting point is 00:35:22 Like, what is the 12 factors for data? Like, where's that expertise at to say, like, oh, in real-time systems, this is how you should be thinking about things. This is how you should connect things. If you're running, like, microservices, this is how you should be architecting, you know, your real-time data. Like, there
Starting point is 00:35:39 isn't that just general consensus. Like, it exists in the big companies, right? Like, at Google and Facebook and Netflix and Uber and all that type of stuff. But that general knowledge, unless exists in the big companies, like at Google and Facebook and Netflix and Uber and all that type of stuff. But that general knowledge, unless you know somebody that knows somebody that has done this before, it just really isn't available.
Starting point is 00:35:54 So we want to basically use open source to democratize not only the access to this kind of power, but just the education as well. You should be doing things better than what you're doing, even if you are one of these big companies. Yeah, yeah. And I think it's a very good point what you said about education. I think education is a very big part around building this kind of products. But okay, nobody actually has prior experience, right?
Starting point is 00:36:21 It's not like a toothpaste. It's not like a can of binge or whatever. I mean, it's something that, okay, like you build it, you put it out there, there is value, but okay, people also need to understand like why and how to use it. And I think that open source, like it's an amazing tool to actually accelerate the process of educating
Starting point is 00:36:41 the people out there how to, and democratize the knowledge at the end. Like let all the engineers out there how to democratize the knowledge at the end. Let all the engineers out there do the best that they can because that's at the end what they want to do. That's great. Moving to the last part of our conversation, just a few questions around the company itself. How long have you been out there and what is the current status? I mean, are you hiring? Yeah. I mean, we've company was incorporated in October of 2019.
Starting point is 00:37:16 Our first commit was January 9th. And so we are hiring. We'd love to find a senior back-end engineer that's focused on Go. So if you're that person, I'm devars at maroxa.io. We would also love to find a front-end person, mid-level, junior, doesn't matter. But yeah, we are hiring. And one of the things I would like to say too is that we are very intentional about building an inclusive culture, an inclusive team. So at least 25% of our company will be women. And so that's something that I am very steadfast in, in making sure that, that we are creating an environment that is not going to perpetuate kind of
Starting point is 00:38:13 the traditional Silicon Valley tropes. We are remote first, so you can be anywhere and everywhere. And then the other thing too, is, is like, we, we, we're adults. So me and Ali have been in startups before. And so the things that we know people care about is work-life balance. Everybody has this idea that startups are just absolute chaos all the time. But planning and focus execution is something that can help alleviate that. So, like I said, we're adults.
Starting point is 00:38:48 We've been there and done that before. And then we pay people like adults. So, you know, we've, you know, we know that, you know, especially at the senior realm, there's, you know, Netflix is giving you a million dollars and, like, Facebook is starting to, you know, Google is starting to, you know, kind of kitchen sink at you.
Starting point is 00:39:05 But, like, you know, we raise a little bit of money, so you'll have more of an impact, and you'll be paid commiserately for your level of contribution and the impact that you can create. Yeah, yeah, yeah. Also experience is an important aspect of choosing to
Starting point is 00:39:25 join like a company at this early stage i think the experience of being part of such a team and like seeing how what it means like to grow fast and how you can grow together with this team i think it's amazing an amazing experience and asset for everyone and yeah especially when the people from this company they're like great people like you so appreciate that man yeah I mean we got great investors
Starting point is 00:39:50 Amplify who's you know who's invested in folks like Datadog and Prisma and like you know all that RootVC who's traditionally a hardware focused company but the partner that we're working with, Lee,
Starting point is 00:40:06 who led our round, was VP of engineering at Teespring. We got the Looker co-founders on our cap table. So Ben and Kenan, Nick Caldwell, who's the product officer at Looker, is on our profile. Jason Warner, who's the CTO Looker, is on our profile Jason Warner, who's the CTO of GitHub Is in our cap table Adam Gross, who's the old CEO Of Heroku, who's on our cap table
Starting point is 00:40:35 Fred, who's the CTO at The RealReal As well, so like, we got a community Of folks, Dion Nicholas, sorry Dion You know, if you hear this, my bad man Every time I get like a long list of things I always forget like Deion, Nicholas, sorry, Deion. If you hear this, my bad, man. Every time I get a long list of things, I always forget one or two. I mean, it's just like we've got an all-star group of people around us that really are technical and want to see us succeed.
Starting point is 00:40:58 I mean, Chris Riccometti over at WePay just came on the cap table. He did Apache SamSA, right? Like, you know, we got Jesse Peterson also is the head of data at Autodesk as well. Like, you know, he's on our cap table. Like, we have so many investors and advisors that have just like been there and done that, that just know like, yo, you might want to stay away from this because it's like you've seen this happen before or yeah that's a great idea right like you know it's just one of those things where uh it's so exciting to have i don't know about you like you know when you started your company right like you know all the experiences that i've had all the relationships i've made have just
Starting point is 00:41:41 basically kind of paired me for this moment right now. It was just like the time where I was like, yo, I need help. Everybody was just kind of like, yeah, we're going to give you, you know, time, resource, money, whatever it is to make sure that this is successful. And so all of that is like an input to the company. And then now like all of the, you know, good or bad experiences that Ali and I have had as far as startup life, we just bring in our design in the company that we want. So I think that's the big difference between us and a lot of these more immature startups is that we just understand. We get basically, we get recruiting now and we send out surveys at the end of our recruiting cycle,
Starting point is 00:42:27 as well as a gift card. Right. So we go parent exercise. Right. And it's like, look, we'll just pay you for your time. We know you're taking three, four hours out of your day from your job. Right. Like why, why wouldn't we want to just pay you for doing something that's meaningful work? Right. Like, and then on our surveys, you know, you know everybody's like yo even though i didn't get the job this is one of the best recruiting experiences that we've ever had because me and ali literally took a few days to say okay let's think recruiting experience that we would have loved to have right and it's
Starting point is 00:42:58 reflective you know kind of in you know uh towards towards the audience that we're going after so it's just like all these little things, man. And it's just like kind of preparedness where we're going to be successful one way or another. Yeah, yeah. I think based on my experience, I mean, to have a good opportunity, you need like three main components,
Starting point is 00:43:16 the three main ingredients. One's like find the right timing, have the right team and be in the right place. And it sounds like you have all three of them. So yeah, good luck with and I'm pretty sure that one way or another, as you said, you're going to be successful. So that's, that's great. Okay, like, last question from my side. What's about the name? How did you came up with
Starting point is 00:43:39 the name? And also, if you can serve some interesting facts about the company, something that's not well known and something that you think it's interesting to share with the people out there. Yeah, the name is actually pretty... Most people don't get it. They just think it's some like made-up Silicon Valley name. But it's like a real deep story.
Starting point is 00:44:02 Not real deep, but it's kind of an interesting story. So I was watching... So me and Ali both came from Salesforce real deep story uh not real deep but it's kind of an interesting story so i was uh up watching so me and ollie both came from salesforce and mark benioff used to say data's new oil all the time right like data's new oil data's new oil and so one night i was up watching national geographic and they were talking about the dangle tape pipeline that's getting built in west africa and it's going to be the largest oil refinery pipeline in the world kind of thing. And one of the byproducts is jet fuel. And the way that you remove impurities from jet fuel is the Mirox process. All right.
Starting point is 00:44:32 And so I went immediately was like, yo, that's the name of our company. And then basically like everything Mirox was sold out. So I just basically added an A on the end of it. So it sounds like either like a pharmaceutical company or, you know, Silicon Valley official, but, you know, kind of our unofficial tagline is if Davis is new oil, we want to power the refinery. So that's how we got Maroxa. Yeah, that's great. That's great.
Starting point is 00:45:00 Very cool. What a great story. Love that. Yeah, man. Everything has a meaning here, right? Like that's what I said. This is a purpose-built company. Interested fact about us. I would say for all of the Kafka users out there, we feel your pain around Kafka Connect, and we will be open sourcing
Starting point is 00:45:26 our Maroxa Connect that's going to be written in Go. And so now you don't have to worry about using the JVM. You can just deploy our Maroxa Connect thing in Kubernetes or whatever it is that you have. It's automatically compliant
Starting point is 00:45:42 with the Kafka API subset and the Kafka API subset. And so the Kafka API spec, so it's just basically just works. Instead of the gigabyte of memory that you need for your JVM Kafka Connect, our instance uses, I think it's like eight megabytes footprint. So now you can just deploy it on Kubernetes
Starting point is 00:46:01 and like get automatic scale at that area. So, and then we are, yeah, Now you can just deploy it on Kubernetes and get automatic scale at that area. And then we are... Yeah, so that's going to be coming out in October, November when we open source some of these data plane components. So I think that's the coolest thing that we have so far. Yeah. I have a feeling that as time passes, you'll have more and more interesting facts to share about Meroxa.
Starting point is 00:46:24 So thank you so much for today. And I'm pretty sure that we'll have the opportunity quite soon to discuss again and share more interesting facts around both the company and experiences around working with data. Thank you so much, Devarish, for today. Yeah, man, thank you very much. Thanks for having me. So that was it. I really enjoyed the conversation with Devaris. It's very interesting to see and hear from a person who is very dedicated and excited to build a new technology.
Starting point is 00:46:58 I'm also excited because, as I said, it's a very interesting concept to see CDC being applied and on scale on databases. And as we heard from him, there are plenty of new use cases that have been enabled. And even like the company is pretty young, like it's a couple of months, but still, I mean, they have came up
Starting point is 00:47:27 with some very interesting use cases that you shared with us. The technology is great. I mean, there are many different open source technologies that are incorporated there from Kafka to Debezium. There are some folks there that have huge experience in interacting with all these products.
Starting point is 00:47:50 And I'm sure that in the future we will have the opportunity to discuss even more about the technical aspect of building a product like Meroxa. And yeah, I mean, I think it's just the beginning.
Starting point is 00:48:05 I'm pretty sure that from now on, we will hear more and more about products that are around real-time streams of data. And I'm really looking forward to meet again with the virus and hear what they are going to be building like in the next couple of months. Me too. I think, you know, one thing that we see is a trend in data engineering and really sort
Starting point is 00:48:31 of related to customer data in general is that the warehouse is, it's always been a key part of, you know, the data stacks that companies build, but with tools like Meroxa, who productize CDC in ways that really can be game-changing for other parts of the organization. It's been interesting to see this trend of the data warehouse sort of rising in ascendancy to sort of be the king of the stack again, which is pretty neat. And of course, that's driven by a number of things,
Starting point is 00:49:08 you know, new database technologies. Also interesting to see how Meroxa came out of Heroku. So product managers helping build features and noticing that there's a big need in the market. So need to see how DeVarious and his co-founder took big need in the market. So need to see how the various and his co-founder took advantage of that as well. And we probably need to circle up and have sort of a good, bad and ugly related to Kafka
Starting point is 00:49:34 because I don't think we dug too deeply into that, but they obviously have some strong opinions and we know that there's strong opinions out there on both sides with Kafka. So maybe we can ask his co-founder to come back on and have a Kafka showdown discussion. Absolutely.
Starting point is 00:49:50 I think we are both looking forward to discuss again with the guys at Meroxon.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.