The Data Stack Show - 149: Turning Tables Into APIs for Real-time Data Apps, Featuring Matteo Pelati and Vivek Gudapuri of Dozer

Episode Date: August 2, 2023

Highlights from this week’s conversation include:Building Dozer: Simplifying Data Sources into APIs (1:13)Bridging Data Engineering with Application Engineering (4:19)Turning Data Sources into APIs ...(7:46)The cost of caching (12:59)Challenges with legacy systems (14:30)Real-time data integration (19:31)YAML and SQL experience (25:37)Behind the scenes of Dozer (29:18)Heavy Workloads and Low Latency (42:00)Use Cases of Dozer (45:51)Reliability and storing data from different connectors (51:35)Importance of observability in serving data to customers (53:24)Final thoughts and takeaways (56:34)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com..

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack. They've been helping us put on the show for years and they just launched an awesome new product called Profiles. It makes it easy to build an identity graph and complete customer profiles right in your warehouse for Data Lake. You should go check it out at ruddersack.com today. Welcome back to the Data Sack Show.
Starting point is 00:00:39 Costas, we love talking about real-time stuff. And we have a fascinating company on the show today, Dozer. So we're going to talk with Vivek and Matteo. Both have fascinating backgrounds, but they allow you to take a data source, really many types of data sources, you know, from sort of like real-time, like Kafka-esque data sources to a table, maybe in your Snowflake warehouse, and just turn it into an API to get real-time data, which is fascinating. And I want to know what in their experience sort of led them to build Dozer? What problems did they face where they had, I mean, obviously they're trying to simplify something, right? Turn a table into an API,
Starting point is 00:01:33 you know, sounds very much like marketing, you know, which gives me pause. But if they can actually do it, that's really cool. And so I want to know why they built it. And then I'm going to let you ask them how they built it. Yeah, 100%. I think it's a very interesting space because now I think we are reaching the point where we've accumulated all this data into the data warehouse or the data infrastructure that we have in general, like we are able to create insights from that data.
Starting point is 00:02:14 But like the question is like, what's next, right? Like how do we create even more value from this data? And that's like where we start seeing like stuff like reverse ETL coming into the picture or let's say the approach that Dozer is taking into taking this data from the more analytical infrastructure that a company has and turn it back
Starting point is 00:02:38 into something that an application developer can use to go and build a top. Right? Because I mean, I feel like we always, like the first use case that we think about data is like analytics and BI and reporting. But to be honest, like today, that's just like a small part of what the industry is doing, right?
Starting point is 00:03:01 Or like what the companies need to do. There's much more like that we can do. But there's a gap there, obviously. ReversityL is probably addressing partially this gap, but I don't think that it's a Sol's problem. And I think that's what exactly companies like Dozer are trying to do. So it's going to be super interesting to see these breeds and what kind of technology and tooling is needed to bridge data engineering with application engineering. And that's what we are going to talk about today. And I'm very excited about it. So let's go and do it. All right. Let's dig in. Vivek, Matteo, welcome to the Data Sack Show.
Starting point is 00:03:43 We're so excited to chat about Dozer and all things data. So thanks for joining us. Thank you very much. All right. Well, let's start where we always do. Vivek, do you want to tell us about your background and kind of what led you to starting Dozer? And feel free to talk about how you know Matteo, of course, as part of that story. Yeah, so I've always been in technology roles. I knew Matteo in one of the first companies I worked for in Singapore. So I've been in Singapore for the last 10 to 12 years. It was a second-hand company. And since then, we have been good friends.
Starting point is 00:04:24 We always talked about starting something together. We isolated on several concepts. And this is something we came across in our previous experiences and we've solved it in multiple different ways. And we'll explain what that means in a second. Speaking about my personal experience, I mean maybe the last three companies would be a good place to start. So just before this, I was a CTO of a fintech company solving a payment problem in Southeast Asia. Before that, I was involved with a publicly-assessed company called Yoji as a CTO,
Starting point is 00:04:58 basically solving logistics in Southeast Asia and Australia. And before that, I was involved with PaySense, which had a $200 million exit, basically a buy-no-pay-later company, right? That's a little about me. Awesome. Matteo? Yeah. So I'm, as Vivek mentioned, we knew each other for about 10 years.
Starting point is 00:05:19 We worked at one of the first companies together. I'm coming from a mix. I've been working in software engineering for the last 20 years and data for about the first companies together. I'm coming from a mix. I've been working in software engineering for the last 20 years and in data for about the last 10 years. I've been jumping around between startups and mostly financial institutions. I was part of DataRobot relatively early
Starting point is 00:05:42 when they were scaling out the product and I was helping to scale out their product to enterprises. Right after that, I joined DBS Bank, which is the biggest bank in Southeast Asia and helped to build the entire data platform and the data team actually from the ground up and right after that before starting Dozer I was leading the data group for Asia Pacific and data engineering for Asia Pacific at Goldman Sachs and yeah after that me and Vivek we have been always iterating about ideas and we like very much the concept of Dozer and we just decided to jump fully.
Starting point is 00:06:31 Awesome. Well, before we get into Dozer specifics, it's really interesting hearing your stories. There's both sort of a startup background and then also like large enterprise, both in fintech. Is Dozer, does it have roots in sort of like fintech flavored problems or is that just a coincidence of the experience that both of you have?
Starting point is 00:07:01 It's, I think it's, I mean, I can realize it. It's kind of a coincidence because Dozer solves generic data problems, which I happen to face in financial institutions and Vivek also happened to face in FinTech startup. But it's not something that is specifically for FinTech at all, actually. Okay, well, give us the high-level overview. And then let's talk about the problems that you faced that sort of drove you to actually start building Dozer. But give us an overview of what is Dozer and what does it do? So Dozer basically points us at any data source or multiple data sources.
Starting point is 00:07:50 With a simple configuration in YAML, we can produce APIs in gRPC and REST, and developers can actually put that together in a few minutes and start working on data products right away. So the main problem statement, the way you say it, there is significant investment in the data world. There's a lot of tools basically working on ingestion, transformations, et cetera, et cetera. But there are not many errors going out of the data warehouses and data lakes, et cetera. So typically companies looking to solve a data serving problem end up building a lot of infrastructure from scratch. That is what we have done in the past as well.
Starting point is 00:08:28 Typically, that will involve, if you're working with real-time, you'll bring in a Kafka, you'll bring in, for example, a Spark infrastructure for scheduled batch job, you'll bring in ElastiSearch or Redis to cache the queries, you'll build APIs on
Starting point is 00:08:43 top of that. Typically, it involves stitching together various technologies and it will require a significant amount of time and coordination between several teams and cost as well, right? So this is what we kind of personally faced and this is what we wanted to productize. So with a small team or a single developer can actually kind of go all the way from leveraging data to produce APIs so that data products can be instantly iterated upon, work on the problem that you care most about. So that's what Dozer solves. Makes total sense. Can you give us just an example of something. And really, Matteo or Vivek, I know it sounds like you both faced this and solved it using a complex stream of technology.
Starting point is 00:09:36 Give us the way that you solved it before. And if you can, give us maybe a specific example. You needed to deliver a data product that did X. And what stack did you use to actually deliver that? Maybe I can start with the kind of problem that I have, and maybe I can follow on with his problem. So the biggest challenge that I was facing when I was at DBS, for example, DBS wanted to build a unified API layer
Starting point is 00:10:11 to serve all the banking application across product and across country. So we're talking about core banking, we're talking about insurance, we're talking about wealth, we're talking about wealth, etc. And what we wanted to achieve as well was offloading all the source systems.
Starting point is 00:10:36 So you can imagine, in this case, tens of different source systems serving different kinds of products. And in order to achieve all that, we had to start building a very complex infrastructure,
Starting point is 00:10:56 capturing data from all these source systems, preparing it, caching it, and building API on top of it. Now, saying that seems very simple, but in reality, it's a fairly complex work because we're talking about a number of source systems, we're talking about capturing everything in real time, and we're talking about making sure the systems are extremely reliable because this is not just a dashboard.
Starting point is 00:11:26 It's data that is integrated. It's served directly to the bank staff. And that's where we realized that how much time was spent to build the entire plumbing. And that's how the idea of those are came about. Yeah, interesting. So would that be like, this is probably a really primitive example,
Starting point is 00:11:51 but let's say you have an account balance that needs to be available in multiple apps, you know, that relates to insurance or something. And so you need to actually, you need to serve that like across a variety, you know, you need an API that essentially makes that balance available within all sorts of applications across the ecosystem. Yeah, that's correct.
Starting point is 00:12:12 And especially if you think when you open up your banking account, you see your current account balance, you see your wealth account balance, you see your insurance. So these are managed by different systems. And traditionally, you query each and every system to get the data. But that becomes complex and the load for the source system is very heavy. So that's what we wanted to achieve,
Starting point is 00:12:48 a unified layer that was making it much easier for app developer to integrate and also at the same time reduce the complexity and the load to the source system. Yep, yep. I mean, I would guess that... Did you have to make a lot of decisions around cost? Because you have to decide how much of what to cash depending on how often people check something, right? So mobile, they may check way more often than web. Or were there a lot of trade-offs that you had
Starting point is 00:13:18 to make just in terms of caching and the cost of running the queries? Because people want to know when money hits their bank account. They need to know that basically when it happens, right? That's right. To be honest, the trade-off in terms of cost, talking about data caching, we didn't have to make so many trade-offs because the cost of running the read load on the source system was so high that even caching the entire data pre-packaging the entire data, pre-aggregating the entire data and store in the cache, which was basically a persistent cache, an entire user profile with a part of the transaction history. And this had a much lower cost anyway than hitting the source system itself.
Starting point is 00:14:21 Interesting. Wow. Okay. So the caching just wasn't even a big deal because it costs much at the source system itself. Interesting. Wow. Okay, so the caching just wasn't even a big deal because it costs much of the source to generate it. Wow. Okay. That's right. I mean, here we're talking about
Starting point is 00:14:34 legacy systems like mainframes, like where that are not specifically designed for read load. And each read operation is a cost to the company. How many, just out of curiosity, how many separate,
Starting point is 00:14:58 let's just say like data engineering tools, would you say did you use to build the pipeline that served that? I mean, are we talking like three or like 10? We're talking about probably around 10, maybe a little bit less than 10. Wow. Yeah. I mean, because there were a bunch of tools that were like legacy tools to connect to the systems, plus we had entire infrastructure using leverage Kafka. Obviously, there was a lot of custom code as well.
Starting point is 00:15:39 Sure. I was just going to ask. Yeah, there was a lot of custom code. And, you know, when we started implementing, we were not just teaching the pieces together, but we properly defined a full, I would say now you would use the term data contracts, so that all the data that was published,
Starting point is 00:16:05 the API available, were fully documented, documentation was fully generated, et cetera, et cetera. And then we had a caching layer, and not just one, but multiple, depending on what kind of lookup you would need. Oh, right, sure. Yeah, yeah. Because, I mean, depending on your query pattern, you will choose one or another. Yeah. Okay. Well, to all of our listeners, next time you're trying to refresh your banking app to see your paycheck hit, just know that there's a lot going on.
Starting point is 00:16:43 And so the little spinning bar is running a lot behind the scenes. Vivek, I'd love to just hear, you know, what problem you faced and then let's dig into Dozer. Yeah, so before that, I'll bring us to a slightly higher level for a second. So this problem manifests as we see in multiple different ways. Larger organizations will call them an experience layer where you're bringing data from multiple domains and serving certain domain APIs to end users, which could be direct customers or it could be internal clients that are doing different things. It could be a simple problem if you have several microservices and you simply have an API.
Starting point is 00:17:27 For example, let's talk about a use case of user personalization where you want to have some amount of calculated data about a user which is coming through a machine learning model or some other, for example, in the case of fintech which we were talking about earlier it could be your credit scores
Starting point is 00:17:44 and your risk profiles etc etc which are useful and you would have certain amount of master data which are coming from certain other systems where you would think about data production rules etc etc as well but when it comes to a mobile app where matthew was describing you are putting all of that as one user api for example now you have to stitch together data that is coming from multiple systems. And having real-time data in these scenarios becomes very important. Similarly, another use case I would describe is that you have data sitting in, for example, I mean, today, a lot of this system data will be available in a data warehouse.
Starting point is 00:18:22 And data warehouse is typically not suitable for doing low latency queries. Let's say if you have millions of users hitting your application, you cannot make all the calls back to a data warehouse. You have to bring that into a cache to serve all these APIs. And that suddenly becomes an entire pipeline to manage. And you have to think about real time
Starting point is 00:18:39 and all the caching policies, et cetera, et cetera. So in my experiences, we have had to deal with some of these problems where we had a data warehouse in place, and we had to kind of bring information about users and certain profiles in the form of reports, in the form of embedded personalized experiences for users. Users, for example, in the case of fintech, as I mentioned,
Starting point is 00:19:03 it could be disk profiles, et cetera, spend example, in the case of fintech, as I mentioned, it could be disk profiles, et cetera, spend patterns, whatnot. In the case of logistics, for example, it could be driver locations, it could be customer latest number of inventory items in the inventory, et cetera, et cetera, right? There are many things that are supposed to be kept real time, but this data is often coming from multiple different systems, and we still need to serve these APIs at a low latency for a large throughput. Yeah, that makes total sense. This is probably a dumb question, but a lot of the data sources we're talking about aren't necessarily real-time themselves, right? I mean, of course, like a Kafka or if you're running Databricks and a Spark cluster, you can run some of those things real-time. When we think about a data warehouse, is the problem overcoming the limitation that you are...
Starting point is 00:20:00 Because a lot of the data coming into the warehouse is running on a batch shop, right? And so you're going to get your payments data, what, you know, every hour, every six hours, you know, whatever. And so the idea is that, okay, well, you actually have that data in Snowflake or BigQuery or whatever. And you need to make the updated, like the latest data available in real time without having like a complex set of pipelines. Yeah. So on that note, obviously the warehouses, as you mentioned, sometimes could be a snapshot of information, which is done at a certain schedule. Dozer works best in the context of real time when you connect us to the source systems. So if you're connected to a transactional system, we typically take the data in CDC and move that in real time. So we have inserts, deletes, and upgrades as they're flowing through the main transactional system.
Starting point is 00:20:49 And we keep information fresh in real time. Well, real time, I mean, obviously there's a bit of a data latency as CDC will also have a little bit of lag, but it is as best you can get from a latency standpoint. But if you already have information in data warehouse and you want to connect that with your other data streams, that's something you could do in a very similar, from an experience standpoint. We can do that in a very similar fashion. So you could basically pull in a snowflake, pull in a Postgres or pull in
Starting point is 00:21:23 the future, other transactional systems as well. And you can connect them as if you are writing a simple join query between tables and columns and that will immediately produce an API. Interesting. Okay, so I'm just going to come up with a fake use case here. So let's say that I have a SaaS app and someone's on a free trial and I have, you know,
Starting point is 00:21:45 my, you know, my app database is running in Postgres. And so I have like some basic data in Postgres about like what features have been used, like the status of the, you know, person's trial, whatever it is. And of course, like I want to, you know, send messages to that person or even like maybe modify like the app or even the marketing site, you know, using that data. And so with Dozer, I mean, from what I can tell, I can essentially turn that Postgres data into an API and then just
Starting point is 00:22:22 hit the API to grab the data that I need to make the decisions that I want to make, you know, whatever, in like my React app or whatever I'm delivering my app as. Is that accurate? Exactly correct. You can commit to real-time sources in, let's say, less real-time sources, you can define your data, how you want to combine the data, or even if you want to pre-aggregate the data, create the, I would say the payload of your API and API are automatically exposed. That's roughly how it works. Interesting. Interesting.
Starting point is 00:23:06 Okay, so when you say pre-combined data, so let's just say I'm running my app database in Postgres, but then the marketing team is collecting a bunch of whatever data they collect. So clickstream data, web views, you know, marketing data. And Dozer would allow me to actually
Starting point is 00:23:30 join that data and make the join available as an API, like an endpoint? That's correct. That's correct. So fundamentally every, you can join. Let's say, as you mentioned, you have your Postgres database.
Starting point is 00:23:50 Let's say you have also some analytical data coming out of your Snowflake or Delta Lake. And you want to join this data or even do some additional stuff on top of the join. You want to do some aggregation, you want to do anything. So every time something changes on the source, the change is actually propagated to Dozer. And Dozer really helps place the output and store it in the cache and make it as available as APIs. Fascinating. Okay.
Starting point is 00:24:30 So, man, I have so many more questions, but I know Costas has a ton of questions. But can we just talk through, and this will probably be a good handoff for Costas. If I have a Postgres database and then I have an analytical database with Snowflake, and then just to make it even more complicated, let's say our ML team is working in Databricks or Spark, and so I have some output there. And so I want to figure out how to provide some sort of personalized, my marketing team wants to personalize this page on the site based on something we know about these people that needs to combine these sort of three key, you know, app database, analytics database, ML database. How do I do that with Dozer?
Starting point is 00:25:24 So like, how do I install Dozer? How do I connect the sources? Can you just give us a quick walkthrough of, you know, I have those three sort of data sources and I want to make them an API. Yeah, so Dozer's experience is mainly driven through YAML and SQL, right? So you would put a YAML. One block would be connections. You would specify the three connections you mentioned. One block would be about the SQL transformations that you need to perform. You would write all the SQL transformations
Starting point is 00:25:53 that you want to perform on the source systems and specify where the endpoints, APIs, how they are to be exposed, and the indexes that need to be created. And that's it. That's all you have to do. And you have APIs available in GRPC and REST. All right.
Starting point is 00:26:09 That sounds like super easy, but I'm sure there are a little bit more details there in how it happens, right? In the backend, at least. So can you guys take us a little bit through what does it mean? What happens behind the scenes when I provide these YAML files and when I provide this SQL in my choice of the API protocol that I want to use?
Starting point is 00:26:38 Behind the scenes, we have multiple connectors that basically captures real-time data from databases or data warehouses. So let's say from Postgres, we use CDC. From Snowflake, we use TableStream. So every time there is an event that can be an insert, a delete, an update from any of the sources,
Starting point is 00:27:04 we capture all this data. After this data is captured, it goes through the SQL that you have defined. Now, this SQL is fundamentally transformed into a DAG, Direct Accyclic Graph. And that DAG is executed, that DIG is executed in real time as the data is in transit. So we keep the state of the output data always up to date and in the caching layer.
Starting point is 00:27:40 And because we know what is the output of your SQL query, we can actually produce what is the output schema of the API. And that's how we generate the protobuf definition and the open API definition. So in brief, this is the entire flow of execution from the sources all the way to the consumption. All right, that's super interesting. And okay, you act in a way as an insumer, like the service, right? Of data that is coming from a number of other services that you don't really control the schema there. That's right.
Starting point is 00:28:31 So let's say I'm going and for whatever reason, I drop a column on Snowflake, or even worse, like on my production Postgres database, Which probably means that, and I'm using drop for a reason because adding a new column might be a little bit easier to handle, the other problem is more silent. Right? Data will come and data will be missing or will turn into nulls, right, or something like that. How do you deal with that? Because again, we are talking about a service on the other side that someone is consuming, right? Or something like that. How do you deal with that? Because again, we are talking
Starting point is 00:29:06 about like a service on the other side that someone is consuming, right? Like they are driving like a product or an application or it doesn't matter if it is internal or external, right? How do you deal with that? Yeah, that's actually a very good question because this is what happens typically in companies. And when you have multiple people working on multiple systems, there needs to be an entire coordination that needs to be in place for something to do a schema migration of sorts. So we actually have, this is something we really thought about. And Doza has an API versioning experience where if for you to kind of create a new API version, you'll just have to change the SQL or the source schema has changed or the types have changed. You automatically publish a new API
Starting point is 00:29:50 and with a few commands, you can switch the API to the new version, right? So we actually run two pipelines in parallel and basically populate both of them and a developer can simply switch from one version to the second version. Obviously, as you mentioned, if it's a destructive change that one of the pipeline completely breaks we have an error notification kind of an experience in play where we'll let you know that family is not working anymore but if it's a if it's a straightforward change and nothing has to
Starting point is 00:30:19 change in terms of schemas we simply overwrite the version but let's say if there is a breaking version change we automatically create a parallel version. But let's say if there is a breaking version change, we automatically create a parallel version. So this is from a developer standpoint, you typically work with YAML, you deploy a new pipeline, and it starts to kind of work as a parallel version. And you could simply switch that to the parallel version. By the way, so this experience of API versioning is not kind of part of our open source because it's a lot to do with infrastructure than just code. So that is coming out in our cloud version, which we are kind of launching soon in beta.
Starting point is 00:30:52 But typically, if anyone deploying this in a self-hosted manner, they could also kind of deploy it in a similar fashion and write about that on our blog. Well, it makes a lot of sense. Okay, so from my experience with working with, what is happening with these systems is that you have the database from one side,
Starting point is 00:31:13 which represents some kind of state, right? And from this very concrete state, you go to a series of events that actually represent how these, like the operations that are applied on the state, right? The reason I'm saying that as an introduction is because one of the, let's say, tough situations with like CDCs, like going and recreating the state, right? Because you might need, I mean, not the whole state, but part of it.
Starting point is 00:31:52 But the events themselves, it's just like inducting on an individual event, it's just part of what you usually want to do, right? But this has some complexity, in terms of the SQL that you have to write to do that, and also has a computational complexity. There might be a lot of events happening coming from CDC. And when we are talking about systems that are operating more as as a cache. Okay, you always think that there are some constraints,
Starting point is 00:32:28 like in terms of the resources that you have there. So how is Dozer dealing with these complexities of working with event streams coming from data systems? Yeah. So one technology choice that we decided to use with Dozer is to implement everything in Rust. I mean, that's something that is happening a lot in a lot of tools, the data engineering space. And we fully believe that this is going to change a lot in the data engineering world.
Starting point is 00:33:14 So in my experience, when you have to deal with distributed systems and JVM-based tools, like most of the tools are in the space. So you add a lot of complexity to your system for no party, sometimes no particular reason. Okay. In some situation, it is really justified to have a fully distributed system because the volume is so big, but in some other situation you don't really need it.
Starting point is 00:33:45 So we said, okay, let's take a much linear approach because a language like Dozer is, sorry, a language like Rust allows us to get incredible performance with much more simplicity. And that's how we, that's what, how we follow the implementation in Dozor. I mean, the execution of the pipeline is actually run in a single process. Now, it can be distributed among multiple processes and nodes, and that's what we are doing in our cloud version. But the open source version is fundamentally a single binary,
Starting point is 00:34:35 which is much, much easier to run and manage rather than having a full cluster. That's the approach we followed. And that has to be quite useful. I mean, it's simple to manage and with the performance that you get. And another thing that we noticed, we started
Starting point is 00:34:58 experimenting running Dozer on different machines. And, you know, with the R-based cores getting more and more popular, and especially with the large number of cores, now you have like an ARM-based machine with like 64 cores,
Starting point is 00:35:17 you can really scale out your computation, not on, you don't need really a cluster, but you can scale out on multiple cores on the same machine and achieving incredible performance. So you can achieve what was not possible before with much simpler code and much simpler infrastructure. Yeah, that makes sense. So from the user perspective, right, structure. Yeah, that makes sense. So
Starting point is 00:35:45 from the user perspective, when the user has to deal with these event data and works on recreating, let's say, the state, how is the experience there? I mean, how is the
Starting point is 00:36:02 user going from a stream of events that represent changes that have been applied on a table? Let's say the user table, right? How is the experience of doing that? How is the user of Dozer going to implement the SQL query that, let's say, takes that and from that creates, let's say, keeps only the daily new signups. And only these are exposed through the API. The reason I'm asking that is because, okay, when someone develops, you need to have access to some data, go through some process to do the actual development.
Starting point is 00:36:46 So what's the experience here? Because it's a little bit different than a database system, right? Like, okay, in the database system, you have your interface, you have the data, you start writing queries, see what happens. And at some point, through iterations, you end up with a query at the end. But if you have something like Dozer, how is this experience happening? Yeah. So as Matthew mentioned earlier about the high-level infrastructure, right? So Dozer has fundamentally four components. We have connectors. There is a real-time SQL engine
Starting point is 00:37:20 running. We have a caching layer and on top of that, APS are available. So underlying, once the data crosses the connectors, everything is turned into a CDC, which means it's an insert, delete, and update. And the SQL is working on data as if it's a simple table and there are a bunch of columns, right? So if you are connecting to a CDC of a postless database, for example, you're getting inserts and delete, inserts, deletes, and updates as they are flowing through. So you're making changes to the database and you'll get inserts, deletes, and updates in your toaster. But let's say if you're working with a Kafka, and as you mentioned, you're working with
Starting point is 00:37:58 events in that context, right? So you could actually, so events would be available as a table in your SQL and you could write whatever business logic you have on top of events as a SQL base. So you could actually combine that with data coming out of Postgres and represent that as series of transformations
Starting point is 00:38:17 and the output of the SQL would be produced as an API. So that's the experience. So when I'm connecting through Dozer on Postgres, what I see is not a stream of events, insert, delete, and update that are coming. I see the table. If I connect on the user table,
Starting point is 00:38:39 what I see through Dozer is the table itself. That's right. If you use the end output after you use SQL is you get, when you call the API, you would see records as if you are seeing a table, but you're not working with events directly. And actually, on top
Starting point is 00:38:56 of that, you see the table, but you don't see a static table. You see a live table with all the data that is actually changing in real time. So basically, if you do a select and your select produces, like, let's say, 10 rows. And in the database, a new row is added. The new row, if the row satisfies the actual condition of your SQL, it will suddenly appear in the list, actually.
Starting point is 00:39:29 So what you see is a table, but it's actually, it's more than a table because it's a live table. And one of the, I mean, at least like historical like issues with CDC is how you seed the initial table right because CDC like when you connect like to the CDC feed for Postgres it has like a limited capacity right like you can't really access the whole table through the CDC feed itself So how is this happening with Dozer? How do we get access to the whole table? Or it's a decision that you made that you only get the updates
Starting point is 00:40:12 seems like the date that you have installed the pipeline. Yeah, so typically with connectors, we have a snapshot and then a CDC that continues. So we take initial, so basically in the case of Postgres, for example, we have a snapshot and then, you know, like a CDC that continues. So we take initial.
Starting point is 00:40:26 So basically in the case of Postgres, for example, we start a transaction. You would basically get the snapshot of the table and we kick off CDC. So you get the initial state and all the updates that are coming after basically. Okay. That's interesting. Okay. So enough with the technicalities. I'm asking all these questions because I find this in general very fascinating as a way of dealing with data, but it's also quite challenging, especially if you're trying to scale that. So it's very interesting to see how a real system that is built handles all these interesting parts of C and trade-offs. So let's move towards the use cases.
Starting point is 00:41:20 As we go there, I would like to ask something because Mateo was saying at the beginning of our recording that he used the term read-heavy applications, right? Where usually you use a caster. That's the whole idea of having a c cast like Redis. And I would like to ask you guys when you were thinking about Dozer and how Dozer is implemented today, are we talking about a system that is primarily trying to serve read-heavy
Starting point is 00:41:56 workloads? And this is, let's say, a big part of the definition of what real-time is. Is it more about the latency or is it both? Right. Because you don't necessarily need both, right? Like you can have a low latency system that processes data in, I don't know, like sub millisecond even latencies, but still you're not going to have like too
Starting point is 00:42:23 many reads happening, right? Or like too many writes. These are different concepts there between the throughput and the latency. So what is, what dozer stands between these parameters of the problem that you are solving? Good question, because this is, I mean, linking back to the use case that I was mentioning in the bank, it's actually both. Because, you know, one is the, if you think about the actual banking application, read heavy in terms of cash, because obviously you have a lot of users logging into your app and checking their bank
Starting point is 00:43:06 account. And, you know, it's surprising to see how many users they log into their banking application after they do any operation. They withdraw the money at the ATM and they immediately log into the banking app to check if the transaction is correct. And that's the read heavy on the cash. But at the same time, because there are these kinds of users, you need low latency on the pipeline. So if I withdraw my money from the ATM and the database, the source database is updated, obviously I need a low latency in the pipeline execution so that I can display the data in real time or
Starting point is 00:43:59 near real time to the user. So we try to address both scenarios to be honest. Yeah, just want to add something on top of that, right? So that's why what we want to do from a dozer standpoint is think, so to become the de facto standard in a way you think about data serving, right? In some cases, we are unlocking the web, unlocking data for companies. For example, sometimes in enterprises, you don't have access to a source system that is hidden behind several controls and, you know, like it's sitting in a certain business unit.
Starting point is 00:44:32 And to actually kind of make that part of a user experience, you'd have to think about creating so much infrastructure internally and it's such a challenge in several months of project, right? In some cases, you're dealing with read scalability, your Postgres is not able to answer those queries anymore. In some cases, you're talking about creating an entire domain layer of APIs where you're combining several things and exposing that. So definitely Dozer comes into play at a certain scale of a company. I would not say a company starting today would not need Dozer right away. But when you're thinking about standardizing
Starting point is 00:45:03 all your read traffic, thinking about scaling your read infrastructure especially, and data serving capacity, that's where Dozer comes into play. That makes total sense. And, okay, let's talk a little bit more about the use cases now, okay? Like, we've used
Starting point is 00:45:20 a lot the banking sector as an example of how a system like this is needed what are like the use cases that you have seen so far that's like actually like one of like my opinion like the beauties of like building a company because there's always like people out there that surprise you with how they use like the that you are building. So what have you learned so far with Dozer? What have you seen people are doing with it? Yeah, so I would like to describe
Starting point is 00:45:54 two use cases that we saw that are very interesting. Smaller companies are basically looking to use Dozer because suddenly you get SDK that you can plug in and you get real-time APIs on and you can start building APIs right away. So cost of saving time
Starting point is 00:46:12 and immediately start to kind of build products, that's what is appealing at the lower end of the spectrum. And if you talk about enterprises and we are currently engaged in a few enterprises, this is where unlocking the value of data is coming into play and the terms like experience layer etc etc coming into play where today without naming names some companies some enterprise companies are dealing with large volumes of data sitting in you know disparate systems and they are currently thinking about creating a large infrastructure,
Starting point is 00:46:45 which is potentially a few months or even a few years long, and putting together a large stack and a large team to solve this problem, right? This suddenly becomes a multimillion-dollar project solving this, you know, like, and at the end of it, you still don't know how to exactly build it
Starting point is 00:47:03 because there are many technical complexities involved and several key stakeholders involved. So this is where I think we received some inbound, which was very interesting for us, where basically instead of solving, creating all this infrastructure, code plumbing yourself, Dozer can actually immediately provision a data API backend for you. And you can start to kind of
Starting point is 00:47:26 work on one API, two API, and actually kind of start to build an entire experience layer for a company. So that's what we have seen for larger enterprises. I think there is another kind of interesting use case that came mostly up from our open source usage. That is where
Starting point is 00:47:41 you have, let's say, a product engineer, full stack engineer, that has to build, that has to integrate data with a consumer facing application actually. And you know, this engineer has to, I mean, you know, all this data is coming from, can be coming from the source system as well as the data warehouse. So it has to deal a lot with the data engineering team. And you know, there is always friction between product engineering and data engineering, who does what.
Starting point is 00:48:17 And those are actually started to prove to be very useful in helping these kind of engineers to get the data they want, combine it in the way they want, and expose API and integrate. Without having to go through the entire process of building pipeline on the data lake, getting approvals to run, for the, to run the pipeline there, et cetera. So it kind of started to be the, let's say, last mile delivery of data for product engineer to, to bridge the gap between the data engineering and the product engineering.
Starting point is 00:49:01 That's super interesting. And you used earlier, like the term experience layer. Can you explain what do you mean by experience layer? Like what is it? Yeah, maybe experience layer is something that is typically used in banking and telco and bigger enterprise so fundamentally you have your domain particularly so domain layer where like let's say let example let's take the banking space actually so you have your domains which are fundamentally wealth management core banking, insurance, et cetera, et cetera.
Starting point is 00:49:57 And this product, this system, basically, they expose data relative to that specific domain. Now, when you build a mobile app, we are mostly talking about the user experience. And at that point, you don't care about the domain, but you care about, for example, giving an overview of your balances, whether it's insurance, whether it's wealth, whether it's... That's the kind of definition of the experience layer. What is the layer that you put in front of your user to better serve the user.
Starting point is 00:50:27 I think this is something that is maybe mostly used in these spaces, actually. But we have seen, even if they don't call it experience layer, we have seen companies needed something like that. Maybe they call it in different ways, but that's fundamentally what it is. Yeah. Yep. No, makes a lot of sense. Okay. One last question from me and then I'll give the microphone back to Eric. So I would assume that having a system that in one side might possibly be driving a user-facing application, right? And on the other side, like consuming data from values like data systems that are probably like working already on like their limits and all that stuff. I would assume that reliability is important, right?
Starting point is 00:51:23 So how do you deal with that? And what kind of guarantees dozer can give when it comes to the reliability of the system? Yes. So this is actually, I mean, as you rightly mentioned, it's a difficult problem the way we solve this for companies. So reliability, we do this in multiple ways firstly there are some things that are also coming in some of our future versions as well so the data as we get it from the sources we actually kind of stored depending on the type of the
Starting point is 00:52:00 connector for example if the connector can support a replay we don't necessarily have to store all the information ourselves. But let's say if the connector doesn't have a replay mechanism, we have the ability to persist that in a certain Doze format. So that's one guarantee. So even though let's say if a pipeline breaks, we can kind of restart and kind of replay the messages and recreate the state. On the other end of the spectrum, the API layer, the caching layer is built on LMDB database. It's a memory mapped file. And we can basically kind of scale the number of APIs on the existing state as it stands in a horizontal fashion. So let's say even the pipeline breaks, we can still serve the APIs with the existing data as it stands. It might be, you know, if the pipeline breaks, for example, you would see a little bit of a data latency.
Starting point is 00:52:49 But when the pipeline kicks in, you have a new version deployed, we automatically switch the version and you have API available again, right? So all this, we still guarantee that APIs are not down, whereas the data pipeline, we will try to kind of replay the message and recreate the state. So we can run Dozer in multiple, I mean, in the cloud version or in an enterprise deployment, this typically would be a Kubernetes cluster with different type of pods doing different things. And even though some of the pods go down, we still have a way to maintain the state so that APIs would not go down. Actually, one thing I want to add
Starting point is 00:53:26 is that it's in addition to reliability, one important aspect is also observability here. Because you're serving data to customers. So it's actually 10 times more critical than an internal dashboard. So it's not uncommon to, let's say, get a support call and say, okay, my balance is wrong.
Starting point is 00:53:53 Why is that, actually? And you need to really understand why the balance is wrong and trace it back, actually. So that's another important aspect about it. All right. That's all from my side. Eric, the microphone is all yours. All right. Well, we are really close to the buzzer, but of course I have to ask, where did the name dozer come from? I mean, usually you think about, you know, a bulldozer just pushing mounds of dirt, you know,
Starting point is 00:54:27 into big piles, but give us the backstory. Okay. That's quite interesting. Okay. So when those started, like, it's like, we are like almost a year into the journey and we were iterating on the idea and i like very much the concept i share with you that and i stumbled upon an article from netflix where they were describing a system that was very similar to what I built in DBS actually. And and the system's name is actually Bulldozer. And we kind of got an inspiration from from that. And obviously we didn't want to use the same name Bulldozer. So we abbreviated it and it actually became it became dozer but
Starting point is 00:55:27 actually was very it's very good because now joannes that is the main author of bulldozer at netflix is actually helping us out as a as an advisor in the company. Oh, wow. Cool. It's actually, it's a nice story. Wonderful. Well, what a great story to end on. Vivek, Matteo, thank you so much for your time. Fascinating. And I'm really excited that technologies like Dozer, I think are going to enable a lot of companies
Starting point is 00:56:04 to actually deploy a lot of companies to actually deploy a lot of real-time use cases, even at a small scale, when they're early on, and then scale to do things like the huge companies that y'all have both worked for. So very exciting to see this democratization, especially in the form of a great developer experience. So congrats and best of luck. And thanks for spending some time with us on the DataSec show. Thank you very much for having us.
Starting point is 00:56:33 Cost is fascinating conversation with Vivek and Matteo from Dozer. So interesting, the problem space of trying to turn data into an API, right? Think about all the data sources that a company has, and their goal is to turn all of those sources into APIs and actually even combine different sources into a single API, which is where things get really interesting. Imagine a production database, analytical database, an ML database. Being able to combine those into a single API that you can access in real time is absolutely fascinating. I think my biggest takeaway, which we didn't actually talk about this explicitly, but I think that they are anticipating what we're already seeing becoming a huge movement, which is that data applications and data products are just going to become the norm, right? Whether you're serving those to an end user in an application, you know, so we talked about a banking application where you need an account balance across, you know, the mobile app, the, you know, sort of web app, the insurance portal, et cetera, right?
Starting point is 00:58:01 Of course, you need that data there. Or your personalizing experience, right? Based on sort of demographics or whatever, all of these are data products. And we haven't talked about that a ton on the show, but I really think that's the way that things are going. And this is really tooling for the teams that are building those data products, whether they're internal or, you know, sort of for the end user. And I think APIs make a ton of sense as the way to sort of enable those data products. So that's my big takeaway.
Starting point is 00:58:34 How about you? Yeah, 100%. I don't think like an application engineer is going to change the way that they operate, right? Like they have their tooling and they should continue like working with what they know how to use, right. And do it like I mean, how they do it like already in application.
Starting point is 00:58:57 So that's where I think that the opportunities for tools like Dozer, right. The same way that a data engineer doesn't want to get into all the protobufs and I don't know what else applications are using to exchange data, right? The same way an application developer shouldn't get into what the Delta table is. They should care about that, right? Or what the Delta table is. They should care about that.
Starting point is 00:59:25 Or what Snowflake is. What they care about is getting access to the data that they need in the way that it has to be so they can build their stuff. And that's, I think, what is happening. I think it's primarily a developer tooling problem to be solved. It's not like a marketing. It's not like a sales ops. It's not any of these like ops kind of like...
Starting point is 00:59:55 I mean, there is space obviously also like for these tools, but if we want to enable, let's say, builders, we need to build also tooling for engineers to go and build on top of that data. And I think we will see more and more of that happening, even in the reverse ETL tools that we've seen coming in the past two years. And you see that even with... What's the name of this one?
Starting point is 01:00:26 Say, one of these companies that they're doing start like implementing like a caching layer on top of yeah snowflake right so like an audience cache yeah for sure yeah but but like, forget like audience and put like any kind of like query result. I agree. That's why I'm saying like, they started like from a marketing use case, right? But at the end, like what they are building right now is interfaces for application developers
Starting point is 01:00:58 to go and build a top of data that lives inside their warehouse, right? Yeah. And I'm sure we'll see more and more of that. But it's interesting to see that, like, even, like, Hytats that started as a company, like, very, like, focused on, like, the marketing use case. At least that's my understanding, like,
Starting point is 01:01:17 when I saw them, like, when they started. They're also, like, moving towards that, which is a good sign. It's a sign that, like, more technology is coming, exciting tooling and developer tooling. Yeah, I agree. I think that, you know, we've talked a lot on the show over the last two years about, you know, data engineering, the confluence of data engineering and software engineering, right? And nowhere is this more
Starting point is 01:01:45 apparent than, you know, putting an ML model into production or taking data and delivering it to an application that's providing an experience for, you know, an end user. And so we've actually had a lot of conversations around, you know, software development principles and data engineering or vice versa. And tools like Doze are fascinating because they actually may help create a healthy separation of concerns where there is good specialization, right? Not that, you know, there isn't good, you know, healthy cross pollination of skill sets there. But, you know, if you have an API that can serve you data that you need, as an application developer, that's actually better, you can do your job to the best of your ability without having to sort of co-opt other skill sets or, you know, sort of, you know, deal with a lot of like data engineering concerns, right? And the other way around. So I think it's super exciting
Starting point is 01:02:52 and an interesting shift since we've started the show. So stay tuned if you want more conversation like this, more guests, lots of exciting stuff coming your way, and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at Datastackshow dot com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at Rudderstack dot com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.