The Data Stack Show - 70: The Difference Between Data Lakes and Data Warehouses with Vinoth Chandar of Apache Hudi

Episode Date: January 12, 2022

Highlights from this week’s conversation include:Vinoth’s career background (3:19)Building a data lake at Uber (6:52)Defining what a data lake is (14:01)How data warehouses differ from data lakes ...(22:46)When you should utilize an open source solution in your datastack (37:36)Evolving from a data warehouse to a data lake (45:09)Early wins Hudi earned inside of Uber (52:30)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Kostas, we love talking about data, and you were telling me the other day that you are actually looking for someone
Starting point is 00:00:31 to work with you on your team, and part of the job is talking about data all day. So tell me about this job. Yeah, we are looking for someone to work with me as part of the developer experience team at Rudderstack, and more specifically in the dev rel role. So we are looking for a person who is interested in anything around data and talking about data and building relationships with other people that care and love working with data. So that's it. I think it's an amazing opportunity
Starting point is 00:01:07 and we'd love to hear from anyone who might be interested. It doesn't really matter if someone has a previous experience in DevRel. It's more about genuinely love and want to work with data and the communities around data. Very cool.
Starting point is 00:01:26 And where, man, maybe I'm going to apply for this job, but if I did want to apply for this job, where should I go? You can mail me at costas at datastackshow.com. That's one way. The other way is just visit ratherutherstark.com slash careers. And you can, I mean, someone can have like a look at all the open positions that we have and they will also find the developer experience related positions there. Very cool.
Starting point is 00:01:57 Well, hopefully we get one of our amazing listeners to join you on the team. Yeah. Yeah. Hopefully. Yeah. Welcome back to the DataSec show. Yeah, yeah. Hopefully, yeah. probably steal the show with my question, Costas, but we've talked with some people who have worked on technology inside of large companies like Netflix that has later been open sourced and sort of made available generally. But we haven't, at least in my knowledge, talked with someone who
Starting point is 00:02:38 was there at the beginning and really started it from the very beginning. And I just want to get that story. Like what were the challenges that they were facing at Uber? Where did the idea come from? And then like, how did it actually come to life inside the company? So I think that sort of origin story is going to be really cool to hear about Hudi. Yeah, absolutely. And I think it's probably like the first open source project that it's actually an Apache project that we have here as a guest. I think so. So that's going to be interesting because, okay,
Starting point is 00:03:11 this is also important. It's one thing to have like a project, to open source something on GitHub. It's another thing to have something that's governed by the Apache Foundation. So especially from the governance side, like it's a very different situation. So I think that's going to be very interesting
Starting point is 00:03:28 to chat with him. And okay, Vinath is a person who has been in the right place at the right time in many occasions when something interesting around data was created. He was in LinkedIn, Uber, Confluent later on. Yeah. So, and he's one of the, like, I think,
Starting point is 00:03:51 I think it's one of the best people out there to talk about what the data lake is because that's what Houdi is. And it's going to be very interesting to see how he started playing with the idea of like building something like a data lake inside Uber and why these got open like as open source and why now data lakes are so important and so hyped. So I'm very excited. I think we're going to have a very interesting conversation with him. All right, well, let's dig in. Yeah, let's do it. Vinod, welcome to the show. We're so excited to chat with you. Yeah, great to be here. Vinod, welcome to the show. We're so excited to chat with you.
Starting point is 00:04:25 Yeah, great to be here. Thanks for having me. Well, in the world of data, I think it's safe to say that your resume is probably one of the most impressive that I've ever seen. So do you want to just give us a quick background on your career path and what led you to what you're doing today with Hudi? Yeah. And first of all, I don't know if I am deserving of all those kind words, but I tend to think of myself more of a one-trick pony who keeps doing databases for over a decade,
Starting point is 00:04:58 because that's the only thing I know. So for me, I started first job out of college at UD Austin was Oracle, work on the Oracle server, data replication, what passed for stream processing back in the day, CQL streams, Oracle, Golden Gate, data integration, CDC. That's where I started. Then moved on to LinkedIn, where I led the Voldemort key value store. I think most people have forgotten that project by now, but it was like LinkedIn's Cassandra. It's actually a pretty popular project and actually led that. And we scaled it through all the hyper growth stages
Starting point is 00:05:36 for LinkedIn from like tens of millions of users to hundreds of millions of users. That's what lasted us through that. Then I moved on to Uber, where I was the third engineer on the data team, where we had a Vertica cluster on some Python scripts. And that's kind of like, this is Uber, 200 engineers, 100 plus engineers, and back in 2014. So I spent almost five years there working on data. I had, since I joined yearly, and that's kind of what I was looking for
Starting point is 00:06:11 when I left LinkedIn, is like a blank sheet of paper in which I can actually try, work hard, try to build something new, make mistakes, learn, like that journey I wanted for myself. So Uber gave me that and we ended up creating Hoodie there, which kind of like has become, it's great to actually see how the space has evolved
Starting point is 00:06:35 over the last four years. I also did a lot of other infrastructure things at Uber, including Uber's one of the first companies to adopt HTTP3, if if you will as it was getting standardized i still don't know whether it's fully standardized so we ran quick over and then replaced tcp with udb based condition so i like to dabble with a lot of infrastructure stuff i was working used to work with the database teams at uber So then I left Uber, went to Confluent, where I met some of my old colleagues from LinkedIn,
Starting point is 00:07:10 worked on yet another database, KSQL, and parts of Kafka Storage, Connect, and, you know. So generally been around this stream processing database data pipeline, this kind of space for a while. And yeah, I'm like, I have some time now to actually dedicate full time to Hudi. Hudi was something that I kept growing the community in the open source in the ASF for almost four years now. And then finally have some time to dedicate to Hudi.
Starting point is 00:07:39 And then I'm doing, I'm enjoying that this year. Very cool. First of all, I don't know if after hearing that, my conclusion would be one trick pony. But okay, I have so many questions. But one thing I'm really excited about is we've talked a lot on the show about trickle down of technology from organizations like Uber, Netflix, etc. that are sort of solving problems at a scale that just haven't been solved before.
Starting point is 00:08:04 And some really cool technology emerges from that. But we haven't been able to talk with someone who is part of the sort of development of that from the beginning. So could you just, we just love the story of Hudi. What problem were you trying to solve? And I think in the age that we live in, it's sometimes hard to think back to, you know, 2014 and what the infrastructure was like back then, but we'd love to know like the tools that you were working with, the problems you were having, and then just tell
Starting point is 00:08:34 us about the birth of, of Hudi inside of Uber. Yeah, I think it's a, it's a good, fascinating story actually. So 2014, as you can imagine, Uber was like, we're hiring a lot. We're growing a lot. It's like launching new cities every week, if not every day. And so we were really in that phase. And if you look at what we had, we had a typical on-prem data warehouse. And while Vertica is a really great MPP query engine, but it's not really,
Starting point is 00:09:08 we couldn't really fit all of our data volumes into it. If you look at all the IoT data or the sensor data or like any large volume event stream data or any of these things, they don't fit inside that. So we built out Hadoop data lake. Most people did. I came from LinkedIn before that. So very well out Hadoop Data Lake. Most people did. I came from LinkedIn before that.
Starting point is 00:09:26 So very well, like until that, I knew the runbook to what to do. You do Kafka, you do like event streams, you do like CDC, change capture, get a lake up and running and you do- So that was like familiar territory. That was familiar. The things that we really replaced the certain things i wanted to fix over kind of like what we didn't do at linkedin which was we wanted to ensure all data is columnar never have a mix of like like json or don't so we essentially forced and built a lender company to schematize the data and built a lot of tooling around it end to end you would get a page duty alert if like data was broken all of that so so and build a lot of tooling around it end to end. You would get a page duty alert if data was broken and all of that.
Starting point is 00:10:07 So we did a lot of things to actually ensure the lake can be operationalized. And within a year, we had a lake where you can do, the data was flowing in, where we can do Presto for interactive querying, some Spark, ETLs, and Hive, which was still the mainstay for ETL at this point because Spark was 1.3, like this is coming up, right? So the main problem we hit was, as you can imagine, Uber is a pretty real-time business, right? So what we weren't able to do was
Starting point is 00:10:39 we had our trip stores upstream, a lot of different databases. We wanted to take the transactional data, which is kind of changing, and reflect that onto the lake. With something like Vertica, you already have transactions, updates, mergers, like it can do these things. It got indexes. And while the data lake could scale horizontally
Starting point is 00:10:58 on compute and storage, it cannot do these things. So that led to the creation of Hudi, where we said, hey, look, we are between a rock and a hard place. We can't fit all this data there, but we don't have these functionalities here. So we chose to basically bring these database-y functionalities or data warehouse-y transactional functionalities to the lake. And that's kind of how hoodie was really born and the key differentiator i would say from some of the maybe the other projects in the space would be right away we
Starting point is 00:11:32 had to support like three engines like presto like like three i mentioned had to work out of the box right and the other thing is like with every company we had our raw data then we build etls on the lake after that so it's not sufficient that we just replicate the trip data very quickly by building updates and everything we also had to build the downstream tables quickly so we essentially borrowed from stream processing a lot having like worked on stream processing systems before we built cdc capabilities or streaming incremental streams into Hudi, even in the very first version.
Starting point is 00:12:08 And so that we can actually, the effect was upstream data store every few minutes. It's up to date with a downstream table on the lake. And then you can consume incrementally from that lake and build more tables downstream.
Starting point is 00:12:22 So we kind of moved all of the core data flows at Uber into this model. And that gave us 10X or even some cases like for our G1 tables, even 100X kind of like improvements over the way that we used to process data before. So fundamentally, Hudi was created around the concept of, okay, yes, we added transactions of data to leads, but the bigger picture was this enabled you to process everything incrementally as opposed to doing big batch processing. That's kind of how Hudi was born.
Starting point is 00:12:57 Wow. I feel like we could talk for five hours because I have so many questions, but quick question, less about the technology, how many, what was the size of the team and how long did it take you to go from sort of the idea or the definition of the problem, or maybe like early spec to sort of having an early version of Hudi in production. Yeah. Okay, it's kind of like a funny thing because I started writing like a first kind of draft or at least for the writer transactional side,
Starting point is 00:13:32 I think in my second month at Uber. But we didn't get to build it until for a year because we put the business first. We're just trying to build an operation. There's so many other things to build. But finally, we decided to kind of fund the project with three people in, I think, late 2015. And then by mid-2016,
Starting point is 00:13:57 we actually had our mid or late Q3-ish, late Q3-ish, we had all of our core English tables at least running on the project. And I thought we were able to only do that because we use existing horizontally scalable storage. We used all the existing batch engines, right? So we didn't write a new server runtime or build a lot of things. We didn't try to build a Kudu, for example, which was something that we considered back then before building this.
Starting point is 00:14:30 And then I think we opened this as a project pretty early in 2017. So Hudi was the first sort of trailblazer for transactions on a data lake across multiple engines and we mostly wanted to open source it because we weren't really sure if we're doing the right thing back then so we wanted to get more feedback i can tell you know it's like super visionary and all that but we were like okay we're doing something a little bit awkward at least it felt a lot awkward to the more of the Hadoop people who grew up in the Hadoop space. To me, it felt very natural because I was working on key value stores and databases
Starting point is 00:15:12 and change capture before that. So it all was like, but there's like a lot of the bridges to cross before I think it became a mainstream thing. Can you give us a definition of what Data Lake is? Wow, okay. So in my mind at least, so most people if you ask them, I'll start with that, like a Data Lake is files on like S3 or GCS, so that's kind of like the perception that people have. In reality, I think we built what I would call a honest data lake architecture at Uber, which is what it is. So data lake is basically an architectural pattern for organizing data. You can even build data lakes on RDBMS if you want. The main idea is you replicate your operational schema raw keep it like simple there so you do a el and then you do etl there and then you try to keep so
Starting point is 00:16:14 but it's been over years overloaded with a lot of different constructs here that it means hadoop in some people's mind it means like s3 in some people's mind or parop in some people's mind. It means like S3 in some people's mind or Parquet in some people's mind. So the basic idea I think remains that, which is like you do this like raw and drive data. And from the impact that we saw at Uber, what I can say is embracing the architecture has a lot of key benefits.
Starting point is 00:16:43 It completely decouples your backend teams or data producers and your data consumers have this raw data layer now, which they can use to actually even figure out what the data problems are. Otherwise, a lot of people do transformations in flight. So you can't really like you have to go to the source system to reboot stuff. We had a lot of basic issues around just how we do the data architecture. That's, that's how I see a data lake to be like an architectural pattern. Yeah. So that's a very interesting definition, actually, and probably the most accurate I have here, like so far, because to be honest, like also personally, I'm
Starting point is 00:17:24 still like a little bit confused. You can see many different pieces of technology that they fall under the umbrella of a data lake without being very clear what the role in the data lake architecture is for them. And obviously, marketing doesn't help with that stuff, especially now that we have all these lake houses. And we're trying to, let's say, take the data lake
Starting point is 00:17:49 and make it equivalent to a data warehouse and vice versa. And that's my next question. What's the difference between a data warehouse and data lake in terms of, like, architectural patterns, right? Got it. So I'll now actually talk about how the system design of data lakes and data barrels typically have been. That is what I think most, what I think what you're speaking to.
Starting point is 00:18:17 If you look at, if you take a minute, right, if you just go back to how we were doing Vertical Teradata,adata we had you essentially bought some software installed it on like a bunch of servers right they had deep coupling between storage and compute and it's a fully vertically optimized stack right so having closed file formats like one query engine one sql on top of that data on like a fixed set of servers, gave them like, they were able to probably squeeze out performance for a single core. So that is how your on-prem or the traditional data warehouses have been built. And your data lakes typically rely on, even from
Starting point is 00:19:02 the Hadoop era, they rely on rely on, okay, HDFS or cloud or some horizontally scalable storage. You decouple the storage and then you can fit even back before, even before, like even if we go to 2014, you can fit like I mentioned, a Presto or a Spark or a Hive on top of the same data. So the fundamental difference here is
Starting point is 00:19:24 the data and the compute are more loosely coupled in a data lake. And they're much, much more strongly coupled on a warehouse in terms of like across the stack optimizations and how it's built. With the modern cloud warehouses, they've changed the game where they've actually decoupled the storage and the compute but the format and everything everything else is still like vertical right so there's like one format for snowflake or bigquery one sql and it just like operates in a different way which gives you a lot of scalability over traditional warehouses so that's why you see a lot of people talking about, okay, you don't need data lakes.
Starting point is 00:20:06 You just need like cloud warehouses, right? So if all you're trying to do in life is just BI, maybe they're right. Cost aside, maybe they're right. If you now go to cloud, so while the cloud warehouses have like leapfrogged on-prem warehouses and evolved if you look at on our data lakes in the cloud they're very similar to how they were on-prem
Starting point is 00:20:32 so that's where we are today and that's where the sort of the lake house comes in and i think we didn't well we did in like you know pioneer kind of transactions on the lake and all that, but we didn't call it the lake house back then. Mostly because I've still felt even today, the transactional capabilities that we have in a hoodie or like all these similar technologies are like much slower compared to what you would find in a typical warehouse, full blown warehouse. So we were like a little bit shy about those things, but I think many people weren't. So, but that's kind of what we refer to lake house now, right? So we brought some of these core data warehouse functionality back to this data lake model, fit it on top of like a Parquet or a ORC kind of like open format, and then make it accessible to multiple engines.
Starting point is 00:21:23 That's what a lake house is. And it gives you some of the, some of the, some is the important part, some of the capabilities of the cloud data warehouse, while it still retains the advantages over a warehouse, the lake house over a warehouse. For example, it's much, much more cost efficient. It's way cheaper. You can run like,
Starting point is 00:21:45 eventually if you think you're going to need machine learning or data science, it's a more forward looking way to build where you get your data first into some sort of like lake house thing. And then you query, you do your analytics and data science there. And then you can move a portion of your workload
Starting point is 00:22:02 into a cloud warehouse, right? So that's kind of like, I feel like we will go back to that model in the next few years. Because the cloud data warehousing architecture fundamentally doesn't really suit running large data processing on them. So at least a good segment, chunk of the market, I think will move towards this model, I think. Yeah, that's interesting. I remember like I was talking recently with some friends in a company
Starting point is 00:22:34 where traditionally they had, let's say, when it comes to data management and data processing, they had like two parallel paths. They had a data warehouse that was used for data analytics and BI. And then they also had, let's say, a data lake. And it was based on Spark and on top of S3 that was used from the data science team. And what they want to do now is actually they want to move into this, let's say, Delta Lake, like the lake house architecture, so they
Starting point is 00:23:05 can merge these two together. So the two teams inside the company don't use like two completely different like stacks for their work. So that's very interesting to hear also from you because it resonates a lot like with what like people are trying to do. You talk about transactions and getting transactions and implementing transactions on top of like the data lake. Why transactions are important?
Starting point is 00:23:31 Why we need them? Yeah. So if you look at how, let's look at it through the lens of like a use case, right? GDPR. I look back at GDPR and I could see that that was the one use case that kind of trickled hoodie down to everyone else. Because till then, if you look at the stuff that I talked about at the end, sure, Uber had the needs for a lot of business, like faster data. And we did it certain way. And anybody who does that will get the benefits, the efficiency gains that we got. But the business drivers for that weren't simply there before something like GDPR.
Starting point is 00:24:11 So you need to ingest data and then you needed like a team now to go scrub the data and say like delete people who left your service or something, right? So this kind of like you now introduce two teams who want to operate on the same kind of like you now introduce two teams who want to operate on the same kind of like data set or table and then now that forced updates deletes and transactions and that pretty much is what kind of like have made this into like an inevitable sort of transition
Starting point is 00:24:37 if you're doing a lake you're going to probably want to just move into one of those one of these like newer things now right so that's kind of the main thing, I would say. Okay, that's very interesting. And you said at some point that, okay, we take something from the database port and we implement it on top of the file system, which is the transactions. But again, the transactions, the way that we implement them is not like exactly what you see like in a data warehouse right so what's the difference yeah what's what is that we don't need only yeah so here i think there are significant key differences like people people tend to talk about the like delta lake or hoodie in the same kind of like thing because we like to compare things and then it's easier for us to compare
Starting point is 00:25:25 things and understand right but if you look at the concurrency model even they're completely different how they're designed so data warehouses do like multi-table transactions for example like here we've been say on the lake we've been lake house we've been saying yeah we can do it we can probably add multi-table transactions, but the locking that you do, they can do more pessimistic locking. They probably can. Since they have long running servers, they can probably do a lot more unique constraint
Starting point is 00:25:54 foreign key validation. These kinds of things that you would expect in a full-blown database, they're able to do today. So yeah. And the other key difference with the current lakehouse architecture is it's completely it's kind of like serverless right it's like a it's like a serverless whatever warehouse if you will that comes up part by part as needed on demand right
Starting point is 00:26:19 okay just like the writer comes up writes and then goes away and then like a reader comes and goes away so there is no a reader comes and goes away. So there is no long running things that you can do to do coordination. So that's like for some interesting challenges, right? So if you take a look at Delta Lake, they pretty much do optimistic concurrency control, which basically is if two writers don't contend, you're fine, but otherwise one of them fails.
Starting point is 00:26:52 And if you look at the approach we take in Hudi, we try to serialize everything. We try to resolve conflicts by supporting log structures, differential data structures. We try to take in the rights and then sort of do collision resolution later on and we try to because end of the day data lakes are about high throughput writes these transactions are in database terms very large transactions so you cannot really afford to have one of them fail like imagine
Starting point is 00:27:18 like a delete job that ran for eight hours and it fails now and then you lost like some eight hours of compute and all this cloud. So we took a very different approach. I could see because we were focused a lot more on streaming CDC data in and like all of those incremental use cases.
Starting point is 00:27:36 If you look at Databricks and Delta Lake, probably they have a lot of batch Spark workload that they run. So they probably don't get that much concurrency overlap. So maybe OCC works well for them. So just like with databases, like how we have an Oracle, Postgres, or MySQL, I think there's so much technical differences
Starting point is 00:27:57 with these projects that we will end up with a bunch of these things, I feel, over time. Yeah, makes sense. Makes sense. Do you see, that's my last question around transactions, do you see the transactions from the data lake to get closer and closer to what we have in databases? Or do you think that there is a limit out there
Starting point is 00:28:18 that it doesn't make sense or we cannot, let's say, pass? I think we can. We can build the same thing we are actually in hoodie at least we are experimenting with adding a meta server so essentially make so if you look at the problem as data plane and sort of metadata plane the data warehouse has servers for both data and metadata the lake has no servers for both data and metadata today, with the way that things have evolved with Delta Lake or Iceberg, where you stick metadata into a file. That's not going to be performant if you compare to what, let's say, Snowflake does, which is keep metadata in another horizontally scalable OLDB database like FoundationDB, for example.
Starting point is 00:29:07 So we are trying to tinker with a model where we have servers for metadata and we keep the data playing like in a serverless where Spark jobs should be able to access S3 raw direct. So that's one thing we feel like we'll bring it a little bit closer this is i feel is the gap in the lake house architecture today but i like the first aspect you mentioned right like do we need to do that so that's the other part so unless you're running really a lot of concurrent workloads today there isn't like a pressing thing. The lakehouse vision is just starting up. But if you have to fulfill that, I would imagine that you need a full set of capabilities. People should be able to run workloads on a lakehouse, which are highly concurrent and highly scalable
Starting point is 00:29:59 as they would on a warehouse. So I think there are technical gaps and a lot of things to be built in the next couple of years or more going forward there. Super interesting. And outside of transactions, what else do you see as a component from like a more traditional database system that it is required also from a data lake or a lake house. Yeah. So I don't know if this fits into the lake house model, but at least for Hudi, we actually borrowed a lot from OLTP databases as well, like indexes, for example.
Starting point is 00:30:37 We have an interesting problem for CDC, right? So, okay, you have an upstream like Oracle or Cassandra, some OLTP databases taking rights. If you have to replicate that downstream to like a data lake table, then, I mean, why are the updates faster on the upstream OLTP table? Because they have indexes and like whatnot to like update them, right?
Starting point is 00:30:56 So if you have to keep up with an upstream OLTP table, your write on the data lake table has to be like, feel like you're writing to a kind of like an OLDP table. So we invested a lot in more sort of like, so this problem is similar to running a flink job, reading from like a Kafka CDC, and then updating a state store, essentially stream processing principles. So we borrowed a lot from stream processing and databases and brought it also to the data lake. And that is, I think, at this point, a pretty unique thing that we've been able to achieve.
Starting point is 00:31:30 If you look at a lot of Hudi users, they are able to stream a lot of data raw into the lake very quickly. And that's all possible because of this. But for the core warehousing problem, I think we already have columnar formats. We close the loops on transactions and get the usability there. That's something that we haven't talked about at all. If you compare stack, we've talked a lot about technology, but if we talk about usability,
Starting point is 00:32:00 how quickly can you build a lakehouse versus like starting on a warehouses, warehouses win all the time, right? So these kinds of things are more important for the lakehouse versus like starting on a warehouses warehouse has been all the time right so these kinds of things are more important for the lakehouse vision i think than but but we are trying to add more capabilities on the lake than even a typical warehousing what did you do today yeah that makes total sense and what about the query layer yeah that's a interesting one so i think today if you if the lay of the land is you pick uh on the white if you are on the lake you pick like a presto trino equivalent for a lot of the interactive queries and you write spark or fling or high vtls i think
Starting point is 00:32:41 i know i'm broadly categorizing but that's the major things that pop up, right? And the key thing to understand here is there is a lot of things that we don't typically even classify as query engines, like all different NLP frameworks or like some of them are not even distributed, right? There's like a, but they still work on these open data formats.
Starting point is 00:33:08 So there's a lot, there's more kind of like more fragmented sort of like tool set around the ML, NLP, AI, deep learning, like that space that is also kind of going to kind of only grow. So I don't see a future where there'll be more query engines on the lake. There's going to be like more and more query engines. And I think the smarter strategy here would be to know how a lake kind of strategy again and build towards that,
Starting point is 00:33:42 keep your data sort of like in an open format that you can buy support from many people and kind of like have it be more future-proofed. That's kind of like what I think innovatively this is going to lead organizations into. Yeah, I'll ask something like from the completely opposite side of the stack because we're talking about the query. And correct me if I'm wrong, but what I understand is that
Starting point is 00:34:08 the data lake at the end, your work as the creator of Hudi, for example, is to build, let's say, table formats on top of some file formats that we already have, but usually we are talking about Parquet and ORC here, right? Is this correct, the way that I'm understanding it? Yeah, so the thing that this table format term again is like, doesn't do justice to sort of, at least like what Hudi has to offer, for example, right? There's a lot more than what you need than a table format.
Starting point is 00:34:41 So if you look at what a table format is, it's a metadata of your file formats, right? Around what, right? It's a means to an end. What I think we built in open source today is a lot of the services that also operate on the data. Because without them being open, it doesn't matter with open format, right? You don't own the services that operate on them. So you have to basically, you're saying,
Starting point is 00:35:08 I have to buy some vendor who will operate these services for me. So this is the gap that I think like something like refills here, speaking for Hody, we have compaction clustering. We have the bottom half of the warehouse or a lake house or a database which you want to use kind of like available to you which you can now use to query multiple different file formats with and to your point yes we mostly it's analytical storage right but if you look at hoodie there are some use cases that come up where people really don't want a database, but they want a key-based lookup on top of S3 data. We support HedgeFile as a base format, for example.
Starting point is 00:35:54 HedgeFile is the underlying file format for HedgeFile. It's really optimized for user range reads to get batched. You can do batch point key gets from H file. So there are, I think, going to be like more and more use cases like this. I can totally imagine how this can be used for, let's say hyperparameter tuning or something on a machine learning pipeline, right?
Starting point is 00:36:18 So I think there's a lot more that we probably haven't built. And this space is sort of like still nascent in my opinion yeah for all the reasons that i've been citing it's it's still a lot more work to do here do you do you see like any innovation uh any space left for innovation like when it comes to the file formats themselves because okay we take for granted like parquet out there or orc but like that's pretty much what everyone is using right do you see like anything changing there or we need something to change there yeah so that's the thing right so often you know oftentimes in open source that's the other kind of like my i mean i've been in open source for 12 years so but my my own pet
Starting point is 00:37:03 gripe is sometimes i think what wins is the most popular is what happens, right? It is a popularity contest in some sense, it becomes that. While on a more managed service, you get swapped out with something that happens new. So, I think for a change to happen at like that file format layer, I am pretty sure that there can be a new better file format that can be written even like google has a capacitor is the file format on top of underlying big query right it is a successor to dremel which is what park is based on so i mean you can read a blog they don't open source the format this time. There's already one there. So it's more like if we've done this now,
Starting point is 00:37:51 so it's going to take a while for people to migrate. But I'm pretty sure with new processes coming out all the time and there's not documented things around CPU efficiency, around how you access Parquet. So there's plenty of room for improvement, I think. Like original Parquet was designed in an era where mostly on-prem HDFS, right? So you had to care a lot about storage space. But if you now don't care as much,
Starting point is 00:38:17 would you do certain things differently? I haven't put a lot of thought into it, but I'm pretty sure there's something that is better that can come out in the future. That's super interesting. Cool. You mentioned open source. So let's spend some time on that aspect of the data lake. Because let's say we have three, as we said, major technologies out there. All three of them have some open
Starting point is 00:38:43 source presence. And I will start my question with asking you why data lakes are open source, like we can see open source there. And when we are talking about data warehouses, I don't know, I think instinctively the first response would be we don't have an open source data warehouse, right? Why is that? I honestly feel this all started from the Hadoop data lake, Hadoop era, basically, where
Starting point is 00:39:09 I think Cloudera, if my memory shows me right, they boldly declared that everything open-source is the way to go, and I think I agree, but it's basically been a train from there because Spark was open like the major tools
Starting point is 00:39:27 that have succeeded have been open right and then i think we ended up with like the lakes being open and the warehouse was being more closed i i don't know why that is though i do see that there is advantages in being closed and moving faster and you can build more vertically optimized solutions so historically databases have been that way if you even take like rdb every single we won't even talk about something like this in oltb databases for example, right? We won't say, why don't we have a common table format and let's have Spanner and YugaByte and CockroachDB all query that format or something. So I think I don't find that very weird.
Starting point is 00:40:17 I won't be the person who would say, yeah, it should just be open. Otherwise it's wrong. I don't think that's true. I do think that to that point, what do databases add? They add a lot of runtime over that format. And then at that point, you're not dealing with the format. So it doesn't matter whether it's open or not, right?
Starting point is 00:40:39 So what I really care about, again, going back is whether these services are open, right? Can you cluster a snowflake table outside of snowflake if you don't buy that maybe there is someone who can use ai and super cluster your tables automagically they know this is like like a genius who has this like one a clustering algorithm can you use it you can right? So I think that is the main thing that I would say that the lakes bring. And it's been that way. And I feel on the flip side, warehouses do have better out of box,
Starting point is 00:41:13 easy to get started. And like those things, they've made it work for the cost and the cost of openness. And on the lake, I would say, people still have to build a lake, right? You can use a warehouse, but you have to build a lake. You can either download one or you sign up on something and use it, right?
Starting point is 00:41:31 But you need to go hire a data engineer, hire some people, build a data team, and then they will build a data lake for you. So there's pros and cons to both approaches, I would say. I think, I don't know which one's right. Do you think this is going to change for data leaks? Do you see more effort put towards the user experience, let's say, of these technologies? Yeah, I think that suddenly, at least we are doing it and we've been doing it for, that's kind of like how we even got started. If you go to Hudi, you will find a full-fledged streaming ingestion service, right? There is a single Spark Submit command that you can run. And then the tables gets
Starting point is 00:42:10 clustered and cleaned and all this indexed and all this Z ordering or Hilbert curves or stuff that is logged away to even table data bricks or Snowflake, you can find in open source. And we try to give you a tool set where you can actually run it easily. But here is what I see. I think even as we make usable, make it more and more usable, more and more consumable, it's still the operational aspects of it.
Starting point is 00:42:38 I do see people on the community, like really talented, driven engineers, data engineers who come to the community. They're trying to pick up all these database concepts, trying to understand what data clustering is. Why do I, what do you do linear sorting or like, like they're trying to understand all these fundamental database concepts,
Starting point is 00:43:01 try to become platform engineers, try to like run thousand tables and manage that entire thing for their company right and many of them come out with flying colors some of them don't and in any case it takes like a year or more for people to get through that learning curve and do this so this is where i wonder where there is a like a better model here where companies should be able to get started with as easy as how it is i mean okay don't worry about all of this just get started with all of these like late technologies then yeah maybe you don't want you don't want you want to do it yourself right so then they should be able to fork off this is
Starting point is 00:43:46 what i'm suggesting is a pretty much a reverse of what most open source go-to-market people tell you which is your community and then you make it so that you keep it bare minimum and then people can use it and then you build more advanced on top but for the lake i feel like for like with hoodie we try to make everything easy but the problem is people still need to take it and run it it's not non-trivial thing to operate a database as a service right having done that walmart as a service i do like linkedin and like ksql on the cloud and like i can vouch for that much i can talk with some like authenticity so we should make it easy for people
Starting point is 00:44:25 to get started with the lake house, like more of a lake or whatever. And then at a point, your business will grow where it needs data science. Business will need ML, right? At that point, you can decide, okay, am I going to be able to hire better engineers than that vendor?
Starting point is 00:44:42 Then you shouldn't be bottlenecked on the vendor. You want to move quickly. You should be able to branch out from open source, run your own thing. Right? So that is, I think, the model that we should build. And unfortunately, what happens in the data space today is, it's like, you may remember the famous Parquet ORC format wars of the Hadoop era, right? I mean, where two companies were just like the same twoquet orc format wars of the of the hadoop era right i mean where two companies were
Starting point is 00:45:06 just like they're saying two formats or whatever it kind of like doing the same thing to table formats which defeats the whole point of the thing having being open to begin with right because most data lake companies are a query engine or like a data science stack and they're basically going and upselling users, hey, use this format, use that format, including Hudi, right? But the real problem here is they have to go and hire the engineers
Starting point is 00:45:34 and do the ops and like data engineers have to get every optimization right for that organization to, someone signing the check high up is like, oh yeah, you are like better than the warehouse or you are you're now future approved for the organization to see the benefit so i think if we don't fix this problem this way it's it's not about technology i think we can fix all the all the gaps but i think this is the problem that i see that the managed aspect of it it is no easy
Starting point is 00:46:02 way to get started so otherwise i think I think it will remain in the cloud. Cloud warehouse will be the entry point and you build a lake when you're suffering from cost or openness or you want data science team. That's how it will be if you don't fix it this way. Quick question on that front. And I'm thinking about our listeners who,
Starting point is 00:46:21 we certainly have listeners who are sort of managing complex data lake infrastructure, but I'm thinking about our listeners who maybe started with a warehouse and they know that the data lake is inevitable in some ways for their organization. But to your point, that can probably be a big step. What are the things that they need to be thinking about or even sort of planning for, you know, six months or a year away from sort of the inevitability of like needing a larger data lake infrastructure? Are there decisions or architectures or sort of even ways they think about data now that
Starting point is 00:47:00 will help them make that path smoother, even though the tooling isn't quite there to make it easy for them? Yeah. Yeah. So the first thing I would say is like, no, like do, do more of the, the streaming event based or the, the Kafka hub kind of like architecture, right? Because it really having the ability for you to get all your data streams in a single kind of like firehouse.
Starting point is 00:47:28 And then you can now tee this off to the warehouse or to the lake. You have that flexibility. I would say most people who are in the journey today are using like a opaque kind of like data integration pipe, which takes data from a data lake and let's say FITRAN, for example, or FITRAN. By the way, really great services. But I'm just like the architecturally, you just know, by the way, the really great services, but I'm just like the architecturally, you just don't see the tap into the data streams. It's, it's, so you, you really have to capture. There's like a core data infrastructure pipe that those tools actually need to feed into for you to actually feed it out into your arms. Yeah. Yeah. Switching my hat a
Starting point is 00:48:01 little bit. If you look at my, my, my, like my life at Confluent, like what we are to build was, okay, you do like the source connector, the sync connector kind of decouple. So you get the CDC logs from like an Oracle or any database, and then you can feed it to many other systems. So a lake and a warehouse. So make sure your data can flow into both and you have the optionality to pick which one you want to send where that's one the the other thing is start with probably your more the raw data move that to the lake that's where you have most of the the the data volume and since it's usually in a wild tp schema not optimized for like analytical queries uh that's where probably you're spending most of your cost on a warehouse as well, because like they're not really in that schema. So those are like really good candidates for you to start. And then in most scenarios,
Starting point is 00:48:52 the derivative tables, you can keep them there. They're more performance sensitive. So you can slowly migrate them over here, right? And then what you need in the meantime is you should really push for your cloud data warehouse provider for better external table support. Because they have no incentive to do that.
Starting point is 00:49:13 Unless you force them to do it. Because technically speaking for organizations, what I can see is, okay, I'm using pipeline company X and then using data warehouse Y. And then if you want to now build a lake, right? And offload your raw, like you want to build a lake and going back to our first question, you want a raw data and a derived data layer. You want to move raw data out. I mean, if you do it,
Starting point is 00:49:39 then all the SQL has to still run, correct? Like for you to build the derived data. So that is where I think there is stickiness and like lock-in points for warehouses where unless that SQL can run in a reasonable amount of time on the lake, this project would fail, right? So for example, in Hudi, we just added DBD support
Starting point is 00:49:59 so that you can get raw data tables in Hudi and now you can use DBD to probably like transfer over. We'll be working towards more parity or more standardization. We are today as standard as what Spark SQL is, right? So you can now use that and use DBD to do the transformation on the lake even. There should be a way for you to move the workloads
Starting point is 00:50:23 to the lake seamlessly. Think about those abstractions, whether it's dbt or airflow, like how compatible SQL is. Think through all these things. But if your cloud warehouse provider provided better external table support, then you can keep those queries running, even though if you offload the raw data lake, you can try a Presto or some other lake engine in the meantime,
Starting point is 00:50:47 as you decide how things are going, right? So it's not going to be an easy switch. This is going to take a year or like at least six months for you to switch a reasonable amount of data. So planning ahead around all these touch points is what I would kind of like advise to think through first. Sure. I think that's really helpful because the question I asked was, what do you need to be thinking about if you're sort of going from a warehouse-based infrastructure and then adding the lake infrastructure?
Starting point is 00:51:22 And you would think that the answer is more around the lake, but it's actually more around the orchestration and the pipelines and giving yourself option value as it relates to all the various components of the stack that are going to arise from moving towards the lake architecture. Right. I've seen many companies, right? So I categorize them into two buckets one is if you
Starting point is 00:51:46 don't do this right what happens is there is a lake but no one's using it and then over time the data quality goes like slowly these products start to fizzle out if you don't do this right the ones that succeed have top-down kind of energy to say okay we're going lake first and we're going to like revamp the whole thing. In lots of scenarios, for example, the lake comes in when data science comes in. When data science comes in, usually what comes in is data scientists would show up and say, hey, like, okay, fine, you want me to improve your app, but give me some events. Tell me what's going on in that. Then Kafka comes in, and then you like pump a lot of events, right?
Starting point is 00:52:25 And then that's when the data volume spikes. And then that's when people are like, oh yeah, wait, like, right. This is kind of like how that cycle works typically. People who start like that have lot more drive to get it done that way. And like what we try to, the missing puzzle there is moving data from a warehouse
Starting point is 00:52:43 and replicating the database of SaaS data that you may already have in the cloud warehouse. But those are people who are more leaning on the, I'm going to pay the double cost of warehouse and lake for us sometime. And then over time, I'll figure out how to move things. And I think this will be the most interesting thing to watch because right now, given the performance, like the bad things are, right? Lake is super optimized for running large scale data science machine learning workloads. The VARs are really optimized for running like BI.
Starting point is 00:53:15 Then I think that the BI workload stays there. The data science workload stays here. I think as we build tech, I think maybe they will like more BI goes. That's what the rise of Starburst tells you, rise of Presto tells you, right? I think it's very interesting times, I think, to be building data. It's going to be super fun. It's going to be super fun. I have one more question for you. And this kind of goes back to where we started with the origin of Hudi.
Starting point is 00:53:41 I'm interested to know, so you actually, you got it running with three engineers in production in a pretty short amount of time for developing a new technology that's sort of managing the scale. Was there a sort of feature or optimization in the business that sticks out in your mind as like an early win? I just love for our listeners to, to hear like okay so you developed this amazing technology like i'm sure we have users of hoodie or people who want to use it but i just love to know like what was an early win inside of uber that came directly from the
Starting point is 00:54:16 technology yeah there's like a direct dollar value attached to that project at that point like uh dollar value that is that exceeds like hundreds of millions of dollars because the we were able to run fraud checks a lot faster which meant we report to banks a lot faster and you can imagine the how complex these checks would be they're very hard to write those in a streaming sort of way and like kind of like you know get it right but if you have the like high query basically or something like that right we needed real near real-time data not like real time real time but we needed to be running some checks like every hour for example as opposed to
Starting point is 00:54:56 every 12 hours or every and the you can imagine right at uber scale the amount of like rides and everything like the the banks typically give you money back more money back if you report sooner like kind of like how sure sure again don't quote me on this is how it was there i don't know how banking rules have changed this is not financial advice yeah it's not financial and that was the main driver and then of course there was like intrinsic uber is it starts raining it affects our business right there's like a huge concert the traffic changes so intrinsically the business had real-time business real-time needs and this sometimes hard to put a dollar value around it except for we can count the number of times people wish data was there sooner. The stable was built faster.
Starting point is 00:55:46 But the real tangible dollar value was we can do all the background things, rider safety, for example. We can do all these like background things like tasks and data processing that we do to make Uber experience really better. It can run faster, quicker incremental that is sort of thing and this was actually not a very i i at least came with that mindset because at linkedin the main thing that we would try to incrementalize was people you may know for example it's a very complex graph algorithm but the whole like we've spent a lot of time around hey if you connected you and i connected now and then it'll be cool to like, I go to LinkedIn and then I get the thing right away. Probably they've made it work now. I haven't
Starting point is 00:56:29 kept track, but we were in that mindset. Okay. Yeah. Let's make all the batch jobs incremental. There is no reason for them to be running full batch and eating up our entire clusters, right? So that's sort of how we went about it. Amazing. Well, Vinat, this has been an amazing conversation. I know we could keep going, but we're at the buzzer. Thank you so much for joining us. I learned an immense amount and I know our audience did too. So thank you for sharing some of your time with us. Yeah, yeah. Glad to be here. And these are like really deep questions. So, so thank you. Thank you for these questions. It also helps me think better.
Starting point is 00:57:08 All right. Thanks. Thanks, everyone. I'm going to break the rules. In this recap, I have two takeaways. One, I love that he called himself a one-trick pony. I think he was very authentic and humble, but that was just hilarious to me. The other one, which we talked about right towards the end of the episode, sometimes
Starting point is 00:57:32 you think about the gains from sort of building your own infrastructure. How do you calculate ROI on that? Is it engineering time saved, et cetera. But he was talking about financial transactions to the tune of hundreds of millions of dollars, which is wild. And those sort of stakes are really, really high. And so that was just amazing to me. I wasn't expecting that quantity of a sort of ROI impact, but it's massive. So that's just, man, that's crazy. Yeah, yeah, 100%. I think it was super, super interesting conversation that we had.
Starting point is 00:58:13 I think that we managed to make much more clear what a data lake is and why it is important. And what's the distinction also with a lake house? Where things are going, where they are today. And we had like a pretty technical conversation, but without getting into like too much technical detail. Yeah. But it was very, I don't know.
Starting point is 00:58:37 I really enjoyed this conversation and we definitely need to get him back. I think we have much more to discuss about. We didn't have the time, for example, to talk about open source, open source project governance, like what's his experience there, why it is important.
Starting point is 00:58:53 Yeah, I'd love to hear more about running a project like Hudi within the Apache Foundation. I mean, that would be so interesting to hear about. Yeah, 100%. So yeah, hopefully we will manage. I think he was the first guest that had like an immediate relationship with a data lake technology. There's more out there. Hopefully we will manage to get more on the show to discuss about that.
Starting point is 00:59:18 Both like lake house and data lakes and everything. So yeah, I'm really looking forward to have him back on the show again. We'll do it. All right. Well, thanks again for joining the Data Stack Show. A lot of great episodes coming up. So make sure to subscribe
Starting point is 00:59:32 and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback.
Starting point is 00:59:44 You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.