The Data Stack Show - 99: State of the Data Lakehouse with Vinoth Chandar of Apache Hudi

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Kostas, we always talk about getting guests back on the show. And we haven't actually done a great job of that. But it's kind of hard with all the scheduling stuff. But we were able to do it.

Starting point is 00:00:39 Vinath, who was one of the creators of Apache Hootie, is coming back on the show. And I am really excited because last time we talked to him, his project was in stealth mode. So I remember before the show, he said, we can't talk about, you know, what I'm working on. But it is now public. It's called One House. And it's super interesting. It's a data lake house built on Hudi, of course, which isn't a huge surprise. So I'm super excited to learn just more about OneHouse and the way they tackle the problem.

Starting point is 00:01:11 But one thing I want to do, we got a really good explanation from Vinath last time about the difference between a data warehouse and a data lake. I mean, maybe one of the best explanations we've heard, but OneHouse is squarely in the data lake house space. And so I want to leverage his ability to articulate these sort of, you know, deep technical concepts really well to ask about what the data lake house is and just get a

Starting point is 00:01:37 definition. So that is what I'm going to do. How about you? Yeah, I don't know. I'll have a hard time, to be honest. Vinod is one of these guys that's always awesome to chat with him on a deeply technical level.

Starting point is 00:01:52 But I'm also very interested to hear more about the product they are building, the business they are building, and his whole experience of going from monolithic like Apache open source projects to trying to build a business on top of that.

Starting point is 00:02:10 And lake houses are also like a very interesting new, let's say, like product category out there. And I'd love to hear more about that and how he sees the future. So we'll see. I'm pretty sure we are going to have like a lot to chat about with him. There's no question. All right, let's dive in. Back to the show. This is your second time joining us on the Data Stack Show, and it's so good to have you back. Yeah, it's fantastic to be back. And, you know, I look forward to another. Last time around, I think it was a very deep, interesting technical conversation.

Starting point is 00:02:42 So I look forward to another round of interesting conversations here. Absolutely. Well, for those of our listeners who missed the first episode, we have to ask you to do your intro again. So can you just give your brief background? And then I'd love for you to finish by, last time you couldn't talk about this publicly, but what you're doing today at OneHouse. Yeah, my name is Vinod, and I've been working on open data infrastructure

Starting point is 00:03:11 in this area, in our own databases, data lakes, for the last 10 years, MJ. And I started my career at Oracle with the Oracle server data application that the key value store in LinkedIn during the time where, you know, key value stores was the cool thing that you built.

Starting point is 00:03:31 Then we moved on to Uber, which is where, you know, Apache Hudi happened. And we kind of like, you know, brought transactions on top of our how to deal like back in the day and what we call transactional data lakes. I think it's a pretty nerdy engineering name, which is very, you know, kind of what is known as the lake house kind of architecture today. I continue to kind of grow the project in ASF Apache Foundation. I still work with the PMC chat, so for Apache Hoodie. And right, you know, after Uber, I actually had a, you know, a good, had a good amount of time at Confluent as well.

Starting point is 00:04:08 I wasn't working on Hoodie, I was working on Kafka. I was working on KSQL DB, if you heard about that streaming database and connect a bunch of other things. And most recently, I think now I'm super excited to talk about OneHouse, which is where my current employment lies. I'm CEO founder at OneHouse. Our goal is to bring managed data lakes or lake houses into existence. We see a world where there are fully managed flow systems and then there is DIY open systems in the world.

Starting point is 00:04:46 And we're trying to actually build sort of like that kind of managed experience on top of open technologies like Apache. Love it. Okay, let's, I'd love to kind of set the stage and focus on a term that you mentioned, which is lake house. And some of our listeners will be familiar with that.

Starting point is 00:05:05 Some of them will have seen it in some sort of marketing materials, I'm sure out there. So I want to ask you for a definition of data lake house. But before we go there, could you remind us what the original use case for Hudi was specifically for transactions on data lake. So what were you facing in that role inside of the company? And then why did you need transactions on data lake? Got it. So yeah, so for this, I think we need to go back to actually 2015, 2016. And Uber was growing very fast.

Starting point is 00:05:43 We were building out our data platform and all we had was an on-prem data warehouse at that time. And while essentially we were hiring fast, we were building a lot of new products, we were collecting high-scale data, right? So we couldn't fit all this data into Uber on from warehouse. It's not built for this amount of storage. A Hadoop cluster is like an HDFS cluster, even before Uber. At LinkedIn or Twitter or in many places, Facebook, it's been scaled to several hundreds of petabytes, at least. So we built out our Hadodle cluster, our data lake. And here is where I think we had a very interesting problem

Starting point is 00:06:28 that, you know, remember like my previous stint was at LinkedIn. This was something that we didn't even face at LinkedIn, which is Uber is a very real-time business. So if it rains, the prices change, you know, and then there is a huge operational aspect to the company. There are 4,000 engineers and let's say 12,000 people who are operating cities. And they all need access to fresh, near real-time data about what's going on out there.

Starting point is 00:06:57 So essentially, what we found was while we can stand up a Hadoop cluster and dump a bunch of files onto that, you know, on there and, you know, bring Spark or something and write some queries, we were not able to, like, you know, some of our core data sets at Uber, like the trips, transactions, and these core database tables, we were not able to actually replicate them very easily onto the data lake. We would suffer multi-hour delays, eight hours, 12-hour delays in first ingesting it and then writing ETLs on top of it. So it got to a pretty serious level where people actually figured out we couldn't run our fraud checks fast enough. So we were actually losing money from fraud. It was like a pretty serious business problem, actually. And we actually started to look at,

Starting point is 00:07:47 hey, how do we solve this? And, you know, and we essentially actually looked at what we had before that. How are we solving it before Hadoop Twister? We were, you know,

Starting point is 00:07:58 like the on-prem warehouse that we had supported transactions updates and it can actually do kind of like, you you know you can write it like merge style etls on top that people currently write using dbt on all of these warehouses right so essentially we were like that's pretty much it like we need to essentially build that sort of functionality bring it to the lake but do it in a way that we we retain the the

Starting point is 00:08:23 scalability all the the cost efficiency, all the different advantages of the lake. And that is how Hudi was born. So we essentially called it a transactional data lake because in our mind, what we were doing was introducing basic transactions, building some indexing schemes, updates, deletes.

Starting point is 00:08:41 You can now, your data lake is now mutable, which means it can absorb, you can get a change know, your data lake is now mutable, which means it can absorb, you can get a change record from upstream, you can update the table instead of rewriting the whole thing, right? And that's kind of like how Hudi was born. And it was pretty like yearly, like it came before most of the other contemporary technologies that you see out there. Love it.

Starting point is 00:09:03 Such a great story. I remember you talking about that in the previous episode and it's just so wonderful to hear the Genesis story again. So you've kind of already answered a lot of those questions, you know, from a historical lens, but define, so you know, with that context, define the data lakehouse especially sort of through the lens of how you view the world at one house? Yeah, that's a great question.

Starting point is 00:09:31 So actually one of the key, you know, technology-wise, like what a lakehouse adds to a data lake is, as I mentioned, transactions updates, right? It gives you more, like, you know, upgradability. So it gives you like an impedance match with how you do things on the warehouse, if you can put it that way. Like from a user standpoint.

Starting point is 00:09:53 There is also from a user standpoint, there are two other important aspects though. These are mostly used to kind of, you know, improve the baseline performance of the data compared to a warehouse. For one is the metadata management. So most warehouses, even cloud warehouses, if you see today, they actually have pretty good fully managed metadata systems where if you want to execute a query, statistics for different files, columns, yadaada yada, all of these things are sort of, you know, like well maintained and they're organized in a way that queries can plan very quickly, right?

Starting point is 00:10:32 So that is another angle that piece of technology that the Lagos adds. Because the Lags were pretty much, you know, files and the individual query engines would, you know, like high metastore is would, you know, high, like high meta store is basically what we had for metadata management. Right. And if you look at what high, but high metas will never track any file level statistics or anything. So really file level granular statistics and all of these things, that's one big area. Like the second, which, which is where I think in hoodie, we spent spent a lot of time around it, and we are much further advanced, there is what we call table services. So if you look at any

Starting point is 00:11:13 warehouse, take Snowflake or BigQuery, you'll find a fully managed testing service. You'll find all these different services that do useful things to your table, And they're all self-managing. You don't write code for all of these things. But if you look at sort of like, even with the, you know, that's why I feel the table format doesn't do justice to sort of like what we need to build overall. The table format alone is not important.

Starting point is 00:11:41 You need like a set of services that rival warehouses that can provide you clustering, you know, data loading, ingestion, all these other things. This is where what we focused a lot on Hudi. And this is, I would say all these three put together, you know, like the, the, the storage format, like the table format itself, you know, accepting updates, deletes, and like transactionality, plus like a well-optimized metadata layer,

Starting point is 00:12:11 plus these kind of like well-managed table services, they give you together, I imagine like, you know, if you take a warehouse and break it sort of like horizontally, you get a bottom half of a warehouse today. And then you can fit like a, you know, query engine like Spark or Kino or Presto or anything really on top, right? So that in my mind is what a lakehouse should be, right? And in that sense, yeah, that, you know, connecting from the one house, this is sort of like what we want to unlock is for people to be able to get this bottom half as a service while they have the choice to pick any query engines we can choose. Love that.

Starting point is 00:12:57 Okay, one more question from me, Kostas, just to help me and our listeners set the stage. So, you know, from a marketing standpoint, Databricks has invested a lot in the lake house term, you know, which is maybe one of the ways that a lot of our listeners, including me, are just, you know, are familiar with the term or have become familiar with the term. How do you think about one house in relation to, you know, to sort of Databricks flavor of Lakehouse? Are they similar in terms of like, I love the illustration of the bottom half of the warehouse, but help us understand the differences and similarities. Yeah.

Starting point is 00:13:34 Okay. So that's a great question. So I think Databricks' articulation of Lakehouse is slightly different, right? I think if you're going from the paper, even, essentially, it's a Spark Lakehouse, essentially, or a Spark Databricks Lakehouse, right? And even if you look at Delta Lake, there is an open source version of Delta Lake,

Starting point is 00:13:59 and then there's a paid version of Delta Lake. So they essentially have two flavors of the bottom layer, if you will, that I just mentioned. While they have a top layer, which is a super optimized Spark layer, and they know Forton and all of the investments that they put into that. Honestly, they can apply to other formats as well, right? It's end of the day, see, all these stable format games, they're all creating Parquet files at the end of the day. see, like all these stable format games, they're all quitting market files at the end of the day. So sure, if you can optimize, I think it's a decoupled problem.

Starting point is 00:14:30 And the way they market it is as a full vertical stack against Snowflake, right? That's kind of like, at least where I've seen most of their marketing energy being spent so far. And that's probably because Snowflake is one vertical stack. Correct? Yeah. But if you look at the pieces

Starting point is 00:14:50 overall, it's still kind of like online. The biggest problem, and we see this with a lot of, you know, Hudi and Delta

Starting point is 00:14:58 have been like around for much longer supporting mutable workloads and, you know, everything, right? For like three, four, three years now, right? And out in production. So we routinely run into this. People want either people like Hudi for how rich of a table service

Starting point is 00:15:15 ecosystem it has, how actually vibrant and grassroots open source the community is, or several technical differentiators like concurrency control or indexing and whatnot, but they still want Databricks, Databricks Spark. So I think as Hudi, we didn't have to care as much about that, but as One House, we deeply care about that because somebody who wants to buy both One House and Databricks should be able to get a really good good end to end experience. So even for us, some of the thinking is now very, you know, customer focus that way, I would say. So there is a slight difference. We don't believe in one vertical stack, you know, I think this can be accomplished by breaking the bottom half

Starting point is 00:16:03 separately and then fitting every query engine. So let me just give you some data, right? You take like Ray, Flink, and then, you know, DASCO, like any other upcoming query engine. For what it's worth, you know, between them, they have some 50,000, 60,000 new dev stars, right? So it's a multi-engine, it's like a new thing that I think like Bodo,

Starting point is 00:16:23 like there's going to be new query engine innovation that's going to happen. So I think decoupling the data layer from the compute layer at the vendor or even at the staff's level is a good thing overall, we feel.

Starting point is 00:16:39 Yeah, super interesting. Yeah, it's almost like bring your own interface to the bottom layer or multiple interfaces, which is super interesting. Yeah. It's almost like bring your own interface to the bottom layer or multiple interfaces, which is super interesting. Okay, Kostas, I could keep going, but please, I'm actually more interested in what you're going to ask than what I've already asked. Yeah. Kostas Svitorka Oh, come on.

Starting point is 00:16:55 Like, that's not true. I think you are like asking all the interesting questions. I'm, I'm boring. I'm just asking like a little bit more technical stuff. That's all. But yeah. Okay. I have like something that I really want more technical stuff, that's all. But yeah, okay. I have like something that I really want to ask you, Vinod,

Starting point is 00:17:08 because you mentioned something. You said that there is a number of services that like a lake house need to have, like in order to rival warehouses. Yeah. And so I really like like the word rival, first of all. Yeah. But can you tell us like, let's, I mean, you mentioned them, but, like, let's enumerate, like, these services again so, like, our audience, like, has, like, much more clear idea

Starting point is 00:17:32 of, like, what we are talking about in terms of, like, technical services there. Got it. So let's start from the initial, right? Like, you need a service that can, you know, ingest data, first of all, right? And we built an ingestion system in Hudi from like three years ago. So this is similar to an autoloader kind of snow pipe or, you know, I don't know exactly what it's called, like what product it's called.

Starting point is 00:18:00 So I think there's an ingestion system that can like load data that we use down to cloud storage or different sources. That's one. And there is reasons for it to be aware of the sync because you can do checkpoint management and any other things very, very easily if the system actually understands that it's who it's writing to. Number two, when you update data on underneath what happens is the version files you you create garbage right that is you're writing new versions of files and you you the old version of somebody needs to clean up this is what we call cleaning in equity and what you know is called like vacuuming i think in and depth like and you need a service that can

Starting point is 00:18:41 actually know you can't tell it, Hey, I want to read in X versions or something, and then it can automatically do this for you. Right. That is one. The third thing, as you know, failures happen when you're writing to a table and you have like some leftover files, uncommitted data lying around. You need, you need systems that can like, you know, services that can clean the data so that, you know,

Starting point is 00:19:08 these, like, dead files don't litter up your tables and things like that. Number four, this is slightly specific to Hudi, but Hudi supports a merge and read storage type where we can actually land data very quickly in a row-based format or, you know, flexibly in a column-based format, and then later sort of, like, compact it, right? And when we say compaction,

Starting point is 00:19:26 when what we mean is what compaction means in databases, like, you know, Cassandra, HBase, or it's like compacting Delta files into a base file. So you need a service that can do that. And Hudi's compaction service can, for example, like keep compacting even while the riders are going on. As you can imagine, at Uber or TikTok, where there's a stream

Starting point is 00:19:49 of high-volume data coming in, it's impossible to stop and do OCC, optimistic concurrency control at all for this thing. So you need service like this. Again, I'm making the case that this has to be deeply aware of this. Services need to be aware of each other. And that is how databases are written, right?

Starting point is 00:20:07 The other one is clustering service. Like we implemented reordering Hilbert curves and just like linear sort order clustering. So fundamentally, what a table format metadata layer can do is remove bottlenecks in planning, right? It can store file things under statistics, which is used to plan. But end of the day,

Starting point is 00:20:27 if you look at, you know, most warehouses, for like, you know, high performance sensitive reports and stuff, people actually, you know, tweak performance by clustering and playing with the,

Starting point is 00:20:39 with Invertica, I think it's called projections and they're very different names and different things. But you tweak the actual storage layout to squeeze performance, right? And then you need a service which can actually understand the right patterns that are happening in the table, schedule these clustering, execute them. If they fail, they retry, right?

Starting point is 00:20:59 So what Hudi actually, the bulk of the value that we add, we believe, is in this layer where you write to a Hudi table, all of these services will be scheduled, executed automatically. They can fail. They'll be retried. Otherwise, if you take a very thin table format as an alternative, then you need to write all these jobs yourself. And what I've seen from my LinkedIn days in the last 10 years living through the how-to parts

Starting point is 00:21:30 and Cloud Era, how-to parts, all of these things, everybody focuses on the format. Like, you know, as if you solve the format and then everything's fine. But open alone doesn't cut it.

Starting point is 00:21:41 That is the painful lesson that we should learn from the rise of cloud warehouses. What we should, we should focus on the standardized services and they take years to get like standardized

Starting point is 00:21:51 and like, you know, hardened in production scale like this. I think this is the main thing that we are not right now, even around lake house marketing by any vendor, I don't see enough emphasis laid on some of this

Starting point is 00:22:05 i've recently started noticing that you know reasons you know had some content on this recently i think starburst had some recently but it's a very recent thing that has happened in the last few months and this is what we've been at for last three years okay so so just to make sure that I also understood correctly, right? We start, like our foundation is a data lake where we store their parquet or ORC files. Let's say parquet, like that's the standard. And on top of that, then we need like a number of services. I counted five.

Starting point is 00:22:46 I hope I didn't miss anything. Yeah, but let's say at least the most fundamental ones, right? So we have an initiation process there. We need some service that's going to prepare the data and make them available. We have vacuuming or cleaning and say cleaning and taking care of like all the version files and like all the stuff that are happening like on a low level to make sure that we increment concurrency. We have some kind of garbage collection, let's say I'm using garbage collection more like a broader

Starting point is 00:23:19 term. Combaction with... Combaction from what I understood is like more of a specific use case for Hudi because you have the columnar and they're all based like representation. So you at some point take these two and you merge them into one or something. Is this correct? Yeah, it's correct. I think it's slightly different. Most of the other two projects are written for us a file statistics tracking system. But this is where Compaction is not new at all to, let's say, RocksDB or LSM stores or anything in the database world. And as you know, I come with that background.

Starting point is 00:23:59 So Compaction is more about controlling, I want to write smaller amount and then queue up a lot of these updates, later merge them, instead of merging them right away. Okay. I think that is the key technical rationale for compaction. Okay. That makes sense. Is this like something similar to like what happens when like tube stones are like, for example, used and then you go and like remove the TubeStone from there so you can actually like delete or not? Peruptually.

Starting point is 00:24:35 Like if you, if you read a block structured merge trees, LSM trees, for example, they will talk about this whole bunch of signs around how to balance write cost and read cost and merge cost. And that it's a very, very, you know, widely adopted database, right? From Google's Bigtable to Cassandra to HBase to RocksDB to LevelDB. That's all they use there. Henry Suryawirawan- Awesome.

Starting point is 00:25:03 And then the fifth one has to do with what you call like clustering, which is more about like how you can optimize like on a lower level the data, how it is stored. So you can actually do improve performance, right? Is this correct? Does this have to do with encoding or like give us like a little bit more of like information? Stas Milosav. So I think clustering changes how you start, how you actually pack records into files. Just that if you know something about the query, let's say, for example, you are a SaaS organization, and you have thousands of customers, and then you're a SaaS organization and you have thousands of customers and then you're collecting logs from them.

Starting point is 00:25:46 And then you know that your query patterns mostly are, you'll query for one customer at a time. Then instead of spreading this data across all the Parquet files in your table or a partition, what you can do is you can cluster them so that the records similar to one customer is in like a fewest number of files, which means when you query them, you read the smallest amount of data, right? This will give you 10, 12x, like, you know, the order of magnitude

Starting point is 00:26:18 of query performance. And while I feel compared to, let's say file listings, file listing is a real problem only for very large tables. Right? So related to all that, this fundamentally affects your compute dollars. And it can dramatically reduce cost for your lake. All right. So that's amazing. My question is, and going back to the initial question, these are like, let's say, the minimum set of additional services

Starting point is 00:26:47 that the data lake needs in order to rival a data warehouse. But there's a big difference that I see here. And the difference is that with a data warehouse, I don't really care about all that stuff, right? I don't have to know about all these very technical and interesting details details right yeah uh while in the lake house like okay we have to talk about that stuff so yeah how do we change that because not everyone wants to become like a database engineer right uh in order like to query and

Starting point is 00:27:18 store their data yeah unfortunately we opened that door when we wanted updates on the data lake, right? Because before that, if you like appending some files to a folder and then collecting statistics on it, I think it's a very simple thing to do. It's like, you know, conceptually, it's very easy for people to understand. And people in the data lake have grown up thinking about everything as formats. If you look at the updates, you turned it into a database problem. And if you look at the database world, you don't actually see, I think I made this statement even last time. You don't see CockroachDB, MySQL, everybody saying,

Starting point is 00:28:04 let's standardize on one format and then build something on top. It's not a thing. When you change into a database problem, the stuff that we talked about, those are the higher order problems. So, to answer your question,

Starting point is 00:28:20 what do we do to change this? That, honestly, is at the core of why we even started OneHouse to begin with. And this is what I say in a lot of places. A lot of people have asked me and they come up to us for enterprise hoodie support or something. That is not what we're trying to build here at all.

Starting point is 00:28:36 We're not trying to build an enterprise hoodie company. What we've seen, and you've spoken to Kyle, our head of product, who was in a different camp before this, technology-wise. The common thing that we see is it takes six to nine months for people, for data engineers to become database engineers and platform engineers, understand all these concepts and actually implement them.

Starting point is 00:28:59 So what if there existed a similar managed service where you can click four buttons and then you have your you know lake house up and running and you know it's open it's meaning like it seems I think open is super overloaded with marketing these days

Starting point is 00:29:19 truly what we care about is interoperable and extensible right so if you have an engineering team, you can go to the project, you can contribute to the project, get a seat on the table, on the PMC. Yeah, that exists. And then it's interoperable. It works with every open standard. There is no vendor bias or anything, the project, right?

Starting point is 00:29:37 So we need a foundational technology like that on top of which we build this management. That's how we are thinking about it. Even I speak to a lot of cloud barrows users. That's like my day job right now. And what I see is ultimately they realize this, right? They start with a fully vertical start because it's fully managed. And like you say, people don't even have to care about it.

Starting point is 00:30:05 But you're signing up for a two-year, like a migration project two years down the line, right then when you're making the choice, I think fundamentally we need to like sort of bring some manageability to it. Open alone won't cut it. That is what I'm trying to say. Like open alone is not a key business thing. Customers are looking

Starting point is 00:30:27 for how soon can I get my lakehouse up and running, technology aside. And we have to focus on that. And I feel while it's open as the only kind of USP against a closed stack or, you know, to take on like warehouses is not good enough in my mind. Cloud era, how to not stray that and fail. Yeah. I would say. Yeah. That makes sense. I mean, so, all right.

Starting point is 00:30:54 Having, let's say, the experience that someone has with like a cloud data warehouse, something that's, okay, it's good. Like we are after that, right? Like we want to offer this over like a data lake. And that's what one house like is, from what I understand. So do you like to spend a little bit more time to explain to us how we can go from these at least five pretty complicated technical concepts and services to an experience with a couple of like clicks on a cloud dashboard we can have let's say a lake house up and running and we can start like interacting with it so how does it work like how what's your vision uh for one house like from a product perspective yeah so uh honestly

Starting point is 00:31:41 like even detaching myself from right if you have to look around now and see what will I pick today to build a product experience around, I'd still go and pick Hudi because Hudi already has most of these services. But it's a library. Hudi is a library. You need to adopt it, tweak it. So what we've learned from some of our initial

Starting point is 00:32:05 users that we're working with and everybody is that just by hiding a lot of configuration, just like we expose a lot of configuration, speaking for Vidi, we expose a lot of configuration, just like any database. You go to Oracle, you go to MySQL,

Starting point is 00:32:19 the point is to expose a lot of configurations. Administrators will pick it up over time and know what to do, right? I think we have to simplify that. And for example, don't even show file sizes. Why you care about what the file size should be, right? Right now we ask people to go hand tune that, right? Hand tune the panels of an office.

Starting point is 00:32:41 So in our experience, it's a whole bunch of something like auto-tuning and intelligent configuration management and sort of like, you know, that I think is the first ingredient to get there. And the second thing where specifically talking about one-hours where we back ourselves more is our team actually has operational experience, not just, you know, build it, right? Like I've been on call for 250 petabyte data lake, and I had to like wake up in the middle of the night and recover a table, like do this kind of thing. So that's the second part, which is usually in data lake so far, we've not, you know, the user managed the tables, right?

Starting point is 00:33:24 And if you look at Snowflake or BigQuery, if a table is corrupted, user has no control whatsoever. Like in you, like some Snowflake engineer

Starting point is 00:33:32 didn't, you know, want to values, redshift values, you just have to figure out what's going on.

Starting point is 00:33:36 So that's the second part, like building enough manageability, operability to this product where,

Starting point is 00:33:44 you know, you're taking control away from the user in the name of simplicity and getting started quickly, but we need to now build all the operational kind of chops to be able to pull this part off. I think this is the hardest, hardest part. I think Jay Krebs, the conference CEO, has a thing where he says, you know, in ranking,

Starting point is 00:34:06 programming a theory, what's much harder is debugging that thing. What's much, much more harder is operating that piece of code. Right? Yeah. And I think this is where

Starting point is 00:34:16 my disappointment with all of the marketing that happens in the data lake land is that we focus very little on these operational aspects. It's all super DIY. And then later we also complain that,

Starting point is 00:34:27 oh, it's not standardized, blah, blah, blah, right? We have to build these statistics. I don't know how to explain it, but we built. That's what the warehouse has done really well. It's really admirable for what they've done in the last 10 years. They've actually accomplished a lot. Absolutely, absolutely.

Starting point is 00:34:45 Cool. So we start with like auto-tuning and management of configuration in general, like simplifying, let's say the whole like setup process for users there. And also introducing, let's say like abstracting the operations, right? Like making, giving, let's say like a cloud experience, right? Like there is a team that will stay awake, like to take care of things when they go wrong, like instead of having to build your own team, like to do that. And especially like for so complicated technologies like these,

Starting point is 00:35:17 where it's not that easy like to know exactly what might go wrong. So I think it makes like total sense. And my next question is, I can understand like, well, I think one of the benefits that let's say the cloud warehouse and all the vertical solutions in general that they have is that when something is vertical and you have like complete knowledge and control over all the components, right? You can control the experience exactly as you want, right? Like, you know exactly, like, how it's going to be experienced by the user.

Starting point is 00:35:49 At the same time, you have, like, much more control over what kind of optimizations to do, right? And we see that, like, with things like BigQuery and, like, Snowflake. So when it comes, and there are, like, actually two questions. One has to do with, like, the experience, but let's keep it, like, for next. start like with performance right so these systems like when you are like vertically integrate all the components you can go there and be like okay i'm going to

Starting point is 00:36:15 build something like photon and have like on top of that like these changes that need to happen like on the different components and make sure that, like, I squeezed out every piece of, like, every little of, like, performance out there. Where do we stand with the lake house architecture when it comes, like, to performance compared to the solutions like Snowflake or even like, yeah, like BigQuery? Yeah. It's a great question. So I, first for once,

Starting point is 00:36:45 I feel like things like Photon could be built on top of, like at the end of the day, going back to my previous statement, on the read side, right? Even with the lake house, these transactional formats, on the read side,

Starting point is 00:36:59 all that happens is you are getting some statistics and planning some query. From there on, your query performance is dictated by things like that. I feel like, I think already we've proven that this can be built independent of the

Starting point is 00:37:15 in a very decoupled way. And then if you now take things like all the table services and all these things that we talked about, they're pretty decoupled from how the query is processed. You mean. You cluster it and then they'll read it. That's it. So in that sense, I don't see a technical limitation to optimizing the stack sort of vertically like how we do it. But I do see that you know there are different companies here there is no single company right like even even for us we routinely work with different query engines there are

Starting point is 00:37:50 different projects you know each you know we take like months to like land certain things and like you know it can be like a lot of different friction points in terms of how quickly we can move forward but i think the performance itself comes from the engine. A lot from the engine, I say. At least for interactive query performance, a lot of it comes from the engine. With

Starting point is 00:38:15 a better integration with things like Hudi, Open-DOS, Hudi, or even one-house services, we can probably match the experience where you go and maybe configure clustering in one house while you go query on like, you know, Presto Trino or something, right? Like that kind of experience, you can product experience, you can build, but I think there are significant cross organizational boundaries and working across companies.

Starting point is 00:38:41 I think it's gonna slow us down there, I feel. Yeah, yeah, absolutely. And just to reiterate on what you said, there's no, let's say, interesting technical reason to have data lakes slower than a data warehouse. But when you build the product and how the user experiences the product, things get a little bit more complicated. Just to give you an example, let's say I have a setup with Hoodie and Trino or Presto, I'm running my queries and I see a performance regression at some point happening somewhere, right? What do I do?

Starting point is 00:39:19 Who do I reach out to debug this thing, figure it out? Should I come to you on one house or should I go like to the Trino community and ask there, or is it my data engineer, like doing something stupid out there, like things while when I do that, like with Snowflake or okay, Google is notoriously known for its support, so forget Google. Let's not keep on Snowflake. But at least at Snowflake, I'll open a ticket and I'll be like, guys, something goes wrong here.

Starting point is 00:39:51 Figure it out, right? And that's the other part of the question, which comes to the user experience. So how we can also, as vendors, that we believe in these unbundled, let's say, DB system of the lake house, how we can deliver at the end the same experience to the user, or at least like a similar experience to the user. Yeah. I think that right now there's a lot of fractures. First of all, there is no standard like you know like apis right even i think we attempted this with even presto which is even the hive connector right we tried to

Starting point is 00:40:30 introduce abstraction so that you know okay you just like change the way you are getting file listing you are listing the thing so we like you know there aren't even like good abstractions points right now and across these different engines for us to test and guard i think as these get more standardized right all these three transaction formats have their own connectors now uh right at least your prs out of like these landed i think starting with even basic stuff investing in some basic things, having between these companies, testing them, I think we have some very basic cash difficult, I would say.

Starting point is 00:41:10 Longer term, it's a pretty interesting point that you bring up. I think end of the day, there will be some level of trade-off for the user where they are consciously choosing I want the freedom and the flexibility. So yeah yeah when you

Starting point is 00:41:26 go for that then you have to pick and choose right it's like buying Android versus iPhone like you're sure like you know the OS you know the experience that you're getting but it's going to vary differently based on the underlying hardware and the manufacturer and like blah blah blah so you kind of have to go through that I feel like even

Starting point is 00:41:41 with that you know once we iron out the basics, I think it'll get to a manageable level. I don't think it'll be, it'll always be one level, it'll always be a problem,

Starting point is 00:41:51 I think. It won't be completely eliminated, but I think that's where I feel the lake, you know, the lake storage

Starting point is 00:42:01 players and the query engine players have to like work much more closely together than what's going on today. Yeah, yeah. No, 100%.

Starting point is 00:42:10 I mean, I agree with what... I mean, obviously, there's a lot of space for improvements out there. Okay, like all the vendors right now, especially when it comes to vendors like OneHouse, because, okay, you just start the business right like it's like it's one thing to have like a open source project and it's a completely different thing like to build like a cloud product on top of that right like it's there's a lot to be discovered there and i would also add and that's like something that like i really admire like to people like you that, okay, you are also starting something that it's like completely new, right?

Starting point is 00:42:50 Like in terms of as a product category, right? So there's a lot of learning from both sides, both from the customer side and also like from the vendor side. And this takes time. It's very risky, but potentially also like super rewarding. But there's always going to be, I think like a trade off at the end. It's not like, okay, we're going to have, let's say the Microsoft Access experience with like a lake house architecture, right? Like there's going to be like some kind of trade off there.

Starting point is 00:43:22 Okay. So let me ask like a question that is like also like a little bit of like a personal like question that I have. So let's say right now I want like to start building a lake house, right? One of the things that each one of the first likes, actually the first service that you mentioned, like the Ingentium service, right, like somehow you have like to push data into there. Uh, how do I do it together today with Houdi?

Starting point is 00:43:49 Like the only way that I can do that is like through this ingestion loader that you have built. There are other ways, like how, how, how, how does this work? Yeah, I mean, it's pretty simple actually. You go to docs and if you go to, you know, how to streaming injection, it's a single command. It has like an umpteen set of parameters. You say what your source is, what your target is,

Starting point is 00:44:14 configure a whole bunch, and that single Spark submit command actually can ingest from Kafka, it can ingest from JDBC sources, it can ingest from S3 kind of like event streams. And then it can also do things like it can configure clustering, cleaning, compaction, all of the stuff that I talked about, right? It's almost like running a database on itself. So if you just run that one command, it will internally, be a spark it spins of a spark job and then within that it will self-manage all the resourcing that we need for ingestion if you're not ingesting it's going to do clustering if it's not clustering it's going to do compaction it even has resource

Starting point is 00:44:57 management so we made it like super super easy and in the front so we actually have built a very similar thing at Uber. And I actually started writing this tool as a, you know, like a replacement for it in open source. But I think it's gotten so popular that it's used in many, many companies in production.

Starting point is 00:45:17 Right. A lot of those companies, this is the main thing. This is like the main ingest service. So yeah, that's what I'm trying to say. Like as a project, we've tried to make it very easy for the users because we, you know,

Starting point is 00:45:33 suffered through all this integration pains when we had to build our own data-making Uber. But in spite of that, I feel still the operational world is too high. I mean, I don't know. Like that's what one of us is trying to solve. But Hudi already makes all this very easy for me. Okay. still the operational order is too high. I mean, I don't know. Like that's what One House is trying to solve. But yeah, Hudi already makes

Starting point is 00:45:46 all this very easy for me. Okay. So how this would work like with Open House? Like what's the difference there? One House. One House.

Starting point is 00:45:56 Yeah. Open is so much different. Yeah. I mean, so the thing is we're not forking Hudi. We don't have a Hudi fork.

Starting point is 00:46:05 We are, so if you look at how, even let's say, click label is US ingestion or something like that, usually we'll have a blog which describes an end-to-end architecture, right? We are platformizing that end-to-end architecture on top of Hudi. It's almost like we're automating the blog that we wrote. You can still run if you want like that that's actually something that people really like which is they can like a lot of our early users that like pilots that you're working with that they are happy that they can

Starting point is 00:46:35 start with something managed so they don't have like a long lead time for the latest and but if for whatever reason they don't like us right and? And they can just turn around and all these services are in open source. They can just buy support from AWS and that's it, right? They can move off of OneNote as well. Most open source GTMs are built, you know, good for it. Strategies are, okay, it is an open source project. We have to place a kernel on top of it. I think we are trying something new where we are, we have an price kind of on top of it. I think we are trying something new where

Starting point is 00:47:05 we are, we have an open project. We're trying to achieve high data as much as possible within our product because we want to up-level

Starting point is 00:47:12 the experience. Then, if you get really familiar for whatever reason, we're not adding enough value to your,

Starting point is 00:47:20 you should be able to move off the data as yours, right? I think this is the fundamental problem. Now you contrast it with the warehouse move. This is a fundamental problem.

Starting point is 00:47:28 Once you're stuck in the warehouse, you have to migrate the data, right? If you're unhappy with it, there's nothing you can do about it. So that is actually what we want to change. And like you're saying, it's as a product and also as an architecture and a category it's something pretty like new and experimental architecture technology wise sure it's pretty proven out right your

Starting point is 00:47:54 earlier question around this unbundled stack see whether we like it or not whether one house exists or not that's how people are using the lake even before me, right? You are using Parquet and using Presto or Spark or Hive. That's literally how we started at Uber as well.

Starting point is 00:48:12 So this multiple engine on an open format kind of thing already existed before. I think all we're trying to do is build a path

Starting point is 00:48:21 for users to get started on sooner and hopefully as a company, as a product, we add enough value that we can retain users. Okay. Yeah, yeah. Okay, I'm going to make it a little bit harder for you, okay?

Starting point is 00:48:33 Okay. I'm sure you like challenging. So let's say I'm a data engineer who is coming from the modern data stack environment where I'm used to use,'s say Snowflake and a tool like Airbyte or Fivetran, right? Where I know that I'm going to connect like a source, the data are going to be loaded on S3, then like a copy command is going to be executed on Snowflake.

Starting point is 00:49:00 The data will get imported into Snowflake table format. And then I'm able, like, to create that. And all these things happen, like, inside transactions. So nothing is going to get corrupted, right? Sure. Sure. Cool. And now my boss says, go build a data lake. And, okay, like, we need, like, to expose it to the rest of the organization.

Starting point is 00:49:20 So it should be, like, feel like the same, let's say as a lake. Yeah. Okay. And I come to one house, right? How like, think of me as like, I have this experience in my mind, right? Like that's like the journey that I think when it comes to loading data and like this whole ELT thing. Is this something that like I can do in a lake house in general, first of all?

Starting point is 00:49:43 And second of all, like something that even if we cannot do today, let's say, I will be able to do that in the future with one house is how you think of things and how the experience should be. Yeah, I think, first of all, you think the experience should be similar to what you're used to in an existing marine service right but how how we accomplish that in in one house can be through you know like us having more upstream partnerships right like for example my my previous employer at consulate i think a lot of scenarios right when people are at the point of thinking about daylinks and everything they're

Starting point is 00:50:22 also thinking about okay i want to open up event streams to my company, right? I want to open up for stream processing. So most of the things they would, like, you know, they would naturally do something to get, extract all this data into, you know, like a big event bus, like Kafka

Starting point is 00:50:39 or Pulsar or one of these things, right? And the minute you get it into that, then it's pretty etc. And the minute you get into that, then it's pretty simple. So you can use ideally, one of us can try the same experience whether we run it

Starting point is 00:50:55 or whether we partner. But I'm saying we want to right now, we would recommend for people to rethink how they're doing data streams right like okay the CDC that you're

Starting point is 00:51:07 capturing from 5K can you t that into elastic search no you can't right you can only send it to one

Starting point is 00:51:13 point which is snowflake right that's not the forget who the data likes everything that's not the you know that's not what people

Starting point is 00:51:21 ultimately build as the data architecture right and I'm sure you're familiar with like data meshes and we live in a world where there are like

Starting point is 00:51:29 there's enough data that there's like so many specialized stores. So, I think that will make this move much easier

Starting point is 00:51:36 for us, I feel, for something like OneHouse. The move towards streaming data and as technology is like

Starting point is 00:51:44 very well positioned to be the absorb all the streaming data. And as technology, Hudi is very well positioned to absorb all the streaming data and integrate it very well. And, you know, one of just has to focus on that problem. Yeah. Yeah, what I keep like

Starting point is 00:51:54 from what you say is that, yeah, like things, when it comes, let's say, to the lake house, or it will get like, let's say, closer to what people are used to use from like cloud warehouses.

Starting point is 00:52:07 But there's also education that needs to happen, people to understand that there are also different ways that we can do things. And there's value in that. It's not like you lose just how easy it is. It's something you also gain, let's say, flexibility and opportunities like to optimize your infrastructure and do more things with your data at the end, right?

Starting point is 00:52:30 So I also feel like when people, when users, like the data that you talked about is usually at the point where they're building a data lake for the company, they actually have a business problem to solve already.

Starting point is 00:52:42 I think they'll mostly look at it from that lens. For example, it can be stream processing. Data democratization is what I just talked about. It could be just that, hey, I'm building a new data engineering team or a data science team. And there is all this event logs and data that I can't even ingest into the warehouse anymore.

Starting point is 00:53:01 It's not like it's replicating the same data that exists in a warehouse outside, right? I believe a lot more data sits out there on some S3 buckets and cloud storage buckets unmanaged completely. So I think there's a vast amount of data that is not even getting

Starting point is 00:53:19 into warehouses. And now if you now think about it, right, from this lens, I don't think the existing managed pipe solutions are operating at that scale, right? They're not operating

Starting point is 00:53:31 at event scale. Like at Uber, we do like, you know, tens of billions of trips a day. If we did that, then we are ingesting

Starting point is 00:53:37 hundreds or, you know, like a billion events per day. Like there's a scale difference in the amount of events and data volume. These are things

Starting point is 00:53:45 that we've done routinely in open source and we ourselves have actual hands-on experience building. So I feel technically scale-wise

Starting point is 00:53:56 it's a very different problem and when people consider it like they have one of those cost scale problems already and that will

Starting point is 00:54:04 motivate the experience that we build. But by and large, I think it'll be fine. Yeah, that's an excellent point. And it's like, it's a very fair point also, because yeah, like I'm giving an example, let's say, but like the example and like the behavior that let's say someone has with a product,

Starting point is 00:54:23 it cannot be taken out of context, right? Like there's also like the problems that someone's trying to has with a product, it cannot be taken out of context, right? Like there's also like the problems that someone's trying to solve. And you're absolutely right. Like when you reach the point where you need a data lake, there are reasons for that. It's not just because like you don't like snowflake, right? Last question. And then I give it like to Eric, because I completely monopolize the conversation. Although he's going to be very kind and be like, it was so enjoyable and like blah, blah, blah, all that stuff. So we have seen lately, both from Google with the big lake initiative that they announced at some point, but also with Snowflake with the support, both as external tapes with Iceberg, but also like as native format, we see the data warehouses are also making, let's say, a move towards more openness and embracing, let's say, the lake house

Starting point is 00:55:10 or data lake paradigm. How do you think that this is going to affect one house as a vendor in this space? And how do you think this is going to evolve as part of the data warehouse experience that we have seen so far in the cloud? Yeah. So I think let's take even the Snowflakes expansion and stuff. The key question I would ask is how do external tables actually perform? It's one thing to have an integration, but it's another thing. do they perform as well as native tables?

Starting point is 00:55:48 Right. Because internally, you might've read the big metadata before. There's like a lot of metadata optimizations. Problems that, that, that transactional formats solve have been solved in a very different way in warehouses. So I think that there is going to be like, my feeling is this is a nice thing where you can actually access

Starting point is 00:56:09 data, but by and large, people are going to move, if they want something performance critical, they're going to move that into a native table and save this warehouse. I think that's what I think. I think it's very early. Right now, it feels like everybody wants to do something against Databricks.

Starting point is 00:56:27 And everybody wants, like, you know, I have a lake house. They want, like, whatever they want to use. That's how it feels like to me. So we'll see. Like, of course, you know, you can also evolve over time. At the end of the day, warehouses still are used for traditional you know analytics use cases right there's like much more beyond that can be unlocked in a the kind of model that we've been discussing so far so it'll be interesting to see how broad they want to be like what warehouses want to

Starting point is 00:56:57 make this right so it's i'm not like saying that won't happen, but historically, like, you know, if you, if you project it out, it, it, it may or may not happen. Right. Yeah. Uh, the, the second thing here is overall, let's look at this architecture now. Right. Let's say, okay, so we have a common format and then all the engines read and write from that. Like the same table is written from Snowflake and BigQuery. I haven't seen a use case like that.

Starting point is 00:57:31 Why would you do external tables? You do external tables only because you want to do some Spark processing on the same data that you want to also query. Then if Spark's performance is good enough, what do you pick Spark? I just don't see clarity in these individual graphics cases to a level, oh, for BI, only always use X.

Starting point is 00:57:55 I don't see that kind of thing. I see way more users caring about, I want to actually keep my data more future-proof. Because four years ago or three years ago, nobody talked about Snowflake, F30, FATO, warehouse that you dump everything. It's a breakthrough. So maybe in the next three years, it's something else. So I just want to keep

Starting point is 00:58:18 my data future-proof. This data will outlive the vendors and query engines. I see far more companies worried about thinking from that perspective than this, you know, I want to have this thin layer that I can read from many engines. That makes sense. Yeah, I mean, it's still early.

Starting point is 00:58:36 And I think it's going to be a couple of like interesting, at least like years ahead of us at the end. All these like innovation and like product developments are hopefully are going like to be beneficial for the customer. Right. And I think it's also like from my point of view and like also like putting, let's say my entrepreneurial hat, let's say being like a new vendor in this space and seeing like these much bigger and like well established vendors like to be investing towards like something that I'm

Starting point is 00:59:13 also doing. It's good. It means that like there is market, there is appetite in the market for that kind of stuff. Now, who's going to win? I usually say that it's the smaller vendors that win in that kind of innovation. Yeah. But we'll see.

Starting point is 00:59:29 It's going to be interesting. Yeah. To that point, actually, quickly, if you think about it, right, so who writes code in these systems? I think if you go back to who's pushing also the transactional formats forward. I think that matters more, right?

Starting point is 00:59:47 Because those people are the ones that are closer to the problems, closer to the technology. And that's kind of why I think it reflects in the smaller vendors winning because they're much, much closer. That's the only thing that they focus on, right? And overall, it's great. By the way, don't get me. It's not, it's absolutely fantastic that, you know, barrels are now taking external tables super

Starting point is 01:00:10 seriously. Right. I think redshift deserves a lot of credit for this. They don't, I'm not seeing anybody keep them spectrum added hoodie, like, you know, two years ago and, you know, they, they deserve a lot of credit for that. Yeah, I agree with you. It's a little bit of a shame because like, okay, like there is some kind of perception that like Redshift is some kind of like, let's say, dead, let's say in a way. Although Redshift was the first cloud data warehouse out there

Starting point is 01:00:41 and they, like the guys there, they keep building amazing technology. Uh, so people should keep paying attention to them. Like they are doing like a great job. Yeah. Yeah. And I think, yeah, I mean, this is like marketing, right? Like this, this is when marketing is reality.

Starting point is 01:00:57 I think, I think as a, as a, as a founder now, I also have the job of empowering my team that, Hey, you that, hey, what's marketing and what's like, this is like a pretty blurry line. But yeah, I mean, there is, you know, Redshift, I think, makes maybe a little bit more, I don't know, but I think the same

Starting point is 01:01:18 ballpark as some of the more successful people, I think that we talk about, right? And then they have have tens of thousands of customers. Yeah, this is where I think, for us, we feel like we've not... So I would say we're under a yearly

Starting point is 01:01:33 disadvantage because when EMR has... A lot of the EW services are deeply integrated. We didn't start water house back then because we couldn't like we're starting when marketing it's very hard under the marketing you know shine the spotlight now but i think i've seen enough systems come and go

Starting point is 01:01:59 that i know that you know end of the, the technology has to work and somebody has to operate this system with all these customer problems. So we're like super, you know, I think we're pretty helpful for both the open data and lake houses at one house. Awesome. Awesome. So Eric, as you can see, you are the king. Like without you deciding that now we have to make Lakehouse seriously, nothing happens out there.

Starting point is 01:02:25 So you as the marketeer of this group of people, we want to hear from you. I was laughing about you saying, you know, the line between marketing and sort of, you know, the product reality can be blurry. You know, and that is certainly true. One last question for you. And I'm thinking about just the practical, thinking about someone who's maybe thinking through the lake house on a practical level, right? So you talked about the genesis of Hootie, you were, you had real-time needs at an immense scale. And then also a lot of the, you know, you mentioned sort of, you know, you have the bottom half of the, of the warehouse and you can run Spark on it or like press Archino, et cetera. A lot of that tooling, I think to a lot of our listeners,

Starting point is 01:03:25 probably at least hints at scale problems, right? Like a lot of those technologies develop because of scale problems. One interesting thing that when we talked with Kyle from your team was that he said his opinion has been changing, uh, on the lake house as sort of practical for companies that aren't at, you know, sort of whatever Uber, you know, pesk scale. I just love your thoughts on that. I mean, is the, and maybe I can just frame it and sort of slightly unfair frame in the form of a slightly unfair question, but do you think that the lake house is at a point where a very forward thinking data team could say, we're just going to skip the data warehouse and we're going to go straight for the lake house and we're just going to use that and it will sort of your traditional, like, data lake for object storage, you know,

Starting point is 01:04:30 and then data warehouse for, you know, all the transactional sort of like day-to-day practical stuff? Yeah, that's a great question, actually. So I think it's a totally fair question. I think we are probably like a year or so from that. And I cite mostly all the DIY stuff that you need to do. Like, for example, somebody has to understand a Debezium, Postgres, Kafka,

Starting point is 01:04:53 just to build a simple Postgres to Lake Indochina thing. So there's like a significant investment. And I've spoken to smaller companies who basically know that the warehouse is going to get expensive and scale over time. But today it costs way less than three data. So that is the problem that most people start. And that's where we are starting, right? With that as a product and from that lens.

Starting point is 01:05:20 The technology, if you look at it, I think cost performance wise, I think grand scheme of things, I think even out of the lake will be like, lake is much, much cheaper for running any large data processes. Like the way I look at the world today is barrels are, I think, in my opinion, still best in class when it comes to maybe like interactive query performance. You know, the work that's going into things like Presto, Trino are changing all that, right? And then when you look at data processing, ETLs, that is where it gets like really expensive. The flip side of scale is cost, right? If you're running a large scale, it means also large cost.

Starting point is 01:05:56 So even moderate scale stuff, that's probably what Kylie hinted at. Even simple stuff, right? We can, in just spending 100,000 bucks, you can probably spend $30k on a lake. As long as the similar kind of experience existed. I think that

Starting point is 01:06:12 is opening up and this is not possible without the cloud. I think cloud is what is the proliferation of all these different awesome engines and that's actually what's acting as a good catalyst for deriving that. So I don't see this

Starting point is 01:06:28 as just a scale problem. And although an interesting note, when we started Hudi at Uber, I mean, for a year and a half or so, it was like, you know, that's why you don't see, like we don't have a launch or anything. It was like an interesting,

Starting point is 01:06:41 nerdy project that engineers at Uber built. Because not a lot of people have that kind of scale with updates and this kind of thing then. But right now, just with the time and then the data volume exploding, what we see routinely is, I'm surprised that much smaller companies, the scale that they have. Like, oh, oh wow, okay, you have a two terabyte partition okay i did not see that like you know yeah that's that's so there is also that and i my view has been also evolving for a while i i myself to be honest thought like that which is like oh it's a high scale problem you know yeah yeah uh but then the community when I saw the scale that they were doing things at, that changed me.

Starting point is 01:07:27 I just literally met an airline data tracking system company. They think in just some tens of terabytes every day. You wouldn't have even heard about them. They track all the flight data across all the airlines in the U.S. And then they're able to get something up and running and they have that. They can't send this data into a warehouse. So they have a lake-based solution. So there is also that organic data volume growth that is pushing people more towards

Starting point is 01:07:54 this. Yeah, super interesting. I can absolutely see that. Let's say in a year's time, right, you have, you know, say people who have been working at a larger scale company, maybe they adopt some sort of lake house, you know, flavored technology. Then they go to work for a smaller company and they're like, Hey, we can do like, we can do this actually. Instead of waiting until the bill gets to a hundred grand and then having to do, you know, sort of a complete

Starting point is 01:08:22 like replatforming. Super interesting. Yeah, it's going to change a lot in the next three, four years. And it's going to, I think, yeah, I think we have to get to a point where it's feasible, right? You can like no cost, no trade-offs to your timelines. You can get started with this thing. Yep. Yeah, it makes total sense.

Starting point is 01:08:42 Awesome. Well, I think we went long, but that's because Brooks let me record this time so we get to break the rules, which is always great. Vinat, this has been such a great conversation. We learned a ton as always, and we'll need to have you back for sort of a third time's a charm round on the data section. Yeah, absolutely. Love to. And thanks for all the awesome questions. That's one of the things that I really enjoy is the quality of the questions. And it gets me pushing on the hard stuff. So yeah, this is fun. That should definitely be good.

Starting point is 01:09:18 Well, that's a very high compliment. So thank you so much. And we'll talk soon. I love talking with that guy, Costas. He just has this really incredible ability to answer questions with a high level of detail, but keep the explanation really concise, which is a really challenging skill that I have a lot to learn from. I think probably one of the takeaways for me was the conversation right at the end where he talked about how the market is changing and when he thinks that data lakehouse technology will come down market and potentially even be adopted instead of a warehouse, sort of as the

Starting point is 01:10:07 first major operational data store in a company, which is really interesting to think about. But at the same time, his point was, well, four years ago, no one thought about Snowflake as like, okay, you need a warehouse, you just stand up Snowflake, right? And so he said in another three years, who knows what could happen? So that was just really interesting. And I know I'll be thinking about that a lot this week. How about you? Yeah, I agree with you. That was like a very interesting point.

Starting point is 01:10:35 And it's also like something that remains to be seen in how exactly it's going to happen and what will happen. What I keep is, I mean, okay, there are like many different points that I will keep, but one of the things that I really enjoyed was the conversation that we had about how a lake house as an experience and does like performance and like a couple of different, let's say like parameters that we put there,

Starting point is 01:11:00 it's doing like too much. The experience that we have with data warehouses and I like how pragmatic he was about that and saying that, okay, I mean, obviously things can improve a lot, but there are always trade-offs, right? Like you are, like, you're not going to have, let's say the exact same experience, but at the same time, you are going to have, let's say, more flexibility, or like more scalability, or like having different capabilities that you cannot have right now, like you will probably not have ever have with a vertically integrated solution, like a cloud data warehouse. We'll see. I mean, it's too early with all these products,

Starting point is 01:11:42 but it's always great to talk with people like him because he gives a very accurate prediction of the future. I agree. All right. Well, many, many more great guests coming up. Subscribe if you haven't, and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app

Starting point is 01:12:03 to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rutterstack.com.

Pet Camera - EBO Air 2

The Data Stack Show - 99: State of the Data Lakehouse with Vinoth Chandar of Apache Hudi

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 99: State of the Data Lakehouse with Vinoth Chandar of Apache Hudi

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.