The Data Stack Show - 103: Everyone Is Invited to the Data Lakehouse with Kyle Weller of Onehouse.ai

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show. Kostas, today we are talking with Kyle from OneHouse. Now, we've actually heard about OneHouse before, although we didn't know the name. We had Vinoth, one of the creators of Apache Hootie on the show, one of our

Starting point is 00:00:41 best performing episodes ever. And he told us, I think before the show, and he may have mentioned on the show, that he was working on something based on Hudi, but they were in stealth mode. And now they are not in stealth mode anymore. The company is called OneHouse. And like I said, it's built on Hudi. And we were talking with Kyle from OneHouse.

Starting point is 00:01:02 He's their head of product. And he has a really interesting background. Spent a ton of time at Microsoft and is Kyle from OneHouse. He's their head of product and he has a really interesting background. Spent a ton of time at Microsoft and is now building OneHouse along with Vinoth. You know, I'm really interested. This isn't going to surprise you at all. I do have a lot of questions about OneHouse, but I kind of want to hear about his experience at Microsoft

Starting point is 00:01:21 because he did a lot of things. He worked on, you know, the Office Suite. He worked on Bing. He worked on sort of the Siri, Siri-esque, I can't remember exactly what, Cortada, I think it is,

Starting point is 00:01:34 for Microsoft. And then his year Databricks. And so I want to hear about his experience at Microsoft. We don't talk a ton about Microsoft on the show, but they're a huge, huge company with tons of data products.

Starting point is 00:01:47 So that's what I'm going to ask about. How about you? Yeah. I mean, I have plenty of questions to ask about lake houses and the work that they're doing at one house. But one of the questions that I definitely want to ask you is like, how it seems to go from such a big company, like Microsoft things like stage startup. Kyle Pratt- I'm here too.

Starting point is 00:02:08 It's seed stage startup. Yeah. So I'm very curious to see like how, how it feels and how it's going. So let's do it. Jason Cosperinou- Kyle, welcome to the Data Sack Show. We're so excited to have you. Kyle Pratt- Thanks Eric. Excited to be here. Jason Tucker- All right. Well, let's start where we always do. Give us your background and then tell us what you do today at OneHouse.

Starting point is 00:02:33 Kyle Pratt- Yeah, great. Thanks. Yeah, I've been in the data space about nine years and I started that journey at Microsoft actually. And in that time I built data platforms, data engineering platforms for large-scale services like Office. I joined, actually, Microsoft in some interesting times when they first released Windows 8. Oh, wow. Yeah, but we were also developing the new O365 apps, mobile apps. And there I was tasked for building a new telemetry system for all of these office applications.

Starting point is 00:03:09 So it was a really fun time. Back in 2013, we had an internal tool built on Hadoop, and we actually had some kind of transactional data-like components in there. A project called Project Baja, if anyone's interested to search these things back then in 2013, it was pretty advanced stuff. But yeah, I faced some interesting data engineering challenges, building data platforms for Office, then went over and did this for Bing, Microsoft Search Engine. And that was, of course, a much larger scale, more mature kind of data platform. And from there, I wanted to consciously drive my career more into like more product building. And so that first step from there

Starting point is 00:03:54 was going over to Cortana. This is Microsoft's digital assistant like Siri. And there I was driving product growth strategy and measurement strategy. So a lot of data science work and defining from a business perspective what does success mean for the product and tracing that back down into how we measure the product and track its growth. And then I switched over to building like true data products themselves. I went over to the Azure machine learning world and worked on some interesting components like Python and R execution inside SQL server. So we had these unique ways that we could do remote compute execution. The goal was for the, think of the persona of data scientists who like to be in Jupyter notebooks and writing all their machine learning development inside Jupyter

Starting point is 00:04:43 notebooks. But if they had data inside SQL server, usually they'd take like a dump, a CSV dump or something and pull it in, sample the data and run with it that way. We made ways where people can stay in the IDE of their choice and then send remote compute execution, the code that they write in Python or R, into SQL Server. So it would process the data there in SQL Server, return results back to their notebooks and things like that. That's super interesting. Yeah, it's pretty fun.

Starting point is 00:05:11 That was near like 2017. Then shortly after that, I remember being, I distinctly remember this experience. I was at the Microsoft Build Conference, which was in 2018. And I was at the booth. It was a shared booth for Azure Machine Learning, Azure Data kind of components. And Microsoft first announced the preview for Azure Databricks. If folks don't know, Microsoft and Databricks have a special relationship where we would take Databricks' native product and build it into deeper integrations.

Starting point is 00:05:48 It's the Azure Backbone Stack. And then we'd actually market and sell this as a Microsoft first-party service. We'd call it Microsoft Azure Databricks. And of course, behind the scenes, it's Microsoft and Databricks building this together. And I was still on the Azure Machine Learning team, but we announced that preview for Azure Databricks and the booth was flooded, completely flooded with people that wanted to talk about it. I get really funny questions like, like, what is, what is a Databrick? And just like really funny questions people would come ask at the booth. I'm like, like, where, where's the team that we had? Like one person that helped like bootstrap this partnership and

Starting point is 00:06:27 integration, these kinds of things. But then shortly we were going, we were preparing to have like the GA release of the Azure Databricks product and needed to staff up a bigger team. And so I jumped onto that team and was there from when we first launched Azure Databricks GA service at Microsoft. And that was a really fun ride. Was a product manager there, product lead. And I got to see firsthand the growth of this service from making zero dollars to become

Starting point is 00:06:58 like the fastest growing analytic service on Azure. We had some really fun things that we went through with scale, scale challenges and outreach challenges and all kinds of things that it was fun to build on a fast paced, fast growing product. Then yes. So I was on the front lines of this emergence of the, the lake house category that, that Databricks calls it here,

Starting point is 00:07:23 the lake house and help you know hundreds of different organizations enterprises modernize their data stacks their data architecture onto the the lake house architecture and i was deep deep in that domain and then i bumped into for north chandar who i know has been on your show in in the past. And he's the creator of Apache Hudi. So if folks on the show know that Databricks has this open source project called Delta Lake, very successful, amazing, amazing product service. And it's pretty comparable

Starting point is 00:07:57 to what Apache Hudi is as well. And so, of course, I'd heard about Hudi. And so I was interested to talk to Vinod. I learned about what his vision was for what he wanted to do and build a company around Apache Hudi as well. And I decided to jump ship and that's where I'm at here at OneHouse Space.

Starting point is 00:08:15 I'm head of product at OneHouse. We just emerged out of stealth about three months ago and we're building Fast and Furious. Awesome. Okay, Kyle, so many questions, especially around Azure Databricks, but I actually want to go back because you mentioned something

Starting point is 00:08:31 about some of the work you did in the context of Office at Microsoft. And what I think is interesting about that is the time when you were there, you know, Office is a super interesting, you know, one of the most arguably influential, you know, software space in the entire world. Excel is the most used business application in the world.

Starting point is 00:08:52 But correct me if I'm wrong, but during the time when you were there, there was this big push to get a lot of the sort of, you know, sort of locally run software connected to the larger Microsoft ecosystem online, including sort of the Microsoft accounts. And you mentioned the word telemetry. And so I can only imagine that there were some really unique challenges sort of crossing the chasm of locally run software connecting these online accounts. What were some of the unique things that you experienced trying to build data engineering products and workflows around that? Yeah, awesome. That's a really good question. I'm glad you picked up on that as well. It was an interesting time. Looking back on it, I was new in my career and didn't have a lot of perspective on history and data and the evolution, these kind of things. But I was in the thick of it right then. Like you you said the family of office products before that were all like local installed like there was no telemetry

Starting point is 00:09:50 basically at all i think there was if i remember right i think there was like a pop-up that would say when it crashed do you want to send this diagnostic log to microsoft or something like that yeah but there's basically like no no telemetry on these things. And so with the move to more, like O365 was that move to subscription-based services. So by default, you have an online subscription. And so there's things that authenticate and connect it to the internet and things like that. And we also developed the web apps at this time. We developed the mobile apps at that time. This is like 2013, 2014,

Starting point is 00:10:27 all these cool products were coming out. And so there we were designing a new telemetry system from the ground up. Everything from like even the instrumentation SDKs. So identifying, you know, of course, office products, if you look at them, they have a common like shared ribbon on top. And so we developed these SDKs that would have have a common like shared ribbon on top and so we we developed

Starting point is 00:10:46 these sdks that would have a lot of like shared instrumentation and telemetry inside there and then we had to devise all of the engineering platforms for like ingesting that data bringing it into the system dealing with a lot of late arriving data like you say devices aren't always online. Laptops are frequently offline, mobile apps, these things. So there were some interesting data challenges with late arriving data, Lambda architectures. We're building on Hadoop systems back then. We had an internal tool at Microsoft called Cosmos.

Starting point is 00:11:22 And being in the most Microsoft way, we had this special language that you could write. It's called Scope. You can look these things up online. And the Scope language is a mix between SQL and C Sharp. You could embed C Sharp inside your SQL. Oh, wow. And yes, then we developed these pipelines to process the data and then expose it back out through APIs to all the internal engineering teams that were like working on the products too, so that we could like measure the

Starting point is 00:11:53 health of the products, like availability, reliability, these kinds of things. But then also like business kind of metrics. And we had like click stream usage analysis and error trace built some systems that we struggled to get some of the neural time at least when i was there in 2013 ish we struggled out some of the neural time but it was certain certainly a a fun time of of change and and and revolution i felt like i was also on a team where it we didn't have a lot of captains that were like doing this for a second time or third time it felt likeains that were doing this for a second time or third time. It felt like we were all doing this for the first time.

Starting point is 00:12:29 And so it was really exciting ways that we had to think out of the box. And of course, we leaned on some other teams. Like I mentioned, I went over to the Beam team, the Microsoft Search Engine Beam after that. So of course, I would meet with them frequently and interview, hey, how did you design these things, these tables with thousands of columns and exabytes and petabytes of data inside these processing systems for Bing? So we learned a lot of things from that that we carried over to this new telemetry platform for Office. Super interesting. Thank you for indulging me just love hearing some of that the historical

Starting point is 00:13:07 sort of insider stories around some of the things that we're all familiar with okay let's fast forward to azure databricks so in a very short time it went from you know sort of a product announcement to being the fastest growing analytics product on Azure. So help us understand why did that happen, right? It sounds like there were a lot of users who just sort of said, yes, this solves such a painful issue for me. What was that sort of problem set? That's awesome.

Starting point is 00:13:48 Yeah, I think it breaks down into a few different categories. I think first off, we should give all the credit and kudos to our Databricks partners who built the service and they had it already available on AWS. And so they were bringing existing product that's proven product market fit, everything like that. And of course it was the like best spark experience out on the market there that people enjoyed on, on AWS already. And we brought it to Azure for the first. So that that's one factor with that. Like, Hey, there's, it's a product that had legs that people kind of knew about as well, that, that we came and launched on the Azure platform and, you know, then

Starting point is 00:14:24 we married up our Microsoft Azure, like, global sales fleet and field thing that, you know, we have a foot in the door of every enterprise customer out there. And so combining and marrying, like, an amazing product that's there and a global sales fleet trained to get that in the door with customers, we had this really great partnership of development. But then when you think of your question on why was it so successful, what made it take off, these kind of things, some of this I would... I actually did a conference in Florida, somewhere halfway between this journey, and I asked this question to the audience and that was there.

Starting point is 00:15:07 I said like, why, like, why do you like Azure Databricks? Shout out from the crowd. What, what, what do you like? And people, I think there were like three simple answers that came out. It's fast, it's easy to use and it's secure. And I think that ease of use was, was really important. Looking at other comparable Microsoft products that we had, we had some other services that did offer Spark,

Starting point is 00:15:32 open source Apache Spark, but they were not that easy to use and a little bit cumbersome. So Databricks brought the easiest Spark experience to the market. And it was just a pleasure, pleasure to use the product, the data, the collaborative notebooks and everything else you can do for data science and, and whatnot, and that, that was a big factor. I think another dimension that made this successful is people saw this as, you know, every other product that we offer from Microsoft is like in the

Starting point is 00:16:06 Azure cloud only. And this is actually a unique product that is available on multiple clouds. And so people would feel that they are future-proofing their data environments and data stacks by, hey, if I ever needed to leave Azure and go to AWS, hey, I'm using Databricks here, I'll use Databricks over there. And so they felt comfortable picking a product that was available on multiple clouds. I think that was an advantage that we had as well. So yeah, does that answer the question? Yeah, super helpful. Okay, I have one more, and then I know that Kostas

Starting point is 00:16:40 has a bunch of interesting questions, especially around the Lakehouse, but tell us about, tell us about one house. You know, so I know that of course, Vinath, who was such a wonderful guest to have on the show and Hootie, but tell us, tell us about one house. You came out of stealth three months ago. So this is, I feel privileged that we get to talk to you after coming out of stealth such a short time ago. Yeah. Yeah. This is really exciting. And of course, a topic that's dear to the heart as I'm a head of product here and in the weeds building this thing right now actively from the ground up. So OneHouse, I would summarize, if you wanted a one-line answer, I would say we are

Starting point is 00:17:18 a pre-built lakehouse foundation for analytics and AI. And If I break that down a little bit further, we'll get into, it sounds like the topics of what a lake house is and why it's important, why it matters, those things. But what I observed in my time with Azure Databricks and otherwise working with a variety, large diversity of different customers out in the market is, that it's hard to build a data platform on data lakes.

Starting point is 00:17:46 Even with amazing products like Databricks, even with amazing technologies like Delta Lake, Apache Hudi, these different things, it still is time intensive to build these lakes. And I would frequently observe customers take six months or longer with large engineering teams to operationalize and truly have like a production grade data platform right i built this with office and and other things even in first-hand experience and so what we plan to do with one house is we offer this fully managed experience to have a pre-built foundation of your data lake and lake house and and so we are you mentioned benoth vanos the founder

Starting point is 00:18:26 ceo of one house and he is the creator the original creator of apache hoodie that he created in 2016 at uber and so apache hoodie kind of pioneered this new transactional data lake category that now we call the the lake house and here now we're at one house, we're offering automation on top of the open source components that Apache, Apache hoodie has, has to offer. And so if you look, if you read up on Apache hoodie, you go to the docs and see what those, what those different services are, we have things like ingestion services and so with one house will offer manage ingestion points where your data's at.

Starting point is 00:19:05 We'll bring it in, stream it in real time and efficient ways. Then there's a lot of like table services. Like when you, when you think of what a lake house is, and I think we'll dive in that category soon, it's like, there's a lot of things you want to, to a lot of services you want to operate on this data, like clustering, compaction, indexing, cleaning up historical metadata, these kinds of things. And we'll automate the use of all these services.

Starting point is 00:19:32 Then, yeah, from one house there, we're not building a query engine. So one of the other goals that we have is to try to decouple data infrastructure from query engines. The worldview that I see out there today is most people build a query engine. There's a lot of dollars that you can chase after for ETL and query dollars. But then they build a vertical optimized stack down to the lake or the warehouse, wherever the data resides. They build a vertical optimized stack that like, hey, it's going to be the best for this

Starting point is 00:20:03 query engine. Let's crank the wheel and make more revenue through our query engine, et cetera. And so what we want to do is make it so that people can very easily stand up a data lake or lake house platform and have interoperability across the, to be able to use the query engine of their choice. I've seen people do want to have mixed mode compute and use things like Trino, use things like Presto, use things like Spark, Hive, you name it. And we want to be able to provide that flexibility to future-proof your data as well. Love it. All right, Costas, I've been monopolizing, and I heard a couple of words in there that I know are very near and dear to you and what you work on every day. So please jump in. Yeah, yeah.

Starting point is 00:20:49 I mean, I'll start with a bit of a more personal question, though, before we go to the Lakehouse story. So, Kyle, from working in one of the biggest enterprises in the world, which is Microsoft. Yeah. You moved to a city. Oh yeah. That was a good question. Yeah. Right. Uh-huh.

Starting point is 00:21:09 How does it feel? Oh, wow. It's a big change. It's a big change. So yeah, in my, in my career journey, there's like a side note where like I was volunteering, helping some people do some like startup things, startup their own businesses with this program CalcDef, we can talk about later. But I was inspired by the startup ecosystem and culture and then also being exposed to Azure Databricks.

Starting point is 00:21:33 Like we started partnering up with Databricks. They were Series D, about a few hundred employees. And I was there for that whole like rapid growth journey. And so then when I started to look at what like next opportunity i wanted to do in my career i looked around of course first inside microsoft i'm happy you know things are good here and but i felt like i'd be bored with any other choice that was there like after experiencing such fast pace and so i started to look at more more startup scale i didn't think i would go this small honestly i thought i'd land somewhere in the CV, that kind of range, somewhere in the middle. But when I met Vinoth and this was,

Starting point is 00:22:11 I was deep in this lake house domain already and Delta Lake and everything else. And then I met him and he's a creator of Hootie. And so the parallels there. And when we started to talk, it was just light bulb moments that were going, light bulb experiences that were going off in my head of my, this is, these are incredible market opportunities that I already know. And I feel like this, this strategy to build this company in ways like once in a lifetime, I was like, let's, let's do it. Let's go build. So I haven't been a part of some startup this small. I'm learning, learning as we go. And it's, it's a lot of fun.

Starting point is 00:22:44 That's nice. Nice to hear. And so can you share an experience that really surprised you? Like something that's obviously different, but you also didn't expect. Yeah, good question. I think most of these I expected when I got here, right? The complete lack of structure, the complete lack of, like, guidance and direction, like, it's all on you, it's all on your shoulders. these together, one thing that excites me and energizes me about this experience is like feeling that ownership, feeling that accountability and feeling that like, hey, there's no one else here that will get it done unless I get it done, right?

Starting point is 00:23:37 If I fail, it's on me, right? And so feeling that accountability really amps up the energy and gets me excited to come to work gets me excited to lean in and and try to build for for the future yeah yeah that's awesome i mean to be honest like i i admire people that like they are able to go through such a radical change and still get excited because it is a right it's like it is radical change. Yeah, a completely different way of how things are operating and what kind of mindset you need to have. Yes. Yeah, that's amazing.

Starting point is 00:24:14 Cool, so, okay. Let's, one question about Microsoft also. So, it's kind of interesting, like the story around like Microsoft and the data infrastructure in general, because Microsoft has like traditionally a lot of innovation in this space, right? It's like just if we concentrate only on MS SQL on its own, like it's like a lot and lot of like innovation in like database systems and working with data. But from someone who, let's say, spent most of their time

Starting point is 00:24:49 like in the modern data stack, right? We don't hear about Azure that much. We don't hear about like Microsoft that much. We know that it exists, probably like something really big, but it's like also distant. Why is this happening? Let's say distance between... In one side, we have Snowflake, we have Google, we have AWS,

Starting point is 00:25:18 and then we also have Microsoft. Why is this happening? Yeah, I would maybe... I probably don't have the perfect answer, but I'll take, I'll give you my take on it. I think there might be two components. One is Microsoft is hyper-focused on like their environment, their lane. Like let's get it done for Azure and Azure customers.

Starting point is 00:25:42 And we don't have many like cross cloud plays like Google. Google does great like cross cloud plays with BigQuery and a bunch of other services this way. AWS of course is the market leader and has the most market share in terms of like cloud compute and things like this. And so I think some of it is because of that stay in your lane marketing that Microsoft focuses on. The other half may be because the modern data stack is also pretty new and it's evolving and building a startup myself now,

Starting point is 00:26:21 I see that the first place I'm going to build is AWS, where we're the largest market shares. And of course, Azure is still, I would that the first place I'm going to build is AWS, where we're the largest market shares. And of course, Azure is still, I would say, my favorite cloud. And so I want to take my product to Azure as well. I just need the right customer demand mix to take it there. And so maybe that's also where we see some bias in the modern data stack is like, hey, it's starting out and some of these are new products. Some of them are mature products, but you'll see the new ones will probably gravitate towards where they think they will find the most customers. And then they can expand and grow from

Starting point is 00:26:56 there. Because I think if you look at the modern data stack, most of these companies are outside of the cloud native vendors. And so they'll want to build multi-cloud products. It makes sense for anyone outside of a cloud vendor to be available in all clouds. But AWS is like an easy choice to start from first. Henry Suryawirawanacik, Absolutely. Do you think that also has to do with like, let's say what's market like each one of the, like the cloud vendors probably focusing a little bit more, has more success? Because I don't know, like in my mind, at least Azure and Microsoft is always like an enterprise.

Starting point is 00:27:38 So, right. Like that's like what they know really well, how do, how to build there and like all these things. While on the complete opposite side of the spectrum, you have Google, which is more of like the medium size, like small size, like customer. Then somewhere in between you have AWS, right? Do you think that's also like... I think it does influence because if you look at this from a perspective of owning these services or owning these products, you want to hyper-focus your efforts on where you have the most success, where you can drive the most revenue. And if you have these largest enterprises come to Microsoft for these big contracts, sometimes people combine and they want a single vendor for like Office. We were talking about Office, right? And have O365 and Azure and combined spend, these kind of things, combined relationship. Then, yeah, then if I own these products, I would focus the success on where I know I can turn the crank and drive more revenue and dollars that way. So I think it might have a component too.

Starting point is 00:28:42 Yeah, that makes a lot of sense, I think. And it's probably like also one of the reasons that, I mean, data breaks together with Microsoft and Azure meetings, thanks to those like so successful. You take like a platform that it's, let's say, built for the enterprise and you have a product that is also, let's say, addresses the needs of the enterprise has today, so you put them together and you have success. So it's like, of course, the go-to market is going to follow it. Like it's going to work very well, but I think it's like the perfect context to go and build integration product there with the two together.

Starting point is 00:29:17 So my assumption would be that like that was also like an important factor to the success of this collaboration. Okay, cool. I think it would be that like, that was also like an important factor to the success. Yeah. Yeah, I agree. Of this collaboration. Okay, cool. So let's talk about lake houses now. As you said, it's a new category. The common, I think, understanding is that the lake house is a paradigm that emerges through the need to work with data lakes and the issues that we have with building and operating data lakes.

Starting point is 00:29:55 Can you tell us a little bit about what are these challenges? Why the data lake is not enough? Why do we need something on top of that? Or, I don't know, like a completely different part in there. That's awesome. Yeah, good question. And I would start it with helping point people to some resources too, so they know it's not just my opinion.

Starting point is 00:30:15 If you go search like Gartner's latest data management hype cycle, what you'll see on there is data lakes are in the pit of what G Gardner calls the trough of disillusionment. And same with, I think I have the chart handy here, even data engineering side by side with data lakes, lake house, metadata management, a bunch of different components that if you look at these trends together, it starts to call out and make obvious what the problems are. If you're not in the space, if you're not living and breathing and felt the experiences already, you can learn from this perspective. Because one, I mentioned that data lakes are hard to build. But also, when compared to alternatives, they lack many qualities and features, right? Like a data lake is just a collection of files out on cloud storage, whether that's S3 or ADLS, GCS. And these files represent data. And then you have to build metadata systems around how to understand what data is in what files and track these. And then you have to manage the size of the files and how the files are organized. And not to mention access control around the files.

Starting point is 00:31:42 All these different components that make lakes so painful and hard to use. Whereas if you look at an alternative, like a data warehouse, you can pick up a data warehouse off the shelf, purchase like a data warehouse and use it. It's ready to go, right? You put data in and it's all like their schema is managed. These are tables. You can write your SQL. It's even if you're using a service like Snowflaker or otherwise, it's like no knobs, performance tuning, these are tables, you can write your SQL, it's even if you're using a service like Snowflaker or otherwise it's like no knobs, performance tuning, these kind of things and

Starting point is 00:32:10 it just works, right? And the data lake you have to go build and but the lakes also miss like if you study in data warehouses, the lakes are missing a thing called ACID transactions where you can't process updates or merges or these kind of things on lakes because the file systems are immutable file storage so you don't have acid transactions you don't have like managed schemas metadata these different components that are different now if you look if you flip the tables and look at what are the advantages of the lake versus a warehouse, because what I just described, maybe you'd be inclined to pick a warehouse over a lake. But on the flip side, there's a lot of advantages the other direction

Starting point is 00:32:54 where the lake is a lot cheaper. The economics, especially when you start to scale on those economics, it's a lot cheaper to use a lake. You also can use a variety of like structured and unstructured data. When it comes to machine learning data science, a lot of this is unstructured data, the warehouse also kind of locks you up to a single vendor, right? Like, like you're, you put your data in that warehouse and then it's all run on the compute of that vendor, et cetera. Whereas on a lake, you are open to play in more of the open source ecosystem and have a variety of tools.

Starting point is 00:33:30 You're kind of more agile and future-proofing in that way. So backing this back up to lake house, right? What a lake house is, is you can take a there's these open source projects, Apache Hudi, which is of course close to our heart at One House and Delta Lake, which was close to my heart at Databricks, where they take the best capabilities of a warehouse and bring them to the lake. And that's why we call these lake house environments. So does that help answer some of those? Like why the lake house? Yeah, absolutely. It kind of feels a little bit like trying to build, let's say, a database

Starting point is 00:34:11 system from starting from the file system and going out, right? Yeah. Which kind of is actually. Just like in, I mean, these are databases from like more of a monolith. So there aren't very complex architectures behind, but everything is hidden behind the configuration that the SQL dialect that you're using. Yeah.

Starting point is 00:34:34 But right now we're pretty much when you build a lake house, you have to know about query engines, about parquet files, and.org files, and then table formats on top of that. And it sounds like a lot of work, right? And I understand that operating a system like this is going to be pretty hard. And that's why I'm going to, in a way, ask the question again but like from a different angle what's so great and so important about the data lake that puts people into like all these efforts to do that like okay why people let's say don't just get snowflake and call it a day at the end right you know what is that we cannot do with like the data warehouse?

Starting point is 00:35:25 Sure, sure. I think this I can answer with a perspective on what I've seen from customer journeys. And then maybe some specific examples too. So what I've seen in the market today, there's a common pattern where because it's easier to start with, and if you think of like a company's life cycle or their data engineering team's life cycle, when you only have a few engineers or you're just getting started, or maybe your data's not huge in size, hey, I'll just pick up a warehouse. Maybe I pick up a combo, like a five-train snowflake kind of thing, and I start building on this warehouse. But where I've seen countless challenges is once people hit growth phases in their data

Starting point is 00:36:09 engineering platforms, or they hire more like data scientists and the data scientists are like, Hey, I need to train these models. And now I need you to go instrument these, like we're talking to office apps with more events and like machine generated data and, and things like that, and the, the size of data increases like by incredible scale, but also the complexity of the workloads that you want to run on your warehouse. And this is where I see a lot of tumultuous kind of migration start to happen where, where, when people scale in their warehouses, they get this cost fatigue.

Starting point is 00:36:47 And I see even examples, I've been talking to a lot of big customers and seeing the amount of dollars they spend just on ingesting data into these warehouses. And then subsequently, even like ETLs and these other things. It's huge dollars. And what I think people realize is that on a lake, you are able to break apart different parts of your workloads. Like if you just try to dissect what are the workload types on the lake, you have query up on the top, you have ETL. You have data management. This might include reprocessing or backfilling, GDPR deletions. You have performance tuning, things like cleanup of data.

Starting point is 00:37:35 And you have ingestion, right? And you break apart these different types of workloads. Not all of them have the best ROI characteristics on one compute platform. And so, like I mentioned, these characteristics, like query dollars, depending on the type and the concurrency and the requirements of latency, perhaps, warehouses are still the best pick. But for like exploratory analytics, you might be better off with like a Trino. For like ETL kind of processes is better off with a spark. And so not just better off in terms of like,

Starting point is 00:38:12 it's going to be more successful, but like actual big cost savings that you're able to drive across your platform. So the warehouse you're locked on one, one compute framework, the others. Now you can, can segment these across the board.

Starting point is 00:38:28 And this is why I see people have big problems once they get to those scale curves with their data platforms. Yeah, that makes a lot of sense. That's very interesting. This whole conversation around the workloads. So, okay. You mentioned something about scale. You use like this word like a couple of times and especially like when you're talking about the moment that like customers start realizing that they got

Starting point is 00:38:57 like, it becomes like really hard to scale using just like a Google warehouse. So is the lake house something that's, it's only relevant to enterprises or like to big companies, like who cares about the end for lake house or who should care about it? Sure. Yeah. I think that is kind of where one house comes into play as well. Cause right now, if you look at it, it's still kind of hard to build a lake house

Starting point is 00:39:24 and you need the right ROI characteristics to build a lake house and you need the right roi characteristics to enter a stage where you decide hey i'm going to pour a couple engineers on this project it's going to take us x amount of time so the economics work out that yeah let's build a lake house right and what we're trying to do with one house is flip that model completely and make it just as easy to build a lake house as like a five train snowflake combo click click click the button you're in your data's moved in it's all formatted up it's synced to the catalogs of your choosing now you could just have a data analyst come into the picture and start querying this data using this data you could bring a data

Starting point is 00:40:01 scientist to the table we let them use the compute context that they like. And so let me see if I can regroup back to your core question. I see the actual architectural pattern of the lake house is important and viable, I think, to companies of all sizes. This ability to have a single pane of glass or a single centralized place that you can manage your data and you can govern access controls around this data and you can share this data within your organization rather than, you know, the alternative, I see people kind of build out these silos or like, okay, we've got a data warehouse for these types of things. And then we've got this kind of database for other things. And we've got, you know, it's all kind of mixed. So I see that,

Starting point is 00:40:57 to answer your question, I think the lake house is of value to companies of any size. It's just on, like, it's a tough sell to try to build one sometimes if you're looking at it from like, you're getting started from scratch and, and you look at what the alternatives are. Yeah. Makes sense. Do you see like a future where like a late cows will be weaving like side by side, meet at their house and how do you feel like these two paralel things are going to coexist like in the company? Yeah.

Starting point is 00:41:29 Yeah. I think they already do coexist today and I've seen a lot of successful cases of coexistence as well. I think what might be interesting is to hear this from an angle of like the emergence of like house two. So, because when i was working on azure databricks i saw these patterns start to emerge with customers where before we had the term lake house and but we had delta lake we had the databricks service covers were looking for

Starting point is 00:41:59 ways to eliminate warehouses out of the picture they had had a mixed mode of lakes and warehouses and that they would use warehouses for BI analytics and this or for machine learning and ETLs and everything else from the lakes. But I saw time and time again, customers trying to eliminate the warehouse and bring these in. But sometimes they were struggling. When you look at BI workloads and the type of concurrency that happens from users that are in like Tableau or Power BI dashboards, you click one button that may end up triggering like 20 queries that go and execute to your query engine. And you look at Spark, right?

Starting point is 00:42:37 Azure Databricks with Spark. That's come into one single driver node. And if you study Apache Spark, like the fair scheduling within Spark, these things are not good for managing concurrency scale. And so I saw a lot of customers actually try to build lake houses before we had this lake house thing. And so the demand signals were very obvious to us. And we knew that customers wanted to do this. They were feeling the cost fatigue. They wanted to approach these scenarios.

Starting point is 00:43:05 And so that's where, you know, with Databricks, we solved these challenges by offering SQL endpoints that now can load balance between different Spark clusters and also the move to serverless, right? You see Databricks making that move right now as well. And so these type of things were at the same time when we came out and said, hey, this is like officially the lake house, right? Now BI queries can be distributed, be scalable, and actually work on the lake in a comparable way to warehouses. And that's where it was big to double down on that. Cool.

Starting point is 00:43:45 So one last question from me, then I'll hand the microphone back to Eric. So you mentioned Delta, you mentioned Houdi, obviously, and we also know that there's like Iceberg out there. So there are like three table formats. There is something common in all three of them. And that's like the production of adding like asset guarantees there for transactions, which is let's say like the minimum requirement for creating the table format, but what's different, like what's each one like brings on the table

Starting point is 00:44:24 right now that the other one does not? Sure. Can you help us like understand a little bit better, like why we have three of them out there? Yeah, sure. I think they were both, they were all born in, in different ecosystems at different times. Hudi was invented in 2016, came out of Uber from, from the CEO of our company. Delta Lake came to market and I think 2019 and Iceberg, I can't remember, is it 2019 as well, the same year as Delta Lake?

Starting point is 00:44:55 Maybe. And Iceberg, Iceberg came out of Netflix and these kinds of things. And I think at the get go, if you study Hudi and Iceberg, they were built to solve slightly different challenges, but ended up building really overlapping, solving general solutions the same. Same with Delta Lake. Now, if you want to look at ways that they're different or differentiators or things like that, I can talk from Apache Hudi's perspective first. And because this was something that I grilled Vinoth on as well when I first met him, where I was sitting in a really great spot and Azure Databricks growing so fast. And I met Vinoth and I kind of grilled him like, what are you going to do with this Hudi project? Like, how is it going

Starting point is 00:45:41 to make it with this gorilla in the room with Delta Lake? And then I started to actually learn about these technical edges that Hudi had. And you can go out and study some of the use cases or people that are talking publicly about how they're using Hudi and why they chose Hudi and these kind of things. A lot of them, there's a common pattern that emerges around like CDC kind of

Starting point is 00:46:08 workloads where you need to ingest CDC data into your, into your data. Like some of those are because of there there's two write formats that we have with, with Apache three there's copy and write and there's merge on read. And so we have these, this merge on reuses Avro based ways to write the data that we can then asynchronously compact into columnar formats. We have record level indexing with Hudi 0.11, where we just released also this multimodal index, which is really exciting. Go read about our latest release with 0.11, the new ways that we've

Starting point is 00:46:48 extracted another 10 to 30 X gains on query performance, and even switched to using H file metadata files formats that we can get like 10 to a hundred X performance gains on how you access the metadata, enable data skipping, these kinds of things, Hudi also takes a pretty different stance when it comes to concurrency control. Both the other products or projects, I should say, are working with optimistic concurrency control, which is a, you know, hope that things don't collide when they do retry this kind of mode. Whereas with Hudi, we have OCC and MVCC and you're able to get multiple

Starting point is 00:47:29 writers and also have the table services around your data, like when you have to compact the data, cluster the data, index the data, we can manage all these through a timeline server and make sure that there's no collisions at all for managing the data. So I've seen when customers do deep evaluations, they do deep technical studies, benchmark comparisons, these kinds of things, they usually tend to find like, Hey, Hudi and Delta come out kind of close when it comes to performance parity, but then when you look at like feature sets, Hudi's got a really exciting

Starting point is 00:48:04 bunch of feature sets and also the roadmap that's there. Does that help on the question? Yeah, absolutely. Absolutely. All right. I mean, I think we need to have like at least one more episode to chat more, to be honest. And maybe we should arrange to have like both you and Vinod at the same time. Oh, that'd be fun. That'd be a fun combo. Yeah, absolutely. So we should arrange to have both you and Vinod at the same time. Oh, that'd be fun.

Starting point is 00:48:25 That'd be a fun combo. Yeah, absolutely. So we should do that. So I give the microphone back to Eric. Eric, I keep from the conversation that there are times in life that being an optimist is not always good. At least when it comes to concurrency. That's true. Yeah.

Starting point is 00:48:50 I'm not. I love it. Okay, just one last question because we're close to the buzzer year, Kyle. So, you said that the Lakehouse format is really for companies of all sizes, right?

Starting point is 00:49:07 Let's speak to a listener who you know sort of maybe is living in warehouse only world or maybe they are living in sort of like we have sort of a whole separate you know data lake infrastructure than a whole separate you know warehouse infrastructure you sort of a whole separate, you know, data lake infrastructure, then a whole separate, you know, warehouse infrastructure. You sort of talked about those, you know, performing different functions. How do you begin to think about a world where you are working with a data lake house, right? Like maybe that's in your future. And so what are the different sort of modes of thought, workflows, et cetera, that you need to be thinking about? Yeah, that's a good question.

Starting point is 00:49:48 Let's start from the angle of, of someone that's at a lake or they have a mixed ball with lake and warehouse. The lake house fits in really nicely to compliment where warehouses are, where you can make a central place for ETL, BI, machine learning, these kind of things. And then when you need true enterprise-scale BI and lots of internal users and dashboards and a lot of concurrency and these kind of things, then you can push aggregate data and more cleaned-up data into data warehouses and be able to use it as more of like a serving kind of layer.

Starting point is 00:50:27 That's when you're starting from that angle. When you're starting from the angle that you described of, hey, you're a big data warehouse user. You don't have a lake anywhere in the picture. How or where would a lake house come in? I would say, you know, look at where you're spending the most money in your warehouse and identify what are the workload patterns that are costing the most money, whether that's maybe some ETL process, some kind of cleans or business aggregates or things like that. And find those and then run a POST on the lake and see see how much it costs you and and just see the the amount of things that you're able to get there for if people are looking on like how can they take everything that they've built in the warehouse and come to a lake as well

Starting point is 00:51:19 i think dbt is actually a really interesting choice and view there where if your logic is all written inside dbt, you can, your, your, like all your logic is pretty portable across these systems. So that's actually, if, if people are on the cusp and getting into their warehouse and they're like, Hey, I'm worried, you know, I might be thinking about a lake later, but I still need to build a warehouse. Perhaps put, put a layer on top on top that makes you more agile and portable for the future. That's just a suggestion. Love it. Awesome. All right. Well, Brooks is telling us that we are at time. Kyle, this has been such a wonderful conversation. I learned so much and it was a real pleasure to have you on the show. Thanks, Eric. Thanks, Costas. Appreciate it. As always, a super fascinating conversation. I think one of my takeaways, Costas, is that I don't know if I've heard the opinion stated so clearly that the lake house architecture is for companies of all sizes.

Starting point is 00:52:33 The underlying context in many of those conversations is enterprise use cases, high-scale use cases. And I mean, cost was certainly a subject that came up on the show, but it was really interesting. I think that really stuck out to me, you know? And so maybe we are seeing sort of the beginnings of a migration to a new architecture or at least like the, you know, sort of the very early stages of that and I don't know, maybe one house is the company that will, that'll make that happen. Yeah. Yeah. Yeah.

Starting point is 00:53:00 I mean, I don't know. I think I, I I've said that like many, I mean, at least a few times, like on our show or at least like in private conversations between the two of us, that like one of the, like the way that we should be looking into what the future will look like in the data space, let's say, ecosystem, whatever you want to call it, the sectors like check what the enterprises are doing. Like things start from there and then they go down. Like it's pretty much like the opposite of what was happening with SaaS, where

Starting point is 00:53:33 you would go and like innovate in the medium size or small size companies and then go upmarket, now like things are actually happening like the opposite direction here. I think one of the reasons like people do not use lake houses that much is because exactly like there's like, you need to have a lot of expertise and infrastructure in terms of like human resources, like to go and do that, like you need to call like the data engineers, like specialized, like systems engineers who can go and date with data lake and turn it into a lake house.

Starting point is 00:54:03 And I think that's exactly where the opportunity lies for companies like One House, like even like Iceberg with Tabular, the company behind that, and Delta Lake with DataBricks, right? How we can, I mean, the market can come up with products that will make it like much, much easier to build the systems. Because I think that at the end, what the late house delivers is a platform that it's flexible enough to accommodate in the most optimal way, all the different workloads that the company might have.

Starting point is 00:54:43 And like on-banks, we do not have just one workload anymore. Even small companies. So, I think that's like what is happening there. And it's still early, but I think we will have more and more

Starting point is 00:54:54 about this new category in the next couple of months at least. I agree. All right. Well, thank you for joining us on the Data Stack Show. Tell a friend about the show if you haven't already, and we will catch you on the next one. datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by

Starting point is 00:55:27 Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Pet Camera - EBO Air 2

The Data Stack Show - 103: Everyone Is Invited to the Data Lakehouse with Kyle Weller of Onehouse.ai

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 103: Everyone Is Invited to the Data Lakehouse with Kyle Weller of Onehouse.ai

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.