The Data Stack Show - 127: The Anatomy of a Data Lakehouse with Alex Merced of Dremio

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Costas, this episode is going to be exciting. I'm actually excited to hear about the questions you asked

Starting point is 00:00:30 just because you have a lot of experience in the space. We're going to have Alex from Dremio on the show. And actually, we've been working on getting Dremio on the show for a while. They've been around for quite some time. And they do some really interesting things on the Data Lake and actually have recently made a huge push on Data Lakehouse architecture. So that's really interesting. And then Alex is an interesting guy. He has a lot of education in his background. So of course, I'm going to ask about that. But I want to ask what his definition of a lake house is. We've had some really good

Starting point is 00:01:07 definitions from other people on the show who invented lake house technology. And Dremia is kind of in an interesting place in that they kind of enable lake house functionality on top of other tooling. And so that's what I'm going to ask. But yeah, I'm so curious to know what you're going to ask. Yeah, I think having someone like Alex from Dremio, like it's a very good opportunity to go through like all the different, let's say components and technologies that are needed to build and maintain a lake house. Because Dremio is a technology that enables all the different components to work together.

Starting point is 00:01:45 And at the end, it gets, let's say, like an experience, like a warehouse, but on top of like a much more, let's say, open architecture. So that's something that like definitely I would like to do with him to go to that and see how these things like work together, how Dremio works with them, and also talk a little bit about like the future and what is missing from the lake house to become, let's say, something equal in terms of like the experience and the capabilities of a data warehouse, right? So yeah, like we'll start with that and we'll see, I'm sure that like more things will come up. David Pérez- Indeed. Well, let's dig in.

Starting point is 00:02:30 Alex Bialik- Let's do it. David Pérez- My one job. I did it. Alex Bialik- Yay. David Pérez- Yay. All right. Welcome back to the Data Stack Show. Alex, so excited to have you on. We've been trying to make this happen for quite some time and Alex, so excited to have you on.

Starting point is 00:02:46 We've been trying to make this happen for quite some time, and it's a privilege to have you here. Thank you. It's a privilege to be on. I mean, I'm very excited to be on the show. Very excited to talk about data. Very excited to just talk and just be part of the data fabric that the show provides. I love it. I love the energy. Well, let's start where we always do. Give us your background because we have some finance, we have some education. I mean, very interesting. So tell us where you came from. Yeah, no, Lifehouse kind of taken me to a lot of different places and given me a lot of experiences, which I feel has given me like a lot of

Starting point is 00:03:20 perspective that's made it fun to talk about things. But basically the story starts back when I was younger. Like many kids, I was really into video games. So I wanted to be a video game developer. So I went to college for computer science, but I eventually changed my majors because a variety of different events in life. I just made a shift into like marketing and popular culture, which led me to start a chain of comic book stores.

Starting point is 00:03:45 But then shortly after that, after I graduated, I went to New York City where I actually worked in training and finance. So I learned a lot about just being one speaking in public and producing clear communication by that training job, but also got to experience the finance side, which is a very data heavy industry and learn the importance of, you know, like the real time data when you're're talking about stock prices and stuff like that and how much that really matters to how everything works. So it gave me an appreciation for a lot of that stuff. But basically around 2018, 2019,

Starting point is 00:04:14 I was kind of ready to move outside of New York City, which meant time for a career change. I still always dabbled a lot with code and technology. So it felt like, this is what I do for fun. Why not make that what I do all the time? So I made that shift first off into like full stack web development and just was completely enthralled. Like basically, I ended up not just coding, but creating a lot of content around coding

Starting point is 00:04:38 and have like thousands of videos on YouTube about coding in pretty much any language you can think of. But eventually, basically, I wanted to combine the skills that I have from all my walks of life, training, marketing, coding, technology, public speaking, and developer advocacy seemed like

Starting point is 00:04:55 the right path. And then on top of that, I found myself constantly playing with different data technologies, just in my free time, learning how to go deeper with Mongo, Neo4j, different databases. So I targeted the data space, and I discovered go deeper with Mongo, Neo4j, different databases. So I targeted sort of the kind of in the data space, and I discovered like the Data Lakehouse and Dremio. And I got the privilege of becoming like the first developer advocate at Dremio and got to combine all my absolute, all my interests, all one day to day thing. So I just like live, eat this stuff nowadays, because I find that exciting. And that's how I got to where I am oh very cool okay tons to talk about both for developer advocacy because I you know I think

Starting point is 00:05:32 Kostas and I both are very curious about that and Dreamio one thing I'd love to hear about though and this may be getting too close to the specific questions around developer advocacy. Studying pop culture and then running a chain of comic book stores and then going into finance training is such an interesting series of steps. And I'd love to know is, was there anything from studying pop culture and working in the world of comics that you really carried with you into finance training? Because most people would think about those two worlds as sort of completely separate. And maybe they were, but, you know, just hearing about your background and the way that you like to combine learnings from different spaces, it seems like there may be a connection.

Starting point is 00:06:23 Yeah, no, I definitely think it's sort of like the way I've always, one of my skills in life has always been to sort of notice patterns, which is a great skill to have when you're writing code. But base bottom line is just like doing all these different things, you notice a lot of things are the same.

Starting point is 00:06:37 A lot of the things you need to do to be successful are the same. You start picking up on these patterns. So basically patterns that I learned when studying things like cultural studies, where you're learning about like what different cultural works mean to people and the meanings they can take on and how you can use that kind of structured communication carried into me starting that Kauai bookstore, where I also learned a bunch of entrepreneurial skills and learned about a lot of marketing techniques.

Starting point is 00:06:59 And at that time, it was like really early on online marketing. So it wasn't like what it was today. It was like starting a message board and trying to build a community on a on an old school like message board yeah but taking then taking that and then when i get and when i get into like finance you know i when i end up learning about all these like financial things and learning about that industry but at the same time i'm taking a lot of that uh the ability to communicate and organize myself that i picked up from those other things and again be able to take what's typically a really complicated thing to teach in finance and be able to teach it in a more entertaining sort of palpable way which has also been something i then repeat now in technology where basically i i don't necessarily always deliver my explanations of

Starting point is 00:07:37 things in the most technical highbrow way i try to really speak in a way that's accessible to anyone that anyone could talk to me for five minutes and be like, yeah, kind of get what a data lake house is. That's pretty cool. And I think that's what this wide journey that I've had and experience that I've had have really kind of helped me bring to the table. Yeah. Why don't we put that to the test?

Starting point is 00:07:59 Can you explain data lake house? And we've had some good explanations you know from vanath who uh you know who created hootie and several other people in the data lakehouse space but this is a huge you know area of importance for jimmy oh so can you just level set us on you know how does jimmy view the data lakehouse and what's your working definition of it? Got it. Okay. I mean, I would say at the core, the whole idea of a data lakehouse is just saying, hey, I have data warehouses, which have certain pros and cons, and I got a data lake, which has certain pros and cons. Can we create something that's in the middle that has the pros of both of them? So when I think of a data warehouse, hey, I got this nice enclosed place

Starting point is 00:08:40 where I can enclose some data. It's going to give me really nice performance, really nice user experience, really easy to use to work with my data. But if I have my data lake, I have this place where I can store a bunch of my data at a much lower cost point. And basically, it's much more open to use with different tools.

Starting point is 00:09:00 I would like all those things in one place. So how can we make it where, hey, I can have all my data in the data lake, but still get that same performance and ease of use of the data warehouse. So that's essentially the premise. But now how you architect that, how you make that happen, everyone's kind of got their story to tell. And us at Dremio, we definitely have ours. key component is going to be like speaking since you mentioned binoth and hoodie it's going to be that it's going to be that table format because that's going to be whatever tool you're using it's those table formats apache iceberg apache hoodie and delta lake that are really going to enable those tools to actually give you the performance and that access and then each tool

Starting point is 00:09:37 can then provide you that ease of use and that's where drumeo will really specialize and try and say anything that made it difficult to use a data lake as a center of your data world before, Dremio tries to address that. So you think of ECUs, Dremio has like a UI that makes it really easy for anyone to go query the data. But also when it comes to like governance and controls, Dremio has a nice semantic layer that makes it really easy to organize your data across a bunch of different sources and act and control the actual access to them so that way you can meet your regulatory needs and whatnot and when it comes to things like migration especially now we now have the cloud product but even with the software product if you

Starting point is 00:10:15 were a on you had an on-prem data stack and you wanted to start moving towards a cloud dremio works with dremio software works with your cloud and it works with your on-prem data. So in that case, you can create one unified where basically people who are working with the data, they don't have to notice the migration. They're just accessing the data from Dremio and they don't even realize that data is being moved from on-prem to cloud, making migration to the cloud much easier for companies. So there's also different benefits that Dremio provides. And again, trying to make the data lake easier and also more performant. Because Dremio has all these, one, it really leverages things like Apache Arrow and Apache Iceberg, really from top to bottom. But also has features like the columnar cloud cache, which makes using cloud storage faster.

Starting point is 00:10:57 And also data reflections, which is the real secret sauce with Dremio. Think of it as like, I mean, it's a little bit more complicated than this, but the way I like to think about it is like automated materialization. So normally in a database, you could create a materialized table. So like this sort of mini copy of your table to make certain things faster. The problem is like, if I'm querying and I want to take advantage of that materialized view, I actually have to know it exists and say, okay, query that, not this. Now with Dremio, you have reflections. And if you turn on reflections, if the reflection can speed up a particular query on many different data sets, it will, you don't have to think about it. You don't have to be aware that it exists.

Starting point is 00:11:33 And that basically really, one, makes it easier to make things faster, but also makes it easier for people to take advantage of that. Yep. Super interesting. Can you describe for us the state of a company before they adopt Dremio and sort of what does their architecture look like and maybe how are they trying to solve some of the problems that Dremio solves? I think that would help me

Starting point is 00:12:01 and our listeners just understand like, okay, what state are companies in before they adopt tremio got it okay i mean there's a variety of different possibilities that's the thing about remio is like it's hard to say this is the why because there's so many whys but i think one of the most compelling stories is definitely that migration story you're a company that you know wants to use the cloud more you want to move your data to s3 you want to move your data to azure you want to move your data to the cloud. But the problem is like you have tools

Starting point is 00:12:26 that work with your on-prem data and then there's tools that you want to use with your cloud data. And now you have all your consumers having to learn different sets of tools. There's all this migration friction. Well, Dremio creates like that unified interface. So it makes it easy.

Starting point is 00:12:38 First, you set up Dremio with your on-prem data, get everyone used to using it. And then you start migrating the data over to like S3 or Azure, and they don't even like notice it. So it makes that kind of migration easier. But I also see use cases where basically people just maybe had a really big data warehouse bill that wasn't really working for them. And basically by moving using Dremio,

Starting point is 00:13:01 they're able to access that data on their data lake and using all those performance features, they're able to get that performance and have, with that UI, get that easy use. It makes it easier to move less and less of that work on the data warehouse and really cut down their costs a significant portion. So, bottom line is, if you have a big data warehouse footprint that you would like to be smaller, Dremio is worth looking into. If you have an on-prem data lake that you would like to move to the cloud, Dremio is worth looking into. If you have an on-prem data lake that you would like to move to the cloud, Dremio is worth looking into. Or if you just have an on-prem data lake that you like, but you

Starting point is 00:13:31 just want to get more juice out of it, Dremio is going to provide that to you because it is going to provide you that better performance on the data lake and it's probably one of the best on-prem tools there is right now. So, generally, if you're using a data lake and you want to use that data lake more, Dremio is going to have some sort of solution. Yeah. And what I'm just so curious to know, you know, we were chatting before, cloud is a fairly recent, you know, in the history of the company is fairly recent for Dremio, you know, in the last year or so. And so having a company that, you know, is largely built on and

Starting point is 00:14:07 has been extremely successful with on-prem, can you just describe being inside of Dremio? Like, what is the mindset shift been? And what's that been like, you know, sort of focusing on cloud, having spent so much time and effort on on-prem. And I know that migration story is a big part of that, but just interested to know, that's probably something that, you know, some of our listeners may know migrating from on-prem to cloud, you know, from a basic infrastructure standpoint, but you doing that as a product is really interesting. Yeah. So essentially, like you have like two product, two overarching products over there, like

Starting point is 00:14:42 Dremio in the sense that you have Dremio software, which is you would create your own cluster that runs Dremio software, but that can access data on the cloud and data on-prem. So that was already being used for those kind of migrations or to access data. But over the last year, what we released is Dremio Cloud, which instead of you having to kind of set up your own Dremio cluster and all this stuff, you can literally in a few minutes just sign up for Dremio Cloud, which instead of you having to kind of set up your own Dremio cluster and all this stuff, you can literally, in a few minutes, just sign up for Dremio Cloud, have a free account. Essentially, it's free of licensing

Starting point is 00:15:10 costs. The only cost there would be would be any cost of any instances you run up to run a query. Outside of that, the account's free. If you want to use our catalog, Dremio Arctic, that's free. And basically, sometimes I'll just open up, you know, run some queries with a spark

Starting point is 00:15:26 running in in the docker container my computer against my arctic catalog and again that's a zero cost operation so basically it makes it just easier to get that dremio experience so dremio made it easier to use the data lakehouse and dremio cloud made it easier to use dremio so it's always about that journey of trying to make things easy and open. Those are sort of the two key things we want to do. So Dremio Cloud makes it easier, but either way, if you're using software or cloud, it's open.

Starting point is 00:15:53 You can connect all your data sources. You can connect, work with your data, and also just work the way you've been working. You're not necessarily locked into doing anything particularly the Dremio way. So you have ways to take that data and use it elsewhere. So that way you don't have, and that's another thing a lot of people really like about Dremio

Starting point is 00:16:09 is just that they don't have to learn a new way of doing things. They can generally make whatever their existing workflow work. Yep. Super interesting. Yeah. I mean, it is, I mean, I know that building for, you know, on-prem versus, you know, sort of a, you know, a pure cloud SaaS product, very different, but thinking about it through the lens of making things easier and taking patterns that existed, but making those easier and delivering those as SaaS without the infrastructure burden makes a ton of sense.

Starting point is 00:16:39 Well, I have a ton more questions. Costas, please jump in because I'm going to try to end this one on time with Brooks out. Brooks Horowitz, Oh, sure. And feel free to direct me, Eric, if you have any hold questions that you have to ask. So Alex, let's start with like the basics and let's talk about the data lake and, in your opinion, what it takes to build a data lake and how also Dremio fits in this architecture. Got it.

Starting point is 00:17:15 I mean, bottom line, I mean, to build a data lake, it's just a matter of just having somewhere you store your data, whether that's an on-prem like Hadoop cluster or, you know, object storage like an S3, Azure, Google Cloud, having some of the stored data and a way to get it there. So your ETL pipelines that are going to take your data from your OLTP sources or whatever other sources you may have and move them to that storage area. But then always the next step comes to like, what do you do with it once it's there? And then that's where like things start to get more interesting because before really you could read data, you had tools that allow you to do like ad hoc queries and that was all fine and good problem is like what if you want to do big updates deletes things like that and then that's where we kind of get into sort of like we start crossing the line from data lake to data lake house with

Starting point is 00:17:57 things like apache iceberg could eat delta lake but where dremio comes in there is that there's all these pieces that you're going to need to kind of put all that together. Like, you know, you might want Apache Iceberg as your table format. So that way you can treat all your data like a table and be able to do deletes, updates. You may want to use, you want to leverage things like Apache Arrow so that way you can communicate with your data faster because there's less serialization between different sources. You know, things like Project Nessie, which will allow you to take those Apache expert tables and do Git-like semantics, like be able to branch a table, then merge changes in the table so you can isolate changes in the same way we would do with code with Git. All those things are really nice, but by themselves can be a lot of work to put the setup and put together. But with Dremio Cloud, so you have two, in Dremio Cloud, you have two tools.

Starting point is 00:18:42 You have Dremio Sonar with the query engine. Okay, it's going to make it easy for me to connect my different sources, whether it's my cloud storage, actually, whether it's databases like Postgres or MySQL. Connect them, join the data together, do whatever I need to do, be able to accelerate that data using reflections. And just do it, and again, have governance and set up permissions and do it in a very easy-to-use way on my data lake. So it makes that aspect a lot easier. But then you have Dremio Arctic, which is sort of the newer product, which is in preview still, which basically gives you that Project Nessie catalog

Starting point is 00:19:12 as a service. And that allows you to have this one catalog that you can connect with any tool. You can connect the Project Nessie with Presto, with Trino, I think there's a pull request for Project Nessie in Trino that you can connect to it with Flink, with Spark. The data-based catalog allows you to connect

Starting point is 00:19:27 whatever tool you want and work with your data. But Dremio provides you this UI that's going to allow you to manage that, be able to observe who's making what changes to your data, when did they make them, what branch did they make them to, and have all the benefits of that isolation from a nice place with an easy setup. Because again, you don't have

Starting point is 00:19:45 to do any setup literally setting up a dremio arctic catalog is you just sign up and you say make one and it's going to exist and you just connect uh so it dremio's role really is just to make the patterns that make a data lakehouse more practical to use easier bottom line it just becomes that gateway to make a say yeah I don't need the data warehouse. I would just do it here, but I can still bring in all those other tools because it doesn't really try. It always tries to adopt as many formats as possible, as many sources as possible, and be open

Starting point is 00:20:14 to connecting to as many tools as possible so that way you're never, you're not locked into anything. Yep. That's awesome. Okay, so you've touched a couple of different things. That's, of course, like for people who are working with data, like they're probably like, okay, known terminology, but that's not necessarily true, like for who listens to the podcast. So let's dive in a little bit like more into like some of in my opinion, like

Starting point is 00:20:45 fundamental pieces that you mentioned. And if I forget anything, please like feel free like to add it. So let's start, first of all, you talked about ETA, right? Like you have, let's say we have like a mechanism that is going to our transactional database, the ones that we use for our product, pulls all the data out and goes to a file system. It doesn't matter if it's like S3 or your local laptop, whatever, okay, still like a file system and you store the data there. Okay. Now from that to being able to query the data and query the data at scale.

Starting point is 00:21:25 And when I say at scale, I don't mean like at scale in terms of like petabytes, but at the scale of the organization, like to make it available to everyone. There's a lot of work that needs to happen. And let's start like with the first, which is how this data is stored on the file system, right? It's not like you just throw up, you know, random stuff out there and a query engine will figure it out and like make it available. So there are like formats out, right?

Starting point is 00:21:55 And before even we go to the table formats, we have the file formats, we have ORC, we have Parquet. So what are, let's say, are there like some specific requirements that Dremio has in terms of like how files, like how the data has to be like stored on the file system or it can be anything? I mean, Dremio is going to have like the best experience when you're using Parquet files. But I'm sure it does support ORC. Not sure about Avro, but bottom line is like, basically, when you're using Dremio,

Starting point is 00:22:25 if you use, for example, like that reflections feature, it's going to materialize your data into Parquet. So for example, let's say I have a Postgres database, and I'm joining it with some other table that I have somewhere else. You know, what's going to happen is that if I just join them, and this is always like sort of like the issue when you start like, you know, doing like data virtualization, is that hey, every time I want to look at this join, it's running this query in Postgres, and it's running this query for this other table, and they may have differing

Starting point is 00:22:52 performance. But with reflections, I can turn on reflections, it'll run that query, and then take that result and materialize it in Parquet. So that way, next time I look at those joins, it's performance. So Parquet really is really at the bottom layer.

Starting point is 00:23:07 So if we were to kind of go back to that foundational level and build up that data lake house, that first step is to basically land your data in a format like Parquet that's really built for analytics. Because Parquet is going to offer you lots of benefits. Like one is that instead of just having all the data just laid out there, it's organizing them into different row groups. The row groups have metadata. So a query engine like Dremio can actually scan that file and be like, okay, do I need to scan this row group?

Starting point is 00:23:30 If not, let me skip to the next one and really have those more efficient query patterns. But once you have all the files, well, my table might be 100 Parquet files or 1,000 Parquet files. So how does an engine know that these 1,000 Parquet files are a table? And that's where the data table format comes in. Basically, first you store the data, you get Parquet, so that way you get that nice, easy-to-scan files, and you get the table format so we can recognize those files

Starting point is 00:23:56 at the table. And then above that, you need engines that can actually read the metadata from Iceberg, and then also know how to read Parquet files to drill into those two layers to get the most best performance possible. Henry Suryawirawan, Cool. So you did something great here. You moved to the next fundamental piece of a data lake or lakehouse,

Starting point is 00:24:16 which is the table format. So we have Parquet, which is the serialization where we write the data. We store it like on the disk. And then we'd have the table format, like what's organized like tables, what's these table formats bring to the user, right? Outside of like, okay, going out there and creating some metadata that says like, all right, this table consists of 1000 files that you can find over there. There are also like other things that these formats provide, right? consists of 1,000.k files that you can find over there. There are also other things that these formats provide, right?

Starting point is 00:24:49 Can you help us with that? What else Iceberg, Delta, and Hudi are bringing to the end user? So all table formats, the main goal is to not only be able to... Because before you could recognize what a table was with Hive, but Hive did it based on a directory. So you said, hey, this folder was a table and whatever files were in that folder was a table, which was great at the time,

Starting point is 00:25:10 but also had a lot of different things that it can't do, particularly when it comes to like safe updates, delete, things like that. So modern data formats, table formats, you know, Iceberg, Hudi, Delta, basically their goal is to solve that. They say, hey, we need to find some other way to sit down and say, okay, these files

Starting point is 00:25:25 make up X table and then also provide supplemental information for engines to be able to query that table efficiently. So basically if you look at Apache Iceberg, it does it through sort of a metadata tree. And basically by going through that tree, the engine can

Starting point is 00:25:41 whittle away and say, okay, hey, there's a thousand files here, but once it works its way through the metadata, there's only really 30 files under the scan. And it allows you to kind of, it completely, so basically, all the actual scan planning is done through the metadata. You take a look at like Delta Lake, what it does, it basically works

Starting point is 00:25:57 through several different log files. And essentially you have like log file zero, which is like the initial state of the table. And then each kind of like, like get diffs, each log file says, okay, here are the changes to which files are the initial state of the table. And then each kind of like, like get diffs, each log file says, okay, here are the changes to which files are the table from the last log. So essentially you'd say, okay, Hey, I want to scan a table. And there is some metadata in there and some indexes that you'll use to

Starting point is 00:26:14 help do what's data skipping. So all three of them are doing trying, you're trying to skip data. You don't want to scan because you scan less data. You speed up the query without having to spend more on compute. Okay. So that's always the name up the query without having to spend more on a compute. Okay, so that's always the name of the games. Go faster without spending more. So what happens is

Starting point is 00:26:32 then you have a hoodie and hoodie works more and more in this like timeline system where basically every change is done on a timeline. That was more built initially to facilitate like streaming. Now in more recent versions,

Starting point is 00:26:44 they've made it now the default. You have this metadata table that facilitates that data skipping. It'll read the stats that are stored in this metadata table that's kept alongside your table and then plan the query around that. Iceberg, I think, has those stats

Starting point is 00:27:00 more really built into it intrinsically to how it works. The pattern would be, if I'm a query engine well what happened in apache iceberg is i you will have something called a catalog which could be like that germio arctic catalog i talked about earlier or something else and it's going to say hey there's this table that you said you have where can i find that data and it's going to point it to some where that metadata is and it's going to go through each layer and like that first layer is going to say okay this is what the table looks like and the second layer is going to be like this is

Starting point is 00:27:28 what a snapshot the snapshot you're trying to query looks like and then in the third layer saying okay these are the groups of files that you may need to scan there's some more additional metadata just on those individual files and then the query engine can be like okay that file I don't need this file I do need this file I don't need and then at the

Starting point is 00:27:44 end really only have to scan the files it absolutely needs. And then that's literally what the table format's doing. It's saying, hey, not only are these thousand files the table, but it's going to give you the information to say, hey, even though that's a thousand files at the table, I only need to scan three. That's how you get that performance.

Starting point is 00:28:00 Awesome. And then you mentioned something else, which is a catalog, right? So that's also quite important. So what is a catalog? was of the store and say, this is what I want to order for Christmas. Well, same thing when it comes to a table format. It catalogs and tells me, hey, what tables are available? And give me the information so I can access those tables. So it's basically the layer between the engine and the table format that allows... So the engine needs to know a few things. First, it needs to know where does the table exist?

Starting point is 00:28:40 That's what the catalog does. Then it needs to know which files are part of the table. That's what the table exist that's what the catalog does then it needs to know which files are part of the table that's what the table format provides and then it needs some metadata on the data in each individual file to fine-tune its scan and that's what the parquet file format does so at each layer it's just giving engine a little bit more information to to get to that an eventual scan without having to scan every row and every file every time but but basically like i like I, with Iceberg, the catalog is like, you have to have a catalog. I mean, it's built in sort of how it works and that's why it's able to

Starting point is 00:29:10 decouple from the directory approach. So again, the Hive had that directory approach. In Delta Lake and a Hoodie, you still very much kind of have that where basically this particular folder is the table, it just, again, has some additional metadata that kind of help wade through that, but Apache Iceberg, your files can be all over the place. Okay. And they'll still be part of your Apache Iceberg table, long as the metadata has them listed. And that creates some really interesting possibilities, particularly with migration. Because if I want to migrate

Starting point is 00:29:36 my Parquet files, let's say from a Delta Lake table to an Apache Iceberg table, I don't necessarily have to rewrite every data file into a particular folder. I can just run an operation that says, okay, these are the Par the files that make up the current state of my table. Write some Apache Iceberg metadata. You've literally rewritten nothing. All you did was write some new metadata, and your table has migrated. So that's, to me, one of the really cool differences when it comes to Apache Iceberg versus some of the other formats. But the facilitate, that's what you need to catalog. Cause there are other ways.

Starting point is 00:30:09 How is it going to know where all these files are if it can't figure out what the initial metadata file is. So that necessity for catalog is what allows that decoupling to really be a thing. Henry Suryawirawan- And okay. Let's let's talk a little bit more about Dremio now. Let's say we want to build like a data lake or lake house, and we need all these components that you mentioned, right? Do I have like to break my own here, or is it something that like I just sign

Starting point is 00:30:38 up on the cloud version of Dremio and like Dremio can take care of like all the different components that I need to build my lake house. Daniel P Laird... It can go both ways. So basically like if you don't already have a data lake house, you could just open up a Dremio account, connect wherever your data is currently. So again, if you have a Postgres database, MySQL database on your transactional side that has all your data and you want to start moving it over to a data lake house, you can just connect them and just start

Starting point is 00:31:02 moving the data incrementally. You won't even think about it and you won't realize it that's already being stored in iceberg tables, being stored in Parquet files, and You can just connect them and just start moving the data incrementally. You won't even think about it and you won't realize it. That's already being stored in iceberg tables, being stored in Parquet files. And if you're using Dremio Arcade Catalog, it's kind of got some really built-in functionality. So all those pieces are going to be there without you having to really think about the configuration of any of this or the deployment of any of it. But if you already have stuff in a way, like if you have Parquet tables that are not Iceberg tables and you want to use them, you can use them. If you have a Delta Lake table that you want to scan, you can do that. Like basically, Dremio allows you to keep the choices you've already made, but will make very sensible, easy to use choices if you go along with sort of if you're building with Dremio from the get go.

Starting point is 00:31:42 So it just depends on where you're coming from, but always tries to meet you where you are. That makes sense. And from your experience, from what you have seen out there as part of Dremio, what are the architectures that most commonly people have implemented for a data lake or a lake house.

Starting point is 00:32:06 And I mean, like, okay, I don't talk that much about like companies that, you know, they might have started a year ago, like a data lake initiative, because, okay, people need to understand that we might invent new worlds for things, but like things exist for quite some time. Like pretty much since the Hadoop, since like Hadoop came out there, like Hadoop is like a data lake at the end. Like it is like a file system where you go and like, you can store all your files there, then you can use MapReduce to go and like create the data you want. Yeah.

Starting point is 00:32:40 It's like super primitive. It's not, doesn't have like the stuff that we have today, but there are companies who started from back then and they are still like evolving their infrastructure. Right. So what are like the, let's say the paradigms that you have seen out there that like, they are like common. David Pérez- Hard to say, cause I would have to say like, the problem is like up until recently, there wasn't like over the last several years, you have seen some standards rise up again. A lot of stuff we're just talking about, like Parquet and whatnot. But before then, there really wasn't that much

Starting point is 00:33:11 of a standard way. Maybe Hive was a pretty ubiquitous standard. So that's probably one of the few things I do see consistently. But really, when I take a look at many different customer stories or potential customer stories, they vary quite widely. And I think that's why this space is so interesting right now because right now you are starting to

Starting point is 00:33:27 see sort of a movement towards more standardization and more you know what those patterns are going to be but you know you see everything from people who are literally treating a database as a data lake or you know or moving all their data into a data warehouse or something doing some weird hybrids between, you know, cloud and high far as like file storage between like a Hadoop and cloud for different use cases or different departments. But so every I would have to say almost every customer story I've heard up to this point has been different than the last some So it's hard to kind of say what's... But I can't think of any particular...

Starting point is 00:34:09 Hive is, I would say, the one thing I think you see over and over again. Do you think there is something that is very different if you consider, let's say, on-prem setups with cloud setups? I would say the big difference nowadays is that if you're you're on the cloud you're going to have a lot more the newer tools available to you just like everyone

Starting point is 00:34:30 sort of kind of gearing towards cloud that's one of the nice things about dremio it is kind of like a newer more modern tool that still very much makes sure that it's you can cater and take care of people who are on-prem so you have that benefit but it's consistent as far as like the experience so you'll have the same sort of at least from the end consumer you're gonna have that same experience whether you're cloud or not prem and that's sort of what it brings to the table but i guess the big consideration is just again what tools are going to be available to you so that's continuing to shrink on prem while continuing to grow in the cloud yeah makes Suryawirawanaclapur, Yeah, makes a lot of sense. And you mentioned at the beginning, like when you were like chatting with, with Eric about how LakeHow became a thing, like by getting like the data lake, you

Starting point is 00:35:14 have the data warehouse and try like to create like a hybrid there, right? So what's, what do you think that is missing currently from the lake how to make it, let's say, to realize the dream behind like this hybrid? I think the standardization of the catalog. I mean, I think you're starting to see more that we're not, you saw a few years off before you really see like, what does the industry standardize on for his table format? But I think you're seeing certain movements over the last year that a lot of coalescing around certain formats. But the next

Starting point is 00:35:48 thing will be the catalog. Because basically, every tool, the way it generally interacts with your data, regardless of what table format it is, regardless of what file format it is, it's through the catalog. So basically, if you need different catalogs for every tool, you're still kind of running into interop issues.

Starting point is 00:36:04 And this is where Project Nessie, I think is really going to be really important because it does, when it offers a catalog that's built to be a catalog in modern era, like that's its purpose versus like a lot of things that we use for catalogs nowadays,

Starting point is 00:36:17 we might be using like an iceberg. You have a choice between like using like a database as a catalog, you can use Hive as your catalog, you can use Glue as your catalog, but none of these tools were really built to be sort of like that kind of catalog in the same way Project Ness is built. It gives you these extra features

Starting point is 00:36:30 that grant to do a lot of new operations and also be able to control governance across tools. That'll be also part of it, being able to set rules on different branches and whatnot. So that way, hey, if I connect to that same table from Dremio and Trino,

Starting point is 00:36:47 I'm going to get the same access rules. And that's going to be sort of really key because that gives really one place where people can control access to their data across all their tools. So that's where it's nice about Dremio Arctic service because it's going to make it easier to adopt Project NetEase. And most tools can already connect to it more and that's expanding. so once you start seeing people sort of standardize on a catalog then it makes it easier for tools to really just focus on supporting the table format and supporting the file format

Starting point is 00:37:13 because they're not supporting 50 different catalogs again the more variety the more the harder it is to kind of give full support to anything so as we standardize on each of those levels that's when you're going to really see the data lakehouse continue to reach its next and next levels. It's already at a pretty insane level of what you can do now. When you think about just where we were a few years ago and what you can do now

Starting point is 00:37:33 with this technology, it's amazing. But when you think a few years from now, when basically more people are using the same catalog, more people are using the same table format, more people are using the same format, the level of support that can be provided by old tools to that is going to be kind of amazing

Starting point is 00:37:47 because then, again, you'll have that promise of openness where I can switch between tools I want and there's no vendor lock-in. But, like, so to me, like,

Starting point is 00:37:55 that's sort of, like, that next step. And, like, Dremio Architect is going to really help provide that step to give you that sort of open catalog that lets you use

Starting point is 00:38:02 whatever tool you want and have access to the data you want and control how your data is accessed from one place. Okay, and this is Project Nessie that you mentioned? Yeah, so Project Nessie is the open source project. Dremio Arctic is sort of like the Nessie as a service

Starting point is 00:38:15 product from Dremio. But it's not just Nessie as a service. It provides you the catalog, but also provides you a really nice UI. It's going to provide you automated optimization features, so that way you can just optimize your table as you'd like so these there's other features that are coming down the road but at the core you're getting this catalog and you can connect to that catalog again using like presto flink spark dremio sonar pretty sure there's a pull request on trino to have that that as well so you'll be able to use all all your

Starting point is 00:38:43 major data lakehouse tools with it. And then that'll just continue to grow from there. But again, and the benefit is again, I keep mentioning the Git like semantics, but the real use cases there are threefold isolation. So for example, if I'm doing ETL work, you know, I might want to do some auditing first. So I can ETL that data into a branch and not merge it. So I've done my like verification and validation, multi-table transactions. Let's say I want to update three tables, I get joined regularly.

Starting point is 00:39:07 Instead of updating them one at a time and running the risk of having sort of broken joins, I can update them all on a branch then merge them when I'm done.

Starting point is 00:39:16 Or, you know, if I want to like basically create a branch that isolates a data at a point in time for like an ML model so that we can continue to test against

Starting point is 00:39:23 very consistent data, it makes all these much more possible, much more easier. Henry Suryawirawan, That's super cool. All right. And what, like, you mentioned that the catalog is like what is missing right now. And Project Nessie is trying like to fill this gap, but like how far away we are from like filling this gap, right? And is it like a technological like issue that makes it like, let's say, slower as a process or is it also a matter of like the current state of the industry and having like

Starting point is 00:39:55 all these like different stakeholders where each one is building their own catalog? And of course they want to promote their own catalog. Like I think Databricks, it's pretty recent that they introduced their own, which is closed source also, like it's not even like possible, like to consume it outside like Databricks itself. Right. So what's your take on that? I mean, that's inevitable.

Starting point is 00:40:20 I mean, like that's one nice thing about like, again, Apache Iceberg that it does support multiple catalogs. So, I mean like Noteflake just thing about like again apache iceberg that does support multiple catalogs so i mean like noflake just recently had like iceberg support and they created their own catalog and now they have a pull request to kind of add support to iceberg out of the box for that and that's just that's going to happen you're gonna have people who try to continue creating and that's one of the nice things about like an apache iceberg they do have this new thing called the apache arrest catalog or the rest which basically creates like a standard api so basically if anyone wants to build a catalog you can just follow this rest api open spec and then basically you iceberg would

Starting point is 00:40:56 automatically work with that catalog basically if ever theoretically everyone follow that spec then it doesn't matter even the cat you wouldn't even have to standardize on the catalog and you'd still be able to use it everywhere. So you have technologies like that. So I do think right now, again, you're starting to see first is going to be the standardization of the table format because that's going to determine which catalog people will choose from. And then once you start seeing much more standardization on the table format, then you'll see that battle for which catalog to use for that table.

Starting point is 00:41:23 I do think this year is going to be an interesting year, mainly just because there's a lot of interesting things that will be coming down the pipeline this year regarding catalogs on different levels. So as much as that,

Starting point is 00:41:34 as much as I can say, but the bottom line is like, I do think that the catalog conversation will be a big conversation this year. All right. Super interesting. Okay.

Starting point is 00:41:43 One last question from me because I want to give like some time to Eric also to follow up with any additional questions that he has. So last question is about developer advocacy, right? And I'd love to hear from you, like what it means to be a developer advocate for something that it's, okay, it's technical, but it's also, let's say, very, there are many moving parts. It's like when we're talking about the data, like we spend like all this time talking about table formats, file formats, catalogs, query engines, materialization. Like it's so many different things and you have like so many different technologies that

Starting point is 00:42:28 you need like orchestrate all together, right? Which is very different compared to being like, okay, I'm advocating for something like, I don't know, a JavaScript library, right? For the front end, which I don't say that it's not complicated, but it's much more like the scope of the technology itself. It's much more narrow compared like to something like Macau. So what does it mean? Like what's like unique about what you're doing and the value that Advocacy brings

Starting point is 00:42:57 it to the industry? Got it. Okay. So first we'll just start off with like developer advocacy as a thing. It's been really interesting. Like, you know, when I first discovered that this role existed, I realized this role is like tailor-made for me because there's certain skill sets you need, like basically the idea, I mean, at the end of the day, like the hope with a developer advocate is that you're sort of like the cross between basically, you know, like if you took a PM and

Starting point is 00:43:21 someone from the marketing team and like mushed them together, that's ideally what you want. Someone who can basically understand the product enough to be able to communicate its value with conviction and authority, but someone who can also understand the marketing and basically the idea that, hey, you want people to make a

Starting point is 00:43:40 choice and think about that. But being a good developer advocate, you'll be good at both of those things. you need one, you need technical knowledge, you need to, you know, know the space, know the technology and know technology in general. But then you also have to be one a good communicator, which is why I think, you know, like, you know, having a sort of a history in sort of educating really was really helpful one also have conviction ideas like you can't advocate for something you don't believe in so you got to believe in whatever you're the developer advocate for so i was excited to be at dremio because it's

Starting point is 00:44:14 such an exciting product at a very exciting time which is i think the most exciting part is just like the state that the industry in such a kind of this is such a moment of flux between so many different competing technologies it makes it that much more interesting and makes it that much more exciting to be on the front lines of that. But bottom line is, and also to be a content creator, because, I mean, you know,

Starting point is 00:44:34 to get that word out there, to be in front of people requires you to go speak at meetups, requires you to go do podcasting, requires you to go make videos, find any clever way to kind of get in front of people to speak that more technical level and

Starting point is 00:44:45 also creating like example code or useful tools like it goes beyond just saying okay hey this is what we do and this is why you should use it it's really being able to like empathize with people and like take a look at you hear people's experiences and their stories and be able to think of like get it because you understand them on the technical level but you also understand the pain on a different level and it's a difficult and that's one thing i noticed like it's a i can imagine it must be a difficult position to hire for because it's not usually it's you know you can find people who are good communicators but and then you can find people who are really technical knowledge but finding both of them and you know sometimes can be really tricky

Starting point is 00:45:22 so you know what's another reason why i'm like i'm very grateful that i've had such a weird backstory that took me through so many different experiences and why i just love doing what i do because it really is like a position that's tailored to the life story that i've had yeah well i think it speaks a lot to you because finding joy and understanding the deep technical stuff and in the process of trying to condense that down. And I think even throughout the show, it's been wonderful to hear you use examples. You know, you say, admittedly, this is oversimplified, but I like to think of it

Starting point is 00:45:54 as XYZ. It's very clear that you have, you know, sort of a deep love of both the technology, but also, you know, the way to communicate that best. One question I'm interested in, especially relative to your excitement around this technology that we've talked a little bit about before on the show when the subject of Data Lakehouse has come up, is when you think sort of wide market adoption will happen. And to put a little bit more detail on that question,

Starting point is 00:46:30 you know, there are certain characteristics that make the data lake house make a lot of sense at a really large scale, say, you know, sort of enterprise scale, right? So a couple examples you get, you know, moving from on-prem, making the on-prem experience better, you know And I certainly foresee some companies at even a small, an early stage, adopting a lake house architecture from the outset, right? Just so that they can essentially have a glide path towards scale that doesn't require any retooling. Now, that's not to say there's still a huge market for it and it sense you know it's like adopt a warehouse or query your postgres directly or whatever but i'm interested to know what are

Starting point is 00:47:29 you seeing out there from the germia perspective about companies adopting this way earlier than maybe 10 years ago companies trying to move towards a lake house architecture because of the enterprise specific issues got it yeah no actually, no, actually, this is something I did because I do a podcast called Data Nation. And it was actually an episode where I did episodes specifically on this, where I think people are saying that companies should adopt data lakehouse earlier.

Starting point is 00:47:53 Because really, usually the things that would impede you is just that like the cost of having a lot of this big data infrastructure earlier on was just really expensive and complicated. But especially with something like Dremio Cloud, it's easy and cheap. Like literally signing up like Dremio Cloud, it's easy and cheap. Like literally signing up for Dremio Cloud is just signing up.

Starting point is 00:48:08 And, you know, you use it when you need to and you don't use it if you don't. You have your account. So if you're a small company and you're thinking, hey, you know, wait, you know, I'm going to get to a point where I might want to start hiring a couple of data analysts, you know, and maybe right now you have everything

Starting point is 00:48:21 saved in maybe spreadsheets or you might have everything in a Postgres table. You can still connect them. Hire your data analysts. Have them start working directly from there. But then as you scale, as you were saying, your workflows aren't going to have to change when you get to that point where you're scaling because people are already using the tool that you're going to be using. And then you just shift how you store your data, the way your data is managed on the back end, but your consumers never notice the difference as you're going to be using. And then you just shift sort of how you store your data, the way your data is sort of managed on the backend, but your consumers never notice the difference as you grow. Yeah. Super interesting. All right. Well, we are at the buzzer. Alex,

Starting point is 00:48:54 this has been an absolutely fascinating conversation. I've learned a ton and we're really thankful that you gave us some time to join us on the show. Thank you for having me. And then I just recommend everyone out there to go follow me on Twitter at amdatalakehouse. You can also add me on LinkedIn. Check out my podcast, Data Nation, and also Dremio. We're starting a new weekly webcast called Gnarly Data Waves, where I'll be hosting, and we're going to have a lot of interesting people come talk. So come check us out. Awesome. Thank you so much. Kostas, one thing that struck me was the emphasis on openness, which I guess makes sense for a tool like Dremio, you know, where they need to enable multiple technologies. But a lot of times

Starting point is 00:49:34 you'll hear technology companies be a lot more opinionated, you know, like, we are doubling down on this file format because of these really strong convictions. And it was just really interesting to hear Alex say, you know, it probably works best with Parquet, but you should try to query a bunch of other stuff with it, and it'll work. It may not be the most ideal experience, but I appreciated that openness, right? And it seems like that's sort of a core value of the platform, at least as we heard from Alex. And so I thought that was really neat. And honestly, I think it's probably pretty wise of them, even though they're, you know,

Starting point is 00:50:14 obviously I think a lot of their customers are well-served by the Parquet format. But the fact that they seem to be building towards openness, I think is probably pretty wise for them as a company as well. Yeah, a hundred percent. I mean, I don't think that you can be in, let's say, the space of the lakehouse or the data lake without being open.

Starting point is 00:50:38 I think that's like the whole point. That's how like a data lake started as a concept, like compared to a data warehouse where you have like the opposite, like you have like an architecture that is like closed, you have like a central authority that like optimizes like every decision and have like total control over that. And okay, the data leak is the opposite of that. It's like, okay, here are like all the tools, figure out how to put them together and optimize them for your like use case, right?

Starting point is 00:51:09 So obviously there are like pros and cons there. Yeah. I have to say though that openness is a little, I think like easier in this industry, primarily because the things that you have to support are not that many. That's a great point. Right. Like, okay, if you compare the number of front-end frameworks that we have compared to how many file formats we have for like calling our data, you cannot compare them.

Starting point is 00:51:42 Right. And there is a reason behind that. It's because it's a different type of problem and it has like a more limited, let's say, probable set of solutions. So that's something that's easier also to achieve and maintain. Yeah. But this doesn't mean that it's not hard, right? If you are going to productize it, it's one thing. Like, what's another thing like the product?

Starting point is 00:52:10 So, yeah, it's very interesting. I really want to keep what Alex said about the catalogs and the importance of cataloging. That this year is going to be an important year and hear a lot about that. And yeah, like hopefully to have him again, like in a couple of months and see like how things are progressing and not just for Dremio, but for the whole industry in general. We will have him back on. Thank you again for joining the Data Stack Show. Subscribe if you haven't, tell a friend, and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Subscribe if you haven't, tell a friend, and we will catch you on the next one.

Starting point is 00:52:46 We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. That's E-R-I-C at datastackshow.com.

CODACE Plant Stand

The Data Stack Show - 127: The Anatomy of a Data Lakehouse with Alex Merced of Dremio

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

The Data Stack Show - 127: The Anatomy of a Data Lakehouse with Alex Merced of Dremio

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.