The Data Stack Show - 127: The Anatomy of a Data Lakehouse with Alex Merced of Dremio
Episode Date: February 22, 2023Highlights from this week’s conversation include:Alex’s background in the data space (2:41)Comics and Pop Culture Blending with Finance training (5:20)What is a data lake house? (7:36)What is Drem...io solving in for users? (11:21)Essential components of a data lake house (16:35)Difference between on-prem and cloud experiences (33:53)What does it mean to be a developer advocate? (41:31)Final thoughts and takeaways (49:02)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show.
Costas, this episode is going to be exciting.
I'm actually excited to hear about the questions you asked
just because you have a lot of experience in the space.
We're going to have Alex from Dremio on the show.
And actually, we've been working on getting Dremio on the show for a while.
They've been around for quite some time.
And they do some really interesting things on the Data Lake
and actually have recently made a huge push on Data Lakehouse architecture. So that's really
interesting. And then Alex is an interesting guy. He has a lot of education in his background.
So of course, I'm going to ask about that. But I want to ask what his definition of a lake house is. We've had some really good
definitions from other people on the show who invented lake house technology. And Dremia is
kind of in an interesting place in that they kind of enable lake house functionality on top of other
tooling. And so that's what I'm going to ask. But yeah, I'm so curious to know what you're going to ask.
Yeah, I think having someone like Alex from Dremio, like it's a very good
opportunity to go through like all the different, let's say components and
technologies that are needed to build and maintain a lake house.
Because Dremio is a technology that enables all the different components to
work together.
And at the end, it gets, let's say, like an experience, like a warehouse, but on top of like a much more, let's say, open architecture.
So that's something that like definitely I would like to do with him to go to that and see how these things like work together,
how Dremio works with them, and also talk a little bit about like the future and what is missing
from the lake house to become, let's say, something equal in terms of like the experience
and the capabilities of a data warehouse, right? So yeah, like we'll start with that and we'll see, I'm sure
that like more things will come up.
David Pérez- Indeed.
Well, let's dig in.
Alex Bialik- Let's do it.
David Pérez- My one job.
I did it.
Alex Bialik- Yay.
David Pérez- Yay.
All right.
Welcome back to the Data Stack Show.
Alex, so excited to have you on. We've been trying to make this happen for quite some time and Alex, so excited to have you on.
We've been trying to make this happen for quite some time, and it's a privilege to have you here.
Thank you. It's a privilege to be on. I mean, I'm very excited to be on the show.
Very excited to talk about data. Very excited to just talk and just be part of the data fabric that the show provides.
I love it. I love the energy.
Well, let's start where we
always do. Give us your background because we have some finance, we have some education. I mean,
very interesting. So tell us where you came from. Yeah, no, Lifehouse kind of taken me to a lot of
different places and given me a lot of experiences, which I feel has given me like a lot of
perspective that's made it fun to talk about things.
But basically the story starts back when I was younger.
Like many kids, I was really into video games.
So I wanted to be a video game developer.
So I went to college for computer science,
but I eventually changed my majors because a variety of different events in life.
I just made a shift into like marketing and popular culture,
which led me to start a chain of comic book stores.
But then shortly after that, after I graduated, I went to New York City where I actually worked
in training and finance. So I learned a lot about just being one speaking in public and
producing clear communication by that training job, but also got to experience the finance side,
which is a very data heavy industry and learn the importance of, you know,
like the real time data when you're're talking about stock prices and stuff like that
and how much that really matters to how everything works.
So it gave me an appreciation for a lot of that stuff.
But basically around 2018, 2019,
I was kind of ready to move outside of New York City,
which meant time for a career change.
I still always dabbled a lot with code and technology.
So it felt like, this is what I do for fun.
Why not make that what I do all the time?
So I made that shift first off into like full stack web development and just was completely
enthralled.
Like basically, I ended up not just coding, but creating a lot of content around coding
and have like thousands of videos on YouTube about coding in pretty much any language you
can think of.
But eventually, basically, I wanted to
combine the skills that I
have from all my walks of life,
training, marketing,
coding, technology, public
speaking, and developer advocacy seemed like
the right path. And then on top of that, I found myself
constantly playing with different data technologies,
just in my free time, learning how to go deeper
with Mongo, Neo4j, different
databases. So I targeted the data space, and I discovered go deeper with Mongo, Neo4j, different databases.
So I targeted sort of the kind of in the data space, and I discovered like the Data Lakehouse and Dremio. And I got the privilege of becoming like the first developer advocate at Dremio and
got to combine all my absolute, all my interests, all one day to day thing. So I just like live,
eat this stuff nowadays, because I find that exciting. And that's how I got to where I am oh very cool okay tons to talk about both for developer advocacy because I you know I think
Kostas and I both are very curious about that and Dreamio one thing I'd love to hear about though
and this may be getting too close to the specific questions around developer advocacy. Studying pop culture
and then running a chain of comic book stores and then going into finance training is such an
interesting series of steps. And I'd love to know is, was there anything from studying pop culture
and working in the world of comics that you really carried with you into
finance training? Because most people would think about those two worlds as sort of completely
separate. And maybe they were, but, you know, just hearing about your background and the way that you
like to combine learnings from different spaces, it seems like there may be a connection.
Yeah, no, I definitely think it's sort of like the way I've always,
one of my skills in life
has always been to sort of notice patterns,
which is a great skill to have
when you're writing code.
But base bottom line is just like
doing all these different things,
you notice a lot of things are the same.
A lot of the things you need to do
to be successful are the same.
You start picking up on these patterns.
So basically patterns that I learned
when studying things like cultural studies, where you're learning about like what different
cultural works mean to people and the meanings they can take on and how you can use that kind
of structured communication carried into me starting that Kauai bookstore, where I also
learned a bunch of entrepreneurial skills and learned about a lot of marketing techniques.
And at that time, it was like really early on online marketing. So it wasn't like what it was
today. It was like starting a message board and trying to build a community on a on an old school like message board yeah but
taking then taking that and then when i get and when i get into like finance you know i when i
end up learning about all these like financial things and learning about that industry but at
the same time i'm taking a lot of that uh the ability to communicate and organize myself that
i picked up from those other things and again be able to take what's typically a really complicated thing to teach in finance and be
able to teach it in a more entertaining sort of palpable way which has also been something i then
repeat now in technology where basically i i don't necessarily always deliver my explanations of
things in the most technical highbrow way i try to really speak in a way that's accessible to
anyone that anyone could talk to me for five minutes and be like,
yeah, kind of get what a data lake house is.
That's pretty cool.
And I think that's what this wide journey that I've had
and experience that I've had have really kind of helped me bring to the table.
Yeah.
Why don't we put that to the test?
Can you explain data lake house?
And we've had some good explanations you know from vanath who uh you know
who created hootie and several other people in the data lakehouse space but this is a huge you know
area of importance for jimmy oh so can you just level set us on you know how does jimmy view the
data lakehouse and what's your working definition of it? Got it. Okay. I mean, I would say at the core, the whole idea of a data lakehouse is just saying,
hey, I have data warehouses, which have certain pros and cons, and I got a data lake,
which has certain pros and cons. Can we create something that's in the middle that has the
pros of both of them? So when I think of a data warehouse, hey, I got this nice enclosed place
where I can enclose some data. It's going to give me really nice performance, really nice
user experience, really easy
to use to work with my data.
But if I have my data lake, I have this place
where I can store a bunch of my data
at a much lower cost point.
And basically, it's much
more open to use with different tools.
I would like all those things in one place.
So how can we make it where, hey, I can have all my
data in the data lake, but still get that same performance and ease of use of the data warehouse.
So that's essentially the premise.
But now how you architect that, how you make that happen, everyone's kind of got their story to tell.
And us at Dremio, we definitely have ours. key component is going to be like speaking since you mentioned binoth and hoodie it's going to be that it's going to be that table format because that's going to be whatever tool you're using
it's those table formats apache iceberg apache hoodie and delta lake that are really going to
enable those tools to actually give you the performance and that access and then each tool
can then provide you that ease of use and that's where drumeo will really specialize and try and
say anything that made it difficult to use a data lake as a
center of your data world before, Dremio tries to address that. So you think of ECUs, Dremio has
like a UI that makes it really easy for anyone to go query the data. But also when it comes to
like governance and controls, Dremio has a nice semantic layer that makes it really easy to
organize your data across a bunch of different sources and act and control the actual access
to them so that way you can meet your regulatory needs and whatnot and when it comes to things like
migration especially now we now have the cloud product but even with the software product if you
were a on you had an on-prem data stack and you wanted to start moving towards a cloud
dremio works with dremio software works with your cloud and it works with your on-prem data.
So in that case, you can create one unified where basically people who are working with the data, they don't have to notice the migration.
They're just accessing the data from Dremio and they don't even realize that data is being moved from on-prem to cloud, making migration to the cloud much easier for companies.
So there's also different benefits that Dremio provides.
And again, trying to make the data lake easier and also more performant.
Because Dremio has all these, one, it really leverages things like Apache Arrow and Apache Iceberg, really from top to bottom.
But also has features like the columnar cloud cache, which makes using cloud storage faster.
And also data reflections, which is the real secret sauce with Dremio.
Think of it as like, I mean, it's a little bit more complicated than this, but the way I like to think about it is like automated materialization. So normally in a database, you could create a
materialized table. So like this sort of mini copy of your table to make certain things faster.
The problem is like, if I'm querying and I want to take advantage of that materialized view,
I actually have to know it exists and say, okay, query that, not this. Now with Dremio,
you have reflections. And if you
turn on reflections, if the reflection can speed up a particular query on many different data sets,
it will, you don't have to think about it. You don't have to be aware that it exists.
And that basically really, one, makes it easier to make things faster, but also makes it easier
for people to take advantage of that. Yep. Super interesting. Can you describe for us
the state of a company
before they adopt Dremio
and sort of what does their architecture look like
and maybe how are they trying to solve
some of the problems that Dremio solves?
I think that would help me
and our listeners just understand like,
okay, what state are companies
in before they adopt tremio got it okay i mean there's a variety of different possibilities
that's the thing about remio is like it's hard to say this is the why because there's so many
whys but i think one of the most compelling stories is definitely that migration story
you're a company that you know wants to use the cloud more you want to move your data to s3 you
want to move your data to azure you want to move your data to the cloud.
But the problem is like you have tools
that work with your on-prem data
and then there's tools that you want to use
with your cloud data.
And now you have all your consumers
having to learn different sets of tools.
There's all this migration friction.
Well, Dremio creates like that unified interface.
So it makes it easy.
First, you set up Dremio with your on-prem data,
get everyone used to using it.
And then you start migrating the data over to like S3 or Azure,
and they don't even like notice it.
So it makes that kind of migration easier.
But I also see use cases where basically people just maybe had
a really big data warehouse bill that wasn't really working for them.
And basically by moving using Dremio,
they're able to access that data on their data lake
and using all those performance features,
they're able to get that performance and have, with that UI, get that easy use. It makes it easier to move less and less of that work on the data warehouse and really cut down their costs a significant portion.
So, bottom line is, if you have a big data warehouse footprint that you would like to be smaller, Dremio is worth looking into.
If you have an on-prem data lake that you would like to move to the cloud, Dremio is worth looking into. If you have an on-prem data lake that you would like to move to the cloud,
Dremio is worth looking into.
Or if you just have an
on-prem data lake that you like, but you
just want to get more juice out of it, Dremio
is going to provide that to you because it is going to provide you that
better performance on the data lake and it's probably
one of the best on-prem tools there is right now.
So,
generally, if you're using a data lake and you
want to use that data lake more, Dremio is going to have some sort of solution.
Yeah. And what I'm just so curious to know, you know, we were chatting before, cloud is a fairly recent, you know, in the history of the company is fairly recent for Dremio, you know, in the last year or so. And so having a company that, you know, is largely built on and
has been extremely successful with on-prem, can you just describe being inside of Dremio? Like,
what is the mindset shift been? And what's that been like, you know, sort of focusing on cloud,
having spent so much time and effort on on-prem. And I know that migration story is a big part of that, but just interested to know, that's
probably something that, you know, some of our listeners may know migrating from on-prem
to cloud, you know, from a basic infrastructure standpoint, but you doing that as a product
is really interesting.
Yeah.
So essentially, like you have like two product, two overarching products over there, like
Dremio in the sense that you have Dremio software,
which is you would create your own cluster that runs Dremio software, but that can access data on the cloud and data on-prem.
So that was already being used for those kind of migrations or to access data.
But over the last year, what we released is Dremio Cloud, which instead of you having
to kind of set up your own Dremio cluster and all this stuff, you can literally in a
few minutes just sign up for Dremio Cloud, which instead of you having to kind of set up your own Dremio cluster and all this stuff, you can literally, in a few minutes, just sign up
for Dremio Cloud, have a free account.
Essentially, it's free of licensing
costs. The only
cost there would be would be any cost
of any instances you run up to run a query.
Outside of that, the account's free.
If you want to use our catalog, Dremio Arctic,
that's free. And basically,
sometimes I'll just open up, you know,
run some queries with a spark
running in in the docker container my computer against my arctic catalog and again that's a
zero cost operation so basically it makes it just easier to get that dremio experience so
dremio made it easier to use the data lakehouse and dremio cloud made it easier to use dremio
so it's always about that journey of trying to make things easy and open.
Those are sort of the two key things we want to do.
So Dremio Cloud makes it easier,
but either way, if you're using software or cloud,
it's open.
You can connect all your data sources.
You can connect, work with your data,
and also just work the way you've been working.
You're not necessarily locked into doing anything
particularly the Dremio way.
So you have ways to take that data
and use it elsewhere.
So that way you don't have, and that's another thing a lot of people really like about Dremio
is just that they don't have to learn a new way of doing things. They can generally make whatever
their existing workflow work. Yep. Super interesting. Yeah. I mean, it is, I mean,
I know that building for, you know, on-prem versus, you know, sort of a, you know, a pure
cloud SaaS product, very different, but thinking about it
through the lens of making things easier
and taking patterns that existed,
but making those easier and delivering those as SaaS
without the infrastructure burden makes a ton of sense.
Well, I have a ton more questions.
Costas, please jump in because I'm going to try
to end this one on time with Brooks out.
Brooks Horowitz, Oh, sure.
And feel free to direct me, Eric, if you have any hold questions
that you have to ask.
So Alex, let's start with like the basics and let's talk about the data lake and, in your opinion, what it takes to build a data lake and how also Dremio fits in this architecture.
Got it.
I mean, bottom line, I mean, to build a data lake, it's just a matter of just having somewhere you store your data, whether that's an on-prem like Hadoop cluster or, you know, object storage like an S3, Azure, Google Cloud, having some of the stored data and a way to get it there.
So your ETL pipelines that are going to take your data from your OLTP sources or whatever
other sources you may have and move them to that storage area.
But then always the next step comes to like, what do you do with it once it's there?
And then that's where like things start to get more interesting because before really
you could read data, you had tools that allow you to do like ad hoc queries and that was all fine and good
problem is like what if you want to do big updates deletes things like that and then that's where we
kind of get into sort of like we start crossing the line from data lake to data lake house with
things like apache iceberg could eat delta lake but where dremio comes in there is that there's
all these pieces that you're going to need to kind of put all that together.
Like, you know, you might want Apache Iceberg as your table format.
So that way you can treat all your data like a table and be able to do deletes, updates.
You may want to use, you want to leverage things like Apache Arrow so that way you can communicate with your data faster because there's less serialization between different sources.
You know, things like Project Nessie, which will allow you to take those Apache expert tables and do Git-like semantics, like be able to branch a table, then merge changes in the table so you can isolate changes in the same way we would do with code with Git.
All those things are really nice, but by themselves can be a lot of work to put the setup and
put together. But with Dremio Cloud, so you have two, in Dremio Cloud, you have two tools.
You have Dremio Sonar with the query engine. Okay, it's going to make it easy for me to connect my different sources, whether it's my cloud storage, actually, whether it's databases like Postgres or MySQL.
Connect them, join the data together, do whatever I need to do, be able to accelerate that data using reflections.
And just do it, and again, have governance and set up permissions and do it in a very easy-to-use way on my data lake.
So it makes that aspect a lot easier.
But then you have Dremio Arctic,
which is sort of the newer product,
which is in preview still,
which basically gives you that Project Nessie catalog
as a service.
And that allows you to have this one catalog
that you can connect with any tool.
You can connect the Project Nessie with Presto,
with Trino, I think there's a pull request
for Project Nessie in Trino
that you can connect to it with Flink, with Spark.
The data-based catalog allows you to connect
whatever tool you want and work with your data.
But Dremio provides you this UI
that's going to allow you to manage that,
be able to observe who's making what changes to your data,
when did they make them, what branch did they make them to,
and have all the benefits of that isolation
from a nice place with an easy setup.
Because again, you don't have
to do any setup literally setting up a dremio arctic catalog is you just sign up and you say
make one and it's going to exist and you just connect uh so it dremio's role really is just to
make the patterns that make a data lakehouse more practical to use easier bottom line it just
becomes that gateway to make a say yeah I don't need the data warehouse.
I would just do it here, but I can still bring in all those other
tools because it doesn't really try. It always tries
to adopt as many formats as possible,
as many sources as possible, and be open
to connecting to as many tools as possible
so that way you're never,
you're not locked into anything.
Yep. That's awesome.
Okay, so you've touched
a couple of different things.
That's, of course, like for people who are working with data, like they're probably like, okay, known terminology, but that's not necessarily true, like for who listens to the podcast.
So let's dive in a little bit like more into like some of in my opinion, like
fundamental pieces that you mentioned.
And if I forget anything, please like feel free like to add it.
So let's start, first of all, you talked about ETA, right?
Like you have, let's say we have like a mechanism that is going to our
transactional database, the ones that we use for our product, pulls all the data out
and goes to a file system. It doesn't matter if it's like S3 or your local laptop, whatever,
okay, still like a file system and you store the data there. Okay. Now from that to being able to
query the data and query the data at scale.
And when I say at scale, I don't mean like at scale in terms of like petabytes,
but at the scale of the organization, like to make it available to everyone.
There's a lot of work that needs to happen.
And let's start like with the first, which is how this data
is stored on the file system, right?
It's not like you just throw up, you know, random stuff out there and a query
engine will figure it out and like make it available.
So there are like formats out, right?
And before even we go to the table formats, we have the file formats, we have ORC, we have Parquet.
So what are, let's say, are there like some specific requirements that Dremio
has in terms of like how files, like how the data has to be like stored on the
file system or it can be anything?
I mean, Dremio is going to have like the best experience when
you're using Parquet files.
But I'm sure it does support ORC.
Not sure about Avro, but bottom line is like, basically, when you're using Dremio,
if you use, for example, like that reflections feature, it's going to materialize your data
into Parquet. So for example, let's say I have a Postgres database, and I'm joining it with
some other table that I have somewhere else. You know, what's going to happen is that if I just
join them, and this is always like sort of like the issue when you start like, you know, doing
like data virtualization, is that
hey, every time I want to look at this join, it's running
this query in Postgres, and it's running this query
for this other table, and they may have differing
performance. But
with reflections, I can turn on reflections,
it'll run that query, and then
take that result and materialize it in Parquet.
So that way, next time I look at those joins,
it's performance. So Parquet
really is
really at the bottom layer.
So if we were to kind of go back to that foundational level and build up that
data lake house, that first step is to basically land your data in a format like
Parquet that's really built for analytics.
Because Parquet is going to offer you lots of benefits.
Like one is that instead of just having all the data just laid out there, it's
organizing them into different row groups.
The row groups have metadata.
So a query engine like Dremio can actually scan that file and be like, okay, do I need to scan this row group?
If not, let me skip to the next one and really have those more efficient query patterns.
But once you have all the files, well, my table might be 100 Parquet files or 1,000 Parquet files.
So how does an engine know that these 1,000 Parquet files are a table? And that's
where the data table format
comes in. Basically, first you store the
data, you get Parquet, so that way you get that
nice, easy-to-scan files,
and you get the table format so we can recognize those files
at the table. And then above that, you
need engines that can actually read
the metadata from Iceberg, and then
also know how to read Parquet files
to drill into those two layers to get the most best performance possible.
Henry Suryawirawan, Cool.
So you did something great here.
You moved to the next fundamental piece of a data lake or lakehouse,
which is the table format.
So we have Parquet, which is the serialization where we write the data.
We store it like on the disk. And then we'd have the table format, like what's organized like tables, what's
these table formats bring to the user, right?
Outside of like, okay, going out there and creating some metadata that says
like, all right, this table consists of 1000 files that you can find over there.
There are also like other things that these formats provide, right? consists of 1,000.k files that you can find over there.
There are also other things that these formats provide, right?
Can you help us with that?
What else Iceberg, Delta, and Hudi are bringing to the end user?
So all table formats, the main goal is to not only be able to...
Because before you could recognize what a table was with Hive,
but Hive did it based on a directory.
So you said, hey, this folder was a table
and whatever files were in that folder was a table,
which was great at the time,
but also had a lot of different things that it can't do,
particularly when it comes to like safe updates,
delete, things like that.
So modern data formats, table formats,
you know, Iceberg, Hudi, Delta,
basically their goal is to solve that.
They say, hey, we need to find some other way
to sit down and say, okay, these files
make up X table
and then also provide supplemental information
for engines to be able to query that table
efficiently. So basically
if you look at Apache Iceberg, it does
it through sort of a metadata tree.
And basically by going through that
tree, the engine can
whittle away and say, okay, hey, there's a thousand
files here, but once it works its way through the metadata,
there's only really 30 files under the scan.
And it allows you to kind of,
it completely, so basically,
all the actual scan planning is done through the metadata.
You take a look at like Delta Lake,
what it does, it basically works
through several different log files.
And essentially you have like log file zero,
which is like the initial state of the table.
And then each kind of like, like get diffs,
each log file says, okay, here are the changes to which files are the initial state of the table. And then each kind of like, like get diffs, each log file says, okay,
here are the changes to which files are the table from the last log.
So essentially you'd say, okay, Hey, I want to scan a table.
And there is some metadata in there and some indexes that you'll use to
help do what's data skipping.
So all three of them are doing trying, you're trying to skip data.
You don't want to scan because you scan less data.
You speed up the query without having to spend more on compute.
Okay. So that's always the name up the query without having to spend more on a compute. Okay, so that's always
the name of the games.
Go faster without spending more.
So what happens is
then you have a hoodie
and hoodie works more and more
in this like timeline system
where basically every change
is done on a timeline.
That was more built initially
to facilitate like streaming.
Now in more recent versions,
they've made it now the default. You have this
metadata table that facilitates
that data skipping. It'll read
the stats that are stored in this metadata table that's
kept alongside your table
and then plan the query around that.
Iceberg, I think, has
those stats
more really built into it intrinsically
to how it works.
The pattern would be, if I'm a query engine well what happened in apache iceberg
is i you will have something called a catalog which could be like that germio arctic catalog
i talked about earlier or something else and it's going to say hey there's this table that you said
you have where can i find that data and it's going to point it to some where that metadata is and
it's going to go through each layer and like that first layer is going to say okay this is what the table looks like
and the second layer is going to be like this is
what a snapshot the snapshot you're trying to query
looks like and then in the third
layer saying okay these are the groups of files
that you may need to scan
there's some more additional metadata just on those
individual files and then the query
engine can be like okay that file I don't need this file I do need
this file I don't need and then at the
end really only have to scan
the files it absolutely needs.
And then that's literally what the table format's
doing. It's saying, hey, not only are these thousand
files the table, but it's going to give
you the information to say, hey, even though that's a thousand
files at the table, I only need to scan three.
That's how you get that performance.
Awesome. And then you mentioned something
else, which is a catalog, right?
So that's also quite important. So what is a catalog? was of the store and say, this is what I want to order for Christmas. Well, same thing when it comes to a table format.
It catalogs and tells me, hey, what tables are available?
And give me the information so I can access those tables.
So it's basically the layer between the engine and the table format that allows...
So the engine needs to know a few things.
First, it needs to know where does the table exist?
That's what the catalog does.
Then it needs to know which files are part of the table.
That's what the table exist that's what the catalog does then it needs to know which files are part of the table that's what the table format provides and then it needs some metadata on the data in each
individual file to fine-tune its scan and that's what the parquet file format does so at each layer
it's just giving engine a little bit more information to to get to that an eventual scan
without having to scan every row and every file every time but but basically like i like I, with Iceberg, the catalog is like,
you have to have a catalog.
I mean, it's built in sort of how it works and that's why it's able to
decouple from the directory approach.
So again, the Hive had that directory approach.
In Delta Lake and a Hoodie, you still very much kind of have that where
basically this particular folder is the table, it just, again, has some
additional metadata that kind of help wade through that, but Apache Iceberg,
your files can be all over the place. Okay. And they'll still be
part of your Apache Iceberg table, long as the metadata has them listed. And that creates
some really interesting possibilities, particularly with migration. Because if I want to migrate
my Parquet files, let's say from a Delta Lake table to an Apache Iceberg table, I don't
necessarily have to rewrite every data file into a particular folder. I can just run an
operation that says, okay, these are the Par the files that make up the current state of my table.
Write some Apache Iceberg metadata.
You've literally rewritten nothing.
All you did was write some new metadata, and your table has migrated.
So that's, to me, one of the really cool differences when it comes to Apache Iceberg versus some of the other formats.
But the facilitate, that's what you need to catalog. Cause there are other ways.
How is it going to know where all these files are if it can't figure out
what the initial metadata file is.
So that necessity for catalog is what allows that decoupling to really be a thing.
Henry Suryawirawan- And okay.
Let's let's talk a little bit more about Dremio now.
Let's say we want to build like a data lake or lake house, and we need all
these components that you mentioned, right?
Do I have like to break my own here, or is it something that like I just sign
up on the cloud version of Dremio and like Dremio can take care of like all
the different components that I need to build my lake house.
Daniel P Laird... It can go both ways.
So basically like if you don't already have a data lake house, you could just
open up a Dremio account, connect wherever your data is currently.
So again, if you have a Postgres database, MySQL database on your
transactional side that has all your data and you want to start moving it
over to a data lake house, you can just connect them and just start
moving the data incrementally.
You won't even think about it and you won't realize it that's already being stored in iceberg tables, being stored in Parquet files, and You can just connect them and just start moving the data incrementally.
You won't even think about it and you won't realize it.
That's already being stored in iceberg tables, being stored in Parquet files.
And if you're using Dremio Arcade Catalog, it's kind of got some really built-in functionality.
So all those pieces are going to be there without you having to really think about the configuration of any of this or the deployment of any of it.
But if you already have stuff in a way, like if you have Parquet tables that are not Iceberg tables and you want to use them, you can use them.
If you have a Delta Lake table that you want to scan, you can do that. Like basically, Dremio allows you to keep the choices you've already made, but will make very sensible, easy to use choices if you go along with sort of if you're building with Dremio from the get go.
So it just depends on where you're coming from,
but always tries to meet you where you are.
That makes sense.
And from your experience,
from what you have seen out there as part of Dremio,
what are the architectures
that most commonly people have implemented
for a data lake or a lake house.
And I mean, like, okay, I don't talk that much about like companies that, you know,
they might have started a year ago, like a data lake initiative, because, okay,
people need to understand that we might invent new worlds for things, but like
things exist for quite some time.
Like pretty much since the Hadoop, since like Hadoop came out there, like Hadoop is like a data lake at the end.
Like it is like a file system where you go and like, you can store all your files
there, then you can use MapReduce to go and like create the data you want.
Yeah.
It's like super primitive.
It's not, doesn't have like the stuff that we have today, but there are companies who started from back then and they are still like evolving their infrastructure.
Right.
So what are like the, let's say the paradigms that you have seen out there that like, they are like common.
David Pérez- Hard to say, cause I would have to say like, the problem is like up until recently, there wasn't like over the last several years, you have seen some standards rise up again.
A lot of stuff we're just talking about,
like Parquet and whatnot.
But before then, there really wasn't that much
of a standard way.
Maybe Hive was a pretty ubiquitous standard.
So that's probably one of the few things
I do see consistently.
But really, when I take a look at many different
customer stories or potential customer stories,
they vary quite widely.
And I think that's why this space is so interesting right now because right now you are starting to
see sort of a movement towards more standardization and more you know what those patterns are going to
be but you know you see everything from people who are literally treating a database as a data lake
or you know or moving all their data into a data warehouse or something doing some weird hybrids between, you know, cloud and high
far as like file storage between like a Hadoop and cloud for different
use cases or different departments.
But so every I would have to say almost every customer story I've heard up to
this point has been different than the last some So it's hard to kind of say what's...
But I can't think of any particular...
Hive is, I would say, the one thing
I think you see over and over again.
Do you think there is something
that is very different
if you consider, let's say,
on-prem setups with cloud setups?
I would say the big difference nowadays is that if you're you're on
the cloud you're going to have a lot more the newer tools available to you just like everyone
sort of kind of gearing towards cloud that's one of the nice things about dremio it is kind of like
a newer more modern tool that still very much makes sure that it's you can cater and take care
of people who are on-prem so you have that benefit but it's consistent as far as like the experience so you'll have the same
sort of at least from the end consumer you're gonna have that same experience whether you're
cloud or not prem and that's sort of what it brings to the table but i guess the big consideration
is just again what tools are going to be available to you so that's continuing to shrink on prem
while continuing to grow in the cloud yeah makes Suryawirawanaclapur, Yeah, makes a lot of sense. And you mentioned at the beginning, like when you were like chatting with, with
Eric about how LakeHow became a thing, like by getting like the data lake, you
have the data warehouse and try like to create like a hybrid there, right?
So what's, what do you think that is missing currently from the lake how to make it, let's say, to
realize the dream behind like this hybrid?
I think the standardization of the catalog.
I mean, I think you're starting to see more that we're not, you saw a few years off before
you really see like, what does the industry standardize on for his table format?
But I think you're seeing certain movements over the last year that a lot of coalescing around
certain formats. But the next
thing will be the catalog.
Because basically, every tool, the way it generally
interacts with your
data, regardless of what table format it is,
regardless of what file format it is,
it's through the catalog. So basically,
if you need different catalogs for every tool,
you're still kind of running into interop issues.
And this is where Project Nessie,
I think is really going to be really important
because it does,
when it offers a catalog that's built
to be a catalog in modern era,
like that's its purpose
versus like a lot of things
that we use for catalogs nowadays,
we might be using like an iceberg.
You have a choice between like using
like a database as a catalog,
you can use Hive as your catalog,
you can use Glue as your catalog,
but none of these tools were really built to be sort of
like that kind of catalog in the same way
Project Ness is built. It gives you these extra features
that grant to
do a lot of new
operations and also be able to
control governance across tools.
That'll be also part of it,
being able to set rules on different branches
and whatnot. So that way, hey,
if I connect to that same table from Dremio and Trino,
I'm going to get the same access rules.
And that's going to be sort of really key
because that gives really one place where people can control access
to their data across all their tools.
So that's where it's nice about Dremio Arctic service
because it's going to make it easier to adopt Project NetEase.
And most tools can already connect to it more and that's expanding. so once you start seeing people sort of standardize on a catalog then it makes it
easier for tools to really just focus on supporting the table format and supporting the file format
because they're not supporting 50 different catalogs again the more variety the more the
harder it is to kind of give full support to anything so as we standardize on each of those
levels that's when you're going to really see the data lakehouse continue to reach
its next and next levels.
It's already at a pretty insane level
of what you can do now.
When you think about just where we were
a few years ago and what you can do now
with this technology, it's amazing.
But when you think a few years from now,
when basically more people are using
the same catalog, more people are using
the same table format, more people are using
the same format, the level of support
that can be provided by old tools to that
is going to be kind of amazing
because then, again,
you'll have that promise
of openness
where I can switch
between tools I want
and there's no vendor lock-in.
But, like,
so to me, like,
that's sort of, like,
that next step.
And, like, Dremio Architect
is going to really help
provide that step
to give you that sort of
open catalog
that lets you use
whatever tool you want
and have access
to the data you want
and control how your data is accessed from one place.
Okay, and this is Project Nessie
that you mentioned?
Yeah, so Project Nessie is the open source project.
Dremio Arctic is sort of like the Nessie as a service
product from Dremio.
But it's not just Nessie as a service.
It provides you the catalog, but also provides you
a really nice UI. It's going to provide you
automated optimization features, so that way you can just optimize your table as you'd like so
these there's other features that are coming down the road but at the core you're getting this
catalog and you can connect to that catalog again using like presto flink spark dremio sonar pretty
sure there's a pull request on trino to have that that as well so you'll be able to use all all your
major data lakehouse tools with it. And then
that'll just continue to grow from there. But again, and the benefit is again, I keep mentioning
the Git like semantics, but the real use cases there are threefold isolation. So for example,
if I'm doing ETL work, you know, I might want to do some auditing first. So I can ETL that data
into a branch and not merge it. So I've done my like verification and validation, multi-table
transactions. Let's say I want to update
three tables,
I get joined regularly.
Instead of updating them
one at a time
and running the risk
of having sort of broken joins,
I can update them
all on a branch
then merge them
when I'm done.
Or, you know,
if I want to like
basically create a branch
that isolates a data
at a point in time
for like an ML model
so that we can continue
to test against
very consistent data,
it makes all these much more possible, much more easier.
Henry Suryawirawan, That's super cool.
All right.
And what, like, you mentioned that the catalog is like what is missing right now.
And Project Nessie is trying like to fill this gap, but like how far away we are
from like filling this gap, right? And is it like a technological like issue that makes it like, let's say, slower as a
process or is it also a matter of like the current state of the industry and having like
all these like different stakeholders where each one is building their own catalog?
And of course they want to promote their own catalog.
Like I think Databricks, it's pretty recent that they introduced their own,
which is closed source also, like it's not even like possible, like to consume
it outside like Databricks itself.
Right.
So what's your take on that?
I mean, that's inevitable.
I mean, like that's one nice thing about like, again, Apache Iceberg
that it does support multiple catalogs. So, I mean like Noteflake just thing about like again apache iceberg that does support
multiple catalogs so i mean like noflake just recently had like iceberg support and they
created their own catalog and now they have a pull request to kind of add support to iceberg
out of the box for that and that's just that's going to happen you're gonna have people who
try to continue creating and that's one of the nice things about like an apache iceberg they do
have this new thing called the apache arrest catalog or the rest which basically creates like a standard api so basically if anyone wants to
build a catalog you can just follow this rest api open spec and then basically you iceberg would
automatically work with that catalog basically if ever theoretically everyone follow that spec
then it doesn't matter even the cat you wouldn't even have to standardize on the catalog and you'd still be able to use it everywhere.
So you have technologies like that.
So I do think right now, again, you're starting to see first is going to be the standardization of the table format
because that's going to determine which catalog people will choose from.
And then once you start seeing much more standardization
on the table format, then you'll see that battle
for which catalog to use for that table.
I do think this year is going to be an interesting year,
mainly just because
there's a lot of interesting things
that will be coming
down the pipeline this year
regarding catalogs
on different levels.
So as much as that,
as much as I can say,
but the bottom line is like,
I do think that
the catalog conversation
will be a big conversation this year.
All right.
Super interesting.
Okay.
One last question from me
because I want to give like some time to Eric
also to follow up with any additional questions that he has.
So last question is about developer advocacy, right?
And I'd love to hear from you,
like what it means to be a developer advocate
for something that it's, okay, it's technical, but it's also, let's say, very, there are many moving parts.
It's like when we're talking about the data, like we spend like all this time talking about table formats, file formats, catalogs, query engines, materialization. Like it's so many different things and you have like so many different technologies that
you need like orchestrate all together, right?
Which is very different compared to being like, okay, I'm advocating for something like,
I don't know, a JavaScript library, right?
For the front end, which I don't say that it's not complicated, but it's much more like
the scope of the technology itself.
It's much more narrow compared like to something like Macau.
So what does it mean?
Like what's like unique about what you're doing and the value that Advocacy brings
it to the industry?
Got it.
Okay.
So first we'll just start off with like developer advocacy as a thing.
It's been really interesting. Like, you know, when I first discovered that this role existed,
I realized this role is like tailor-made for me because there's certain skill sets you need,
like basically the idea, I mean, at the end of the day, like the hope with a developer advocate
is that you're sort of like the cross between basically, you know, like if you took a PM and
someone from the marketing team and like mushed them together, that's ideally what you want. Someone who can
basically understand the product enough to be able to
communicate its value with conviction
and authority, but someone who can
also understand
the marketing
and basically the idea
that, hey, you want people to make a
choice and think about that.
But being a good developer advocate,
you'll be good at both of those things. you need one, you need technical knowledge, you need
to, you know, know the space, know the technology and know technology in general. But then you also
have to be one a good communicator, which is why I think, you know, like, you know, having a sort
of a history in sort of educating really was really helpful one also have
conviction ideas like you can't advocate for something you don't believe in so you got to
believe in whatever you're the developer advocate for so i was excited to be at dremio because it's
such an exciting product at a very exciting time which is i think the most exciting part is just
like the state that the industry in such a kind of this is such a moment of flux between so many
different competing technologies it makes it that much more interesting
and makes it that much more exciting
to be on the front lines of that.
But bottom line is,
and also to be a content creator,
because, I mean, you know,
to get that word out there,
to be in front of people
requires you to go speak at meetups,
requires you to go do podcasting,
requires you to go make videos,
find any clever way
to kind of get in front of people
to speak that more technical level and
also creating like example code or useful tools like it goes beyond just saying okay hey this is
what we do and this is why you should use it it's really being able to like empathize with people
and like take a look at you hear people's experiences and their stories and be able to
think of like get it because you understand them on the technical level but you also understand
the pain on a different level and it's a difficult and that's one thing i noticed like it's a i can
imagine it must be a difficult position to hire for because it's not usually it's you know you
can find people who are good communicators but and then you can find people who are really
technical knowledge but finding both of them and you know sometimes can be really tricky
so you know
what's another reason why i'm like i'm very grateful that i've had such a weird
backstory that took me through so many different experiences and why i just love doing what i do
because it really is like a position that's tailored to the life story that i've had yeah
well i think it speaks a lot to you because finding joy and understanding the deep technical
stuff and in the process of
trying to condense that down. And I think even throughout the show, it's been wonderful to hear
you use examples. You know, you say, admittedly, this is oversimplified, but I like to think of it
as XYZ. It's very clear that you have, you know, sort of a deep love of both the technology, but
also, you know, the way to communicate that best. One question I'm interested in,
especially relative to your excitement
around this technology
that we've talked a little bit about before
on the show when the subject of Data Lakehouse has come up,
is when you think sort of wide market adoption will happen.
And to put a little bit more detail on that question,
you know, there are certain characteristics
that make the data lake house make a lot of sense
at a really large scale, say, you know,
sort of enterprise scale, right?
So a couple examples you get, you know,
moving from on-prem, making the on-prem experience better, you know And I certainly foresee some companies at even a small,
an early stage, adopting a lake house architecture from the outset, right? Just so that they can essentially have a glide path towards scale that doesn't require any retooling. Now, that's not to
say there's still a huge market for it and it sense you know it's like adopt a warehouse or query your postgres directly or whatever but i'm interested to know what are
you seeing out there from the germia perspective about companies adopting this way earlier than
maybe 10 years ago companies trying to move towards a lake house architecture because of
the enterprise specific issues got it yeah no actually, no, actually, this is something I did
because I do a podcast called Data Nation.
And it was actually an episode
where I did episodes specifically on this,
where I think people are saying
that companies should adopt data lakehouse earlier.
Because really, usually the things that would impede you
is just that like the cost
of having a lot of this big data infrastructure earlier on
was just really expensive and complicated.
But especially with something like Dremio Cloud,
it's easy and cheap. Like literally signing up like Dremio Cloud, it's easy and cheap.
Like literally signing up for Dremio Cloud
is just signing up.
And, you know, you use it when you need to
and you don't use it if you don't.
You have your account.
So if you're a small company and you're thinking,
hey, you know, wait, you know,
I'm going to get to a point where I might want
to start hiring a couple of data analysts,
you know, and maybe right now you have everything
saved in maybe spreadsheets
or you might have everything in a Postgres table.
You can still connect them.
Hire your data analysts.
Have them start working directly from there.
But then as you scale, as you were saying, your workflows aren't going to have to change when you get to that point where you're scaling because people are already using the tool that you're going to be using.
And then you just shift how you store your data, the way your data is managed on the back end, but your consumers never notice the difference as you're going to be using. And then you just shift sort of how you store your data, the way your data is sort of managed on the backend, but your consumers never notice the
difference as you grow. Yeah. Super interesting. All right. Well, we are at the buzzer. Alex,
this has been an absolutely fascinating conversation. I've learned a ton and we're
really thankful that you gave us some time to join us on the show. Thank you for having me.
And then I just recommend everyone out there to go follow me on Twitter at amdatalakehouse. You can also add me on LinkedIn.
Check out my podcast, Data Nation, and also Dremio. We're starting a new weekly webcast
called Gnarly Data Waves, where I'll be hosting, and we're going to have a lot of interesting
people come talk. So come check us out. Awesome. Thank you so much. Kostas, one thing that
struck me was the emphasis on openness, which I guess makes sense
for a tool like Dremio, you know, where they need to enable multiple technologies. But a lot of times
you'll hear technology companies be a lot more opinionated, you know, like, we are doubling down
on this file format because of these really strong convictions.
And it was just really interesting to hear Alex say, you know, it probably works best with Parquet, but you should try to query a bunch of other stuff with it, and it'll work.
It may not be the most ideal experience, but I appreciated that openness, right?
And it seems like that's sort of a core value of the platform, at least as we heard from Alex.
And so I thought that was really neat.
And honestly, I think it's probably pretty wise of them,
even though they're, you know,
obviously I think a lot of their customers
are well-served by the Parquet format.
But the fact that they seem to be building
towards openness, I think is probably pretty wise
for them as a company as well.
Yeah, a hundred percent.
I mean, I don't think that you can be in, let's say, the space of the lakehouse
or the data lake without being open.
I think that's like the whole point.
That's how like a data lake started as a concept, like compared to a data warehouse
where you have like the opposite, like you have like an architecture that is like closed,
you have like a central authority that like optimizes like every decision and have like
total control over that.
And okay, the data leak is the opposite of that.
It's like, okay, here are like all the tools, figure out how to put them
together and optimize them for your like use case, right?
So obviously there are like pros and cons there.
Yeah.
I have to say though that openness is a little, I think like easier in this
industry, primarily because the things that you have to support are not that many.
That's a great point.
Right.
Like, okay, if you compare the number of front-end frameworks that we have compared to how many
file formats we have for like calling our data, you cannot compare them.
Right.
And there is a reason behind that. It's because it's a different type of problem and it has like a more limited,
let's say, probable set of solutions.
So that's something that's easier also to achieve and maintain.
Yeah.
But this doesn't mean that it's not hard, right?
If you are going to productize it, it's one thing.
Like, what's another thing like the product?
So, yeah, it's very interesting.
I really want to keep what Alex said about the catalogs and the importance of cataloging.
That this year is going to be an important year and hear a lot about that.
And yeah, like hopefully to have him again, like in a couple of months and see like how things are progressing and not just for Dremio,
but for the whole industry in general.
We will have him back on.
Thank you again for joining the Data Stack Show.
Subscribe if you haven't, tell a friend, and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Subscribe if you haven't, tell a friend, and we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers. That's E-R-I-C at datastackshow.com.