The Data Stack Show - 70: The Difference Between Data Lakes and Data Warehouses with Vinoth Chandar of Apache Hudi
Episode Date: January 12, 2022Highlights from this week’s conversation include:Vinoth’s career background (3:19)Building a data lake at Uber (6:52)Defining what a data lake is (14:01)How data warehouses differ from data lakes ...(22:46)When you should utilize an open source solution in your datastack (37:36)Evolving from a data warehouse to a data lake (45:09)Early wins Hudi earned inside of Uber (52:30)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Kostas, we love talking about data,
and you were telling me the other day that you are actually looking for someone
to work with you on your team,
and part of the job is talking about data all day.
So tell me about this job.
Yeah, we are looking for someone to work with me
as part of the developer experience team at Rudderstack,
and more specifically
in the dev rel role. So we are looking for a person who is interested in anything around data
and talking about data and building relationships with other people that care and love working with data. So that's it. I think it's an amazing opportunity
and we'd love to hear from anyone
who might be interested.
It doesn't really matter
if someone has a previous experience in DevRel.
It's more about genuinely love
and want to work with data
and the communities around data.
Very cool.
And where, man, maybe I'm going to apply for this job, but if I did want to apply for this
job, where should I go?
You can mail me at costas at datastackshow.com.
That's one way.
The other way is just visit ratherutherstark.com slash careers.
And you can, I mean, someone can have like a look at all the open positions that we have
and they will also find the developer experience related positions there.
Very cool.
Well, hopefully we get one of our amazing listeners to join you on the team.
Yeah.
Yeah.
Hopefully.
Yeah.
Welcome back to the DataSec show. Yeah, yeah. Hopefully, yeah. probably steal the show with my question, Costas, but we've talked with some people who have worked
on technology inside of large companies like Netflix that has later been open sourced and
sort of made available generally. But we haven't, at least in my knowledge, talked with someone who
was there at the beginning and really started it from the very beginning. And I just want to get
that story. Like what were the
challenges that they were facing at Uber? Where did the idea come from? And then like,
how did it actually come to life inside the company? So I think that sort of origin story
is going to be really cool to hear about Hudi. Yeah, absolutely. And I think it's probably like
the first open source project that it's actually an Apache project that we have here as a guest.
I think so.
So that's going to be interesting because, okay,
this is also important.
It's one thing to have like a project,
to open source something on GitHub.
It's another thing to have something that's governed
by the Apache Foundation.
So especially from the governance side,
like it's a very different situation.
So I think that's going to be very interesting
to chat with him.
And okay, Vinath is a person who has been
in the right place at the right time
in many occasions when something interesting
around data was created.
He was in LinkedIn, Uber, Confluent later on.
Yeah.
So, and he's one of the, like, I think,
I think it's one of the best people out there to talk about what the data
lake is because that's what Houdi is.
And it's going to be very interesting to see how he started playing with the
idea of like building something like a data lake
inside Uber and why these got open like as open source and why now data lakes are so important
and so hyped. So I'm very excited. I think we're going to have a very interesting conversation with
him. All right, well, let's dig in. Yeah, let's do it. Vinod, welcome to the show. We're so excited
to chat with you. Yeah, great to be here. Vinod, welcome to the show. We're so excited to chat with you.
Yeah, great to be here.
Thanks for having me.
Well, in the world of data, I think it's safe to say that your resume is probably one of
the most impressive that I've ever seen.
So do you want to just give us a quick background on your career path and what led you to what
you're doing today with Hudi?
Yeah. And first of all, I don't know if I am deserving of all those kind words, but I tend
to think of myself more of a one-trick pony who keeps doing databases for over a decade,
because that's the only thing I know. So for me, I started first job out of college at UD Austin was Oracle, work on the
Oracle server, data replication, what passed for stream processing back in the day, CQL streams,
Oracle, Golden Gate, data integration, CDC. That's where I started. Then moved on to LinkedIn,
where I led the Voldemort key value store. I think most people have forgotten that project by now,
but it was like LinkedIn's Cassandra.
It's actually a pretty popular project
and actually led that.
And we scaled it through all the hyper growth stages
for LinkedIn from like tens of millions of users
to hundreds of millions of users.
That's what lasted us through that.
Then I moved on to Uber, where I was the third engineer on the data team, where we had a Vertica cluster on some Python scripts.
And that's kind of like, this is Uber, 200 engineers, 100 plus engineers, and back in 2014.
So I spent almost five years there working on data.
I had, since I joined yearly,
and that's kind of what I was looking for
when I left LinkedIn,
is like a blank sheet of paper
in which I can actually try, work hard,
try to build something new, make mistakes,
learn, like that journey I wanted for myself.
So Uber gave me that and we ended up creating Hoodie there,
which kind of like has become,
it's great to actually see how the space has evolved
over the last four years.
I also did a lot of other infrastructure things at Uber,
including Uber's one of the first companies
to adopt HTTP3, if if you will as it was
getting standardized i still don't know whether it's fully standardized so we ran quick over
and then replaced tcp with udb based condition so i like to dabble with a lot of infrastructure
stuff i was working used to work with the database teams at uber So then I left Uber, went to Confluent,
where I met some of my old colleagues from LinkedIn,
worked on yet another database, KSQL,
and parts of Kafka Storage, Connect, and, you know.
So generally been around this stream processing database data
pipeline, this kind of space for a while.
And yeah, I'm like, I have some time now to actually dedicate full time to Hudi.
Hudi was something that I kept growing the community in the open source in the ASF for
almost four years now.
And then finally have some time to dedicate to Hudi.
And then I'm doing, I'm enjoying that this year.
Very cool.
First of all, I don't know if after hearing that,
my conclusion would be one trick pony.
But okay, I have so many questions.
But one thing I'm really excited about is we've talked a lot on the show
about trickle down of technology from organizations like Uber, Netflix, etc.
that are sort of solving problems at a scale that just haven't been solved before.
And some really cool technology emerges from that.
But we haven't been able to talk with someone who is part of the sort of development of
that from the beginning.
So could you just, we just love the story of Hudi.
What problem were you trying to solve?
And I think in the age that we live in, it's sometimes hard
to think back to, you know, 2014 and what the infrastructure was like back then, but we'd love
to know like the tools that you were working with, the problems you were having, and then just tell
us about the birth of, of Hudi inside of Uber. Yeah, I think it's a, it's a good, fascinating
story actually. So 2014, as you can imagine, Uber was like, we're hiring a lot.
We're growing a lot.
It's like launching new cities every week, if not every day.
And so we were really in that phase.
And if you look at what we had, we had a typical on-prem data warehouse.
And while Vertica is a really great MPP query engine,
but it's not really,
we couldn't really fit all of our data volumes into it.
If you look at all the IoT data or the sensor data
or like any large volume event stream data
or any of these things,
they don't fit inside that.
So we built out Hadoop data lake.
Most people did.
I came from LinkedIn before that. So very well out Hadoop Data Lake. Most people did. I came from LinkedIn before that.
So very well, like until that, I knew the runbook to what to do. You do Kafka, you do like
event streams, you do like CDC, change capture, get a lake up and running and you do-
So that was like familiar territory.
That was familiar. The things that we really replaced the certain things i
wanted to fix over kind of like what we didn't do at linkedin which was we wanted to ensure
all data is columnar never have a mix of like like json or don't so we essentially forced
and built a lender company to schematize the data and built a lot of tooling around it end to end
you would get a page duty alert if like data was broken all of that so so and build a lot of tooling around it end to end. You would get a page duty alert if data was broken and all of that.
So we did a lot of things to actually ensure the lake can be operationalized.
And within a year, we had a lake where you can do, the data was flowing in, where we
can do Presto for interactive querying, some Spark, ETLs, and Hive, which was still the
mainstay for ETL at this point
because Spark was 1.3, like this is coming up, right?
So the main problem we hit was, as you can imagine,
Uber is a pretty real-time business, right?
So what we weren't able to do was
we had our trip stores upstream,
a lot of different databases.
We wanted to take the transactional data,
which is kind of changing, and reflect that onto the lake.
With something like Vertica, you already have transactions,
updates, mergers, like it can do these things.
It got indexes.
And while the data lake could scale horizontally
on compute and storage, it cannot do these things.
So that led to the creation of Hudi, where we said, hey, look, we are between a rock
and a hard place.
We can't fit all this data there, but we don't have these functionalities here.
So we chose to basically bring these database-y functionalities or data warehouse-y transactional
functionalities to the lake.
And that's kind of how hoodie was really born and the key differentiator
i would say from some of the maybe the other projects in the space would be right away we
had to support like three engines like presto like like three i mentioned had to work out of the box
right and the other thing is like with every company we had our raw data then we build etls on the lake after that
so it's not sufficient that we just replicate the trip data very quickly by building updates
and everything we also had to build the downstream tables quickly so we essentially borrowed from
stream processing a lot having like worked on stream processing systems before we built cdc
capabilities or streaming incremental streams
into Hudi,
even in the very first version.
And so that we can actually,
the effect was upstream data store
every few minutes.
It's up to date
with a downstream table on the lake.
And then you can consume incrementally
from that lake
and build more tables downstream.
So we kind of moved
all of the core data flows at
Uber into this model. And that gave us 10X or even some cases like for our G1 tables,
even 100X kind of like improvements over the way that we used to process data before.
So fundamentally, Hudi was created around the concept of, okay, yes, we added transactions
of data to leads, but the bigger picture was this enabled you to process everything incrementally
as opposed to doing big batch processing.
That's kind of how Hudi was born.
Wow.
I feel like we could talk for five hours because I have so many questions, but quick question,
less about the technology,
how many, what was the size of the team and how long did it take you to go from sort of the idea
or the definition of the problem, or maybe like early spec to sort of having an early version
of Hudi in production. Yeah.
Okay, it's kind of like a funny thing because I started writing like a first kind of draft
or at least for the writer transactional side,
I think in my second month at Uber.
But we didn't get to build it until for a year
because we put the business first.
We're just trying to build an operation.
There's so many other things to build.
But finally, we decided to kind of fund the project
with three people in, I think, late 2015.
And then by mid-2016,
we actually had our mid or late Q3-ish,
late Q3-ish, we had all of our core English tables at least running on the project.
And I thought we were able to only do that because we use existing horizontally scalable
storage.
We used all the existing batch engines, right?
So we didn't write a new server runtime or build a lot of things.
We didn't try to build a Kudu, for example, which was something that we considered back then
before building this.
And then I think we opened this as a project
pretty early in 2017.
So Hudi was the first sort of trailblazer
for transactions on a data lake across multiple engines and we mostly wanted to
open source it because we weren't really sure if we're doing the right thing back then so we wanted
to get more feedback i can tell you know it's like super visionary and all that but we were like okay
we're doing something a little bit awkward at least it felt a lot awkward to the more of the Hadoop people who grew up in the
Hadoop space. To me, it felt very natural because I was working on key value stores and databases
and change capture before that. So it all was like, but there's like a lot of the bridges to
cross before I think it became a mainstream thing. Can you give us a definition of what Data Lake is?
Wow, okay. So in my mind at least, so most people if you ask them, I'll start with that,
like a Data Lake is files on like S3 or GCS, so that's kind of like the perception that people have. In reality, I think we built what I would call a honest data lake architecture at Uber, which is what it is.
So data lake is basically an architectural pattern for organizing data.
You can even build data lakes on RDBMS if you want.
The main idea is you replicate your operational schema
raw keep it like simple there so you do a el and then you do etl there and then you try to keep so
but it's been over years overloaded with a lot of different constructs here that it means
hadoop in some people's mind it means like s3 in some people's mind or parop in some people's mind. It means like S3 in some people's mind
or Parquet in some people's mind.
So the basic idea I think remains that,
which is like you do this like raw and drive data.
And from the impact that we saw at Uber,
what I can say is embracing the architecture
has a lot of key benefits.
It completely decouples your backend teams or data
producers and your data consumers have this raw data layer now, which they can use to actually
even figure out what the data problems are. Otherwise, a lot of people do transformations
in flight. So you can't really like you have to go to the source system to reboot stuff. We had a lot of basic issues around just how we do the data architecture.
That's, that's how I see a data lake to be like an architectural pattern.
Yeah.
So that's a very interesting definition, actually, and probably the most accurate
I have here, like so far, because to be honest, like also personally, I'm
still like a little bit confused.
You can see many different pieces of technology
that they fall under the umbrella of a data lake
without being very clear what the role
in the data lake architecture is for them.
And obviously, marketing doesn't help with that stuff,
especially now that we have all these lake houses.
And we're trying to, let's say, take the data lake
and make it equivalent to a data warehouse and vice versa.
And that's my next question.
What's the difference between a data warehouse and data lake
in terms of, like, architectural patterns, right?
Got it. So I'll now actually talk about how the system design
of data lakes and data barrels typically have been.
That is what I think most,
what I think what you're speaking to.
If you look at, if you take a minute, right,
if you just go back to how we were doing
Vertical Teradata,adata we had you essentially
bought some software installed it on like a bunch of servers right they had deep coupling between
storage and compute and it's a fully vertically optimized stack right so having closed file formats
like one query engine one sql on top of that data on like a fixed set of servers, gave them like,
they were able to probably squeeze out performance for a single core. So that is how your on-prem or
the traditional data warehouses have been built. And your data lakes typically rely on, even from
the Hadoop era, they rely on rely on, okay, HDFS or cloud
or some horizontally scalable storage.
You decouple the storage and then you can fit
even back before, even before,
like even if we go to 2014,
you can fit like I mentioned, a Presto or a Spark
or a Hive on top of the same data.
So the fundamental difference here is
the data and the compute
are more loosely coupled in a data lake. And they're much, much more strongly coupled on a
warehouse in terms of like across the stack optimizations and how it's built. With the
modern cloud warehouses, they've changed the game where they've actually decoupled the storage and the compute but the format and
everything everything else is still like vertical right so there's like one format for snowflake
or bigquery one sql and it just like operates in a different way which gives you a lot of
scalability over traditional warehouses so that's why you see a lot of people talking about,
okay, you don't need data lakes.
You just need like cloud warehouses, right?
So if all you're trying to do in life is just BI,
maybe they're right.
Cost aside, maybe they're right.
If you now go to cloud,
so while the cloud warehouses have like leapfrogged
on-prem warehouses and evolved
if you look at on our data lakes in the cloud they're very similar to how they were on-prem
so that's where we are today and that's where the sort of the lake house comes in and i think we
didn't well we did in like you know pioneer kind of transactions on the lake and all that, but we didn't call it the lake house back then.
Mostly because I've still felt even today, the transactional capabilities that we have in a hoodie or like all these similar technologies are like much slower compared to what you would find in a typical warehouse, full blown warehouse.
So we were like a little bit shy about those things, but I think many people weren't.
So, but that's kind of what we refer to lake house now, right?
So we brought some of these core data warehouse functionality back to this data lake model,
fit it on top of like a Parquet or a ORC kind of like open format, and then make it accessible
to multiple engines.
That's what a lake house is.
And it gives you some of the, some of the, some is the important part,
some of the capabilities of the cloud data warehouse,
while it still retains the advantages over a warehouse,
the lake house over a warehouse.
For example, it's much, much more cost efficient.
It's way cheaper.
You can run like,
eventually if you think you're going to need
machine learning or data science,
it's a more forward looking way
to build where you get your data first
into some sort of like lake house thing.
And then you query,
you do your analytics and data science there.
And then you can move a portion of your workload
into a cloud warehouse, right?
So that's kind of like, I feel like we will go back to that model in the next few years.
Because the cloud data warehousing architecture fundamentally doesn't really suit running large
data processing on them. So at least a good segment, chunk of the market,
I think will move towards this model, I think.
Yeah, that's interesting.
I remember like I was talking recently
with some friends in a company
where traditionally they had, let's say,
when it comes to data management and data processing,
they had like two parallel paths.
They had a data warehouse
that was used for data analytics and BI. And then they also
had, let's say, a data lake. And it was based on Spark and on top of S3 that was used from the data
science team. And what they want to do now is actually they want to move into this,
let's say, Delta Lake, like the lake house architecture, so they
can merge these two together.
So the two teams inside the company don't use like two completely different like stacks
for their work.
So that's very interesting to hear also from you because it resonates a lot like with what
like people are trying to do.
You talk about transactions and getting transactions
and implementing transactions on top of like the data lake.
Why transactions are important?
Why we need them?
Yeah.
So if you look at how, let's look at it through the lens of like a use case, right?
GDPR.
I look back at GDPR and I could see that that was the one use case that kind of trickled hoodie down to everyone else.
Because till then, if you look at the stuff that I talked about at the end, sure, Uber had the needs for a lot of business, like faster data.
And we did it certain way. And anybody who does that will get the benefits, the efficiency gains that we got.
But the business drivers for that weren't simply there before something like GDPR.
So you need to ingest data
and then you needed like a team now
to go scrub the data and say like delete people
who left your service or something, right?
So this kind of like you now introduce two teams
who want to operate on the same kind of like you now introduce two teams who want to operate
on the same kind of like data set or table and then now that forced updates deletes and transactions
and that pretty much is what kind of like have made this into like an inevitable sort of transition
if you're doing a lake you're going to probably want to just move into one of those one of these
like newer things now right so that's kind of the main thing,
I would say. Okay, that's very interesting. And you said at some point that, okay, we take
something from the database port and we implement it on top of the file system, which is the
transactions. But again, the transactions, the way that we implement them is not like exactly what you see like in a data warehouse right so
what's the difference yeah what's what is that we don't need only yeah so here i think there are
significant key differences like people people tend to talk about the like delta lake or hoodie
in the same kind of like thing because we like to compare things and then it's easier for us to compare
things and understand right but if you look at the concurrency model even they're completely
different how they're designed so data warehouses do like multi-table transactions for example
like here we've been say on the lake we've been lake house we've been saying yeah we can do it
we can probably add multi-table transactions, but the locking that you do,
they can do more pessimistic locking.
They probably can.
Since they have long running servers,
they can probably do a lot more unique constraint
foreign key validation.
These kinds of things that you would expect
in a full-blown database,
they're able to do today.
So yeah.
And the other key difference with the current lakehouse
architecture is it's completely it's kind of like serverless right it's like a it's like a
serverless whatever warehouse if you will that comes up part by part as needed on demand right
okay just like the writer comes up writes and then goes away and then like a reader comes and
goes away so there is no a reader comes and goes away.
So there is no long running things that you can do to do coordination.
So that's like for some interesting challenges, right?
So if you take a look at Delta Lake,
they pretty much do optimistic concurrency control,
which basically is if two writers don't contend, you're fine,
but otherwise one of them fails.
And if you look at the approach we take in Hudi, we try to serialize everything.
We try to resolve conflicts by supporting log structures, differential data structures.
We try to take in the rights and then sort of do collision resolution later on and we try to because end of the day data lakes
are about high throughput writes
these transactions are in
database terms very large transactions
so you cannot really afford to have
one of them fail like imagine
like a delete job that
ran for eight hours and it fails now
and then you lost like some
eight hours of compute and all this cloud.
So we took a very different approach.
I could see because we were focused a lot more
on streaming CDC data in
and like all of those incremental use cases.
If you look at Databricks and Delta Lake,
probably they have a lot of batch Spark workload
that they run.
So they probably don't get that much concurrency overlap.
So maybe OCC works well for them.
So just like with databases,
like how we have an Oracle, Postgres, or MySQL,
I think there's so much technical differences
with these projects that we will end up
with a bunch of these things, I feel, over time.
Yeah, makes sense.
Makes sense.
Do you see, that's my last question around transactions,
do you see the transactions from the data lake
to get closer and closer to what we have in databases?
Or do you think that there is a limit out there
that it doesn't make sense or we cannot, let's say, pass?
I think we can. We can build the same thing we are actually in
hoodie at least we are experimenting with adding a meta server so essentially make so if you look
at the problem as data plane and sort of metadata plane the data warehouse has servers for both data
and metadata the lake has no servers for both data and metadata today, with the way that things
have evolved with Delta Lake or Iceberg, where you stick metadata into a file. That's not going to be
performant if you compare to what, let's say, Snowflake does, which is keep metadata in another
horizontally scalable OLDB database like FoundationDB, for example.
So we are trying to tinker with a model where we have servers for metadata and we keep the
data playing like in a serverless where Spark jobs should be able to access S3 raw direct.
So that's one thing we feel like we'll bring it a little bit closer this is i feel
is the gap in the lake house architecture today but i like the first aspect you mentioned right
like do we need to do that so that's the other part so unless you're running really a lot of
concurrent workloads today there isn't like a pressing thing. The lakehouse vision is just starting up.
But if you have to fulfill that, I would imagine that you need a full set of capabilities. People
should be able to run workloads on a lakehouse, which are highly concurrent and highly scalable
as they would on a warehouse. So I think there are technical gaps and a lot of things to be built
in the next couple of years or more going forward there. Super interesting. And outside of
transactions, what else do you see as a component from like a more traditional database system
that it is required also from a data lake or a lake house. Yeah.
So I don't know if this fits into the lake house model,
but at least for Hudi,
we actually borrowed a lot from OLTP databases as well,
like indexes, for example.
We have an interesting problem for CDC, right?
So, okay, you have an upstream like Oracle or Cassandra,
some OLTP databases taking rights.
If you have to replicate that downstream to like a data lake table,
then, I mean, why are the updates faster
on the upstream OLTP table?
Because they have indexes and like whatnot
to like update them, right?
So if you have to keep up with an upstream OLTP table,
your write on the data lake table
has to be like, feel like you're writing to a kind of like an OLDP
table. So we invested a lot in more sort of like, so this problem is similar to running a flink job,
reading from like a Kafka CDC, and then updating a state store, essentially stream processing
principles. So we borrowed a lot from stream processing and databases and brought it also
to the data lake.
And that is, I think, at this point, a pretty unique thing that we've been able to achieve.
If you look at a lot of Hudi users, they are able to stream a lot of data raw into the lake very quickly.
And that's all possible because of this. But for the core warehousing problem, I think we already have columnar formats.
We close the loops on transactions
and get the usability there.
That's something that we haven't talked about at all.
If you compare stack,
we've talked a lot about technology,
but if we talk about usability,
how quickly can you build a lakehouse
versus like starting on a warehouses,
warehouses win all the time, right? So these kinds of things are more important for the lakehouse versus like starting on a warehouses warehouse has been all
the time right so these kinds of things are more important for the lakehouse vision i think
than but but we are trying to add more capabilities on the lake than even a typical warehousing
what did you do today yeah that makes total sense and what about the query layer yeah that's a interesting one so i think today if you if the
lay of the land is you pick uh on the white if you are on the lake you pick like a presto trino
equivalent for a lot of the interactive queries and you write spark or fling or high vtls i think
i know i'm broadly categorizing but that's the major things that pop up, right?
And the key thing to understand here is
there is a lot of things
that we don't typically even classify as query engines,
like all different NLP frameworks
or like some of them are not even distributed, right?
There's like a,
but they still work on these open data formats.
So there's a lot, there's more kind of like more fragmented sort of like tool set around
the ML, NLP, AI, deep learning, like that space that is also kind of going to kind of
only grow. So I don't see a future
where there'll be more query engines on the lake.
There's going to be like more and more query engines.
And I think the smarter strategy here
would be to know how a lake kind of strategy again
and build towards that,
keep your data sort of like in an open format
that you can buy support from many people
and kind of like have it be more future-proofed.
That's kind of like what I think innovatively
this is going to lead organizations into.
Yeah, I'll ask something like from the completely opposite side of the stack
because we're talking about the query.
And correct me if I'm wrong, but what I understand is that
the data lake at the end, your work as the creator of Hudi, for example,
is to build, let's say, table formats on top of some file formats
that we already have, but usually we are talking about Parquet and ORC here, right?
Is this correct, the way that I'm understanding it?
Yeah, so the thing that this table format term again
is like, doesn't do justice to sort of,
at least like what Hudi has to offer, for example, right?
There's a lot more than what you need than a table format.
So if you look at what a table format is,
it's a metadata of your file formats, right?
Around what, right?
It's a means to an end.
What I think we built in open source today is a lot of the services that also operate on the data.
Because without them being open, it doesn't matter with open format, right?
You don't own the services that operate on them.
So you have to basically, you're saying,
I have to buy some vendor who will operate these services for me.
So this is the gap that I think like something like refills here,
speaking for Hody, we have compaction clustering.
We have the bottom half of the warehouse or a lake house or a database which
you want to use kind of like available to you which you can now use to query multiple different
file formats with and to your point yes we mostly it's analytical storage right but if you look at
hoodie there are some use cases that come up where people really don't want a database, but they want a key-based lookup on top of S3 data.
We support HedgeFile as a base format, for example.
HedgeFile is the underlying file format for HedgeFile.
It's really optimized for user range reads to get batched.
You can do batch point key gets from H file.
So there are, I think,
going to be like more and more use cases like this.
I can totally imagine how this can be used for,
let's say hyperparameter tuning
or something on a machine learning pipeline, right?
So I think there's a lot more that we probably haven't built.
And this space is sort of like still nascent in my opinion yeah
for all the reasons that i've been citing it's it's still a lot more work to do here do you do
you see like any innovation uh any space left for innovation like when it comes to the file formats
themselves because okay we take for granted like parquet out there or orc but like
that's pretty much what everyone is using right do you see like anything changing there or we need
something to change there yeah so that's the thing right so often you know oftentimes in open source
that's the other kind of like my i mean i've been in open source for 12 years so but my my own pet
gripe is sometimes i think what wins is the most popular is what happens, right?
It is a popularity contest in some sense, it becomes that.
While on a more managed service, you get swapped out with something that happens new.
So, I think for a change to happen at like that file format layer, I am pretty sure that there can be a new better
file format that can be written even like google has a capacitor is the file format on top of
underlying big query right it is a successor to dremel which is what park is based on so i mean
you can read a blog they don't open source the format this time. There's already one there.
So it's more like if we've done this now,
so it's going to take a while for people to migrate.
But I'm pretty sure with new processes coming out all the time and there's not documented things around CPU efficiency,
around how you access Parquet.
So there's plenty of room for improvement, I think.
Like original Parquet was designed in an era
where mostly on-prem HDFS, right?
So you had to care a lot about storage space.
But if you now don't care as much,
would you do certain things differently?
I haven't put a lot of thought into it,
but I'm pretty sure there's something that is better
that can come out in the future. That's super interesting.
Cool. You mentioned open source. So let's spend
some time on that aspect of the data lake.
Because let's say we have three, as we said, major
technologies out there. All three of them have some open
source presence.
And I will start my question with asking you why data lakes are open source, like we can see open source there.
And when we are talking about data warehouses, I don't know, I think instinctively the first
response would be we don't have an open source data warehouse, right?
Why is that?
I honestly feel this all started from the
Hadoop data lake, Hadoop
era, basically, where
I think
Cloudera, if my memory
shows me right, they boldly declared
that everything open-source is
the way to go, and I think
I agree, but it's basically
been a train from there because
Spark was open like the major tools
that have succeeded have been open right and then i think we ended up with like the lakes being open
and the warehouse was being more closed i i don't know why that is though i do see that there is
advantages in being closed and moving faster and you can build more vertically
optimized solutions so historically databases have been that way if you even take like rdb
every single we won't even talk about something like this in oltb databases for example, right? We won't say, why don't we have a common table format
and let's have Spanner and YugaByte and CockroachDB
all query that format or something.
So I think I don't find that very weird.
I won't be the person who would say,
yeah, it should just be open.
Otherwise it's wrong.
I don't think that's true.
I do think that to that point, what do databases add?
They add a lot of runtime over that format.
And then at that point, you're not dealing with the format.
So it doesn't matter whether it's open or not, right?
So what I really care about, again, going back is whether these services are open, right?
Can you cluster a
snowflake table outside of snowflake if you don't buy that maybe there is someone who can
use ai and super cluster your tables automagically they know this is like like a genius who has this
like one a clustering algorithm can you use it you can right? So I think that is the main thing that I would say that the lakes bring.
And it's been that way.
And I feel on the flip side,
warehouses do have better out of box,
easy to get started.
And like those things,
they've made it work for the cost
and the cost of openness.
And on the lake, I would say,
people still have to build a lake, right?
You can use a warehouse, but you have to build a lake.
You can either download one or you sign up on something and use it, right?
But you need to go hire a data engineer, hire some people, build a data team, and then they will build a data lake for you.
So there's pros and cons to both approaches, I would say.
I think, I don't know which one's right.
Do you think this is going to change for data leaks? Do you see more effort put
towards the user experience, let's say, of these technologies?
Yeah, I think that suddenly, at least we are doing it and we've been doing it for,
that's kind of like how we even got started. If you go to Hudi, you will find a full-fledged
streaming ingestion service, right? There is a single Spark Submit command that you can run. And then the tables gets
clustered and cleaned and all this indexed and all this Z ordering or Hilbert curves or stuff
that is logged away to even table data bricks or Snowflake, you can find in open source.
And we try to give you a tool set
where you can actually run it easily.
But here is what I see.
I think even as we make usable,
make it more and more usable, more and more consumable,
it's still the operational aspects of it.
I do see people on the community,
like really talented, driven engineers,
data engineers who come to the community.
They're trying to pick up all these database concepts,
trying to understand what data clustering is.
Why do I, what do you do linear sorting
or like, like they're trying to understand
all these fundamental database concepts,
try to become platform engineers,
try to like run thousand tables and manage that
entire thing for their company right and many of them come out with flying colors some of them
don't and in any case it takes like a year or more for people to get through that learning curve
and do this so this is where i wonder where there is a like a better model here where companies
should be able to get started with as easy as how it is i mean okay don't worry about all of this
just get started with all of these like late technologies then yeah maybe you don't want
you don't want you want to do it yourself right so then they should be able to fork off this is
what i'm suggesting is a pretty much a reverse of what most open source go-to-market people tell you
which is your community and then you make it so that you keep it bare minimum and then people can
use it and then you build more advanced on top but for the lake i feel like for like with hoodie we
try to make everything
easy but the problem is people still need to take it and run it it's not non-trivial thing to operate
a database as a service right having done that walmart as a service i do like linkedin and like
ksql on the cloud and like i can vouch for that much i can talk with some like authenticity
so we should make it easy for people
to get started with the lake house,
like more of a lake or whatever.
And then at a point, your business will grow
where it needs data science.
Business will need ML, right?
At that point, you can decide, okay,
am I going to be able to hire better engineers
than that vendor?
Then you shouldn't be bottlenecked on the vendor.
You want to move quickly. You should be able to branch out from open source, run your own
thing.
Right?
So that is, I think, the model that we should build.
And unfortunately, what happens in the data space today is, it's like, you may remember
the famous Parquet ORC format wars of the Hadoop era, right?
I mean, where two companies were just like the same twoquet orc format wars of the of the hadoop era right i mean where two companies were
just like they're saying two formats or whatever it kind of like doing the same thing to table
formats which defeats the whole point of the thing having being open to begin with right because most
data lake companies are a query engine or like a data science stack and they're basically going
and upselling users,
hey, use this format, use that format,
including Hudi, right?
But the real problem here is
they have to go and hire the engineers
and do the ops and like data engineers
have to get every optimization right
for that organization to,
someone signing the check high up is like,
oh yeah, you are like better than the
warehouse or you are you're now future approved for the organization to see the benefit so i think
if we don't fix this problem this way it's it's not about technology i think we can fix all the
all the gaps but i think this is the problem that i see that the managed aspect of it it is no easy
way to get started so otherwise i think I think it will remain in the cloud.
Cloud warehouse will be the entry point
and you build a lake
when you're suffering from cost or openness
or you want data science team.
That's how it will be if you don't fix it this way.
Quick question on that front.
And I'm thinking about our listeners who,
we certainly have listeners
who are sort of managing complex data lake infrastructure,
but I'm thinking about our listeners who maybe started with a warehouse and they know that the
data lake is inevitable in some ways for their organization. But to your point, that can probably
be a big step. What are the things that they need to be thinking about or even sort of planning for, you know,
six months or a year away from sort of the inevitability of like needing a larger data
lake infrastructure?
Are there decisions or architectures or sort of even ways they think about data now that
will help them make that path smoother, even though the tooling isn't quite
there to make it easy for them?
Yeah.
Yeah.
So the first thing I would say is like, no, like do, do more of the, the streaming event
based or the, the Kafka hub kind of like architecture, right?
Because it really having the ability for you to get all your data streams in a single kind
of like firehouse.
And then you can now tee this off to the warehouse or to the lake.
You have that flexibility. I would say most people who are in the journey today are using like a opaque kind of like data integration pipe,
which takes data from a data lake and let's say FITRAN, for example, or FITRAN.
By the way, really great services.
But I'm just like the architecturally, you just know, by the way, the really great services, but I'm just like
the architecturally, you just don't see the tap into the data streams. It's, it's, so you,
you really have to capture. There's like a core data infrastructure pipe that those tools actually
need to feed into for you to actually feed it out into your arms. Yeah. Yeah. Switching my hat a
little bit. If you look at my, my, my, like my life at Confluent, like what we are to build was, okay, you do like the source connector, the sync connector kind of decouple. So you get the CDC logs from like an Oracle or any database, and then you can feed it to many other systems. So a lake and a warehouse. So make sure your data can flow into both and you have the optionality to pick which one you want to send where that's one the the
other thing is start with probably your more the raw data move that to the lake that's where you
have most of the the the data volume and since it's usually in a wild tp schema not optimized for
like analytical queries uh that's where probably you're spending most of your cost on a warehouse as well,
because like they're not really in that schema.
So those are like really good candidates
for you to start.
And then in most scenarios,
the derivative tables,
you can keep them there.
They're more performance sensitive.
So you can slowly migrate them over here, right?
And then what you need in the meantime
is you should really push
for your cloud data warehouse provider for better external table support.
Because they have no incentive to do that.
Unless you force them to do it.
Because technically speaking for organizations, what I can see is, okay, I'm using pipeline company X and then using data warehouse Y.
And then if you want to now build a lake, right?
And offload your raw, like you want to build a lake
and going back to our first question,
you want a raw data and a derived data layer.
You want to move raw data out.
I mean, if you do it,
then all the SQL has to still run, correct?
Like for you to build the derived data.
So that is where I think there is stickiness
and like lock-in points for warehouses
where unless that SQL can run
in a reasonable amount of time on the lake,
this project would fail, right?
So for example, in Hudi, we just added DBD support
so that you can get raw data tables in Hudi
and now you can use DBD to probably like transfer over.
We'll be working towards more parity
or more standardization.
We are today as standard as what Spark SQL is, right?
So you can now use that and use DBD
to do the transformation on the lake even.
There should be a way for you to move the workloads
to the lake seamlessly.
Think about those abstractions, whether it's dbt or airflow,
like how compatible SQL is.
Think through all these things.
But if your cloud warehouse provider provided better external table support,
then you can keep those queries running,
even though if you offload the raw data lake,
you can try a Presto or some other lake engine in the meantime,
as you decide how things are going, right?
So it's not going to be an easy switch.
This is going to take a year
or like at least six months for you to switch
a reasonable amount of data.
So planning ahead around all these touch points
is what I would kind of like advise to think through first.
Sure. I think that's really helpful because the question I asked was, what do you need to be thinking about if you're sort of going from a warehouse-based infrastructure and then adding the lake infrastructure?
And you would think that the answer is more around the lake,
but it's actually more around the orchestration
and the pipelines and giving yourself option value
as it relates to all the various components of the stack
that are going to arise
from moving towards the lake architecture.
Right. I've seen many companies, right?
So I categorize them into two buckets one is if you
don't do this right what happens is there is a lake but no one's using it and then over time
the data quality goes like slowly these products start to fizzle out if you don't do this right
the ones that succeed have top-down kind of energy to say okay we're going lake first and we're going
to like revamp the whole thing.
In lots of scenarios, for example, the lake comes in when data science comes in. When data science
comes in, usually what comes in is data scientists would show up and say, hey, like, okay, fine,
you want me to improve your app, but give me some events. Tell me what's going on in that.
Then Kafka comes in, and then you like pump a lot of events, right?
And then that's when the data volume spikes.
And then that's when people are like,
oh yeah, wait, like, right.
This is kind of like how that cycle works typically.
People who start like that
have lot more drive to get it done that way.
And like what we try to,
the missing puzzle there is moving data from a warehouse
and replicating the database of SaaS data that
you may already have in the cloud warehouse.
But those are people who are more leaning on the, I'm going to pay the double cost of
warehouse and lake for us sometime.
And then over time, I'll figure out how to move things.
And I think this will be the most interesting thing to watch because right now, given the performance, like the bad things are, right?
Lake is super optimized for running large scale data science machine learning workloads.
The VARs are really optimized for running like BI.
Then I think that the BI workload stays there.
The data science workload stays here.
I think as we build tech, I think maybe they will like more BI goes.
That's what the rise of Starburst tells you, rise of Presto tells you, right?
I think it's very interesting times, I think, to be building data.
It's going to be super fun. It's going to be super fun.
I have one more question for you.
And this kind of goes back to where we started with the origin of Hudi.
I'm interested to know, so you actually,
you got it running with three
engineers in production in a pretty short amount of time for developing a new technology that's
sort of managing the scale. Was there a sort of feature or optimization in the business
that sticks out in your mind as like an early win? I just love for our listeners to,
to hear like okay so you
developed this amazing technology like i'm sure we have users of hoodie or people who want to use it
but i just love to know like what was an early win inside of uber that came directly from the
technology yeah there's like a direct dollar value attached to that project at that point like uh
dollar value that is that exceeds like
hundreds of millions of dollars because the we were able to run fraud checks a lot faster
which meant we report to banks a lot faster and you can imagine the how complex these checks would
be they're very hard to write those in a streaming sort of way and like kind of like you know get it
right but if you have the like
high query basically or something like that right we needed real near real-time data not like real
time real time but we needed to be running some checks like every hour for example as opposed to
every 12 hours or every and the you can imagine right at uber scale the amount of like rides and everything like the the banks
typically give you money back more money back if you report sooner like kind of like how sure sure
again don't quote me on this is how it was there i don't know how banking rules have changed
this is not financial advice yeah it's not financial and that was the main driver and
then of course there was like intrinsic uber is it starts raining
it affects our business right there's like a huge concert the traffic changes so intrinsically the
business had real-time business real-time needs and this sometimes hard to put a dollar value
around it except for we can count the number of times people wish data was there sooner. The stable was built faster.
But the real tangible dollar value was we can do all the background things, rider safety, for example.
We can do all these like background things like tasks and data processing that we do to make Uber experience really better.
It can run faster, quicker incremental that is sort of thing
and this was actually not a very i i at least came with that mindset because at linkedin the
main thing that we would try to incrementalize was people you may know for example it's a very
complex graph algorithm but the whole like we've spent a lot of time around hey if you connected
you and i connected now and then it'll be cool to like, I go to
LinkedIn and then I get the thing right away. Probably they've made it work now. I haven't
kept track, but we were in that mindset. Okay. Yeah. Let's make all the batch jobs incremental.
There is no reason for them to be running full batch and eating up our entire clusters, right?
So that's sort of how we went about it. Amazing. Well, Vinat, this has been
an amazing conversation. I know we could keep going, but we're at the buzzer. Thank you so much
for joining us. I learned an immense amount and I know our audience did too. So thank you for
sharing some of your time with us. Yeah, yeah. Glad to be here. And these are like really
deep questions. So, so thank you. Thank you for these questions.
It also helps me think better.
All right.
Thanks.
Thanks, everyone.
I'm going to break the rules.
In this recap, I have two takeaways.
One, I love that he called himself a one-trick pony.
I think he was very authentic and humble, but that was just hilarious to me.
The other one, which we talked about right towards the end of the episode, sometimes
you think about the gains from sort of building your own infrastructure.
How do you calculate ROI on that?
Is it engineering time saved, et cetera.
But he was talking about financial transactions to the tune of hundreds of millions of dollars, which is wild. And those sort of
stakes are really, really high. And so that was just amazing to me. I wasn't expecting that
quantity of a sort of ROI impact, but it's massive. So that's just, man, that's crazy.
Yeah, yeah, 100%.
I think it was super, super interesting conversation that we had.
I think that we managed to make much more clear what a data lake is
and why it is important.
And what's the distinction also with a lake house?
Where things are going, where they are today.
And we had like a pretty technical conversation,
but without getting into like too much technical detail.
Yeah.
But it was very, I don't know.
I really enjoyed this conversation
and we definitely need to get him back.
I think we have much more to discuss about.
We didn't have the time, for example,
to talk about open source,
open source project governance,
like what's his experience there,
why it is important.
Yeah, I'd love to hear more about running a project
like Hudi within the Apache Foundation.
I mean, that would be so interesting to hear about.
Yeah, 100%.
So yeah, hopefully we will manage.
I think he was the first guest that had like an immediate relationship with a data lake technology.
There's more out there.
Hopefully we will manage to get more on the show to discuss about that.
Both like lake house and data lakes and everything.
So yeah, I'm really looking forward to have him back on the show again.
We'll do it.
All right.
Well, thanks again for joining
the Data Stack Show.
A lot of great episodes coming up.
So make sure to subscribe
and we'll catch you on the next one.
We hope you enjoyed this episode
of the Data Stack Show.
Be sure to subscribe
on your favorite podcast app
to get notified about new episodes
every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse
at rudderstack.com.