The Data Stack Show - 99: State of the Data Lakehouse with Vinoth Chandar of Apache Hudi
Episode Date: August 10, 2022Highlights from this week’s conversation include:Vinoth’s background and career journey (3:08)Defining “data lakehouse” (5:10)Databricks versus lake houses (13:37)The services a lakehouse need...s (17:37)How to communicate technical details (26:55)Onehouse’s product vision (31:41)Lakehouse performance versus BigQuery solutions (36:44)How to deliver customer experience equally (40:17)How to start building a lakehouse (44:00)Big tech’s effect on smaller lakehouses (55:33)Skipping the data warehouse (1:04:39)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. Kostas,
we always talk about getting guests back on the show. And we haven't actually done a great job
of that. But it's kind of hard with all the scheduling stuff. But we were able to do it.
Vinath, who was one of the creators of Apache Hootie, is coming back on the show.
And I am really excited because last time we talked to him, his project was in stealth mode.
So I remember before the show, he said, we can't talk about, you know, what I'm working on.
But it is now public. It's called One House.
And it's super interesting.
It's a data lake house built on Hudi, of course,
which isn't a huge surprise.
So I'm super excited to learn just more about OneHouse and the way they tackle the problem.
But one thing I want to do,
we got a really good explanation from Vinath last time
about the difference between a data warehouse
and a data lake.
I mean, maybe one of the best explanations we've heard,
but OneHouse is squarely in the data lake house space.
And so I want to leverage his ability to articulate these sort of, you know, deep
technical concepts really well to ask about what the data lake house is and just get a
definition.
So that is what I'm going to do.
How about you?
Yeah, I don't know.
I'll have a hard time, to be honest.
Vinod is one of these guys that's
always awesome to chat with him
on a deeply technical level.
But I'm also very
interested to hear more about
the product they are building,
the business they are building,
and his whole
experience of going
from monolithic like Apache open source projects
to trying to build a business on top of that.
And lake houses are also like a very interesting new, let's say, like product category out there.
And I'd love to hear more about that and how he sees the future.
So we'll see. I'm pretty sure we are going to have like a lot to chat about with him.
There's no question. All right, let's dive in.
Back to the show. This is your second time joining us on the Data Stack Show, and it's so good to have you back.
Yeah, it's fantastic to be back.
And, you know, I look forward to another.
Last time around, I think it was a very deep, interesting technical conversation.
So I look forward to another round of interesting conversations here.
Absolutely.
Well, for those of our listeners who missed the first episode, we have to ask you to do your intro again.
So can you just give your brief background?
And then I'd love for you to finish by, last time you couldn't talk about this publicly, but what you're doing today at OneHouse.
Yeah, my name is
Vinod, and I've been working
on open data infrastructure
in this area, in our own
databases, data lakes, for the last
10 years, MJ.
And I started my career
at Oracle with the Oracle
server data application that
the key value store in LinkedIn during
the time where, you know, key value stores was the cool thing that you built.
Then we moved on to Uber, which is where, you know, Apache Hudi happened.
And we kind of like, you know, brought transactions on top of our how to deal like back in the
day and what we call transactional data lakes.
I think it's a pretty nerdy engineering name, which is very, you know, kind of what is known
as the lake house kind of architecture today.
I continue to kind of grow the project in ASF Apache Foundation.
I still work with the PMC chat, so for Apache Hoodie.
And right, you know, after Uber, I actually had a, you know, a good, had a good amount of time at Confluent as well.
I wasn't working on Hoodie, I was working on Kafka. I was working on KSQL DB, if you heard
about that streaming database and connect a bunch of other things. And most recently, I think now
I'm super excited to talk about OneHouse, which is where my current employment lies.
I'm CEO founder at OneHouse.
Our goal is to bring managed data lakes
or lake houses into existence.
We see a world where there are fully managed flow systems
and then there is DIY open systems in the world.
And we're trying to actually build
sort of like that kind of managed experience
on top of open technologies like Apache.
Love it.
Okay, let's, I'd love to kind of set the stage
and focus on a term that you mentioned,
which is lake house.
And some of our listeners will be familiar with that.
Some of them will have seen it in some sort of marketing materials, I'm sure out there.
So I want to ask you for a definition of data lake house. But before we go there,
could you remind us what the original use case for Hudi was specifically for transactions on data lake.
So what were you facing in that role inside of the company?
And then why did you need transactions on data lake?
Got it.
So yeah, so for this, I think we need to go back to actually 2015, 2016.
And Uber was growing very fast.
We were building out our data platform and all we had
was an on-prem data warehouse at that time. And while essentially we were hiring fast,
we were building a lot of new products, we were collecting high-scale data, right? So we couldn't
fit all this data into Uber on from warehouse.
It's not built for this amount of storage.
A Hadoop cluster is like an HDFS cluster, even before Uber.
At LinkedIn or Twitter or in many places, Facebook, it's been scaled to several hundreds of petabytes, at least.
So we built out our Hadodle cluster, our data lake. And here is where I think we had a very interesting problem
that, you know, remember like my previous stint
was at LinkedIn.
This was something that we didn't even face at LinkedIn,
which is Uber is a very real-time business.
So if it rains, the prices change, you know,
and then there is a huge operational aspect to the company.
There are 4,000 engineers and let's say 12,000 people who are operating cities.
And they all need access to fresh, near real-time data about what's going on out there.
So essentially, what we found was while we can stand up a Hadoop cluster and dump a bunch of files onto that,
you know, on there and, you know, bring Spark or something and write some queries,
we were not able to, like, you know, some of our core data sets at Uber, like the trips,
transactions, and these core database tables, we were not able to actually replicate them very
easily onto the data lake. We would suffer multi-hour delays, eight hours,
12-hour delays in first ingesting it and then writing ETLs on top of it. So it got to a pretty
serious level where people actually figured out we couldn't run our fraud checks fast enough.
So we were actually losing money from fraud. It was like a pretty serious business problem, actually. And we actually started to look at,
hey, how do we solve this?
And, you know,
and we essentially
actually looked at
what we had before that.
How are we solving it
before Hadoop Twister?
We were, you know,
like the on-prem warehouse
that we had
supported transactions updates
and it can actually do
kind of like, you you know you can write
it like merge style etls on top that people currently write using dbt on all of these
warehouses right so essentially we were like that's pretty much it like we need to essentially
build that sort of functionality bring it to the lake but do it in a way that we we retain the the
scalability all the the cost efficiency,
all the different advantages of the lake.
And that is how Hudi was born.
So we essentially called it a transactional data lake
because in our mind, what we were doing was
introducing basic transactions,
building some indexing schemes,
updates, deletes.
You can now, your data lake is now mutable,
which means it can absorb, you can get a change know, your data lake is now mutable, which means it can absorb,
you can get a change record from upstream, you can update the table instead of rewriting
the whole thing, right?
And that's kind of like how Hudi was born.
And it was pretty like yearly, like it came before most of the other contemporary technologies
that you see out there.
Love it.
Such a great story.
I remember you talking about that in the previous episode
and it's just so wonderful to hear the Genesis story again.
So you've kind of already answered a lot of those questions, you know, from
a historical lens, but define, so
you know, with that context, define the data lakehouse
especially sort of through the lens of how you view the world at one house?
Yeah, that's a great question.
So actually one of the key,
you know, technology-wise,
like what a lakehouse adds to a data lake
is, as I mentioned,
transactions updates, right?
It gives you more,
like, you know, upgradability. So it gives you like an impedance match with how you do things on the warehouse,
if you can put it that way. Like from a user standpoint.
There is also from a user standpoint, there are two other important aspects though. These are
mostly used to kind of, you know, improve the baseline performance of the data compared to a warehouse.
For one is the metadata management.
So most warehouses, even cloud warehouses, if you see today, they actually have pretty
good fully managed metadata systems where if you want to execute a query, statistics
for different files, columns, yadaada yada, all of these things are
sort of, you know, like well maintained and they're organized in a way that queries can
plan very quickly, right?
So that is another angle that piece of technology that the Lagos adds.
Because the Lags were pretty much, you know, files and the individual query engines would,
you know, like high metastore is would, you know, high, like high meta store is
basically what we had for metadata management. Right. And if you look at what high, but high
metas will never track any file level statistics or anything. So really file level granular
statistics and all of these things, that's one big area. Like the second, which, which is where
I think in hoodie, we spent spent a lot of time around it,
and we are much further advanced, there is what we call table services. So if you look at any
warehouse, take Snowflake or BigQuery, you'll find a fully managed testing service. You'll find
all these different services that do useful things to your table, And they're all self-managing.
You don't write code for all of these things. But if you look at sort of like,
even with the, you know,
that's why I feel the table format
doesn't do justice to sort of like
what we need to build overall.
The table format alone is not important.
You need like a set of services
that rival warehouses that can provide you
clustering, you know, data loading, ingestion, all these other things.
This is where what we focused a lot on Hudi.
And this is, I would say all these three put together, you know, like the, the,
the storage format, like the table format itself, you know, accepting updates, deletes,
and like transactionality,
plus like a well-optimized metadata layer,
plus these kind of like well-managed table services,
they give you together,
I imagine like, you know,
if you take a warehouse and break it sort of like horizontally, you get a bottom half of a warehouse today.
And then you can fit like a, you know, query engine like Spark or Kino or Presto or anything really on top, right?
So that in my mind is what a lakehouse should be, right?
And in that sense, yeah, that, you know, connecting from the one house, this is sort of like what we want to unlock is for people to be able to get this bottom half as a service while they have the choice to pick any query engines we can choose.
Love that.
Okay, one more question from me, Kostas, just to help me and our listeners set the stage. So, you know, from a marketing standpoint,
Databricks has invested a lot in the lake house term, you know, which is maybe one of the ways
that a lot of our listeners, including me, are just, you know, are familiar with the term or
have become familiar with the term. How do you think about one house in relation to,
you know, to sort of Databricks flavor of Lakehouse?
Are they similar in terms of like, I love the illustration of the bottom half of the
warehouse, but help us understand the differences and similarities.
Yeah.
Okay.
So that's a great question.
So I think Databricks' articulation of Lakehouse is slightly different, right?
I think if you're going from the paper, even,
essentially, it's a Spark Lakehouse,
essentially, or a Spark Databricks Lakehouse, right?
And even if you look at Delta Lake,
there is an open source version of Delta Lake,
and then there's a paid version of Delta Lake.
So they essentially have two flavors of the bottom layer,
if you will,
that I just mentioned. While they have a top layer, which is a super optimized Spark layer,
and they know Forton and all of the investments that they put into that. Honestly, they can apply
to other formats as well, right? It's end of the day, see, all these stable format games,
they're all creating Parquet files at the end of the day. see, like all these stable format games, they're all quitting market files at the end of the day.
So sure, if you can optimize, I think it's a decoupled problem.
And the way they market it is as a full vertical stack against Snowflake, right?
That's kind of like, at least where I've seen most of their marketing energy being spent
so far.
And that's probably because Snowflake is one vertical stack.
Correct?
Yeah.
But if you look
at the pieces
overall,
it's still
kind of like online.
The biggest problem,
and we see this
with a lot of,
you know,
Hudi and Delta
have been like around
for much longer
supporting
mutable workloads
and, you know,
everything, right?
For like three, four,
three years now, right? And out in production. So we routinely run into this. People want either people like Hudi for how rich of a table service
ecosystem it has, how actually vibrant and grassroots open source the community is,
or several technical differentiators like concurrency control or indexing and whatnot,
but they still want Databricks, Databricks Spark.
So I think as Hudi, we didn't have to care as much about that, but as One House, we deeply
care about that because somebody who wants to buy both One House and Databricks should
be able to get a really good good end to end experience. So even for us, some of the thinking is now very,
you know, customer focus that way, I would say. So there is a slight difference. We don't believe
in one vertical stack, you know, I think this can be accomplished by breaking the bottom half
separately and then
fitting every query engine.
So let me just give you some data, right?
You take like Ray, Flink, and then, you know, DASCO, like any
other upcoming query engine.
For what it's worth, you know, between them, they have some 50,000,
60,000 new dev stars, right?
So it's a multi-engine, it's like a new thing that I think like Bodo,
like there's going to be new query engine innovation
that's going to happen. So I think
decoupling the data
layer from the compute layer
at the vendor
or even at the
staff's level is
a good thing overall, we feel.
Yeah, super interesting. Yeah,
it's almost like bring your own interface to the
bottom layer or multiple interfaces, which is super interesting. Yeah. It's almost like bring your own interface to the bottom layer or multiple
interfaces, which is super interesting.
Okay, Kostas, I could keep going, but please, I'm actually more interested in
what you're going to ask than what I've already asked.
Yeah.
Kostas Svitorka Oh, come on.
Like, that's not true.
I think you are like asking all the interesting questions.
I'm, I'm boring.
I'm just asking like a little bit more technical stuff.
That's all.
But yeah.
Okay. I have like something that I really want more technical stuff, that's all. But yeah, okay.
I have like something that I really want to ask you, Vinod,
because you mentioned something.
You said that there is a number of services that like a lake house need to
have, like in order to rival warehouses.
Yeah.
And so I really like like the word rival, first of all.
Yeah.
But can you tell us like, let's, I mean, you mentioned them, but, like, let's enumerate, like, these services again
so, like, our audience, like, has, like, much more clear idea
of, like, what we are talking about in terms of, like,
technical services there.
Got it.
So let's start from the initial, right?
Like, you need a service that can, you know, ingest data,
first of all, right? And we built an ingestion system in Hudi from like three years ago.
So this is similar to an autoloader kind of snow pipe or, you know,
I don't know exactly what it's called, like what product it's called.
So I think there's an ingestion system that can like load data
that we use down to cloud storage or different sources.
That's one.
And there is reasons for it to be aware of the sync because you can do checkpoint management and any other things very, very easily if the system actually understands that it's who it's writing to.
Number two, when you update data on underneath what happens is the
version files you you create garbage right that is you're writing new versions of files and you
you the old version of somebody needs to clean up this is what we call cleaning in equity and what
you know is called like vacuuming i think in and depth like and you need a service that can
actually know you can't tell it, Hey, I want to
read in X versions or something, and then it can automatically do this for you.
Right.
That is one.
The third thing, as you know, failures happen when you're writing to a table
and you have like some leftover files, uncommitted data lying around.
You need, you need systems that can like, you know, services that can clean the
data so that, you know,
these, like, dead files don't litter up your tables and things like that.
Number four, this is slightly specific to Hudi,
but Hudi supports a merge and read storage type
where we can actually land data very quickly
in a row-based format or, you know,
flexibly in a column-based format,
and then later sort of, like, compact it, right?
And when we say compaction,
when what we mean is what compaction means in databases,
like, you know, Cassandra, HBase,
or it's like compacting Delta files into a base file.
So you need a service that can do that.
And Hudi's compaction service can, for example,
like keep compacting even while the riders are going on. As you can
imagine, at Uber or
TikTok, where there's a stream
of high-volume data coming in, it's impossible
to stop and do OCC,
optimistic concurrency control at all for
this thing. So you need service like
this. Again, I'm making the case that
this has to be deeply aware of
this. Services need to be aware of
each other. And that is how databases are written, right?
The other one is clustering service.
Like we implemented reordering Hilbert curves
and just like linear sort order clustering.
So fundamentally, what a table format metadata layer can do
is remove bottlenecks in planning, right?
It can store file things under statistics,
which is used to plan.
But end of the day,
if you look at, you know,
most warehouses,
for like, you know,
high performance sensitive reports
and stuff,
people actually, you know,
tweak performance by clustering
and playing with the,
with Invertica,
I think it's called projections
and they're very different names
and different things.
But you tweak the actual storage layout to squeeze performance, right?
And then you need a service which can actually understand the right patterns that are happening in the table,
schedule these clustering, execute them.
If they fail, they retry, right?
So what Hudi actually, the bulk of the value that we add, we believe, is in this layer where you write to a Hudi table, all of these services will be scheduled, executed automatically.
They can fail.
They'll be retried.
Otherwise, if you take a very thin table format as an alternative, then you need to write all these jobs yourself. And what I've seen
from my LinkedIn days
in the last 10 years
living through the
how-to parts
and Cloud Era,
how-to parts,
all of these things,
everybody focuses on the format.
Like, you know,
as if you solve the format
and then everything's fine.
But open alone doesn't cut it.
That is the painful lesson
that we should learn
from the rise of cloud warehouses.
What we should,
we should focus on
the standardized services
and they take years
to get like standardized
and like, you know,
hardened in production scale like this.
I think this is the main thing
that we are not right now,
even around lake house marketing
by any vendor,
I don't see enough emphasis
laid on some of this
i've recently started noticing that you know reasons you know had some content on this recently
i think starburst had some recently but it's a very recent thing that has happened in the last
few months and this is what we've been at for last three years okay so so just to make sure that I also understood correctly, right?
We start, like our foundation is a data lake where we store their
parquet or ORC files.
Let's say parquet, like that's the standard.
And on top of that, then we need like a number of services.
I counted five.
I hope I didn't miss anything.
Yeah, but let's say at least the most fundamental ones, right?
So we have an initiation process there.
We need some service that's going to prepare the data
and make them available.
We have vacuuming or cleaning and say cleaning and taking care of like all the version files and
like all the stuff that are happening like on a low level to make sure that we increment concurrency.
We have some kind of garbage collection, let's say I'm using garbage collection more like a broader
term. Combaction with... Combaction from what I understood is like more of a specific use case for Hudi because you have the columnar and they're all based like representation.
So you at some point take these two and you merge them into one or something.
Is this correct?
Yeah, it's correct.
I think it's slightly different.
Most of the other two projects are written for us a file statistics tracking system.
But this is where Compaction is not new at all to, let's say, RocksDB or LSM stores or anything in the database world.
And as you know, I come with that background.
So Compaction is more about controlling, I want to write smaller amount and then queue up a lot of these updates, later merge them, instead of merging them right away.
Okay.
I think that is the key technical rationale for compaction.
Okay.
That makes sense.
Is this like something similar to like what happens when like tube stones are like, for example, used and then you go and like remove
the TubeStone from there so you can actually like delete or not?
Peruptually.
Like if you, if you read a block structured merge trees, LSM trees,
for example, they will talk about this whole bunch of signs around how to
balance write cost
and read cost and merge cost.
And that it's a very, very, you know, widely adopted database, right?
From Google's Bigtable to Cassandra to HBase to RocksDB to LevelDB.
That's all they use there.
Henry Suryawirawan- Awesome.
And then the fifth one has to do with what you call like clustering, which is more
about like how you can optimize like on a lower level the data, how it is stored.
So you can actually do improve performance, right?
Is this correct?
Does this have to do with encoding or like give us like a little
bit more of like information? Stas Milosav. So I think clustering changes how you start, how you
actually pack records into files. Just that if you know something about the query, let's say,
for example, you are a SaaS organization, and you have thousands of customers, and then you're a SaaS organization and you have thousands of customers and then you're collecting logs from them.
And then you know that your query patterns mostly are,
you'll query for one customer at a time.
Then instead of spreading this data across all the Parquet files
in your table or a partition,
what you can do is you can cluster them
so that the records similar to one customer
is in like a fewest number of files, which means when you query them, you read the smallest
amount of data, right? This will give you 10, 12x, like, you know, the order of magnitude
of query performance. And while I feel compared to, let's say file listings, file listing is a real problem only for very large tables.
Right?
So related to all that, this fundamentally affects your compute dollars.
And it can dramatically reduce cost for your lake.
All right.
So that's amazing.
My question is, and going back to the initial question,
these are like, let's say, the minimum set of additional services
that the data lake needs in order to rival a data warehouse.
But there's a big difference that I see here.
And the difference is that with a data warehouse,
I don't really care about all that stuff, right?
I don't have to know about all these very technical
and interesting details details right yeah uh while
in the lake house like okay we have to talk about that stuff so yeah how do we change that because
not everyone wants to become like a database engineer right uh in order like to query and
store their data yeah unfortunately we opened that door when we wanted updates on the data lake, right?
Because before that, if you like appending some files to a folder and then collecting statistics on it, I think it's a very simple thing to do.
It's like, you know, conceptually, it's very easy for people to understand.
And people in the data lake have grown up thinking about everything as formats.
If you look at the updates, you turned it into a database problem.
And if you look at the database world, you don't actually see,
I think I made this statement even last time.
You don't see CockroachDB, MySQL, everybody saying,
let's standardize on one format
and then build something on top. It's not a thing.
When you change into
a database problem,
the stuff that we talked about, those are
the higher order problems.
So,
to answer your question,
what do we do to change this?
That, honestly, is at the
core of why we even started OneHouse to begin with.
And this is what I say in a lot of places.
A lot of people have asked me
and they come up to us
for enterprise hoodie support or something.
That is not what we're trying to build here at all.
We're not trying to build an enterprise hoodie company.
What we've seen, and you've spoken to Kyle,
our head of product,
who was in a different camp before this,
technology-wise.
The common thing that we see is it takes six to nine months for people, for data engineers
to become database engineers and platform engineers, understand all these concepts and
actually implement them.
So what if there existed a similar managed service where you
can click four buttons and then you have
your you know lake house up and running
and you know it's open
it's meaning like
it seems I think
open is super overloaded
with marketing these days
truly what we care about is
interoperable and extensible right
so if you have an engineering team, you can go to the project,
you can contribute to the project, get a seat on the table, on the PMC.
Yeah, that exists.
And then it's interoperable.
It works with every open standard.
There is no vendor bias or anything, the project, right?
So we need a foundational technology like that
on top of which we build this management.
That's how we are thinking about it.
Even I speak to a lot of cloud barrows users.
That's like my day job right now.
And what I see is ultimately they realize this, right?
They start with a fully vertical start because it's fully managed.
And like you say, people don't even have to care about it.
But you're signing up for a two-year,
like a migration project two years down the line,
right then when you're making the choice,
I think fundamentally we need to like
sort of bring some manageability to it.
Open alone won't cut it.
That is what I'm trying to say.
Like open alone is not a key business thing. Customers are looking
for how soon can I get my lakehouse up and running, technology aside. And we have to focus on that.
And I feel while it's open as the only kind of USP against a closed stack or, you know, to take on like warehouses is not good enough in my mind.
Cloud era, how to not stray that and fail.
Yeah.
I would say.
Yeah.
That makes sense.
I mean, so, all right.
Having, let's say, the experience that someone has with like a cloud data warehouse, something that's, okay, it's good.
Like we are after that, right?
Like we want to offer this over like a data lake.
And that's what one house like is, from what I understand. So do you like to spend a little bit more time
to explain to us how we can go from these at least five pretty complicated technical concepts
and services to an experience with a couple of like clicks on a cloud dashboard we can have
let's say a lake house up and running and we can start like interacting with it so how does it work
like how what's your vision uh for one house like from a product perspective yeah so uh honestly
like even detaching myself from right if you have to look around now and see what will I pick today
to build a product experience around,
I'd still go and pick Hudi
because Hudi already has most of these services.
But it's a library.
Hudi is a library.
You need to adopt it, tweak it.
So what we've learned from some of our initial
users that we're working with
and everybody is that
just by hiding a lot of configuration,
just like we expose a lot of configuration,
speaking for Vidi,
we expose a lot of configuration,
just like any database.
You go to Oracle, you go to MySQL,
the point is to expose a lot of configurations.
Administrators will pick it up over time
and know what to do, right?
I think we have to simplify that.
And for example, don't even show file sizes.
Why you care about what the file size should be, right?
Right now we ask people to go hand tune that, right?
Hand tune the panels of an office.
So in our experience, it's a whole bunch of something like auto-tuning and
intelligent configuration management and sort of like, you know, that I think
is the first ingredient to get there.
And the second thing where specifically talking about one-hours where we back
ourselves more is our team actually has operational experience, not just,
you know, build it, right? Like I've been on call for 250 petabyte data lake, and I had to like wake
up in the middle of the night and recover a table, like do this kind of thing. So that's the second
part, which is usually in data lake so far, we've not, you know, the user managed the tables, right?
And if you look at
Snowflake or BigQuery,
if a table is corrupted,
user has no
control whatsoever.
Like in you,
like some
Snowflake engineer
didn't,
you know,
want to
values,
redshift values,
you just have to
figure out
what's going on.
So that's the
second part,
like building
enough
manageability,
operability
to this product
where,
you know,
you're taking control away from the user
in the name of simplicity and getting started quickly,
but we need to now build all the operational kind of chops
to be able to pull this part off.
I think this is the hardest, hardest part.
I think Jay Krebs, the conference CEO,
has a thing where he says, you know, in ranking,
programming a theory,
what's much harder
is debugging that thing.
What's much, much more harder
is operating that piece of code.
Right?
Yeah.
And I think this is where
my disappointment
with all of the marketing
that happens in the data lake land
is that we focus
very little on these
operational aspects.
It's all super DIY.
And then later we also complain that,
oh, it's not standardized, blah, blah, blah, right?
We have to build these statistics.
I don't know how to explain it, but we built.
That's what the warehouse has done really well.
It's really admirable for what they've done
in the last 10 years.
They've actually accomplished a lot.
Absolutely, absolutely.
Cool.
So we start with like auto-tuning and management of configuration in general,
like simplifying, let's say the whole like setup process for users there.
And also introducing, let's say like abstracting the operations, right?
Like making, giving, let's say like a cloud experience, right?
Like there is a team that will stay awake, like to take care of things when they go wrong,
like instead of having to build your own team, like to do that.
And especially like for so complicated technologies like these,
where it's not that easy like to know exactly what might go wrong.
So I think it makes like total sense.
And my next question is, I can understand like,
well, I think one of the benefits that let's say the cloud warehouse and all the vertical solutions
in general that they have is that when something is vertical and you have like complete knowledge
and control over all the components, right? You can control the experience exactly as you want, right?
Like, you know exactly, like,
how it's going to be experienced by the user.
At the same time, you have, like, much more control
over what kind of optimizations to do, right?
And we see that, like, with things like BigQuery
and, like, Snowflake.
So when it comes, and there are, like, actually two questions.
One has to do with, like, the experience,
but let's keep it, like, for next. start like with performance right so these systems like when you
are like vertically integrate all the components you can go there and be like okay i'm going to
build something like photon and have like on top of that like these changes that need to happen like
on the different components and make sure that, like, I squeezed out every piece of, like, every little of, like, performance out there.
Where do we stand with the lake house architecture when it comes, like, to
performance compared to the solutions like Snowflake or even like, yeah, like
BigQuery?
Yeah.
It's a great question.
So I, first for once,
I feel like things like Photon
could be built on top of,
like at the end of the day,
going back to my previous statement,
on the read side, right?
Even with the lake house,
these transactional formats,
on the read side,
all that happens is
you are getting some statistics
and planning some query. From there on,
your query performance is
dictated by things like that.
I feel like, I think already
we've proven that this can be built
independent of the
in a very decoupled
way. And then if you now
take things like all the table services
and all these things that we talked about,
they're pretty decoupled from how the query is processed. You mean. You cluster it and then they'll read it. That's it.
So in that sense, I don't see a technical limitation to optimizing the stack sort of
vertically like how we do it. But I do see that you know there are different companies here there is no single
company right like even even for us we routinely work with different query engines there are
different projects you know each you know we take like months to like land certain things and like
you know it can be like a lot of different friction points in terms of how quickly we can
move forward but i think the performance itself
comes from the engine.
A lot from the engine, I say.
At least for interactive query performance,
a lot of it comes from the engine.
With
a better integration with things
like Hudi, Open-DOS, Hudi, or even
one-house services, we can
probably match the experience
where you go and maybe
configure clustering in one house while you go query on like, you know, Presto Trino or something,
right? Like that kind of experience, you can product experience, you can build, but I think
there are significant cross organizational boundaries and working across companies.
I think it's gonna slow us down there, I feel.
Yeah, yeah, absolutely.
And just to reiterate on what you said,
there's no, let's say, interesting technical reason to have data lakes slower than a data warehouse.
But when you build the product and how the user experiences the product,
things get a little bit more complicated. Just to give you an example, let's say I have a setup with Hoodie and
Trino or Presto, I'm running my queries and I see a performance regression at some
point happening somewhere, right? What do I do?
Who do I reach out to debug this thing, figure it out?
Should I come to you on one house or should I go like to the Trino
community and ask there, or is it my data engineer, like doing something
stupid out there, like things while when I do that, like with Snowflake or
okay, Google is notoriously known for its support, so forget Google.
Let's not keep on Snowflake.
But at least at Snowflake, I'll open a ticket and I'll be like,
guys, something goes wrong here.
Figure it out, right?
And that's the other part of the question, which comes to the user experience.
So how we can also, as vendors, that we believe in these unbundled, let's say, DB
system of the lake house, how we can deliver at the end the same experience to the user,
or at least like a similar experience to the user.
Yeah.
I think that right now there's a lot of fractures.
First of all, there is no standard like you know like apis right even i think we attempted this with even presto which is even the hive connector right we tried to
introduce abstraction so that you know okay you just like change the way you are getting
file listing you are listing the thing so we like you know there aren't even like good
abstractions points right now and across these different engines for us to test and guard
i think as these get more standardized right all these three transaction formats have their
own connectors now uh right at least your prs out of like these landed i think starting with
even basic stuff investing in some basic things, having between these companies, testing them,
I think we have some very basic
cash difficult, I would say.
Longer term, it's
a pretty interesting point
that you bring up. I think
end of the day, there will be some
level of trade-off for the user
where they are consciously choosing
I want the freedom and the flexibility.
So yeah yeah when you
go for that then you have to pick and choose right it's like
buying Android versus iPhone
like you're sure like you know the OS
you know the experience that you're getting but it's going to vary
differently based on the underlying hardware
and the manufacturer and like blah blah
blah so you kind of have to go
through that I feel like even
with that you know once we
iron out the basics,
I think it'll get to a manageable level.
I don't think it'll be,
it'll always be
one level,
it'll always be
a problem,
I think.
It won't be
completely eliminated,
but I think that's
where I feel
the lake,
you know,
the lake storage
players
and the
query engine players
have to like work
much more closely together
than what's going on today.
Yeah, yeah.
No, 100%.
I mean, I agree with what...
I mean, obviously, there's a lot of space for improvements out there.
Okay, like all the vendors right now,
especially when it comes to vendors like OneHouse,
because, okay, you just start the business right like it's like it's one thing to have like a
open source project and it's a completely different thing like to build like a cloud
product on top of that right like it's there's a lot to be discovered there and i would also add
and that's like something that like i really admire like to people like you that, okay, you are also starting something that it's like completely new, right?
Like in terms of as a product category, right?
So there's a lot of learning from both sides, both from the customer side and also like from the vendor side.
And this takes time.
It's very risky, but potentially also like super rewarding.
But there's always going to be, I think like a trade off at the end.
It's not like, okay, we're going to have, let's say the Microsoft
Access experience with like a lake house architecture, right?
Like there's going to be like some kind of trade off there.
Okay.
So let me ask like a question that is like also like a little bit of like
a personal like question that I have.
So let's say right now I want like to start building a lake house, right?
One of the things that each one of the first likes, actually the first service
that you mentioned, like the Ingentium service, right, like somehow you
have like to push data into there.
Uh, how do I do it together today with Houdi?
Like the only way that I can do that is like through this ingestion
loader that you have built.
There are other ways, like how, how, how, how does this work?
Yeah, I mean, it's pretty simple actually.
You go to docs and if you go to, you know, how to streaming injection,
it's a single command.
It has like an umpteen set of parameters.
You say what your source is, what your target is,
configure a whole bunch, and that single Spark submit command
actually can ingest from Kafka, it can ingest from JDBC sources,
it can ingest from S3 kind of like event streams.
And then it can also do things like it can configure clustering, cleaning, compaction,
all of the stuff that I talked about, right? It's almost like running a database on itself.
So if you just run that one command, it will internally, be a spark it spins of a spark job and then within
that it will self-manage all the resourcing that we need for ingestion if you're not ingesting it's
going to do clustering if it's not clustering it's going to do compaction it even has resource
management so we made it like super super easy and in the front so we actually have built a very
similar thing at Uber.
And I actually started writing this tool
as a, you know,
like a replacement for it in open source.
But I think it's gotten so popular
that it's used in many,
many companies in production.
Right.
A lot of those companies,
this is the main thing.
This is like the main ingest service.
So yeah, that's what I'm trying to say.
Like as a project,
we've tried to make it very easy for the users
because we, you know,
suffered through all this integration pains
when we had to build our own data-making Uber.
But in spite of that,
I feel still the operational world is too high.
I mean, I don't know.
Like that's what one of us is trying to solve.
But Hudi already makes all this very easy for me. Okay. still the operational order is too high. I mean, I don't know. Like that's what One House is trying to solve. But yeah,
Hudi already makes
all this very easy for me.
Okay.
So how this would work
like with Open House?
Like what's the
difference there?
One House.
One House.
Yeah.
Open is so much
different.
Yeah.
I mean,
so the thing is
we're not forking Hudi.
We don't have a Hudi fork.
We are, so if you look at how, even let's say,
click label is US ingestion or something like that,
usually we'll have a blog which describes an end-to-end architecture, right?
We are platformizing that end-to-end architecture on top of Hudi.
It's almost like we're automating the blog that we wrote.
You can still run if you
want like that that's actually something that people really like which is they can like a lot
of our early users that like pilots that you're working with that they are happy that they can
start with something managed so they don't have like a long lead time for the latest and but if
for whatever reason they don't like us right and? And they can just turn around and all these services are in open source.
They can just buy support from AWS and that's it, right?
They can move off of OneNote as well.
Most open source GTMs are built, you know, good for it.
Strategies are, okay, it is an open source project.
We have to place a kernel on top of it.
I think we are trying something new where we are, we have an price kind of on top of it. I think we are trying something new where
we are,
we have an open
project.
We're trying to achieve
high data as much as
possible within our
product because we
want to up-level
the experience.
Then,
if you get really
familiar for whatever
reason,
we're not adding
enough value to
your,
you should be able
to move off the
data as yours,
right?
I think this is the
fundamental problem.
Now you contrast it with the warehouse move.
This is a fundamental problem.
Once you're stuck in the warehouse,
you have to migrate the data, right?
If you're unhappy with it,
there's nothing you can do about it.
So that is actually what we want to change.
And like you're saying, it's as a product
and also as an architecture and a category it's something pretty
like new and experimental architecture technology wise sure it's pretty proven out right your
earlier question around this unbundled stack see whether we like it or not whether one house
exists or not that's how people are using the lake even before me, right?
You are using Parquet
and using Presto
or Spark or Hive.
That's literally
how we started
at Uber as well.
So this multiple engine
on an open format
kind of thing
already existed
before.
I think all we're
trying to do
is build a path
for users
to get started
on sooner
and hopefully
as a company,
as a product,
we add enough value that we can retain users. Okay. Yeah, yeah.
Okay, I'm going to make it a little bit harder for you, okay?
Okay.
I'm sure you like challenging.
So let's say I'm a data engineer
who is coming from the modern data stack environment
where I'm used to use,'s say Snowflake and a
tool like Airbyte or Fivetran, right?
Where I know that I'm going to connect like a source, the data are going to be loaded
on S3, then like a copy command is going to be executed on Snowflake.
The data will get imported into Snowflake table format. And then I'm able, like, to create that.
And all these things happen, like, inside transactions.
So nothing is going to get corrupted, right?
Sure.
Sure.
Cool.
And now my boss says, go build a data lake.
And, okay, like, we need, like, to expose it to the rest of the organization.
So it should be, like, feel like the same, let's say as a lake.
Yeah.
Okay.
And I come to one house, right?
How like, think of me as like, I have this experience in my mind, right?
Like that's like the journey that I think when it comes to loading data and like this
whole ELT thing.
Is this something that like I can do in a lake house in general, first of all?
And second of all, like something that even if we cannot do today,
let's say, I will be able to do that in the future with one house
is how you think of things and how the experience should be.
Yeah, I think, first of all,
you think the experience should be similar
to what you're used to in an existing marine service right but how how we accomplish that in in one house can be through you know like us having more
upstream partnerships right like for example my my previous employer at consulate i think a lot of
scenarios right when people are at the point of thinking about daylinks and everything they're
also thinking about okay i want to open up event streams
to my company, right?
I want to open up for stream processing.
So most of the
things they would, like, you know, they would
naturally do something to get, extract
all this data into, you know,
like a big event bus, like Kafka
or Pulsar or one of these things,
right? And the minute
you get it into that, then it's pretty etc. And the minute you get into that,
then it's pretty simple.
So you can use
ideally, one of us can
try the same experience
whether we run it
or whether we partner.
But I'm saying we want to
right now,
we would recommend for people to
rethink how they're doing data
streams right
like okay the
CDC that you're
capturing from
5K
can you
t that into
elastic search
no you can't
right you can only
send it to one
point which is
snowflake right
that's not the
forget who the
data likes everything
that's not the
you know that's not
what people
ultimately build as
the data architecture
right and I'm sure
you're
familiar with like
data meshes
and we live in a world
where there are like
there's enough data
that there's like
so many specialized stores.
So,
I think
that will make
this move
much easier
for us,
I feel,
for something like
OneHouse.
The move towards
streaming data
and as technology
is like
very well positioned to be the absorb all the streaming data. And as technology, Hudi is very well positioned
to absorb all the streaming data
and integrate it very well.
And, you know,
one of just has to focus
on that problem.
Yeah.
Yeah, what I keep like
from what you say is that,
yeah, like things,
when it comes, let's say,
to the lake house,
or it will get like,
let's say, closer
to what people are used to use
from like cloud warehouses.
But there's also education that needs to happen, people to understand that there are also different
ways that we can do things.
And there's value in that.
It's not like you lose just how easy it is.
It's something you also gain, let's say, flexibility and opportunities
like to optimize your infrastructure
and do more things
with your data at the end, right?
So I also feel like when people,
when users,
like the data that you talked about
is usually at the point
where they're building a data lake
for the company,
they actually have a business problem
to solve already.
I think they'll mostly look at it
from that lens.
For example, it can be stream processing.
Data democratization is what I just talked about.
It could be just that, hey, I'm building a new data engineering team
or a data science team.
And there is all this event logs and data that I can't even ingest
into the warehouse anymore.
It's not like it's replicating the same data that exists in
a warehouse outside, right?
I believe a lot more data
sits out there on some S3 buckets
and cloud storage buckets
unmanaged completely.
So I think there's
a vast amount of data that is not even getting
into warehouses. And now
if you now think about it, right, from this
lens, I don't think the existing
managed pipe
solutions
are operating
at that scale, right?
They're not operating
at event scale.
Like at Uber,
we do
like, you know,
tens of billions
of trips a day.
If we did that,
then we are ingesting
hundreds or,
you know,
like a billion events
per day.
Like there's a scale
difference in the amount
of events and data volume.
These are things
that we've done
routinely in open
source and we
ourselves have
actual hands-on
experience building.
So I feel
technically scale-wise
it's a very
different problem
and when people
consider it like
they have one of
those cost scale
problems already
and that will
motivate the
experience that we build.
But by and large, I think it'll be fine.
Yeah, that's an excellent point.
And it's like, it's a very fair point also,
because yeah, like I'm giving an example, let's say,
but like the example and like the behavior
that let's say someone has with a product,
it cannot be taken out of context, right? Like there's also like the problems that someone's trying to has with a product, it cannot be taken out of context,
right? Like there's also like the problems that someone's trying to solve. And you're absolutely
right. Like when you reach the point where you need a data lake, there are reasons for that.
It's not just because like you don't like snowflake, right? Last question. And then I
give it like to Eric, because I completely monopolize the conversation. Although he's
going to be very kind and be like, it was so enjoyable and like blah, blah, blah, all that stuff.
So we have seen lately, both from Google with the big lake initiative that they announced at some point, but also with Snowflake with the support, both as external tapes with Iceberg, but also like as native format, we see the data warehouses are also making, let's say,
a move towards more openness and embracing, let's say, the lake house
or data lake paradigm.
How do you think that this is going to affect one house as a vendor
in this space?
And how do you think this is going to evolve as part of the data warehouse
experience that we have seen so far in the cloud? Yeah. So I think let's take even the
Snowflakes expansion and stuff. The key question I would ask is how do external tables actually
perform? It's one thing to have an integration, but it's another thing.
do they perform as well as native tables?
Right.
Because internally, you might've read the big metadata before.
There's like a lot of metadata optimizations.
Problems that, that, that transactional formats solve have been solved in a
very different way in warehouses.
So I think that there is going to be like, my feeling is
this is a nice
thing where you can actually access
data, but by and large, people are going
to move, if they want
something performance critical, they're going to
move that into a native table
and save this warehouse. I think that's what
I think. I think it's very
early. Right now, it feels like
everybody wants to do something against Databricks.
And everybody wants, like, you know, I have a lake house.
They want, like, whatever they want to use.
That's how it feels like to me.
So we'll see.
Like, of course, you know, you can also evolve over time.
At the end of the day, warehouses still are used for traditional you know analytics use cases right
there's like much more beyond that can be unlocked in a the kind of model that we've been discussing
so far so it'll be interesting to see how broad they want to be like what warehouses want to
make this right so it's i'm not like saying that won't happen, but historically, like, you know, if you, if you project it out, it, it, it may or may not happen.
Right.
Yeah.
Uh, the, the second thing here is overall, let's look at this architecture now.
Right.
Let's say, okay, so we have a common format and then all the engines read and write from that.
Like the same table is written from Snowflake and BigQuery.
I haven't seen a use case like that.
Why would you do external tables?
You do external tables only because you
want to do some Spark processing on the same data
that you want to also query.
Then if Spark's performance is good enough,
what do you pick Spark?
I just don't see clarity in these individual graphics cases to a level, oh, for BI, only
always use X.
I don't see that kind of thing.
I see way more users caring about, I want to actually keep my data more future-proof. Because four years ago
or three years ago, nobody talked about
Snowflake, F30,
FATO, warehouse that
you dump everything. It's a breakthrough.
So maybe in the next three years, it's something
else. So I just want to keep
my data future-proof. This data will
outlive the vendors and
query engines. I see far more
companies worried about thinking from that perspective
than this, you know, I want to have this thin layer that I can read
from many engines.
That makes sense.
Yeah, I mean, it's still early.
And I think it's going to be a couple of like interesting,
at least like years ahead of us at the end.
All these like innovation and like product developments are hopefully are going like
to be beneficial for the customer.
Right.
And I think it's also like from my point of view and like also like putting, let's say
my entrepreneurial hat, let's say being like a new vendor in this space and seeing like these much bigger
and like well established vendors like to be investing towards like something that I'm
also doing.
It's good.
It means that like there is market, there is appetite in the market for that kind of
stuff.
Now, who's going to win?
I usually say that it's the smaller vendors that win in that kind of innovation.
Yeah.
But we'll see.
It's going to be interesting.
Yeah.
To that point, actually, quickly,
if you think about it, right,
so who writes code in these systems?
I think if you go back to who's pushing
also the transactional formats forward.
I think that matters more, right?
Because those people are the ones that are closer to the problems,
closer to the technology.
And that's kind of why I think it reflects in the smaller vendors winning
because they're much, much closer.
That's the only thing that they focus on, right?
And overall, it's great.
By the way, don't get me. It's not,
it's absolutely fantastic that, you know, barrels are now taking external tables super
seriously. Right. I think redshift deserves a lot of credit for this. They don't, I'm
not seeing anybody keep them spectrum added hoodie, like, you know, two years ago and,
you know, they, they deserve a lot of credit for that.
Yeah, I agree with you.
It's a little bit of a shame because like, okay,
like there is some kind of perception that like Redshift
is some kind of like, let's say, dead, let's say in a way.
Although Redshift was the first cloud data warehouse out there
and they, like the guys there,
they keep building amazing technology.
Uh, so people should keep paying attention to them.
Like they are doing like a great job.
Yeah.
Yeah.
And I think, yeah, I mean, this is like marketing, right?
Like this, this is when marketing is reality.
I think, I think as a, as a, as a founder now, I also have the job of
empowering my team that, Hey, you that, hey, what's marketing and
what's like, this is like a pretty
blurry line. But
yeah, I mean, there is, you know,
Redshift, I think, makes
maybe a little bit more,
I don't know, but I think the same
ballpark as some of the more successful
people, I think that
we talk about, right?
And then they have have tens of thousands of
customers. Yeah, this is where
I think, for us,
we feel like we've not...
So I would say we're under a yearly
disadvantage because
when EMR
has...
A lot of the EW services
are deeply integrated.
We didn't start water
house back then because we couldn't like we're starting when marketing it's very hard under the
marketing you know shine the spotlight now but i think i've seen enough systems come and go
that i know that you know end of the, the technology has to work and somebody has to
operate this system with all these customer problems.
So we're like super, you know, I think we're pretty helpful for both the open data and
lake houses at one house.
Awesome.
Awesome.
So Eric, as you can see, you are the king.
Like without you deciding that now we have to make Lakehouse seriously, nothing happens out there.
So you as the marketeer of this group of people, we want to hear from you.
I was laughing about you saying, you know, the line between marketing and sort of, you know, the product reality can be blurry.
You know, and that is certainly true. One last question for you.
And I'm thinking about just the practical, thinking about someone who's maybe thinking
through the lake house on a practical level, right? So you talked about the genesis of Hootie, you were, you had real-time needs at an immense scale.
And then also a lot of the, you know, you mentioned sort of, you know, you have the
bottom half of the, of the warehouse and you can run Spark on it or like press Archino, et cetera.
A lot of that tooling, I think to a lot of our listeners,
probably at least hints at scale problems, right? Like a lot of those technologies develop
because of scale problems. One interesting thing that when we talked with Kyle from your team
was that he said his opinion has been changing, uh, on the lake house as sort of practical
for companies that aren't at, you know, sort of
whatever Uber, you know, pesk scale. I just love your thoughts on that. I mean, is the, and maybe
I can just frame it and sort of slightly unfair frame in the form of a slightly unfair question,
but do you think that the lake house is at a point where a very forward thinking data team could say, we're just going to skip the data warehouse and we're going to go straight for the lake house and we're just going to use that and it will sort of your traditional, like,
data lake for object storage, you know,
and then data warehouse for, you know,
all the transactional sort of like day-to-day practical stuff?
Yeah, that's a great question, actually.
So I think it's a totally fair question.
I think we are probably like a year or so from that.
And I cite mostly all the DIY stuff that you need to do.
Like, for example, somebody has to understand
a Debezium, Postgres, Kafka,
just to build a simple Postgres to Lake Indochina thing.
So there's like a significant investment.
And I've spoken to smaller companies
who basically know that the warehouse is going to get expensive and scale over time.
But today it costs way less than three data.
So that is the problem that most people start.
And that's where we are starting, right?
With that as a product and from that lens.
The technology, if you look at it, I think cost performance wise, I think grand scheme of things, I think even out of the lake will be like, lake is much, much cheaper for running any large data processes.
Like the way I look at the world today is barrels are, I think, in my opinion, still best in class when it comes to maybe like interactive query performance.
You know, the work that's going into things like Presto, Trino are changing all that, right?
And then when you look at data processing,
ETLs, that is where it gets like really expensive.
The flip side of scale is cost, right?
If you're running a large scale,
it means also large cost.
So even moderate scale stuff,
that's probably what Kylie hinted at.
Even simple stuff, right?
We can, in just spending 100,000 bucks,
you can probably spend $30k
on a lake.
As long as the similar kind of
experience existed. I think that
is opening up and this is not possible
without the cloud. I think cloud
is what is
the proliferation of all these different
awesome engines and that's actually
what's acting as a good
catalyst for deriving that.
So I don't see this
as just a scale problem.
And although an interesting note,
when we started Hudi at Uber,
I mean, for a year and a half or so,
it was like, you know,
that's why you don't see,
like we don't have a launch or anything.
It was like an interesting,
nerdy project that engineers at Uber built.
Because not a lot of people have that kind of scale with updates and this kind of thing then.
But right now, just with the time and then the data volume exploding, what we see routinely is,
I'm surprised that much smaller companies, the scale that they have. Like, oh, oh wow,
okay, you have a two terabyte partition okay i did
not see that like you know yeah that's that's so there is also that and i my view has been
also evolving for a while i i myself to be honest thought like that which is like oh it's a high
scale problem you know yeah yeah uh but then the community when I saw the scale that they were doing things at, that changed me.
I just literally met an airline data tracking system company.
They think in just some tens of terabytes every day.
You wouldn't have even heard about them.
They track all the flight data across all the airlines in the U.S.
And then they're able to get something up and running and they have that.
They can't send this data into a warehouse.
So they have a lake-based solution.
So there is also that organic data volume growth that is pushing people more towards
this.
Yeah, super interesting.
I can absolutely see that.
Let's say in a year's time, right, you have, you know, say people who have been working
at a larger scale company,
maybe they adopt some sort of lake house, you know, flavored technology. Then they go to work
for a smaller company and they're like, Hey, we can do like, we can do this actually. Instead of
waiting until the bill gets to a hundred grand and then having to do, you know, sort of a complete
like replatforming. Super interesting.
Yeah, it's going to change a lot in the next three, four years.
And it's going to, I think, yeah, I think we have to get to a point where it's feasible,
right?
You can like no cost, no trade-offs to your timelines.
You can get started with this thing.
Yep.
Yeah, it makes total sense.
Awesome.
Well, I think we went long, but that's because Brooks let me record this time so we get to break the rules, which is always great. Vinat, this has been such a great conversation. We learned a ton as always, and we'll need to have you back for sort of a third time's a charm round on the data section. Yeah, absolutely. Love to. And thanks for all the awesome questions.
That's one of the things that I really enjoy
is the quality of the questions.
And it gets me pushing
on the hard stuff.
So yeah, this is fun.
That should definitely be good.
Well, that's a very high compliment.
So thank you so much.
And we'll talk soon.
I love talking with that guy,
Costas. He just has this really incredible ability to answer questions with a high level of detail,
but keep the explanation really concise, which is a really challenging skill that I have a lot to learn from. I think probably one of the
takeaways for me was the conversation right at the end where he talked about how the market is
changing and when he thinks that data lakehouse technology will come down market and potentially even be adopted instead of a warehouse, sort of as the
first major operational data store in a company, which is really interesting to think about.
But at the same time, his point was, well, four years ago, no one thought about Snowflake as like,
okay, you need a warehouse, you just stand up Snowflake, right? And so he said in another three years, who knows what could happen?
So that was just really interesting.
And I know I'll be thinking about that a lot this week.
How about you?
Yeah, I agree with you.
That was like a very interesting point.
And it's also like something that remains to be seen in how exactly
it's going to happen and what will happen.
What I keep is, I mean, okay, there are like many different points that I will keep,
but one of the things that I really enjoyed
was the conversation that we had about
how a lake house as an experience
and does like performance and like a couple of different,
let's say like parameters that we put there,
it's doing like too much.
The experience that we have with data warehouses
and I like how pragmatic he was about that and saying that, okay, I mean,
obviously things can improve a lot, but there are always trade-offs, right?
Like you are, like, you're not going to have, let's say the exact same experience,
but at the same time, you are going to have, let's say, more flexibility, or like more scalability, or like having different capabilities that you cannot
have right now, like you will probably not have ever have with a vertically integrated solution,
like a cloud data warehouse. We'll see. I mean, it's too early with all these products,
but it's always great to talk with people like him
because he gives a very accurate prediction of the future.
I agree.
All right.
Well, many, many more great guests coming up.
Subscribe if you haven't, and we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rutterstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse
at rutterstack.com.