The Data Stack Show - 249: Quacking Through Data: Duckdb's Emerging Ecosystem
Episode Date: June 18, 2025This week on The Data Stack Show, John Wessel and Matt Kelliher-Gibson dive into the recent Duck Lake announcement, exploring the evolving landscape of data analytics technologies. They discuss DuckDB...'s role as a lightweight, local analytics database and its potential as a caching layer for open table formats like Iceberg. The conversation also highlights the current state of data storage standards, focusing on agreements around Parquet and Iceberg, while noting the ongoing complexity in catalog management. Key takeaways include the importance of local compute solutions, the early stage of open table formats, and the potential for simplified data infrastructure that can provide faster, more cost-effective analytics workflows. The episode underscores the ongoing innovation in data technologies and the need for more streamlined, flexible data management solutions. Don’t miss it!Highlights from this week’s conversation include:Discussion on Duck Lake Announcement (1:41)Compatibility with Apache Iceberg (4:05)Use Cases for DuckDB (6:23)Concerns About Data Management (10:01)Introduction to Data Formats (11:40)Catalog Space Challenges (13:13)Metadata Orchestration (14:54)Simplicity in Data Management (15:25)SQL Demo Discussion (17:26)Wrap-Up and Final Thoughts (18:44)The Data Stack Show is a weekly podcast powered by RudderStack, customer data infrastructure that enables you to deliver real-time customer event data everywhere it’s needed to power smarter decisions and better customer experiences. Each week, we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Hi, I'm Eric Dotz.
And I'm John Wessel.
Welcome to the Data Stack Show.
The Data Stack Show is a podcast where we talk about the technical, business, and human
challenges involved in data work.
Join our casual conversations with innovators and data professionals to learn about new
data technologies and how data teams are run at top companies. How to Create a Data Team with RutterSack
Before we dig into today's episode,
we want to give a huge thanks
to our presenting sponsor, RutterSack.
They give us the equipment and time
to do this show week in, week out,
and provide you the valuable content.
RutterSack provides customer data infrastructure
and is used by the world's most innovative companies
to collect, transform, and deliver their event data wherever it's needed, all
in real time.
You can learn more at ruddersack.com.
Welcome back to the Data Stack Show.
We've got another special episode here with Matt, the cynical data guy.
Welcome back to the show, Matt.
Ah, so I see everyone's canceled.
Big trivia on the show. Matt. Ah, so I see everyone's canceled.
Make sure you're on the show.
We have a fun topic today.
We normally at least have a segment here reacting to posts,
talking about current events.
We're going to kind of bypass that today
and we're gonna talk about one of my favorite topics,
DuckDB and a recent announcement.
And then we're gonna zoom out a little bit
and talk about the ecosystem.
Basically, this is a chance for John
to teach me what Duckly.
Yes, live, I'm gonna do it live.
All right, this will be fun.
Matt, tell me about the...
This is not fair.
But Matt, what have you heard about the announcement
and then we'll nerd out on that.
So what I ever heard, what I read was
that the DuckDB is coming out of this thing called
Duck Lake, which is supposed to be a lake house format, open table format, and it's
going to be unlike Iceberg, you don't need a separate catalog.
So I'm assuming it's going to stick it in SQL tables or something like that.
Otherwise it's I'm still not exactly sure where DuckDB fits into everything with this,
so you're gonna have to explain that to me.
I haven't had a chance to work with it.
Now, Matt, you haven't had a chance to work with it.
And we're recording this on 5.27,
and this post is on 5.27,
so it's been out for at least 30 minutes.
Yes.
And you're not an expert yet.
What?
I know, I'm disappointed.
All right. So nobody's actually an expert on this, right? At this point.
Anyways, I think we'll zoom out a little bit.
So what I think one of the interesting things here is you have active development out there on technologies like DuckDB that are kind of
the SQL light of analytics databases.
Lightweight, open source, all in the one package.
And by the way, if you pay attention,
it's crazy to see how many, just like SQLite did,
how many apps or have duck DB embedded in them
like at this point.
So that's the one component here.
The other component here is Iceberg,
which we have to talk about where it's like, okay, we've got this new, cool, standard format that essentially sits on top of Parquet for
analytics workloads, and you can bring your own catalog, you can say, hey, Databricks,
go look at the Iceberg table. Hey, Snowflake, go look at the Iceberg table, you know, or whatever,
whatever tool, and, you know, BI tools are getting direct connections
to Iceberg or sometimes you're going through,
I guess usually you have to go through
some kind of compute layer.
But point being Iceberg is this modern data stack evolution.
We've talked about this before,
but it seems to be Iceberg and then like question mark.
People are like, I don't know what goes alongside this.
I don't know if we just like copy and paste
most of the modern data stack over here.
And then iceberg is something to do with it.
Or there's some AI thing that like is a major component.
Once again, we see the underpants known problem pop up.
Iceberg, step two questions, step three, big profits.
It's like, how are we going to use this?
I don't know, but we think it has to do with iceberg.
All right, so back to the Duck Lake announcement.
And I'm actually going to start from the end,
which I think is always fun.
So essentially, I think in a lot of people's minds,
it's like, wait a second, I thought Iceberg
was the next thing, like, what is Duck Lake?
I want to read it so I don't mess it up here.
But essentially, okay, you ready?
Yep. So the question is like compatibility.
All right. So the data and
the positional delete files that Duck Lake writes to
storage are fully compatible with Apache Iceberg,
allowing for metadata only migrations.
So that's one component here.
And the other component,
which I think is interesting,
the availability of the Duck Lake extension
augments and does not replace
DuckDB's existing
and continuing support for Iceberg and Delta and the associated catalogs.
Duck Lake is well positioned to serve as a local cache or acceleration feature for these
formats.
Okay.
I only have one question for you.
Okay.
What is Duck Lake?
Great question.
What does that mean?
Help me here.
I know.
I think what that means is, hey, we've got this new cool thing.
We don't want to compete with you, Icebreak.
We want to still support that.
Competing with you.
We're totally competing with you.
I don't know.
I don't think that's actually true.
Do you think the distinction here is like look like the well-known
DuckDB like problem they need to solve long term is like what if I run out of
memory? Yeah. Essentially because just like SQLite it's like alright you can
only like cram so much RAM still even today you can cram a lot but so much RAM
into a machine. So my take on this is they're trying to solve that problem.
And like specifically they're well positioned
to serve as local cache acceleration feature.
So they're gonna be here, I don't know, like a Redis,
like think about Redis like conceptually on a stack.
And then you're still gonna have your underlying formats
of like Parquet and Iceberg.
I think that's the thought literally released today.
So I will not claim to.
Okay.
Though fully what they're thinking.
Let me step back a second here.
Cause I'm still, like I said, I've not had a lot of time
with Duck in DuckDB in general.
I have not had the opportunity to spend much time with it.
So DuckDB, where, what is the use case for that?
Yeah.
Great question. I feel like I'm missing where this is going.
Yeah, great question.
So I think, let me come in from kind of an analytics angle
and it's essentially like, hey,
I've got this really powerful MacBook Pro
with a ton of RAM in it
and I'm like doing this analytics project
and I am tired of making server calls
for every single thing.
Like I can run Duck locally,
I can have a full, like,
mostly fully functional SQL, there's limitations.
But I can run all these SQL commands
and it can be local and it can be super fast.
That's one.
So it's basically utilizing your computer as like compute.
Yeah, right, right.
So that's one, if you're running local.
Two, there's companies like Mother, DocFitter,
like taking this, making it a SaaS product
and handling the like, just like most SaaS products,
the complexity of like managing the compute,
all the things for you.
I think a third interesting one,
that actually at data council,
there's a neat presentation about this of like,
hey, okay, what if we can,
what if we're, say we're actually writing a query
and we can like auto sample the data for you.
And while you're writing the query,
you don't actually need hundreds of thousands
of results every time.
Maybe we can speed up your workflow that way.
So I think that's an interesting use case.
And then a third, that's very interesting,
is all these companies making BI tools
or tools that have a component of BI.
Like, hey, let's just like,
we have to get this like really neat fast store
we can bake into it.
And I don't know all the like technical details of it,
but there's some neat stuff you can do with the browser
and having DuckDB like fuel your like end browser experience
with like a BI tool, essentially.
Okay, so it feels a little bit like we're talking about,
it's this way of, it's like another form of compute almost
that makes for, and in one hand,
it's like kind of local development-ish type idea.
I'm not having to hit like Snowflake every time
and work with them as long as I can get to where the data is.
And then possibly some type of cache layer
or some type of web app or something like that.
Yeah, I think it's very simplistic.
And we've had people from TuckTV
and I think Mother Duck both on the show.
We really need to have them back on the show.
But those are the two practical applications
that I'm seeing is that the cache layer problem
and the local dev problem or like a CI to CD pipeline
problem, there's some neat workflows where like,
hey, I need to run this pipeline.
Like I can actually like use Duck to like test the pipeline instead of having to like hit my production snowflake saves me some money.
Yeah.
And you know, it's fast as well.
I'm seeing this as partially a stem away from the I have to pay for every time I want to do anything with my data.
Right.
Exactly.
But yeah, back to the duck, I think it'll be interesting to see, one, are there going to be
more people that come out with this similar type, because everybody's like iceberg, iceberg. Is there going to be more in this space where it's like, all right, we're a local cash or
accelerator on top of iceberg or we're like other alternative local things like you know
how much of it gets built around iceberg as the given of like hey icebergs
gonna be here we're gonna build around it right and how much of it gets built
kind of alongside with a little bit of a hedge and how much of it's like hey
we're gonna be direct competition I've seen zero direct competition
essentially so far well I could see a spot where because one of the things
with iceberg having had to dive deep into this for some stuff professionally essentially so far. Well I could see a spot where because one of the things with Iceberg having
to dive deep into this for some stuff professionally is most of the biggest gains come on very large
scales because that's what it was designed for. It was designed for like terabytes of data. So I
could see there being something where if you can come in there with a more local version or something like that, where for smaller, smaller sizes, you can get better
efficiencies out of that because that is something that, you know, you think
iceberg, it's going to be great.
And then you get into it and you're like, Oh, look, the overhead involved with it
causes it to actually be three times slower than if I just used Snowflake.
Right.
Like that, like if you don't, if you don't know, you're not optimizing it correctly.
And if you're not, your data isn't at the right point,
you're not configuring it right, it will be slower.
I mean, it will be slower on smaller data sets a lot anyways.
So something that could help with that problem too,
I could see as a compliment rather than a.
Right.
So I do wonder what that looks like as a,
well, we've got our catalog in this in SQL and you've
got it in file and I can migrate it, but now am I having to keep those two synced up and
I can see some issues with it.
Well, and the thing that's most interesting to me in this whole world is the standards
adoption is what I would call it.
So, okay.
We've pretty much agreed on for all this daylight stuff.
We've pretty much agreed on parquet.
And I don't-
Better for worse.
Yeah, for better for worse.
Sure.
Pretty much agreed on parquet.
And then on top of that, I said like, that's the core.
That's your like, I mean, CSV parquet.
People are like, all right, well Parquet is better than CSV for this.
There's other options.
It's not the only option, but for whatever reason,
it seems like it's got the vast majority of adoption.
Okay, so step up from that.
Like, all right, what are we gonna do next?
How are we gonna like store metadata around it?
How are we gonna do like the table concept,
database table concept, iceberg,
at least from open standards, like, all right.
I think people are like, we're doing iceberg. Yeah. Okay, so we got two open standards, all right, I think people are like we're doing iceberg
Yeah, okay, so we got two open standards per K iceberg great
Then from here is a mess of like people
With catalogs. It's like everybody has their own like additional catalog and like people are scrambling in the catalog space
Yeah, okay cool. And hey, that's where the permissions are set like that. There's money there
I get it and access control and all this this stuff that you have to have as a company and you have
to pay money for, or you pay money for.
The other given for all of this, which is easy to skip over, is the storage layer, is
we've also kind of agreed on like S3, S3 equivalent.
Yes.
It's some type of blob type of storage.
But we already agreed on in the modern data stack, we already decided like Snowflake,
it's all
S3 backed or Azure Blob or whatever, but it's like that Blob object storage, like whatever
brand or variety you want, object storage.
So we've agreed on a lot.
We've agreed on like this storage layer, the underlying file format, the, call it table
format and with metadata with let's say iceberg. the underlying file format, the, call it table format
and with metadata with let's say iceberg.
And then we've not agreed at all on like catalogs
or we just, we've kind of agreed
there's gonna be a billion different catalogs.
What we need is the Kubernetes or catalogs.
I mean, yeah.
But then this is what I'm interested though
in the Lake House thing is like, okay, what else?
We know there's no agreement on catalogs.
Well, there's high agreement here
but then there's that middle space. And these guys like with the Duck Lake seem to be in that middle space is like, okay, what else? We know there's no agreement on catalogs. Well, there's high agreement here, but then there's that middle space. And these guys like with the duck lake
seem to be in that middle space of like, hey, we're like compatible, but we're like kind of off to
the side here. And like, we can do some like caching and stuff. We're totally not inbred at all.
We just want to be a little fish that like goes by the sharp and you know, just picks up on
our leftovers there. Yeah. I mean, I don't know. For positioning, I don't know.
I think it's a real space.
And I think it's like there's a use case, for sure
a use case here.
It's a neat use case with thinking
about having that kind of like caching layer.
And again, like the, oh, well, I'm
sorry, we missed one of the most important layers, the compute
layer.
Yeah. That is also in that like catalog and compute
are in the like fight over, you know,
a bunch of different solutions
are gonna fight over that forever.
But I think for a certain extent, like the compute,
by design, it's supposed to be more diverse
because it can get certain use cases and stuff like that.
The catalog is the one that's still kind of
this weird and more for this.
Yeah. That's the one that I still kind of this weird morph. Yeah.
That's the one that I feel like if you're going to consolidate on any of these,
it's going to be kind of that catalog space.
Right.
Or it's going to be, like I said, like a Kubernetes state thing where it's like,
hey, yeah, there's 12 of these and we're just going to abstract over it.
You don't have to think about it.
Yeah. And I honestly think that is more likely to happen.
There's the abstraction layer,
because just with the governance and security
and all of this stuff that's going to be built into,
you know, these various things,
I think that is most likely to happen.
And then I would then just allow you
to migrate your catalog if you need to,
if you need to be multi-cloud, it's not going to matter.
It has all the same advantages.
It's kind of that like metadata orchestration type idea.
Yep. Yeah, and we're like metadata orchestration type idea.
Yeah, and we're still super in all of this. I think that's the thing it's easy to forget about
a lot of this stuff with open table formats
and everything is like, we're very early in all this.
There's still a lot of runway
to smooth out these problems.
It took a while for a lot of cloud stuff
to kind of smooth out a lot of rough edges.
Right. Well, and the other thing I do like about this, though, is the simplicity piece here,
because it is still painful to work through, like, all right, what catalog am I going to use? How do
I set up the catalog? Especially if it's not, like, quite compatible with your, like, standard
stack. Like, that's still painful. So I definitely see the use case here of, like, okay, I got
everything in Parquet or Parquet Iceberg.
Like, all right, I just have this thing
and I don't have to think about the catalog
separate from the compute, separate from, you know.
So that I see the value.
I wonder, I could see others following suit,
having this same concept.
We're gonna take a quick break from the episode
to talk about our sponsor, Rudder Stack.
Now I could say a bunch of nice things as if I found a fancy new tool,
but John has been implementing RutterStack for over half a decade.
John, you work with customer event data every day, and you know how hard it can be
to make sure that data is clean and then to stream it everywhere it needs to go.
Yeah, Eric. As you know, customer data can get messy.
And if you've ever seen a tag manager, you know how messy it can get. where it needs to go.
you have implemented the longest-running over all the years and with so many RutterStack customers
including your data infrastructure tools, head over to ruddersac.com to learn more.
Now the most important DUP-related question.
Yeah.
Have you seen...
It was on LinkedIn the other day.
I didn't flag it for you,
but it's a video of this guy doing a demo.
It's with DUPDB, I think, or something.
And it's like you talk,
and then it quacks and writes SQL.
Have you seen this?
No, okay.
I don't even really understand what was going on with it.
It was that a guy was showing it to Zach with Vasilim, the data engineer.
Oh, okay.
Yeah, yeah.
And I just saw the clip of it and the guy's talking and then the thing goes quacking,
right?
SQL as it's quacking.
It's just like a CLI tool.
I'm imagining like Calce, if you know that from like...
Well, no, there was a UI there because it had basically, you know, where you could see the voice kind of going...
Okay, yeah.
And then off to the side, it was writing the sequel there.
I don't know what it was writing for sequel or how it determined it.
All I know is it was quacking and writing at the same time.
This is all just duck moving going on right here.
Wow. All right, that sounds like something we're gonna have to link at the show notes if we can find it again
In the show notes we can't put anything in the show notes. Yeah. Yeah, does he? Yeah
I must have to go back and look at this. It feels like something you just saved like
Don't ask me a question. There'll be any shit. No, you're always welcome to ask questions
Cool. Well, I think this wraps our little segment
Well, we will have to get some experts on here to actually do a deep dive
But I wanted to just call this out
because we saw it today and it looks pretty neat.
You'll have to have me on so I can sit there and just go,
yeah, but what is that, baby?
What is the data like?
I'd be like, forky ask a question right there.
Yeah, perfect.
All right, that is it for our segment here.
Matt, thanks for being here.
Stay cynical.
All right, see for being here.