Disseminate: The Computer Science Research Podcast - Adaptive Factorization in DuckDB with Paul Groß
Episode Date: November 6, 2025In this episode of the DuckDB in Research series, host Jack Waudby sits down with Paul Groß, PhD student at CWI Amsterdam, to explore his work on adaptive factorization and worst-case optimal joins -... techniques that push the boundaries of analytical query performance.Paul shares insights from his CIDR'25 paper “Adaptive Factorization Using Linear Chained Hash Tables”, revealing how decades of database theory meet modern, practical system design in DuckDB. From hash table internals to adaptive query planning, this episode uncovers how research innovations are becoming part of real-world systems.Whether you’re a database researcher, engineer, or curious student, you’ll come away with a deeper understanding of query optimization and the realities of systems engineering.Links:Adaptive Factorization Using Linear-Chained Hash Tables Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Disseminate the computer science research podcast, the DuckDB in Research series.
Hey folks and welcome to the DuckDB in Research series.
This is the show that explores all the wonderful ways DuckDB has been used in the world of research and academia.
That might be as a vehicle to teach databases or as a platform to experiment and to develop cool new things.
So yeah, if you're a researcher, an engineer or just a general database enthusiast, and this is definitely the show for you,
In today's episode, I'm really delighted to say that I'm joined by Paul Gross, a PhD student at CWI in Amsterdam.
I'm not going to attempt the Dutch pronunciation of what the acronym actually means, but yeah, maybe you can, Paul, explain it for you.
I've tried before and I always get it wrong.
But yeah, Paul is here to talk about some of the research that he published earlier this year at the conference on Inative Data Systems Research, also known as cider.
No, it's not a beer festival, unfortunately.
But yeah, cool. Paul's research focuses on this research, sorry, should I say, looked at factorized aggregations and worst case optimal joins. So first of all, welcome to the show, Paul. Yeah, thank you very much for having me.
Cool. Well, let's dive straight in and give us the sales. Tell the listener kind of what they've got them got, what they've got themselves in for the next sort of 30, 40 minutes or so. So again, give us the sales pitch for, I didn't actually say the title of your paper's call. So maybe I should do that now. That would be a great idea.
Yeah, right? So yeah, your paper's called adaptive factorization using linear change hash tables.
So yeah, what's the TLDR on this? How would you explain it to your grandma? Take it away, Paul.
Yeah, so the paper adaptive factorization is basically on allowing you to reduce the intermediate size of your query, basically most of the analytical queries.
You don't really have a million rows in your output because at the end, the output is designed for a human or maybe now for LLM.
They can't really understand that, but usually the intermediate contains millions of millions of rows.
And with factorization or with adaptive factorization, you can basically avoid those big intermediate results by only processing on a reduced set of data.
And the cool thing really is for our paper that you can basically use 90% of the things that are already in a normal analytical database system to do that.
and you don't really need a new big operator for that,
but you can basically integrate it quite easily.
And also to add, like, this was joint work with my supervisor at the time,
Danielton Walde.
And I think you already had him on the show as well on Duck PGQ.
Yeah, he was on in the first season on Duck PGQ.
So if you're interested in this listener,
you should definitely go check out Daniel's show as well.
It's really good.
So that's cool there.
So basically the aim of the game here is then we have these really big intermediate.
we can often get these really big intermediate sets of things while we're doing our queries.
We want to make those as small as possible, right?
I'll not kind of end up with this sort of state space explosion during our query processing.
Cool.
So yeah, my grandma definitely gets it now.
She knows.
She understands.
Cool.
So, yeah, there was also another thing that I kind of maybe mentioned at the top of the show.
I mentioned the phrase, worst-case optimal joint.
Maybe we can talk a little bit more about what that actually means because it always, I see that I see that acronym.
me that always scares me a little bit. So maybe we can kind of talk about what that is a little
bit as well for the listener. Yeah, it still scares me a little bit as well. Basically the idea,
I think the most common example is like a triangle query. And for example, you have a social
network, there you have friends relationships. So friends meaning user A's, a friend of user B and then
user A is again, a friend of user C. And then like everybody has like 10 or 15 or 100 friends.
I don't know.
And the idea is basically, okay, and now I want to have a triangle.
So, like, who's friends' relationships form a triangle?
And in a normal query plan, usually what you do, you have binary join plans.
So in the binary joint plan, you would join the relations to, like relation at a time.
So first you would get, for all the users, get all of their friends.
So you would have this explosion of friends, because now if everybody has 10 friends,
then you will get 10 friends for everybody
and then like you end up with a hundred doubles per user
so you have this explosion and it's quadratic.
And then with the second join,
you will create like this filtering a joint
because like now you only want,
you have those two conditions.
You want to have like if you have friend A,
then friend B and then you want to have friend C.
C has to be a friend of B and A now to close the triangle.
So this is a very filtering join.
So then you exactly have this explosion of the intermediate result.
Like you first join will return a huge intermediate
and then the second joint will filter that again.
And what you do with the first case optimum joints, basically,
is you go tuble at a time or like attribute at a time.
So you will basically simultaneously check all the three relations
and then intersect those...
tuples instead of like doing a relation at the time and having this huge intermediate.
Nice, cool. Yeah, that's a very good. That's a very good explanation. I was just thinking as you
were sort of going through and then you're talking about the social network example about
oh, yeah, if you've got this amount of friends, then it can be problematic because I know 10 times
it's quadratic and 10 times it quickly explodes. I was thinking, well, I think my my social media
accounts wouldn't be in any danger of troubling them because I think like one follower, two follower,
and maybe not as sort of as challenging to the, to the query engine as other. Maybe. Maybe
Yeah, but yeah, that's cool.
Yeah, so that's a really nice description there
of what West Case-Octomor joins that.
So, yeah, we've got Weir's-Ca-Oxmort-Jones and factorization.
So why is taken so long for these sort of techniques
to become, like, adopted in mainstream systems,
or what makes them difficult to sort of implement in database systems?
Yeah, I mean, like maybe quick for factorization is basically
a method of avoiding those intermediate results.
So what you get if you are like basically this binary join is a join hash table,
like a hash join.
And this is actually quite simple.
Like you basically you have your build side.
You just build a hash table on your right hand relation.
And then you probe this hash table if your left hand relation.
And let's say you have 10 friends again.
So you would have this hash table that has maybe like just a collision chain.
And then for each of those two, for each friend you're probing into this hash table,
you would get 10 results because like you have 10 hits.
So the idea is now to instead of flattening that and having like user 0, user 1, user 0, user 2, user user 0, user 3,
you basically have a list of users for this very first user.
So like you have user 1, user 0, and then you have times and then you have this list, user 1, user 2, user 3,
user for and et cetera so and this really helps because you need less and memory because you don't
have to repeat user zero again and also for example if you want to make like a count aggregate on
top of that then you can use the factorized representation to easily compute the count because you
don't have to expand the whole intermediate but you can just make like the sum of the length
of the lists because now it's 10 so this is basically the idea of factorization and this is
a very, very useful trick
but the problem is really
to handle those lists
because you have a vectorized execution engine
usually for us, for what we did
and DuckDB
and then you don't have
one user and then it's lists
but you have
2048 users because that's the vector size
so you're always processing those
2048 elements at the time
and then for each you have those lists
and suddenly your lists are variable length sized
so it's also not easy to provide the memory for that easily
because one user maybe like me has two followers
and then you have Il Musk in the next
and he has like millions I don't know how many
probably millions yeah half not billions yeah
this is really a problem because like having like this
variable sized data is just hard to deal with
so this is mainly like how do you create those lists
how do you make your operators emit those lists?
And because, like, first, like, you have to have them,
then you have to calculate on them.
This is really the challenge.
And there have been several approaches to that.
For example, Graflodd-B and Kuzu have a very cool way of, basically,
like, having a vectorized system where you basically,
like, you have this first vector, which is the left side of the factor,
and then you have the right.
and big flat buffer, and then you can basically switch certain IDX.
It's very hard to explain with all pictures,
but there's a very great thesis that I wanted to highlight from Amin on GraphflowDB,
like the predecessor of Kuzu.
So this is, I think, a good read.
Yeah.
Yeah.
Try to continue.
Yeah.
Yeah, and it's basically, yeah, this is like the main reason.
Like, how do you make your operators integrate those?
representations and then how you
execute on them
and this is like for factorization and for worse cause
optimum join and it's like also two problems
like these operators are super complex
to integrate because they have this
complete new way of
I'm working I mean in some sense they are also simpler
but they are just different
in their style let's say it like very
researchy
and then you also
have to decide whether you want to use this
worst cause optimum join or whether you want to continue
with a binary joint plan because this
triangle query it's it's it's it's of course the perfect example but how often do you run this query on
like an o lab system it's it's like 90 like no tpccd s query i hope this is true um has such
yeah it is now you've spoken into into truth now basically paul yeah it is yeah this is very
easy we'll check it but tpcch for sure does
cool uh so yeah you actually um you mentioned uh kuzu and umbra did you mention umbra you mentioned graph
or not umbra sorry yeah i mean actually on a random aside i just saw you there that coozy
actually the project has been shut down sadly so yeah no more no more coozy which is a show
it's it's a pity yeah i'm sure it'll come out on some hack and use thread at some point
what's happened right there'll be uh something cool there yeah always does um but yes you mentioned
that those systems um have sort of uh tried to tackle it and have kind of come up with some neat
not neat approaches to taking the idea and actually making it sympathetic to like a vexorized engine
not and what are there limitations at the moment like what what do they where are they falling short
in putting those in those ideas into into reality i think like the most of the systems that
really want to have it have it like in the sense of like kuzu has support for worse because
optimal joints umbra has it because umbra's very feature complete um so uh so it's possible definitely
but for us the challenge was also like how to make something that mark would merge
so how to not touch too much
and there's really the problem
like how do you very
and also I mean this is for different systems
like if you're not like a prime graph database
and like you're expecting a lot of cyclic joints
or like this kind of things
then it's like then you can have the resources to
integrate that so like it's it's not unsolved
it's just like complicated in the sense
and then this is like from the implementation part
and then it's still hard
to really plan those elements because you have to have those very certain conditions like you
want to have this or like you want to avoid this intermediate result and sometimes maybe your
intermediate result is not actually large like maybe your first join which would be the binary join is
even also selective or like um the or like the third join is not filtering like what do you do then
like then your binary joint plan was um maybe better and for that you really have to rely um at
moment on cardinality estimations and like sketches and like all of all of these statistics but
it's like planning is also very hard and then having this other dimension of um now you can
choose between the worst case of the mid join and the binary join plan which are very fundamentally
different you can't just swap them on runtime um and is adding a challenge that is i think
really like an academic challenge nice yeah for the listener and mark mark is one of the co-founders
right so yeah yeah we've got to get it that's the bar to get it past but that's cool so you went
about tackling some of these issues you just outlined there and you did that with a thing called
linear chained hash tables there we go a second time um yeah so walk us through kind of what they are
and what their design is and how they go about mitigating some of those things you outlined a second
they go about with the challenges of um of these these techniques yeah i think this was like the linear
change hash table is the idea is from my supervisor actually like peter bones is my PhD supervisor
and i think it's a very neared idea and also a very neat strategy because we basically implemented
them before implementing the factorization because they enable factorization but they also provide
certain other benefits and so therefore we could also like sneak in infrastructure for our later
plans into that nicely in the future ground work oh look at this now we've managed to implement this
yeah yeah oh coincidence so it's awesome like
a little bit in this direction, but basically...
Strategic, yeah.
Strategic PRs.
Yeah.
Basically, the idea is to extend the DGB hash join a little bit
so that we can use the chains of the hash table later for factorization.
So basically a half join, as already said, it's a very simple design.
You basically have this hash table.
You build it on your build side.
You probe it with the other left-hand side of the join.
And then you're done.
And then the big question is, of course,
how do you design this hash table?
And I mean, like, there's like this very fundamental data structures
thingy where you can do linear probing
either to resolve hash collisions because,
I mean, hash collisions happen because like two keys
basically collide on the same slot of your hash table.
So how you deal with that?
Like either you can do linear probing.
So you say, okay, this bucket is full.
So I will just take the next one.
And if this one is full, I will take the next one
until I find an empty one, basically in your probing.
Or you can do chaining, basically,
having a linked list of elements, and then, yeah, like, populating that linked list.
And both have advantages and disadvantages, linear probing is very cool because you have all the
things inline, but what do you do if you have a string key or a struct as a key, like,
you can't inline everything into the hash table, and you also can get very long probing chains.
Chaining is also, like, it's easier if you have, like, those big keys, but
you have a link list, and link lists are slow to follow,
and they can be scattered across memory,
and hash can be gigabytes of size,
so it's very, very slow to follow those chains.
So basically, what we did, it's a little bit involved,
but it's also not like the other system do is similar.
So basically, if you want to insert a tubel into the hash table,
you first need to hash it,
and then you want to get the offset from a part of the hash
and then with this offset you go to this slot
and then DGB has this chained hash table
where for each slot you have this pointer to a chain
on this chain then can have all the different types of keys
and like has some memory that it traverses.
So the problem now is that you have two types of collisions basically.
Like first is like you have the same key twice
in your hash table.
For example, if you have multiple, like, one user has multiple friends
and all, like, this key is then just multiple times in the hash table
because it's like the idea of this first user.
So then what you have is, like, the same key collides to the same hash,
to the same slot, which is normal.
Like, you can't avoid that.
It's good.
And then what you can also have is that you actually have a hash collision.
So, like, you have two distinct ideas you want to insert into your hash table.
But a hash just happens to be the same.
And like if you have a small hash table, for example, like 2024 elements
and you want to insert, let's say, 900 elements.
So then it's quite likely that those will collide.
And now because Dr.B has those chains,
because the keys, like different keys might go into the same chain.
During probing, you always have to walk this chain and say,
is this tuple an actual match?
Yes or no.
Is this the next tuple and actual match?
match yes or no so and this is kind of slow because you have to do this comparison for every
element in the chain so what we did is basically um do some trick to make the linear probing
happening with only like parts of the hash so we add some salt we call it so because the pointers
in the yeah it's like the meme um so basically pointers are only 48 bits so and our hash table slots
are you in 64 so we have 64 bits
so there are 16 bits of
emptiness of void if you're not on
Android and so
basically we fill those upper 16 bits
with some parts of the hash and we use this
as our tiny key and
on this key we can then do the inline
linear crowbing and then
we can do linear cropping
for all the collisions
where the keys are actually distinct
and only add an element to a chain
if the keys are the same
so therefore we can guarantee that
within a chain all the keys are the same
so therefore if we want to probe
now this chain with a key
like if we find that the first element
in the chain matches you know that all elements
of the chain match and that
we can use to make probing faster
so this is
the TLDR kind of thing
nice yeah that's really cool so how does
this then so that you kind of
well the first question I want to ask is that you were saying earlier on
in the show that you wanted to be as
as least invasive as possible
so you can get this PR merge
did they have to be any significant
code changes of actually making
of having this like
was it easy to implement
did much need to change
on the on the core
product side
basically the salt
thingy was already implemented
luckily for the aggregate hash table
but not for the join hash table
so that was like
it was known as a technology
so that's also good like
because if
If you want somebody to maintain it, then if they saw something already, it's easy to implement.
And I think the saw like also, Umbra does something similar, for example.
So this is like quite common to do.
And then this like having those key unique chains was possible in a sense that you already had infrastructure for checking whether the keys match because you have to do this during probing anyway.
So basically was moving some code.
also to another part of the build pipeline
and I think it was like maybe like 200 lines of code
like not too much and it was like now it's very cool
because usually your build side is much smaller than your probe side
so you like for each double you insert you may be
probe 20 times so now because you do this comparison
now 10 times during insertion
you will save it 200 times during probing
so it's like for and especially if you
you have like not that easy to compare keys like strings, it's very beneficial.
So that was also the motivation to just add this as a plain data structure.
Nice, cool.
So we've got this primitive in there then.
How do we go from that to enabling our end goal of Westcase optimal joins and
factorize query processing?
Yeah.
Now it's actually quite close already because now you have those chains and they are key
unique so if you now instead of like if you probe your join and you normally you would
flatten those lists like for each element you would take the first element of the list
emitted take the second element of the list eliminated now instead of doing that we just
emit a pointer to the chain um so then we basically have factorization and now if we want to
compute a count aggregate on that um under certain conditions and if they're
The group key is also not within the list because then we can't really use it.
If the group key is in the flat part of the vector, let's say it like that,
then we can just, again, take the sum of length of lists to compute account.
And this is the idea for that.
The problem with that, though, is that there's actually, which I realized far into my master
thesis, is that there's a much easier way to compute that.
Because if you already know that you want to have the count anyway,
what you should do is make a group join,
which is also like a technology introduced by Munich people.
So basically instead of building the list in the first place,
you just compute this aggregate while creating the join hash table,
and then you can emit the aggregate directly.
So it's a little bit more fancy now with this chains
and then emitting the pointer to the chain
and then only counting the length of the chain,
but it's already there.
Yeah.
That was not very useful.
The question when I asked you, Paul, is where was you when you realized you could do it this other simpler way?
Not inside or you, like, literally.
In the shower morning, for, oh, my God, I could have done that.
It saved me so much work, but yeah.
I mean, I was talking to people from Munich.
And also with, I mean, actually, and then there were pointers to that.
And there's a paper that came out at the same time, which does something very similar,
which is, I think, Diamond Heart and joints.
from
Altharne Birle
at Munich
and he also explains that
but only like in a tiny sentence
it's a
yeah
so there would have been pointers
but I mean it's fine
it's fine it's fine
we learned something
and we showed another way to do it
very true very true
maybe not the best
yeah cool
so let's talk about
so you mentioned this earlier
I think in the show where you were saying
though, it's often hard to know when to deploy these techniques based on that.
It's not, do you go one way or do you do it?
It adds another dimension to the complexity already of picking the plan basically, right,
of what you actually want to do.
So I guess this is where the adaptive decision making comes in a little bit, I think, right?
So, yeah, and this concept of adaptive factorization.
So I guess start with what's the motivation behind adaptive factor factorization
and what we're trying to achieve here and then tell us how you went and did it, right, I guess.
The idea is basically to have those chains of elements.
And having this chains of elements, it adds some overhead
because suddenly you have this variable length elements.
You have to kind of keep track off in a vectorized execution engine,
which is more trivial.
But also on the other way, you can compute things faster
if you don't have to expand them.
And to know whether to do this extra efforts of maintaining those chains,
it's very important to know whether you actually have those chains.
and this basically means
okay
how I need to know
that there are duplicates
on the keys of my join side
which means that
that will result in those chains
and now
if I don't have a special operator
for doing Vosk's optimal joint
but just using the normal
duct B hash join
then
those chains are there for free
now basically
because I would like in a binary
join tree, I would have this first join and now worst case optimum join plan that we can
now do by not expanding those chains and then using them later in the joint plan.
We also have these hash table chains.
So it's like the first step of the plan is very similar.
And by the time where we have to decide to now use a worst case optimum plan or this binary
joint plan we already seen these chains because we can do this after we build the first
touch table and therefore you you touch the data you basically don't have to rely on statistics
you can very precisely say okay i have like for each value i have got i've got seven duplicates
which means my chain length on average is seven which means okay is i can update my cost model
with like true values and i don't have to rely on estimates and estimating like how many duplicates
are there like you can't there are sketches for that but they always require like yeah there are sketches
and what is like maybe and you only have them usually on your um base tables but what if you are
you're querying a CSV file or a per k file or like um another join result like you you can't really
like there are cases where you don't have statistics and then if you have statistics they are
only statistics they are not like the true value so that's the motivation behind doing it that way
nice cool um so i guess the next so i want to ask now about i want to dig onto the on to the
sketches a little bit more so yeah from what i'm given there is that you don't actually need them right
or do we need them or how does how do they fit into your overall approach and in collecting these
statistics like why do you need them still yeah so basically for um if we want to do our skills
of the rejoin like plan now
with those
factorized elements
then we have to do two things
like first we have to it's for triangle
query and it's a little bit hard
to explain without a picture
but I'll try
basically you get
these chains like you join
let's say we want to make a triangle between
like the three
relations
so first you have to join
the first two relations
and then because
you emitting those chains, you get
for each value of relation
one, you get
chains with elements of
relation two. And then you do the same
basically, this you emit up
into your tuple, in your
pipeline. And then
the next join comes, which is also
kind of a normal head join, but then
you have like one simple
flat condition, which is
tubels from relations
one to relation three.
And then basically you have those two
lists and now you can do list intersection it's very like it's not very clear i think from this
explanation but basically the idea is like okay you only have to decide um after this first join
whether you i want to do this or not and then you can either do worst case optimal join plan or
and binary join plan so now you all because you have the exact data from the first join um
you you already know like you have duplicates or not so you can know whether you are
things are
whether your intermediate
is exploding or not.
But in order for
the joint to be
really, really performant,
or like,
then you also want to
shrink this intermediate again,
let's say,
like,
very simple.
So how do you know
that your second join
is actually selective?
Or like,
what does,
you don't really have,
you only have data
on one relation
of the three relations
of the triangle.
So now,
what DGB does
is also like
what,
every modern analytical system does
is before building actually the hash table
for the second join,
it first materializes all the data
and stores them basically in memory
or like splitting to this in rows.
And because that,
and it does that to then know,
okay, how much space do I need for my hash table
to allocate this lookup buffer?
Because if I have 10 elements in my hash table,
then I will make a hash table of size 32, for example.
But if I have a million, I have to resize this big, I have to have a big hash table.
And it's bad to, it's very expensive to resize a hash table during building.
That's why you first want to gather all the data, hash it,
and then you know how much data you have to build a hash table on,
and then you build this hash table.
So this means in this case, we also have scan over all the data of the second joint.
And then we can use this scan, and we even have the hashes to populate other sketches,
to then estimate more statistics basically on how selective is joined be and so on and so on.
And are there duplicates on the second join?
And this is basically the motivation for doing it that way.
So you have very exact data on your first relation.
You have somehow what exact data on the second.
And then the third is still an Easter egg.
Yeah, it is what it is.
touch it. Nice, cool. I know, it wouldn't be 2025 if there wasn't some machine learning in
there somewhere, right? So I know you have used some ML as well in sort of, I guess, compared it
to like normal heuristics and, yeah, and how machine learning models can also help you decide
the various joint types you want to do, right? So maybe you can dig into how you applied that and what
the tradeoffs were and what you found between the two different approaches. The classical versus
the modern, I guess. Yeah. I think, like,
machine learning always like it sounds like it's a very very basic machine learning like a
decision free so like at the end don't say that it was complicated but it's it's important i think
to to to not be like a deep learning model where you don't really know what's happening
i think the machine learning was really more to to to see whether there's insights above what
we would expect um are there other like patterns in the data that's like because for us it's
very clear if, like, the first join has many duplicates, that means we have an explosion,
and therefore, worst-case optimal joints are probably good.
And so the idea is, okay, you have this metric with this feature, which is basically
the average chain length of the first joint, which is correlated to the duplicates in the first
join, and then you have this holistic.
If it's like, I think maybe it's like bigger than 10 or something, then you take this
worst-case opt-in-the-join approach.
and then like there's other features where it's not that clear and so therefore we we build also those
decision trees to analyze but there's like more detailed to the data but I think the results or like
the predictivity the truth positive rate it was kind of it was very similar like it was not like
the decision tree was outperforming the this heuristics just bigger than like average chain
bigger than 10 by 2x it was very similar
And therefore, I would also say, like, okay, I would, like, if I would implement it,
I would use this heuristic because, A, like, it's simple.
I can understand it.
And B, it's not that much better.
So I think there are places where you can use machine learning in planning, definitely,
and there's a lot of work on that.
But in this case, I would stay with the heuristics because it's so close.
Yeah, nice, cool.
So, yeah, that's interesting.
Actually, it's funny, you should say that,
because over the years of doing the podcast,
Quite a few people have had the same sort of opinion.
They've been like, yeah, they're good in some very narrow subset of, like, corner cases of certain type workloads.
But, like, sometimes, most times the juice isn't worth the squeeze, right?
It's just stick with something simpler.
It's more, you can reason about it easier.
And yet, it's easier to maintain as well, often a lot as well.
So, yeah, it's interesting that you've also come to the same conclusion.
Sweet.
So, yeah, the next section of the podcast, when I talk about some results and evaluation,
and how well your techniques performed.
So, yeah, give us some click-bary results
that can chop up and stick in the promotions for this,
for the podcast.
How do they do?
Basically, I think benchmarking was not trivial
because how do you, I mean, you have to have,
for us, we mainly focus on those triangle queries
because they're the prime example.
And you could generalize this to cycles
with odd numbers of edges.
because there you really have this explosion
or like this
you can do this
worst case optimal join like plan very easily
and what we did there is this
a social network benchmark
and basically we run a triangle query
on the schema to get like all the possible
triangles you can reasonably get
and then use all of these queries
to benchmark whether we want to use
worst case optimum join or whether to not
and basically run it once with the normal plan,
once with the worst case optimal joint plan,
and then also recording those statistics,
and then we could also say,
okay, would our cost model have predicted the correct runner?
Therefore, like, simulating it.
And, yeah, it showed that, like, always or never using Westcase Optimate joints
was, of course, inferior to, like, using some ML or holistic
to decide whether to use.
it or not. But to be honest, I don't have to exact numbers anymore, like on how much percent
or something. But it was worth investing. A million percent. Big number. Yeah. It was better.
The moment where you shape your benchmark, you can always have arbitrary speedups because you can
design like your query in a way that it exactly hits your optimization. And then, like,
it's always easy to get arbitrary speedups. Yeah. Yeah.
Cool. I guess
so this is available
after your techniques now
are obviously implemented in DuckDB
so people can
benefit from your
from your techniques right in your code
partly
there's a branch of course
which is like on my private GitHub
which implements the worst case of the mid-joint
and this also like factories
like aggregates on factorization
which I talked about a little bit earlier
where there's actually like the better way of
doing the group join but you can
also do it that way. And yeah, this linear probing hash table, it's in duct to be. I think it's
inductee since 1.1 or something. It's quite a while actually. It has working quite good, I would say.
And then I also had a discussion, of course, with the duct TV maintainers about how we could
implement the worst case optimal joint because that, like the factorization and the aggregates are
solved otherwise easier and I hope
that duct to be at some point will also integrate that
ego aggregation
and group joins but so far not
and unfortunately it's already have
been done like research wise it's covered so
I can't do it so it's not for me
sadly I would like to implement it but I
have to focus on newer
stuff. Newer stuff yeah more
of the cutting edge side of things right yeah
yeah because I was going to ask
kind of about
whether you have any way
I guess it's still kind of an open
question of how in the in the wild of like of workloads at large how often
the sort of techniques would be beneficial so i was kind of getting on the line of
have you had any sort of uh feedback from people out there and in in in the wild
they've said oh yeah this i don't know if any sort of statistics are collected from like
kind of workloads that i run and kind of fed back i'm like yeah we're seeing
everyone's using worst case optimal joins everyone's got a social network as their as their
application and they just want to do a triangle counting constantly um i don't know yeah
Yeah, I think for the default DGB workload, which is like on-device OLAB, it's not super important.
But I mean, there is DugPGQ, and like the goal is still to implement some of the techniques into DGQ.
And like for like a side note, like DGQ adds like graph very languages to DGB and also some other operators like graph and path finding.
so it's very cool and enable and it's also very performant so then like because people are using
this graph language you would expect graph workloads i mean this is not a crazy assumption
and therefore maybe also like more cycles so the idea of merging it as an extent like putting it
into an extension is very very cool the problem is that you probably still have to fiddle a little bit
with the hash join so what do you do like do you copy the whole operator and then replace it
or like it's still like not that easy to implement
but I mean I really hope that at some point
it will end up in TACQ.
Nice, cool, yeah, so that's one future direction.
Sorry, Paul, one direction you're going to go in
with this work next, but is there any other directions?
I mean, you tease there, there's some new cool
cutting-edge research you're doing as well,
the new news, so is anything else related to this in any way
or is it different?
Yeah, I mean, this was my master thesis.
Now as a PhD student, I unfortunately,
or like maybe also, not unfortunately,
have a completely different topic
and which is strings and string processing
and database systems.
So I don't unfortunately have that much time anymore for that.
I mean, I'm still doing some join stuff.
I try to implement plume filter pushdown at a moment.
And DGB, it's also not as easy as I thought of this, but...
It never is.
It never is, right?
So fingers crossed that this is done soon.
But now my main research
goal is to
basically do better string
processing inductively
because I think
strings are very underestimated data type
there has been
a snowflake survey showing
that like I think or I don't have the number
directly but I think it's like 20% of
all string key types are strings
of all joint keys
are strings so that is correct
people using strings as joint keys
people using strings to store numbers
to store dates
to do like to store Wikipedia pages like there's a lot of stuff in database systems that that are
text and now the idea is to basically use compression to first like get smaller on on the storage side
but then also push it through the execution tree as compressed as possible as long as possible
so that you don't have this huge memory footprint and those like huge chunky bites
in your memory.
And their techniques you're going to be
implementing in DuckDB as well, right?
That's the goal, yeah.
That's cool.
So, yeah, we can do this podcasting.
Again, when you finish that and publish that,
we can come and talk about some strings as well.
Cool.
Yeah, so let's talk about DuckDB some more then
and sort of your experience of working with it
and developing and writing code in its code base.
So, yeah, what would be your overall sort of experience
of using DuckDB and, I guess,
this extension module, even just in the codebase,
like what was that like?
What would your advice be to other people
who were thinking about getting their,
dipping their toes in and getting their hands day?
Yeah, I mean, before doing my master thesis,
I was more like a front-end kind of person,
so I did a lot of type script.
And then...
You come to the dark side now, yeah.
Literally, like.
So, like, diving into the ductyb codebase
was quite hard for me in the beginning
because, like, there's, like, if you don't really have a,
understanding on an like on an abstract level of what's happening like understanding trying to
understand it from the code is impossible i would say or like maybe not impossible but impossible
for me um so like in the beginning i really had to look at a lot of andy pablo lectures on
youtube to really understand the basics i think this is the journey of everybody yeah but then i think
duct db is really like readable in a sense and there's also like one resource which is called
I think deep wiki.
Yes, deep wiki.
And there basically is an LLM documentation for the open source projects and duct B is in there.
And then you can basically ask questions like how is the duct DB string storage build up.
And it's so convenient because it shows you like what's happening and also shows you like
which lines of code you have to look at.
So that's really like a nice way to get started with DukD if you want to understand something.
this is highly recommended um so and then at some point you you start understanding like
you have to find like your little spot where you where you where your break point hits and then
you're like yay i know what's happening and then you can start changing things and um try to
meet c icd uh yeah so it's it's possible and and then of course it's very nice because it's
like running it is super easy because it's on your device and this means also like developing
it is super easy because it's on your device.
You don't have to have some Docker in between or like whatever.
Like you can just build it and debug on your device
and then you can debug it like every other slide.
A nice tight feedback loop there.
Yeah.
Yeah.
It definitely makes development so much nicer when you've got that.
No, that's really cool.
Yeah, I guess now I'd like to talk a little bit more about
kind of surprises and interesting lessons learned whilst working with DuckDB
and with, yeah, on factoring.
and where's case optimal join.
So has there been any of that sort of jumped out
when you kind of gone,
I really didn't expect that,
or that was a bit of a shock?
I think the system's part is always a little bit
concerning.
It's like the moment you touch something,
it will touch something else
and then something, like you have a different state
of the whole thing and then something.
It's always like cascading down
to places in the code where you wouldn't expect it to be.
and then ddb being a production system meaning that if you want to merge something like people might want to join on a struct of lists that contains lists that contains structs again and suddenly you are in very like weird states of the code so I think for research it's very important to like keep it simple like for example those join things they worked on you in 64 join keys which is probably like 95%
of all edge tables
and graphs.
I hope nobody has
string edge tables.
I'm not...
Somebody will,
Paul.
Honestly, man.
If it's possible,
someone will have done it.
So yeah.
So, yeah.
And then, like,
for doing research,
I think it's like you don't have
to do something in production ready.
Like,
that's like two different types of things.
You have to show that it works.
And then you have to like just say,
okay, the CI will keep
being read and yelling at you.
and there will be a hundred tests feigning, but it's fine.
So, yeah, this is the most, like, I think.
Yeah, I have fun.
I guess it'd be nice talking now a little bit more about yourself,
as I was kind of the man behind the research and all, so to speak.
So, yeah, tell us how did you end up working?
You said you were sort of a front-end guy before,
and now you sort of worked your way into being the systems person.
And so, yeah, tell us about that journey and kind of how did you end up where you are?
Yeah, I went through the full stack, basically.
So what's going to happen next?
You're going to be working on, like, I go lower down.
You'd be like working on my end of the semiconductors next, right?
So, yeah.
No, I mean, I did my bachelor in Germany at the University of Applied Science.
So that was very, like, practice-focused.
I liked it, but I missed always, like, being at a big university,
being at a research university
so I went from my master to Amsterdam
and there like the first class I had
was Peter Bowen's class about
data like large
what was a large data engineering
or something like that
so and there I met my
two best Amsterdam friends
also very convenient they both work at Motherduck now
and so we all
got hooked by Peter into the lecture
and it went into like
and then decided to do our master thesis
with him at CWI
I think it's central for WISCunda Informatica, but my Dutch is also.
I'm glad I didn't attempt it because that was a lot better than I would have got them.
It's a very nice place to do your master thesis.
It's highly recommended because I had my own office or like not my, I shared it with other master's students,
but it was great to have like a place to go for the teasers and not be alone in the library
and nobody cares about you.
You had meetings with Peter, meetings with Daniel, which were really helpful.
really, really liked being there.
So I asked Peter, but I can stay a little bit longer for four years, and he accepted that.
And so that's why I'm doing a PhD now, basically.
I just want to do and stay studying also a little bit.
It's a very nice way of work.
Yeah, definitely cool.
Very nice.
So we're near in the end of the podcast now, Paul.
But yeah, before we do finish off, Duck D.B.
recommendation. What's the one
plug-in, extension
feature that is your favorite about
DuckDB?
I think DuckDB in general
such a nice system being
on your device. I think it's such a simple
design choice, but it makes so much possible
because suddenly, instead of having
a million CSV or parquet files,
suddenly you have them all in one
place and if you want to copy
all of your data because you might
do something you regret later, you only
have to copy one file you have like um joins across different files you have like all of these
very neat features so it's just my since working on duct to be like in the beginning i really had
struggle writing sequel and now my whole my all my python scripts are full of like 10 15 9th of
code um SQL statements doing left anti joins and cool stuff to do and i think like one
one recommendation is of course
DGQM as a very cool extension
and like a very nice way
of showing what you can do
with DGB extensions because the extension
ecosystem is something that's
that's very cool to have
and there's a ton of other like there's also
Flock MTL like it's some
AI focus where
you can have AI scalar functions
and from Amin
and there's yeah of course
there's another dog which is also kind of cool
to add like this whole other
dimension to
dark DB
so I like the
ecosystem and the people working on that
so yeah
definitely I keep trying to get
a means to come on
to talk about flock MTL
but he keeps turning me down
so maybe next time
season three I'll get him
to come on and tell everyone
about it
but it's interesting what you're saying
about there about
kind of now your
like privacy trips are full of these
like kind of SQL everywhere
so do you think that sort of
time basically spent
inside the database
and kind of makes you more
sympathetic of the way
to actually write SQL queries
is that kind of
would be your
assessment of it?
I think maybe like it
because I just was forced
writing a lot of SQL to write my
and then like I at ease my
fear of doing that
because it looks kind of to have those
SQL in your Python
but it works so convenient like
you can like output the
duct to be result to a Pandas
data frame and then visualize it
it's so cool how they integrated
that and like the
referring remote files like
then just putting data twist for you yeah it's yeah it's like of course you have you have those
ac desecal statements maybe some people say they're beautiful but i'm on the other side
beauty's in the eye of the beholder i guess and we'll leave we'll leave that one there yeah
they really take a lot of work away so that's that's that's that's that's why i started accepting
them yeah cool that's very nice very nice yeah well now it's the time that is time for the last
word so what's the one thing you want the listener to get from our chat today
I think, like, assuming researchers also listen to that,
and not only the DGDB users,
I really recommend DGB for research
because, like, adding an extension can be very, very powerful
and also, like, a way to add visibility
or, like, just have something that's complete, that's usable,
because most of the research outputs are artifacts that, yeah,
that end up in the shelf,
for like are some random Python script
nobody can use because you have to install
Python 3.10.4 for
that and so
and I don't have this on my computer like if you
like maybe and also like
usually like not
not everything ends up in DuckDB main
so maybe some things can end up in extension
like hopefully the worst gives optimal joints
at some point and
then it's also fine and like you can
use this to demo things you can use
that to really integrate your algorithms
in a system because it's very different
of having it something like a new hashtag layout in a system
because then suddenly you can run TPCH
like a proper benchmark on it
and you can't do that
if you're building like your own C
program that has struggle reading CSVs.
So yeah, there's a lot of infrastructure already there
and it's nice to use it.
Nice, yeah, that's a good line to end on.
Thanks very much, Paul. It's been a great chat.
Thanks for coming on the show.
Thank you.
