Disseminate: The Computer Science Research Podcast - Recursive CTEs, Trampolines, and Teaching Databases with DuckDB - with Prof. Torsten Grust
Episode Date: October 16, 2025In this episode of the DuckDB in Research series, host Dr Jack Waudby talks with Professor Torsten Grust from the University of Tübingen. Torsten is one of the pioneers behind DuckDB’s implementati...on of recursive CTEs.In the episode they unpack:The power of recursive CTEs and how they turn SQL into a full-fledged programming language.The story behind adding recursion to DuckDB, including the using key feature and the trampoline and TTL extensions emerging from Torsten’s lab.How these ideas are transforming research, teaching, and even DuckDB’s internal architecture.Why DuckDB makes databases exciting again — from classroom to cutting-edge systems research.If you’re into data systems, query processing, or bridging research and practice, this episode is for you.Links:USING KEY in Recursive CTEsHow DuckDB is USING KEY to Unlock Recursive Query PerformanceTrampoline-Style Queries for SQLU Tübingen Advent of codeA Fix for the Fixation on FixpointsOne WITH RECURSIVE is Worth Many GOTOsTorsten's homepageTorsten's X Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Disseminate the Computer Science Research Podcast, the DuckDB in Research series.
Hello and welcome to the DuckDB in Research podcast.
In today's episode, we are going to be talking about recursive commentable expressions,
that is recursive CTEs and their implementation in DuckDB.
And I'm really happy to say that I've got the expert of recursive CTEs with me today,
and that is Tolston Grust.
Torsten is the professor of computer science at the University of Tubingham.
So yeah, welcome to the show, Torsten.
Hi, Jack. I'm really feeling flattered about the invitation to be part of the series.
I'm sitting here in my DuckDB t-shirt, ready to talk to DuckDB for the coming hour.
So I'm happy to be here. Thank you.
Let's do it. Fantastic.
So the first question, though, the customer question on the show is for you to tell the listener a little bit more about yourself and your story and how you got into databases and how you got working with DuckDB.
So, yeah, let's take it away.
Okay.
So I guess I'm one of the older guys that have appeared in your series here.
I'm actually a professor in database systems
and I've been a PhD student in database system since 1994
at University of Constance, Lake Constance,
very beautiful scenic area in Germany.
And since then I've been working in databases all the time.
So from 1994 until this day, I'm researching databases,
in particular query optimization, query processing.
I've become a professor in database systems in 2004, and then I've spent time at several other universities, also at T.U. Munich, where Alphonse Kemper and Thomas Neumann are located.
Actually, Thomas is my successor. He's sitting in my office, in a sense. Everybody on this podcast were probably no Thomas. He's just a genius.
And I joined Tewingen in 2008, having been here ever since feeling great here, doing database research since then.
And, well, recently we have switched really lots of our efforts targeting DocDB,
and I'm happy to talk about this here.
Fantastic. Cool.
So, yeah, like I said at the top of the show, we're going to be talking about recursive
CTEs today.
But I feel it's one of the more advanced topics in SQL is Interative Sequel and Recursive CTEs.
So for the listener who's maybe not familiar with what they are,
give us a primer, I guess, of what CTEs are and then kind of what recursive CTEs are on top of that.
Okay. So CTE, common table expressions are actually, they come into flavors.
There's the simple, the straightforward flavor, which is the non-recursive CTE, which allows you to name, to assign names to intermediate results that you've computed in your query.
And then later on, in subsequent part of your query, you can refer to that particular name, to that intermediate result and use that one time, many times, potentially to build on that.
and then build up complex queries, compute complex results from these intermediate results.
It's a, I would say it's a software engineering tool that lets you master and lets you tame complex queries by naming intermediate steps
and then referring to them again and again.
It's just, I think using CTEs to structure very complex queries is just good practice, good daily practice.
And I think these are well accepted everywhere out there.
And then there is the sister concept, the recursive commentable expression, which really is a different beast.
It really adds to the expressivity to what you actually can do with the SQL query language.
It really, it's a major step forward.
It's a really, it's a quantum leap for the query language because with these recursive CTEs, SQL becomes a full-fledged programming language in a sense.
It has full expressive power, just like your general purpose programming language.
which everything can be done in SQL.
Whether everything should be done in SQL,
that's probably a different question.
Because recursive CTEs look awkward to many of us.
They tend to be somewhat verbus.
They have the bad rep of being inefficient for evaluation.
Maybe you can talk about that.
But it gives you the ability to iterate inside your query to iterate
until you reach a fixed point or a stable result.
And that's something that's really unique to recursive comment.
that you cannot do with regular SQL queries.
Has been added to SQL in 1999,
has been around for 25 years,
and it's still some of the more peculiar exotic features
in the query English for sure.
Yeah, so what would be like kind of a comment
if I was a user of a database?
And when would I be reaching for such a recursive CT?
What would be sort of the context or an example of when we're about,
okay, that's really one I need to be doing here?
Yeah, the vanilla, the textbook.
example of a recursive theory would be to traverse tree-shaped data structure.
So if you have tables in your relation to database, but if you look at the columns and
how they are connected via keys and foreign keys, if you're in fact modeling a hierarchical
or tree-shaped structure, then a recursive CTE allows you to identify a root or a starting
node in that particular tree structure. And then in each iteration of your query,
advance from node to note from the root
reach all the child nodes from these child notes
reach all the grandchildren. In each iteration
you will explore this hierarchical structure
further and collect more nodes
probably matching defined criteria
to really explore a tree-shaped
or hierarchical data structure of arbitrary death.
If you cannot iterate, then you are
confined to explore your data structure to a defined
depth. You can join, of course, your table with itself to explore the tree structure,
but such a join will be a two-fold join or three-fold join or four-fold join, which limits
the depth. You can explore the structure, recursive commentator expressions, remove that fixed
limit, and allow you to traverse as deeply as you must, as you need to. Now, it's kind of on this
and also what you said a second ago about this, the really expressive power of SQL and you can
kind of really do anything in it is i know that at the university you solved all of the advent
of code and it is the 2022 edition just using pure duct db sequel right so maybe you can speak to
that second as another example of what things you can do with with with recursive ctees yeah so so
advent of code is if probably most listeners have come across that it's a collection of 25 programming
puzzles each of these puzzles are posed on the day of December so it's an advent of
of programming and puzzles, and from day to day, these typically increase in complexity
and the challenge increases, but these are programming puzzles that often call you to either
traverse tree structures or traverse graph structures or build up graph structures or to perform
some optimization task. And this iterative nature of the recursive CTE is just perfect
to whatever, identify a starting and an end node and a graph structure.
And then in each iteration, start from these nodes, explore the graph structure, discard particular
nodes that will lead you to cul-de-sacs, explore other nodes, keep iterating over these
nodes in the coming iterations, and then explore complex data structures, collect interesting data
on your way to solve these programming puzzles.
Many of these, I think I'm looking at the 25 days of Edwin of Code from 2022, I bet 80%
of these required recursion or iteration
to explore these.
Yeah, yeah, that's fantastic.
Yeah, and I guess that's a really good example
for any listener who wants to go and see
what kind of recursive CTE looks like in DuckDB.
That would be a real good place to go
and have a look and have a play around.
Cool.
So, yeah, you also mentioned a second ago
about there are certain
shortcomings, possibly,
or like limitations or restrictions
of SQLs recursive CTEs.
So before we get stuck into kind of DuckDB's
recursive CTs, let's talk a little bit more
about the shortcomings of these
general. Yeah. So when you perform query evaluation in iterative fashion, so parts of your query
are evaluated again and again. Of course, you have the problem of infinite iteration. So when
will the iteration stop? And to make sure that these, or to at least have some guarantees about
when such an iteration will stop, the inventors of SQL, the definition of SQL 99,
imposes restrictions on the iterated part of your query.
The iterated part has to come to a defined end,
the so-called fixed point of the computation.
I won't define what that means here.
It's a proper abstract algebraic concept here.
But there's a syntactic restriction that impose on CTEs
that really make it hard to use CTEs, iterated CTEs or recursive CTEs,
as a let's say in creative ways you have cool ideas you want to express these using your CTEs
boom you hit one of these syntactic restrictions a wall that has been set up probably for good
reason but limits you in what you can actually do and in CTEs and this often leads to rather
awkward syntactic workarounds that make these creoses look awkward complex hard it probably is
one of the reasons why recursive CTEs have a bad reputation
in the sense. That would be, that's a usability deficiency, I would say. And there's efficiency
concerns that let the inventors, the definers of SQL 99, of the original recursive CTE standards
say, hey, when you iterate, in your iterated query, you can only see the results that
you have produced in the immediately preceding iteration. You might be in the 100 iteration,
but you can only see the results
that you've produced in the 99th iteration.
Everything that has been computed before is lost.
Well, it's not actually lost.
We collect that for you.
And later, when the computation is done,
we give it back to you.
But during the iteration, you cannot see it.
It's sort of short-term memory for this.
Yeah.
And it's for a good reason.
Looking only at the iteration results
from the last iteration is probably very few
as opposed to the potentially huge log
of intermediate results that you have collected
in all the prior iterations.
So these are where in efficiency concerns
that lead to this short-term memory semantics of recursive cities,
you can only see what has been produced in the preceding iteration.
And this leads query authors to play dirty tricks like,
in my query, I will collect a huge array of stuff
that I've produced in the earlier iterations
that is an ever-only growing array
that will collect and aggregate
all the stuff that I've seen
during all my whatever tree traversals,
graph traversals, and I will pass that on
from iteration to iteration as a replacement
for the short-term memory in a sense.
And that really leads to space, space leaks
really super huge arrays
that you pass from iteration to iteration
slowing down the whole iteration process.
And it's really, that's an idiom.
It's an anti-pattern, I would say,
to carry all this baggage with you
to remember what you've done in the early iterations.
It's really unfortunate, yeah.
Yeah, you're going to really want to bring with you
the things you actually need, right?
But it's, okay, cool.
But given that then,
so I think we've set the scene up there,
perfectly there, Torson.
So let's talk about the DuckDB implementation
of recursia for CTEs.
And obviously, you guys,
you guys at University of tubing and we're behind that.
maybe we can set the scene of kind of like how that came about how you decided okay we're going
to be the people to add this into db and just set the story there to start off with okay so
i've seen the very first glimpse of duct db at sigmod 2019 in amsterdam when hannis and mark
were showing the first prototypes of the system and it looked super interesting very promising to
us and i was in amsterdam at that point in my my now postdoc dennis here
he was with me, we were looking at that system
of really war.
This is something, this is something
new in-process database measurement
could be really easy to set up,
would be ideal teaching tool, great.
And then after the conference,
I think on Twitter,
I communicated, I connected with Hannes Mule Eisen,
the CTO is in the CEO of Dr.B Labs.
And I said, oh, Dr.B is great.
It's great.
When will you add recursive CTEs?
because the recursive Cs were missing at that early point.
And then he responded on Twitter saying, oh, probably never.
And I was super disappointed.
I was super disappointed because in teaching advanced concept of SQL,
the recursive CTEs play a role because they really define that quantum leap of expressness in the query language.
The teaching SQL without having access to iterative CTEs would have been hard for me.
to imagine.
So after that disappointing response on Twitter,
here in Turingen, Dennis set out to write the very first cut at an implementation of
with recursive.
And already back then, we really were delighted to see how clean and how well-structured the
DGB code base was.
So adding that to DGB was actually a manageable effort.
And it has been in DuckDB since, let me check, since version 0.1.5.
Yeah.
That's when we added recursive CTEs.
In the SQL 99 style with all these short-term memory deficiencies and all the syntactic restrictions that I was talking about.
But since then, recursive CTTs are part of Dr.B.
So we added a new plan operator that internally has the ability to iterate subplan,
of your query plan
that's actually needed
that's obviously needed
to implement the CTE semantics
and since then
we have provided bug fixes
and improved the CG implementation
since version 015.
Yes. Nice, yeah.
Were there any other modification that needed
to be done there other than adding a new plan operator
he said that it was the way that
the code is structured it made
adding things in really nice but was there any sort of
of not kind of breaking changes that need to be made
to support and with recursive?
well what you needed is the ability to reset a sub plan to reset to let it run again in a sense
because upon the next ct iteration there's there's there's a need to to restart a subplan in a sense
that's something that we had to work on you you have to have this basic functionality when you have
a nested loop join in your in your database system but we had to work on that particular plan
feature, plan execution feature to reset
a plan and evaluate again. Actually, that's
until this day, that's a bit of a pain point because
resetting all the plan data structures associated with a
sub-plan, restarting a plan
in a new in a sense is somewhat costly
operation that we
try to, until this
day, try to optimize, make
cheaper and try to work on here
in Turing. Yeah.
Very nice. That's a nice segue into my next question then.
So obviously, given that the fair
implementation was going by the standard by the SQL 99 standard with all the limitations in
there since then it's been a very fertile ground of research for your group so I mean there's a few
places we could start here I don't know if you have any favorites but maybe we should start off
with maybe the using key and that sort and what that how that has built on the existing
implementation in yeah yeah okay so the using key feature actually using key is a modifier
that you can add to the with recursive CTE construct in SQL now
And we have very recently added that to DuckDB 1.3.
So since 1.3, that's not a research idea or whatever, some whiteboard pipe dream anymore.
But it's actually implemented and tested and bugfix and so on in version 1.3.
We have blocked about that.
Maybe we can provide a link to the blog post associated with us.
It would be wonderful.
And we now see on the Discord that some of DuckDB users have.
pick that up. I've seen LinkedIn postings
of people saying, oh, have you seen
that using key feature, it's something
that's cool, and it's nice to see
that this is being picked up.
So what does using key mean?
So it's an optional modifier
that you can attach to your with recursive
CTE, and when you do that,
the behavior of the
iteration process changes
slightly. So
in the original CTE,
as I said, in the
background, the database system is
collecting a record, a log of all the intermediate results that all the iterations have been produced.
But this log, this record, it's in the so-called union table, is over time, it might grow super large.
That's why the original CTE definers said, no, we cannot let you access this potentially super large table.
This will inhibit efficient implementation, now execution of the iterated CTEs.
You cannot have that. You have short-term memory.
so in using key we are not just adding we are not appending or adding to this ever-growing
log of information but we define key columns and once you when you see in some iteration
a key combination a key value combination that you have seen before in any of the iterations
then the new information is used to overwrite under this particular key combination
the information that you have recorded before.
So if in any iteration you have the key value of 100
and you've remembered something under the key value 100
and 200 iterations later,
the key 100 comes up again in your computation,
then you have the ability to update,
to update that particular value saved under the key 100
to store whatever an augmented and up-to-date version
of the information that you want to save under key 100
at this particular point.
This means that there's this record of results that DBMS keeps for you
is not an ever-growing heap of stuff,
but you have the ability to control, to selectively overwrite and replace old information
with new information.
That keeps in general the recorded information in check, the size of the recorded information
is kept in check and it's kept so much in check.
and it's so much smaller in size
that we now can allow you
in the iterated query
to see what you've done
in early iterations.
You can see what you've done.
There's no need anymore
to whatever collect large areas of stuff
to remember what you have done.
The system has the ability to give you,
hey, this is what you have seen so far.
These are the most recent value
that you've saved under these key values,
Key 100, for example.
And you can, based on that,
you can whatever steer your computation
decide that the iteration is good enough
and then stop the iteration or go on again and so on.
So now you have the ability
to see what you've done before
and this works because we are not
only ever adding to a pile of stuff
that's too large probably by the 1000s iteration.
That's using key.
Nice.
The question there, so obviously when you were described at that
it was very much an overwrite of the key.
Is there any way of expressing a more complex semantics of what you want to do?
I don't merge the keys or sum the keys when they want.
Is that possible as well or is it very much at the moment just an overwrite?
Maybe you don't want to do that.
I don't know.
But yeah.
No, no, no.
That's very cool.
Very cool question.
Actually, so because you have access to the current state saved under, say, key 100,
because you have access to that, you can use that currently stored well.
and whatever, add to that augment, that, maintain that particular value, and add on top of that.
So that's actually already possible.
So it's not just a plain replacement.
You can refer to the currently stored value and whatever.
Add to that aggregate on top of that.
Decide that the old value is better or is good enough than the current value that you've computed,
anything like that.
If you find that your query is whatever, taking the value that you have already,
and then whatever, some perform in addition,
sum to the old value.
Then we will probably in one of the coming DugDB versions
or variations of using key,
we will have a modifier to the using key construct
that allows you to say,
hey, if there's an old key value,
here's a new CalVal, you add it up for me,
sum it up for me, so you don't have to manually maintain
the particular entries under these key value
under the key 100, the system will
whatever, automatically keep
do the summing for you.
But that's an extension
that we are considering. Currently, you would
have to use the old value,
add the new value on top of that,
and then store that as
the new value under key 100.
Maybe something to the listener to look
forward to in future versions of DuckDB then.
Cool. Do you have any idea,
obviously, with this, adding this
feature and it sort of broadens
the possibilities and maybe makes things
types of queries possible that weren't possible before.
But do you have any idea on like benchmarks and numbers of like how much speed-up of
this is or anything like that?
Yeah, yeah, yeah.
So this is about speed-up, but there's also about space savings.
Because we are very interested in benchmarking, how large will this key store, this key value
store, this associate dictionary, whatever you mean, whatever you like as a term for that,
how large will that grow?
And of course, we wanted to be significantly smaller
than the pile of stuff that the database would be maintaining in the back
with the vanilla recursive CTEs.
So we've implemented a number of classical algorithms,
some of these graph algorithms,
to see how the using key variant would fare regarding space usage
and also regarding time.
So we have, for example, implemented a connected components algorithm
where each node, depending on its neighborhood
and the reachability of other nodes,
is assigned to one particular component of the graph.
And if you know the compacted components algorithms,
in the beginning, every node is located in its own single component,
but if you iterate this particular connected components algorithm,
then the component that the node belongs to is going to be updated.
That would be a wonderful example for this updating behavior of the using key,
The key in this particular approach would be the node ID itself
and the value that you store under that particular key
is the component that this particular node belongs to.
And every node is only belonging to one particular component.
There's no need to pile up stuff of old, irrelevant information.
What you need is the current, the most current component
that your particular node is belonging to.
that we have written a paper
we have shown the connected components
formulation using using key
in this particular paper and it's just
it's almost a one to one
match to the imperative
textbook style connected components algorithm
that you can see in your data structure's book.
It's very nice to see
how very elegant
and very readable this using key
formulation of the duct debate
query becomes
and regarding
space savings
and time savings.
So we have run this on
some larger graph with
thousands of nodes, tens of thousands of
edges, where a vanilla with
recursive formulation of the connected
components algorithm would have
amassed intermediate
result information. Most of these
obsolete in the gigabyte
range, multiple gigabytes of
intermediate information, while
the with recursive, now they're using
key variant, excuse me,
using key variant, the users
a whopping 230 kilobytes
and in the end
we are by a factor of
270 faster than the vanilla
with recursive implementation
That's a pretty good feed up
that was very nice to see
we have other algorithms
like all nodes shortest path
so from each node to each other
node what would the shortest path to take
and there's
some again
in some of these algorithms each
node saves routing information.
So if I'm node X, I want to go to node Y.
What would be the next hop, the best hop to go to to reach node Y from X?
And that's information that is going to be updated.
It's iteratively refined.
It's going to be updated by the algorithm runs.
And again, that would be a perfect example of such a using key use case.
And also there, the vanilla with recursive really.
really very quickly runs into out-of-memory space problems,
while we can run, on large graphs,
we can run these all nodes shortest path queries
in below one second,
where vanilla recursive query uses 400 seconds or suffers from OOM.
There's actually some efficiency,
some space and runtime efficiency to be at,
and some readability improvements.
These queries look nicer, and that's as a researcher and also a teacher who tries to convey
these algorithms in SQL style to students, that's also a big win.
Yeah, ticks all the boxes there, then.
And I think that last point there that you mentioned is historically being undervalued
and how important that is, I think, as well, and having that really nice expressive.
Like you're saying, I don't like, this is probably one of the impediments as to why these types
of SQL queries are not as prevalent as they might be otherwise, is the readability aspect
of them as well. Cool. So the other thing that I want to touch on, and he's got a brilliant name
in this, and one of your proposals for new CT variants, and it's trampolines. So can you tell us
what trampolines are? Okay. So actually, in the meantime, we have, I think we have like three or four
different flavors or variants of with recursive or iterative queries on the table. There's one,
I will come to trampolines in a second. Sure, yeah, yeah. Another variant I'm very thought of, and that's
the time to live or the TTL variant where the query can decide,
hey, this is information that I've just computed.
I guess it will be irrelevant for the next four iterations.
Then it can be discarded.
So you set the time to live for that particular information
that you've just computed before.
And then after four iterations,
that particular row will be forgotten
and will never, whatever, clock your memory again and so on.
That's also some very nice algorithm that have some
moving window that they move over some large data structures you know that the window has a
particular certain size everything that has left the window will not be relevant anymore you set
the ttl of that particular information the row that you've computed accordingly and the ttl variant
of this recursive will forget everything in due time that's it's also very nice ttl has not been
implemented inductively yet but well this is on the on the on the on the on the on the on the on the on the on the on the on the
board here in one of our offices.
So it's probably being worked
on as we speak here.
And then there's a trampoline version.
And I'm very fond of that because
it's a collaboration with Thomas Neumann.
The Thomas Neumann I was mentioning it at
Technical University of Munich, who is
a great guy and a genius in my
eyes. And
these trampolines, actually
they are a construct
that has its origin
and roots in the
programming languages world. Where you
want to efficiently implement recursive functions without whatever filling up your stack.
Okay.
So that's actually a recurring theme in the work that we do here in tubing and we try to look
for and then steal and adapt the best ideas that the programming language and compile out
folks have, in particular functional programming folks, and try to massage them and then reuse them
in the query language's context.
And the trampolines are one particular aspect of that.
As I said, originally they were invented to have efficient calling schemes for highly recursive functions.
But in the query context, we are using them to define iterative queries where the iterated part has multiple branches of computation that can be active in every iteration.
So in one iteration, the first branch might be relevant.
In the second iteration, the second branch might be relevant.
In the third iteration, the first may be relevant again.
And in every iteration you can invoke different parts of the logic,
perform different parts of computation.
That's actually different from a normally with recursive CTE,
where you have one monolithic query part that is iterated again and again.
In a temporary query, you can identify individual branches and only some of these.
maybe all of them, but normally only some of these are activated in each iteration.
So when you produce an iteration result, every row is tagged with a branch.
It is subjected to in the next iteration.
So hey, row, you are meant to be processed by branch three in the next iteration.
And in the next iteration, that particular row will be routed to branch three,
which will do some cool stuff with that.
and other rows might have different fates
and maybe routed to different branches
and that leads to very cool
means to implement all kinds of stuff
it's also the basis for our own research
on implementing iterative
POSQL style UDFs
you can map such imperative programs
and have loops and then and branches
and whatever all these good stuff
you can map these very naturally to trampoline
to these iterated trampoline queries
and branching inside the imperative
code would then
correspond to selecting the proper
branch of your iterated query.
The correspondence is really one to one
from iterative
UDF style code to these
trampoline style iterated SQL queries.
And we are currently developing
compilers for
imprative PL SQL SQL
style UDFs to these
trampoline recursive queries
and that might be
a cool, maybe this opens
the venue to have real POSSQL
style UDFs inside database systems
that have a facility
for CTEs but never had a POSSQL
style interpreter. And Dugtab
is among those systems as you probably know.
So in secret
nobody is supposed to know this
but in secret we are working on an implementation
for POSSQL style
UDFs here in Tewing and for DuckDB based on this trampoline-style UDFs,
a recursive query, I'm sorry.
Fantastic, yeah, we're breaking the news here first.
There's the cats out of the bag now to Austin, that's it.
That's fantastic.
Cool.
So, yeah, I want to talk a little bit about impacts in the next section of the podcast.
Do you have a handle on how much impacts your work on the KTCTEs has been in Duck DDP
in terms of usage and feedback?
And, yeah, what's that look like for you over the last course?
I guess three or four years, right?
I mean, how long ago was Sigmaud 2019, right?
So, like, in the last six years or so, I guess, five years, yeah.
I think as of today,
Duky has one of the most more efficient implementation
over the course by today.
I think it has gone through so many iterations,
in a sense, implementation-wise,
that really is a production-ready feature.
And if you're out there reading whatever blog posts
and see other research on recursive TTEs
and iterative computation in SQL,
then you can see that DugDB is one of the platforms now
where people try things out.
Whether DGDB and recursive CTEs have found their way
into production industry, use that,
I have no data on that because people tend not to
or post or blog or bloat about that
but I've seen
in the Eastern sphere
you see that TuckDB has become
one of the go-to platform for complex SQL
and iterative and recursive sequel by now
and that's very nice to see
regarding the using key feature
it's super brand new
it's like three months in the wild now
and I've seen on the blue sky
and on LinkedIn several block posts where people said
oh this is the game changer
I've written my clunky with recursive CTEs that were inefficient or hard to read.
I've written them and, well, yeah, it's a real step forward.
But using key, really, that's still in its infancy.
So no hard facts about the uptake there in a sense.
Right, yeah.
I just want to pull on a second.
You mentioned other systems, and we should probably talk about those briefly as well.
Like how DubDB compares to them, like how much more advanced.
is, or more efficient is duct DEP's implementation versus what's out there and what's available
in other systems.
Yeah.
What's interesting is that there is other relational query language that have been specifically
geared to evaluate recursive queries, data log, for example.
So there's dedicated implementations of data lock, and data log that is, has been built
to evaluate recursive queries, and there's implementation.
out there, very recent implementation of data lock, one is called suflay, and recent research
paper suggests that that Dr.B can keep up with this dedicated system that has been built
to iterate and to evaluate recursive queries efficiently, Dr.B. can at least keep up with that
and even beat it for a type of theory. So we are quite proud of that. And further research
in further work on the efficient resetting of subplans and so on, hopefully will
hopefully get some
performance advantage for
DGB further advantages if it all
goes well. So I
think we are on par
with many of the systems out there.
Of course
there's the elephant in the room. There's post-KKL.
I would say that
the DGB implementation of
restrictive TTEs
has some major
rants over the post-Col-E variant
in the meantime. We have less
syntactic restrictions on there.
So very efficient evaluation.
I think that using recursive features or queries in DuckDB
it's just more fun now because we got rid of many of these syntactic restrictions
that hold you back and hold you back from exploring, being creative with these iterative CTs.
Yeah, you've liberated us all.
Yeah, free to run, yeah, go wild now.
Cool, yeah, so I guess summarizing then on the future of where you go and now
is there's loads of different angles.
So what can we sort of expect over the next year or so,
I guess, if you were going to summarize.
Yeah.
So regarding using key, we want to,
where there's some performance improvement that you are currently working on.
So if you look at the typical body of such an iterated,
of such an iterated query,
then what you see is that these iterated queries
look at what they have computed last
and then join that with information or whatever
about the tree or graph structure that you would like to traverse.
You look at what you've computed does,
oh, that's where my travel has brought me so far.
Then you join that with the graph or tree table
to see where you can go next.
And because this is such an idiom
to join the intermediate result of your CTE
with some existing table,
we are specifically looking at improving
and optimizing these joints,
So in using key, we hold the key value, the information about the key values that you have seen and saved so far in a hash table.
And that if the hash table can immediately serve as the one of the sites of a hash join implementation.
So we can save half the work on implementing this join with the tree or graph structure.
we can save half the work
by just reusing the native
using key value data structure
and use that as one of the half
of this has shown implementation.
This should bring another really measurable speed up.
We are very keen of measuring that
and benchmarking that.
It's something that we do.
And we will generalize using key,
as I said, by giving the system
the ability to maintain
the values under known keys
for you to do whatever
max min sum average computation automatically for you we will of course try to add trampolines to
duct tvs have already been added to umbra which is thomas noemann's research style system
and it's fun to play with them there we want to have trampolines in duct tv as well and then maybe
i don't know whether we can add the ttl variant that's quite a laundry list of stuff yeah
it's a never-ending it's a never-ending interesting pool of cool of cool
ideas, this working with
DuckDB
is something
as we try in the next few months
of course.
Yeah.
And then this Duck, this Duck,
this DuckDB based
PSSQL style UDF thing.
This is something we definitely want to try
but this is something
for a larger
horizon maybe the next two years or something.
Yeah, very nice.
What a good thing. A lot of cool things to look forward to
that's for sure.
Nice. Let's change.
I want to talk a little bit about
you're teaching with DuckDB.
You mentioned it earlier on
about how it's become
the basically the way you teach
databases at the university of tubing.
So she tells us a little bit more about that
and what that experience is out,
what prompted the switch
and how you have found it
from kind of before teaching students
without DuckDB to now using that
and how that has changed
outcomes and student experience
and those things as well.
Okay, cool.
I'm glad that you find the time
to talk about this here.
That's great.
It's of course close to my heart.
Before, DuckDB, all our teaching database-related teaching efforts were based on Postgres.
And that worked well for many years.
Post-CQL is an open-store system that's needed for the particular type of course
where you want to inspect database and journals.
How does a buffer manager work?
How does query planning work and so on?
And many of the textbook style algorithms out there you find implemented Postgresquarel.
That's nice.
But it's equally nice to see that DuckDB is actually incorporating all the cool recent hot
research results that you find
in papers like adaptive radax
trees and creptorized
query execution, pipelining,
muscle-based travel paralysis,
which are the really the new
cool stuff that you find in research papers.
You find all of these almost
by the book or by the paper
implemented in DuctiB and that's very
motivating for students. If you look at Postal
you find B plus three techniques
implemented from the 1980s.
If you look at DuctiB you find techniques
implemented that have been proposed
like in the last whatever five to eight years or something very motivating for students that's
one thing so we need an open source database system to do particular aspect of teaching we have to look
inside stuff we have to whatever insert or throw in debug statements switch stuff on and off
to allow students to play with the stuff and it's installing post-school can be a real
nuisance can be can be really hard for especially on Windows play
platform for students were not very inclined to whatever fiddle at the terminal and so on.
Installing duct TV, as we all know, is PIP install DACDB, and that's it.
And that's an advantage.
That's not to be underrated in the early years of computer science curricula,
to have a very pleasant install experience that's consistent across macOS and Linux and windows and so on.
It's not to be underrated.
The Python API, Python would be a language that's, that's,
known to many of our younger students,
this tight integration
enables them to do
cool stuff with DuckDB in whatever, in the second
week of the semester or something.
It's really refreshing. That's cool.
DuckDB SQL dialect
dialect is super
extensive by now. It's friendly,
but it's also very extensive.
That means it's a, it's,
duct TV is the ideal vehicle for a
course that I call advanced
SQL, which is something that you
would probably hear after
course on introduction to relational database system
or something. Advanced SQL really focuses on the
on the corner, on the dark corners
and the interesting corners of SQL window functions
and of course it's recursive CTEs and so on.
And TACDB is so rich in implementing all kinds
and very efficiently, all kinds of window functions,
all the nitty-gritty data, it's recursive CTEs,
that this has become the ideal vehicle
for advanced SQL teaching here at University of Tubing.
That's a course where you don't learn about select staff from Groupai,
but all the other cool stuff that you only find
if you flip to page 300 of the sequel textbook.
And it's very cool.
Students start to implement whatever with these window functions and recursities,
finite state machines, k-means clustering, cellular automata,
optimization problems, Sudoku, solvers,
all kinds of stuff that we do in advent of code style in a sense in a lecture.
Have you felt that students have been more satisfied
and more interested in databases after the course
given that they're like using Duckdeevese using, say, Postgres?
Like, have you found like, I don't know,
by the end of the course last time,
like, oh, God, I never want to see a database again.
But now they're like, they're enthused
and they're interested in databases
and kind of want to pursue it at Masters, PhD or whatever.
Yeah, definitely.
There is the same actually in a database research community
that says, would you give the database system
to your best friend as a birthday president or something?
And of course, the answer is no
because it's a maintenance nightmare
and like whatever.
Installing many of these
off-the-shelf database systems out there
is a nightmare.
Fiddling with DuckDB is just a pleasure
and students have a sense
of that this is a system that is moving fast,
that development is going rapidly,
that's a young system,
it's super efficient,
seeing good benchmarks for the DuckDB system
really motivates students to take up on that
and keep working with that.
even after these courses.
We have a very active influx of students
who want to do their
whatever bachelor math sees it with us
and I'm very happy to see that.
The duct TV makes
well, yeah,
these database systems in a sense
sexy again. It's really, it's really nice.
And then there's all these cool gear.
You see all kinds, lots of students
carry around their laptops now
with DuckDB stickers here on the campus and so on.
It's really cool.
to see. Yeah, fantastic.
We've got a few more sort of high-level
questions now, Torsson. And the first
is your overall
let me frame it this way.
What advice would you give to
other sort of would-be developers
of DuckDB thinking, people who are thinking
of going in there and extending it themselves
in some cool way? What would be your advice to them?
I would,
the first advice would probably to join
the DuckDB Discord because
the community there is
so responsive, so friends.
and so so positive that I can only recommend to to go there and hang out there.
I'm there.
I'm on there every day.
You can find me under the pseudonym of Tagi, which is my nickname, probably most in the SQL
channel.
Happy to help there, but there's dedicated channels on the internals of the C++ API,
the extension, and so on, and people hanging out there are super friendly.
So my advice is go there and then start reading a particular piece of C++ API, the extension,
plus-plus code that you are
most interested in.
Start reading about whatever,
adeptive reddex trees,
compare that with the particular research paper
that you've probably seen out there
and then find the concept of the paper
almost one-to-one in parts.
It's like that in the code base
and get a hole to get a foot inside the code base
in that way.
So it's a major effort
to get to grips,
to do with,
the C++ code base inductee me.
But I think it's
really worth it. Don't be turned
off by
a slews of code that
have little documentation
and little comment, Terry and so on.
It's all structured well.
It's named well. It's consistent.
Lots of tests there.
I think it's manageable. Don't be turned
off by the first impression that you might get.
Yeah, that's a solid advice.
My next one is more of a reflective question
on you're looking back over your experience
and working with it,
what's been the one thing
that surprised you
and kind of caught you off guard
while working with the system?
It's probably the flexibility
that's already baked into the engine
from the start
and how receptive the engine
proved for tweaks
and all the additions
that we had to do.
It's not just regarding
the extension system of duct
which is really cool
but really the flexibility
baked into the kernel
itself, it's really
cool. So for using key, we
had to adapt the so-called aggregate
hash table data structure in DuckDB.
And it was
the API that this structure has
was just wonderful to work with,
easy to adapt, and so on.
So that was very
nice to see and really
surprising to us how
forward thinking
the engineering of the DuC++
code base is in parts.
And then, yeah,
probably how directly you can find recent research results reflected inside the engine itself.
That's also a very nice surprise.
Yeah, that's very nice for sure.
My next question is the penultimate one, and it's more of a recommendation.
So it's kind of to the listener, what is your favorite plug-in, extension or feature?
You can only choose one.
I know there's probably a list of those loads, but yeah, you can only pick one, Tustin.
Oh, okay, I can you pick one of your own as well.
I would probably pick the extensive support for array and list processing in DGDB
using all these built-in functions and the lumbdas and the comprehension that have been built into the system.
It makes it almost feel like an array or list programming language in a sense.
That's something I can only recommend to really get to know well.
Because on the back end, there's also, it's accompanied by a very efficient storage of large lists,
as opposed to systems like PostalQL where larger arrays overflow on whatever they're called toast pages or something.
And not so in Dugby, it's super efficient list storage.
And I can recommend to take the time and dive into the list and area functionality of DGB.
There you go, listen.
You've got some homework.
Go on play with that.
Cool.
yeah, the last word now
and that is what's the one thing
you want the listener to take from this chat today?
Yeah, if you have a complex
problem, computation problem
that needs lots of data,
don't give into the knee-jerk reaction,
export the data from the database
into some real making air quotes here,
some real programming range
and do all the heavy computation
and heavy lifting outside the database kernel,
try to express your complex computation
using SQL.
maybe using features of the darker corners
or maybe using iterative for recursive SQL CTEs,
compute your results close to the data
using scalable query engine as DuckDB provides to you
and don't give in to the urge to whatever,
use the database system as a dump data silo only
and perform all the cool, interesting stuff
outright in database corner.
That's, I would say, unfortunate to frame it nicely.
Yeah, cool.
Well, let's send it there as a good message to end on.
Thank you so much for chatting with us today.
It's been a really, really insightful chat.
I'm sure the listener will have loved it as well.
We'll drop links to everything that we've mentioned in the show notes.
So people can go and go and find all of stuff we've been chatting about.
But yeah, thank you very much, Tostin.
Thank you very much. Jack.
It's every minute was a pleasure.
Thank you.
