Disseminate: The Computer Science Research Podcast - Recursive CTEs, Trampolines, and Teaching Databases with DuckDB - with Prof. Torsten Grust

Starting point is 00:00:00 Disseminate the Computer Science Research Podcast, the DuckDB in Research series. Hello and welcome to the DuckDB in Research podcast. In today's episode, we are going to be talking about recursive commentable expressions, that is recursive CTEs and their implementation in DuckDB. And I'm really happy to say that I've got the expert of recursive CTEs with me today, and that is Tolston Grust. Torsten is the professor of computer science at the University of Tubingham. So yeah, welcome to the show, Torsten.

Starting point is 00:00:33 Hi, Jack. I'm really feeling flattered about the invitation to be part of the series. I'm sitting here in my DuckDB t-shirt, ready to talk to DuckDB for the coming hour. So I'm happy to be here. Thank you. Let's do it. Fantastic. So the first question, though, the customer question on the show is for you to tell the listener a little bit more about yourself and your story and how you got into databases and how you got working with DuckDB. So, yeah, let's take it away. Okay. So I guess I'm one of the older guys that have appeared in your series here.

Starting point is 00:01:02 I'm actually a professor in database systems and I've been a PhD student in database system since 1994 at University of Constance, Lake Constance, very beautiful scenic area in Germany. And since then I've been working in databases all the time. So from 1994 until this day, I'm researching databases, in particular query optimization, query processing. I've become a professor in database systems in 2004, and then I've spent time at several other universities, also at T.U. Munich, where Alphonse Kemper and Thomas Neumann are located.

Starting point is 00:01:37 Actually, Thomas is my successor. He's sitting in my office, in a sense. Everybody on this podcast were probably no Thomas. He's just a genius. And I joined Tewingen in 2008, having been here ever since feeling great here, doing database research since then. And, well, recently we have switched really lots of our efforts targeting DocDB, and I'm happy to talk about this here. Fantastic. Cool. So, yeah, like I said at the top of the show, we're going to be talking about recursive CTEs today. But I feel it's one of the more advanced topics in SQL is Interative Sequel and Recursive CTEs.

Starting point is 00:02:13 So for the listener who's maybe not familiar with what they are, give us a primer, I guess, of what CTEs are and then kind of what recursive CTEs are on top of that. Okay. So CTE, common table expressions are actually, they come into flavors. There's the simple, the straightforward flavor, which is the non-recursive CTE, which allows you to name, to assign names to intermediate results that you've computed in your query. And then later on, in subsequent part of your query, you can refer to that particular name, to that intermediate result and use that one time, many times, potentially to build on that. and then build up complex queries, compute complex results from these intermediate results. It's a, I would say it's a software engineering tool that lets you master and lets you tame complex queries by naming intermediate steps and then referring to them again and again.

Starting point is 00:03:12 It's just, I think using CTEs to structure very complex queries is just good practice, good daily practice. And I think these are well accepted everywhere out there. And then there is the sister concept, the recursive commentable expression, which really is a different beast. It really adds to the expressivity to what you actually can do with the SQL query language. It really, it's a major step forward. It's a really, it's a quantum leap for the query language because with these recursive CTEs, SQL becomes a full-fledged programming language in a sense. It has full expressive power, just like your general purpose programming language. which everything can be done in SQL.

Starting point is 00:03:55 Whether everything should be done in SQL, that's probably a different question. Because recursive CTEs look awkward to many of us. They tend to be somewhat verbus. They have the bad rep of being inefficient for evaluation. Maybe you can talk about that. But it gives you the ability to iterate inside your query to iterate until you reach a fixed point or a stable result.

Starting point is 00:04:19 And that's something that's really unique to recursive comment. that you cannot do with regular SQL queries. Has been added to SQL in 1999, has been around for 25 years, and it's still some of the more peculiar exotic features in the query English for sure. Yeah, so what would be like kind of a comment if I was a user of a database?

Starting point is 00:04:41 And when would I be reaching for such a recursive CT? What would be sort of the context or an example of when we're about, okay, that's really one I need to be doing here? Yeah, the vanilla, the textbook. example of a recursive theory would be to traverse tree-shaped data structure. So if you have tables in your relation to database, but if you look at the columns and how they are connected via keys and foreign keys, if you're in fact modeling a hierarchical or tree-shaped structure, then a recursive CTE allows you to identify a root or a starting

Starting point is 00:05:13 node in that particular tree structure. And then in each iteration of your query, advance from node to note from the root reach all the child nodes from these child notes reach all the grandchildren. In each iteration you will explore this hierarchical structure further and collect more nodes probably matching defined criteria to really explore a tree-shaped

Starting point is 00:05:38 or hierarchical data structure of arbitrary death. If you cannot iterate, then you are confined to explore your data structure to a defined depth. You can join, of course, your table with itself to explore the tree structure, but such a join will be a two-fold join or three-fold join or four-fold join, which limits the depth. You can explore the structure, recursive commentator expressions, remove that fixed limit, and allow you to traverse as deeply as you must, as you need to. Now, it's kind of on this and also what you said a second ago about this, the really expressive power of SQL and you can

Starting point is 00:06:15 kind of really do anything in it is i know that at the university you solved all of the advent of code and it is the 2022 edition just using pure duct db sequel right so maybe you can speak to that second as another example of what things you can do with with with recursive ctees yeah so so advent of code is if probably most listeners have come across that it's a collection of 25 programming puzzles each of these puzzles are posed on the day of December so it's an advent of of programming and puzzles, and from day to day, these typically increase in complexity and the challenge increases, but these are programming puzzles that often call you to either traverse tree structures or traverse graph structures or build up graph structures or to perform

Starting point is 00:07:05 some optimization task. And this iterative nature of the recursive CTE is just perfect to whatever, identify a starting and an end node and a graph structure. And then in each iteration, start from these nodes, explore the graph structure, discard particular nodes that will lead you to cul-de-sacs, explore other nodes, keep iterating over these nodes in the coming iterations, and then explore complex data structures, collect interesting data on your way to solve these programming puzzles. Many of these, I think I'm looking at the 25 days of Edwin of Code from 2022, I bet 80% of these required recursion or iteration

Starting point is 00:07:47 to explore these. Yeah, yeah, that's fantastic. Yeah, and I guess that's a really good example for any listener who wants to go and see what kind of recursive CTE looks like in DuckDB. That would be a real good place to go and have a look and have a play around. Cool.

Starting point is 00:07:59 So, yeah, you also mentioned a second ago about there are certain shortcomings, possibly, or like limitations or restrictions of SQLs recursive CTEs. So before we get stuck into kind of DuckDB's recursive CTs, let's talk a little bit more about the shortcomings of these

Starting point is 00:08:15 general. Yeah. So when you perform query evaluation in iterative fashion, so parts of your query are evaluated again and again. Of course, you have the problem of infinite iteration. So when will the iteration stop? And to make sure that these, or to at least have some guarantees about when such an iteration will stop, the inventors of SQL, the definition of SQL 99, imposes restrictions on the iterated part of your query. The iterated part has to come to a defined end, the so-called fixed point of the computation. I won't define what that means here.

Starting point is 00:08:59 It's a proper abstract algebraic concept here. But there's a syntactic restriction that impose on CTEs that really make it hard to use CTEs, iterated CTEs or recursive CTEs, as a let's say in creative ways you have cool ideas you want to express these using your CTEs boom you hit one of these syntactic restrictions a wall that has been set up probably for good reason but limits you in what you can actually do and in CTEs and this often leads to rather awkward syntactic workarounds that make these creoses look awkward complex hard it probably is one of the reasons why recursive CTEs have a bad reputation

Starting point is 00:09:44 in the sense. That would be, that's a usability deficiency, I would say. And there's efficiency concerns that let the inventors, the definers of SQL 99, of the original recursive CTE standards say, hey, when you iterate, in your iterated query, you can only see the results that you have produced in the immediately preceding iteration. You might be in the 100 iteration, but you can only see the results that you've produced in the 99th iteration. Everything that has been computed before is lost. Well, it's not actually lost.

Starting point is 00:10:21 We collect that for you. And later, when the computation is done, we give it back to you. But during the iteration, you cannot see it. It's sort of short-term memory for this. Yeah. And it's for a good reason. Looking only at the iteration results

Starting point is 00:10:36 from the last iteration is probably very few as opposed to the potentially huge log of intermediate results that you have collected in all the prior iterations. So these are where in efficiency concerns that lead to this short-term memory semantics of recursive cities, you can only see what has been produced in the preceding iteration. And this leads query authors to play dirty tricks like,

Starting point is 00:11:04 in my query, I will collect a huge array of stuff that I've produced in the earlier iterations that is an ever-only growing array that will collect and aggregate all the stuff that I've seen during all my whatever tree traversals, graph traversals, and I will pass that on from iteration to iteration as a replacement

Starting point is 00:11:26 for the short-term memory in a sense. And that really leads to space, space leaks really super huge arrays that you pass from iteration to iteration slowing down the whole iteration process. And it's really, that's an idiom. It's an anti-pattern, I would say, to carry all this baggage with you

Starting point is 00:11:46 to remember what you've done in the early iterations. It's really unfortunate, yeah. Yeah, you're going to really want to bring with you the things you actually need, right? But it's, okay, cool. But given that then, so I think we've set the scene up there, perfectly there, Torson.

Starting point is 00:11:57 So let's talk about the DuckDB implementation of recursia for CTEs. And obviously, you guys, you guys at University of tubing and we're behind that. maybe we can set the scene of kind of like how that came about how you decided okay we're going to be the people to add this into db and just set the story there to start off with okay so i've seen the very first glimpse of duct db at sigmod 2019 in amsterdam when hannis and mark were showing the first prototypes of the system and it looked super interesting very promising to

Starting point is 00:12:31 us and i was in amsterdam at that point in my my now postdoc dennis here he was with me, we were looking at that system of really war. This is something, this is something new in-process database measurement could be really easy to set up, would be ideal teaching tool, great. And then after the conference,

Starting point is 00:12:52 I think on Twitter, I communicated, I connected with Hannes Mule Eisen, the CTO is in the CEO of Dr.B Labs. And I said, oh, Dr.B is great. It's great. When will you add recursive CTEs? because the recursive Cs were missing at that early point. And then he responded on Twitter saying, oh, probably never.

Starting point is 00:13:16 And I was super disappointed. I was super disappointed because in teaching advanced concept of SQL, the recursive CTEs play a role because they really define that quantum leap of expressness in the query language. The teaching SQL without having access to iterative CTEs would have been hard for me. to imagine. So after that disappointing response on Twitter, here in Turingen, Dennis set out to write the very first cut at an implementation of with recursive.

Starting point is 00:13:50 And already back then, we really were delighted to see how clean and how well-structured the DGB code base was. So adding that to DGB was actually a manageable effort. And it has been in DuckDB since, let me check, since version 0.1.5. Yeah. That's when we added recursive CTEs. In the SQL 99 style with all these short-term memory deficiencies and all the syntactic restrictions that I was talking about. But since then, recursive CTTs are part of Dr.B.

Starting point is 00:14:30 So we added a new plan operator that internally has the ability to iterate subplan, of your query plan that's actually needed that's obviously needed to implement the CTE semantics and since then we have provided bug fixes and improved the CG implementation

Starting point is 00:14:47 since version 015. Yes. Nice, yeah. Were there any other modification that needed to be done there other than adding a new plan operator he said that it was the way that the code is structured it made adding things in really nice but was there any sort of of not kind of breaking changes that need to be made

Starting point is 00:15:03 to support and with recursive? well what you needed is the ability to reset a sub plan to reset to let it run again in a sense because upon the next ct iteration there's there's there's a need to to restart a subplan in a sense that's something that we had to work on you you have to have this basic functionality when you have a nested loop join in your in your database system but we had to work on that particular plan feature, plan execution feature to reset a plan and evaluate again. Actually, that's until this day, that's a bit of a pain point because

Starting point is 00:15:41 resetting all the plan data structures associated with a sub-plan, restarting a plan in a new in a sense is somewhat costly operation that we try to, until this day, try to optimize, make cheaper and try to work on here in Turing. Yeah.

Starting point is 00:15:59 Very nice. That's a nice segue into my next question then. So obviously, given that the fair implementation was going by the standard by the SQL 99 standard with all the limitations in there since then it's been a very fertile ground of research for your group so I mean there's a few places we could start here I don't know if you have any favorites but maybe we should start off with maybe the using key and that sort and what that how that has built on the existing implementation in yeah yeah okay so the using key feature actually using key is a modifier that you can add to the with recursive CTE construct in SQL now

Starting point is 00:16:32 And we have very recently added that to DuckDB 1.3. So since 1.3, that's not a research idea or whatever, some whiteboard pipe dream anymore. But it's actually implemented and tested and bugfix and so on in version 1.3. We have blocked about that. Maybe we can provide a link to the blog post associated with us. It would be wonderful. And we now see on the Discord that some of DuckDB users have. pick that up. I've seen LinkedIn postings

Starting point is 00:17:04 of people saying, oh, have you seen that using key feature, it's something that's cool, and it's nice to see that this is being picked up. So what does using key mean? So it's an optional modifier that you can attach to your with recursive CTE, and when you do that,

Starting point is 00:17:21 the behavior of the iteration process changes slightly. So in the original CTE, as I said, in the background, the database system is collecting a record, a log of all the intermediate results that all the iterations have been produced. But this log, this record, it's in the so-called union table, is over time, it might grow super large.

Starting point is 00:17:47 That's why the original CTE definers said, no, we cannot let you access this potentially super large table. This will inhibit efficient implementation, now execution of the iterated CTEs. You cannot have that. You have short-term memory. so in using key we are not just adding we are not appending or adding to this ever-growing log of information but we define key columns and once you when you see in some iteration a key combination a key value combination that you have seen before in any of the iterations then the new information is used to overwrite under this particular key combination the information that you have recorded before.

Starting point is 00:18:33 So if in any iteration you have the key value of 100 and you've remembered something under the key value 100 and 200 iterations later, the key 100 comes up again in your computation, then you have the ability to update, to update that particular value saved under the key 100 to store whatever an augmented and up-to-date version of the information that you want to save under key 100

Starting point is 00:18:58 at this particular point. This means that there's this record of results that DBMS keeps for you is not an ever-growing heap of stuff, but you have the ability to control, to selectively overwrite and replace old information with new information. That keeps in general the recorded information in check, the size of the recorded information is kept in check and it's kept so much in check. and it's so much smaller in size

Starting point is 00:19:29 that we now can allow you in the iterated query to see what you've done in early iterations. You can see what you've done. There's no need anymore to whatever collect large areas of stuff to remember what you have done.

Starting point is 00:19:43 The system has the ability to give you, hey, this is what you have seen so far. These are the most recent value that you've saved under these key values, Key 100, for example. And you can, based on that, you can whatever steer your computation decide that the iteration is good enough

Starting point is 00:20:00 and then stop the iteration or go on again and so on. So now you have the ability to see what you've done before and this works because we are not only ever adding to a pile of stuff that's too large probably by the 1000s iteration. That's using key. Nice.

Starting point is 00:20:21 The question there, so obviously when you were described at that it was very much an overwrite of the key. Is there any way of expressing a more complex semantics of what you want to do? I don't merge the keys or sum the keys when they want. Is that possible as well or is it very much at the moment just an overwrite? Maybe you don't want to do that. I don't know. But yeah.

Starting point is 00:20:38 No, no, no. That's very cool. Very cool question. Actually, so because you have access to the current state saved under, say, key 100, because you have access to that, you can use that currently stored well. and whatever, add to that augment, that, maintain that particular value, and add on top of that. So that's actually already possible. So it's not just a plain replacement.

Starting point is 00:21:05 You can refer to the currently stored value and whatever. Add to that aggregate on top of that. Decide that the old value is better or is good enough than the current value that you've computed, anything like that. If you find that your query is whatever, taking the value that you have already, and then whatever, some perform in addition, sum to the old value. Then we will probably in one of the coming DugDB versions

Starting point is 00:21:34 or variations of using key, we will have a modifier to the using key construct that allows you to say, hey, if there's an old key value, here's a new CalVal, you add it up for me, sum it up for me, so you don't have to manually maintain the particular entries under these key value under the key 100, the system will

Starting point is 00:21:54 whatever, automatically keep do the summing for you. But that's an extension that we are considering. Currently, you would have to use the old value, add the new value on top of that, and then store that as the new value under key 100.

Starting point is 00:22:10 Maybe something to the listener to look forward to in future versions of DuckDB then. Cool. Do you have any idea, obviously, with this, adding this feature and it sort of broadens the possibilities and maybe makes things types of queries possible that weren't possible before. But do you have any idea on like benchmarks and numbers of like how much speed-up of

Starting point is 00:22:28 this is or anything like that? Yeah, yeah, yeah. So this is about speed-up, but there's also about space savings. Because we are very interested in benchmarking, how large will this key store, this key value store, this associate dictionary, whatever you mean, whatever you like as a term for that, how large will that grow? And of course, we wanted to be significantly smaller than the pile of stuff that the database would be maintaining in the back

Starting point is 00:22:59 with the vanilla recursive CTEs. So we've implemented a number of classical algorithms, some of these graph algorithms, to see how the using key variant would fare regarding space usage and also regarding time. So we have, for example, implemented a connected components algorithm where each node, depending on its neighborhood and the reachability of other nodes,

Starting point is 00:23:24 is assigned to one particular component of the graph. And if you know the compacted components algorithms, in the beginning, every node is located in its own single component, but if you iterate this particular connected components algorithm, then the component that the node belongs to is going to be updated. That would be a wonderful example for this updating behavior of the using key, The key in this particular approach would be the node ID itself and the value that you store under that particular key

Starting point is 00:23:56 is the component that this particular node belongs to. And every node is only belonging to one particular component. There's no need to pile up stuff of old, irrelevant information. What you need is the current, the most current component that your particular node is belonging to. that we have written a paper we have shown the connected components formulation using using key

Starting point is 00:24:22 in this particular paper and it's just it's almost a one to one match to the imperative textbook style connected components algorithm that you can see in your data structure's book. It's very nice to see how very elegant and very readable this using key

Starting point is 00:24:38 formulation of the duct debate query becomes and regarding space savings and time savings. So we have run this on some larger graph with thousands of nodes, tens of thousands of

Starting point is 00:24:53 edges, where a vanilla with recursive formulation of the connected components algorithm would have amassed intermediate result information. Most of these obsolete in the gigabyte range, multiple gigabytes of intermediate information, while

Starting point is 00:25:10 the with recursive, now they're using key variant, excuse me, using key variant, the users a whopping 230 kilobytes and in the end we are by a factor of 270 faster than the vanilla with recursive implementation

Starting point is 00:25:27 That's a pretty good feed up that was very nice to see we have other algorithms like all nodes shortest path so from each node to each other node what would the shortest path to take and there's some again

Starting point is 00:25:43 in some of these algorithms each node saves routing information. So if I'm node X, I want to go to node Y. What would be the next hop, the best hop to go to to reach node Y from X? And that's information that is going to be updated. It's iteratively refined. It's going to be updated by the algorithm runs. And again, that would be a perfect example of such a using key use case.

Starting point is 00:26:11 And also there, the vanilla with recursive really. really very quickly runs into out-of-memory space problems, while we can run, on large graphs, we can run these all nodes shortest path queries in below one second, where vanilla recursive query uses 400 seconds or suffers from OOM. There's actually some efficiency, some space and runtime efficiency to be at,

Starting point is 00:26:43 and some readability improvements. These queries look nicer, and that's as a researcher and also a teacher who tries to convey these algorithms in SQL style to students, that's also a big win. Yeah, ticks all the boxes there, then. And I think that last point there that you mentioned is historically being undervalued and how important that is, I think, as well, and having that really nice expressive. Like you're saying, I don't like, this is probably one of the impediments as to why these types of SQL queries are not as prevalent as they might be otherwise, is the readability aspect

Starting point is 00:27:15 of them as well. Cool. So the other thing that I want to touch on, and he's got a brilliant name in this, and one of your proposals for new CT variants, and it's trampolines. So can you tell us what trampolines are? Okay. So actually, in the meantime, we have, I think we have like three or four different flavors or variants of with recursive or iterative queries on the table. There's one, I will come to trampolines in a second. Sure, yeah, yeah. Another variant I'm very thought of, and that's the time to live or the TTL variant where the query can decide, hey, this is information that I've just computed. I guess it will be irrelevant for the next four iterations.

Starting point is 00:27:55 Then it can be discarded. So you set the time to live for that particular information that you've just computed before. And then after four iterations, that particular row will be forgotten and will never, whatever, clock your memory again and so on. That's also some very nice algorithm that have some moving window that they move over some large data structures you know that the window has a

Starting point is 00:28:19 particular certain size everything that has left the window will not be relevant anymore you set the ttl of that particular information the row that you've computed accordingly and the ttl variant of this recursive will forget everything in due time that's it's also very nice ttl has not been implemented inductively yet but well this is on the on the on the on the on the on the on the on the on the on the on the on the board here in one of our offices. So it's probably being worked on as we speak here. And then there's a trampoline version.

Starting point is 00:28:52 And I'm very fond of that because it's a collaboration with Thomas Neumann. The Thomas Neumann I was mentioning it at Technical University of Munich, who is a great guy and a genius in my eyes. And these trampolines, actually they are a construct

Starting point is 00:29:07 that has its origin and roots in the programming languages world. Where you want to efficiently implement recursive functions without whatever filling up your stack. Okay. So that's actually a recurring theme in the work that we do here in tubing and we try to look for and then steal and adapt the best ideas that the programming language and compile out folks have, in particular functional programming folks, and try to massage them and then reuse them

Starting point is 00:29:41 in the query language's context. And the trampolines are one particular aspect of that. As I said, originally they were invented to have efficient calling schemes for highly recursive functions. But in the query context, we are using them to define iterative queries where the iterated part has multiple branches of computation that can be active in every iteration. So in one iteration, the first branch might be relevant. In the second iteration, the second branch might be relevant. In the third iteration, the first may be relevant again. And in every iteration you can invoke different parts of the logic,

Starting point is 00:30:23 perform different parts of computation. That's actually different from a normally with recursive CTE, where you have one monolithic query part that is iterated again and again. In a temporary query, you can identify individual branches and only some of these. maybe all of them, but normally only some of these are activated in each iteration. So when you produce an iteration result, every row is tagged with a branch. It is subjected to in the next iteration. So hey, row, you are meant to be processed by branch three in the next iteration.

Starting point is 00:31:01 And in the next iteration, that particular row will be routed to branch three, which will do some cool stuff with that. and other rows might have different fates and maybe routed to different branches and that leads to very cool means to implement all kinds of stuff it's also the basis for our own research on implementing iterative

Starting point is 00:31:25 POSQL style UDFs you can map such imperative programs and have loops and then and branches and whatever all these good stuff you can map these very naturally to trampoline to these iterated trampoline queries and branching inside the imperative code would then

Starting point is 00:31:44 correspond to selecting the proper branch of your iterated query. The correspondence is really one to one from iterative UDF style code to these trampoline style iterated SQL queries. And we are currently developing compilers for

Starting point is 00:32:03 imprative PL SQL SQL style UDFs to these trampoline recursive queries and that might be a cool, maybe this opens the venue to have real POSSQL style UDFs inside database systems that have a facility

Starting point is 00:32:19 for CTEs but never had a POSSQL style interpreter. And Dugtab is among those systems as you probably know. So in secret nobody is supposed to know this but in secret we are working on an implementation for POSSQL style UDFs here in Tewing and for DuckDB based on this trampoline-style UDFs,

Starting point is 00:32:39 a recursive query, I'm sorry. Fantastic, yeah, we're breaking the news here first. There's the cats out of the bag now to Austin, that's it. That's fantastic. Cool. So, yeah, I want to talk a little bit about impacts in the next section of the podcast. Do you have a handle on how much impacts your work on the KTCTEs has been in Duck DDP in terms of usage and feedback?

Starting point is 00:33:04 And, yeah, what's that look like for you over the last course? I guess three or four years, right? I mean, how long ago was Sigmaud 2019, right? So, like, in the last six years or so, I guess, five years, yeah. I think as of today, Duky has one of the most more efficient implementation over the course by today. I think it has gone through so many iterations,

Starting point is 00:33:26 in a sense, implementation-wise, that really is a production-ready feature. And if you're out there reading whatever blog posts and see other research on recursive TTEs and iterative computation in SQL, then you can see that DugDB is one of the platforms now where people try things out. Whether DGDB and recursive CTEs have found their way

Starting point is 00:33:53 into production industry, use that, I have no data on that because people tend not to or post or blog or bloat about that but I've seen in the Eastern sphere you see that TuckDB has become one of the go-to platform for complex SQL and iterative and recursive sequel by now

Starting point is 00:34:16 and that's very nice to see regarding the using key feature it's super brand new it's like three months in the wild now and I've seen on the blue sky and on LinkedIn several block posts where people said oh this is the game changer I've written my clunky with recursive CTEs that were inefficient or hard to read.

Starting point is 00:34:36 I've written them and, well, yeah, it's a real step forward. But using key, really, that's still in its infancy. So no hard facts about the uptake there in a sense. Right, yeah. I just want to pull on a second. You mentioned other systems, and we should probably talk about those briefly as well. Like how DubDB compares to them, like how much more advanced. is, or more efficient is duct DEP's implementation versus what's out there and what's available

Starting point is 00:35:05 in other systems. Yeah. What's interesting is that there is other relational query language that have been specifically geared to evaluate recursive queries, data log, for example. So there's dedicated implementations of data lock, and data log that is, has been built to evaluate recursive queries, and there's implementation. out there, very recent implementation of data lock, one is called suflay, and recent research paper suggests that that Dr.B can keep up with this dedicated system that has been built

Starting point is 00:35:40 to iterate and to evaluate recursive queries efficiently, Dr.B. can at least keep up with that and even beat it for a type of theory. So we are quite proud of that. And further research in further work on the efficient resetting of subplans and so on, hopefully will hopefully get some performance advantage for DGB further advantages if it all goes well. So I think we are on par

Starting point is 00:36:10 with many of the systems out there. Of course there's the elephant in the room. There's post-KKL. I would say that the DGB implementation of restrictive TTEs has some major rants over the post-Col-E variant

Starting point is 00:36:26 in the meantime. We have less syntactic restrictions on there. So very efficient evaluation. I think that using recursive features or queries in DuckDB it's just more fun now because we got rid of many of these syntactic restrictions that hold you back and hold you back from exploring, being creative with these iterative CTs. Yeah, you've liberated us all. Yeah, free to run, yeah, go wild now.

Starting point is 00:36:56 Cool, yeah, so I guess summarizing then on the future of where you go and now is there's loads of different angles. So what can we sort of expect over the next year or so, I guess, if you were going to summarize. Yeah. So regarding using key, we want to, where there's some performance improvement that you are currently working on. So if you look at the typical body of such an iterated,

Starting point is 00:37:21 of such an iterated query, then what you see is that these iterated queries look at what they have computed last and then join that with information or whatever about the tree or graph structure that you would like to traverse. You look at what you've computed does, oh, that's where my travel has brought me so far. Then you join that with the graph or tree table

Starting point is 00:37:42 to see where you can go next. And because this is such an idiom to join the intermediate result of your CTE with some existing table, we are specifically looking at improving and optimizing these joints, So in using key, we hold the key value, the information about the key values that you have seen and saved so far in a hash table. And that if the hash table can immediately serve as the one of the sites of a hash join implementation.

Starting point is 00:38:18 So we can save half the work on implementing this join with the tree or graph structure. we can save half the work by just reusing the native using key value data structure and use that as one of the half of this has shown implementation. This should bring another really measurable speed up. We are very keen of measuring that

Starting point is 00:38:40 and benchmarking that. It's something that we do. And we will generalize using key, as I said, by giving the system the ability to maintain the values under known keys for you to do whatever max min sum average computation automatically for you we will of course try to add trampolines to

Starting point is 00:39:02 duct tvs have already been added to umbra which is thomas noemann's research style system and it's fun to play with them there we want to have trampolines in duct tv as well and then maybe i don't know whether we can add the ttl variant that's quite a laundry list of stuff yeah it's a never-ending it's a never-ending interesting pool of cool of cool ideas, this working with DuckDB is something as we try in the next few months

Starting point is 00:39:30 of course. Yeah. And then this Duck, this Duck, this DuckDB based PSSQL style UDF thing. This is something we definitely want to try but this is something for a larger

Starting point is 00:39:44 horizon maybe the next two years or something. Yeah, very nice. What a good thing. A lot of cool things to look forward to that's for sure. Nice. Let's change. I want to talk a little bit about you're teaching with DuckDB. You mentioned it earlier on

Starting point is 00:39:56 about how it's become the basically the way you teach databases at the university of tubing. So she tells us a little bit more about that and what that experience is out, what prompted the switch and how you have found it from kind of before teaching students

Starting point is 00:40:11 without DuckDB to now using that and how that has changed outcomes and student experience and those things as well. Okay, cool. I'm glad that you find the time to talk about this here. That's great.

Starting point is 00:40:22 It's of course close to my heart. Before, DuckDB, all our teaching database-related teaching efforts were based on Postgres. And that worked well for many years. Post-CQL is an open-store system that's needed for the particular type of course where you want to inspect database and journals. How does a buffer manager work? How does query planning work and so on? And many of the textbook style algorithms out there you find implemented Postgresquarel.

Starting point is 00:40:46 That's nice. But it's equally nice to see that DuckDB is actually incorporating all the cool recent hot research results that you find in papers like adaptive radax trees and creptorized query execution, pipelining, muscle-based travel paralysis, which are the really the new

Starting point is 00:41:06 cool stuff that you find in research papers. You find all of these almost by the book or by the paper implemented in DuctiB and that's very motivating for students. If you look at Postal you find B plus three techniques implemented from the 1980s. If you look at DuctiB you find techniques

Starting point is 00:41:22 implemented that have been proposed like in the last whatever five to eight years or something very motivating for students that's one thing so we need an open source database system to do particular aspect of teaching we have to look inside stuff we have to whatever insert or throw in debug statements switch stuff on and off to allow students to play with the stuff and it's installing post-school can be a real nuisance can be can be really hard for especially on Windows play platform for students were not very inclined to whatever fiddle at the terminal and so on. Installing duct TV, as we all know, is PIP install DACDB, and that's it.

Starting point is 00:42:04 And that's an advantage. That's not to be underrated in the early years of computer science curricula, to have a very pleasant install experience that's consistent across macOS and Linux and windows and so on. It's not to be underrated. The Python API, Python would be a language that's, that's, known to many of our younger students, this tight integration enables them to do

Starting point is 00:42:30 cool stuff with DuckDB in whatever, in the second week of the semester or something. It's really refreshing. That's cool. DuckDB SQL dialect dialect is super extensive by now. It's friendly, but it's also very extensive. That means it's a, it's,

Starting point is 00:42:46 duct TV is the ideal vehicle for a course that I call advanced SQL, which is something that you would probably hear after course on introduction to relational database system or something. Advanced SQL really focuses on the on the corner, on the dark corners and the interesting corners of SQL window functions

Starting point is 00:43:05 and of course it's recursive CTEs and so on. And TACDB is so rich in implementing all kinds and very efficiently, all kinds of window functions, all the nitty-gritty data, it's recursive CTEs, that this has become the ideal vehicle for advanced SQL teaching here at University of Tubing. That's a course where you don't learn about select staff from Groupai, but all the other cool stuff that you only find

Starting point is 00:43:30 if you flip to page 300 of the sequel textbook. And it's very cool. Students start to implement whatever with these window functions and recursities, finite state machines, k-means clustering, cellular automata, optimization problems, Sudoku, solvers, all kinds of stuff that we do in advent of code style in a sense in a lecture. Have you felt that students have been more satisfied and more interested in databases after the course

Starting point is 00:44:00 given that they're like using Duckdeevese using, say, Postgres? Like, have you found like, I don't know, by the end of the course last time, like, oh, God, I never want to see a database again. But now they're like, they're enthused and they're interested in databases and kind of want to pursue it at Masters, PhD or whatever. Yeah, definitely.

Starting point is 00:44:15 There is the same actually in a database research community that says, would you give the database system to your best friend as a birthday president or something? And of course, the answer is no because it's a maintenance nightmare and like whatever. Installing many of these off-the-shelf database systems out there

Starting point is 00:44:31 is a nightmare. Fiddling with DuckDB is just a pleasure and students have a sense of that this is a system that is moving fast, that development is going rapidly, that's a young system, it's super efficient, seeing good benchmarks for the DuckDB system

Starting point is 00:44:48 really motivates students to take up on that and keep working with that. even after these courses. We have a very active influx of students who want to do their whatever bachelor math sees it with us and I'm very happy to see that. The duct TV makes

Starting point is 00:45:05 well, yeah, these database systems in a sense sexy again. It's really, it's really nice. And then there's all these cool gear. You see all kinds, lots of students carry around their laptops now with DuckDB stickers here on the campus and so on. It's really cool.

Starting point is 00:45:23 to see. Yeah, fantastic. We've got a few more sort of high-level questions now, Torsson. And the first is your overall let me frame it this way. What advice would you give to other sort of would-be developers of DuckDB thinking, people who are thinking

Starting point is 00:45:39 of going in there and extending it themselves in some cool way? What would be your advice to them? I would, the first advice would probably to join the DuckDB Discord because the community there is so responsive, so friends. and so so positive that I can only recommend to to go there and hang out there.

Starting point is 00:46:00 I'm there. I'm on there every day. You can find me under the pseudonym of Tagi, which is my nickname, probably most in the SQL channel. Happy to help there, but there's dedicated channels on the internals of the C++ API, the extension, and so on, and people hanging out there are super friendly. So my advice is go there and then start reading a particular piece of C++ API, the extension, plus-plus code that you are

Starting point is 00:46:25 most interested in. Start reading about whatever, adeptive reddex trees, compare that with the particular research paper that you've probably seen out there and then find the concept of the paper almost one-to-one in parts. It's like that in the code base

Starting point is 00:46:42 and get a hole to get a foot inside the code base in that way. So it's a major effort to get to grips, to do with, the C++ code base inductee me. But I think it's really worth it. Don't be turned

Starting point is 00:46:59 off by a slews of code that have little documentation and little comment, Terry and so on. It's all structured well. It's named well. It's consistent. Lots of tests there. I think it's manageable. Don't be turned

Starting point is 00:47:15 off by the first impression that you might get. Yeah, that's a solid advice. My next one is more of a reflective question on you're looking back over your experience and working with it, what's been the one thing that surprised you and kind of caught you off guard

Starting point is 00:47:29 while working with the system? It's probably the flexibility that's already baked into the engine from the start and how receptive the engine proved for tweaks and all the additions that we had to do.

Starting point is 00:47:44 It's not just regarding the extension system of duct which is really cool but really the flexibility baked into the kernel itself, it's really cool. So for using key, we had to adapt the so-called aggregate

Starting point is 00:47:59 hash table data structure in DuckDB. And it was the API that this structure has was just wonderful to work with, easy to adapt, and so on. So that was very nice to see and really surprising to us how

Starting point is 00:48:15 forward thinking the engineering of the DuC++ code base is in parts. And then, yeah, probably how directly you can find recent research results reflected inside the engine itself. That's also a very nice surprise. Yeah, that's very nice for sure. My next question is the penultimate one, and it's more of a recommendation.

Starting point is 00:48:41 So it's kind of to the listener, what is your favorite plug-in, extension or feature? You can only choose one. I know there's probably a list of those loads, but yeah, you can only pick one, Tustin. Oh, okay, I can you pick one of your own as well. I would probably pick the extensive support for array and list processing in DGDB using all these built-in functions and the lumbdas and the comprehension that have been built into the system. It makes it almost feel like an array or list programming language in a sense. That's something I can only recommend to really get to know well.

Starting point is 00:49:19 Because on the back end, there's also, it's accompanied by a very efficient storage of large lists, as opposed to systems like PostalQL where larger arrays overflow on whatever they're called toast pages or something. And not so in Dugby, it's super efficient list storage. And I can recommend to take the time and dive into the list and area functionality of DGB. There you go, listen. You've got some homework. Go on play with that. Cool.

Starting point is 00:49:48 yeah, the last word now and that is what's the one thing you want the listener to take from this chat today? Yeah, if you have a complex problem, computation problem that needs lots of data, don't give into the knee-jerk reaction, export the data from the database

Starting point is 00:50:05 into some real making air quotes here, some real programming range and do all the heavy computation and heavy lifting outside the database kernel, try to express your complex computation using SQL. maybe using features of the darker corners or maybe using iterative for recursive SQL CTEs,

Starting point is 00:50:24 compute your results close to the data using scalable query engine as DuckDB provides to you and don't give in to the urge to whatever, use the database system as a dump data silo only and perform all the cool, interesting stuff outright in database corner. That's, I would say, unfortunate to frame it nicely. Yeah, cool.

Starting point is 00:50:47 Well, let's send it there as a good message to end on. Thank you so much for chatting with us today. It's been a really, really insightful chat. I'm sure the listener will have loved it as well. We'll drop links to everything that we've mentioned in the show notes. So people can go and go and find all of stuff we've been chatting about. But yeah, thank you very much, Tostin. Thank you very much. Jack.

Starting point is 00:51:03 It's every minute was a pleasure. Thank you.

Disseminate: The Computer Science Research Podcast - Recursive CTEs, Trampolines, and Teaching Databases with DuckDB - with Prof. Torsten Grust

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.