Disseminate: The Computer Science Research Podcast - Adaptive Factorization in DuckDB with Paul Groß

Starting point is 00:00:00 Disseminate the computer science research podcast, the DuckDB in Research series. Hey folks and welcome to the DuckDB in Research series. This is the show that explores all the wonderful ways DuckDB has been used in the world of research and academia. That might be as a vehicle to teach databases or as a platform to experiment and to develop cool new things. So yeah, if you're a researcher, an engineer or just a general database enthusiast, and this is definitely the show for you, In today's episode, I'm really delighted to say that I'm joined by Paul Gross, a PhD student at CWI in Amsterdam. I'm not going to attempt the Dutch pronunciation of what the acronym actually means, but yeah, maybe you can, Paul, explain it for you. I've tried before and I always get it wrong.

Starting point is 00:00:45 But yeah, Paul is here to talk about some of the research that he published earlier this year at the conference on Inative Data Systems Research, also known as cider. No, it's not a beer festival, unfortunately. But yeah, cool. Paul's research focuses on this research, sorry, should I say, looked at factorized aggregations and worst case optimal joins. So first of all, welcome to the show, Paul. Yeah, thank you very much for having me. Cool. Well, let's dive straight in and give us the sales. Tell the listener kind of what they've got them got, what they've got themselves in for the next sort of 30, 40 minutes or so. So again, give us the sales pitch for, I didn't actually say the title of your paper's call. So maybe I should do that now. That would be a great idea. Yeah, right? So yeah, your paper's called adaptive factorization using linear change hash tables. So yeah, what's the TLDR on this? How would you explain it to your grandma? Take it away, Paul. Yeah, so the paper adaptive factorization is basically on allowing you to reduce the intermediate size of your query, basically most of the analytical queries. You don't really have a million rows in your output because at the end, the output is designed for a human or maybe now for LLM.

Starting point is 00:01:56 They can't really understand that, but usually the intermediate contains millions of millions of rows. And with factorization or with adaptive factorization, you can basically avoid those big intermediate results by only processing on a reduced set of data. And the cool thing really is for our paper that you can basically use 90% of the things that are already in a normal analytical database system to do that. and you don't really need a new big operator for that, but you can basically integrate it quite easily. And also to add, like, this was joint work with my supervisor at the time, Danielton Walde. And I think you already had him on the show as well on Duck PGQ.

Starting point is 00:02:42 Yeah, he was on in the first season on Duck PGQ. So if you're interested in this listener, you should definitely go check out Daniel's show as well. It's really good. So that's cool there. So basically the aim of the game here is then we have these really big intermediate. we can often get these really big intermediate sets of things while we're doing our queries. We want to make those as small as possible, right?

Starting point is 00:03:00 I'll not kind of end up with this sort of state space explosion during our query processing. Cool. So yeah, my grandma definitely gets it now. She knows. She understands. Cool. So, yeah, there was also another thing that I kind of maybe mentioned at the top of the show. I mentioned the phrase, worst-case optimal joint.

Starting point is 00:03:18 Maybe we can talk a little bit more about what that actually means because it always, I see that I see that acronym. me that always scares me a little bit. So maybe we can kind of talk about what that is a little bit as well for the listener. Yeah, it still scares me a little bit as well. Basically the idea, I think the most common example is like a triangle query. And for example, you have a social network, there you have friends relationships. So friends meaning user A's, a friend of user B and then user A is again, a friend of user C. And then like everybody has like 10 or 15 or 100 friends. I don't know. And the idea is basically, okay, and now I want to have a triangle.

Starting point is 00:03:57 So, like, who's friends' relationships form a triangle? And in a normal query plan, usually what you do, you have binary join plans. So in the binary joint plan, you would join the relations to, like relation at a time. So first you would get, for all the users, get all of their friends. So you would have this explosion of friends, because now if everybody has 10 friends, then you will get 10 friends for everybody and then like you end up with a hundred doubles per user so you have this explosion and it's quadratic.

Starting point is 00:04:31 And then with the second join, you will create like this filtering a joint because like now you only want, you have those two conditions. You want to have like if you have friend A, then friend B and then you want to have friend C. C has to be a friend of B and A now to close the triangle. So this is a very filtering join.

Starting point is 00:04:51 So then you exactly have this explosion of the intermediate result. Like you first join will return a huge intermediate and then the second joint will filter that again. And what you do with the first case optimum joints, basically, is you go tuble at a time or like attribute at a time. So you will basically simultaneously check all the three relations and then intersect those... tuples instead of like doing a relation at the time and having this huge intermediate.

Starting point is 00:05:27 Nice, cool. Yeah, that's a very good. That's a very good explanation. I was just thinking as you were sort of going through and then you're talking about the social network example about oh, yeah, if you've got this amount of friends, then it can be problematic because I know 10 times it's quadratic and 10 times it quickly explodes. I was thinking, well, I think my my social media accounts wouldn't be in any danger of troubling them because I think like one follower, two follower, and maybe not as sort of as challenging to the, to the query engine as other. Maybe. Maybe Yeah, but yeah, that's cool. Yeah, so that's a really nice description there

Starting point is 00:05:55 of what West Case-Octomor joins that. So, yeah, we've got Weir's-Ca-Oxmort-Jones and factorization. So why is taken so long for these sort of techniques to become, like, adopted in mainstream systems, or what makes them difficult to sort of implement in database systems? Yeah, I mean, like maybe quick for factorization is basically a method of avoiding those intermediate results. So what you get if you are like basically this binary join is a join hash table,

Starting point is 00:06:28 like a hash join. And this is actually quite simple. Like you basically you have your build side. You just build a hash table on your right hand relation. And then you probe this hash table if your left hand relation. And let's say you have 10 friends again. So you would have this hash table that has maybe like just a collision chain. And then for each of those two, for each friend you're probing into this hash table,

Starting point is 00:06:52 you would get 10 results because like you have 10 hits. So the idea is now to instead of flattening that and having like user 0, user 1, user 0, user 2, user user 0, user 3, you basically have a list of users for this very first user. So like you have user 1, user 0, and then you have times and then you have this list, user 1, user 2, user 3, user for and et cetera so and this really helps because you need less and memory because you don't have to repeat user zero again and also for example if you want to make like a count aggregate on top of that then you can use the factorized representation to easily compute the count because you don't have to expand the whole intermediate but you can just make like the sum of the length

Starting point is 00:07:42 of the lists because now it's 10 so this is basically the idea of factorization and this is a very, very useful trick but the problem is really to handle those lists because you have a vectorized execution engine usually for us, for what we did and DuckDB and then you don't have

Starting point is 00:08:03 one user and then it's lists but you have 2048 users because that's the vector size so you're always processing those 2048 elements at the time and then for each you have those lists and suddenly your lists are variable length sized so it's also not easy to provide the memory for that easily

Starting point is 00:08:21 because one user maybe like me has two followers and then you have Il Musk in the next and he has like millions I don't know how many probably millions yeah half not billions yeah this is really a problem because like having like this variable sized data is just hard to deal with so this is mainly like how do you create those lists how do you make your operators emit those lists?

Starting point is 00:08:49 And because, like, first, like, you have to have them, then you have to calculate on them. This is really the challenge. And there have been several approaches to that. For example, Graflodd-B and Kuzu have a very cool way of, basically, like, having a vectorized system where you basically, like, you have this first vector, which is the left side of the factor, and then you have the right.

Starting point is 00:09:16 and big flat buffer, and then you can basically switch certain IDX. It's very hard to explain with all pictures, but there's a very great thesis that I wanted to highlight from Amin on GraphflowDB, like the predecessor of Kuzu. So this is, I think, a good read. Yeah. Yeah. Try to continue.

Starting point is 00:09:41 Yeah. Yeah, and it's basically, yeah, this is like the main reason. Like, how do you make your operators integrate those? representations and then how you execute on them and this is like for factorization and for worse cause optimum join and it's like also two problems like these operators are super complex

Starting point is 00:09:56 to integrate because they have this complete new way of I'm working I mean in some sense they are also simpler but they are just different in their style let's say it like very researchy and then you also have to decide whether you want to use this

Starting point is 00:10:12 worst cause optimum join or whether you want to continue with a binary joint plan because this triangle query it's it's it's it's of course the perfect example but how often do you run this query on like an o lab system it's it's like 90 like no tpccd s query i hope this is true um has such yeah it is now you've spoken into into truth now basically paul yeah it is yeah this is very easy we'll check it but tpcch for sure does cool uh so yeah you actually um you mentioned uh kuzu and umbra did you mention umbra you mentioned graph or not umbra sorry yeah i mean actually on a random aside i just saw you there that coozy

Starting point is 00:10:49 actually the project has been shut down sadly so yeah no more no more coozy which is a show it's it's a pity yeah i'm sure it'll come out on some hack and use thread at some point what's happened right there'll be uh something cool there yeah always does um but yes you mentioned that those systems um have sort of uh tried to tackle it and have kind of come up with some neat not neat approaches to taking the idea and actually making it sympathetic to like a vexorized engine not and what are there limitations at the moment like what what do they where are they falling short in putting those in those ideas into into reality i think like the most of the systems that really want to have it have it like in the sense of like kuzu has support for worse because

Starting point is 00:11:32 optimal joints umbra has it because umbra's very feature complete um so uh so it's possible definitely but for us the challenge was also like how to make something that mark would merge so how to not touch too much and there's really the problem like how do you very and also I mean this is for different systems like if you're not like a prime graph database and like you're expecting a lot of cyclic joints

Starting point is 00:12:01 or like this kind of things then it's like then you can have the resources to integrate that so like it's it's not unsolved it's just like complicated in the sense and then this is like from the implementation part and then it's still hard to really plan those elements because you have to have those very certain conditions like you want to have this or like you want to avoid this intermediate result and sometimes maybe your

Starting point is 00:12:26 intermediate result is not actually large like maybe your first join which would be the binary join is even also selective or like um the or like the third join is not filtering like what do you do then like then your binary joint plan was um maybe better and for that you really have to rely um at moment on cardinality estimations and like sketches and like all of all of these statistics but it's like planning is also very hard and then having this other dimension of um now you can choose between the worst case of the mid join and the binary join plan which are very fundamentally different you can't just swap them on runtime um and is adding a challenge that is i think really like an academic challenge nice yeah for the listener and mark mark is one of the co-founders

Starting point is 00:13:13 right so yeah yeah we've got to get it that's the bar to get it past but that's cool so you went about tackling some of these issues you just outlined there and you did that with a thing called linear chained hash tables there we go a second time um yeah so walk us through kind of what they are and what their design is and how they go about mitigating some of those things you outlined a second they go about with the challenges of um of these these techniques yeah i think this was like the linear change hash table is the idea is from my supervisor actually like peter bones is my PhD supervisor and i think it's a very neared idea and also a very neat strategy because we basically implemented them before implementing the factorization because they enable factorization but they also provide

Starting point is 00:13:57 certain other benefits and so therefore we could also like sneak in infrastructure for our later plans into that nicely in the future ground work oh look at this now we've managed to implement this yeah yeah oh coincidence so it's awesome like a little bit in this direction, but basically... Strategic, yeah. Strategic PRs. Yeah. Basically, the idea is to extend the DGB hash join a little bit

Starting point is 00:14:22 so that we can use the chains of the hash table later for factorization. So basically a half join, as already said, it's a very simple design. You basically have this hash table. You build it on your build side. You probe it with the other left-hand side of the join. And then you're done. And then the big question is, of course, how do you design this hash table?

Starting point is 00:14:44 And I mean, like, there's like this very fundamental data structures thingy where you can do linear probing either to resolve hash collisions because, I mean, hash collisions happen because like two keys basically collide on the same slot of your hash table. So how you deal with that? Like either you can do linear probing. So you say, okay, this bucket is full.

Starting point is 00:15:05 So I will just take the next one. And if this one is full, I will take the next one until I find an empty one, basically in your probing. Or you can do chaining, basically, having a linked list of elements, and then, yeah, like, populating that linked list. And both have advantages and disadvantages, linear probing is very cool because you have all the things inline, but what do you do if you have a string key or a struct as a key, like, you can't inline everything into the hash table, and you also can get very long probing chains.

Starting point is 00:15:37 Chaining is also, like, it's easier if you have, like, those big keys, but you have a link list, and link lists are slow to follow, and they can be scattered across memory, and hash can be gigabytes of size, so it's very, very slow to follow those chains. So basically, what we did, it's a little bit involved, but it's also not like the other system do is similar. So basically, if you want to insert a tubel into the hash table,

Starting point is 00:16:09 you first need to hash it, and then you want to get the offset from a part of the hash and then with this offset you go to this slot and then DGB has this chained hash table where for each slot you have this pointer to a chain on this chain then can have all the different types of keys and like has some memory that it traverses. So the problem now is that you have two types of collisions basically.

Starting point is 00:16:38 Like first is like you have the same key twice in your hash table. For example, if you have multiple, like, one user has multiple friends and all, like, this key is then just multiple times in the hash table because it's like the idea of this first user. So then what you have is, like, the same key collides to the same hash, to the same slot, which is normal. Like, you can't avoid that.

Starting point is 00:17:02 It's good. And then what you can also have is that you actually have a hash collision. So, like, you have two distinct ideas you want to insert into your hash table. But a hash just happens to be the same. And like if you have a small hash table, for example, like 2024 elements and you want to insert, let's say, 900 elements. So then it's quite likely that those will collide. And now because Dr.B has those chains,

Starting point is 00:17:30 because the keys, like different keys might go into the same chain. During probing, you always have to walk this chain and say, is this tuple an actual match? Yes or no. Is this the next tuple and actual match? match yes or no so and this is kind of slow because you have to do this comparison for every element in the chain so what we did is basically um do some trick to make the linear probing happening with only like parts of the hash so we add some salt we call it so because the pointers

Starting point is 00:18:00 in the yeah it's like the meme um so basically pointers are only 48 bits so and our hash table slots are you in 64 so we have 64 bits so there are 16 bits of emptiness of void if you're not on Android and so basically we fill those upper 16 bits with some parts of the hash and we use this as our tiny key and

Starting point is 00:18:25 on this key we can then do the inline linear crowbing and then we can do linear cropping for all the collisions where the keys are actually distinct and only add an element to a chain if the keys are the same so therefore we can guarantee that

Starting point is 00:18:41 within a chain all the keys are the same so therefore if we want to probe now this chain with a key like if we find that the first element in the chain matches you know that all elements of the chain match and that we can use to make probing faster so this is

Starting point is 00:18:57 the TLDR kind of thing nice yeah that's really cool so how does this then so that you kind of well the first question I want to ask is that you were saying earlier on in the show that you wanted to be as as least invasive as possible so you can get this PR merge did they have to be any significant

Starting point is 00:19:14 code changes of actually making of having this like was it easy to implement did much need to change on the on the core product side basically the salt thingy was already implemented

Starting point is 00:19:28 luckily for the aggregate hash table but not for the join hash table so that was like it was known as a technology so that's also good like because if If you want somebody to maintain it, then if they saw something already, it's easy to implement. And I think the saw like also, Umbra does something similar, for example.

Starting point is 00:19:47 So this is like quite common to do. And then this like having those key unique chains was possible in a sense that you already had infrastructure for checking whether the keys match because you have to do this during probing anyway. So basically was moving some code. also to another part of the build pipeline and I think it was like maybe like 200 lines of code like not too much and it was like now it's very cool because usually your build side is much smaller than your probe side so you like for each double you insert you may be

Starting point is 00:20:25 probe 20 times so now because you do this comparison now 10 times during insertion you will save it 200 times during probing so it's like for and especially if you you have like not that easy to compare keys like strings, it's very beneficial. So that was also the motivation to just add this as a plain data structure. Nice, cool. So we've got this primitive in there then.

Starting point is 00:20:51 How do we go from that to enabling our end goal of Westcase optimal joins and factorize query processing? Yeah. Now it's actually quite close already because now you have those chains and they are key unique so if you now instead of like if you probe your join and you normally you would flatten those lists like for each element you would take the first element of the list emitted take the second element of the list eliminated now instead of doing that we just emit a pointer to the chain um so then we basically have factorization and now if we want to

Starting point is 00:21:30 compute a count aggregate on that um under certain conditions and if they're The group key is also not within the list because then we can't really use it. If the group key is in the flat part of the vector, let's say it like that, then we can just, again, take the sum of length of lists to compute account. And this is the idea for that. The problem with that, though, is that there's actually, which I realized far into my master thesis, is that there's a much easier way to compute that. Because if you already know that you want to have the count anyway,

Starting point is 00:22:05 what you should do is make a group join, which is also like a technology introduced by Munich people. So basically instead of building the list in the first place, you just compute this aggregate while creating the join hash table, and then you can emit the aggregate directly. So it's a little bit more fancy now with this chains and then emitting the pointer to the chain and then only counting the length of the chain,

Starting point is 00:22:29 but it's already there. Yeah. That was not very useful. The question when I asked you, Paul, is where was you when you realized you could do it this other simpler way? Not inside or you, like, literally. In the shower morning, for, oh, my God, I could have done that. It saved me so much work, but yeah. I mean, I was talking to people from Munich.

Starting point is 00:22:54 And also with, I mean, actually, and then there were pointers to that. And there's a paper that came out at the same time, which does something very similar, which is, I think, Diamond Heart and joints. from Altharne Birle at Munich and he also explains that but only like in a tiny sentence

Starting point is 00:23:14 it's a yeah so there would have been pointers but I mean it's fine it's fine it's fine we learned something and we showed another way to do it very true very true

Starting point is 00:23:28 maybe not the best yeah cool so let's talk about so you mentioned this earlier I think in the show where you were saying though, it's often hard to know when to deploy these techniques based on that. It's not, do you go one way or do you do it? It adds another dimension to the complexity already of picking the plan basically, right,

Starting point is 00:23:46 of what you actually want to do. So I guess this is where the adaptive decision making comes in a little bit, I think, right? So, yeah, and this concept of adaptive factorization. So I guess start with what's the motivation behind adaptive factor factorization and what we're trying to achieve here and then tell us how you went and did it, right, I guess. The idea is basically to have those chains of elements. And having this chains of elements, it adds some overhead because suddenly you have this variable length elements.

Starting point is 00:24:15 You have to kind of keep track off in a vectorized execution engine, which is more trivial. But also on the other way, you can compute things faster if you don't have to expand them. And to know whether to do this extra efforts of maintaining those chains, it's very important to know whether you actually have those chains. and this basically means okay

Starting point is 00:24:37 how I need to know that there are duplicates on the keys of my join side which means that that will result in those chains and now if I don't have a special operator for doing Vosk's optimal joint

Starting point is 00:24:54 but just using the normal duct B hash join then those chains are there for free now basically because I would like in a binary join tree, I would have this first join and now worst case optimum join plan that we can now do by not expanding those chains and then using them later in the joint plan.

Starting point is 00:25:17 We also have these hash table chains. So it's like the first step of the plan is very similar. And by the time where we have to decide to now use a worst case optimum plan or this binary joint plan we already seen these chains because we can do this after we build the first touch table and therefore you you touch the data you basically don't have to rely on statistics you can very precisely say okay i have like for each value i have got i've got seven duplicates which means my chain length on average is seven which means okay is i can update my cost model with like true values and i don't have to rely on estimates and estimating like how many duplicates

Starting point is 00:26:01 are there like you can't there are sketches for that but they always require like yeah there are sketches and what is like maybe and you only have them usually on your um base tables but what if you are you're querying a CSV file or a per k file or like um another join result like you you can't really like there are cases where you don't have statistics and then if you have statistics they are only statistics they are not like the true value so that's the motivation behind doing it that way nice cool um so i guess the next so i want to ask now about i want to dig onto the on to the sketches a little bit more so yeah from what i'm given there is that you don't actually need them right or do we need them or how does how do they fit into your overall approach and in collecting these

Starting point is 00:26:48 statistics like why do you need them still yeah so basically for um if we want to do our skills of the rejoin like plan now with those factorized elements then we have to do two things like first we have to it's for triangle query and it's a little bit hard to explain without a picture

Starting point is 00:27:10 but I'll try basically you get these chains like you join let's say we want to make a triangle between like the three relations so first you have to join the first two relations

Starting point is 00:27:25 and then because you emitting those chains, you get for each value of relation one, you get chains with elements of relation two. And then you do the same basically, this you emit up into your tuple, in your

Starting point is 00:27:40 pipeline. And then the next join comes, which is also kind of a normal head join, but then you have like one simple flat condition, which is tubels from relations one to relation three. And then basically you have those two

Starting point is 00:27:56 lists and now you can do list intersection it's very like it's not very clear i think from this explanation but basically the idea is like okay you only have to decide um after this first join whether you i want to do this or not and then you can either do worst case optimal join plan or and binary join plan so now you all because you have the exact data from the first join um you you already know like you have duplicates or not so you can know whether you are things are whether your intermediate is exploding or not.

Starting point is 00:28:31 But in order for the joint to be really, really performant, or like, then you also want to shrink this intermediate again, let's say, like,

Starting point is 00:28:40 very simple. So how do you know that your second join is actually selective? Or like, what does, you don't really have, you only have data

Starting point is 00:28:49 on one relation of the three relations of the triangle. So now, what DGB does is also like what, every modern analytical system does

Starting point is 00:28:58 is before building actually the hash table for the second join, it first materializes all the data and stores them basically in memory or like splitting to this in rows. And because that, and it does that to then know, okay, how much space do I need for my hash table

Starting point is 00:29:18 to allocate this lookup buffer? Because if I have 10 elements in my hash table, then I will make a hash table of size 32, for example. But if I have a million, I have to resize this big, I have to have a big hash table. And it's bad to, it's very expensive to resize a hash table during building. That's why you first want to gather all the data, hash it, and then you know how much data you have to build a hash table on, and then you build this hash table.

Starting point is 00:29:42 So this means in this case, we also have scan over all the data of the second joint. And then we can use this scan, and we even have the hashes to populate other sketches, to then estimate more statistics basically on how selective is joined be and so on and so on. And are there duplicates on the second join? And this is basically the motivation for doing it that way. So you have very exact data on your first relation. You have somehow what exact data on the second. And then the third is still an Easter egg.

Starting point is 00:30:21 Yeah, it is what it is. touch it. Nice, cool. I know, it wouldn't be 2025 if there wasn't some machine learning in there somewhere, right? So I know you have used some ML as well in sort of, I guess, compared it to like normal heuristics and, yeah, and how machine learning models can also help you decide the various joint types you want to do, right? So maybe you can dig into how you applied that and what the tradeoffs were and what you found between the two different approaches. The classical versus the modern, I guess. Yeah. I think, like, machine learning always like it sounds like it's a very very basic machine learning like a

Starting point is 00:30:58 decision free so like at the end don't say that it was complicated but it's it's important i think to to to not be like a deep learning model where you don't really know what's happening i think the machine learning was really more to to to see whether there's insights above what we would expect um are there other like patterns in the data that's like because for us it's very clear if, like, the first join has many duplicates, that means we have an explosion, and therefore, worst-case optimal joints are probably good. And so the idea is, okay, you have this metric with this feature, which is basically the average chain length of the first joint, which is correlated to the duplicates in the first

Starting point is 00:31:41 join, and then you have this holistic. If it's like, I think maybe it's like bigger than 10 or something, then you take this worst-case opt-in-the-join approach. and then like there's other features where it's not that clear and so therefore we we build also those decision trees to analyze but there's like more detailed to the data but I think the results or like the predictivity the truth positive rate it was kind of it was very similar like it was not like the decision tree was outperforming the this heuristics just bigger than like average chain bigger than 10 by 2x it was very similar

Starting point is 00:32:18 And therefore, I would also say, like, okay, I would, like, if I would implement it, I would use this heuristic because, A, like, it's simple. I can understand it. And B, it's not that much better. So I think there are places where you can use machine learning in planning, definitely, and there's a lot of work on that. But in this case, I would stay with the heuristics because it's so close. Yeah, nice, cool.

Starting point is 00:32:41 So, yeah, that's interesting. Actually, it's funny, you should say that, because over the years of doing the podcast, Quite a few people have had the same sort of opinion. They've been like, yeah, they're good in some very narrow subset of, like, corner cases of certain type workloads. But, like, sometimes, most times the juice isn't worth the squeeze, right? It's just stick with something simpler. It's more, you can reason about it easier.

Starting point is 00:33:02 And yet, it's easier to maintain as well, often a lot as well. So, yeah, it's interesting that you've also come to the same conclusion. Sweet. So, yeah, the next section of the podcast, when I talk about some results and evaluation, and how well your techniques performed. So, yeah, give us some click-bary results that can chop up and stick in the promotions for this, for the podcast.

Starting point is 00:33:26 How do they do? Basically, I think benchmarking was not trivial because how do you, I mean, you have to have, for us, we mainly focus on those triangle queries because they're the prime example. And you could generalize this to cycles with odd numbers of edges. because there you really have this explosion

Starting point is 00:33:49 or like this you can do this worst case optimal join like plan very easily and what we did there is this a social network benchmark and basically we run a triangle query on the schema to get like all the possible triangles you can reasonably get

Starting point is 00:34:09 and then use all of these queries to benchmark whether we want to use worst case optimum join or whether to not and basically run it once with the normal plan, once with the worst case optimal joint plan, and then also recording those statistics, and then we could also say, okay, would our cost model have predicted the correct runner?

Starting point is 00:34:28 Therefore, like, simulating it. And, yeah, it showed that, like, always or never using Westcase Optimate joints was, of course, inferior to, like, using some ML or holistic to decide whether to use. it or not. But to be honest, I don't have to exact numbers anymore, like on how much percent or something. But it was worth investing. A million percent. Big number. Yeah. It was better. The moment where you shape your benchmark, you can always have arbitrary speedups because you can design like your query in a way that it exactly hits your optimization. And then, like,

Starting point is 00:35:10 it's always easy to get arbitrary speedups. Yeah. Yeah. Cool. I guess so this is available after your techniques now are obviously implemented in DuckDB so people can benefit from your from your techniques right in your code

Starting point is 00:35:27 partly there's a branch of course which is like on my private GitHub which implements the worst case of the mid-joint and this also like factories like aggregates on factorization which I talked about a little bit earlier where there's actually like the better way of

Starting point is 00:35:43 doing the group join but you can also do it that way. And yeah, this linear probing hash table, it's in duct to be. I think it's inductee since 1.1 or something. It's quite a while actually. It has working quite good, I would say. And then I also had a discussion, of course, with the duct TV maintainers about how we could implement the worst case optimal joint because that, like the factorization and the aggregates are solved otherwise easier and I hope that duct to be at some point will also integrate that ego aggregation

Starting point is 00:36:19 and group joins but so far not and unfortunately it's already have been done like research wise it's covered so I can't do it so it's not for me sadly I would like to implement it but I have to focus on newer stuff. Newer stuff yeah more of the cutting edge side of things right yeah

Starting point is 00:36:34 yeah because I was going to ask kind of about whether you have any way I guess it's still kind of an open question of how in the in the wild of like of workloads at large how often the sort of techniques would be beneficial so i was kind of getting on the line of have you had any sort of uh feedback from people out there and in in in the wild they've said oh yeah this i don't know if any sort of statistics are collected from like

Starting point is 00:36:59 kind of workloads that i run and kind of fed back i'm like yeah we're seeing everyone's using worst case optimal joins everyone's got a social network as their as their application and they just want to do a triangle counting constantly um i don't know yeah Yeah, I think for the default DGB workload, which is like on-device OLAB, it's not super important. But I mean, there is DugPGQ, and like the goal is still to implement some of the techniques into DGQ. And like for like a side note, like DGQ adds like graph very languages to DGB and also some other operators like graph and path finding. so it's very cool and enable and it's also very performant so then like because people are using this graph language you would expect graph workloads i mean this is not a crazy assumption

Starting point is 00:37:51 and therefore maybe also like more cycles so the idea of merging it as an extent like putting it into an extension is very very cool the problem is that you probably still have to fiddle a little bit with the hash join so what do you do like do you copy the whole operator and then replace it or like it's still like not that easy to implement but I mean I really hope that at some point it will end up in TACQ. Nice, cool, yeah, so that's one future direction. Sorry, Paul, one direction you're going to go in

Starting point is 00:38:21 with this work next, but is there any other directions? I mean, you tease there, there's some new cool cutting-edge research you're doing as well, the new news, so is anything else related to this in any way or is it different? Yeah, I mean, this was my master thesis. Now as a PhD student, I unfortunately, or like maybe also, not unfortunately,

Starting point is 00:38:38 have a completely different topic and which is strings and string processing and database systems. So I don't unfortunately have that much time anymore for that. I mean, I'm still doing some join stuff. I try to implement plume filter pushdown at a moment. And DGB, it's also not as easy as I thought of this, but... It never is.

Starting point is 00:39:00 It never is, right? So fingers crossed that this is done soon. But now my main research goal is to basically do better string processing inductively because I think strings are very underestimated data type

Starting point is 00:39:17 there has been a snowflake survey showing that like I think or I don't have the number directly but I think it's like 20% of all string key types are strings of all joint keys are strings so that is correct people using strings as joint keys

Starting point is 00:39:32 people using strings to store numbers to store dates to do like to store Wikipedia pages like there's a lot of stuff in database systems that that are text and now the idea is to basically use compression to first like get smaller on on the storage side but then also push it through the execution tree as compressed as possible as long as possible so that you don't have this huge memory footprint and those like huge chunky bites in your memory. And their techniques you're going to be

Starting point is 00:40:10 implementing in DuckDB as well, right? That's the goal, yeah. That's cool. So, yeah, we can do this podcasting. Again, when you finish that and publish that, we can come and talk about some strings as well. Cool. Yeah, so let's talk about DuckDB some more then

Starting point is 00:40:25 and sort of your experience of working with it and developing and writing code in its code base. So, yeah, what would be your overall sort of experience of using DuckDB and, I guess, this extension module, even just in the codebase, like what was that like? What would your advice be to other people who were thinking about getting their,

Starting point is 00:40:43 dipping their toes in and getting their hands day? Yeah, I mean, before doing my master thesis, I was more like a front-end kind of person, so I did a lot of type script. And then... You come to the dark side now, yeah. Literally, like. So, like, diving into the ductyb codebase

Starting point is 00:41:01 was quite hard for me in the beginning because, like, there's, like, if you don't really have a, understanding on an like on an abstract level of what's happening like understanding trying to understand it from the code is impossible i would say or like maybe not impossible but impossible for me um so like in the beginning i really had to look at a lot of andy pablo lectures on youtube to really understand the basics i think this is the journey of everybody yeah but then i think duct db is really like readable in a sense and there's also like one resource which is called I think deep wiki.

Starting point is 00:41:34 Yes, deep wiki. And there basically is an LLM documentation for the open source projects and duct B is in there. And then you can basically ask questions like how is the duct DB string storage build up. And it's so convenient because it shows you like what's happening and also shows you like which lines of code you have to look at. So that's really like a nice way to get started with DukD if you want to understand something. this is highly recommended um so and then at some point you you start understanding like you have to find like your little spot where you where you where your break point hits and then

Starting point is 00:42:15 you're like yay i know what's happening and then you can start changing things and um try to meet c icd uh yeah so it's it's possible and and then of course it's very nice because it's like running it is super easy because it's on your device and this means also like developing it is super easy because it's on your device. You don't have to have some Docker in between or like whatever. Like you can just build it and debug on your device and then you can debug it like every other slide. A nice tight feedback loop there.

Starting point is 00:42:45 Yeah. Yeah. It definitely makes development so much nicer when you've got that. No, that's really cool. Yeah, I guess now I'd like to talk a little bit more about kind of surprises and interesting lessons learned whilst working with DuckDB and with, yeah, on factoring. and where's case optimal join.

Starting point is 00:43:04 So has there been any of that sort of jumped out when you kind of gone, I really didn't expect that, or that was a bit of a shock? I think the system's part is always a little bit concerning. It's like the moment you touch something, it will touch something else

Starting point is 00:43:22 and then something, like you have a different state of the whole thing and then something. It's always like cascading down to places in the code where you wouldn't expect it to be. and then ddb being a production system meaning that if you want to merge something like people might want to join on a struct of lists that contains lists that contains structs again and suddenly you are in very like weird states of the code so I think for research it's very important to like keep it simple like for example those join things they worked on you in 64 join keys which is probably like 95% of all edge tables and graphs. I hope nobody has

Starting point is 00:44:04 string edge tables. I'm not... Somebody will, Paul. Honestly, man. If it's possible, someone will have done it. So yeah.

Starting point is 00:44:12 So, yeah. And then, like, for doing research, I think it's like you don't have to do something in production ready. Like, that's like two different types of things. You have to show that it works.

Starting point is 00:44:24 And then you have to like just say, okay, the CI will keep being read and yelling at you. and there will be a hundred tests feigning, but it's fine. So, yeah, this is the most, like, I think. Yeah, I have fun. I guess it'd be nice talking now a little bit more about yourself, as I was kind of the man behind the research and all, so to speak.

Starting point is 00:44:49 So, yeah, tell us how did you end up working? You said you were sort of a front-end guy before, and now you sort of worked your way into being the systems person. And so, yeah, tell us about that journey and kind of how did you end up where you are? Yeah, I went through the full stack, basically. So what's going to happen next? You're going to be working on, like, I go lower down. You'd be like working on my end of the semiconductors next, right?

Starting point is 00:45:13 So, yeah. No, I mean, I did my bachelor in Germany at the University of Applied Science. So that was very, like, practice-focused. I liked it, but I missed always, like, being at a big university, being at a research university so I went from my master to Amsterdam and there like the first class I had was Peter Bowen's class about

Starting point is 00:45:35 data like large what was a large data engineering or something like that so and there I met my two best Amsterdam friends also very convenient they both work at Motherduck now and so we all got hooked by Peter into the lecture

Starting point is 00:45:52 and it went into like and then decided to do our master thesis with him at CWI I think it's central for WISCunda Informatica, but my Dutch is also. I'm glad I didn't attempt it because that was a lot better than I would have got them. It's a very nice place to do your master thesis. It's highly recommended because I had my own office or like not my, I shared it with other master's students, but it was great to have like a place to go for the teasers and not be alone in the library

Starting point is 00:46:20 and nobody cares about you. You had meetings with Peter, meetings with Daniel, which were really helpful. really, really liked being there. So I asked Peter, but I can stay a little bit longer for four years, and he accepted that. And so that's why I'm doing a PhD now, basically. I just want to do and stay studying also a little bit. It's a very nice way of work. Yeah, definitely cool.

Starting point is 00:46:46 Very nice. So we're near in the end of the podcast now, Paul. But yeah, before we do finish off, Duck D.B. recommendation. What's the one plug-in, extension feature that is your favorite about DuckDB? I think DuckDB in general

Starting point is 00:47:04 such a nice system being on your device. I think it's such a simple design choice, but it makes so much possible because suddenly, instead of having a million CSV or parquet files, suddenly you have them all in one place and if you want to copy all of your data because you might

Starting point is 00:47:21 do something you regret later, you only have to copy one file you have like um joins across different files you have like all of these very neat features so it's just my since working on duct to be like in the beginning i really had struggle writing sequel and now my whole my all my python scripts are full of like 10 15 9th of code um SQL statements doing left anti joins and cool stuff to do and i think like one one recommendation is of course DGQM as a very cool extension and like a very nice way

Starting point is 00:47:56 of showing what you can do with DGB extensions because the extension ecosystem is something that's that's very cool to have and there's a ton of other like there's also Flock MTL like it's some AI focus where you can have AI scalar functions

Starting point is 00:48:12 and from Amin and there's yeah of course there's another dog which is also kind of cool to add like this whole other dimension to dark DB so I like the ecosystem and the people working on that

Starting point is 00:48:26 so yeah definitely I keep trying to get a means to come on to talk about flock MTL but he keeps turning me down so maybe next time season three I'll get him to come on and tell everyone

Starting point is 00:48:34 about it but it's interesting what you're saying about there about kind of now your like privacy trips are full of these like kind of SQL everywhere so do you think that sort of time basically spent

Starting point is 00:48:44 inside the database and kind of makes you more sympathetic of the way to actually write SQL queries is that kind of would be your assessment of it? I think maybe like it

Starting point is 00:48:53 because I just was forced writing a lot of SQL to write my and then like I at ease my fear of doing that because it looks kind of to have those SQL in your Python but it works so convenient like you can like output the

Starting point is 00:49:10 duct to be result to a Pandas data frame and then visualize it it's so cool how they integrated that and like the referring remote files like then just putting data twist for you yeah it's yeah it's like of course you have you have those ac desecal statements maybe some people say they're beautiful but i'm on the other side beauty's in the eye of the beholder i guess and we'll leave we'll leave that one there yeah

Starting point is 00:49:33 they really take a lot of work away so that's that's that's that's that's why i started accepting them yeah cool that's very nice very nice yeah well now it's the time that is time for the last word so what's the one thing you want the listener to get from our chat today I think, like, assuming researchers also listen to that, and not only the DGDB users, I really recommend DGB for research because, like, adding an extension can be very, very powerful and also, like, a way to add visibility

Starting point is 00:50:08 or, like, just have something that's complete, that's usable, because most of the research outputs are artifacts that, yeah, that end up in the shelf, for like are some random Python script nobody can use because you have to install Python 3.10.4 for that and so and I don't have this on my computer like if you

Starting point is 00:50:28 like maybe and also like usually like not not everything ends up in DuckDB main so maybe some things can end up in extension like hopefully the worst gives optimal joints at some point and then it's also fine and like you can use this to demo things you can use

Starting point is 00:50:45 that to really integrate your algorithms in a system because it's very different of having it something like a new hashtag layout in a system because then suddenly you can run TPCH like a proper benchmark on it and you can't do that if you're building like your own C program that has struggle reading CSVs.

Starting point is 00:51:04 So yeah, there's a lot of infrastructure already there and it's nice to use it. Nice, yeah, that's a good line to end on. Thanks very much, Paul. It's been a great chat. Thanks for coming on the show. Thank you.

Disseminate: The Computer Science Research Podcast - Adaptive Factorization in DuckDB with Paul Groß

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.