Disseminate: The Computer Science Research Podcast - Laurens Kuiper | These Rows Are Made For Sorting | #30

Starting point is 00:00:00 Hello and welcome to Disseminate the Computer Science Research Podcast. I'm your host, Jack Wardby. Today, I'm delighted to say I'm joined by Lawrence Kuyper, who'll be talking about his ICDE paper. These rows are made for sorting, and that's just what we'll do. Lawrence is a PhD student at CWI in the Netherlands, and he also works at DuckDB. Lawrence, welcome to the show. Hi, thanks. Thanks for the intro, Jack also works at DuckDB. Lawrence, welcome to the show. Hi, thanks. Thanks for the intro, Jack. Happy to be here.

Starting point is 00:00:49 Fantastic. The pleasure is all ours. So let's jump straight in then. So maybe you can tell us a little bit more about yourself and how you became interested in data management research. Right. So I'm 27 years old, living in Amsterdam, and as I said, working at CWI and DuckDB. And I did my bachelor's and master's in computer science in Nijmegen,

Starting point is 00:01:12 where they don't really do much data management. So we do some modeling, some SQL, but we don't really know how the database actually works. And luckily, I had a professor there, Arjen de Vries, who worked at CWI for 15 years. So he knows Hannes, he knows Peter from CWI. And he really likes data management. And I got to know him through a course on Spark and started talking with him and eventually did my master thesis with him and got interested in data management that way, which is a miracle because there's no data management in the masters there.

Starting point is 00:01:54 I did a bunch of machine learning in my master and got really bored of that. So I'm really happy I got to meet him and I'm really happy I ended up in the data management area. Fantastic. And now you work for the ended up in the data management area. Fantastic and now you work for the coolest startup, data startup in the world as well so that's uh that must be fun right? Yeah I'm really happy to work for DuckDB Labs as well so we've been a startup for a bit over a year and a half now and yeah we're're a small team uh about 15 here in amsterdam we have some remote people and it is really fun to work uh for yeah for such a young young company but we're

Starting point is 00:02:32 getting a lot of attraction on github and on twitter and uh in the data science uh space so it is very fun to work here awesome stuff well uh let's talk about sorting then today what the focus of your paper icd icde was um so maybe you can maybe give us some background in sorting i mean it's kind of a classical topic in computer science right but maybe you can give us some background about the context of sorting in relational database management systems and just kind of generally give us the sale pitch sales pitch sales pitch for your recent paper. Yeah, of course. So sorting kind of seems like a solved problem, right? At least if you search for sorting algorithm on Google Scholar, you'll get about 2 million

Starting point is 00:03:16 hits. So it's a very well-researched topic and everyone, of course, learns about it, the NLGN sorting algorithms in their computer science degree. But this sorting algorithm research tends to focus on a few use cases that don't really translate one-to-one to database sort or sorting relational data. So they tend to focus on large materialized arrays of, let's say, integers that are just sitting there in memory. And then they just try to sort them as fast as they can. Or there's another field within sorting where they have these large distributed sorts where they sort petabytes of data. This is a bit closer to the data management or like the relational data sorting.

Starting point is 00:04:03 But for relational data, you sort on multiple keys often. So you can sort, you can order by all of the columns, ascending, descending, and you also select other columns. So it's very often the case that you select 10 columns and order by just one. So there's a big difference between that and sorting algorithm research. And that's why I chose to research this. Also, we needed a good sort implementation for DuckDB, right? and that's why um i chose to research this uh also we needed a good sort implementation for duck tv right so that's a good motivator as well awesome cool so yeah you kind of touched on it a little bit there about the challenges of implementing sorting in in relational um dbmss but like how hard is it to take the existing sort of like the, the huge amount of like existing literature on sorting and then applying that

Starting point is 00:04:48 to an implement in a relational database? What are the kinds of challenges there? So if you just want to get it to sort, then there's no, there's no lot of challenge. So you can materialize your data and call sort. And we, we implement that to be C++. So the sort API is just you give it the start of the sequence to be sorted and the end and the comparison function. The thing is, you can apply any sorting algorithm, but it probably won't be fast.

Starting point is 00:05:18 So that was the real research in my papers. You can get PDQ sort or whatever, like state-of-the-art sorting algorithm. But the API tends not to be good enough because you pass it this comparison function, and if you ship a compiler with your database, like Hyper and Umbra,

Starting point is 00:05:37 they use a compiled execution engine, then this is great because you can compile in the data types and you can compile in the comparison function. But for interpreted execution engines like DuckDB, we don't ship a compiler and we cannot compile in the comparison function. And then you get dynamic function calls in your comparison function. And for vectorized engines, you usually amortize the overhead of these dynamic function calls by calling them once per vector. And then it's really not so much of a big deal. But this comparison function, you do it n log n times. And then the dynamic function call overhead tends to

Starting point is 00:06:16 become a huge cost. And that's why you can't just call std sort or pdq sort in your interpreted database database system in interpreted execution engine cool so before you before we continue could you maybe explain the difference between a compiled query engine and an interpreted query engine right so um this started around the let's say the 2000s when um of course, database systems are an old topic, right? And we have the classical OLTP-style database systems, which have generally an interpreted execution engine. So that means these systems really pull one row through the query plan at a time. So let's say you have a table scan and then a filter

Starting point is 00:07:09 and then that's your query. Then they pull one row from the base table through the filter and then to the query result. And this tends to, then you get a dynamic function call for the filter evaluation for every row that you pull through the query plan. And around the 2000s at CWI, Peter and Marcin invented vectorized execution where you pull one vector through at a time. So let's say the vector could be 1000 or 2000 rows at a time. And then the data would be in a columnar format instead of a row format. And this really, really speeds up. The dynamic function call overhead is one of the reasons that this really sped up query processing because you're doing less dynamic function calls.

Starting point is 00:07:55 There's also SIMD and cache locality. I won't get into that too much. And they have compiled execution engines. And the compiled execution engines also deal with the same problem by just in time compiling the query plan and then by compiling in stuff like comparison functions and data types that perfectly fit the query types that the types that are expected in the query you also get no interpretation or dynamic function call overhead for these for these tuples flowing through the plan. So you said that kind of things are not fast if we just, if we're in an interpreted execution, we just kind of, we use the kind of off the shelf PDQ and stuff like that APIs. So you did in a

Starting point is 00:08:37 micro benchmark where you wanted to kind of compare the differences between, I don't know, having a row versus a column data format, and then sorting in these two different types of query engines to see kind of guess what came out best, what was the best way of doing this. So can you maybe walk us through the design of this experiment and how you set it up? Right, so I tried to make it as simple as I could and make it like isolate the raw sorting performance there

Starting point is 00:09:05 just to see the characteristics of columns and rows. And yeah, what I did is I generated a few different data distributions, so a random distribution and some correlated distributions. And I generated fully materialized columns of data or rows of data. And the rows would have all the data tightly packed together. And I would interpret it as a C++ struct. So this is actually a compiled approach because in the micro benchmark,

Starting point is 00:09:32 I don't have to interpret types because it's just my micro benchmark. There's no actual queries. So I could just compile in the types and then see the performance of sorting just rows or just columns. And then we basically isolate the memory access pattern of sorting rows and sorting columnar data and also the branch predictions of the comparison function of these approaches. So that was the setup. And then I sorted it. So one of the points I tried to make in the paper

Starting point is 00:10:04 is that it doesn't matter what sorting algorithm you use because we're trying to see what the rows and columns do, right? The sorting algorithm doesn't matter as much. So I just went with std sort, but I did cross-validate with std stable sort because these are quite different. So std sort is a introspective sorting algorithm is mostly quick sort and stable sort is a merge sort algorithm

Starting point is 00:10:30 so these are quite different cache or like memory access patterns okay cool so let's talk some results then so there's two different ways you use of actually comparing and comparing kind of tuples right and then you kind of in your, you talk about them for how it was for sorting column of data and for row data. So let's start off with the two approaches, then we can talk about column of data, right? I looked at two approaches to sort. Generally, you would think there's only one approach to sort

Starting point is 00:10:59 because you're comparing rows when you're sorting. And there's only one way to compare a row, which is just to compare the data in the first column. And if they're equal, then you have to compare them in the second column. You keep comparing the values in the rows until you have a value that's not equal or until you've compared all values in the row. And I call this the tuple at a time approach. But there's also for which some database systems implement this columnar data format while sorting. And these tend to go for the subsorting approach, which I call the subsorting approach. And there they sort the data by the first column. So let's say you have

Starting point is 00:11:41 multiple columns in your order by clause, they sort all the data by the first column. Then they identify which rows are tied, so which have the same value in this first column, and then have this small sort on the second column for just these rows. So what were the results then for sorting columnar data? How did it fare? So the subsorting approach is quite clearly better for sorting columnar data. There's some maybe, so for the merge sort,

Starting point is 00:12:13 the tuple at a time and the subsort approach were quite similar. But in general, the subsorting approach is better because you simplify the comparison function because you're only sorting by one column at a time. So your column is just, this value is less than that value instead of looping over the columns in the order by clause. And secondly, it's also more, the memory access pattern is better, because you're only accessing this one memory region at a time, which is this one column, and then you move on to the next column. So if you would do this tuple at a time approach for columnar data, you compare by the first

Starting point is 00:12:50 value, by the value in the first column, and if they're equal, you have to compare by the value in the second column, and that's random access into a different memory region, which slows down the sort. So the subsorting approach was better here. Okay, I'm guessing this flipped around for the row data? So for the row data, interestingly, it didn't flip

Starting point is 00:13:14 around. So the subsorting approach, you end up doing less comparisons in total. Okay. Because often you don't need to loop through the whole row to get the full comparison. And somehow you end up doing slightly less comparisons.

Starting point is 00:13:31 And the memory access pattern, yeah, there were slightly more cache misses for this sub-sort approach because for the row data, the data is co-located in memory already and the memory access pattern is excellent already. But somehow it was slightly faster still, this sub-sorting approach. Oh, that's fascinating. in memory already and that the cache x the memory access pattern is is excellent already and but somehow it was slightly faster still this sub sorting approach oh that's fascinating so obviously

Starting point is 00:13:51 this is this micro benchmark that focused primarily on whether how he was laying how he was laying out data so how did the how did the type of query execution then sort of impact sorting so this wasn't included in the micro benchmark right this was like kind of more of a discussion section in your paper. So maybe can you elaborate on the impact of the execution engine? Yeah, so the execution engine mostly impacts how you should compare tuples, so how you should compare your rows. So it's quite clear from the micro benchmark

Starting point is 00:14:21 that you should use rows regardless of your execution engine. But how you should compare tuples. So for the compiled query engines, it's simple. You compile in the comparison function. And that's about it. Then you have an efficient comparison function. But for the interpret execution engines, you get dynamic function calls. And this greatly slows down the comparison.

Starting point is 00:14:44 So what I ended up doing is something that was invented for System R already. So it's quite an old approach, but somehow, yeah, of course, there hasn't been much research on sorting relational data, but somehow this approach got lost. And it's to normalize the values, normalize them so they are sortable as if they are strings. So for integers that,

Starting point is 00:15:09 so generally computers use big end in architecture. So the most significant digits are on the end. But if you sort the string, the most significant digits are in the front. So what you end up doing is swapping all the bytes. And then if it's assigned integer, you end up doing is swapping all the bytes. And then if it's a signed integer, you end up flipping the bit. Yeah, you end up flipping the first, the signed bit so that the negative integers come last. For strings, of course, they're already good. And as it turns out, you can do this

Starting point is 00:15:40 key normalization, this string comparison for floating point numbers as well. You can also encode them so they are comparable as a string. Fascinating. I mean, it's mad that System.am, they did so many things in System.am now. It's crazy. It's like, I'm sure pretty much they did everything you could possibly ever want to do with a database in System.am

Starting point is 00:16:00 way back then. And yeah, it's crazy. Sorry, where was I with my question? So yeah, so maybe this isn't a valid question but he said that rose is always the best but is this sort of like kind of is having rose with a compiled engine better than having rose with a interpreted engine for example or is it does it kind of like what the interaction effects of these two dimensions is because guess what i'm asking no i think it's i think it's a good question so for um interpreted so the the so i can give you one answer straight away and that is as as the input size grows if you have a large input if

Starting point is 00:16:36 you're sorting more data then um then it's a no-brainer you always use rows and there's no real discussion there. For the interpreted execution engine, sorting by one column at a time, I think would yield more benefits. So using this sub-sorting approach would use more benefit because you don't get this interpretation or function call overhead. And for the compiled execution engines, they get a lot more freedom because they don't have to deal with a lot of this interpretation overhead while sorting

Starting point is 00:17:12 because they can just compile in the comparison function of the data types. So for the compiled execution engine, just go with rows, always go with rows and compile your comparison function and enjoy your fast sort. For the interpreter execution engine, I'd say also go with rows because, you know, you don't

Starting point is 00:17:32 know how large the input is if you have a streaming engine, right? So there's data coming in, you have to materialize it to sort. So just go with rows anyway. And then it's up to you. Do you want to compare rows at a time? Well, if you do you want to compare rows at a time well if you want to compare it like whole rows at a time you might want to choose for this key normalization approach that i just discussed where you compare the the rows as if they're strings and um or you would use the sub sorts approach where you compare by one column at a time and then don't have this interpretation overhead uh for every for every comparison yeah what's the sort of the the overhead in terms of like it's laid out on disk in column the format

Starting point is 00:18:11 and then materialized into rows for when you want to do your sorting what's the sort of overhead there how costly is that doing that transformation to then convert it back to columns at the end yeah so it's a good question so that's uh, there's a cost, but this cost is actually quite low. So if you have a streaming engine, you have this, what we call data chunks coming in. So there's vectors of data coming in. And these are like the in-flight representation of data. And you can't just say, oh, here's my data chunk. I'll store it somewhere.

Starting point is 00:18:46 You have to copy the data over to buffers. You have to actually materialize the in-flight data coming in. So if you have to materialize anyway, then you have an option. You could go to columns or you could go to rows. So you're copying the data either way. I'd say go to rows because it pays off. And the cost here is not that high because instead of copying one row at a time, you copy one vector at a time. So let's say you have 2000 rows, you copy the first column.

Starting point is 00:19:18 So that's the first vector, copy it into the row and then the second and then the third. And what you end up doing is very much a, so the data is in L1 cache all the time. Yeah. And there's just purely sequential access. So it ends up not being restrictive at all. Nice.

Starting point is 00:19:37 I remember when I first sort of was reading through the paper, in the introduction, you have that sort of example of kind of good. It's even better if you go from columns to rows and back again. was like really but no when you kind of like dig into it kind of it makes sense yeah but it kind of initially was sort of counterintuitive to me but now that's that's really interesting so i mean the kind of the next step in your paper was sort of tying together all of these existing techniques already existed in the literature and then in the context of an interpreted query engine i guess because this is going to be implemented in, or was implemented in DuckDB. Yeah, you have a nice section where you kind of, you tie these all together.

Starting point is 00:20:12 So can you maybe tell us about some more about these techniques and how you managed to tie them all together and walk us through what you found here? Yeah, so I ended up implementing this in DuckDB. So the first step is we use... So the sort is fully parallel. This is important for any database system now. If any part of your system is not parallel, it doesn't keep up with the rest of the system.

Starting point is 00:20:37 So you need to parallelize every step. So we use morsel-driven parallelism for this, which is a paper from 2015, where the parallelism comes from chunking the inputs. So let's say you have a table scan and it's 10 million rows. You partition it into inputs of around 100k rows. And then each thread scans 100k rows, and when they're done, and there's still 100k rows left that haven't been scanned, then they get that. So that's kind of a data-driven parallelism. A lot of systems before used to use some different

Starting point is 00:21:15 kind of parallelism based on some hashes or whatever, and with the exchange operator. But this doesn't scale well. We have now parallelism and we're operators and we have the parcel-driven parallelism. So let's say every thread is scanning data. Every thread gets data into the sort operator. And then each thread materializes that data into the row format. And then when there's no more data left,

Starting point is 00:21:41 we're done converting all the data to row format, each thread sorts the data that they collected. So given the morsel driven parallelism, each thread should collect roughly the same amount of rows, give or take 100,000, but it's not so big of a deal. Then let's say we have four threads, then we have four sorted runs. And so this initial sorting step happens inductively with either radix sort if there's integers or PDQ sort if there's strings. And as I said, you can't just use the PDQ sort API as is. So luckily it's open source, so I was easily able to modify it so that it was efficient for our data layout. And then let's say we have four threads and we have four sorted runs. Then we start a merge sort. Merge sort can be tricky to parallelize. There's two kinds of merge sorts. You can have a two-way

Starting point is 00:22:36 cascaded merge sort where you keep merging two runs and two runs and two runs until you have only one run left. Or you have this k-way merge sort where you use something called the tree of losers where you merge all four runs into one run at the same time. I ended up going with the two-way cascaded merge sort. I have some regrets. I want to go with the k-way merge sort, but I'll explain later. So we go with a parallel two-way merge sort,

Starting point is 00:23:04 and this can be tricky to parallelize, right? Because one thread can merge two runs. But what if you're down to your last two runs, and you have, let's say, 48 threads? You would think that only one thread can merge those two runs, but you can actually pre-compute partitions that can be merged independently and therefore in parallel. So with binary search, you do a binary search through the two runs that you're going to merge. And you find a point where you're sure that the next comparison would copy the right run and then the left run. So that's kind of like an intersection. Okay. So this is a paper called Merge Path that explains this.

Starting point is 00:23:50 And you kind of compute some intersections between the sorted runs efficiently with binary search, and then you can merge parts of the run in parallel. And that, of course, greatly speeds up the process. There's some work you have to do, but in return, you get parallelism. So that's worth the effort. Yeah, so you keep merging until you have only one run left,

Starting point is 00:24:13 and then you are done. You scan the data from rows to columns again. So let's dig into the K-Way. And when you wish you'd done that one instead, why was that? I mean, just because it's more fun to implement or more interesting, or is it actually like a performance-based reason as to why? Yeah, so I've learned now that the sort inductive is very fast.

Starting point is 00:24:38 But if you select a lot of columns, so let's say you select a very wide table, you have 10 columns, you're only sorting by one column. Then with this two-way merge sort, so the way I implemented it is what you can do is you can sort the columns in the order by clause and then collect all the data in the other selected columns after you're done sorting, right? And that would be the most efficient way because then the payload, like these other selected columns, just sit there. They don't move through memory all the time. However, if you want to sort more data than fits in memory, then you cannot do this anymore.

Starting point is 00:25:13 Because let's say your data is two times the size of memory and all these columns that you selected but are not sorting by are just sitting there, but they have to go to disk. And then you sort by the key columns, so the columns in the order by clause. And then you have to collect all the rows in the right order, but they're on disk and in memory. And then at some, so if your data size is big enough,

Starting point is 00:25:38 your random access to memory becomes random access to disk. And at that point, you're screwed because the performance just plummets. So what I ended up doing is that you don't collect it at the end. So you don't collect the payload data at the end. You just keep merging it through. So then you have sequential access, and you don't have random access to disk. But I chose to do the two-way merge because I knew how to parallelize it. And we needed a parallel sort because, as I said, you cannot have some single-threaded thing in your parallel system. So I regret this now because I know that copying the

Starting point is 00:26:19 payload throughout the merge sort is very expensive. So if you have a lot of other selected columns, then this becomes actually a performance bottleneck. And for this K-Way merge, I think I figured out how to parallelize it. I didn't know how to parallelize it because there was no related work on this. Okay. So there's a follow-up paper there, I'm sure, when you...

Starting point is 00:26:44 Yeah, possibly. So I haven't touched sorting in a while, but I think I figured out how to generalize this idea of merge path to a K-way merge. I haven't found any work on this, but I think I know how to do it. Yeah, I think if you look at the two, anyone can connect the dots. It'll take you a while, but I've thought about this a lot. Waking up in the middle of the night thinking, I've got it, I've got it. Yeah, yeah, yeah. This happens to me every now and then. So I think I know how to do that. But I'm now on to joins and aggregations. So it'll be a while before I touch sorting again.

Starting point is 00:27:22 So, yeah, we'll see so you see the drop i mean this is a question more in like how or what's the tipping point between me selecting too many columns and it becoming a performance bottleneck and versus how often do you see people doing that because i mean i guess a lot of the work in dook db is driven by kind of the what people users want right so i guess it's like how high up the priority list would it be to sort of cater for these select and large number of columns? So it's actually for the use case of DuckDB,

Starting point is 00:27:52 it's not a big deal. It's just something I want to do because some users run into problems here. So we can sort more data that fits in memory, but usually DuckDB users tend to more data than fits in memory, but usually the DuckDB users tend to have data that fits in memory. But one of our goals is to have DuckDB be like some kind of mini Spark cluster, right? Like your MacBook with incredibly fast SSD can utilize disk really well during the query execution.

Starting point is 00:28:29 And you'll be able to skip Spark because we don't like Spark. And you'll be able to use DuckTV instead to do larger than memory processing. And for sorting, so sorting shows up in window functions mostly, right? There may be some inequality joins, but most people see them in window functions. And we don't really get many requests or issues for people who have window functions that don't fit memory. And the rows tend not to be that wide. But every now and then there's someone who says,

Starting point is 00:29:04 hey, so there's someone who says, hey, I said, there's people just benchmarking DuckDB and whatever, using it for whatever it wasn't intended for, but people generating TPCH skill factor 100 and then trying to sort line item. Right, okay. And then they say,

Starting point is 00:29:19 hey, this is very slow or this doesn't work. And I think I can make that very fast, but I need some time okay cool that that's gonna be i'm sure i'll be interested when you get around to it for sure um so i just kind of briefly want to touch on implementation a little bit and obviously you're very sort of like well-versed with how like duck db is is sort of architected and built so i guess the implementation if if you i guess what i want to ask is how much of an undertaking was it to add this this new sort operator in so so i did this when uh i wasn't well versed okay this is your first sort of like yeah yeah this was the first physical operator that I touched in DuckTv.

Starting point is 00:30:06 So I'd done some scalar functions and I implemented macros in DuckTv. So then you can create a SQL expression as a function. And this is, so I touched a little bit of physical execution, but not an actual operator. And the sort operator was one of the last operators that had a very basic implementation. So the,

Starting point is 00:30:28 the join and aggregates were really optimized and well refined, but the sort was something Mark wrote in two hours. It's just something. So, so, so it worked, but it wasn't. So it was fast.

Starting point is 00:30:41 You need more time to write something fast. So this was for me, a huge undertaking because I had to learn the parallelism, the model of the, so the morsel-driven parallelism. A lot of that is abstracted away behind a nice API, but I had to learn a lot of things about parallelism and locks and atomics that I didn't really know. And I also had to learn how to use the buffer manager well,

Starting point is 00:31:05 because I also had not had to touch this with my previous endeavors in inductive. So for me, this was a huge undertaking. It took many, many months to implement this. I think if I would do it again, then I could make it faster in less time right but yeah i'm now

Starting point is 00:31:26 i'm now like years like a year and a half uh it's a year and a half after the the source went into duct to be so i've i've learned a lot and i i could do it better now yeah yeah for sure yeah i mean you need to have that initial sort of you need to go pay that initial sort of pain right to get and get to the point where you are now where you could do it do it easier right so yeah that's cool yeah i'm actually really grateful that mark and honus let me as like a starting pc student just do this right i could just uh influence sorting in their system and they just trusted me that it would be good so i'm actually very grateful that i got that opportunity. Yeah, for sure. Cool.

Starting point is 00:32:07 So I guess let's talk about, so you've obviously implemented this in DuckDB, as we've been saying, and you have benchmarked it. So can you maybe tell us a little bit about the benchmark and how you went about that? And then let's hear some numbers. How fast does it go now versus the initial quick implementation that Mark did? Right.

Starting point is 00:32:26 So I did some benchmarking on TPCDS tables and also like just random integers because I just wanted to see the raw sorting performance and then see their relational sorting performance with like multi-key sorting. And I didn't benchmark against the old DuckDB implementation, but this old implementation, I don't think you want to benchmark against it because it was fine for sorting 10,000 rows or like maybe a hundred thousand rows. But as soon as you get to some millions, then yeah, you don't know how far the progress is, but it looks like it's stuck. So I think it would take like 10 minutes to sort or like five minutes to sort 100 million integers, which we can now do in, I think it's 1.7 seconds.

Starting point is 00:33:21 So it was terrible, but it worked. Cool. Great stuff. but it worked cool um great stuff so i yeah i i kind of um i often start trying to ask ask kind of about when i'm interviewing people about their work like kind of what are the sort of like the limitations to their approach or like under what scenarios is the performance of whatever it is suboptimal and so i guess kind of putting that question to you are there any sort of limitations other than the fact that maybe you would, in hindsight, go away and do K-Way, what was the name of it again?

Starting point is 00:33:50 It's a K-Way merge, sorry. K-Way merge, sorry, yeah. I just wrote down in my notes, like, Tree Losers, and that's the word I've got there. It's definitely not called that, right? But it's an interesting like, anyway, it's got to be. No, no, it is. It is. Let me see.

Starting point is 00:34:04 So Tree of Losers is something, it's got me no no it is it is um let me see so tree of losers is something uh it's a it's a tournament tree so okay um it's it's actually the name so so there's a something called a tournament tree i think um like many things in computer science donald knuth invented this right and um it is um a way to minimize comparisons when doing this K-weight merging. So it's a tournament tree where each of the sorted runs goes into one of the entry points of the tournament tree. And you can have a tree of winners or a tree of losers. And with sorting, we generally think about sorting as doing the lowest value first. So that's why it's called the tree of losers because it loses the comparison throughout the tournament tree.

Starting point is 00:34:55 And so this minimizes the comparisons. And I would say, yeah, that also brings me to the limitation. And I think I already talked about it. But the limitation is really the two-way merge sort requires you to copy over the data if you want to have an external, like larger-than-memory kind of sort. It requires you to copy over the data many, many times. For the K-way, you don't because you're merging all runs into one. You end up only copying it once great so i guess there's obviously in an

Starting point is 00:35:29 ideal world you will you would implement this at some point but it's not it's not on your on your kind of immediate rendering so you're working on aggregations and joins so i guess kind of what is that's probably what's next on your research agenda but like how is that where going and where do you go kind of next uh right so as i said one of the goals for dr b is to be able to handle larger than memory data uh gracefully so it's actually for our old tp systems uh because they have been researched for over 50 years now this is a solved problem they can do out of core or like larger than memory joins and aggregations, but they tend to have like these small 4 or 8K buffer size and they tend to be very conservative in their memory usage.

Starting point is 00:36:10 While OLAP systems don't handle larger, generally don't handle larger than memory joins and aggregations or sorting. And they tend to use up all the memory you have to make the query as fast as they can. And we think that there's a middle ground where we can have fast in-memory performance. And let's say your data size exceeds the memory size by 50% or 20% or 80%,

Starting point is 00:36:36 then your performance shouldn't drop off a cliff. So a lot of systems tend to either, yeah, the performance drops off a cliff because they go into this OLTP style, slow kind of larger than memory algorithm, or they just say, I'm sorry, mate, I can't finish this query. It doesn't fit. So that's actually my PhD topic is to write good join and aggregation algorithms for OLAP, for larger-than-memory OLAP processing.

Starting point is 00:37:10 So I've been working on the aggregate and join. So we have a very good aggregate and join implementation, but the data was required to fit in memory. Otherwise, we threw an error. And I've been working on this materialization format. So for joins and aggregation, we also materialized data as rows instead of columnar format. This is something that is a paper by Marchin and Peter.

Starting point is 00:37:37 They found out that this gave also better cache locality for joins and aggregations. And I've been working hard on a materialization format for rows, where the rows have a fixed size, so even for strings. So that means that there's a pointer to the actual string in the rows. But this format can go to disk. So if data goes to disk, your pointers will be invalidated. So we have ways to recompute them on the fly so that this data can go from disk to memory

Starting point is 00:38:09 and the data stays valid. And this kind of unlocks a lot of out-of-core potential. So let's say for joins, generally you need the entire build site. So the hash table and the build site, you need it in memory because you need to make sure that the other side, when you're matching,

Starting point is 00:38:28 that it's matched against everything in the build side. But you can partition the build side, let's say on the hash. So we use Radix partitioning and then that relieves some memory pressure. So you pin a few partitions in memory and then you probe against that during the probe of the hash join. And we use the exact same row format for the

Starting point is 00:38:51 aggregate as well. I haven't touched the aggregate too much. I've just implemented the row format in there. But for the join, we fully support larger than memory. And so next steps for me are tweaking that until the performance is satisfactory let's say and same for the aggregate and i i spoke to thomas neumann at icde and he described an excellent optimization for hash tables that i'm also going to try to implement fantastic um this this the next question i i often ask is is kind of how can a software developer and engineer sort of leverage of the findings in your research but i mean and kind of like what impact you think it can have but i mean with it being in duck db i guess it's having real world impact now so i mean i guess i'm gonna refit rephrase it a little bit

Starting point is 00:39:45 and say like, what impact have you had so far? And have you had any feedback on kind of the performance of the new sorting operators? And what's that look like? Are you just saying this is amazing, fantastic, changed my life? So we've had some positive feedback from that. So this has been inducted before a while now,

Starting point is 00:40:02 but we've had some positive feedback from people who were happy that they could now sort large datasets because this was really prohibitively expensive before. But of course, it's also led to people, as I said, generating TPCH skill factor 100 and then sorting line item and then wondering why it's so slow well uh so these issues are then for me because i implement uh i've implemented sorting so there's been some positive feedback but of course also some negative feedback but the the negative feedback has been quite minimal because in general people don't sort uh as many, many gigabytes of data. So it has been good. But in general, with operators in database systems, people expect them to be good. So you generally don't hear much.

Starting point is 00:40:56 So of course, we get the most feedback from GitHub. You don't hear much. You only hear about, hey, my join is slow. Hey, my integration is slow, right? So I've received much more positive feedback from something like I implemented the JSON extension for DuckDB. And we can now directly read JSON files as if it's a table, right? So you can do select star from and then your JSON file.

Starting point is 00:41:22 Right, okay, cool. And we have risk support for nested types, so list and struct, and these map well to the JSON arrays and object. When I implemented that, the positive feedback was much more than I ever had for the sorting because people were like,

Starting point is 00:41:39 wow, this is some shiny new functionality. Whereas for sorting or joins, they're like, yeah, this should work. Oh, that's wild, isn't it really? Yeah.

Starting point is 00:41:49 I mean, it must be really nice to sort of, to get some feedback kind of at all, I guess on, on, on your issues. I mean, kind of,

Starting point is 00:41:55 I guess the average sort of PhD student, I don't know, I'm probably speaking from my, my personal experience here more maybe, but like you write a paper and no one ever really looks at it ever again so to have your work go straight into a system that like thousands of millions of people are using then it must be it must be a really nice rewarding experience um for sure absolutely yeah that that's very motivating actually yeah just just to think

Starting point is 00:42:22 about hey i want to get this stuff in before the next release uh that's a huge motivator for me yeah cool i mean what whilst you've been whilst you've been working on on sorting um from that sort of journey initially starting on the project were there any sort of things that were kind of interesting or unexpected that you learned while while working on sorting? Yeah, so I initially, so this lesson I learned while sorting is I initially thought I would be able to take the best sorting algorithm that I could find and just apply it to my data. Because I thought sorting also, I thought of sorting kind of as a solved problem, right?

Starting point is 00:43:03 And the more I tried to find papers on relational sorting, I was very surprised that it wasn't. So I tried this a few times with different sorting algorithms. I found that we were, so just using the STD sort API, but then with something like block quick sort or PDQ sort, I found we were a lot slower than 1ITB. And of course, we want to lot slower than monodb and of course we want to be faster than monodb so i was quite surprised with this and then i i really learned that i

Starting point is 00:43:32 you have to implement the sort yourself for your data structure for like it's not a compiled thing we're an interpreted engine so you have to really... And that was very surprising to me. And then all the way in the end, like I said, this cost of memcopying the rows around through the two-way cascaded merge sort, that was also something that I found very interesting because that's purely sequential access, right?

Starting point is 00:43:59 During the merge sort, you're purely doing sequential access copying the rows around. And I thought that would be, would not be very expensive, but it ended up for if you have a very wide table and being much more expensive than i thought and at some point i will influence this cable like vendetta now like like a mission i've got to do this at some point yeah absolutely yeah i have to yeah fantastic and i mean the next question i often ask people i guess we've maybe touched on a little bit and kind of things that you try

Starting point is 00:44:30 along the way like you mentioned second though that you can't just take something off the shelf and and apply it and it worked but are there any any other sort of like i guess um like war stories from working on this that things that were um like tried and failed so stuff that tried and failed so um i actually so i first implemented just sorting and memory sorting and then i made it go um larger than memory um and i had my own row format just for the sort and the rows weren't fixed size so there were variable size so the strings were just in there. And kind of at the same time, Richard, one of our remote contributors, he started unifying the other code where we had rows that are fixed size.

Starting point is 00:45:24 So he wanted to unify every place in the activity that we were using rows because we were using these solutions that are just in every operator instead of having like this unified framework of going to rows and i ended up like scrapping my row stuff and then using his row stuff and then there were pointers in there so i had to figure out how to deal with that and my approach was pointer swizzling. So instead of the pointer, you can overwrite it with an offset before it goes to disk. And then you can recompute the pointer when it loads back into memory. And that's what we have in most operators now.

Starting point is 00:45:55 But that is very tricky to get right because you have to keep track of, is this a pointer or is this an offset? You need to have some serious bookkeeping everywhere. And then at some point, Mark implemented this column format or like column materialization format. And this also had the problem that there were pointers in there. And then I figured, hey, you don't actually have to go from pointer to offset. You can just load the data back and then compare the pointers. Is the

Starting point is 00:46:27 new pointer still the same as the old pointer that was stored there? If it's not the same, you can recompute it on the fly without ever going to this offset. And having learned that from the column materializing data in a columnar format, I immediately applied this to this new row materialization format that I've just implemented. And this is really nice because if the data doesn't ever go to disk, then you've never touched this pointer and made it an offset. You just compare one pointer and see, oh, the pointer is still valid. We can scan this.

Starting point is 00:47:03 And that was a huge lesson. So that took actually, let's say, two years to realize that this is the way to go. Two years. Yeah, I had many iterations of this and we've now realized this is the way to go for us. Worth it in the end then. It's been a long journey, but it's paid off.

Starting point is 00:47:23 Absolutely, yeah. Fantastic. Well, I've just got a couple couple more couple more questions now but obviously all of you you've worked kind of into into db db other than sort of the the the joins and the aggregations are there any other sort of things you're working on or like your other research that the listener might be interested in hearing about uh yeah so the the joins and aggregates, the out-of-course stuff is my main research, but I don't really have a time for

Starting point is 00:47:51 side projects in research, but I do have side projects in DuckDB. So the JSON extension was a huge side project for me, and I had a lot of fun doing that. So if anyone has like a lot of new line

Starting point is 00:48:11 delimited JSON lying around and they want to shred through it quickly, or they just want to try it out, please check the JSON extension. We have a blog on it. And I want to give a shout out to the author of the open source JSON library that we use.

Starting point is 00:48:24 So we use the YYJSON library. And it's written in C89. So it's very portable, but it's also incredibly fast. And we, of course, want to have portable software as DuckDB. So this author is really awesome because I requested a change because I really needed this for the DuckDB implementation. I had to do something with allocations. And he said, well, I could, but that's a breaking change. And then one day later, he said, okay, let's make the breaking change.

Starting point is 00:49:00 He changed the API for us. It was really useful and made our JSON allocations much more elegant. Amazing. That's awesome. Yeah, so you said that you don't really have much time for side quests other than things within DuckDB. But how do you go about deciding those things? Working on the JSON extension, for example, how do you go about like deciding those things like working on the json um extension for example how do you actually approach saying okay this is a cool idea to work on and then just saying okay

Starting point is 00:49:31 yeah i'm actually going to dedicate some time to this yeah so i think i started on the json extension about a year and a half ago and i I did it just because I thought it was fun. So I saw this YYJSON library and I was like, yeah, this is perfect. Let me try to write some JSON functions in DuckDB. And I basically copied what other systems have, right? Because people expect a system to have those functions. And I had some basic functionality, and then I didn't touch the extension for let's say a year and so that really was something that I was interested in but then of course since then the company was founded and of course then money starts dictating what

Starting point is 00:50:19 you work on so one of our clients actually requested this and uh i went into json again and made us be able to read um it's but but our clients so it's not like they they pay us to to implement this functionality it's also for our clients we have clients that have a common goal right we want to make ductv better but they want to use ductv for a specific use case so we collaborate and then we try to make the best json processing that we can and then we're really happy with it because it ends up in the open source project and everyone can use it and then they're happy with it because they wanted this feature specifically okay cool so it's like kind of like a they almost like sponsor the work in a way i guess then i

Starting point is 00:51:05 don't know if that's the right way to thinking about it but like they yeah it's okay no that's it's kind of kind of uh sponsorship yeah because it's open source right we don't make uh tailor made stuff for them it's uh it all goes into the open source project fascinating fantastic stuff yeah and long may that continue um cool so yeah just two more questions now so the penultimate one is um it's kind of a big picture sort of question so it's what do you think is the biggest challenge in data management research today yeah i think this is a very very tricky like very tricky question so i'm very absorbed in my own research um But what I saw at ICDE is, I think one of the challenges is this scale up versus scale out question.

Starting point is 00:51:56 And the CEO, not CEO, the lead engineer at Google BigQuery, who's now the CEO of Motherduck, is Jordan Tigani. And he saw that about 90% of the queries in BigQuery accessed about 100 megabytes of data. And of course, we need these large distributed systems because sometimes we have analytics that are much larger than a single system can handle. But I feel like in data management research, there has been so much focus on this distributed system that there has not been enough attention for the single node. And I feel like some of the researchers don't

Starting point is 00:52:45 realize that we really need this single node because 90% of the queries don't need the distributed system. And we shouldn't be using the distributed system because it costs more money. It uses a lot of energy. It's often slower because they assume that the query will be a large distributed query, and then it ends up being a small one. Right. So of course I also say this because we have a nice single node system, but for me, that seems like something that's quite important right now.

Starting point is 00:53:15 Awesome. Yeah. So I guess last, last question now, last word, what's the one thing you want the listener to take away from this podcast episode today? The one thing. Okay. You can have two things. You have as many things as you want. No, take away from this podcast episode today uh the one thing okay

Starting point is 00:53:25 you can have two things you have as many things as you want no i think the one thing so for me this this journey of uh researching sorting is don't ever make the mistake of thinking that something is a solved problem okay and and uh in the, the solution for me wasn't very difficult. It was use rows, use this technique from system R, right? But this combination of techniques that were already there, but that had not been researched well, like, for example, join algorithms, which have been researched a lot. This combination of techniques that may not seem very novel ended up being very effective and something that was not researched before. So if there's any researchers thinking that, hey, I can't research sorting, it's been beaten to death, let's say, probably not. There's still

Starting point is 00:54:19 some stuff to do for you. You can still research that. Fantastic. That's a great message to end on. So let's finish things there. Thank you so much, Lawrence, for coming on. If the listeners are interested in knowing more about Lawrence's work, we'll put links in the show notes. And if you enjoy listening to the podcast, please consider supporting the show through Buy Me A Coffee. And we will see you all next time for some more awesome computer science research.

Your Ad Here

Disseminate: The Computer Science Research Podcast - Laurens Kuiper | These Rows Are Made For Sorting | #30

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.