Disseminate: The Computer Science Research Podcast - David Justen | POLAR: Adaptive and non-invasive join order selection via plans of least resistance

Episode Date: April 3, 2025

In this episode, we sit down with David Justen to discuss his work on POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least Resistance which was implemented in DuckDB. David shares ...his journey in the database space, insights into performance optimization, and the challenges of working with modern analytical workloads. We dive into the intricacies of query compilation, vectorized execution, and how DuckDB is shaping the future of in-memory databases. Tune in for a deep dive into database internals, industry trends, and what’s next for high-performance data processing!Links: VLDB 2024 PaperDavid's Homepage Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Disseminate the computer science research podcast, the DuckDB in Research series. Hello and welcome to our ongoing DuckDB in Research podcast series. So let's start off by saying, kind of for the listener who doesn't know, what DuckDB actually is. So DuckDB is an open source in-process SQL database that is designed for fast and efficient analytical query processing. Really simple to use, deployers with zero external dependencies, it's extensible, integrates really easily with all your data and science tools, having APIs for things like Python and R,
Starting point is 00:00:34 and you can directly read from file formats like Parquet and CSV. So why this DuckDB in research series, you may ask. Well, Disseminate as a podcast is all about impact and helping bridge the gap between research and industry, which I'm sure many of you have heard me say before. And DuckDB is the perfect example of this,
Starting point is 00:00:54 of these two things, of these two communities working together in that the ideas from decades of research are at its core and it's now influencing the research community as a platform for others to build on. And that leads us nicely onto today's show and today's now influencing the research community as a platform for others to build on. And that leads us nicely onto today's show and today's guest. So I'd like to welcome David Houston to the show. Yeah, very happy to be here.
Starting point is 00:01:13 Thanks for inviting me. Cool. So David, your research focuses on query optimization for analytical databases, and you will implement a lot of your ideas in DuckDb, right? So this is why we're speaking to you specifically. We're going to be chatting about some research you published, VLDB this year. And the title of your paper was Polar, Adaptive and Non-invasive, Joint Order
Starting point is 00:01:35 Selection via Plans of Least Resistance, which I'm sure a lot of that will make a lot more sense by the end of the podcast. But yeah, so, but before we do get into that and talk about your experiences with the DuckDB and everything as well, tell us more about yourself and your story and how you became interested in databases and research. Yeah, of course. So I actually started out in my bachelor's as media informatic. So that's basically a study which is sort of applied computer science with a bit of
Starting point is 00:02:04 audio and video and quite a lot of web development. And there we had this very superficial database lecture where we had a bit of SQL and normalization, denormalization. And I thought, oh my God, that's extremely boring. I don't want to do that ever again. But then I switched to the Hasso Plattner Institute at my master's and the Hasso Plattner Institute at my master's. And the Hasso Plattner Institute is actually a privately funded university faculty founded and financed by Hasso Plattner, who is one of the co-founders of SAP. And they had this seminar in the first semester that is called Develop Your Own Database at Hasso Plattner's chair. And even though I thought,
Starting point is 00:02:44 okay, databases are not really my thing, I thought, okay, building stuff is kind of fun. So let's try that. So we found ourselves in groups and built features for HiRISE, which is the research in memory database at House of Plattner Institute. And it turned out to be very fun, actually. And I think we also did not too bad because then after that our whole group actually got hired as student research assistants. And that was really sort of the beginning where I started taking more and more database seminars at the Platner Chair and also wrote my thesis in robust query processing. And yeah, that made me organically grow into that chair, I guess,
Starting point is 00:03:27 because I wanted to continue working on these systems that they have on that chair. But I also wanted to learn all these cool skills that come with being a PhD student, like speaking in front of crowds, supervising other people, supervising, organising and motivating yourself and also having all this responsibility for your own work. And that's really what, what got me excited. And then after one year, I actually switched to Berlin because Hasso Plattner retired and the chair got closed down, but that's where I am now. And yeah, I'm very happy there.
Starting point is 00:03:56 Awesome stuff. Yeah. It's always interesting that a lot of the intro to database lectures do focus on those sort of things and kind of a little bit of the the Normalization and whatnot and their brief little sprinkle of sequel It's never presented in the most interesting way I feel like it's it's if they started with the cool stuff, maybe a lot more people would be Would be into day basis more but yeah, I kind of my first encounter with Davis is always one of those lectures I can't I was just boring and much interested. Yeah, absolutely. Yeah, look at the way. Look where we are now, right? So
Starting point is 00:04:25 cool and Yeah, absolutely. And then, yeah, look at the way, look where we are now, right? Cool. And so we're going to be talking about things like query optimization, join ordering, and this concept of adaptive query processing as well. So let's kick things off with a little primer, Phil, listen up. And before we dig into your research, give us the sort of the, an idea of what these things are and what they mean. Yeah, sure. So traditionally database systems use what is called static query optimization. So after a database system translates a user's SQL query, for example, into an executable query plan,
Starting point is 00:04:59 that plan must be optimized to get the best possible performance. And then it's being executed. An integral part during this query optimization step is to reorder join operations. So the goal is there to find tree of joins that has the lowest cost or the lowest number of intermediate results. And a common approach to do that is to just do divide and conquer. So you start with small subtrees of your query plan and estimate the cardinality. Then you use some sort of dynamic programming algorithm to combine these subplans with the lowest cost into your final join order. And the quality of this result heavily then impacts the execution performance. And the problem with that is while this cardinality estimation is sort of trivial and works well
Starting point is 00:05:48 for well-behaved synthetic data or a lot of benchmarks as well, it is extremely complex for realistic data where you have skew and correlations and dependencies. And even state-of-the-art systems now show cardinality estimation errors in the range of multiple orders of magnitude in real-world workloads. And the result of that, of course, are suboptimal join orders and bad performance. So the query takes much longer to execute than it should. And yeah, there have been a lot of improvements in the static query optimization, in new sketches, and sampling methods, statistics.
Starting point is 00:06:30 This has been a very, very heavily researched area, but it's still an open problem today. On top of that, there is another problem that these optimizers tend to be sort of flaky. You can run into a so-called performance cliff. The query optimizer behavior changes because one little change that happens in your workload and then your performance might deteriorate drastically. So let's say for example you have some join query and then somebody adds one row to a table that is part of that query, the statistics change and then you hit that join query again and the query optimizer would pick a different join order that might perform much worse than before. And as an end result, a user would do the same operation twice and suddenly it takes much longer and that can be super frustrating, right?
Starting point is 00:07:16 So therefore, research has come up with this notion of robust performance or robust query processing and the idea is that instead of focusing on the peak performance for most of the queries and just accepting the horrible performance that you might get in the worst case, you go for a good or acceptable performance in the average case and from then on, degrade gracefully performance-wise when conditions get more difficult. One measure to improve this robustness for join ordering is called adaptive query processing. And what you do there is instead of trusting the estimates that you get during your static
Starting point is 00:07:55 query optimization step, you start executing some plan and then measure the actual cardinality. And you can use those cardinalities to actually change the plan and then execute that. And that sort of creates this feedback loop where the plans get better and better over time and you end up with a plan that might be much better than the one you started with. And yeah, that's where we arrive at Pola, which is a new approach that is within that adaptive query processing field of improving join orders on the go and to make up for bad cardinality estimates. Yeah, it's really interesting what you're saying there about as a user, you would much
Starting point is 00:08:35 prefer to have this sort of maybe slightly less overall peak performance, but that more robust experience, the reliability there is more important than that being a little bit faster. So I think it's a really interesting point. I know as an end user, you'd rather have that. And also as well, I was thinking when you were speaking at the very start of that about everyone's got a plan with regards to the synthetic data. And as soon as it meets the real world, it just goes out the window.
Starting point is 00:09:02 Everyone's got a plan until, I'm sorry, most plans don't survive the contact with reality, right? I think it's some sort of war general said that, right? You can make all the best plans you want in the world. Soon as you put it out there, like, yeah. And what was it, Mike, it's like, everyone's got a plan until they get punched in the face. But I mean, that's a different quote for a different context. But yeah, anyway, cool.
Starting point is 00:09:20 So yeah, you teased Polar there. So yeah, it was the TLDR on Paul there. How is it, how is it fit into this space of adaptive query processing and ensuring that we have robust query process as well? Yeah. So as I mentioned before, like we had this two decades of AQP research before. So the question is, why would we, why would we even come up with a new approach? And the thing is that even though we had some promising results on the previous paper that are quite old already, this whole topic of adaptive join reordering is still nowhere to be found in the database systems in practice that people actually use. And what we did then is we looked
Starting point is 00:10:02 at the past approaches and we found two major flaws that really hinder adoption. And what we did then is we looked at the past approaches and we found two major flaws that really hinder adoption. And one of those is an invasive design. So one inherent flaw is that most of these approaches intertwine the optimization phase and the execution phase. So for example, run the query optimizer during execution again and again, and that breaks with one of the fundamental paradigms of database system design. And that makes it of course, a bit difficult to integrate a system into your code. And on top of that, you often have custom join operations that are required so you can't use your original join code. And yeah, in general, just many code changes that are necessary to adopt your approach. And the second dimension that we
Starting point is 00:10:54 saw is the performance overhead. I mean, we said that it's acceptable that some queries get a little bit worse, but what we see is that we have often these great improvements for worst case queries, but still very unacceptable overhead for the simple queries where just using normal database system optimizer would have done fine. And that's where we come in with Polar because we designed this system specifically for non-invasive integration. So we want to keep the phases of optimization and execution completely separate. We don't want any optimizer calls during the execution, don't want to require any custom join implementations, and also make use of the host system's optimizer because it can be riot actually. And secondly, we focused
Starting point is 00:11:43 on low overhead, so we never want to waste any work or throw away intermediate results and be always, we always want to allow to fall back on the original plan that the, the, the query optimizer, the, the query optimizer gave us. optimizer gave us. And we want to be able to do that without additional cost. And that way we think that Polar shrinks the gap between adaptive query processing research and actual adoption in real database systems. Nice. Yeah. So these two things here, the non-invasive aspects of it and the performance, not making the performance terrible for the simple things. And maybe I'm getting ahead of it because we're going to talk about how things got integrated with DuckDB when we get probably a little bit later on in the show. But I kind of wanted to ask now kind of, did the way that DuckDB is set up sort of align
Starting point is 00:12:38 nicely with this non-invasive design goal of yours as to why you chose that as the candidate system? Well, on one hand it does because it has these nice extensibility features where you can just add optimizer rules and custom operators. And now they also have, I think, fairly good documentation on that. But originally I chose DuckDB because I could not choose Highrise actually. So that would have been of course my favorite choice because that's what I have worked in before but Highrise has this operator at a time execution model and these adaptive query
Starting point is 00:13:18 processing approaches usually have to do some sort of tuple routing. So you route some tuples through the one join order and some over the others. But you can't of course do that if you have this operator and just, it just processes all the tuples and then, well, that one is done, right? So I heard about DuckDB and people said it's great. And it's, yeah, it has this, yeah, this batch at a time execution engine. Okay, sure. I'll, I'm going to try it. Cool. Well, we'm going to try it. Cool. Well, we're going to get into that in some more detail now. And so I kind of jumped that, jumped the gun a little bit there, but yeah.
Starting point is 00:13:50 So tell us how a, this concept of a polar pipeline, right? So you can tell us what a polar pipeline is and yeah, how that kind of works, how will you go through that pipeline, what that looks like and we can, the different stages within, within, within that. Yeah, sure. I'll try to explain it from it from the very beginning to end. So with this non-invasive integration and this low overhead, we have pretty tight constraints, actually, especially because we decided that we want to be able to jump back to the original query plan without the additional cost. Because that meant for us that we cannot focus on an entire query plan, but instead focus
Starting point is 00:14:31 on join pipelines. So we expect the query optimizer to give us a pipelined plan. And each join pipeline is basically just a sequence of join operators that do not materialize completely the intermediate results in between. And each pipeline then basically works like a left deep join tree where the tuples are pulled from some source table or from a source operator and then probed into multiple build tables.
Starting point is 00:14:59 And the benefit of that approach is that after each tuple completes that pipeline, we can very trivially change the join order without doing any additional work because the result is always the same. So how do we do that? As I said, we assumed a pipeline plan that we get as output from the optimizer and then implement an additional optimizer rule. So we want to look at every pipeline that contains more than one join, because those are the pipelines where we can
Starting point is 00:15:29 potentially reorder joins. So what we do then is to generate additional join orders and insert them into the pipeline. How we do this generation, I will say in a minute. And to connect these join orders, then we also insert a multiplexer operator before those join orders. So that one is then connected with each of the join orders and can send tuples to any of them. And for each batch of input tuples, it basically just decides which join order will be fed with it.
Starting point is 00:16:02 And after the join orders, we insert an adaptive union, and that one just reorders the column to match the column order from the original join order to just unify the pipeline output and to not create some unexpected behavior. And with this design, we have all the planned generation components during the optimization phase and in execution we can solely focus on the tuple routing through these previously defined join orders without rerunning the optimizer and generating new ones. So I mentioned the join order selection already, which is the way we generate the additional join orders for a Polar Pipeline. And the goal here for us is to cover a diverse set of join orders.
Starting point is 00:16:53 And diverse here means that it's supposed to work well for different join selectivities that we might see during the query. So for that, we consider the selectivities that we might see in our join pipeline as d-dimensional space, and each selectivity in that space is one dimension. And if we now discretize that space with an exponential decay, we have a grid that covers extreme values. And we do that because it is much more common to have very low selectivities in the real world workloads than, let's say, a uniform number distribution. And yeah, so we have that grid. And that enables us to just point into it and to take a number of samples from there.
Starting point is 00:17:46 And each sample is just a set of assumed selectivities. And we take that and invoke some dynamic programming algorithm to find the optimal join order in terms of minimal intermediate results for that. We just do that a couple of times to enhance the polar pipeline with additional join orders. And the beautiful thing about this approach is that if there is a join order that is optimal for a large space of selectivities, then it's much more likely for this join order to be selected than another one that only works for a very narrow set of selectivities. So we then have this polar pipeline and now it only needs to get executed. And what happens there is that during the execution, the pipeline just pulls a set of
Starting point is 00:18:33 tuples from its source and then the multiplexer passes that set to one of the join orders. And during this join processing, we need to do some bookkeeping. And what we do is we measure the so-called path resistance. And that is the number of intermediate results that occur per input tuple of the first join. So that's basically just a relative measure of work that needs to be done per or that was done per tuple flowing through one join order. And this path resistance then is reported back to the multiplexer and the multiplexer
Starting point is 00:19:09 then just uses this new information to make the next routing decision. So how does the multiplexer make this routing decision? Yeah, so to get an initial overview of these resistances, the multiplexer needs to feed each join order with a few tuples and then we get reported a resistance for each of the join orders in the pipeline. And from then on, we can use this notion of a probabilistic regret bound. So this bound relates to the additional number of intermediate results compared to the best join order that is present in the pipeline.
Starting point is 00:19:47 And it is no hard guarantee but probabilistic because we look at the maximum number of additional intermediate results assuming the path resistances that we have measured so far. So we use that notion because we only know resistances that we have seen before and simply don't know yet how the data characteristics might change for the remaining tuples. So to make a routing decision now, we have to give the multiplexer a regrab budget. So how much do we want to overstep this probabilistic optimum? So let's say we give it 10%. And it also needs to know the measured resistances for each join order. But then we can find a set of weights for each join order so that the summation of each resistance multiplied with
Starting point is 00:20:41 its weight is lower or equal to the minimal join order resistance that we have in our set plus our regret budget. And these weights that we get from them is sort of the relative amount of tuples that we would need to send to each join order to satisfy this probabilistic regret budget. And then when we have these weights, we still have to decide how to translate this into an actual routing decision. So how many tuples do need to go to which path next? And that's what the different routing strategies are for.
Starting point is 00:21:15 And we have two of those that use this probabilistic regret budget. So the first one is called adapt tuple count. And that's, I guess I guess the straightforward very obvious method where you would just set a number of n tuples and then assign the number of tuples that go to each join order respective to the weights and then just route one tuple set after the other and then recalculate the weights after that with new resistances and repeat. And with that, we have this fixed window of n tuples and adapt the input sizes for the join orders within each of these windows.
Starting point is 00:21:54 Now, there is a second strategy, which is called adapt window size, which does exactly what it sounds like. So it's the other way around. So we set a low number of two poles n, so a number that should be just enough to measure the new resistances for the bad join orders basically. And we assign each of those join orders, except the one with the least resistance with n, and then calculate the number of tuples that we need to send to the best join order to stay within our budget. And why is that useful? So if our system, like DuckDB, uses vectorized execution,
Starting point is 00:22:34 so we perform operations on multiple values with a single instruction, then this system strongly prefers processing values and batches, right? And the performance in turn suffers if join orders change all the time, because then you're only processing a few tuples at the time. And the previous strategy may struggle with that if, for example, two join orders perform equally well, because that limits the number of tuples that go into the best join order. And the adapt window size strategy maximizes that number so that we can route most tuples to a single join order with the least intermediates. And that's then more suited for a vectorized execution.
Starting point is 00:23:14 And in our prototype inducted, we actually saw that this is true, but adapt tuple count might potentially work better in a tuple at a time execution model, for example, for, for Postgres. Yeah. And that's it. Then you should have your result. Nice. A journey through a Polar pipeline for the listener.
Starting point is 00:23:31 Cool. Just to recap then. So we get our query plan out, comes out the optimizer and it is as it is. But at this point we haven't done anything. We've not been totally non-invasive. What we do is we look for in that plan the left deep joins of more than two joins, because that's where we can actually change the join order. And then we replace those with polypyplanes. That means introducing two new operators. One of them is the routing, which kind of would send tuples to one of
Starting point is 00:24:06 these different join orders we've generated. And at the end, we kind of have a, it's not a merge, right? But it's kind of, we bring things back together. Oh, we have an, the exact, the exact time for it. The, uh, the, uh, where's it come? The adaptive union. There we go. Okay, cool. So we have a few different ways of selecting the join orders. You mentioned the selective sampling, that's kind of the main contribution, the new contribution in your paper, right? The selective sampling space algorithm. But there are a few others as well actually you use in your paper as well, but I didn't
Starting point is 00:24:35 think any of you mentioned it, but they're more for comparison, right? This heuristic-based approach. Yeah, yeah, exactly. I mentioned those a little bit, and then you would talk about results. And when we're kind of sending tuples through these different join orders, we're calculating some score this resistance metrics, which we're then using to influence our routing decisions. And there are two approaches you mentioned there of how we actually go about routing the tuples.
Starting point is 00:25:02 So hopefully I've made a very course overview of how of how this of how this works. But yet so kind of given that you mentioned there that the vectorized data, I was going to ask you kind of what was what how would this how would this work? So if you didn't have a vectorized database engine would and you didn't have to DDP, but you said that you think it would work reasonably well or the other approach would work reasonably well for, I don't know, say if you did this in Postgres, for example. Yes. So I think that you might have to configure Polar a little differently.
Starting point is 00:25:34 For example, you might have to choose a different routing strategy, but on top, I mean, apart from that, probably the, the regrade budget might be changed as well. I mean, apart from that, probably the regret budget might be changed as well. But apart from that, if you would switch to a tuple at a time database system, I think it would actually get easier even because then you don't have this whole problem of not being able to actually compare two different join orders by their processing time, for example, because you can't really do that in a vectorized execution engine because it just takes much longer in a relative sense to process a small number of tuples than it does with a large number of tuples. So yeah,
Starting point is 00:26:17 I think it would be perfectly suitable for that. But yeah, I actively chose to do that prototype in DuckDB because I thought, okay, that's what people want to use for analytics. And that's why we should also look at these more complicated things that are necessary to actually integrate it. Cool. So yeah, let's talk about the DuckDB implementation a little bit more. So what were the primary features that you kind of relied on from WDB to actually to make Polaroid a reality?
Starting point is 00:26:49 Yeah. So, I mean, as I just said, the vectorized execution model heavily influenced the design of this prototype and the configuration as well because it ultimately introduces these new requirements of do not switch join orders all the time, try to stick with one join order as long as possible and so on. And also, as I said, it has this limitation that leads for us to only look at intermediate results.
Starting point is 00:27:18 And another feature that we make use of is more duct-to-be's Morselle parallelism. In that each pipeline just has multiple instances and each instance pulls chunks of data independently from some source, processes it and pushes it into some sink and there's no need to coordinate in between the threads or wait for them or shuffle data. And we had to adopt that, of course, then as well in order to make Polar happen in DuckDB. And for us, that meant that each pipeline instance has a thread local multiplexer state. So each multiplexer actually makes independent routing decisions. So it does not look at what the other threads are doing and what their path resistances are, for example. Alternatively, we could have created a global multiplexer state that takes all the local resistances into account,
Starting point is 00:28:16 but we chose to keep it local to avoid all this coordination overhead and just to keep it simple as simple as possible. And regarding DuckDB's extension framework, Polar can actually be very simply implemented as a combination of an optimizer extension and an extension operator. So the optimizers can be extended with new rules that have basically a query plan as input and a query plan as output. And then this new rule would be executed after the internal optimizer rules. And that's exactly where this polar pipeline compilation would happen. And there we do then this join order selection and replace the joins with a dedicated polar
Starting point is 00:29:00 operator. So everything that happens in a polar pipeline happens in this operator. And internally, this would then execute the multiplexer and the joint operator. So it has a bit of pipeline logic in it and does the path resistance measuring. Yeah. How long did it take you to actually implement this? Like how big was the task? How easy did you find it as well? Well, it took me actually quite a long time, but I think that is also because I needed to learn DuckDB first. So from the first time that this idea of Polar came up until we submitted the paper was, I think, one and a half years or so or a little more.
Starting point is 00:29:49 And I think one hurdle for me as well was that back then there was no extension marketplace and no real documentation for the extensibility features. So I sort of improvised a lot. And actually this prototype that is used in the paper is a very invasive one because I hard coded all these elements into the pipeline executor and only now we're working on transforming that into a DuckDB extension, which we want to publicize at a later point. Awesome. Yeah. And I guess it'd be cool to talk about how much performance gains we actually get from using Polar. So when this extension module does arrive, people can be, people can start getting excited, right? So yeah, tell us how you went about evaluating Polar and
Starting point is 00:30:41 yeah, how much better is it over kind of, I don't know, in a non kind of Vegas, some of the current state of the RAAQP approaches and yeah, the current way with things working up BP. Yeah, sure. So before we did any performance testing, we actually did an applicability study because this pipeline focus gives us this low overhead risk that we wanted, but comes with a trade-off in applicability. Because if your optimizer just gives you query plans that are right deep trees or very bushy, that means that you will have no pipelines or very few pipelines that have multiple joins. So
Starting point is 00:31:22 Polar cannot do anything. And indeed, what we saw is that for TPC-H, for example, and therefore also JCC-H, which is a skewed version of TPC-H, DuckDB does not produce any amenable pipelines for Polar, so we needed to exclude that benchmark. But we also evaluated the join order benchmark, which also gives us fairly bushy trees, but a couple of short pipelines. And we spent at least over a third or so of the processing time in pipelines with more than one join.
Starting point is 00:31:57 So these are the ones where we can potentially change something. And then we looked at the star schema benchmark, which is an adapted version of TPC-H that changes it into a star schema and transforms the line item and order table into one large line order effect table. And TACDB actually produces lefty queries here. So that means that we have a lot of opportunity for impact, but we also have this very well behaved synthetic benchmark data that we talked about before. have a lot of opportunity for impact. But we also have this very well-behaved synthetic benchmark data that we talked about before. And that means that the cardinalities are
Starting point is 00:32:30 very easy to estimate. So there's very likely no need to ever reorder those joins. And that's finally why we introduced this benchmark SSB skew that adds skew in correlation to the star schema benchmark data and tries to make it more realistic. And with that, we have a benchmark with high applicability because we still have these left deep trees and the potential for join order improvement because the cardinality estimates actually tend to be a bit more challenging. And yeah, what we did then is we looked at the total execution times for each of those benchmarks and compared that to vanilla.db and a couple of other systems. So total execution time is all query time summated. And we saw that, for example,
Starting point is 00:33:18 for join order benchmark, we have mild improvement, like with one thread, instead of 134 seconds, only 115 seconds. Then the star schema benchmark is about the same as we would have expected it because DuckDB's cardinality estimation does a pretty good job here already, but at least we don't add any overhead to that. And then SSB-SQL, we saw a total around 2x improvement compared to Vanilla DuckDB in total execution time. But we also compared that then with a state-of-the-art adaptive query processing approach, which is called Skinner MT.
Starting point is 00:33:57 And Skinner MT is a really cool work that builds hash tables for each input table and then uses reinforcement learning for each individual query to evolve from some random left deep join tree towards an optimal left deep join tree. And with eight threads on join order benchmarks, Skinner MT actually has 25% less execution time than Polar and also performed best in our comparison. But when we look at StarsCamo benchmark and SSBsq at scale factor 10 or so, then Skinner MT suddenly is four times slower. And what we see there is that this large applicability and high impact really can get you
Starting point is 00:34:38 to unacceptable overheads for some types of workloads. And what was even more interesting then is to drill down on the execution times of each individual query in those benchmarks. So for join other benchmark, for example, we saw that the execution time does not change at all for the vast majority of queries. So no improvement, but also at least no overhead. But there's a few individual queries that show very notable improvements where one query is around nine X faster than without Polar. Then we looked at SAS schema benchmark where most queries get a bit slower because the join orders are
Starting point is 00:35:16 good already. And every additional work that we do just deteriorates performance. So that's a tough place to be in. But the worst query that we measured took around 7% longer than without Polar and the majority of queries around 1-2%. So the overhead in total is sort of mild. And then finally, we got to SSB-SQL of course, and there we have this applicability meets challenging cardinality estimates situation. So we see notable improvements for most queries with some of the queries faster up to 4x. And the conclusion of that for me personally would be that Polar really works best for these typical data warehouse workloads where you have tables arranged in a star schema and yeah sort of realistic data with skewing correlations.
Starting point is 00:36:09 Yeah nice. I guess what you were saying there at the start about sort of when you did this applicability study and some of these really popular benchmarks didn't actually have the necessary, didn't meet the criteria for Polar to be able to exploit. I wonder if there is, do you have a sort of a general intuition of the, all of the, if you imagine all of the workloads in the world, how many of them Polar would be able to be demonstrate performance improvements?
Starting point is 00:36:34 Like how common is having a left deep tree in a query plan? Do you reckon that, is it just kind of, is it a case of, yeah, half reckon, or I don't know, what's your general intuition, I don't know, obviously, I think maybe it's more a function of how many of these types of star schema workloads you kind of have out there is probably the actual question. Well, it's very hard for me to put a number on that for that will be actually very interesting to work in the industry for a bit. But I do assume that it is quite a lot of these data warehouse workloads basically that use the star schema because it's such a well-known optimization. And that's really
Starting point is 00:37:13 where I would expect Polar to shine. But yeah, that's why I can only encourage everyone in the industry to prototype Polar for their systems so that they can check it out if it makes any difference at all. Because I was going to say, because obviously you mentioned there on that there is a plan to make this into an extension module. And I don't know if there is a way for you to anomalously collect statistics of people who have installed it and see how frequently that actually is happening or actually isn't happening.
Starting point is 00:37:43 I mean, there probably is some bias there because if someone is installing it, they maybe think they'd have those types of query plans anyway. So that might be a bit of confirmation bias kind of sort of thing. But yeah, it would be interesting to see if you could collect statistics to see how frequently this is sort of being. That's actually a cool idea. But I might have to check if I'm able to get on the DuckDB extension marketplace if I trick people to get on the Doug DB extension marketplace. If I trick people to, to report data, data to me, but some, I don't know. Yeah.
Starting point is 00:38:09 Um, make it opt opt out rather than opt in. I don't know. Yeah. That's a good idea. Cool. So yeah, I guess where, where do you go next with poll events? So obviously there is this, this, this, this move to, to make it into an extension module so people can start using it in a production
Starting point is 00:38:25 environment and seeing if it helps for them. Is there any other plans or directions you wish to take it in? Well, I mean, there's a couple of future work stuff that might happen in the future and that are definitely in interesting directions. So I think for me, the main thing would be to investigate in a more holistic performance measuring approach because we have this path resistance right now which only looks at the number of intermediate results but really there are many more important factors that influence the performance such as the size of the build sites.
Starting point is 00:39:03 Do they fit in the cache or do they not? How long does it take to probe a tuple or the join implementation? Is it a hash join or an index nested loop join and so on and so on? So you could really either build sort of a cost model for that or just try again to measure the processing time, which we did in the very beginning, but stopped doing that because of this high variance, because you don't have full control over the background processes on your system. So you can have your path resistance lashing out for no reason. And it's also, as I said, very difficult to compare different join orders when they have different numbers of input tuples because you have this vectorized execution of you process a few tuples that is much slower than
Starting point is 00:39:48 processing many. And then you have this buffering if one join does not produce enough output tuples. And there's just a lot of stuff that plays into that. But it would be very interesting if someone, if not myself, would look into that. Cool. So let's talk about the DuckDB experience then. So you mentioned that originally you would probably have maybe gone for high rounds because it was a system you're more familiar
Starting point is 00:40:11 with but that was not an option. So if you reflect on your experience of using DuckDB and the DuckDB extension module and what was your general overall sort of take on that experience? Well I think one cool thing is that while developing this prototype I actually learned a lot about databases because I just debugged DuckDB all the time to really understand how it works and it's really very state of the art. So if you want to see how an implementation of modern dynamic programming approach for join ordering works, you can just go and check because it's right there or yeah, all the other cool stuff. So that I think is very
Starting point is 00:40:58 rewarding also. Yeah, or working in the system that so many people are excited about is also, I think, really great because it opens opportunities. Of course, you get invited to a podcast, for example, and people tend to care more about your research. So I can only encourage other people to do what I did. Nice. Yeah. I was gonna say that what would be your advice you'd give to the would
Starting point is 00:41:26 be developers? I guess that'd be just to go out there and do it and I guess we can get involved. Absolutely. And also the DuckDB people are super great. Like they're all very nice. I met a couple of them and always had a great time. And yeah, they were always very happy to give me some feedback or tell me here, this is the extension operator. Did you try this? Did you try that? So that was great. Nice. Yeah. Let's say let's talk about some impact then. So has your work on pull had any impact back on the core product to also you've had conversations with people who actually engineers at WP, but has there been any influence that is they thought about incorporating this into the into the core products in any way?
Starting point is 00:42:08 Well I mean we're still at this point where we don't have the actual extension, but instead only have this prototypical implementation. And I had this talk once with Hannes Mülleisen, the DuckDB Labs CEO, and I think a community extension really would be the way to go here. But of course, Polar in itself is still very researchy and there's probably a lot of corner cases that you should figure out before using that in your million dollar business. But at VLDB, I actually discussed with a lot of people and many people showed very great interest and I partly attribute that to the fact that this idea and the paper is fairly simple and
Starting point is 00:42:51 easy to understand. So really, many people could easily grasp how this works and what this paper is about. And I also talked with people from the industry and found that most rewarding because actually one engineer told me that this should be fairly simple to implement in their database system and they might want to try to create a rough prototype and evaluate its potential. And yeah, since this non-invasiveness for adaptive query processing was the key goal of my paper, that was obviously a very great feedback for me to hear. Yeah, for sure.
Starting point is 00:43:25 I mean, I really like the non-invasive design goal as well, because I think that just it's going to help increase adoption, right? If you don't have to go in there and say, why are these things together and create this big sort of interconnected blob of going, it just gets complex, right? If you can keep it simple, it's not a verse invasive, then it becomes easy to be able to prototype as well, right? And to really demonstrate the benefits of such an approach. So yeah, I think that is a good design goal to have, sure, if you want to make your work impactful. Awesome. So yeah,
Starting point is 00:43:54 I guess kind of the next question I want to ask is about surprises. And whilst working on Polar and with WPDB, what's been the most interesting lesson that you've learned during that process? Well, I think when it comes to DuckDB, I actually saw that many approaches in this whole problem field of join ordering and adaptive query processing for analytics is actually implemented in Postgres. And that has good sides to it because these approaches can be easily compared. They have the same baseline performance of Postgres.
Starting point is 00:44:32 But on the other side, in practice, if you start your new endeavor, you don't necessarily want to use Postgres for analytics if you can get much better performance with actual analytics systems like like ductiB. And the problem that this brings is that some approaches that work well in Postgres might actually not work that well in systems like that, because you have to look at additional things like vectorized execution. I know it's a meme by now, but it's really crucial to think about things like that for good performance, because in the end, this can be a make or break. And therefore, I think really one of my key learnings
Starting point is 00:45:09 was if you want to implement something in your analytics research, then you might want to prototype in a system that people actually want to use for the use case. So it makes you consider these additional aspects of modern systems. And I think that in my opinion can substantially decrease the gap between your research and the real world. Yeah, for sure. Completely agree with you on that. So we're kind of getting towards the end of the show now. And before we do go on to the last word, I'm not sure if I sent you this, so apologies if you've not got an answer primed for this, but I wanted to have a section where you recommend, other
Starting point is 00:45:49 than obviously your own research and you go and play with Paula, but other plugins, extensions and features of WDB that you would recommend for the listener to go and try out whilst you've been working with WDB. What's your favourite plugin or extension if you've got one? Well, I'm not sure if I got a favorite plugin, but if I had to choose one, it will be probably the Parquet one, which is very basic, but I definitely have a favorite feature. That might be the out of core execution for aggregations and joins. So, Lawrence Koeper has this ICDE paper on this I think from this year. And the essential thing is that when hash tables becomes too large for your available memory, then instead of switching to a disk-based algorithm, their
Starting point is 00:46:38 operators have this clever design that keeps as much of the data as possible in memory and only spills the remaining stuff to disk. And the effect of that is that you have a very, very graceful performance degradation with growing hash tables instead of a performance cliff where you would suffer from the switch of, okay, now we don't do this in memory algorithm anymore, but switch to disk base and maybe even start from the beginning. And I even benchmarked this feature a bit for a different project another time, and I'm actually amazed how well that works.
Starting point is 00:47:11 So I find that it's a very cool contribution to making DuckDB's query execution more robust. Nice. Yeah, Lawrence has actually been on the podcast. I don't think it was this paper he'd mentioned. He had one called these rows are made for sorting, I think, which was the year before, maybe. So yeah, I can't remember what was the title of this paper called? Oh, I would have to look that up, but it's probably something with robust and aggregations.
Starting point is 00:47:33 Yeah, probably. That'd be a good place. Yeah. Good, some good search terms there to stick in. Cool. Yeah. So yeah, let's get time for the last word now. So yeah, what's the one thing, David, that you want the listener to get from this podcast episode today?
Starting point is 00:47:47 Well, as a takeaway, I think I would like to share my favorite learning from this paper that is not really technical. So Polar was actually a huge collaboration endeavor. So you can see that on the paper there are many, many names from academia, but also from the industry from Snowflake, Google, InfluxDB, SAP. And the reason for that is that this paper originates from seminar on robust query processing, which is, I think, every two years or so organized by Goethe's Grave. And that happens at Schloss Dagstuhl, which is like a very remote place in the German outback.
Starting point is 00:48:25 But they have these amazing international guests from academia and industry. And I saw that this is happening during my master's while I was working for my master's thesis in the Robles query processing field. And I saw the event page and I thought, wow, okay, I got to go there. But the guests are usually invited by the organizers. So I thought, okay, I'll just try and send them an application. And surprisingly, they accepted me. So in the first week of my PhD, I attended this seminar with these database legends. And this just developed into this amazing paper collaboration, which was a huge kickstart for my PhD. And it did not stop there.
Starting point is 00:49:06 And like during the process, I got the chance to discuss my work with many people and their feedback really helped me a lot to bring this paper to life. So if I really learned one thing from this project, then it's to reach out to people, discuss your ideas with others. And the people in the database community are usually super nice and it's very beneficial to exchange your ideas. Oh, what a fantastic story that is there. What a really nice message to end on.
Starting point is 00:49:34 I mean, that's lovely. I get into a DAG show as well. I've never been to one, but what was the experience like? Kind of, I guess, life changing almost. I mean, for me personally, it seemed to be, right? It's really interesting because I didn't know anyone of those guys and haven't been before. And yeah, it's sort of a very intimate setting, I would say. So I played pool against Thomas Neumann and Peter Bonge and actually one told my family
Starting point is 00:50:03 about it. I'm so proud of it. So you just get together and discuss ideas for like five days or so and yeah, it's a great experience. So if you get invited to definitely do that. If you don't, maybe it's worth applying. Yeah, just send an application, right? What's the worst that can happen? Cool. Yeah. So let's wrap things up there then, David. Before we go, where can we find you on social media? Anything on ex-LinkedIn? How is it best for people to contact you?
Starting point is 00:50:31 Well, I think it's probably the best idea probably to get on my website, which is d-justin.github.io because there you have it all. You can find me there on LinkedIn. You can find additional resources to the paper, like the slides and the poster and stuff like that. And that's, yeah, I think is probably the best starting point. But yeah, I'm also on all the usual socials of course. So if you look for my name, you'll find me. Awesome stuff. And we'll wrap things up there then. Thank you so much, David. It's been
Starting point is 00:51:02 a great chat. I'm sure the listener will have really enjoyed it and have learned a lot from it as well. So yeah, thank you for coming on the show. Well, thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.