Disseminate: The Computer Science Research Podcast - David Justen | POLAR: Adaptive and non-invasive join order selection via plans of least resistance
Episode Date: April 3, 2025In this episode, we sit down with David Justen to discuss his work on POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least Resistance which was implemented in DuckDB. David shares ...his journey in the database space, insights into performance optimization, and the challenges of working with modern analytical workloads. We dive into the intricacies of query compilation, vectorized execution, and how DuckDB is shaping the future of in-memory databases. Tune in for a deep dive into database internals, industry trends, and what’s next for high-performance data processing!Links: VLDB 2024 PaperDavid's Homepage Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Disseminate the computer science research podcast, the DuckDB in Research series.
Hello and welcome to our ongoing DuckDB in Research podcast series.
So let's start off by saying, kind of for the listener who doesn't know, what DuckDB actually is.
So DuckDB is an open source in-process SQL database that is designed for fast and efficient analytical query processing.
Really simple to use, deployers with zero external dependencies,
it's extensible,
integrates really easily with all your data and science tools,
having APIs for things like Python and R,
and you can directly read from file formats
like Parquet and CSV.
So why this DuckDB in research series, you may ask.
Well, Disseminate as a podcast is all about impact
and helping bridge the gap between research
and industry, which I'm sure many of you
have heard me say before.
And DuckDB is the perfect example of this,
of these two things, of these two communities
working together in that the ideas from decades of research
are at its core and it's now influencing the research
community as a platform for others to build on. And that leads us nicely onto today's show and today's now influencing the research community as a platform for others to build on.
And that leads us nicely onto today's show
and today's guest.
So I'd like to welcome David Houston to the show.
Yeah, very happy to be here.
Thanks for inviting me.
Cool.
So David, your research focuses on query optimization
for analytical databases,
and you will implement a lot of your ideas in DuckDb, right?
So this is why we're speaking to you specifically.
We're going to be chatting about some research you published, VLDB this year.
And the title of your paper was Polar, Adaptive and Non-invasive, Joint Order
Selection via Plans of Least Resistance, which I'm sure a lot of that will make a
lot more sense by the end of the podcast.
But yeah, so, but before we do get into that and talk about your experiences with the DuckDB
and everything as well, tell us more about yourself and your story and how you became
interested in databases and research.
Yeah, of course.
So I actually started out in my bachelor's as media informatic.
So that's basically a study which is sort of applied computer science with a bit of
audio and video
and quite a lot of web development. And there we had this very superficial database lecture
where we had a bit of SQL and normalization, denormalization. And I thought, oh my God,
that's extremely boring. I don't want to do that ever again. But then I switched to the
Hasso Plattner Institute at my master's and the Hasso Plattner Institute at my master's. And the Hasso Plattner Institute
is actually a privately funded university faculty founded and financed by Hasso Plattner,
who is one of the co-founders of SAP. And they had this seminar in the first semester that is
called Develop Your Own Database at Hasso Plattner's chair. And even though I thought,
okay, databases are not really my thing, I thought, okay, building stuff is kind of
fun. So let's try that. So we found ourselves in groups and built features
for HiRISE, which is the research in memory database at House of Plattner
Institute. And it turned out to be very fun, actually. And I think we also did
not too bad because then after that our whole group
actually got hired as student research assistants. And that was really sort of the beginning where I
started taking more and more database seminars at the Platner Chair and also wrote my thesis
in robust query processing. And yeah, that made me organically grow into that chair, I guess,
because I wanted to continue working on these systems that they have on that chair. But
I also wanted to learn all these cool skills that come with being a PhD student, like speaking
in front of crowds, supervising other people, supervising, organising and motivating yourself
and also having all this responsibility for your own work.
And that's really what, what got me excited.
And then after one year, I actually switched to Berlin because Hasso Plattner retired and
the chair got closed down, but that's where I am now.
And yeah, I'm very happy there.
Awesome stuff.
Yeah.
It's always interesting that a lot of the intro to database lectures do focus on those
sort of things and kind of a little bit of the the Normalization and whatnot and their brief little sprinkle of sequel
It's never presented in the most interesting way
I feel like it's it's if they started with the cool stuff, maybe a lot more people would be
Would be into day basis more but yeah, I kind of my first encounter with Davis is always one of those lectures
I can't I was just boring and much interested. Yeah, absolutely. Yeah, look at the way. Look where we are now, right? So
cool and Yeah, absolutely. And then, yeah, look at the way, look where we are now, right? Cool.
And so we're going to be talking about things like query optimization, join ordering, and
this concept of adaptive query processing as well.
So let's kick things off with a little primer, Phil, listen up.
And before we dig into your research, give us the sort of the, an idea of what these
things are and what they mean.
Yeah, sure. So traditionally database systems use what is called static query optimization.
So after a database system translates a user's SQL query, for example, into an executable query plan,
that plan must be optimized to get the best possible performance.
And then it's being executed.
An integral part during this query optimization step is to reorder join operations. So the goal is there to find tree of joins that has the lowest cost or the lowest number of intermediate
results. And a common approach to do that is to just do divide and conquer. So you start with small subtrees of your query plan and estimate the cardinality.
Then you use some sort of dynamic programming algorithm to combine these subplans with the
lowest cost into your final join order.
And the quality of this result heavily then impacts the execution performance.
And the problem with that is while this cardinality estimation is sort of trivial and works well
for well-behaved synthetic data or a lot of benchmarks as well, it is extremely complex
for realistic data where you have skew and correlations and dependencies.
And even state-of-the-art systems now show cardinality estimation errors in the range
of multiple
orders of magnitude in real-world workloads. And the result of that, of course, are suboptimal
join orders and bad performance. So the query takes much longer to execute than it should.
And yeah, there have been a lot of improvements in the static query optimization, in new sketches,
and sampling methods, statistics.
This has been a very, very heavily researched area, but it's still an open problem today.
On top of that, there is another problem that these optimizers tend to be sort of flaky. You can run into a so-called performance cliff. The query optimizer behavior changes because one little change
that happens in your workload and then your performance might deteriorate drastically.
So let's say for example you have some join query and then somebody adds one row to a table
that is part of that query, the statistics change and then you hit that join query again and the
query optimizer would pick a different join order that might perform much worse than before.
And as an end result, a user would do the same operation twice and suddenly it takes
much longer and that can be super frustrating, right?
So therefore, research has come up with this notion of robust performance or robust query
processing and the idea is that instead of focusing on the peak performance for most of the queries and
just accepting the horrible performance that you might get in the worst case, you go for
a good or acceptable performance in the average case and from then on, degrade gracefully
performance-wise when conditions get more difficult.
One measure to improve this robustness for join ordering is called adaptive query
processing.
And what you do there is instead of trusting the estimates that you get during your static
query optimization step, you start executing some plan and then measure the actual cardinality.
And you can use those cardinalities to actually change the plan
and then execute that. And that sort of creates this feedback loop where the plans get better
and better over time and you end up with a plan that might be much better than the one you started
with. And yeah, that's where we arrive at Pola, which is a new approach that is within that
adaptive query processing field of improving
join orders on the go and to make up for bad cardinality estimates.
Yeah, it's really interesting what you're saying there about as a user, you would much
prefer to have this sort of maybe slightly less overall peak performance, but that more
robust experience, the reliability there is more important than that being a
little bit faster.
So I think it's a really interesting point.
I know as an end user, you'd rather have that.
And also as well, I was thinking when you were speaking at the very start of that about
everyone's got a plan with regards to the synthetic data.
And as soon as it meets the real world, it just goes out the window.
Everyone's got a plan until, I'm sorry, most plans don't survive the contact with reality, right?
I think it's some sort of war general said that, right?
You can make all the best plans you want in the world.
Soon as you put it out there, like, yeah.
And what was it, Mike, it's like, everyone's got a plan
until they get punched in the face.
But I mean, that's a different quote for a different context.
But yeah, anyway, cool.
So yeah, you teased Polar there.
So yeah, it was the TLDR on Paul there.
How is it, how is it fit into this space of adaptive query processing and ensuring that we have robust query process as well?
Yeah. So as I mentioned before, like we had this two decades of AQP research before.
So the question is, why would we, why would we even come up with a new approach? And the thing
is that even though we had some promising results on the previous paper that are quite
old already, this whole topic of adaptive join reordering is still nowhere to be found
in the database systems in practice that people actually use. And what we did then is we looked
at the past approaches and we found two major flaws that really hinder adoption. And what we did then is we looked at the past approaches and we found two major
flaws that really hinder adoption. And one of those is an invasive design. So one inherent
flaw is that most of these approaches intertwine the optimization phase and the execution phase.
So for example, run the query optimizer during execution again and again, and that
breaks with one of the fundamental paradigms of database system design. And that makes
it of course, a bit difficult to integrate a system into your code. And on top of that,
you often have custom join operations that are required so you can't use your original join code. And yeah, in general,
just many code changes that are necessary to adopt your approach. And the second dimension that we
saw is the performance overhead. I mean, we said that it's acceptable that some queries get a
little bit worse, but what we see is that we have often these great improvements for worst case queries, but still very unacceptable overhead for the simple queries
where just using normal database system optimizer would have done fine. And that's where we come in
with Polar because we designed this system specifically for non-invasive integration.
So we want to keep the phases
of optimization and execution completely separate. We don't want any optimizer calls during the
execution, don't want to require any custom join implementations, and also make use of
the host system's optimizer because it can be riot actually. And secondly, we focused
on low overhead, so we never want to waste any work or throw away intermediate results and be always, we always want to allow to fall back on the original plan that the, the, the query optimizer, the, the query optimizer gave us.
optimizer gave us. And we want to be able to do that without additional cost.
And that way we think that Polar shrinks the gap between adaptive query processing research and actual adoption in real database systems.
Nice. Yeah. So these two things here, the non-invasive aspects of it and
the performance, not making the performance terrible for the simple things.
And maybe I'm getting ahead of it because we're going to talk about how things got integrated
with DuckDB when we get probably a little bit later on in the show.
But I kind of wanted to ask now kind of, did the way that DuckDB is set up sort of align
nicely with this non-invasive design goal of yours as to why you chose that as the candidate
system?
Well, on one hand it does because it has these nice extensibility features where
you can just add optimizer rules and custom operators.
And now they also have, I think, fairly good documentation on that.
But originally I chose DuckDB because I could not choose Highrise actually.
So that would have been of course my favorite choice because that's what I have worked in
before but Highrise has this operator at a time execution model and these adaptive query
processing approaches usually have to do some sort of tuple routing.
So you route some tuples through the one join order and some over the others. But you can't of course do that if you have this operator
and just, it just processes all the tuples and then, well, that one is done, right? So
I heard about DuckDB and people said it's great. And it's, yeah, it has this, yeah,
this batch at a time execution engine. Okay, sure. I'll, I'm going to try it.
Cool. Well, we'm going to try it.
Cool.
Well, we're going to get into that in some more detail now. And so I kind of jumped that, jumped the gun a little bit there, but yeah.
So tell us how a, this concept of a polar pipeline, right?
So you can tell us what a polar pipeline is and yeah, how that kind of works, how
will you go through that pipeline, what that looks like and we can, the different
stages within, within, within that.
Yeah, sure.
I'll try to explain it from it from the very beginning to end. So with this non-invasive
integration and this low overhead, we have pretty tight constraints, actually, especially
because we decided that we want to be able to jump back to the original query plan without the additional cost. Because that meant for us that we cannot focus on an entire query plan, but instead focus
on join pipelines.
So we expect the query optimizer to give us a pipelined plan.
And each join pipeline is basically just a sequence of join operators that do not materialize
completely the intermediate
results in between.
And each pipeline then basically works like a left deep join tree where the tuples are
pulled from some source table or from a source operator and then probed into multiple build
tables.
And the benefit of that approach is that after each tuple completes that pipeline, we can
very trivially change the join order without doing any additional work because the result
is always the same.
So how do we do that?
As I said, we assumed a pipeline plan that we get as output from the optimizer and then
implement an additional optimizer rule.
So we want to look at every
pipeline that contains more than one join, because those are the pipelines where we can
potentially reorder joins. So what we do then is to generate additional join orders and
insert them into the pipeline. How we do this generation, I will say in a minute. And to
connect these join orders, then we also insert a multiplexer
operator before those join orders.
So that one is then connected with each of the join orders and can send tuples to any
of them.
And for each batch of input tuples, it basically just decides which join order will be fed
with it.
And after the join orders, we insert an adaptive union, and that one just reorders the column
to match the column order from the original join order to just unify the pipeline output
and to not create some unexpected behavior.
And with this design, we have all the planned generation components during the optimization phase and in execution
we can solely focus on the tuple routing through these previously defined join orders without
rerunning the optimizer and generating new ones.
So I mentioned the join order selection already, which is the way we generate the additional join orders for a Polar Pipeline.
And the goal here for us is to cover a diverse set of join orders.
And diverse here means that it's supposed to work well for different join selectivities
that we might see during the query. So for that, we consider the selectivities that we might see in our join pipeline as
d-dimensional space, and each selectivity in that space is one dimension.
And if we now discretize that space with an exponential decay, we have a grid that covers extreme values.
And we do that because it is much more common to have very low selectivities in the real
world workloads than, let's say, a uniform number distribution.
And yeah, so we have that grid.
And that enables us to just point into it and to take a number of samples from there.
And each sample is just a set of assumed selectivities. And we take that and invoke
some dynamic programming algorithm to find the optimal join order in terms of minimal
intermediate results for that. We just do that a couple of times to enhance the polar pipeline
with additional join orders. And the beautiful thing about this approach is that if there is a join order that is optimal
for a large space of selectivities, then it's much more likely for this join order to be
selected than another one that only works for a very narrow set of selectivities.
So we then have this polar pipeline and now it only needs to get executed.
And what happens there is that during the execution, the pipeline just pulls a set of
tuples from its source and then the multiplexer passes that set to one of the join orders.
And during this join processing, we need to do some bookkeeping. And what we do is we measure the so-called path resistance.
And that is the number of intermediate results
that occur per input tuple of the first join.
So that's basically just a relative measure of work
that needs to be done per or that was done per tuple flowing
through one join order.
And this path resistance then is reported back to the multiplexer and the multiplexer
then just uses this new information to make the next routing decision.
So how does the multiplexer make this routing decision?
Yeah, so to get an initial overview of these resistances, the multiplexer needs to feed
each join order with a few tuples
and then we get reported a resistance for each of the join orders in the pipeline.
And from then on, we can use this notion of a probabilistic regret bound.
So this bound relates to the additional number of intermediate results compared to the best join order that is present
in the pipeline.
And it is no hard guarantee but probabilistic because we look at the maximum number of additional
intermediate results assuming the path resistances that we have measured so far.
So we use that notion because we only know resistances that we have seen before and simply
don't know yet how the data characteristics might change for the remaining tuples.
So to make a routing decision now, we have to give the multiplexer a regrab budget.
So how much do we want to overstep this probabilistic optimum? So let's say we give it 10%.
And it also needs to know the measured resistances for each join order. But then we can find a set
of weights for each join order so that the summation of each resistance multiplied with
its weight is lower or equal to the minimal join order resistance that we
have in our set plus our regret budget.
And these weights that we get from them is sort of the relative amount of tuples that
we would need to send to each join order to satisfy this probabilistic regret budget.
And then when we have these weights, we still have to decide how to translate this into
an actual routing decision.
So how many tuples do need to go to which path next?
And that's what the different routing strategies are for.
And we have two of those that use this probabilistic regret budget.
So the first one is called adapt tuple count.
And that's, I guess I guess the straightforward very obvious method
where you would just set a number of n tuples and then assign the number of tuples that
go to each join order respective to the weights and then just route one tuple set after the
other and then recalculate the weights after that with new resistances and repeat.
And with that, we have this fixed window of n tuples and adapt the input sizes for the
join orders within each of these windows.
Now, there is a second strategy, which is called adapt window size, which does exactly
what it sounds like.
So it's the other way around. So we set a low number of two
poles n, so a number that should be just enough to measure the new resistances for the bad
join orders basically. And we assign each of those join orders, except the one with
the least resistance with n, and then calculate the number of tuples that we need to send to the best join order to stay within our budget.
And why is that useful?
So if our system, like DuckDB, uses vectorized execution,
so we perform operations on multiple values with a single instruction,
then this system strongly prefers processing values and batches, right?
And the performance in turn suffers
if join orders change all the time, because then you're only processing a few tuples at the time.
And the previous strategy may struggle with that if, for example, two join orders perform equally
well, because that limits the number of tuples that go into the best join order. And the adapt window size strategy maximizes that number so that we can route
most tuples to a single join order with the least intermediates.
And that's then more suited for a vectorized execution.
And in our prototype inducted, we actually saw that this is true, but adapt tuple
count might potentially work better in a tuple at a time execution model,
for example, for, for Postgres.
Yeah.
And that's it.
Then you should have your result.
Nice.
A journey through a Polar pipeline for the listener.
Cool.
Just to recap then.
So we get our query plan out, comes out the optimizer and it is as it is.
But at this point we haven't done anything.
We've not been totally non-invasive. What we do is we look for in that plan the left deep joins of more than two joins, because
that's where we can actually change the join order.
And then we replace those with polypyplanes. That means introducing two new operators.
One of them is the routing, which kind of would send tuples to one of
these different join orders we've generated. And at the end, we kind of have a, it's not
a merge, right? But it's kind of, we bring things back together. Oh, we have an, the
exact, the exact time for it. The, uh, the, uh, where's it come? The adaptive union. There
we go. Okay, cool. So we have a few different ways of selecting the join orders.
You mentioned the selective sampling, that's kind of the main contribution, the new contribution
in your paper, right?
The selective sampling space algorithm.
But there are a few others as well actually you use in your paper as well, but I didn't
think any of you mentioned it, but they're more for comparison, right?
This heuristic-based approach.
Yeah, yeah, exactly.
I mentioned those a little bit, and then you would talk about results.
And when we're kind of sending tuples through these different join orders, we're calculating some
score this resistance metrics, which we're then using to influence our routing decisions.
And there are two approaches you mentioned there of how we actually go about routing
the tuples.
So hopefully I've made a very course overview of how of how this of how this works.
But yet so kind of given that you mentioned there that the vectorized data, I was going to ask you
kind of what was what how would this how would this work?
So if you didn't have a vectorized database engine would and you didn't have to DDP, but you said
that you think it would work reasonably well or the other approach would work reasonably well for, I don't know,
say if you did this in Postgres, for example.
Yes.
So I think that you might have to configure Polar a little differently.
For example, you might have to choose a different routing strategy, but on top,
I mean, apart from that, probably the, the regrade budget might be changed as well.
I mean, apart from that, probably the regret budget might be changed as well. But apart from that, if you would switch to a tuple at a time database system, I think
it would actually get easier even because then you don't have this whole problem of
not being able to actually compare two different join orders by their processing time, for
example, because you can't
really do that in a vectorized execution engine because it just takes much longer in a relative
sense to process a small number of tuples than it does with a large number of tuples. So yeah,
I think it would be perfectly suitable for that. But yeah, I actively chose to do that prototype in DuckDB because I thought,
okay, that's what people want to use for analytics.
And that's why we should also look at these more complicated things that are
necessary to actually integrate it.
Cool.
So yeah, let's talk about the DuckDB implementation a little bit more.
So what were the primary features that you kind of relied on
from WDB to actually to make Polaroid a reality?
Yeah.
So, I mean, as I just said,
the vectorized execution model heavily influenced
the design of this prototype and the configuration as well
because it ultimately introduces these new requirements
of do not switch join orders all the time, try to stick with one join order as long as possible and so on.
And also, as I said, it has this limitation that leads for us to only look at intermediate
results.
And another feature that we make use of is more duct-to-be's Morselle parallelism.
In that each pipeline just has multiple instances and each instance pulls chunks of data independently
from some source, processes it and pushes it into some sink and there's no need to
coordinate in between the threads or wait for them or shuffle data. And we had to adopt that, of course, then as well in order to make Polar happen in DuckDB.
And for us, that meant that each pipeline instance has a thread local multiplexer state.
So each multiplexer actually makes independent routing decisions.
So it does not look at what the other threads are doing and what their path resistances are, for example.
Alternatively, we could have created a global multiplexer state that takes all the local resistances into account,
but we chose to keep it local to avoid all this coordination overhead and just to keep it simple as simple as possible.
And regarding DuckDB's extension framework, Polar can actually be very simply implemented
as a combination of an optimizer extension and an extension operator.
So the optimizers can be extended with new rules that have basically a query plan as
input and a query plan as output.
And then this new rule would be executed after the internal optimizer rules.
And that's exactly where this polar pipeline compilation would happen.
And there we do then this join order selection and replace the joins with a dedicated polar
operator.
So everything that happens in a polar pipeline happens in this operator.
And internally, this would then execute the multiplexer and the joint operator. So it
has a bit of pipeline logic in it and does the path resistance measuring. Yeah.
How long did it take you to actually implement this? Like how big was the task? How easy
did you find it as well?
Well, it took me actually quite a long time, but I think that is also because I needed
to learn DuckDB first. So from the first time that this idea of Polar came up until we submitted the paper was, I think, one and a half years or so or a little more.
And I think one hurdle for me as well was that back then there was no extension marketplace
and no real documentation for the extensibility features. So I sort of improvised a lot. And
actually this prototype that is used in the
paper is a very invasive one because I hard coded all these elements into the pipeline
executor and only now we're working on transforming that into a DuckDB extension, which we want
to publicize at a later point.
Awesome. Yeah. And I guess it'd be cool to talk about how much performance gains we actually get from using Polar. So when this extension module does arrive, people can be,
people can start getting excited, right? So yeah, tell us how you went about evaluating Polar and
yeah, how much better is it over kind of, I don't know, in a non kind of Vegas,
some of the current state of the RAAQP approaches and yeah, the current way with things working
up BP.
Yeah, sure.
So before we did any performance testing, we actually did an applicability study because this pipeline focus gives us
this low overhead risk that we wanted, but comes with a trade-off in applicability. Because
if your optimizer just gives you query plans that are right deep trees or very bushy, that
means that you will have no pipelines or very few pipelines that have multiple joins. So
Polar cannot do anything.
And indeed, what we saw is that for TPC-H, for example, and therefore also JCC-H, which
is a skewed version of TPC-H, DuckDB does not produce any amenable pipelines for Polar,
so we needed to exclude that benchmark.
But we also evaluated the join order benchmark, which also gives us fairly bushy trees, but
a couple of short pipelines.
And we spent at least over a third or so of the processing time in pipelines with more
than one join.
So these are the ones where we can potentially change something.
And then we looked at the star schema benchmark, which is an adapted version of TPC-H that
changes it into a star schema and transforms the line item and order table into one large
line order effect table.
And TACDB actually produces lefty queries here.
So that means that we have a lot of opportunity for impact, but we also have this very well
behaved synthetic benchmark data that we talked about before. have a lot of opportunity for impact. But we also have this very well-behaved synthetic
benchmark data that we talked about before. And that means that the cardinalities are
very easy to estimate. So there's very likely no need to ever reorder those joins. And that's
finally why we introduced this benchmark SSB skew that adds skew in correlation to the star schema benchmark data
and tries to make it more realistic. And with that, we have a benchmark with high applicability
because we still have these left deep trees and the potential for join order improvement
because the cardinality estimates actually tend to be a bit more challenging.
And yeah, what we did then is we looked at the total execution
times for each of those benchmarks and compared that to vanilla.db and a couple of other systems.
So total execution time is all query time summated. And we saw that, for example,
for join order benchmark, we have mild improvement, like with one thread, instead of 134 seconds, only 115 seconds.
Then the star schema benchmark is about the same as we would have expected it because
DuckDB's cardinality estimation does a pretty good job here already, but at least we don't
add any overhead to that.
And then SSB-SQL, we saw a total around 2x improvement compared to Vanilla DuckDB in
total execution time.
But we also compared that then with a state-of-the-art adaptive query processing approach, which is
called Skinner MT.
And Skinner MT is a really cool work that builds hash tables for each input table and
then uses reinforcement learning for each
individual query to evolve from some random left deep join tree towards an optimal left
deep join tree.
And with eight threads on join order benchmarks, Skinner MT actually has 25% less execution
time than Polar and also performed best in our comparison. But when we look at
StarsCamo benchmark and SSBsq at scale factor 10 or so, then Skinner MT suddenly is four times
slower. And what we see there is that this large applicability and high impact really can get you
to unacceptable overheads for some types of workloads. And what was even more interesting then is to drill down on the execution times of each
individual query in those benchmarks.
So for join other benchmark, for example, we saw that the execution time does not change
at all for the vast majority of queries.
So no improvement, but also at least no overhead.
But there's a few individual queries that show very notable
improvements where one query is around nine X faster than without Polar. Then we looked
at SAS schema benchmark where most queries get a bit slower because the join orders are
good already. And every additional work that we do just deteriorates performance. So that's
a tough place to be in. But the worst query that we measured took around 7% longer than without Polar and the majority of queries around 1-2%.
So the overhead in total is sort of mild. And then finally, we got to SSB-SQL of course,
and there we have this applicability meets challenging cardinality estimates situation. So we see
notable improvements for most queries with some of the queries faster up to 4x. And the conclusion
of that for me personally would be that Polar really works best for these typical data warehouse
workloads where you have tables arranged in a star schema and
yeah sort of realistic data with skewing correlations.
Yeah nice.
I guess what you were saying there at the start about sort of when you did this applicability
study and some of these really popular benchmarks didn't actually have the necessary, didn't
meet the criteria for Polar to be able to exploit.
I wonder if there is, do you have a sort of a general
intuition of the, all of the, if you imagine all of the
workloads in the world, how many of them Polar would be
able to be demonstrate performance improvements?
Like how common is having a left deep tree in a query plan?
Do you reckon that, is it just kind of, is it a case of,
yeah, half reckon, or I don't know, what's your general
intuition, I don't know, obviously, I think maybe it's more a function of how many of these types of star schema workloads you kind
of have out there is probably the actual question. Well, it's very hard for me to put a number on
that for that will be actually very interesting to work in the industry for a bit. But I do assume
that it is quite a lot of these data warehouse workloads basically
that use the star schema because it's such a well-known optimization. And that's really
where I would expect Polar to shine. But yeah, that's why I can only encourage everyone in
the industry to prototype Polar for their systems so that they can check it out if it
makes any difference at all.
Because I was going to say, because obviously you mentioned there on that there is a plan
to make this into an extension module.
And I don't know if there is a way for you to anomalously collect statistics of people
who have installed it and see how frequently that actually is happening or actually isn't
happening.
I mean, there probably is some bias there because if someone is installing it, they maybe think they'd have those types
of query plans anyway. So that might be a bit of confirmation bias kind of sort of thing.
But yeah, it would be interesting to see if you could collect statistics to see how frequently
this is sort of being.
That's actually a cool idea. But I might have to check if I'm able to get on the DuckDB
extension marketplace if I trick people to get on the Doug DB extension marketplace. If I trick people to, to report data, data to me, but
some, I don't know.
Yeah.
Um, make it opt opt out rather than opt in.
I don't know.
Yeah.
That's a good idea.
Cool.
So yeah, I guess where, where do you go next with poll events?
So obviously there is this, this, this, this move to, to make it into an extension
module so people can start using it in a production
environment and seeing if it helps for them.
Is there any other plans or directions you wish to take it in?
Well, I mean, there's a couple of future work stuff that might happen in the future and
that are definitely in interesting directions.
So I think for me, the main thing would be to investigate in a more holistic performance
measuring approach because we have this path resistance right now which only looks at the
number of intermediate results but really there are many more important factors that
influence the performance such as the size of the build sites.
Do they fit in the cache or do they not? How long does it take
to probe a tuple or the join implementation? Is it a hash join or an index nested loop join and so on
and so on? So you could really either build sort of a cost model for that or just try again to
measure the processing time, which we did in the very beginning, but stopped
doing that because of this high variance, because you don't have full control over
the background processes on your system. So you can have your path resistance lashing
out for no reason. And it's also, as I said, very difficult to compare different join orders
when they have different numbers of input tuples because you have this vectorized execution of you process a few tuples that is much slower than
processing many.
And then you have this buffering if one join does not produce enough
output tuples.
And there's just a lot of stuff that plays into that.
But it would be very interesting if someone, if not myself, would look into that.
Cool.
So let's talk about the DuckDB experience then. So you mentioned that originally you
would probably have maybe gone for high rounds because it was a system you're more familiar
with but that was not an option. So if you reflect on your experience of using DuckDB
and the DuckDB extension module and what was your general overall sort of take on that experience?
Well I think one cool thing is that while developing this prototype I actually learned
a lot about databases because I just debugged DuckDB all the time to really understand how
it works and it's really very state of the art.
So if you want to see how an implementation
of modern dynamic programming approach for join ordering works, you can just go and check
because it's right there or yeah, all the other cool stuff. So that I think is very
rewarding also. Yeah, or working in the system that so many people are excited about is also, I think,
really great because it opens opportunities.
Of course, you get invited to a podcast, for example, and people tend to care more about
your research.
So I can only encourage other people to do what I did.
Nice.
Yeah.
I was gonna say that what would be your advice you'd give to the would
be developers? I guess that'd be just to go out there and do it and I guess we can get involved.
Absolutely. And also the DuckDB people are super great. Like they're all very nice. I met a couple
of them and always had a great time. And yeah, they were always very happy to give me some feedback
or tell me here, this is the extension operator. Did you try this? Did you try that? So that was great.
Nice. Yeah. Let's say let's talk about some impact then. So has your work on pull had
any impact back on the core product to also you've had conversations with people who actually
engineers at WP, but has there been any influence that is they thought about incorporating this
into the into the core products in any way?
Well I mean we're still at this point where we don't have the actual extension, but instead
only have this prototypical implementation.
And I had this talk once with Hannes Mülleisen, the DuckDB Labs CEO, and I think a community
extension really would be the way to go here.
But of course, Polar in itself is still very researchy and there's probably a lot of corner
cases that you should figure out before using that in your million dollar business. But
at VLDB, I actually discussed with a lot of people and many people showed very great
interest and I partly attribute that to the fact that this idea and the paper is fairly simple and
easy to understand. So really, many people could easily grasp how this works and what this paper
is about. And I also talked with people from the industry and found that most rewarding because
actually one engineer told me that this should be fairly simple to implement in their
database system and they might want to try to create a rough prototype and
evaluate its potential.
And yeah, since this non-invasiveness for adaptive query processing was the key
goal of my paper, that was obviously a very great feedback for me to hear.
Yeah, for sure.
I mean, I really like the non-invasive design goal as well, because I think that
just it's going to help increase adoption, right?
If you don't have to go in there and say, why are these things together and create
this big sort of interconnected blob of going, it just gets complex, right?
If you can keep it simple, it's not a verse invasive, then it becomes easy to
be able to prototype as well, right?
And to really demonstrate the benefits of such an approach. So yeah, I think that is
a good design goal to have, sure, if you want to make your work impactful. Awesome. So yeah,
I guess kind of the next question I want to ask is about surprises. And whilst working
on Polar and with WPDB, what's been the most interesting lesson that
you've learned during that process?
Well, I think when it comes to DuckDB, I actually saw that many approaches in this whole problem
field of join ordering and adaptive query processing for analytics is actually implemented
in Postgres.
And that has good sides to it because these approaches can be easily compared.
They have the same baseline performance of Postgres.
But on the other side, in practice, if you start your new endeavor, you don't necessarily
want to use Postgres for analytics if you can get much better performance with actual
analytics systems like like ductiB.
And the problem that this brings is that some approaches that work well in Postgres might
actually not work that well in systems like that, because you have to look at additional
things like vectorized execution. I know it's a meme by now, but it's really crucial to
think about things like that for good performance, because
in the end, this can be a make or break. And therefore, I think really one of my key learnings
was if you want to implement something in your analytics research, then you might want
to prototype in a system that people actually want to use for the use case. So it makes
you consider these additional aspects of modern systems.
And I think that in my opinion can substantially decrease the gap between your research and
the real world.
Yeah, for sure. Completely agree with you on that. So we're kind of getting towards
the end of the show now. And before we do go on to the last word, I'm not sure if I
sent you this, so apologies if you've not got an answer primed for this, but I wanted to have a section where you recommend, other
than obviously your own research and you go and play with Paula, but other plugins, extensions
and features of WDB that you would recommend for the listener to go and try out whilst
you've been working with WDB. What's your favourite plugin or extension if you've got
one?
Well, I'm not sure if I got a favorite plugin, but if I had to choose one, it will be probably the Parquet one, which is very basic, but I definitely have a favorite feature.
That might be the out of core execution for aggregations and joins. So, Lawrence Koeper has this ICDE paper on
this I think from this year. And the essential thing is that when hash tables becomes too
large for your available memory, then instead of switching to a disk-based algorithm, their
operators have this clever design that keeps as much of the data as possible in memory and only spills the
remaining stuff to disk.
And the effect of that is that you have a very, very graceful performance degradation
with growing hash tables instead of a performance cliff where you would suffer from the switch
of, okay, now we don't do this in memory algorithm anymore, but switch to disk base
and maybe even start from the beginning.
And I even benchmarked this feature a bit for a different project another time, and
I'm actually amazed how well that works.
So I find that it's a very cool contribution to making DuckDB's query execution more robust.
Nice.
Yeah, Lawrence has actually been on the podcast.
I don't think it was this paper he'd mentioned.
He had one called these rows are made for sorting, I think, which was the year before,
maybe.
So yeah, I can't remember what was the title of this paper called?
Oh, I would have to look that up, but it's probably something with robust and aggregations.
Yeah, probably.
That'd be a good place.
Yeah.
Good, some good search terms there to stick in.
Cool.
Yeah.
So yeah, let's get time for the last word now.
So yeah, what's the one thing, David, that you want the listener to get from this podcast episode today?
Well, as a takeaway, I think I would like to share my favorite learning from this
paper that is not really technical.
So Polar was actually a huge collaboration endeavor.
So you can see that on the paper there are many, many names from academia, but also
from the industry from Snowflake, Google, InfluxDB, SAP. And the reason for that is
that this paper originates from seminar on robust query processing, which is, I think,
every two years or so organized by Goethe's Grave. And that happens at Schloss Dagstuhl,
which is like a very remote place in the German outback.
But they have these amazing international guests from academia and industry. And I saw that this
is happening during my master's while I was working for my master's thesis in the Robles
query processing field. And I saw the event page and I thought, wow, okay, I got to go there.
But the guests are usually invited by the organizers. So I thought, okay, I'll just try and send them an application.
And surprisingly, they accepted me.
So in the first week of my PhD, I attended this seminar with these database legends.
And this just developed into this amazing paper collaboration, which was a huge kickstart
for my PhD. And it did not stop there.
And like during the process, I got the chance to discuss my work with many people and their
feedback really helped me a lot to bring this paper to life.
So if I really learned one thing from this project, then it's to reach out to people,
discuss your ideas with others.
And the people in the database community are usually super nice and it's very beneficial
to exchange your ideas.
Oh, what a fantastic story that is there.
What a really nice message to end on.
I mean, that's lovely.
I get into a DAG show as well.
I've never been to one, but what was the experience like?
Kind of, I guess, life changing almost.
I mean, for me personally, it seemed to be, right?
It's really interesting because I didn't know anyone of those guys and haven't been before.
And yeah, it's sort of a very intimate setting, I would say.
So I played pool against Thomas Neumann and Peter Bonge and actually one told my family
about it.
I'm so proud of it. So you just get together and discuss ideas for like five days or so
and yeah, it's a great experience. So if you get invited to definitely do that. If you
don't, maybe it's worth applying.
Yeah, just send an application, right? What's the worst that can happen? Cool. Yeah. So
let's wrap things up there then, David. Before we go, where can we find you on social media?
Anything on ex-LinkedIn?
How is it best for people to contact you?
Well, I think it's probably the best idea probably to get on my website, which is d-justin.github.io
because there you have it all.
You can find me there on LinkedIn.
You can find additional
resources to the paper, like the slides and the poster and stuff like that. And that's,
yeah, I think is probably the best starting point. But yeah, I'm also on all the usual
socials of course. So if you look for my name, you'll find me.
Awesome stuff. And we'll wrap things up there then. Thank you so much, David. It's been
a great chat. I'm sure the listener will have really enjoyed it and have learned a lot from
it as well. So yeah, thank you for coming on the show.
Well, thank you.