Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Query Execution Models
Episode Date: May 14, 2024...
Transcript
Discussion (0)
Welcome to this week's Hardware Conscious Data Processing.
My name is Martin. Professor Tilman Rabel is on a conference this week.
I'm going to be presenting today's and tomorrow's lectures.
Where are we right now?
Last weeks after the introduction, we talked about CPU architectures and
how modern CPUs work, how they execute code and how they do that increasingly
efficiently and more and more in parallel. And so now the question is as
we are talking about hardware conscious data processing, how do we now execute
data processing? How do we execute execute data processing? How do we execute
for example database queries and how can we incorporate the features of modern hardware?
So how can we use modern CPUs and execute our data processing in a way that is efficient
on modern CPUs and that's for example, as fast as possible.
We are right now here in the sixth row.
We are now talking, or today at the 14th, we are talking about execution models, so
query execution models.
Tomorrow we talk about data structures and then next week Professor Wabel will be back
for a profiling session, for a profiling session with Marcel and Multicore.
Okay, so what is this lecture about? First, we're going to talk about the iterator model,
which is basically the oldest, most standard model for data processing or query processing.
Afterwards, we are talking about the materializationization model which is one of the first models that was directly tuned
for analytical processing and then we have two models that basically improve
on this materialization model which is vectorization and code generation. And again, the sources, we used sources from
Jana Kiciva, Karsten Bönig, and Sebastian Bress.
Okay, so query processing.
In the beginning, we talked about query optimizers
and how a query that you send to a database is processed,
so it is parsed, it is optimized, and then it's executed.
We are not going to talk about the optimizer today, so we don't care about how an
efficient plan is, how database comes up with an efficient plan. We talk
about how the database processes a plan. So we assume we have now the perfect
plan or a good plan. This is the plan that a database came up with or the
optimizer came up with. So now the question is, how do we execute this plan?
In this example here, we have five tables, A, B, C, D, and we have four joins.
So now the question is, how do we process a query, for example, here with four joins?
And we have four selections on these, on four of the five tables.
And how do we actually do that how is really we put
this query plan on a CPU and the most classical approach that is still widely
used today is the iterator approach also it's called sometimes called or not
sometimes it is the volcano model and sometimes called tuple at a time.
What it is still used heavily in transactional processing, recent years people have increasingly
seen that this model has a high overhead if we talk about memory database systems.
So modern servers today have often hundreds of gigabytes of data which is often efficient or yeah enough so to say
for to keep at least most data that is frequently accessed in my memory so we should optimize for
this situation or these try to exploit the situation that a lot of data is in my memory
and if we now use this volcano model, iterator model,
we see that there's a high per tuple overhead
and a high interpretation overhead,
as we'll talk about in a minute.
So people need to come up with better ideas
how this can be executed more efficiently.
And,
wait just a second.
Okay.
Just do you see? Okay.
So, and for main memory database systems, the processing models that we also want to
talk about, again, is the materialization model.
This materialization model is also called operator at a time or column at a time. The successor of this model has been or is the vectorization
model. This is vector at a time, sometimes also called batch or blockwise
processing. And then last but not least we are going to talk about code
generation. Okay so the first model is the volcano model, iterator model.
And how that basically works is you start at the top of your operator, of your qubit
tree, and now starting at the top you ask for the next tuple.
So if there is no next tuple, the next operator, in this case the operator 2, again tries to
get the next tu table, calls next.
So this next is the iterator interface,
tries to call next down the qubit tree
until we finally get a table,
for example, by reading the first table from disk.
And then once we have this table,
basically goes all the way up to the first operator
and goes, as you recall, you have this next chain down
from the very top, from the root of the QB tree,
all the way down, and then this tuple bubbles up.
So we process one tuple at a time.
For this reason, we call it tuple at a time processing.
And this is mainly used still today
for disk-based database systems, mainly for transactional systems.
So you find this model in ProSquare, in your standard Oracle system, SQL Server, and most realized that those volcano model database systems
are not efficient for analytical processing, this model is called materialize.
This model materializes intermediates.
Now we have an operator at a time model.
We don't say, okay, get next, get next, get next, to get the next tuple. In this case, we execute from the bottom every operator
and process the entire input.
So the first operator, for example, a skin,
does not yield single tuples.
It processes the entire input and then forwards
this input to the next operator higher above in the tributary.
And this has been mainly used
for column-oriented database systems.
The Rectorize Execution Model, again,
uses a volcano-style model,
so we're using a iterator-based model
that uses the next call,
but we don't have a single tuple at a time now. We have batches of tuples,
as you can see here. And this has mainly been used when DRAM capacities got so large that
we talked about, that we can increasingly, we're able to talk about in memory database systems. And yeah, the last one is query compilation.
And the idea here is that we get rid of
all the interpretation overhead
that a database system might have.
So we want to take the query and compile
the minimal code that is necessary to execute this code.
So there's no interpretation in case,
as soon as we know the query,
we know exactly, okay, this column has,
for example, the data type integer.
It is not nullable and so on.
So we can generate code that does just exactly that.
We don't have to have code that is interpreted in runtime
to check, okay, which is my data type,
and then calls the function for an integer column.
Now we can exactly compile the code that we need.
And this execution model splits the pattern pipelines,
as we will later see.
And this mainly abused nowadays
and modern many-core systems,
which is basically the most recent
query processing model that we have.
Short excurs, so this is a very interesting paper. It's a little bit outdated.
Maybe it was a database on a smart card. So a couple years ago, there have been those things
called smart cards, and they have been very limited. This was really just a card,
and they have been very limited in their computational capabilities.
So they have very limited memory and very limited CPU.
So the question was now, can we actually execute or can we have a database running on a smart
card?
And what the authors did in this paper was to say,
okay, we can build extreme right deep trees. And with some tricks with the trees, they found a
model that basically has no memory consumption, no additional memory consumption. And the idea is
for that, for each single record, we go, we we walk the entire tree.
That means for every join, single tuple is compared to every other join of every other
tuple of the join pattern.
That means in the end, we don't have any hash joins because that hash table would consume
memory. that hash table would consume memory, we would have all the joints would be nested loop joints.
And with that, obviously, there are some drawbacks. It is less efficient than other models. It does
not work for all the aggregates that we have, for example, a median. So we have to keep some
state to be able to determine the median. That does not work here, or sorting,
but still they managed to do, I would say,
a surprisingly, surprisingly a lot.
So yeah, it was a lot they get done
with these extreme write deep trees.
So it's kind of a really interesting read
if you're interested in that.
So this is one extreme, what you need to do if your hardware is really limited,
something we are not in the situation today, probably,
or at least not our focus in this lecture.
Okay, so iterator model.
Again, the idea is data is processed a tuple at a time.
And each operator basically tries to get the next tuple
that it can process by calling the next function
on the previous operator.
The previous operators, for example, in a join,
you might have two operators.
And the result of the next call is either a tuple
or a simple null or a marker that basically says,
okay, I ran out of tuples, so I have processed all the input tuples
that my previous operator can give me.
And in this case, every operator also needs to keep its state.
For example, hash table, let's say operator 2 in this example
will be a hash join.
So we would get more and more tuples
by calling next, next, next.
And internally, we would need to keep some state,
for example, our hash table
that we can later probe in the join.
Okay, we have some example here.
We're joining tables R and S.
We have an explicit join between R and S on the ID column,
and we scan the, we filter the S table on the value column
for values larger than 100.
You can see the query here,
and a simple projection on top,
so nothing too surprising here.
So how would the operators right now work?
So the first operator here that we have,
when we go bottom to top,
would be an operator that
basically scans R. So this operator just takes all the tuples that might be on
disk and as long as it can get tuples it would emit them whenever next is called
on this operator. And in case there is no more tuples, we have read the entire table, it would emit a null. The
same for table S. For the selection, we would now, you can now see
this next call. So as the selection calls the scan of S, I would now say, okay, next.
And as long as I have a next tuple, I will evaluate my predicate.
In this case, that value is larger than 100.
And if this predicate evaluates to true, I will emit my tuple.
For the join here, it's a little bit more interesting.
So first, we process the entire left side.
So we call next, next, next on the left side
to build our hash table.
And then once we have built the entire hash table,
we will start with processing the right input
and probe our hash table,
and then only emit the result in case it matches.
And the prediction is a simple prediction.
In this case, we don't do any calculations.
We just say, okay, we take the column's ID and state of our joint result and emit this column.
Now, how we would actually process this in the volcano model would be that we start from the very top.
So we don't start from the bottom and push up the result.
In the iterator model, you would begin at the top.
Another projection says, okay, please operator below me, so join, can I have the next tuple?
The join does not yet have a tuple.
So it first calls on the left input next to build, to start building the hash table.
And now we can see that R is called the first time,
and now R would emit the first tuple,
in case there are tuples in R. We would return this tuple,
so we would emit this tuple.
And now the hash join can start building the hash table.
For the projection, again, then for the other side of the, for the right input in this join,
we would now go to the next operator below.
This would be the filter.
We ask the filter for next.
The filter again asks next, goes to S and so on and so forth.
So you see that single tuples bubble up until here we have a breaker in the pipeline.
So the join needs to break
because we need to have this hash table
entirely prepared for the probe phase.
So this is a so-called pipeline breaker.
Pipeline is breaking here,
so tuples don't go up all the way to the projection.
They first go up to the projection, back to the projection, as soon as we have the first
join matching tuples.
And as we have said, the iterator model is widely used.
It is efficient.
In most cases, it is efficient.
For example, for typical transactional database systems, where we don't
talk about huge scans, we don't talk about huge analytical processing, we don't have
huge joints.
So what you usually see there is a very restrictive selection, for example, on an ID.
Let's say we get all the items of an order.
There might be tens or a few dozens,
so we have usually an index on it. We have a couple of dozens of tuples. And now processing
this is actually quite efficient, as the overhead mostly are still bound by disk access.
So this iterator model works quite well. This is why still many database systems use that.
And it allows for tuple pipelining, again,
but there might be some blocking operators
where we can't just pipeline the entire query.
For example, there are joins, subqueries, sorts, and so on.
But one thing where it is not really efficient
is if we have much data and we don't use
our large data sets that we process.
And in this case, we have not the best cache usage.
And one reason, for example, is that we,
that all the operators run interleaved.
That means that we have all the instruction code for all the operators if we have long
pipelines and large OLAP queries.
We often have very large queries with many operators, several joints, and so on.
So now all this code, this instruction code, is basically running at the same time if we
process tuple by tuple.
That means we probably don't have many cache hits.
Or often the code is just too large for the instruction
caches.
We have a large function call overhead,
because every operator calls.
So next operator, next operator, and so on.
We have no tight loops.
And also the combined state of operators can be quite large.
It's easily too large to fit into the caches.
If you think about analytical processing,
we often have several joints of all the joints
and would need to keep all the hash tables.
In my memory, this can easily be too large for the caches. And there's been one really nice paper just listed below on the bottom.
And what they did was that they tested MySQL and profiled MySQL for TPC-H QV1.
Again, TPC-H was an analytical benchmark.
So this is not a single index axis.
We have just a couple of tuples that you process and you're done.
In this curve we have for scale factor one, I think we have six million rows.
So a lot of rows.
The filter you can see here on chip date does not filter out many tuples.
So I think we almost have six million tuples in this quite large aggregation
that you see on top. So we aggregate or we group by two columns, we turn flag and line status,
and then have a lot of sums, averages, and counts and so on. So you see there's
that's not going on and there's a large there's a large number of tuples that's being
processed in this application.
And what the authors did in this paper
was that they profiled the code and basically checked
where does time go if we execute this analytical query on MySQL.
And first of all, there are a lot of function calls, right?
So we have six main tuples,
and just the first instruction much of the time goes,
we have 800 million instruction calls, function calls.
And which makes a lot of sense.
So if you have a complex tree,
and many functions, and you process tuple by tuple, you
will just have a lot of single functions being called.
And the most drastic number, I think,
is here that we only spent 10% of the overall time actually
processing the tree.
So actually doing the aggregates, the sums,
and the counts, and all the other stuff
is just moving tuples to the next operator,
getting some field of our tuple, and then doing the actual, for example, calculation.
And also what we can see here is in the third column, no, fourth column, is that we have
a very low instructions per cycle ratio.
So we don't do, we can't process a lot per cycle.
And one reason is that we have a row store.
So we can't have all the aggregation functions
for each and every tuple that we might expect
because a row store, you can have an arbitrary number of columns per tuple that we might expect because a row store can have an arbitrary number
of columns per tuple that you get.
Every field in this tuple can have different types.
So this is, so we need to have polymorphic operators
that can deal with it.
They can deal with arbitrary looking tuples.
So you will always have a lot of function calls
unless you compile the code.
And for compilers, such a situation
is really hard to optimize.
We usually have a low IPC rate,
so we have empty pipelines.
We don't have many pipelines that we can,
that the CPU can run concurrently.
It's hard for the CPU, or for the compiler,
to optimize cross functions. We can't use Zimny, CPU or for the compiler to optimize cross functions.
We can't use SIMNI, or at least the compiler can't use auto-vectorization.
There is work on how to use SIMNI in those database systems, but it's just much harder
than other cases.
If you take a look at this plus function, so let me go back here.
So there's, you can see here,
can you see it?
Yeah.
There's one addition here in all these functions.
There's one plus text.
So you can see here that there is this plus function here,
which is called six million times,
so kind of makes sense, right,
for the six million rows that we process.
But if you now just take this addition here,
or this plus function, it takes 38 instructions.
We have a relatively low ratio here of instructions per cycle.
That means for this, executing this plus function
takes 48 cycles in this case.
Compare that to something like doing it in assembler
and knowing where data actually is and other assumptions.
But we could theoretically do that in three instructions.
This database system, so MySQL in this case, takes 48.
So that shows there's a lot of things being done that need to be done, but that we can
probably do a little bit more efficient.
We don't need to have those many dependent instructions, which cost us already 20 cycles
here. instructions which cost us already 20 cycles here and we should somewhat find a way to
get rid of this overhead.
One way to do that is the materialization model.
The idea here is that we don't have a single tuple bubbling up the query tree every time,
but each operator in the query tree processes the entire input at once.
So only one operator is running.
So if we have different paths in Ruby,
there might be multiple operators
that run concurrently.
But in a single pipeline, only one operator is running.
And this operator is processing all the input, all its input,
until it has the its output completely
calculated there is no piping of tuples obviously and we need to materialize the entire input so
that means we have to yeah we materialize the entire input so we have a lot of data
residing in my memory an example can be seen seen here. So instead of, if you look on the very top
here, instead of calling, okay, for my child, I get next, which would call some generator function
that gives us just a tuple. We now call the output function that basically blocks and says, okay,
I need to get the entire input of my previous input operator.
In this case here for the join,
so the join would say to build the hash table,
I need tuples of the left input.
I again ask for the entire output
of the left input operator.
And this is the scan journey with a three.
And this would not emit single tuples.
It would process the entire table R,
and then forward this table R to our join operation.
There are some tricks that are usually done
so it don't have to materialize the entire table,
but in general, that's basically the approach
that is being taken here.
Okay, so the advantage here is that we have much fewer function calls.
You can see, go back to the scan here, for example, take a look at the operator marked
with a four.
You can see here now that we have this very, very tight loop.
We go over the entire inputs. We have the entire input that we got from the child operator.
And now we just have a very tight loop over each single tuple.
In case it matches, we put the tuple to the output buffer.
In case we don't match, we immediately go to the next tuple.
This should probably look familiar with the current exercise that you're doing
on Zimd.
So this is something, such a tight loop is something that the compiler likes and it's
much better at optimizing the code for such tight loops.
And we have much fewer function calls because between the interface
between operators just give me the output and I get the entire output. Yeah
so when we have such loops the optimizer can do all the tricks that it usually
does. So it can do loop unrolling, vectorization, auto vectorization and the CPU on the other hand can do hardware prefetching.
So yeah, better branch predictions, hardware prefetching.
So different, so on the one hand there's the compiler, which is much better now, or for
which it is much easier to generate efficient code.
And on the other hand, the code that is generated is just better for the CPU to process than in the iterator model.
But of course, there are also problems here.
We have tight loops, so that's good.
But the intermediate results might be really large.
So assume you have a large join that is not as selective as you might have expected it.
So the join result is really huge.
So now if we say, okay, we have operator at a time and we process the entire input,
the entire output of this operator, then we have a huge intermediate result
laying in, being stored somewhere in the memory.
So this is just from a memory consumption perspective can be a problem
and that also means that probably the first outputs that you have processed
are no longer in the caches if the intermediate result is large.
So another question was is there something like a middle ground?
Can we be somewhere, like can we still have our tight loops,
which the optimizer likes and the CPU likes,
but are not as cache, not always fall out of the cache
with the data that we hand to the next operator?
And the approach that people came up with
is vectorized execution.
I think we're going to have...
Let's take a break a little bit later.
So vectorization model,
there's a vectorized execution.
The idea here is that we still use
or again use the Volcano style model.
So we have this next call. We start at the top and then call next on the
previous operator on our input operator. But in
contrast to the Volcano model we don't have the single tuple. So we don't have
this function overhead for each single tuple.
We call, we retrieve batches.
So the previous operator, when next is being called, tries to give us a batch of a certain number of rows.
The first database, the system that proposed this vectorization model was called Modulibx100.
They came up basically with this idea.
And so we'll later see that it's actually not that easy to find this batch size because
the idea here is, or what the motivation was, is that when you create a batch of tuples that you have just created, for example, with your filter,
ideally the next operator should retrieve this batch.
And when the next operator works on this batch, this batch should still be in the caches.
So ideally you have a batch size that fits into your L1 or L2 cache.
Probably not in registers, but in some cache that you can
efficiently process. But depending on how your Kribi looks, how large or
how wide each tuple is, can be quite hard to find this batch size. So this
configurable task of the system is now to find this batch size. So this configurable, where the task of the system is now to find
this batch size per query,
so that we have, once an operator has finished this batch,
and the next operator processes this batch,
that this batch is still entirely in the caches.
Contrast to the maturization model, right?
So if you remember, when we process the entire table
or the entire input, when you're processing large tables, the first table that your selection might have gilded is probably not going to be in any cache anymore. Okay, so how does it look for our
example query here? We see again that we call left output left output but now on the very
button here in three we say busy we have this out buffer and now in in patches
bubble up not single tuples but batches of tuples here or vectors of tuples.
And the same for both sides.
Right, okay.
And yeah, so this is the volcano model,
so we still pull.
So we don't, as in the materialization model,
where we take the entire input and push it upwards,
we still pull.
So from the beginning, we go down,
I call next, next, next.
So this vectorization model was an improvement or is an improvement. It has
really twice to combine the advantages of the maturization model and the
iterator model. So it reduces the number of invocations per operator, not as much as the materialization model, but usually it's enough to amortize the costs.
As we have still tight loops, we can use Zimdy, we can use all the CPU can do is prefetching and branch prediction magic.
So we still have code that is efficiently executable on modern CPUs.
But the size of data that we send around in the query plan is so small
that it's usually within the caches. And yeah again so vector size is really
that's basically the crucial thing here is how to select the vector size.
And it should be large enough
to amortize the iteration overhead.
So if we have a vector size of one,
we have just the volcano model in the end, right?
And if we have just maybe a dozen,
probably the overhead is still,
the runtime is still dominated
by calling the next and the interpretation code.
If it is too large and
our batches are too large, then the next operator will have to go to DRAM because the batch is no
longer in the caches. And for this has been, so the paper is more than 10 years old, I think.
So this number might be a little bit outdated
for modern CPUs, but what they found on this older CPU here
was that for CRUBY-1,
so remember this was a pretty wide tuples, right?
So the next, the last operator needed like five, six columns
to calculate all the sums and the averages and so on.
So we have rather large tuples,
or wide tuples that go all the way up. And for this case here, the authors found that vector sizes or batch sizes of 1 to 4,000 tuples
give the best runtime here in this case. And this is logarithmic scale, so you can see the batch size of one.
So the classical volcano model is 100 times slower in this case.
And also if your batches are too large, so logarithmic.
So yeah, there's a huge difference between smaller batch sizes and larger batch sizes.
And this table basically shows a comparison
of all the three models.
And so we have the iterator model, the volcano model,
the vectorization model, and the vectorization model.
And for query plans, we see that as a vectorization uses the same
model, so we have this iterator model, also this next volcano model,
what we call next. So query plans are quite simple, while they can be more
complex in the materialization model. The instruction cache utilization, this is the problem that we found in the
iterator model, so this is extremely poor. For the vectorization model, this is
extremely good, because we have only tight loops. We have the minimum amount
of calls between operators, because we only pass the result and never come back
to an operator. And for the vectorization, depending on the batch size, it's still very
good. Function calls, similar. So we have many function calls. We have the fewest amount
in the metrorization model and a little bit more in the vectorization model. The attributes
at attribute axis. So the iterator model is usually mostly done on row stores.
So this is rather complex.
If you have a column store,
you can create code that only accept
or that is optimized for certain input data.
And so the attribute access is direct in both cases here.
And most time being spent is for the iterator model.
So we're talking about, again, about analytical TPC-Edge queries here, right?
So the most time being spent here is interpretation for the volcano model,
while it is processing as we ideally want it in the other two models.
In other cases, so the reason why the volcano model is still used is in other
cases you are probably bound by transaction processing, you have concurrency control,
you have your disk accesses. So please bear in mind that depending on what your actual
workload is, this table might look very different. We're talking about analytical systems right ecosystems right now. So CPU utilization to a data model is quite poor because we
have a lot of stalls. This is much better for the materialization and the vectorized
model. Compiler optimizations are now possible with both modern models because
we have those tight loops that CPUs, that compilers like, that they automatically can unroll and optimize. But now the main point
here why the materialization model is no longer that much used is the
materialization overhead. So there's very few for the iterator model. It's
cheap for vectorization because it's only the batch size.
But it can be really expensive for the materialization model.
For example, if you have large joins or large input tables.
Okay.
I'll say we take a short break now for five minutes. Are there any questions concerning the first three models
that we talked about?
Okay.
So, the next model, or the most recent model
that there is for database systems is code generation.
So what people observed when working
with this vectorized model was that there is
still room for improvement.
So we now don't have this interpretation overhead anymore,
or we have much fewer function calls., or we have much fewer function calls.
Yeah, we have much fewer function calls compared to this volcano model.
We don't have those large intermediate result sets anymore,
but we still have interpretations.
So we still have code that checks for an aggregate,
which code do I need to call for this data type?
Is it nullable? Is it not nullable?
If we have expressions like shown on this example here, we have the column extended price
multiplied by one minus the Elvis count column.
There's still interpretation.
This code still needs to be interpreted.
So there's still interpretation overhead.
And ideally, we would generate,
if possible, we would generate code
that does just exactly this calculation
without any interpretation.
So this was the idea of code generation.
So people wanted to generate code,
just exactly the required code to process a query.
And the first database that did this
was HyperBOSS database that was invented at the TU Munich,
Technische Universität München.
It was recently bought by Tableau
and is now part of Salesforce.
So it's also quite successful and this was the first
prototype in the later database that heavily used code generation for queries. So what we do,
what they did in this project or they still do is that they try to find pipelines they can merge
or they can fuse. So they take pipelines in a query
and then when possible they fuse all the operators of a single pipeline and
generate code per pipeline until there is a pipeline breaker. So for example if
there is a hash join again we have to have we need to have the hash table
before you can process the join.
This would be a typical pipeline breaker.
But they would generate code, tight loops for pipelines,
and then compile this query plan into machine code.
Also, they used a push-based model.
So we don't have this next call.
We push directly into the next operator.
So when we are done with our processing, we would push our result, for example, to the
hash line.
And an example, you can see an example here.
So the query starts.
And the first thing we do for the pipeline is here that for each tuple in R1, so as we would like to pipeline the joint tuples through, we would need to have both hash tables already in place.
So the first thing we do here, we would have a tight loop that
iterates over R1, checks if our predicate matches, and if so,
we would materialize this tuple of R in the hash table.
The same is being done here.
The same is being done for the second hash table. So for the
hash table of table R2 here. No, not much better.
Well, and as soon as we have all the hash tables in place, we would now generate a, again,
tight loops here.
So we would generate code, the minimal necessary code here, and we would have a nested loop
here that goes over all the tuples in R3.
And then, for example, we would now check if there would be a predicate, check the predicate,
and immediately if it matches, we don't materialize it somewhere or push it somewhere
with an instruction overhead or a function call overhead.
We would directly go to the hash table and check,
is there a join partner for this tuple, yes or no?
If there is one, we go to the next loop,
check for the other join, and now we know,
okay, if there is again a match partner,
we can emit now this combined join result.
And this is very efficient.
So this is just the minimal code
that is necessary to process this query,
or not minimal, but the most efficient code.
And what we can also see here is that the blocks
that are generated right now,
they don't really resemble typical operator boundaries. So we switch from an operator-centric
model. First, we always had the clear interface of operators in the Pivot Tree. And now we have
a data-centric execution, so-called data-centric execution. Because now, the build and probe are parts of different, can be parts of different
pipelines that are executed at different stages of this query.
What this hyper database did was that they generated, directly generated low level virtual
machine code. So LLVM code or LLVM IR. And it is somewhat similar to assembly,
so they would yield not assembly code, but something that is kind of very low level
comparable to assembly, but it's still platform independent. So with this LLVM IR, you would then
go to your LLVM compiler clang and compile this for the machine you're running on.
So you would still have almost assembly code, but you are still platform agnostic.
And yeah, the compilation is done when you have your query plan at hand.
So you have your query plan, you know which kind of data types you have, you know exactly
what the current query does, you know all the predicates, so you can directly compile this into your
LLVM IR code.
And there's an interesting demo paper that talked about how efficient this is and what
can be done to do this more efficient.
Because as we will see in a second, this takes some time.
So compilation takes some time. It is not free compiling code.
And there needs to be, you need to be aware of the compilation
Time if you do something like that.
And but we don't compile all the code that we need for
Single-tribute. There's obviously also code that
We just reuse and link against. For example, the memory
Management, this is all just whatever we have in our C++ libraries.
This is not compiled every time.
We would only compile the code that is necessary
to compute your current query.
Yeah, this table shows the Hypers performance
and compares it against VectorWise.
VectorWise is the commercial product that came out of this X100 prototype that we have just discussed.
So VectorWise is the vectorization engine, so to say.
And here we have seven exemplary DPC-H queries.
And you see the compile times on the left,
in milliseconds, and well, it's quite low, right?
So you see the runtime on the very right for hyper,
and not on the second rightmost column,
you see that we are here for this TPC-H example,
we are in the range of seconds.
So we execute each, we take seconds
to execute all the queries for hyper yeah and for each query would pay an overhead
of few dozen dozens of milliseconds and in most cases we are faster than the
vectorize engine so the in this case the compilation overhead is very small and we have nice results.
If you look for example at the QV1, we improved the 33 seconds to 9 seconds.
So there's quite a result here, but we've also QVs where we are slower. So the compilation time is mostly low,
and in this case, we are better than Vectorwise.
And HYPR is not the only system that does this.
So HYPR was the pioneering system,
so this was the first system that really did it
like all the way for analytical queries,
not just small steps,
we really did it for the entire query processing.
And nowadays, there are also other systems.
For example, if you use SQL Server,
there's the Microsoft SQL Server,
there's the Hackathon engine,
which also uses query compilation,
and they create C code as the intermediate language,
and then compile the C code.
So they really generate readable code.
They don't generate assembly.
Cloudera does that for parts.
They don't compile the entire query.
For example, they compile it for predicate evaluation
if it is a complex predicate.
Postgres has JIT compilation.
SAP HANA has a new engine that uses query compilation,
the hex engine, and many other systems nowadays as well.
But what the authors found in the years after Hyper was that for certain queries, especially
OLTP, so imagine OLTP is, for you're for large systems, OLTP systems,
you have tens of thousands of queries per second.
So a millisecond overhead for compiling is not possible.
You can't do that.
If a query runs much faster than a millisecond,
this compilation overhead is just too large.
So people need to come up with ideas how to
find a middle ground for large
OLAB queries, for queries that run seconds or for a long time.
The compilation overhead is usually worth it because the code
is more efficient, so it's a good trade-off to invest into this.
For other queries, this might not be the
case. And then interpretation can be faster because there's no compilation at all. And
what you can see on the right here is what the state of the art is here is
I hope you can see that. So for example there there is hackathon shown here. So you have a query plan.
What hackathon will be doing is they will generate
C++ code and compile this.
So this is expensive,
even though hackathon is done for transactions.
When they do this is mostly for stored procedures.
So they don't do it on the fly.
In general, they would do it for when you store
your procedure, in this case, you prepare the function.
There's enough time to do that.
And then hopefully when you have a stored procedure,
it is called quite often.
So yeah, so the compilation overhead
is worth it here in this case.
And OMRA on the right side here in red
is basically the successor of the TUMUNIC to Hyper.
And what they now do is they have this Umbra IR,
so they no longer generate LLVM IR.
And they have now ways to say,
okay, I can compile to C++
and let the compiler do all the magic.
So do all the loop optimizations
and all the tricks that the compiler do.
So this is the slowest path,
probably gives you the most efficient code.
They can also go to LLVM IR,
which can be interpreted.
So it can run this LLVM code without compilation,
or you can compile it,
which is a little bit faster to execute, or they directly execute this IR code.
So they have different stages and whatever they think they need, if they
think it's an expensive queries and investing in compilation is worth it,
they go this way and they use all the capabilities of modern
compilers or they we don't do it.
And what also, so yeah, there's a typical trade-off,
latency, is it worth the added latency
of compilation or not?
And also when we talk about this whole query compilation,
there are also other projects, for example, Mutable,
which is a thing from is also a German project, database project
that has been recently presented.
And they generate WebAssembly, which can be very efficiently executed by most web browsers,
modern web browsers nowadays, which is also super interesting.
Just as a side note. And this comparison is probably a little
bit hard to read, but what the authors did here is that they compiled different
generation targets. As I already said, so we have bytecode, which is comparable to
SQLite. We have those fast start engines, flying start or firearm, which
have a small overhead, or they go all the way with C++.
Let the compiler do all the magic.
And what you can, I hope, see a little bit is here that the fast engines, so there's interpreted code.
And this flying start engine, and they don't do a lot of optimizations.
They just, the aim of them is just to be happy
could be for running fast.
So here, this is logarithmic scale.
You have very small overheads for compilation,
but you don't have the best,
the highest throughputs possible,
at least not for all queries.
And then we have the other two engines,
for example, the LAMVR optimized,
or we have the yellow one here, C++,
with our free optimizations.
So all the optimization passes we have in WADAN compilers.
This takes significantly longer to compile,
but also gives you overall the best throughput.
If it's worth it, totally depends, right?
So the overhead, the throughput increase
is often not very large,
but if your queries run for seconds or minutes,
you probably don't care.
If your queries run shorter, then yeah,
we have to decide if the red one or the blue one
is probably the target that you want to,
the generation that you want to target.
Okay, so now the question is,
what is now the best system?
There is VectorVise, which is the vectorization model,
and there is Hyper, which does the query compilation.
And since it is really hard to compare
two different systems because they are just
the way they access disk,
how well their query optimizer works,
and so on and so forth,
it's really hard to compare the systems.
So researchers came up with,
and built a new system and wrote a paper about it where they combined two
systems so this is not the fully fetched database so they just did a minimal
system that could mimic both approaches and they coined those approaches TYPR
so for hyper and tecto-wise for the vector wise code. And this paper has got everything
you always wanted to know about compiled and vectorized queries,
but were afraid to ask.
So very short title.
And the results that they found was that it really
depends on your queries.
So scale factor one is quite small.
So, it's just one gigabyte of data.
Probably, so, there's already an overhead
for a query compilation that you need to consider here.
But depending on what the query,
extra query does, different systems are better.
So, for example, we have Q1 that you have seen before.
We have a lot of, very have a very large aggregation.
We have QV6, where the aggregation is just minimal.
We have very selective filters.
So we have three filters in this QV,
on a single table, no join.
And these filters, they remove most rows pretty early.
We have QV queries three and nine
which have multiple joins, large joins.
And we have query 18 which has an aggregate
and the huge generaties, so it is almost a key column.
So one and a half million groups.
And you can see the results here.
And again, there's no clear winner if you look at the results.
For Quby 1, the generated code of this type engine is faster because the expressions are
all fused and generated very efficiently. But for other QBs, for example, QBs 6, 3, and 9 here,
the TectorWiseEngine1.
So it always depends on how the QB looks,
which engine might be better.
And this table, they compiled,
they show the results of profiling the two engines.
And the reason why a type in this case
or the generated code, the code generation
was so much better is that since we have
this very tight loop with all the expressions
already fused into very tight loops
and generated code for the expressions. We have very good
cache aggregation, so we have the data already in the registers when we calculate the results.
And on the other hand, tactile wise often needs to materialize, which can be really expensive in this case.
But,
well, so, it depends on the queries for,
for example here, for query one,
code generation is just for this very compute bound query
is much more efficient.
But on the other hand, we have query 9 and here we have an example where the tecto-wise, so vector-wise engine is more
efficient and the reason for that is here that we have this tight loop
that does the probing of the hash table. So this query is dominated not by
navigation, where we can generate the code, but foring of the hash table. So this query is dominated not by an aggregation,
where we can generate the code, but for probing the hash table.
And here, the TectorWise engine is tight looped.
So in this case, there's this tight loop over the probing
phase.
And here, the CPU can do many outstanding loads.
So we are less bound by the T-RAMP bandwidth.
And we can hide, basically, the probinging latency or the access latencies here because we can generate many outstanding reads, memory loads here in this case, if we have this tight loop.
But if now our merge code does all the operators, it's harder for the CPU to do those outstanding loads, to issue those outstanding loads.
So this is an example where the vectorized code performs better.
Here this is a comparison of now these two systems. We have code generation and vectorization. Again, we have,
so now we have different metrics here. So from the computation wise,
we have usually more efficient code
if we compile the code.
Because data, yeah, we can keep most data
very often in the registers.
We don't have any interpretation overhead
for expressions and so on. So computation is very efficient if we generate code just for this particular query.
But when we talk about parallel data access,
depending on how complex the query is
and how well the CPU is able to hide memory latencies,
code generation can be slower.
In this case, vectorize or tectorize might be able
to generate more concurrent memory loads,
which can be better here.
Zimd vectorization, that's usually where vectorize engines
are better because the optimizer has,
since we pre-compiled the code before,
we can run all the optimizations level of compilers.
So there's a lot of time for the compiler
to do this autovectorization.
Compilation-based engines usually don't do it.
They could just similarly generate the code, right,
or take the time to compile it,
but then again you have the problem of long compilation times.
Priorization is usually good for both.
Or will TP depends, right?
So if you can generate your stop procedure upfront
with code generation, this is probably the best you can get.
But if you can't generate it upfront,
then you have to pay the compilation overhead.
Language support depends. And I think the most important thing here is profiling and
debugging and for in both cases so profiling debugging is just for
VectorWise engine it's just your C++ code as you know it as you have to write
in the exercises right and if we were next week we will talk about profiling
but this is probably what you all know, how debugging works and so on. And now if you
generate code on the fly, this is a very different beast to debug and
to profile this code. So we need to have
a lot of additional tooling around that to be able to debug and probably profile that.
Any questions about code generation?
Okay.
I think I'll go back to the table in a minute.
Okay, so first the summary.
So we have, first of all,
we talked about the iterator model today.
This is the standard model.
This is for analytical systems.
That is a slow model.
That is, I think,
this isn't used in any of the modern systems
for OLAP workloads,
but it's still the model used
for other database systems,
for a typical Postgres system and so on. So this is still the model widely used, but it's usually the model used for other database systems, for a typical Postgres system and so on.
So this is still a model widely used,
but it's usually not a good fit if we talk about OLAP workloads.
Then we talked about three models that are geared for analytical systems,
so that are better and more efficient for those workloads.
This is first the materialization model,
which was the first approach, which is the MoDDB approach.
And then we have vectorization and code generation,
which are basically the two approaches
that all the state of the art systems use.
So if you talk about Snowflake, Databricks,
whatever database you have in mind there,
they usually either they generate code
or they are something like this vectorization model.
Are there any questions so far?
If not, we will continue with data structures tomorrow.
And let me just have a few questions.
So just, yeah.
Let's assume there would be a fourth column.
So this table here comes from a PhD thesis, 2009.
So there was, I think, no code generation database
at this point in time.
But can you help me to fill this?
Or what do you think would be the fourth column
if we now would have here another code generation?
How would that look for code generation?
So let's start with instruction cache utilization.
How would that look for code generation?
I want it to go before we have this table full.
So yeah.
So the problem of the iterator model was that we have a lot of code that is running concurrently, right?
So basically we go through all the operators, and for each operator, each operator is usually huge because it needs to be capable of processing all whatever data there is.
For code generation, we only generate the minimal amount of code because we know this column that we are going to process
is an integer column encoded in this and this way,
so we don't have to have any switches for data type,
for encodings and so on, for null checks.
So there is very few instructions,
and usually they are, thus it's easier,
we have also tight loops, so usually there's a very good instruction cache utilization.
Function calls?
Nobody?
Guesses?
Yeah?
I think it depends on the code that we generate, of course.
And we can do it in a good way and in a bad way.
But I think we can do it in a way that the function codes don't help very much.
Yeah, exactly. So we usually try to... Yeah, exactly.
So we usually try to, yeah, as I said, so it depends on which code is generated.
So in some cases, it might make sense to have still function calls for complex things, and
we still are going to call, like, if we access external data, ssd and so on, we just have
to call functions. But for the main code that we try to process there, we try to avoid any function calls.
We try to inline as much as possible, so this will be as few as possible, basically.
Attribute access is direct. Where do we spend most time on for code generation?
There could be two answers.
Where do we process most time on?
Where do we, yeah, spend time on?
Is it interpretation?
Mm-hmm. I think if the data that we have to process is small,
then we will spend most of our time on generating the code
and making the code efficient.
If the data that we have to process is large,
then probably most of the time we will go into processing and loading.
Exactly.
Yeah.
So, yeah, for small queries, OTP, we spend most time on preparing the query or generating the query, compiling the query.
If we have a complex query, we spend the time on processing.
Exactly.
CPU utilization, that's, I guess should be clear
that is usually very good here.
Compiler optimizations are applicable, right?
This was basically the last thing we talked about.
So depending on which backend you use, you can use,
you can basically say, okay, I want to use all
of the compiler optimizations.
If you think it's worth investing the time, or you don't do that.
Materialization overhead is very cheap because you usually try to don't materialize anything.
In case, just to maybe make that clear again, materialization overhead was huge for the materialization model, or can be there also depends on the query but there are some cases we just have to
materialize right so if you have a classical hash join and most databases
use hash joins because they're the most efficient join and if your hash side is
huge and you need to process the entire hash table before you can continue then
basically all the pipe also the other approaches if this pipeline break up basically
the hash on has a huge input then they all have huge inputs and materialized in case and yeah in
this case then the hash table right so it also depends so also other models apart from the
materialization model can have huge intermediates it As usual, it always depends.
But the largest materialization overhead
is usually always the materialization model.
And scalability is good as well.
Okay.
This was it for today.
Tomorrow we see each other again for data structures.
Have a good day. Bye bye.