Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Query Execution Models
Episode Date: May 17, 2023...
Transcript
Discussion (0)
Okay, so let's get started. Welcome everybody for today's session. We're going to talk about
query execution models. And before we do so, I told you last time, I'm going to tell you again
this time, that's going to be the last time. If you're interested in a student position,
working on stuff like this, working on lectures, etc. Meet me after the lecture, send me an email or chat with somebody in the group.
We're always looking for people, so that would be great.
Where are we right now? We're in execution models.
So we finished up vectorized execution yesterday and today we're going to look at how we actually can execute
complete queries. So there's different ways how we execute them and so we're going to look at a range
of different approaches and then on two approaches more specifically that are working well with modern hardware.
And this is still kind of single threaded, can be multi-threaded, can be multi-socket, etc.
But we're not going out there that far.
So we're still in memory, etc.
And then we're going to continue with data structures.
And after the data structures next week, we're going to memory, et cetera. And then we're going to continue with data structures in the, after the data structures next week,
we're going to have this profiling session
where we talk about how you can profile your code
a bit more detail.
So same thing, still all the same.
Also on the sixth, we want to do a Q&A
and a bit of a feedback session.
We'll probably send out a questionnaire soon.
So this is also going to be, of course, for the tasks.
So you can ask questions.
You can ask questions for the lecture.
But also we want to get some feedback.
This is still a fresh course.
And we're midway, let's say.
So then if you have some feedback how we can change stuff what is good
then we'll be happy to get this there so otherwise no news today we should be able to finish this up
today we're going to talk about the different kind of models first i'm going to give you an overview
and then step through the different models, basically in increasing complexity and also
in increasing efficiency while actually there's some trade-offs.
And most of this today is based on Jana Gitschewa's lecture on data processing on modern hardware.
So that's a TU Munich lecture.
So just so you know, I mean, if you look for her slides, you're going to find some resemblance there.
Okay, so query processing.
So far, we've looked at small code snippets, individual small stuff.
So how do we put stuff in memory?
How do we write code more efficiently?
Things like that.
Now we want to look at the complete query.
And this is basically the question is, if we have a query like this,
so we have like four-way or four, that's basically, or five-way join, basically.
So five tables that we want to join, and we have selections in there,
maybe we have aggregations,
groupings, stuff like this.
So how will we execute this?
How do we process the individual tuples in there?
There's many different ways to do so.
And the processing model, typically a database has a single processing model.
The processing model will define how this is done. So when to read which table,
what to do with the intermediate results, etc. So that's the core. It's not so much
what kind of operators do we have, this is also of course specific to a database, but
here today we want to look at if we have certain kind of operators, how do we invoke them, basically.
And the most classical approach, I think I told you about this already,
is the so-called iterator model or volcano model for early execution engine and or tuple at a time model.
Meaning, we're using each operator like an iterator.
Or each operator has an iterator interface and we're invoking this for every single tuple, basically.
And for main memory database management systems, this typically has a too high overhead.
One problem is basically we have a hyper-tuple overhead, and we have all this interpretation overhead.
So for each tuple, we basically have to go through all of the code base for all of the individual operators.
These might be very complex because you have many different types, etc.
And that makes the processing very expensive. operators these might be very complex because you have many different types etc and
That makes the processing very expensive. I
Told you in the past. This was not so much a problem
because the database was always limited by the disk and the disk would basically have like a
Bigger granularity than per tuple so we would have the individual pages, not individual pages, we would have larger pages, which contain many tuples. And then it really doesn't matter, right? So we're
retrieving the whole pages. And then we can iterate over those tuple by tuple, because
the overhead is really just shipping these pages in and out. Now, if everything's in memory,
that's a different story. All of a sudden, executing the
code makes a difference. And this is where the iterator model just becomes too slow.
And for main memory databases, there's basically three typical models that you can use. It's the
materialization model, which means you're executing each operator individually. So,
one operator at a time, and
you're materializing the whole intermediate results. And this could, for example, be also
if you have a column store, you're basically executing column by column. So a complete
column at a time. Then there is the so-called vectorization model that's different from SIMD vectorization.
It's kind of, I mean, you can use SIMD using the vectorization model, but it's a different kind of vectorization.
The idea is that you, rather than processing individual tuples, you're processing vectors of tuples.
These vectors can then later be like internally be processed with simd
instructions but the idea is you're basically using the same kind of technique rather than
individual tuples you're doing multiple tuples at a time and here you can amortize this these
overheads better so all of these individual interpretations like calling the individual
operators getting the code, etc.
You can amortize if you're running this on multiple tuples at a time.
And then the final one, and let's say there's always a struggle between the two,
is the code generation.
So there's different, let's say, camps in database researchers.
Some say vectorization is the way to go. Some say vectorization is the way to go.
Some say code generation is the way to go.
We'll also see a comparison to the end.
So of course, they also talk to each other and do research together.
And there's basically trade-offs between the two.
Sometimes vectorization is better.
Sometimes code generation is better, sometimes code generation is better. The code generation basically means, rather than having operators that you're interpreting,
you're basically emitting code that will be specific for a query and that will execute the query most efficiently.
And there's different levels. We'll also see this towards the end of the lecture.
So let me give you an overview of the different processing models, and so you basically have an idea of what we're going to go through today. We're going to start with the volcano model,
also called it-tuple-at-a-time iterator model, or yeah, these three names. And it basically uses pipeline parallelism.
So that's what we can use here.
Because we have a pipeline of operators
and these we can basically call the multiple operators
at a time, but each operator will just execute
on a single tuple at a time.
So that's basically all the parallelism
that we get in there.
And this is really used for disk-based database management systems.
And because most of the traditional database management systems were disk-based,
a lot of these systems still use this.
So Postgres, for example, uses exactly this model.
So you have multiple operators, could be, say say your projection on top, then your join,
then your selection, something like this, or a scan on the bottom. Then the projection will
basically call the join, that will call the selection, that will call the scan for each
individual tuple. And this is basically, you have a call stack,
and then the tuples will be basically,
the scan will produce the next tuple.
This goes up to the next operator.
If the operator cannot fully process,
or say for example,
selection will not emit a tuple or every tuple,
because they don't match,
then it will continue to ask for new tuples
until it can admit something.
We'll see this also.
And, well, you can see there's basically a long call stack,
and it's per individual tuple.
More, like, let's say the orthogonal approach to that,
rather than doing tuple by tuple, you can do operator
by operator. That also works. Meaning we can have multiple or we basically say that this
operator completely consumes the input, produces the output, and then let the next operator
work on that output as an input. And the intermediates will go to memory,
will be stored, or something like that.
So say, for example, first we do our scan.
The scan can do the filtering right away.
Then we have the filtered results.
On these filtered results, we can do the next operator.
And here, we can basically have intra or inter operator parallelism, meaning we can, since we're looking at the whole data set at a time,
or the whole intermediate results at a time, we can actually make these operators parallel.
So each operator, like a scan, we can make the scan parallel, we can make the scan parallel we can make the selection parallel everything
right this is found in column oriented database management systems so systems that basically
rather than storing rows store their data in columns and also often store their data on disk
and uh we'll we'll see back so i don't want to give all the details away yet
because we're going to go through this in a bit more detail.
Then we have the vectorized execution.
That's kind of the middle ground between the two
where rather than saying we're going to give or get individual tuples,
we're going to ask for a series of tuples.
So say a batch of tuples would be a thousand tuples, we're going to ask for a series of tuples. So say a batch of tuples would be a thousand tuples,
something like this. And we work on those per operator. So this means we have a fixed size
or like a smaller data set per operator. So we're not going to go through the whole table,
for example, in a scan,
but we're just going to go in batches through the table.
But we don't do like an execution
for each individual tuple.
So like executing a single operator
will be amortized,
or we get kind of this amortization
of loading
the code and having these call stacks, et cetera,
because we still do a lot of computation
rather than doing just calling functions back and forth.
And here, I didn't put tuples.
It should actually be like we're still calling the next here,
but the next will ask for multiple tuples at a time.
So a large number of tuples at a time.
So it's still the same kind of volcano iterator model,
but we're going to go in batches.
And then the final one is query compilation
or code generation.
Here, we're basically trying to get rid
all of this interpretation overhead.
So the interpretation overhead means
we have our operators, which is lots of code, basically.
And we say, please operator now work on this type of tuple,
which has this data type.
We have to load everything so that the operator will be
quite complex in order to deal with all the special cases that we might have to deal with.
And rather than doing this, we're going to say, well, this is a kind of query that does a scan
and a join and aggregation, for example. So let's produce code that does exactly this
and nothing else with these data types
specialized exactly for this.
So we'll have a much smaller code.
We'll have a tight loop and meaning like,
we're gonna iterate over the tuples only
in as many times as we need,
not like in different operators,
not in different functions, but in a
single function. And we're splitting the code up into pipelines until we have pipeline breakers.
So say, for example, we have to build a hash table, something like this, we have to materialize.
So there, we're going to have separate code parts, but the rest will be in one tight loop. And that means our code is going
to be much more efficient. It's going to be much smaller, but we have to pay by producing this code
and compiling this code. And this is used in many core database management systems that have
and main memory database management system.
Same is true also for the vectorized execution.
So as I said, there's kind of this two camps,
which where one says, well, let's go vectorization.
The other says, let's go query compilation.
And there's trade-offs.
As an excursion, so just so you've seen this, there is also something which I actually kind of like.
This is an older paper from 2000 where people built a database management system for smart cards.
So basically the idea was if we have our health have our data, say our health data on the
smart card, let's have a database on there and we have to organize the data there. So you might know
that a smart card actually is a processor. So it's not just like storage. It has a, depending on the
type of card, but it should, many cards have a processor that is powered when you're either tapping it on the RFID
or if you put it into the reader and while it's in there,
basically you can do some processing so that people build this.
But of course, the memory is very limited.
So the idea was how to do processing in this super constrained setup,
where we don't have any memory, and they came up with the idea of an extreme right deep tree.
So the idea here is that we don't use memory at all.
So we're not going to materialize at all.
Meaning, say for example, if we do a join, for each individual tuple on the left input,
we're going to go through the whole pipeline,
through every single tuple in all of the other tables.
So, of course, there's going to be more work,
but we don't need any memory at all.
So, we also have only pipelining.
All joins, except if we have some kind of indexes
or if the data is pre-sorted on the join key,
all joins will always be nested loops,
because we cannot materialize.
So usually we would today use a hash join,
but then we need to build a hash table.
So that consumes memory.
So in this case, we're saying, OK, let's look at the single tuple and go through everything
else.
And then let's look at the next tuple, go through everything else.
Of course, it doesn't work for everything.
So you cannot do a holistic aggregation, for example.
A median, you need to materialize.
There's no way to basically find this without storing something.
Also you cannot produce sorted results because you would have to, unless the data is already
sorted, because you would have to store the intermediate somewhere.
Okay.
So that as just an excursion.
I think it's interesting.
So the idea is that rather than having these
typically left deep trees, for example,
so usually you would produce like your query like this,
where you do the selection on your scan
and then do your joins,
and you will have intermediate results in every step.
Rather than at least individual tuples as intermediate results.
Here, we're not producing any intermediate results at all.
We're just looking at the current tuple.
If we emit it, we emit it.
After going through the whole pipeline, we emit it, we emit it. After going through the whole pipeline, we emit it.
Otherwise, we're just going to drop it, go to the next tuple.
So we always just need basically one tuple plus all the extensions.
Basically, the complete row that we might have from all of the tables.
So meaning one tuple from each input table, basically, but that's it.
So not more than that.
As I said, just because I think it's an interesting idea
and it kind of goes even further extreme to the
iterator model.
I cannot really imagine a use case right now, iterator model.
I cannot really imagine a use case right now, but maybe in the future there comes a new
use case when then you're going to think, well, I know this extreme right deep tree
that I can use for this kind of setup.
Okay, so with that, let's go back to the iterator model. And well, I said the iterator
model is kind of data is processed tuple at a time. It's also called the volcano or pipeline
model because we can pipeline it in volcano because of a system that was called Volcano. And basically each operator has this next, so an open, next and close
function and that's basically it. That's all that you need and then of course you need to
like have to input to that or put that into the next operator basically and each operator will basically you can or let's put it this way
you can request the next tuple from an operator by calling the next function and basically on each
on each invocation either the operator returns a tuple a a single tuple, or it returns null, which means that's it.
We're done with the table, basically,
or we're done with the query,
at least on this side of the table.
And each operator keeps its own state.
That's the other important thing.
So say, for example, you want to do a hash join,
then the operator, the join operator, will keep the hash table as its state. Or we want
to do some aggregation, then the operator will keep the hash table as some kind of aggregate
as state. So again, say for example, median,
we would keep all the values and then in the end find out,
okay, this is our median value.
Let's output this median value
as soon as we can basically say so.
And let's look at an example.
So say for example, we have this very simple query.
We have two tables, R and S.
We're joining S on, R and this very simple query. We have two tables, R and S. We're joining R and S
on an ID attribute.
We're selecting everything that's larger than 100
in the value attribute in table S. And then we're outputting
or projecting the results to RID and S state.
So what this means is basically we have a query plan,
something like this.
So this is something that we could actually, for example,
also have in an extreme right deep tree.
So there we put basically then for each R
to go through the whole list.
But in this case, for example,
we're looking at iterator model right now.
So this means we're gonna try,
or for each of the operators that we have here
in our query plan, we'll have a matching operator
in our database system that we then will utilize
or that we will call for each individual tuple basically.
And there is not necessarily always a one-to-one matching.
So say for example, here in this plan,
we see that there's actually,
I mean, we're just inputting the S relation, right? And the R relation,
but we need an operator for this. So we need basically a scan here for this.
Then the selection could be a separate values. Often this is also already included in,
in the scan operator. Then we have to put the join and we have the projection in the end.
The projection often is already inside other operators.
So just so you know, because I mean, for a simple projection, we're just going to throw
away some of the attributes in the output.
In an extended projection, we might do some kind of additional computation.
So you remember this.
We could also do multiplications, additions, whatever in there.
And this is, say, something, a simple translation of this query.
And here we already stated the join.
We don't really need to do this.
So typically we can also just say R, S,
which then the database would still optimize
because it sees here we're doing a join condition. It will still produce this join here. But we're
not about query optimization, we're about query execution. So what happens here is basically
we will have different kind of operators. So this is very simple code. So then we have like or
this is basically what would be executed.
We have our table, which is kind of a
scan operator then in the end. So R would have a scan or
an interface that says hasNext. So do we have a next and if we have a next one then let's emit this
next one or we just say if next not null then emit next for example could be the
same thing. The same is true of course for s so we'll do the same code there
then we have the predicate so so the selection or filter, which is basically for the individual child or for the tuple that we get.
If the predicate matches, then we'll basically find the next tuple that matches,
and then we emit it,
and then we'll basically wait for the next call.
Then our join could be, say for example, a hash join.
So in this case, our hash join would basically mean
we're first consuming the left side,
building a hash table on the left side,
and then consuming the right side tuple by tuple,
and if we find a match on the right side,
then we'll emit this tuple.
On the left side, while consuming the left side,
we cannot basically emit anything
because we don't know any joins yet.
So this means basically in this kind of query plan,
we'll first go completely through the R table and then tuple by tuple through the S
table, even though the operator on top will just call for individual tuples here.
And then we have the projection.
So the projection basically will just emit
then the individual tuples, tuple by tuple removed
or basically reduced to the attributes
that we're interested in.
So how is this actually executed?
So in the first step, we'll have basically
the projection call next on its child.
And in this case, it's a join basically.
And the join operator will then call next on the left side
in order to build the hash table.
The left side is the R, which the R table, which basically will just read
the table tuple by tuple and the join operator. So the step in the step two, that will basically
then consume a single tuple. When it consumes a single tuple, it will set this, put this in the hash table and consume
or ask for the next tuple until we get no more tuples more. So until the first tuple is basically
null on this side. Okay. So this is kind of the first part. And then in the second part,
we're doing the same thing more or less on the right side. So in the second part, we're doing the same thing more or less on the right
side. So in the second part on the right side, we'll call next on the selection. The selection
will basically call the table until it finds a tuple that matches the selection. So it will
call a single tuple. The single tuple will be sent back to the selection.
If the predicate matches the selection, the tuple will go back further to the join.
If the selection doesn't match, then it will ask for the next tuple.
So, it will continue basically down here until it finds a matching tuple.
As soon as it finds a matching tuple,
it will emit this up to the join.
The join will basically check, does it match?
So do we find something in the hash table?
So we're probing the hash table with our tuple.
If we find a match, then we'll emit this match.
If we don't find a match, we're going to ask for the next tuple.
So then basically we're going to ask the selection.
The selection will ask the scan again.
If we match, we're going to continue.
If we don't match, we're going to iterate here until we go here.
And then as soon as we basically have emitted everything,
as long as we get new tuples in here until both sides are basically
null or until the join here produces null will continue and then our query is
basically done so that's the last step all the way to the top and you can see
this is kind of a lot of function calling.
So in every step we're basically...
If everything is good, right?
If we have matching tuples right away,
then this still means on the left side
we have three function calls
plus the building the hash table, etc.
On the right side we basically have four function calls,
if we have a match.
If we don't have a match, we're going to iterate further down there.
So, this is the iterator model,
and it's used almost in every database today,
at least in almost every traditional database.
So, classical versions, or let's say Postgres, of course,
MySQL, same story, classical versions of DB2,
Oracle, et cetera, would use this model.
We have this option for tuple pipelining.
So if we look here, basically we can call for like,
while we're consuming this,
or while this is basically emitting a tuple, it can already prepare the next tuple.
So, it can basically already say, well, I'm going to have the next tuple ready when I'm going to be asked.
So, I can do this independently.
I don't have to wait from a call all the way to the top.
So, then I get
kind of a pipeline access. And so, but so we are not a pipeline, only a pipeline access. I can get
like a pipeline parallelism. So in the order, in the number of operators that we have and
individual functions, but that's basically it. And we still have a lot of operators that we have in individual functions. But that's basically it.
And we still have a lot of code that we have to walk through.
And we cannot get like, I mean, of course,
queries can get quite complex.
So we can get probably enough parallelism.
But it's not going to be super efficient because we always have
kind of communication in between the different processes here.
And some operators must block until their children emit all their tuples.
So say, for example, a hash join will have to block until it has all of the tuples of
the left side of the build side basically before it has all
these tuples if we're using a hash join we cannot emit anything unless we're using a double pipeline
hash join which will use more memory again then also sub queries sorting grouping etc will will will have the same problem. And then we have these,
the kind of problems with caching, right?
So all operators runs sort of in an interleaved fashion,
especially if we're doing, I mean,
any way they run interleave,
but also if we're doing pipelining,
we have all this code at the same time.
And the code might actually be large because it's interpreted, right?
So this means like the operators will not be specialized for one data type.
They will be general for all kinds of queries that you can think of.
And this means this is a lot of code.
This might not necessarily fit into your instruction cache. So this means you get kind of
instruction cache misses, which makes your program again slow.
Then we have all this function call overhead. So I mean, this is a simple query, but we have a huge call stack for each individual tuple already here.
That costs a lot because we always have to switch between these different functions.
If we have many operators, say we do five joins in a row,
each of these joins will have to build their hash tables.
Different kinds of aggregations will need these intermediate results, so the combined
operator state might be huge.
That basically then means if it doesn't fit into cache, it's going to be slow again and
produce a lot of data cache misses.
So that's not good, basically.
And this is also why it's not great in main memory databases.
And these are kind of the things that we
want to tackle with other kind of execution models.
And so there are some examples and some experimentation.
So this is
something where people
and so there's the X100
execution engine
which is based on Monadb
came out of CWI
which is a vectorization
engine and they of course
tested all this stuff. So this is like
Monadb is a main memory database and originally used this iterator model with a column format
but then basically they came up with this vectorized execution that's the
X100 engine and they for example tested okay what's the overhead or where does time go in something like MySQL, for example.
So, query 1 of the TPC-H benchmark,
so TPC-H is the OLAP benchmark, a simple OLAP benchmark,
and this is one of the simpler queries
where you basically have one table that we're just touching on. You have a complex
extended projection, as you can see. This is what I told you earlier, right? So, all this up here
is basically just aggregations and calculations over the single table. So, I mean, it's not just extended projection,
but it's also aggregations in here.
And we have a simple selection,
so we're just looking for a certain date and the grouping.
So this will produce some intermediate results.
There's no join in here.
There's calculation.
So we would assume, I mean, if the system is fast, it needs to materialize something.
But it should spend a lot of time on the calculations, right?
So this is something that is heavily used in here.
And if we profile this, so this is taken from the paper basically. Then you can see like every call
of the database only processes a single tuple. The database has in scale factor one, I think
600 million tuples in this table. So, this is going to be 600 million calls. And for each of those, you will have all individual calls
like through the whole database code.
And you can see that only 10% is spent on actual query tasks.
Everything else is kind of management around this.
This happens often.
But what you also can see that you have a very low instructions per cycle ratio.
So, meaning we're below 1, meaning that per each cycle there's not a lot of done.
And especially, meaning we have basically instructions per call and instructions per cycle of these instructions.
This is meaning we actually have to divide this number by the instructions per call number.
So then we get like this very low instruction per cycle set.
And everything else is just like basically the system is mostly waiting.
So a lot of the time is basically spent on the field access.
So let me see.
So we're basically going through the field.
So this is basically where a lot of time goes.
Then the individual tuple functions, they are hard to optimize.
So there's not much to do per individual tuple, but there's a lot of code.
So the compiler cannot really do much.
So this also means that our pipelines on the CPU will probably be empty most of the time or many times.
And that makes the CPU basically stall.
So this is why we get such a low instruction per cycle call.
There is no possibility to optimize across functions
because this is very different code blocks that
will be interpreted.
The function call overhead is very high.
And you cannot use any vectorization or very limited,
because you're just going to look at individual tuples.
And most likely, your individual tuples
won't contain any arrays that you
could do your processing on.
And so say, example this addition means that we have
or the addition is basically iChunFengPlusVal in here so if you go back
this is somewhere down here right so this is the instruction. We have 0.8 instructions per cycle.
So this is how many instructions of these 38 instructions per cycle we will execute.
And then we basically need for each individual addition in here, we'll need 38 instructions.
So the way is basically we're dividing the 38 by 0.8.
That way around.
And that basically means rather than just having 38 instructions,
we're going to have 40 or 38 cycles.
We're going to have 48 cycles that we have to walk through
for a single addition in here. And that's just the addition in here and
this is basically in contrast uh to three instructions if we would have like load add
and store in assembly so this is what like the kind of code that we looked at earlier this would
be three instructions this would be to some degree pipeline. So we would
probably get an instruction per cycle above one, right? So this, and we have multiple units at a
time. So we would expect something instructions per cycle of two to three, something like that.
So this should be much faster. One problem that we have is we don't have any loop pipelining and we have dependent instructions.
So this basically takes a lot of time and we have these call overheads.
So these individual calling different functions, that's an overhead of again 20 cycles that cannot be amortized.
So basically we have to load new code,
we have to go to different parts of the code,
which just costs a lot of time and a lot of cycles in here.
So one way to go against this and basically say,
well, now we know, right?
So individual tuples is bad.
And so then all of a sudden people said right? So individual tuples is bad.
And so then all of a sudden people said,
well, individual tuples is bad,
so let's do all of the tuples at a time.
So that's the materialization model, right?
So rather than doing one tuple,
we're going to do all tuples per operator and then give all of the results basically to the next operator.
And that means each operator runs exactly once,
at least each instance of an operator.
So, say we're scanning one table,
we're going to do the complete scan and then do the next operator.
Of course, we're not going to just do a scan and load the data,
and maybe it doesn't really make sense,
but we're going to do, say for example,
a selection and then have the selection result as the next step and the sub result then will be
fully materialized right and then uh but there is no pipelining because we're going to do one
operator at a time but we can do uh in into intraator parallelism.
So we can do one,
like we can parallelize the operator,
somniqui.
And so say, for example, here,
same kind of idea.
If we look at the same example,
so here I removed the animation,
or I did not add the animation,
let me put it this way.
So rather than saying I'm going to call for each individual tuple the next function,
I'm going to call something like an output function for my child. So I'm going to say, please, child operator, give me your complete output.
So this is basically what I'm asking for.
So this is done for the projection. complete output. This is basically what I'm asking for.
This is done for the projection, then the same is done in the join.
The join will need both inputs.
The join will basically ask the scan, please give me all your output.
The scan will basically read all of the tuples, produce an output in memory that then the
join operator can consume. The join operator will consume the complete R table, produce
the build site, then ask on the left side, ask the selection, selection please give me
a complete output. The selection will then basically,
that's already in the next slide,
the selection will basically ask the scan again,
the scan will produce the complete output,
like basically load the table,
the selection will filter the whole table,
send the filtered or give the filtered results to the join. The join
will consume the whole write filtered table and probe with the whole table, basically the hash
table, produce the join result and then the join result will be basically projected with the projection.
As I said, in a practical implementation, most likely some of this will be merged.
This, of course, makes the code even bigger.
If you have the projection in every operator, for example,
and you have some filtering condition stuff in every operator,
probably you cannot merge everything,
but some stuff you will do right away.
So while you're scanning through the data,
when you're already touching the data,
some basic filtering optimizations you can do right away.
Or if you can do a projection pushdown,
you will do the projection pushdown right away
because you don't want to read everything.
And as I said, this is mainly a model for column stores.
Column stores store the individual attributes.
So you will only read the attributes that you need
and not the full table, which basically is already
like using the projection in the first place.
Okay. using the projection in the first place. Okay, so how is this good?
Well, we have much fewer function calls.
So this is going to be much...
We basically don't pay all of this function call overhead.
We have function call overhead, but that's very limited or it's amortized by processing
so many tuples at a time.
The compiler can better optimize the code
because the loops are actually tight
and we're looping over many tuples at a time.
So we can use loop unrolling,
we can use vectorization
and the compiler can do this also for us.
So if, I mean, remember the type,
like the small programs that we used.
If you look at the compiled assembly,
you will always basically see some kind of loop unrolling
if you have a for loop that does something.
You will see some auto vectorization.
And it can, like in general, use some modern CPU features.
It can do some prefetching if it knows
we're reading this from memory, right?
Rather than getting this from another function.
So then, okay, reading from memory,
we have all this logic and from cache, we have all this logic and from cache we have all this logic for prefetching etc.
So the materialization is kind of cool, much more efficient in a way, but
we have this problem that we might get this huge intermediate results.
Especially, I mean think about a query that
produces a Cartesian product.
If we're materializing a Cartesian product,
then probably our memory will explode.
And in some queries, you cannot really optimize this away
very well.
So then doing a pure materialization model
will basically break your system.
Because then all of a sudden, you have to page out
your memory.
So this will be slow.
So the question is, can we aim for something
in the middle ground?
Can we aim something in between tuple at a time
and operator at a time?
And that's where the vectorization comes.
And this we'll do after the break.
A five minute break.
Do we have questions so far?
Yes?
I can imagine that another problem of the materialized model is the handling of limit
clauses, maybe?
Let's say we have a limit 10 clause in the end?
It's going to, I mean, the question is,
can we use limit clauses in materialization?
We cannot optimize for limit clauses exactly.
So what will happen is this basically,
we're producing the intermediate results that we don't need.
So we'll produce, I mean, to some degree, of course,
the database can push down some of the limit clauses.
So then we can produce smaller intermediate results.
But like in a pure, if you think about a pure materialization
model, you're absolutely right.
We're going to produce a huge result that we don't need.
In the iterator model, we're just
going to stop asking for more as soon as we have reached our limit.
Completely correct.
Other questions?
Okay, good.
Then let's do a five-minute break.
Vectorization model.
So the idea is, and I already alluded to it to some degree, is rather than doing the full materialization,
and also rather than doing single tuple at a time, let's still use the iterator model.
So, this Vulcano style model is a nice and clean way of representing a query and structuring your system.
But instead of for each individual call to next, let's not do a single tuple, let's do
a batch of tuples.
This is called vectorization.
It's in this monadb x100 terminology
and in this case
like in the loop
we're not just going to
like in the internal loop
we're not just going to process a single tuple
say in the scan or something like that
or in the selection
but we'll have a large batch of tuples.
And this means, or this batch, of course,
can be configurable, but then we have
more to do for the processor.
So the processor can actually do the same kind of operation,
the same kind of loop on many tuples at a time.
And so the same idea basically, so you can see that the basic layout of the plan is very
much similar to the initial layout of the iterator model, with the exception that in
each step we have to deal or we have to handle a larger batch of tuples and the configurable size of tuples.
So, basically in our projection, we'll call the join and call the join to give us a batch of tuples.
And then the join will basically ask the scan, please give me a batch of tuples that I can
process or a vector and this vector will then go back to the join and the join will
produce or build the hash table with a larger batch. So it can basically do all this basically
hashing one by one. The code will will be more optimized we're not switching to
completely different code paths but do all of the processing in a smaller loop so that the compiler
can help you and the cpu can work more efficiently because it can use the same kind of operation over
and over again so we still have the function call overhead but we're amortizing
this over many tuples at a time. So then the same thing of course happens on the other side.
So once we've built the hash table we'll probe the hash table. For this we're going to ask the
selection to give us a batch of tuples. The selection will ask the scan for a batch of tuples
and this the selection will basically ask for batches until it can produce a complete batch that it will send to the join.
And the same then for the join. The join will produce join results until it has enough so that the batch size is large enough that it can send this to the projection.
And so it kind of uses the best of both worlds. We have the iterator and the materialization model included.
So we have less number of invocations per operator.
So we're not invocating this per each tuple,
but say for a thousand tuples or something like that.
We can use SIMD instructions because we're going to operate on multiple tuples at a time.
We can do the same kind of processing that you're doing right now
because we will look at, say, a thousand tuples at a time,
meaning we'll easily fill up our SIMD registers and do some processing there.
And we'll do this in multiple loops.
So it also means we don't have to do basically just for a single instruction,
but for many loops, do the same kind of SIMD instructions.
So the code will be ready.
It doesn't have to be like, or we're not gonna have this cache miss problem
in the code, for example.
So it's important to, I mean, basically want,
want a vector size that's large enough
for, to amortize the iteration overhead.
So all of this function calls and this operator problems, but it's small enough to not completely thrash the data caches.
So we don't want to basically work on a data set where we're going out of the caches all of the time.
And so here, for example, you can see in this X100 on an older machine, so an Athlon and an Itanium,
so probably today the vector sizes could even be larger with larger caches, but you can
see that with vector sizes in the thousands, like 1000, 4000, something like that, you
get the optimum performance, so up to 16000, something like that, you get the optimum performance. So up to 16,000, something like that.
So today, probably this would be the range
for TPC-H query one, in this case.
So the one that you saw earlier already.
So this kind of means these many tuples
we get into our cache before,
like if we go larger than that, then our tuples will basically always have to go to memory again.
We're going to run out of the cache and then we're going to lose some of the performance.
And you can see this is a logarithmic scale, right? So this also means, from a single tuple, we're actually getting almost two orders of magnitude
performance improvement down here.
So factor of 100.
That's also why the system is called X100, because it's 100 times faster than the original
system just by using these vectors.
So if we compare these three models,
the iterator model, the materialization model, and the vectorization model, then we can see that in the query plans for the iterator model
and the vectorization model, they're basically very simple, right? So we're just
stitching these operators together. We don't have to do much else. They just
call each other. And so that's actually a good thing. The
materialization per operator, the query plans get more complex because we basically are going to produce
the complete output and we will have to stack them next to each other.
The instruction cache, which was one of the big problems for the iterator model there, of course, it's poor. This is great in the materialization model because our operators will be small and they
will hopefully fit into the instruction cache.
At least much more of the operator will fit into the instruction cache than for the iterator
model.
So, it's going to be great.
In the vectorization model, we'll have these function call overheads.
So we're not going to be as good as in the materialization, but we're amortizing this
over the vectors.
So there we'll still have a good instruction cache utilization. Because for a single batch of tuples, the instruction cache utilization will be good.
For the next, like basically switching from operator to operator, we'll have some instruction cache misses.
And these will basically, how many tuples, like the reduced performance in comparison to the materialization
will be basically in the number of batches that we're processing.
But as we said, we're going to try to optimize this here.
We have many function calls in the iterator, but very few in the materialization and a few more, of course, again, in the number of batches
in the vectorization.
And, well, yeah, so then the attribute access is complex in the iterator model, while in
the materialization model, we can directly access each of the attributes, basically. And hopefully, I mean,
this is mainly the two reasons why we're looking at this, because here, we're basically in what
we're looking for on the modern hardware. So what do we spend our time on? In the iterator model, we're mostly spending our time on this interpretation.
So we have to basically load the code, have to basically configure the code, or let's
say work on a single tuple.
So this is basically where time goes and we're not doing useful work.
In the materialization and the vectorization, most of the time is spent on the processing.
Materialization, even more.
And this means we have good CPU utilization
and both in the vectorization and materialization.
In the materialization, we might have caching problems
because our intermediate results can be large.
So we will probably go to memory.
All of a sudden, our memory usage or we're basically spending more time there on memory
load and stores through the large intermediate results. And then we're basically not utilizing the CPU as well anymore as if we have the proper right vectorization size.
So if we have a good vector, then we can still keep everything in the memory, in the caches.
And then we're just going to utilize the CPU ideally, basically.
And well, then, well, compiler optimizations,
we can do this in materialization and vectorization.
We have expensive materialization overhead
for the materialization, of course.
And this also limits our scalability,
which is better in the iterator and vectorization.
Okay.
So now we have kind of this span of how to deal with intermediate results, right?
And how to deal with batches of tuples.
So iterator versus materialization and the vector we can kind of move in between.
We can move all the way to materialization if we have very large vectors.
We can move all the way to iterators
if we have vectors of size 1.
Now something different, like a completely different way
of processing data or of executing models
is kind of another dimension.
How we do the query execution is code generation.
And in the vectorization, we already heard,
we're reducing this function call overhead.
And this is kind of what's really killing,
like one of the things that's really killing us
in the iterator model.
While of course, this individual cache access etc. are stupid,
but having a huge amount of code that doesn't fit into our cache
and jumping in the code back and forth through this function calls,
this is really what's getting really slow.
And in the vectorization,
we still have this function called overhead.
We're just reducing it and amortizing it
by not calling the functions as often.
So we can basically iterate over some data
in a certain function,
but we still jump in between different functions.
So we still do this overhead.
And something else that we can do is rather than having these functions and this interpreted code, we can actually directly emit
code that does exactly what we want. So looking back at this MySQL example, where we said there's a very low IPC because there
are 48 function calls or 48 cycles that need to be done for a simple addition, we can just
say, well, let's just produce the assembly that does exactly this addition and then just
execute this assembly.
This means we basically try to keep everything in the registers, we try to just
generate code that is as small as possible and as specialized as possible and as optimized as
possible for exactly this kind of processing. This can be single threaded,
but of course it can also be parallelized.
We can still use SIMD instruction.
We can still use multi-core if we want.
What is, we just need to produce code
that does exactly what we want there
and then let the compiler do its magic
and do the kind of parallelization.
And so this was pioneered in Hyper.
So that's a system that was built at the TU Munich,
was later on sold to Tableau.
And so the idea here was that they basically started,
like they saw, okay, all this function call overhead is so super expensive.
We have servers with many cores and large amounts of memory
where potentially everything fits in there.
So let's just basically fuse everything that needs to be done within the operators.
Let's fuse as many operators as we can into a single pipeline,
generate code for that, and just emit this code and run this code.
And the pipelines basically go all the way until we have a pipeline breaker.
And as I said earlier already, whenever we have a hash table, for example, we have to materialize something.
This is a pipeline breaker.
So this is basically where we'll stop.
But everything else will be merged. And then we can basically produce this code
and produce machine code for this.
Rather than having operators that we stitch together
that need to be interpreted,
we actually emit something like C++ code
that then we compile and get super fast executables.
And rather than having this iterator model where we basically pull data, we can also
push the data.
So we can basically say, well, let's basically use blocks of data that will be pushed to
the next operator until from one pipeline breaker to the next, basically.
So this means, if we do, say for example, if we process this part here,
then this pipeline will basically read blocks of the data and push it to the join.
And then the join will start processing at some point.
So what does this look like?
So basically, we'll actually create something like this, right?
So we'll really create small code blocks that say,
for this tuple in R, if it matches, then we'll have another loop.
Let's, for each tuple in R2, if it matches, right, then for each tuple in the aggregation,
though this is basically what comes from here, then we'll have these tight loops.
So we'll just produce these small kind of loops
and let the compiler optimize those.
The code blocks, so this is kind of the tricky part,
the code blocks don't really match the operator boundaries.
So, in the iterator model, you remember this is super simple, right?
So, we basically have this direct translation.
We have a join here, we're going to have a join operator there. Here we have to kind of split up
the build and the probe part in the join operator because the build basically is a pipeline breaker.
So we can basically push data into the build until the hash table is built and then start the other table.
The other, sorry, not table, the other pipeline.
So say for example, here we'll build a hash table.
So this pipeline will build a hash table.
This pipeline will basically build the hash table
for the grouping and for the aggregation.
This pipeline will then build the hash table for the grouping and for the aggregation. This pipeline will then build the hash table for this join.
And once we have these two hash tables ready,
then we can execute this pipeline
without any other problems.
So basically, then we'll just push the data through here,
the additional data that comes from R3.
In HYPR, it's not C++, but it's actually LLVM,
so low-level virtual machine code that's generated.
So they basically, this is something that's inside Clang.
So one, let's say like an internal representation
that they use, and it's sort of assembly, but it's platform, somewhat platform independent.
And then can be just in time compiled to executable code.
And there's kind of different ways how you can then also execute or how you can compile this.
So then basically you could say, OK, let's
use all the optimizations that the compiler has,
so multiple passes, et cetera.
Or you can say, oh, I'm just going
to do a single pass compilation.
So it's giving me the compilation as fast as you can
based on this code.
And the system, the database system,
basically already emits this intermediate representation.
It doesn't emit like these for loops,
but it emits this kind of assembly style representation.
There's also other systems.
Whoops, somewhere I jumped in the wrong direction.
So this is kind of a mix generated of like generated code
and some C++ libraries.
So stuff like memory management, et cetera,
this will be in the libraries.
But this means like everything that's in the libraries
that will only be basically compiled once
and like the memory management, etc.
But the individual query that will be compiled once you want to execute this.
And this will be very small code, a very small executable.
And here, for example, this is an example where vector-wise and hyper are compared.
And you can see that for the queries, for TPC-H queries, so you remember a TPC-H query 1,
you need 13 milliseconds to actually compile. So going from the SQL text to LLVM and then to
actually from LLVM to actually an executable, that's 13 milliseconds. And the code is quite small. So this is in a few kilobytes basically versus multiple megabytes,
two gigabytes code based at something like vector-wise,
probably multiple megabytes and vector-wise is basically monadbx100.
So this means you actually have
very small code that will fit into your instruction caches.
And then basically the runtime will be fast and will actually be faster than something like VectorVise in many cases.
Not in all cases, but in many cases.
And we'll see when it doesn't fully work in a bit.
But in general, you can see for these kind
of OLAP style queries,
the compilation time is actually negligible, right?
So it's very low, so in the milliseconds, while we have seconds of execution, it doesn't
really matter. However, if we're looking at an OLTP type, so very small transactional
queries, then this might actually be costly already. So if our queries are in the milliseconds in execution, then paying for the compilation in milliseconds
will be problematic.
And this we'll also see,
so this is also something
that people already started working on.
Okay, but you can see on average, we're faster here.
Hyper is not the only system.
There's many different, like once this came up, Hyper was the first system, but then many other systems noticed that this actually makes sense.
Let's do some stuff like this. Let's do some code generation also in our system.
Microsoft has a special engine. Cloudera has an engine.
Postgres added some JIT compilation support for their system in order to get like this faster execution in order not to have all this interpretation overhead.
There's different kind of strategies I'm also listing here, where they basically,
also the Munich guys compared how you can actually produce like different kind of, or
at which stages and at which levels in the compiler you can actually produce code.
And I mean, very simple, of course, you always need your SQL parser,
right, so the compiler doesn't understand SQL,
so you need to translate this to some kind
of code. You also want to
do the query optimization, because that's also
something that the compiler doesn't do.
But then, you could
directly produce something
like C++.
Then, if you have
C++, this will go all the way in the compiler front end.
The compiler does all the parsing, etc. again. So then you need to basically go through the
intermediate code generation, code optimizer, and then code generator.
If you're going the LLVM way, so this is this low-level machine code, low-level virtual machine code.
Basically, this would go then into the code optimizer.
This is this general purpose IR.
You can also basically have your own code optimizer.
You can basically also say, I'm doing my own bytecode representation.
And then I could either interpret this in my own virtual machine,
or I can actually also emit code myself.
So I can, why not just produce bytecode in the end itself.
So produce, like basically have the complete compiler in my database system.
And this is basically where people and the Munich guys to some degree are actually going right now.
So having these different kind of strategies.
So on the one end, you can produce code, so something like Scala, C, C++, etc.
And the new system, Umbra, is basically a system that can produce all of these different
things. So it can produce C++, it can produce LLVM, it can directly emit code.
And there's different trade-offs. So on the one hand, the latency, the higher in the compiler
you go with your code, the higher the latency. So the more the compiler needs to do,
this will cost you more in terms of code generation.
Then the more optimization the compiler does,
so the more time you give for compilation,
the lower the latency, typically, of your code.
So if the compiler does a lot of optimization, multiple passes,
it can find better ways to execute the code.
The code probably will be faster.
And then if you're producing your own code,
well, this is probably the fastest, right?
So you can actually get very fast binaries.
You know exactly.
So you have a slimmed-downaries. You know exactly, you have
like a slimmed down compiler that's faster than using a regular compiler, but it's highly
platform specific probably. Or you need to again build like the complete compiler. So
this means you're paying on flexibility. There's also an interesting work by Uni Saarland,
where they built a compilation system that compiles towards WebAssembly.
That's also a strategy that you could go.
Which is basically a virtual machine that will run
in basically every browser these days and all kinds of platforms.
Okay, so briefly just as an overview, so this is from this recent paper,
so this is gonna be presented this year.
Here they test on the one hand,
having like interpreted code,
something like a byte code,
having directly a code emitter,
so basically producing assembly right away through the system.
So they built their own two backends that can produce code for x86 and code for ARM architectures.
Then LLVM, so this low-level presentation, optimized and unoptimized.
Optimized means multiple rounds of optimization or C++ code.
And what we can see is that, so the chart is a bit hard to read.
This is basically the throughput that we get.
So the higher, the better.
This is how long it takes to compile this.
So this is the latency of the query or the compilation.
Actually, it's the compilation latency in this case.
And we can see while interpreting the whole thing,
that's actually the fastest, right?
So we don't need to do any compilation,
so we just directly execute the query.
However, the throughput is actually low, right?
So because we have much more code,
the code is not optimized as much.
This is this directly emitted code.
So this is basically a backend
that will produce binaries directly from the queries.
And you can see the latency of the compilation is also very fast.
So this is again, I think it's also logarithmic scale down here.
Then doing this intermediate representation by the compiler without optimizations is faster in terms of compilation, but is of course worse
in terms of how fast the resulting binaries will be. So if you do more optimization,
your queries will in the end be faster. And if you're using C++ code, you'll get roughly,
and this really depends on the architecture,
but you get roughly the same kind of performance
that you would get with the LLVM approach,
but you're paying even more because the compilation
is just much more expensive in this case.
Right, and here you can look up the paper.
I just put it here for reference.
You can see this is, they do it for all kinds
of different architectures.
Yes?
In the table you showed earlier,
the compilation was only a fraction of the execution though.
Is this then aimed for OLTP?
This is also aimed for OLTP kind of workloads, yes.
So, and this is why basically people started
or also Munich guys started to work on this
in order to get like to situations
where the compilation time then is already a problem.
Or if your data sets are small, your queries are fast
and also the compilation time will already compilation time might be already too much.
So here you can see in C++, we're already somewhere close
to the second range, for example,
in terms of execution for simple types of queries.
So I think this is T, P, C, H. Yeah,
I don't remember which queries, if it's all queries or et
cetera.
But you can check out the paper for the details.
OK, so very quickly, or are there more questions?
Then very quickly, I just want to also show you
some results where people also, let's say
the two camps came together and compared VectorVise and Hyper in a single code base.
Because of course, other systems, like if you compare two systems, it's also about who's
the better programmer, right?
So then if you're a better programmer, then maybe your vectorization strategy is always
better than my code generation
strategy, because you know all the tricks. And so this is why people, like they built two systems,
Typer and TechDevice, which basically used the exact same code base to really compare the two.
Used a subset of the TPC-H queries, and there's a nice paper called Everything You Always Wanted to Know
About Compiled and Vectorized Queries
But Were Afraid to Ask.
So it's a good read, actually.
And if you look at these different queries,
if you compare, then you will see,
well, there is no clear winner.
Sometimes the compilation is faster,
sometimes the vectorization is faster.
And you need to do in-depth analysis.
Sometimes they're more or less the same.
Sometimes you can see, like, this is,
like, there can be significant differences.
And you can also see, like, for scaling factor one
on modern CPUs, this is already in the millisecond.
So there, even the compilation will already be significant.
So this, like the modern CPUs, you
can actually be quite fast on scale factor one.
And so say, for example, in query one, there,
tech-to-revise, so basically the vectorization approach is much slower than the code generation approach.
And the reason is here, for example,
the vectorization approach executes much more instructions.
So you remember this query had all this aggregation and whatnot stuff.
So the code that needs to be used or the operators that need to be used
are much more complex than the code that is generated in the code generator
because this will be quite tight.
We have like a single table, we have a single loop that we can iterate on.
And we don't really need to have
that much intermediate results.
And in the code generation,
the data can actually be in the registers.
So basically, we'll have very fast execution
over the data in the code generation
and very few cache misses.
So in contrast, for example, in Q3 and Q9,
there it's mainly about hash table probing.
And there all of a sudden then the code
or the tech device is basically better at hiding these kind of cache miss latencies because it can basically operate
on a batch and then do some like do the other operations. And here, these complex loops that will be generated for these queries
will produce more of these memory stalls and branch misses. So because we have a very complex
loop, that will basically be more problematic to execute than if we're just using batches of
tuples and do operator at a time style,
because then the code that will be executed per batch
will be more, or can be more efficient,
or will at least be smaller and better fit into the caches.
So, well, we can briefly go through this.
If we're compiling in comparison to the truth,
or if we're compiling the compilation,
typically we'll have a more efficient code
because we have like very small code base.
We ideally can keep the data in the registers
that we're working on, the individual tuples, right?
So we don't have to go out into the caches
or even into memory.
While the vectorized engine still has these large operators
that need to be read in and will have to work
in between tuples.
So we cannot do the whole pipeline on a single tuple
in the vectorization.
For parallel data access,
without extra work in the coding,
the compilation is not as good
as hiding these memory latencies.
Though there, because of the vectorization,
of course you could build vectorization
into the code generation, right?
So you can also generate code that's vectorized,
but using a pure code generation,
then you're going to have a less performant memory access.
In both, you can do SIMD vectorization.
It's a bit more complex in the compilation,
because you have to generate the SIMD code
properly and for vectorized, this is basically plain, right?
So you have a bunch of vectors that you do one operator on.
So this is the way we do SIMD here in this lecture.
We can do parallelization in both. For these OLTP setups, their generating code
can produce very fast stored procedures because they're going to be very small and we only
have to compile them once if we have a stored procedure. If we have to produce every query individually, then the compilation might actually bite us.
Language support for compilation-based engines, if we're accessing such intermediate representation,
we're quite good in having the portable code for everything that the compiler basically
addresses.
If we're producing bytecode,
we have to build a new backend
for every individual CPU architecture.
And then we also have to optimize this
for every individual architecture
which the compiler already is, right?
And so this is same as kind of true
for the vectorized engine.
We have to adapt this also to the architecture
to make sure that the vectorization works well.
And I mean, in general, profiling is more complex
because we're producing code that then we can,
of course we can profile each individual query,
but what if we have a new query?
It's kind of much harder to reason about generated code
than it is to reason about operators
that are completely there, right?
So if you have a complete operator, I know what I'm doing
and what's happening, I can step through with it
with the profiler.
If I'm generating code, I have to basically generate
different kinds of queries and see what's happening in there. And I don't know if I
generate all the queries that I actually need. And well, yeah, so this also kind of same
problem with debugging. And well, with so that's kind of a similar problem, but adaptivity, both systems can basically be adaptable.
And with that, that's basically it for today,
almost in time.
So we talked about these different kind of execution models
and as like a, let's say, takeaway, the two models that you want to go with today in modern engines is either code generation or vectorization.
Because there you can actually really utilize the CPU.
And you can basically mix the two in one way or the other, but this is really what you want to do. And there's still a lot of research being done on how to ideally utilize the CPU, but for sure we know
traditional way of structuring individual operators and doing individual top is not
the ideal way. Okay, with that, thank you very much. Questions? No questions? Okay, so then next time we'll
talk about data structures. Thanks and see you next week!