Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Query Execution Models

Starting point is 00:00:00 Okay, so let's get started. Welcome everybody for today's session. We're going to talk about query execution models. And before we do so, I told you last time, I'm going to tell you again this time, that's going to be the last time. If you're interested in a student position, working on stuff like this, working on lectures, etc. Meet me after the lecture, send me an email or chat with somebody in the group. We're always looking for people, so that would be great. Where are we right now? We're in execution models. So we finished up vectorized execution yesterday and today we're going to look at how we actually can execute complete queries. So there's different ways how we execute them and so we're going to look at a range

Starting point is 00:00:56 of different approaches and then on two approaches more specifically that are working well with modern hardware. And this is still kind of single threaded, can be multi-threaded, can be multi-socket, etc. But we're not going out there that far. So we're still in memory, etc. And then we're going to continue with data structures. And after the data structures next week, we're going to memory, et cetera. And then we're going to continue with data structures in the, after the data structures next week, we're going to have this profiling session where we talk about how you can profile your code

Starting point is 00:01:32 a bit more detail. So same thing, still all the same. Also on the sixth, we want to do a Q&A and a bit of a feedback session. We'll probably send out a questionnaire soon. So this is also going to be, of course, for the tasks. So you can ask questions. You can ask questions for the lecture.

Starting point is 00:01:54 But also we want to get some feedback. This is still a fresh course. And we're midway, let's say. So then if you have some feedback how we can change stuff what is good then we'll be happy to get this there so otherwise no news today we should be able to finish this up today we're going to talk about the different kind of models first i'm going to give you an overview and then step through the different models, basically in increasing complexity and also in increasing efficiency while actually there's some trade-offs.

Starting point is 00:02:31 And most of this today is based on Jana Gitschewa's lecture on data processing on modern hardware. So that's a TU Munich lecture. So just so you know, I mean, if you look for her slides, you're going to find some resemblance there. Okay, so query processing. So far, we've looked at small code snippets, individual small stuff. So how do we put stuff in memory? How do we write code more efficiently? Things like that.

Starting point is 00:03:05 Now we want to look at the complete query. And this is basically the question is, if we have a query like this, so we have like four-way or four, that's basically, or five-way join, basically. So five tables that we want to join, and we have selections in there, maybe we have aggregations, groupings, stuff like this. So how will we execute this? How do we process the individual tuples in there?

Starting point is 00:03:32 There's many different ways to do so. And the processing model, typically a database has a single processing model. The processing model will define how this is done. So when to read which table, what to do with the intermediate results, etc. So that's the core. It's not so much what kind of operators do we have, this is also of course specific to a database, but here today we want to look at if we have certain kind of operators, how do we invoke them, basically. And the most classical approach, I think I told you about this already, is the so-called iterator model or volcano model for early execution engine and or tuple at a time model.

Starting point is 00:04:26 Meaning, we're using each operator like an iterator. Or each operator has an iterator interface and we're invoking this for every single tuple, basically. And for main memory database management systems, this typically has a too high overhead. One problem is basically we have a hyper-tuple overhead, and we have all this interpretation overhead. So for each tuple, we basically have to go through all of the code base for all of the individual operators. These might be very complex because you have many different types, etc. And that makes the processing very expensive. operators these might be very complex because you have many different types etc and That makes the processing very expensive. I

Starting point is 00:05:12 Told you in the past. This was not so much a problem because the database was always limited by the disk and the disk would basically have like a Bigger granularity than per tuple so we would have the individual pages, not individual pages, we would have larger pages, which contain many tuples. And then it really doesn't matter, right? So we're retrieving the whole pages. And then we can iterate over those tuple by tuple, because the overhead is really just shipping these pages in and out. Now, if everything's in memory, that's a different story. All of a sudden, executing the code makes a difference. And this is where the iterator model just becomes too slow. And for main memory databases, there's basically three typical models that you can use. It's the

Starting point is 00:05:58 materialization model, which means you're executing each operator individually. So, one operator at a time, and you're materializing the whole intermediate results. And this could, for example, be also if you have a column store, you're basically executing column by column. So a complete column at a time. Then there is the so-called vectorization model that's different from SIMD vectorization. It's kind of, I mean, you can use SIMD using the vectorization model, but it's a different kind of vectorization. The idea is that you, rather than processing individual tuples, you're processing vectors of tuples. These vectors can then later be like internally be processed with simd

Starting point is 00:06:45 instructions but the idea is you're basically using the same kind of technique rather than individual tuples you're doing multiple tuples at a time and here you can amortize this these overheads better so all of these individual interpretations like calling the individual operators getting the code, etc. You can amortize if you're running this on multiple tuples at a time. And then the final one, and let's say there's always a struggle between the two, is the code generation. So there's different, let's say, camps in database researchers.

Starting point is 00:07:24 Some say vectorization is the way to go. Some say vectorization is the way to go. Some say code generation is the way to go. We'll also see a comparison to the end. So of course, they also talk to each other and do research together. And there's basically trade-offs between the two. Sometimes vectorization is better. Sometimes code generation is better, sometimes code generation is better. The code generation basically means, rather than having operators that you're interpreting, you're basically emitting code that will be specific for a query and that will execute the query most efficiently.

Starting point is 00:07:59 And there's different levels. We'll also see this towards the end of the lecture. So let me give you an overview of the different processing models, and so you basically have an idea of what we're going to go through today. We're going to start with the volcano model, also called it-tuple-at-a-time iterator model, or yeah, these three names. And it basically uses pipeline parallelism. So that's what we can use here. Because we have a pipeline of operators and these we can basically call the multiple operators at a time, but each operator will just execute on a single tuple at a time.

Starting point is 00:08:39 So that's basically all the parallelism that we get in there. And this is really used for disk-based database management systems. And because most of the traditional database management systems were disk-based, a lot of these systems still use this. So Postgres, for example, uses exactly this model. So you have multiple operators, could be, say say your projection on top, then your join, then your selection, something like this, or a scan on the bottom. Then the projection will

Starting point is 00:09:13 basically call the join, that will call the selection, that will call the scan for each individual tuple. And this is basically, you have a call stack, and then the tuples will be basically, the scan will produce the next tuple. This goes up to the next operator. If the operator cannot fully process, or say for example, selection will not emit a tuple or every tuple,

Starting point is 00:09:40 because they don't match, then it will continue to ask for new tuples until it can admit something. We'll see this also. And, well, you can see there's basically a long call stack, and it's per individual tuple. More, like, let's say the orthogonal approach to that, rather than doing tuple by tuple, you can do operator

Starting point is 00:10:07 by operator. That also works. Meaning we can have multiple or we basically say that this operator completely consumes the input, produces the output, and then let the next operator work on that output as an input. And the intermediates will go to memory, will be stored, or something like that. So say, for example, first we do our scan. The scan can do the filtering right away. Then we have the filtered results. On these filtered results, we can do the next operator.

Starting point is 00:10:41 And here, we can basically have intra or inter operator parallelism, meaning we can, since we're looking at the whole data set at a time, or the whole intermediate results at a time, we can actually make these operators parallel. So each operator, like a scan, we can make the scan parallel, we can make the scan parallel we can make the selection parallel everything right this is found in column oriented database management systems so systems that basically rather than storing rows store their data in columns and also often store their data on disk and uh we'll we'll see back so i don't want to give all the details away yet because we're going to go through this in a bit more detail. Then we have the vectorized execution.

Starting point is 00:11:32 That's kind of the middle ground between the two where rather than saying we're going to give or get individual tuples, we're going to ask for a series of tuples. So say a batch of tuples would be a thousand tuples, we're going to ask for a series of tuples. So say a batch of tuples would be a thousand tuples, something like this. And we work on those per operator. So this means we have a fixed size or like a smaller data set per operator. So we're not going to go through the whole table, for example, in a scan, but we're just going to go in batches through the table.

Starting point is 00:12:10 But we don't do like an execution for each individual tuple. So like executing a single operator will be amortized, or we get kind of this amortization of loading the code and having these call stacks, et cetera, because we still do a lot of computation

Starting point is 00:12:33 rather than doing just calling functions back and forth. And here, I didn't put tuples. It should actually be like we're still calling the next here, but the next will ask for multiple tuples at a time. So a large number of tuples at a time. So it's still the same kind of volcano iterator model, but we're going to go in batches. And then the final one is query compilation

Starting point is 00:13:00 or code generation. Here, we're basically trying to get rid all of this interpretation overhead. So the interpretation overhead means we have our operators, which is lots of code, basically. And we say, please operator now work on this type of tuple, which has this data type. We have to load everything so that the operator will be

Starting point is 00:13:26 quite complex in order to deal with all the special cases that we might have to deal with. And rather than doing this, we're going to say, well, this is a kind of query that does a scan and a join and aggregation, for example. So let's produce code that does exactly this and nothing else with these data types specialized exactly for this. So we'll have a much smaller code. We'll have a tight loop and meaning like, we're gonna iterate over the tuples only

Starting point is 00:13:59 in as many times as we need, not like in different operators, not in different functions, but in a single function. And we're splitting the code up into pipelines until we have pipeline breakers. So say, for example, we have to build a hash table, something like this, we have to materialize. So there, we're going to have separate code parts, but the rest will be in one tight loop. And that means our code is going to be much more efficient. It's going to be much smaller, but we have to pay by producing this code and compiling this code. And this is used in many core database management systems that have

Starting point is 00:14:42 and main memory database management system. Same is true also for the vectorized execution. So as I said, there's kind of this two camps, which where one says, well, let's go vectorization. The other says, let's go query compilation. And there's trade-offs. As an excursion, so just so you've seen this, there is also something which I actually kind of like. This is an older paper from 2000 where people built a database management system for smart cards.

Starting point is 00:15:21 So basically the idea was if we have our health have our data, say our health data on the smart card, let's have a database on there and we have to organize the data there. So you might know that a smart card actually is a processor. So it's not just like storage. It has a, depending on the type of card, but it should, many cards have a processor that is powered when you're either tapping it on the RFID or if you put it into the reader and while it's in there, basically you can do some processing so that people build this. But of course, the memory is very limited. So the idea was how to do processing in this super constrained setup,

Starting point is 00:16:07 where we don't have any memory, and they came up with the idea of an extreme right deep tree. So the idea here is that we don't use memory at all. So we're not going to materialize at all. Meaning, say for example, if we do a join, for each individual tuple on the left input, we're going to go through the whole pipeline, through every single tuple in all of the other tables. So, of course, there's going to be more work, but we don't need any memory at all.

Starting point is 00:16:42 So, we also have only pipelining. All joins, except if we have some kind of indexes or if the data is pre-sorted on the join key, all joins will always be nested loops, because we cannot materialize. So usually we would today use a hash join, but then we need to build a hash table. So that consumes memory.

Starting point is 00:17:07 So in this case, we're saying, OK, let's look at the single tuple and go through everything else. And then let's look at the next tuple, go through everything else. Of course, it doesn't work for everything. So you cannot do a holistic aggregation, for example. A median, you need to materialize. There's no way to basically find this without storing something. Also you cannot produce sorted results because you would have to, unless the data is already

Starting point is 00:17:37 sorted, because you would have to store the intermediate somewhere. Okay. So that as just an excursion. I think it's interesting. So the idea is that rather than having these typically left deep trees, for example, so usually you would produce like your query like this, where you do the selection on your scan

Starting point is 00:18:02 and then do your joins, and you will have intermediate results in every step. Rather than at least individual tuples as intermediate results. Here, we're not producing any intermediate results at all. We're just looking at the current tuple. If we emit it, we emit it. After going through the whole pipeline, we emit it, we emit it. After going through the whole pipeline, we emit it. Otherwise, we're just going to drop it, go to the next tuple.

Starting point is 00:18:31 So we always just need basically one tuple plus all the extensions. Basically, the complete row that we might have from all of the tables. So meaning one tuple from each input table, basically, but that's it. So not more than that. As I said, just because I think it's an interesting idea and it kind of goes even further extreme to the iterator model. I cannot really imagine a use case right now, iterator model.

Starting point is 00:19:05 I cannot really imagine a use case right now, but maybe in the future there comes a new use case when then you're going to think, well, I know this extreme right deep tree that I can use for this kind of setup. Okay, so with that, let's go back to the iterator model. And well, I said the iterator model is kind of data is processed tuple at a time. It's also called the volcano or pipeline model because we can pipeline it in volcano because of a system that was called Volcano. And basically each operator has this next, so an open, next and close function and that's basically it. That's all that you need and then of course you need to like have to input to that or put that into the next operator basically and each operator will basically you can or let's put it this way

Starting point is 00:20:09 you can request the next tuple from an operator by calling the next function and basically on each on each invocation either the operator returns a tuple a a single tuple, or it returns null, which means that's it. We're done with the table, basically, or we're done with the query, at least on this side of the table. And each operator keeps its own state. That's the other important thing. So say, for example, you want to do a hash join,

Starting point is 00:20:42 then the operator, the join operator, will keep the hash table as its state. Or we want to do some aggregation, then the operator will keep the hash table as some kind of aggregate as state. So again, say for example, median, we would keep all the values and then in the end find out, okay, this is our median value. Let's output this median value as soon as we can basically say so. And let's look at an example.

Starting point is 00:21:19 So say for example, we have this very simple query. We have two tables, R and S. We're joining S on, R and this very simple query. We have two tables, R and S. We're joining R and S on an ID attribute. We're selecting everything that's larger than 100 in the value attribute in table S. And then we're outputting or projecting the results to RID and S state. So what this means is basically we have a query plan,

Starting point is 00:21:51 something like this. So this is something that we could actually, for example, also have in an extreme right deep tree. So there we put basically then for each R to go through the whole list. But in this case, for example, we're looking at iterator model right now. So this means we're gonna try,

Starting point is 00:22:13 or for each of the operators that we have here in our query plan, we'll have a matching operator in our database system that we then will utilize or that we will call for each individual tuple basically. And there is not necessarily always a one-to-one matching. So say for example, here in this plan, we see that there's actually, I mean, we're just inputting the S relation, right? And the R relation,

Starting point is 00:22:47 but we need an operator for this. So we need basically a scan here for this. Then the selection could be a separate values. Often this is also already included in, in the scan operator. Then we have to put the join and we have the projection in the end. The projection often is already inside other operators. So just so you know, because I mean, for a simple projection, we're just going to throw away some of the attributes in the output. In an extended projection, we might do some kind of additional computation. So you remember this.

Starting point is 00:23:27 We could also do multiplications, additions, whatever in there. And this is, say, something, a simple translation of this query. And here we already stated the join. We don't really need to do this. So typically we can also just say R, S, which then the database would still optimize because it sees here we're doing a join condition. It will still produce this join here. But we're not about query optimization, we're about query execution. So what happens here is basically

Starting point is 00:23:58 we will have different kind of operators. So this is very simple code. So then we have like or this is basically what would be executed. We have our table, which is kind of a scan operator then in the end. So R would have a scan or an interface that says hasNext. So do we have a next and if we have a next one then let's emit this next one or we just say if next not null then emit next for example could be the same thing. The same is true of course for s so we'll do the same code there then we have the predicate so so the selection or filter, which is basically for the individual child or for the tuple that we get.

Starting point is 00:24:55 If the predicate matches, then we'll basically find the next tuple that matches, and then we emit it, and then we'll basically wait for the next call. Then our join could be, say for example, a hash join. So in this case, our hash join would basically mean we're first consuming the left side, building a hash table on the left side, and then consuming the right side tuple by tuple,

Starting point is 00:25:31 and if we find a match on the right side, then we'll emit this tuple. On the left side, while consuming the left side, we cannot basically emit anything because we don't know any joins yet. So this means basically in this kind of query plan, we'll first go completely through the R table and then tuple by tuple through the S table, even though the operator on top will just call for individual tuples here.

Starting point is 00:26:00 And then we have the projection. So the projection basically will just emit then the individual tuples, tuple by tuple removed or basically reduced to the attributes that we're interested in. So how is this actually executed? So in the first step, we'll have basically the projection call next on its child.

Starting point is 00:26:29 And in this case, it's a join basically. And the join operator will then call next on the left side in order to build the hash table. The left side is the R, which the R table, which basically will just read the table tuple by tuple and the join operator. So the step in the step two, that will basically then consume a single tuple. When it consumes a single tuple, it will set this, put this in the hash table and consume or ask for the next tuple until we get no more tuples more. So until the first tuple is basically null on this side. Okay. So this is kind of the first part. And then in the second part,

Starting point is 00:27:23 we're doing the same thing more or less on the right side. So in the second part, we're doing the same thing more or less on the right side. So in the second part on the right side, we'll call next on the selection. The selection will basically call the table until it finds a tuple that matches the selection. So it will call a single tuple. The single tuple will be sent back to the selection. If the predicate matches the selection, the tuple will go back further to the join. If the selection doesn't match, then it will ask for the next tuple. So, it will continue basically down here until it finds a matching tuple. As soon as it finds a matching tuple,

Starting point is 00:28:06 it will emit this up to the join. The join will basically check, does it match? So do we find something in the hash table? So we're probing the hash table with our tuple. If we find a match, then we'll emit this match. If we don't find a match, we're going to ask for the next tuple. So then basically we're going to ask the selection. The selection will ask the scan again.

Starting point is 00:28:31 If we match, we're going to continue. If we don't match, we're going to iterate here until we go here. And then as soon as we basically have emitted everything, as long as we get new tuples in here until both sides are basically null or until the join here produces null will continue and then our query is basically done so that's the last step all the way to the top and you can see this is kind of a lot of function calling. So in every step we're basically...

Starting point is 00:29:09 If everything is good, right? If we have matching tuples right away, then this still means on the left side we have three function calls plus the building the hash table, etc. On the right side we basically have four function calls, if we have a match. If we don't have a match, we're going to iterate further down there.

Starting point is 00:29:31 So, this is the iterator model, and it's used almost in every database today, at least in almost every traditional database. So, classical versions, or let's say Postgres, of course, MySQL, same story, classical versions of DB2, Oracle, et cetera, would use this model. We have this option for tuple pipelining. So if we look here, basically we can call for like,

Starting point is 00:30:04 while we're consuming this, or while this is basically emitting a tuple, it can already prepare the next tuple. So, it can basically already say, well, I'm going to have the next tuple ready when I'm going to be asked. So, I can do this independently. I don't have to wait from a call all the way to the top. So, then I get kind of a pipeline access. And so, but so we are not a pipeline, only a pipeline access. I can get like a pipeline parallelism. So in the order, in the number of operators that we have and

Starting point is 00:30:43 individual functions, but that's basically it. And we still have a lot of operators that we have in individual functions. But that's basically it. And we still have a lot of code that we have to walk through. And we cannot get like, I mean, of course, queries can get quite complex. So we can get probably enough parallelism. But it's not going to be super efficient because we always have kind of communication in between the different processes here. And some operators must block until their children emit all their tuples.

Starting point is 00:31:16 So say, for example, a hash join will have to block until it has all of the tuples of the left side of the build side basically before it has all these tuples if we're using a hash join we cannot emit anything unless we're using a double pipeline hash join which will use more memory again then also sub queries sorting grouping etc will will will have the same problem. And then we have these, the kind of problems with caching, right? So all operators runs sort of in an interleaved fashion, especially if we're doing, I mean, any way they run interleave,

Starting point is 00:31:59 but also if we're doing pipelining, we have all this code at the same time. And the code might actually be large because it's interpreted, right? So this means like the operators will not be specialized for one data type. They will be general for all kinds of queries that you can think of. And this means this is a lot of code. This might not necessarily fit into your instruction cache. So this means you get kind of instruction cache misses, which makes your program again slow.

Starting point is 00:32:33 Then we have all this function call overhead. So I mean, this is a simple query, but we have a huge call stack for each individual tuple already here. That costs a lot because we always have to switch between these different functions. If we have many operators, say we do five joins in a row, each of these joins will have to build their hash tables. Different kinds of aggregations will need these intermediate results, so the combined operator state might be huge. That basically then means if it doesn't fit into cache, it's going to be slow again and produce a lot of data cache misses.

Starting point is 00:33:28 So that's not good, basically. And this is also why it's not great in main memory databases. And these are kind of the things that we want to tackle with other kind of execution models. And so there are some examples and some experimentation. So this is something where people and so there's the X100

Starting point is 00:33:52 execution engine which is based on Monadb came out of CWI which is a vectorization engine and they of course tested all this stuff. So this is like Monadb is a main memory database and originally used this iterator model with a column format but then basically they came up with this vectorized execution that's the

Starting point is 00:34:18 X100 engine and they for example tested okay what's the overhead or where does time go in something like MySQL, for example. So, query 1 of the TPC-H benchmark, so TPC-H is the OLAP benchmark, a simple OLAP benchmark, and this is one of the simpler queries where you basically have one table that we're just touching on. You have a complex extended projection, as you can see. This is what I told you earlier, right? So, all this up here is basically just aggregations and calculations over the single table. So, I mean, it's not just extended projection, but it's also aggregations in here.

Starting point is 00:35:11 And we have a simple selection, so we're just looking for a certain date and the grouping. So this will produce some intermediate results. There's no join in here. There's calculation. So we would assume, I mean, if the system is fast, it needs to materialize something. But it should spend a lot of time on the calculations, right? So this is something that is heavily used in here.

Starting point is 00:35:38 And if we profile this, so this is taken from the paper basically. Then you can see like every call of the database only processes a single tuple. The database has in scale factor one, I think 600 million tuples in this table. So, this is going to be 600 million calls. And for each of those, you will have all individual calls like through the whole database code. And you can see that only 10% is spent on actual query tasks. Everything else is kind of management around this. This happens often. But what you also can see that you have a very low instructions per cycle ratio.

Starting point is 00:36:25 So, meaning we're below 1, meaning that per each cycle there's not a lot of done. And especially, meaning we have basically instructions per call and instructions per cycle of these instructions. This is meaning we actually have to divide this number by the instructions per call number. So then we get like this very low instruction per cycle set. And everything else is just like basically the system is mostly waiting. So a lot of the time is basically spent on the field access. So let me see. So we're basically going through the field.

Starting point is 00:37:16 So this is basically where a lot of time goes. Then the individual tuple functions, they are hard to optimize. So there's not much to do per individual tuple, but there's a lot of code. So the compiler cannot really do much. So this also means that our pipelines on the CPU will probably be empty most of the time or many times. And that makes the CPU basically stall. So this is why we get such a low instruction per cycle call. There is no possibility to optimize across functions

Starting point is 00:37:55 because this is very different code blocks that will be interpreted. The function call overhead is very high. And you cannot use any vectorization or very limited, because you're just going to look at individual tuples. And most likely, your individual tuples won't contain any arrays that you could do your processing on.

Starting point is 00:38:23 And so say, example this addition means that we have or the addition is basically iChunFengPlusVal in here so if you go back this is somewhere down here right so this is the instruction. We have 0.8 instructions per cycle. So this is how many instructions of these 38 instructions per cycle we will execute. And then we basically need for each individual addition in here, we'll need 38 instructions. So the way is basically we're dividing the 38 by 0.8. That way around. And that basically means rather than just having 38 instructions,

Starting point is 00:39:13 we're going to have 40 or 38 cycles. We're going to have 48 cycles that we have to walk through for a single addition in here. And that's just the addition in here and this is basically in contrast uh to three instructions if we would have like load add and store in assembly so this is what like the kind of code that we looked at earlier this would be three instructions this would be to some degree pipeline. So we would probably get an instruction per cycle above one, right? So this, and we have multiple units at a time. So we would expect something instructions per cycle of two to three, something like that.

Starting point is 00:39:58 So this should be much faster. One problem that we have is we don't have any loop pipelining and we have dependent instructions. So this basically takes a lot of time and we have these call overheads. So these individual calling different functions, that's an overhead of again 20 cycles that cannot be amortized. So basically we have to load new code, we have to go to different parts of the code, which just costs a lot of time and a lot of cycles in here. So one way to go against this and basically say, well, now we know, right?

Starting point is 00:40:42 So individual tuples is bad. And so then all of a sudden people said right? So individual tuples is bad. And so then all of a sudden people said, well, individual tuples is bad, so let's do all of the tuples at a time. So that's the materialization model, right? So rather than doing one tuple, we're going to do all tuples per operator and then give all of the results basically to the next operator.

Starting point is 00:41:04 And that means each operator runs exactly once, at least each instance of an operator. So, say we're scanning one table, we're going to do the complete scan and then do the next operator. Of course, we're not going to just do a scan and load the data, and maybe it doesn't really make sense, but we're going to do, say for example, a selection and then have the selection result as the next step and the sub result then will be

Starting point is 00:41:32 fully materialized right and then uh but there is no pipelining because we're going to do one operator at a time but we can do uh in into intraator parallelism. So we can do one, like we can parallelize the operator, somniqui. And so say, for example, here, same kind of idea. If we look at the same example,

Starting point is 00:42:02 so here I removed the animation, or I did not add the animation, let me put it this way. So rather than saying I'm going to call for each individual tuple the next function, I'm going to call something like an output function for my child. So I'm going to say, please, child operator, give me your complete output. So this is basically what I'm asking for. So this is done for the projection. complete output. This is basically what I'm asking for. This is done for the projection, then the same is done in the join.

Starting point is 00:42:31 The join will need both inputs. The join will basically ask the scan, please give me all your output. The scan will basically read all of the tuples, produce an output in memory that then the join operator can consume. The join operator will consume the complete R table, produce the build site, then ask on the left side, ask the selection, selection please give me a complete output. The selection will then basically, that's already in the next slide, the selection will basically ask the scan again,

Starting point is 00:43:12 the scan will produce the complete output, like basically load the table, the selection will filter the whole table, send the filtered or give the filtered results to the join. The join will consume the whole write filtered table and probe with the whole table, basically the hash table, produce the join result and then the join result will be basically projected with the projection. As I said, in a practical implementation, most likely some of this will be merged. This, of course, makes the code even bigger.

Starting point is 00:43:56 If you have the projection in every operator, for example, and you have some filtering condition stuff in every operator, probably you cannot merge everything, but some stuff you will do right away. So while you're scanning through the data, when you're already touching the data, some basic filtering optimizations you can do right away. Or if you can do a projection pushdown,

Starting point is 00:44:19 you will do the projection pushdown right away because you don't want to read everything. And as I said, this is mainly a model for column stores. Column stores store the individual attributes. So you will only read the attributes that you need and not the full table, which basically is already like using the projection in the first place. Okay. using the projection in the first place. Okay, so how is this good?

Starting point is 00:44:50 Well, we have much fewer function calls. So this is going to be much... We basically don't pay all of this function call overhead. We have function call overhead, but that's very limited or it's amortized by processing so many tuples at a time. The compiler can better optimize the code because the loops are actually tight and we're looping over many tuples at a time.

Starting point is 00:45:19 So we can use loop unrolling, we can use vectorization and the compiler can do this also for us. So if, I mean, remember the type, like the small programs that we used. If you look at the compiled assembly, you will always basically see some kind of loop unrolling if you have a for loop that does something.

Starting point is 00:45:43 You will see some auto vectorization. And it can, like in general, use some modern CPU features. It can do some prefetching if it knows we're reading this from memory, right? Rather than getting this from another function. So then, okay, reading from memory, we have all this logic and from cache, we have all this logic and from cache we have all this logic for prefetching etc. So the materialization is kind of cool, much more efficient in a way, but

Starting point is 00:46:18 we have this problem that we might get this huge intermediate results. Especially, I mean think about a query that produces a Cartesian product. If we're materializing a Cartesian product, then probably our memory will explode. And in some queries, you cannot really optimize this away very well. So then doing a pure materialization model

Starting point is 00:46:45 will basically break your system. Because then all of a sudden, you have to page out your memory. So this will be slow. So the question is, can we aim for something in the middle ground? Can we aim something in between tuple at a time and operator at a time?

Starting point is 00:47:02 And that's where the vectorization comes. And this we'll do after the break. A five minute break. Do we have questions so far? Yes? I can imagine that another problem of the materialized model is the handling of limit clauses, maybe? Let's say we have a limit 10 clause in the end?

Starting point is 00:47:25 It's going to, I mean, the question is, can we use limit clauses in materialization? We cannot optimize for limit clauses exactly. So what will happen is this basically, we're producing the intermediate results that we don't need. So we'll produce, I mean, to some degree, of course, the database can push down some of the limit clauses. So then we can produce smaller intermediate results.

Starting point is 00:47:51 But like in a pure, if you think about a pure materialization model, you're absolutely right. We're going to produce a huge result that we don't need. In the iterator model, we're just going to stop asking for more as soon as we have reached our limit. Completely correct. Other questions? Okay, good.

Starting point is 00:48:15 Then let's do a five-minute break. Vectorization model. So the idea is, and I already alluded to it to some degree, is rather than doing the full materialization, and also rather than doing single tuple at a time, let's still use the iterator model. So, this Vulcano style model is a nice and clean way of representing a query and structuring your system. But instead of for each individual call to next, let's not do a single tuple, let's do a batch of tuples. This is called vectorization.

Starting point is 00:49:02 It's in this monadb x100 terminology and in this case like in the loop we're not just going to like in the internal loop we're not just going to process a single tuple say in the scan or something like that or in the selection

Starting point is 00:49:23 but we'll have a large batch of tuples. And this means, or this batch, of course, can be configurable, but then we have more to do for the processor. So the processor can actually do the same kind of operation, the same kind of loop on many tuples at a time. And so the same idea basically, so you can see that the basic layout of the plan is very much similar to the initial layout of the iterator model, with the exception that in

Starting point is 00:49:59 each step we have to deal or we have to handle a larger batch of tuples and the configurable size of tuples. So, basically in our projection, we'll call the join and call the join to give us a batch of tuples. And then the join will basically ask the scan, please give me a batch of tuples that I can process or a vector and this vector will then go back to the join and the join will produce or build the hash table with a larger batch. So it can basically do all this basically hashing one by one. The code will will be more optimized we're not switching to completely different code paths but do all of the processing in a smaller loop so that the compiler can help you and the cpu can work more efficiently because it can use the same kind of operation over

Starting point is 00:51:01 and over again so we still have the function call overhead but we're amortizing this over many tuples at a time. So then the same thing of course happens on the other side. So once we've built the hash table we'll probe the hash table. For this we're going to ask the selection to give us a batch of tuples. The selection will ask the scan for a batch of tuples and this the selection will basically ask for batches until it can produce a complete batch that it will send to the join. And the same then for the join. The join will produce join results until it has enough so that the batch size is large enough that it can send this to the projection. And so it kind of uses the best of both worlds. We have the iterator and the materialization model included. So we have less number of invocations per operator.

Starting point is 00:51:57 So we're not invocating this per each tuple, but say for a thousand tuples or something like that. We can use SIMD instructions because we're going to operate on multiple tuples at a time. We can do the same kind of processing that you're doing right now because we will look at, say, a thousand tuples at a time, meaning we'll easily fill up our SIMD registers and do some processing there. And we'll do this in multiple loops. So it also means we don't have to do basically just for a single instruction,

Starting point is 00:52:34 but for many loops, do the same kind of SIMD instructions. So the code will be ready. It doesn't have to be like, or we're not gonna have this cache miss problem in the code, for example. So it's important to, I mean, basically want, want a vector size that's large enough for, to amortize the iteration overhead. So all of this function calls and this operator problems, but it's small enough to not completely thrash the data caches.

Starting point is 00:53:09 So we don't want to basically work on a data set where we're going out of the caches all of the time. And so here, for example, you can see in this X100 on an older machine, so an Athlon and an Itanium, so probably today the vector sizes could even be larger with larger caches, but you can see that with vector sizes in the thousands, like 1000, 4000, something like that, you get the optimum performance, so up to 16000, something like that, you get the optimum performance. So up to 16,000, something like that. So today, probably this would be the range for TPC-H query one, in this case. So the one that you saw earlier already.

Starting point is 00:53:58 So this kind of means these many tuples we get into our cache before, like if we go larger than that, then our tuples will basically always have to go to memory again. We're going to run out of the cache and then we're going to lose some of the performance. And you can see this is a logarithmic scale, right? So this also means, from a single tuple, we're actually getting almost two orders of magnitude performance improvement down here. So factor of 100. That's also why the system is called X100, because it's 100 times faster than the original

Starting point is 00:54:42 system just by using these vectors. So if we compare these three models, the iterator model, the materialization model, and the vectorization model, then we can see that in the query plans for the iterator model and the vectorization model, they're basically very simple, right? So we're just stitching these operators together. We don't have to do much else. They just call each other. And so that's actually a good thing. The materialization per operator, the query plans get more complex because we basically are going to produce the complete output and we will have to stack them next to each other.

Starting point is 00:55:36 The instruction cache, which was one of the big problems for the iterator model there, of course, it's poor. This is great in the materialization model because our operators will be small and they will hopefully fit into the instruction cache. At least much more of the operator will fit into the instruction cache than for the iterator model. So, it's going to be great. In the vectorization model, we'll have these function call overheads. So we're not going to be as good as in the materialization, but we're amortizing this over the vectors.

Starting point is 00:56:20 So there we'll still have a good instruction cache utilization. Because for a single batch of tuples, the instruction cache utilization will be good. For the next, like basically switching from operator to operator, we'll have some instruction cache misses. And these will basically, how many tuples, like the reduced performance in comparison to the materialization will be basically in the number of batches that we're processing. But as we said, we're going to try to optimize this here. We have many function calls in the iterator, but very few in the materialization and a few more, of course, again, in the number of batches in the vectorization. And, well, yeah, so then the attribute access is complex in the iterator model, while in

Starting point is 00:57:18 the materialization model, we can directly access each of the attributes, basically. And hopefully, I mean, this is mainly the two reasons why we're looking at this, because here, we're basically in what we're looking for on the modern hardware. So what do we spend our time on? In the iterator model, we're mostly spending our time on this interpretation. So we have to basically load the code, have to basically configure the code, or let's say work on a single tuple. So this is basically where time goes and we're not doing useful work. In the materialization and the vectorization, most of the time is spent on the processing. Materialization, even more.

Starting point is 00:58:10 And this means we have good CPU utilization and both in the vectorization and materialization. In the materialization, we might have caching problems because our intermediate results can be large. So we will probably go to memory. All of a sudden, our memory usage or we're basically spending more time there on memory load and stores through the large intermediate results. And then we're basically not utilizing the CPU as well anymore as if we have the proper right vectorization size. So if we have a good vector, then we can still keep everything in the memory, in the caches.

Starting point is 00:58:58 And then we're just going to utilize the CPU ideally, basically. And well, then, well, compiler optimizations, we can do this in materialization and vectorization. We have expensive materialization overhead for the materialization, of course. And this also limits our scalability, which is better in the iterator and vectorization. Okay.

Starting point is 00:59:34 So now we have kind of this span of how to deal with intermediate results, right? And how to deal with batches of tuples. So iterator versus materialization and the vector we can kind of move in between. We can move all the way to materialization if we have very large vectors. We can move all the way to iterators if we have vectors of size 1. Now something different, like a completely different way of processing data or of executing models

Starting point is 01:00:00 is kind of another dimension. How we do the query execution is code generation. And in the vectorization, we already heard, we're reducing this function call overhead. And this is kind of what's really killing, like one of the things that's really killing us in the iterator model. While of course, this individual cache access etc. are stupid,

Starting point is 01:00:27 but having a huge amount of code that doesn't fit into our cache and jumping in the code back and forth through this function calls, this is really what's getting really slow. And in the vectorization, we still have this function called overhead. We're just reducing it and amortizing it by not calling the functions as often. So we can basically iterate over some data

Starting point is 01:00:57 in a certain function, but we still jump in between different functions. So we still do this overhead. And something else that we can do is rather than having these functions and this interpreted code, we can actually directly emit code that does exactly what we want. So looking back at this MySQL example, where we said there's a very low IPC because there are 48 function calls or 48 cycles that need to be done for a simple addition, we can just say, well, let's just produce the assembly that does exactly this addition and then just execute this assembly.

Starting point is 01:01:43 This means we basically try to keep everything in the registers, we try to just generate code that is as small as possible and as specialized as possible and as optimized as possible for exactly this kind of processing. This can be single threaded, but of course it can also be parallelized. We can still use SIMD instruction. We can still use multi-core if we want. What is, we just need to produce code that does exactly what we want there

Starting point is 01:02:18 and then let the compiler do its magic and do the kind of parallelization. And so this was pioneered in Hyper. So that's a system that was built at the TU Munich, was later on sold to Tableau. And so the idea here was that they basically started, like they saw, okay, all this function call overhead is so super expensive. We have servers with many cores and large amounts of memory

Starting point is 01:02:46 where potentially everything fits in there. So let's just basically fuse everything that needs to be done within the operators. Let's fuse as many operators as we can into a single pipeline, generate code for that, and just emit this code and run this code. And the pipelines basically go all the way until we have a pipeline breaker. And as I said earlier already, whenever we have a hash table, for example, we have to materialize something. This is a pipeline breaker. So this is basically where we'll stop.

Starting point is 01:03:20 But everything else will be merged. And then we can basically produce this code and produce machine code for this. Rather than having operators that we stitch together that need to be interpreted, we actually emit something like C++ code that then we compile and get super fast executables. And rather than having this iterator model where we basically pull data, we can also push the data.

Starting point is 01:03:54 So we can basically say, well, let's basically use blocks of data that will be pushed to the next operator until from one pipeline breaker to the next, basically. So this means, if we do, say for example, if we process this part here, then this pipeline will basically read blocks of the data and push it to the join. And then the join will start processing at some point. So what does this look like? So basically, we'll actually create something like this, right? So we'll really create small code blocks that say,

Starting point is 01:04:40 for this tuple in R, if it matches, then we'll have another loop. Let's, for each tuple in R2, if it matches, right, then for each tuple in the aggregation, though this is basically what comes from here, then we'll have these tight loops. So we'll just produce these small kind of loops and let the compiler optimize those. The code blocks, so this is kind of the tricky part, the code blocks don't really match the operator boundaries. So, in the iterator model, you remember this is super simple, right?

Starting point is 01:05:21 So, we basically have this direct translation. We have a join here, we're going to have a join operator there. Here we have to kind of split up the build and the probe part in the join operator because the build basically is a pipeline breaker. So we can basically push data into the build until the hash table is built and then start the other table. The other, sorry, not table, the other pipeline. So say for example, here we'll build a hash table. So this pipeline will build a hash table. This pipeline will basically build the hash table

Starting point is 01:06:00 for the grouping and for the aggregation. This pipeline will then build the hash table for the grouping and for the aggregation. This pipeline will then build the hash table for this join. And once we have these two hash tables ready, then we can execute this pipeline without any other problems. So basically, then we'll just push the data through here, the additional data that comes from R3. In HYPR, it's not C++, but it's actually LLVM,

Starting point is 01:06:30 so low-level virtual machine code that's generated. So they basically, this is something that's inside Clang. So one, let's say like an internal representation that they use, and it's sort of assembly, but it's platform, somewhat platform independent. And then can be just in time compiled to executable code. And there's kind of different ways how you can then also execute or how you can compile this. So then basically you could say, OK, let's use all the optimizations that the compiler has,

Starting point is 01:07:10 so multiple passes, et cetera. Or you can say, oh, I'm just going to do a single pass compilation. So it's giving me the compilation as fast as you can based on this code. And the system, the database system, basically already emits this intermediate representation. It doesn't emit like these for loops,

Starting point is 01:07:31 but it emits this kind of assembly style representation. There's also other systems. Whoops, somewhere I jumped in the wrong direction. So this is kind of a mix generated of like generated code and some C++ libraries. So stuff like memory management, et cetera, this will be in the libraries. But this means like everything that's in the libraries

Starting point is 01:07:58 that will only be basically compiled once and like the memory management, etc. But the individual query that will be compiled once you want to execute this. And this will be very small code, a very small executable. And here, for example, this is an example where vector-wise and hyper are compared. And you can see that for the queries, for TPC-H queries, so you remember a TPC-H query 1, you need 13 milliseconds to actually compile. So going from the SQL text to LLVM and then to actually from LLVM to actually an executable, that's 13 milliseconds. And the code is quite small. So this is in a few kilobytes basically versus multiple megabytes,

Starting point is 01:08:49 two gigabytes code based at something like vector-wise, probably multiple megabytes and vector-wise is basically monadbx100. So this means you actually have very small code that will fit into your instruction caches. And then basically the runtime will be fast and will actually be faster than something like VectorVise in many cases. Not in all cases, but in many cases. And we'll see when it doesn't fully work in a bit. But in general, you can see for these kind

Starting point is 01:09:34 of OLAP style queries, the compilation time is actually negligible, right? So it's very low, so in the milliseconds, while we have seconds of execution, it doesn't really matter. However, if we're looking at an OLTP type, so very small transactional queries, then this might actually be costly already. So if our queries are in the milliseconds in execution, then paying for the compilation in milliseconds will be problematic. And this we'll also see, so this is also something

Starting point is 01:10:13 that people already started working on. Okay, but you can see on average, we're faster here. Hyper is not the only system. There's many different, like once this came up, Hyper was the first system, but then many other systems noticed that this actually makes sense. Let's do some stuff like this. Let's do some code generation also in our system. Microsoft has a special engine. Cloudera has an engine. Postgres added some JIT compilation support for their system in order to get like this faster execution in order not to have all this interpretation overhead. There's different kind of strategies I'm also listing here, where they basically,

Starting point is 01:11:08 also the Munich guys compared how you can actually produce like different kind of, or at which stages and at which levels in the compiler you can actually produce code. And I mean, very simple, of course, you always need your SQL parser, right, so the compiler doesn't understand SQL, so you need to translate this to some kind of code. You also want to do the query optimization, because that's also something that the compiler doesn't do.

Starting point is 01:11:36 But then, you could directly produce something like C++. Then, if you have C++, this will go all the way in the compiler front end. The compiler does all the parsing, etc. again. So then you need to basically go through the intermediate code generation, code optimizer, and then code generator. If you're going the LLVM way, so this is this low-level machine code, low-level virtual machine code.

Starting point is 01:12:05 Basically, this would go then into the code optimizer. This is this general purpose IR. You can also basically have your own code optimizer. You can basically also say, I'm doing my own bytecode representation. And then I could either interpret this in my own virtual machine, or I can actually also emit code myself. So I can, why not just produce bytecode in the end itself. So produce, like basically have the complete compiler in my database system.

Starting point is 01:12:45 And this is basically where people and the Munich guys to some degree are actually going right now. So having these different kind of strategies. So on the one end, you can produce code, so something like Scala, C, C++, etc. And the new system, Umbra, is basically a system that can produce all of these different things. So it can produce C++, it can produce LLVM, it can directly emit code. And there's different trade-offs. So on the one hand, the latency, the higher in the compiler you go with your code, the higher the latency. So the more the compiler needs to do, this will cost you more in terms of code generation.

Starting point is 01:13:32 Then the more optimization the compiler does, so the more time you give for compilation, the lower the latency, typically, of your code. So if the compiler does a lot of optimization, multiple passes, it can find better ways to execute the code. The code probably will be faster. And then if you're producing your own code, well, this is probably the fastest, right?

Starting point is 01:14:00 So you can actually get very fast binaries. You know exactly. So you have a slimmed-downaries. You know exactly, you have like a slimmed down compiler that's faster than using a regular compiler, but it's highly platform specific probably. Or you need to again build like the complete compiler. So this means you're paying on flexibility. There's also an interesting work by Uni Saarland, where they built a compilation system that compiles towards WebAssembly. That's also a strategy that you could go.

Starting point is 01:14:35 Which is basically a virtual machine that will run in basically every browser these days and all kinds of platforms. Okay, so briefly just as an overview, so this is from this recent paper, so this is gonna be presented this year. Here they test on the one hand, having like interpreted code, something like a byte code, having directly a code emitter,

Starting point is 01:15:06 so basically producing assembly right away through the system. So they built their own two backends that can produce code for x86 and code for ARM architectures. Then LLVM, so this low-level presentation, optimized and unoptimized. Optimized means multiple rounds of optimization or C++ code. And what we can see is that, so the chart is a bit hard to read. This is basically the throughput that we get. So the higher, the better. This is how long it takes to compile this.

Starting point is 01:15:49 So this is the latency of the query or the compilation. Actually, it's the compilation latency in this case. And we can see while interpreting the whole thing, that's actually the fastest, right? So we don't need to do any compilation, so we just directly execute the query. However, the throughput is actually low, right? So because we have much more code,

Starting point is 01:16:14 the code is not optimized as much. This is this directly emitted code. So this is basically a backend that will produce binaries directly from the queries. And you can see the latency of the compilation is also very fast. So this is again, I think it's also logarithmic scale down here. Then doing this intermediate representation by the compiler without optimizations is faster in terms of compilation, but is of course worse in terms of how fast the resulting binaries will be. So if you do more optimization,

Starting point is 01:16:59 your queries will in the end be faster. And if you're using C++ code, you'll get roughly, and this really depends on the architecture, but you get roughly the same kind of performance that you would get with the LLVM approach, but you're paying even more because the compilation is just much more expensive in this case. Right, and here you can look up the paper. I just put it here for reference.

Starting point is 01:17:26 You can see this is, they do it for all kinds of different architectures. Yes? In the table you showed earlier, the compilation was only a fraction of the execution though. Is this then aimed for OLTP? This is also aimed for OLTP kind of workloads, yes. So, and this is why basically people started

Starting point is 01:17:48 or also Munich guys started to work on this in order to get like to situations where the compilation time then is already a problem. Or if your data sets are small, your queries are fast and also the compilation time will already compilation time might be already too much. So here you can see in C++, we're already somewhere close to the second range, for example, in terms of execution for simple types of queries.

Starting point is 01:18:19 So I think this is T, P, C, H. Yeah, I don't remember which queries, if it's all queries or et cetera. But you can check out the paper for the details. OK, so very quickly, or are there more questions? Then very quickly, I just want to also show you some results where people also, let's say the two camps came together and compared VectorVise and Hyper in a single code base.

Starting point is 01:18:52 Because of course, other systems, like if you compare two systems, it's also about who's the better programmer, right? So then if you're a better programmer, then maybe your vectorization strategy is always better than my code generation strategy, because you know all the tricks. And so this is why people, like they built two systems, Typer and TechDevice, which basically used the exact same code base to really compare the two. Used a subset of the TPC-H queries, and there's a nice paper called Everything You Always Wanted to Know About Compiled and Vectorized Queries

Starting point is 01:19:28 But Were Afraid to Ask. So it's a good read, actually. And if you look at these different queries, if you compare, then you will see, well, there is no clear winner. Sometimes the compilation is faster, sometimes the vectorization is faster. And you need to do in-depth analysis.

Starting point is 01:19:50 Sometimes they're more or less the same. Sometimes you can see, like, this is, like, there can be significant differences. And you can also see, like, for scaling factor one on modern CPUs, this is already in the millisecond. So there, even the compilation will already be significant. So this, like the modern CPUs, you can actually be quite fast on scale factor one.

Starting point is 01:20:19 And so say, for example, in query one, there, tech-to-revise, so basically the vectorization approach is much slower than the code generation approach. And the reason is here, for example, the vectorization approach executes much more instructions. So you remember this query had all this aggregation and whatnot stuff. So the code that needs to be used or the operators that need to be used are much more complex than the code that is generated in the code generator because this will be quite tight.

Starting point is 01:21:01 We have like a single table, we have a single loop that we can iterate on. And we don't really need to have that much intermediate results. And in the code generation, the data can actually be in the registers. So basically, we'll have very fast execution over the data in the code generation and very few cache misses.

Starting point is 01:21:32 So in contrast, for example, in Q3 and Q9, there it's mainly about hash table probing. And there all of a sudden then the code or the tech device is basically better at hiding these kind of cache miss latencies because it can basically operate on a batch and then do some like do the other operations. And here, these complex loops that will be generated for these queries will produce more of these memory stalls and branch misses. So because we have a very complex loop, that will basically be more problematic to execute than if we're just using batches of tuples and do operator at a time style,

Starting point is 01:22:25 because then the code that will be executed per batch will be more, or can be more efficient, or will at least be smaller and better fit into the caches. So, well, we can briefly go through this. If we're compiling in comparison to the truth, or if we're compiling the compilation, typically we'll have a more efficient code because we have like very small code base.

Starting point is 01:22:56 We ideally can keep the data in the registers that we're working on, the individual tuples, right? So we don't have to go out into the caches or even into memory. While the vectorized engine still has these large operators that need to be read in and will have to work in between tuples. So we cannot do the whole pipeline on a single tuple

Starting point is 01:23:20 in the vectorization. For parallel data access, without extra work in the coding, the compilation is not as good as hiding these memory latencies. Though there, because of the vectorization, of course you could build vectorization into the code generation, right?

Starting point is 01:23:43 So you can also generate code that's vectorized, but using a pure code generation, then you're going to have a less performant memory access. In both, you can do SIMD vectorization. It's a bit more complex in the compilation, because you have to generate the SIMD code properly and for vectorized, this is basically plain, right? So you have a bunch of vectors that you do one operator on.

Starting point is 01:24:17 So this is the way we do SIMD here in this lecture. We can do parallelization in both. For these OLTP setups, their generating code can produce very fast stored procedures because they're going to be very small and we only have to compile them once if we have a stored procedure. If we have to produce every query individually, then the compilation might actually bite us. Language support for compilation-based engines, if we're accessing such intermediate representation, we're quite good in having the portable code for everything that the compiler basically addresses. If we're producing bytecode,

Starting point is 01:25:08 we have to build a new backend for every individual CPU architecture. And then we also have to optimize this for every individual architecture which the compiler already is, right? And so this is same as kind of true for the vectorized engine. We have to adapt this also to the architecture

Starting point is 01:25:27 to make sure that the vectorization works well. And I mean, in general, profiling is more complex because we're producing code that then we can, of course we can profile each individual query, but what if we have a new query? It's kind of much harder to reason about generated code than it is to reason about operators that are completely there, right?

Starting point is 01:25:54 So if you have a complete operator, I know what I'm doing and what's happening, I can step through with it with the profiler. If I'm generating code, I have to basically generate different kinds of queries and see what's happening in there. And I don't know if I generate all the queries that I actually need. And well, yeah, so this also kind of same problem with debugging. And well, with so that's kind of a similar problem, but adaptivity, both systems can basically be adaptable. And with that, that's basically it for today,

Starting point is 01:26:33 almost in time. So we talked about these different kind of execution models and as like a, let's say, takeaway, the two models that you want to go with today in modern engines is either code generation or vectorization. Because there you can actually really utilize the CPU. And you can basically mix the two in one way or the other, but this is really what you want to do. And there's still a lot of research being done on how to ideally utilize the CPU, but for sure we know traditional way of structuring individual operators and doing individual top is not the ideal way. Okay, with that, thank you very much. Questions? No questions? Okay, so then next time we'll talk about data structures. Thanks and see you next week!

Your Ad Here

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Query Execution Models

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.