Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Query Execution Models

Starting point is 00:00:00 Welcome to this week's Hardware Conscious Data Processing. My name is Martin. Professor Tilman Rabel is on a conference this week. I'm going to be presenting today's and tomorrow's lectures. Where are we right now? Last weeks after the introduction, we talked about CPU architectures and how modern CPUs work, how they execute code and how they do that increasingly efficiently and more and more in parallel. And so now the question is as we are talking about hardware conscious data processing, how do we now execute

Starting point is 00:00:44 data processing? How do we execute execute data processing? How do we execute for example database queries and how can we incorporate the features of modern hardware? So how can we use modern CPUs and execute our data processing in a way that is efficient on modern CPUs and that's for example, as fast as possible. We are right now here in the sixth row. We are now talking, or today at the 14th, we are talking about execution models, so query execution models. Tomorrow we talk about data structures and then next week Professor Wabel will be back

Starting point is 00:01:22 for a profiling session, for a profiling session with Marcel and Multicore. Okay, so what is this lecture about? First, we're going to talk about the iterator model, which is basically the oldest, most standard model for data processing or query processing. Afterwards, we are talking about the materializationization model which is one of the first models that was directly tuned for analytical processing and then we have two models that basically improve on this materialization model which is vectorization and code generation. And again, the sources, we used sources from Jana Kiciva, Karsten Bönig, and Sebastian Bress. Okay, so query processing.

Starting point is 00:02:15 In the beginning, we talked about query optimizers and how a query that you send to a database is processed, so it is parsed, it is optimized, and then it's executed. We are not going to talk about the optimizer today, so we don't care about how an efficient plan is, how database comes up with an efficient plan. We talk about how the database processes a plan. So we assume we have now the perfect plan or a good plan. This is the plan that a database came up with or the optimizer came up with. So now the question is, how do we execute this plan?

Starting point is 00:02:46 In this example here, we have five tables, A, B, C, D, and we have four joins. So now the question is, how do we process a query, for example, here with four joins? And we have four selections on these, on four of the five tables. And how do we actually do that how is really we put this query plan on a CPU and the most classical approach that is still widely used today is the iterator approach also it's called sometimes called or not sometimes it is the volcano model and sometimes called tuple at a time. What it is still used heavily in transactional processing, recent years people have increasingly

Starting point is 00:03:34 seen that this model has a high overhead if we talk about memory database systems. So modern servers today have often hundreds of gigabytes of data which is often efficient or yeah enough so to say for to keep at least most data that is frequently accessed in my memory so we should optimize for this situation or these try to exploit the situation that a lot of data is in my memory and if we now use this volcano model, iterator model, we see that there's a high per tuple overhead and a high interpretation overhead, as we'll talk about in a minute.

Starting point is 00:04:11 So people need to come up with better ideas how this can be executed more efficiently. And, wait just a second. Okay. Just do you see? Okay. So, and for main memory database systems, the processing models that we also want to talk about, again, is the materialization model.

Starting point is 00:04:40 This materialization model is also called operator at a time or column at a time. The successor of this model has been or is the vectorization model. This is vector at a time, sometimes also called batch or blockwise processing. And then last but not least we are going to talk about code generation. Okay so the first model is the volcano model, iterator model. And how that basically works is you start at the top of your operator, of your qubit tree, and now starting at the top you ask for the next tuple. So if there is no next tuple, the next operator, in this case the operator 2, again tries to get the next tu table, calls next.

Starting point is 00:05:26 So this next is the iterator interface, tries to call next down the qubit tree until we finally get a table, for example, by reading the first table from disk. And then once we have this table, basically goes all the way up to the first operator and goes, as you recall, you have this next chain down from the very top, from the root of the QB tree,

Starting point is 00:05:50 all the way down, and then this tuple bubbles up. So we process one tuple at a time. For this reason, we call it tuple at a time processing. And this is mainly used still today for disk-based database systems, mainly for transactional systems. So you find this model in ProSquare, in your standard Oracle system, SQL Server, and most realized that those volcano model database systems are not efficient for analytical processing, this model is called materialize. This model materializes intermediates.

Starting point is 00:06:37 Now we have an operator at a time model. We don't say, okay, get next, get next, get next, to get the next tuple. In this case, we execute from the bottom every operator and process the entire input. So the first operator, for example, a skin, does not yield single tuples. It processes the entire input and then forwards this input to the next operator higher above in the tributary. And this has been mainly used

Starting point is 00:07:07 for column-oriented database systems. The Rectorize Execution Model, again, uses a volcano-style model, so we're using a iterator-based model that uses the next call, but we don't have a single tuple at a time now. We have batches of tuples, as you can see here. And this has mainly been used when DRAM capacities got so large that we talked about, that we can increasingly, we're able to talk about in memory database systems. And yeah, the last one is query compilation.

Starting point is 00:07:48 And the idea here is that we get rid of all the interpretation overhead that a database system might have. So we want to take the query and compile the minimal code that is necessary to execute this code. So there's no interpretation in case, as soon as we know the query, we know exactly, okay, this column has,

Starting point is 00:08:08 for example, the data type integer. It is not nullable and so on. So we can generate code that does just exactly that. We don't have to have code that is interpreted in runtime to check, okay, which is my data type, and then calls the function for an integer column. Now we can exactly compile the code that we need. And this execution model splits the pattern pipelines,

Starting point is 00:08:30 as we will later see. And this mainly abused nowadays and modern many-core systems, which is basically the most recent query processing model that we have. Short excurs, so this is a very interesting paper. It's a little bit outdated. Maybe it was a database on a smart card. So a couple years ago, there have been those things called smart cards, and they have been very limited. This was really just a card,

Starting point is 00:09:01 and they have been very limited in their computational capabilities. So they have very limited memory and very limited CPU. So the question was now, can we actually execute or can we have a database running on a smart card? And what the authors did in this paper was to say, okay, we can build extreme right deep trees. And with some tricks with the trees, they found a model that basically has no memory consumption, no additional memory consumption. And the idea is for that, for each single record, we go, we we walk the entire tree.

Starting point is 00:09:47 That means for every join, single tuple is compared to every other join of every other tuple of the join pattern. That means in the end, we don't have any hash joins because that hash table would consume memory. that hash table would consume memory, we would have all the joints would be nested loop joints. And with that, obviously, there are some drawbacks. It is less efficient than other models. It does not work for all the aggregates that we have, for example, a median. So we have to keep some state to be able to determine the median. That does not work here, or sorting, but still they managed to do, I would say,

Starting point is 00:10:29 a surprisingly, surprisingly a lot. So yeah, it was a lot they get done with these extreme write deep trees. So it's kind of a really interesting read if you're interested in that. So this is one extreme, what you need to do if your hardware is really limited, something we are not in the situation today, probably, or at least not our focus in this lecture.

Starting point is 00:10:54 Okay, so iterator model. Again, the idea is data is processed a tuple at a time. And each operator basically tries to get the next tuple that it can process by calling the next function on the previous operator. The previous operators, for example, in a join, you might have two operators. And the result of the next call is either a tuple

Starting point is 00:11:21 or a simple null or a marker that basically says, okay, I ran out of tuples, so I have processed all the input tuples that my previous operator can give me. And in this case, every operator also needs to keep its state. For example, hash table, let's say operator 2 in this example will be a hash join. So we would get more and more tuples by calling next, next, next.

Starting point is 00:11:42 And internally, we would need to keep some state, for example, our hash table that we can later probe in the join. Okay, we have some example here. We're joining tables R and S. We have an explicit join between R and S on the ID column, and we scan the, we filter the S table on the value column for values larger than 100.

Starting point is 00:12:08 You can see the query here, and a simple projection on top, so nothing too surprising here. So how would the operators right now work? So the first operator here that we have, when we go bottom to top, would be an operator that basically scans R. So this operator just takes all the tuples that might be on

Starting point is 00:12:30 disk and as long as it can get tuples it would emit them whenever next is called on this operator. And in case there is no more tuples, we have read the entire table, it would emit a null. The same for table S. For the selection, we would now, you can now see this next call. So as the selection calls the scan of S, I would now say, okay, next. And as long as I have a next tuple, I will evaluate my predicate. In this case, that value is larger than 100. And if this predicate evaluates to true, I will emit my tuple. For the join here, it's a little bit more interesting.

Starting point is 00:13:21 So first, we process the entire left side. So we call next, next, next on the left side to build our hash table. And then once we have built the entire hash table, we will start with processing the right input and probe our hash table, and then only emit the result in case it matches. And the prediction is a simple prediction.

Starting point is 00:13:42 In this case, we don't do any calculations. We just say, okay, we take the column's ID and state of our joint result and emit this column. Now, how we would actually process this in the volcano model would be that we start from the very top. So we don't start from the bottom and push up the result. In the iterator model, you would begin at the top. Another projection says, okay, please operator below me, so join, can I have the next tuple? The join does not yet have a tuple. So it first calls on the left input next to build, to start building the hash table.

Starting point is 00:14:23 And now we can see that R is called the first time, and now R would emit the first tuple, in case there are tuples in R. We would return this tuple, so we would emit this tuple. And now the hash join can start building the hash table. For the projection, again, then for the other side of the, for the right input in this join, we would now go to the next operator below. This would be the filter.

Starting point is 00:14:53 We ask the filter for next. The filter again asks next, goes to S and so on and so forth. So you see that single tuples bubble up until here we have a breaker in the pipeline. So the join needs to break because we need to have this hash table entirely prepared for the probe phase. So this is a so-called pipeline breaker. Pipeline is breaking here,

Starting point is 00:15:21 so tuples don't go up all the way to the projection. They first go up to the projection, back to the projection, as soon as we have the first join matching tuples. And as we have said, the iterator model is widely used. It is efficient. In most cases, it is efficient. For example, for typical transactional database systems, where we don't talk about huge scans, we don't talk about huge analytical processing, we don't have

Starting point is 00:15:53 huge joints. So what you usually see there is a very restrictive selection, for example, on an ID. Let's say we get all the items of an order. There might be tens or a few dozens, so we have usually an index on it. We have a couple of dozens of tuples. And now processing this is actually quite efficient, as the overhead mostly are still bound by disk access. So this iterator model works quite well. This is why still many database systems use that. And it allows for tuple pipelining, again,

Starting point is 00:16:31 but there might be some blocking operators where we can't just pipeline the entire query. For example, there are joins, subqueries, sorts, and so on. But one thing where it is not really efficient is if we have much data and we don't use our large data sets that we process. And in this case, we have not the best cache usage. And one reason, for example, is that we,

Starting point is 00:17:02 that all the operators run interleaved. That means that we have all the instruction code for all the operators if we have long pipelines and large OLAP queries. We often have very large queries with many operators, several joints, and so on. So now all this code, this instruction code, is basically running at the same time if we process tuple by tuple. That means we probably don't have many cache hits. Or often the code is just too large for the instruction

Starting point is 00:17:31 caches. We have a large function call overhead, because every operator calls. So next operator, next operator, and so on. We have no tight loops. And also the combined state of operators can be quite large. It's easily too large to fit into the caches. If you think about analytical processing,

Starting point is 00:17:53 we often have several joints of all the joints and would need to keep all the hash tables. In my memory, this can easily be too large for the caches. And there's been one really nice paper just listed below on the bottom. And what they did was that they tested MySQL and profiled MySQL for TPC-H QV1. Again, TPC-H was an analytical benchmark. So this is not a single index axis. We have just a couple of tuples that you process and you're done. In this curve we have for scale factor one, I think we have six million rows.

Starting point is 00:18:35 So a lot of rows. The filter you can see here on chip date does not filter out many tuples. So I think we almost have six million tuples in this quite large aggregation that you see on top. So we aggregate or we group by two columns, we turn flag and line status, and then have a lot of sums, averages, and counts and so on. So you see there's that's not going on and there's a large there's a large number of tuples that's being processed in this application. And what the authors did in this paper

Starting point is 00:19:13 was that they profiled the code and basically checked where does time go if we execute this analytical query on MySQL. And first of all, there are a lot of function calls, right? So we have six main tuples, and just the first instruction much of the time goes, we have 800 million instruction calls, function calls. And which makes a lot of sense. So if you have a complex tree,

Starting point is 00:19:42 and many functions, and you process tuple by tuple, you will just have a lot of single functions being called. And the most drastic number, I think, is here that we only spent 10% of the overall time actually processing the tree. So actually doing the aggregates, the sums, and the counts, and all the other stuff is just moving tuples to the next operator,

Starting point is 00:20:06 getting some field of our tuple, and then doing the actual, for example, calculation. And also what we can see here is in the third column, no, fourth column, is that we have a very low instructions per cycle ratio. So we don't do, we can't process a lot per cycle. And one reason is that we have a row store. So we can't have all the aggregation functions for each and every tuple that we might expect because a row store, you can have an arbitrary number of columns per tuple that we might expect because a row store can have an arbitrary number

Starting point is 00:20:45 of columns per tuple that you get. Every field in this tuple can have different types. So this is, so we need to have polymorphic operators that can deal with it. They can deal with arbitrary looking tuples. So you will always have a lot of function calls unless you compile the code. And for compilers, such a situation

Starting point is 00:21:08 is really hard to optimize. We usually have a low IPC rate, so we have empty pipelines. We don't have many pipelines that we can, that the CPU can run concurrently. It's hard for the CPU, or for the compiler, to optimize cross functions. We can't use Zimny, CPU or for the compiler to optimize cross functions. We can't use SIMNI, or at least the compiler can't use auto-vectorization.

Starting point is 00:21:32 There is work on how to use SIMNI in those database systems, but it's just much harder than other cases. If you take a look at this plus function, so let me go back here. So there's, you can see here, can you see it? Yeah. There's one addition here in all these functions. There's one plus text.

Starting point is 00:21:56 So you can see here that there is this plus function here, which is called six million times, so kind of makes sense, right, for the six million rows that we process. But if you now just take this addition here, or this plus function, it takes 38 instructions. We have a relatively low ratio here of instructions per cycle. That means for this, executing this plus function

Starting point is 00:22:26 takes 48 cycles in this case. Compare that to something like doing it in assembler and knowing where data actually is and other assumptions. But we could theoretically do that in three instructions. This database system, so MySQL in this case, takes 48. So that shows there's a lot of things being done that need to be done, but that we can probably do a little bit more efficient. We don't need to have those many dependent instructions, which cost us already 20 cycles

Starting point is 00:23:03 here. instructions which cost us already 20 cycles here and we should somewhat find a way to get rid of this overhead. One way to do that is the materialization model. The idea here is that we don't have a single tuple bubbling up the query tree every time, but each operator in the query tree processes the entire input at once. So only one operator is running. So if we have different paths in Ruby, there might be multiple operators

Starting point is 00:23:34 that run concurrently. But in a single pipeline, only one operator is running. And this operator is processing all the input, all its input, until it has the its output completely calculated there is no piping of tuples obviously and we need to materialize the entire input so that means we have to yeah we materialize the entire input so we have a lot of data residing in my memory an example can be seen seen here. So instead of, if you look on the very top here, instead of calling, okay, for my child, I get next, which would call some generator function

Starting point is 00:24:15 that gives us just a tuple. We now call the output function that basically blocks and says, okay, I need to get the entire input of my previous input operator. In this case here for the join, so the join would say to build the hash table, I need tuples of the left input. I again ask for the entire output of the left input operator. And this is the scan journey with a three.

Starting point is 00:24:42 And this would not emit single tuples. It would process the entire table R, and then forward this table R to our join operation. There are some tricks that are usually done so it don't have to materialize the entire table, but in general, that's basically the approach that is being taken here. Okay, so the advantage here is that we have much fewer function calls.

Starting point is 00:25:12 You can see, go back to the scan here, for example, take a look at the operator marked with a four. You can see here now that we have this very, very tight loop. We go over the entire inputs. We have the entire input that we got from the child operator. And now we just have a very tight loop over each single tuple. In case it matches, we put the tuple to the output buffer. In case we don't match, we immediately go to the next tuple. This should probably look familiar with the current exercise that you're doing

Starting point is 00:25:46 on Zimd. So this is something, such a tight loop is something that the compiler likes and it's much better at optimizing the code for such tight loops. And we have much fewer function calls because between the interface between operators just give me the output and I get the entire output. Yeah so when we have such loops the optimizer can do all the tricks that it usually does. So it can do loop unrolling, vectorization, auto vectorization and the CPU on the other hand can do hardware prefetching. So yeah, better branch predictions, hardware prefetching.

Starting point is 00:26:36 So different, so on the one hand there's the compiler, which is much better now, or for which it is much easier to generate efficient code. And on the other hand, the code that is generated is just better for the CPU to process than in the iterator model. But of course, there are also problems here. We have tight loops, so that's good. But the intermediate results might be really large. So assume you have a large join that is not as selective as you might have expected it. So the join result is really huge.

Starting point is 00:27:15 So now if we say, okay, we have operator at a time and we process the entire input, the entire output of this operator, then we have a huge intermediate result laying in, being stored somewhere in the memory. So this is just from a memory consumption perspective can be a problem and that also means that probably the first outputs that you have processed are no longer in the caches if the intermediate result is large. So another question was is there something like a middle ground? Can we be somewhere, like can we still have our tight loops,

Starting point is 00:27:51 which the optimizer likes and the CPU likes, but are not as cache, not always fall out of the cache with the data that we hand to the next operator? And the approach that people came up with is vectorized execution. I think we're going to have... Let's take a break a little bit later. So vectorization model,

Starting point is 00:28:19 there's a vectorized execution. The idea here is that we still use or again use the Volcano style model. So we have this next call. We start at the top and then call next on the previous operator on our input operator. But in contrast to the Volcano model we don't have the single tuple. So we don't have this function overhead for each single tuple. We call, we retrieve batches.

Starting point is 00:28:49 So the previous operator, when next is being called, tries to give us a batch of a certain number of rows. The first database, the system that proposed this vectorization model was called Modulibx100. They came up basically with this idea. And so we'll later see that it's actually not that easy to find this batch size because the idea here is, or what the motivation was, is that when you create a batch of tuples that you have just created, for example, with your filter, ideally the next operator should retrieve this batch. And when the next operator works on this batch, this batch should still be in the caches. So ideally you have a batch size that fits into your L1 or L2 cache.

Starting point is 00:29:46 Probably not in registers, but in some cache that you can efficiently process. But depending on how your Kribi looks, how large or how wide each tuple is, can be quite hard to find this batch size. So this configurable task of the system is now to find this batch size. So this configurable, where the task of the system is now to find this batch size per query, so that we have, once an operator has finished this batch, and the next operator processes this batch, that this batch is still entirely in the caches.

Starting point is 00:30:20 Contrast to the maturization model, right? So if you remember, when we process the entire table or the entire input, when you're processing large tables, the first table that your selection might have gilded is probably not going to be in any cache anymore. Okay, so how does it look for our example query here? We see again that we call left output left output but now on the very button here in three we say busy we have this out buffer and now in in patches bubble up not single tuples but batches of tuples here or vectors of tuples. And the same for both sides. Right, okay.

Starting point is 00:31:14 And yeah, so this is the volcano model, so we still pull. So we don't, as in the materialization model, where we take the entire input and push it upwards, we still pull. So from the beginning, we go down, I call next, next, next. So this vectorization model was an improvement or is an improvement. It has

Starting point is 00:31:33 really twice to combine the advantages of the maturization model and the iterator model. So it reduces the number of invocations per operator, not as much as the materialization model, but usually it's enough to amortize the costs. As we have still tight loops, we can use Zimdy, we can use all the CPU can do is prefetching and branch prediction magic. So we still have code that is efficiently executable on modern CPUs. But the size of data that we send around in the query plan is so small that it's usually within the caches. And yeah again so vector size is really that's basically the crucial thing here is how to select the vector size. And it should be large enough

Starting point is 00:32:28 to amortize the iteration overhead. So if we have a vector size of one, we have just the volcano model in the end, right? And if we have just maybe a dozen, probably the overhead is still, the runtime is still dominated by calling the next and the interpretation code. If it is too large and

Starting point is 00:32:46 our batches are too large, then the next operator will have to go to DRAM because the batch is no longer in the caches. And for this has been, so the paper is more than 10 years old, I think. So this number might be a little bit outdated for modern CPUs, but what they found on this older CPU here was that for CRUBY-1, so remember this was a pretty wide tuples, right? So the next, the last operator needed like five, six columns to calculate all the sums and the averages and so on.

Starting point is 00:33:21 So we have rather large tuples, or wide tuples that go all the way up. And for this case here, the authors found that vector sizes or batch sizes of 1 to 4,000 tuples give the best runtime here in this case. And this is logarithmic scale, so you can see the batch size of one. So the classical volcano model is 100 times slower in this case. And also if your batches are too large, so logarithmic. So yeah, there's a huge difference between smaller batch sizes and larger batch sizes. And this table basically shows a comparison of all the three models.

Starting point is 00:34:13 And so we have the iterator model, the volcano model, the vectorization model, and the vectorization model. And for query plans, we see that as a vectorization uses the same model, so we have this iterator model, also this next volcano model, what we call next. So query plans are quite simple, while they can be more complex in the materialization model. The instruction cache utilization, this is the problem that we found in the iterator model, so this is extremely poor. For the vectorization model, this is extremely good, because we have only tight loops. We have the minimum amount

Starting point is 00:34:55 of calls between operators, because we only pass the result and never come back to an operator. And for the vectorization, depending on the batch size, it's still very good. Function calls, similar. So we have many function calls. We have the fewest amount in the metrorization model and a little bit more in the vectorization model. The attributes at attribute axis. So the iterator model is usually mostly done on row stores. So this is rather complex. If you have a column store, you can create code that only accept

Starting point is 00:35:33 or that is optimized for certain input data. And so the attribute access is direct in both cases here. And most time being spent is for the iterator model. So we're talking about, again, about analytical TPC-Edge queries here, right? So the most time being spent here is interpretation for the volcano model, while it is processing as we ideally want it in the other two models. In other cases, so the reason why the volcano model is still used is in other cases you are probably bound by transaction processing, you have concurrency control,

Starting point is 00:36:11 you have your disk accesses. So please bear in mind that depending on what your actual workload is, this table might look very different. We're talking about analytical systems right ecosystems right now. So CPU utilization to a data model is quite poor because we have a lot of stalls. This is much better for the materialization and the vectorized model. Compiler optimizations are now possible with both modern models because we have those tight loops that CPUs, that compilers like, that they automatically can unroll and optimize. But now the main point here why the materialization model is no longer that much used is the materialization overhead. So there's very few for the iterator model. It's cheap for vectorization because it's only the batch size.

Starting point is 00:37:08 But it can be really expensive for the materialization model. For example, if you have large joins or large input tables. Okay. I'll say we take a short break now for five minutes. Are there any questions concerning the first three models that we talked about? Okay. So, the next model, or the most recent model that there is for database systems is code generation.

Starting point is 00:37:48 So what people observed when working with this vectorized model was that there is still room for improvement. So we now don't have this interpretation overhead anymore, or we have much fewer function calls., or we have much fewer function calls. Yeah, we have much fewer function calls compared to this volcano model. We don't have those large intermediate result sets anymore, but we still have interpretations.

Starting point is 00:38:18 So we still have code that checks for an aggregate, which code do I need to call for this data type? Is it nullable? Is it not nullable? If we have expressions like shown on this example here, we have the column extended price multiplied by one minus the Elvis count column. There's still interpretation. This code still needs to be interpreted. So there's still interpretation overhead.

Starting point is 00:38:44 And ideally, we would generate, if possible, we would generate code that does just exactly this calculation without any interpretation. So this was the idea of code generation. So people wanted to generate code, just exactly the required code to process a query. And the first database that did this

Starting point is 00:39:11 was HyperBOSS database that was invented at the TU Munich, Technische Universität München. It was recently bought by Tableau and is now part of Salesforce. So it's also quite successful and this was the first prototype in the later database that heavily used code generation for queries. So what we do, what they did in this project or they still do is that they try to find pipelines they can merge or they can fuse. So they take pipelines in a query

Starting point is 00:39:45 and then when possible they fuse all the operators of a single pipeline and generate code per pipeline until there is a pipeline breaker. So for example if there is a hash join again we have to have we need to have the hash table before you can process the join. This would be a typical pipeline breaker. But they would generate code, tight loops for pipelines, and then compile this query plan into machine code. Also, they used a push-based model.

Starting point is 00:40:25 So we don't have this next call. We push directly into the next operator. So when we are done with our processing, we would push our result, for example, to the hash line. And an example, you can see an example here. So the query starts. And the first thing we do for the pipeline is here that for each tuple in R1, so as we would like to pipeline the joint tuples through, we would need to have both hash tables already in place. So the first thing we do here, we would have a tight loop that

Starting point is 00:41:10 iterates over R1, checks if our predicate matches, and if so, we would materialize this tuple of R in the hash table. The same is being done here. The same is being done for the second hash table. So for the hash table of table R2 here. No, not much better. Well, and as soon as we have all the hash tables in place, we would now generate a, again, tight loops here. So we would generate code, the minimal necessary code here, and we would have a nested loop

Starting point is 00:41:54 here that goes over all the tuples in R3. And then, for example, we would now check if there would be a predicate, check the predicate, and immediately if it matches, we don't materialize it somewhere or push it somewhere with an instruction overhead or a function call overhead. We would directly go to the hash table and check, is there a join partner for this tuple, yes or no? If there is one, we go to the next loop, check for the other join, and now we know,

Starting point is 00:42:19 okay, if there is again a match partner, we can emit now this combined join result. And this is very efficient. So this is just the minimal code that is necessary to process this query, or not minimal, but the most efficient code. And what we can also see here is that the blocks that are generated right now,

Starting point is 00:42:44 they don't really resemble typical operator boundaries. So we switch from an operator-centric model. First, we always had the clear interface of operators in the Pivot Tree. And now we have a data-centric execution, so-called data-centric execution. Because now, the build and probe are parts of different, can be parts of different pipelines that are executed at different stages of this query. What this hyper database did was that they generated, directly generated low level virtual machine code. So LLVM code or LLVM IR. And it is somewhat similar to assembly, so they would yield not assembly code, but something that is kind of very low level comparable to assembly, but it's still platform independent. So with this LLVM IR, you would then

Starting point is 00:43:39 go to your LLVM compiler clang and compile this for the machine you're running on. So you would still have almost assembly code, but you are still platform agnostic. And yeah, the compilation is done when you have your query plan at hand. So you have your query plan, you know which kind of data types you have, you know exactly what the current query does, you know all the predicates, so you can directly compile this into your LLVM IR code. And there's an interesting demo paper that talked about how efficient this is and what can be done to do this more efficient.

Starting point is 00:44:21 Because as we will see in a second, this takes some time. So compilation takes some time. It is not free compiling code. And there needs to be, you need to be aware of the compilation Time if you do something like that. And but we don't compile all the code that we need for Single-tribute. There's obviously also code that We just reuse and link against. For example, the memory Management, this is all just whatever we have in our C++ libraries.

Starting point is 00:44:47 This is not compiled every time. We would only compile the code that is necessary to compute your current query. Yeah, this table shows the Hypers performance and compares it against VectorWise. VectorWise is the commercial product that came out of this X100 prototype that we have just discussed. So VectorWise is the vectorization engine, so to say. And here we have seven exemplary DPC-H queries.

Starting point is 00:45:23 And you see the compile times on the left, in milliseconds, and well, it's quite low, right? So you see the runtime on the very right for hyper, and not on the second rightmost column, you see that we are here for this TPC-H example, we are in the range of seconds. So we execute each, we take seconds to execute all the queries for hyper yeah and for each query would pay an overhead

Starting point is 00:45:51 of few dozen dozens of milliseconds and in most cases we are faster than the vectorize engine so the in this case the compilation overhead is very small and we have nice results. If you look for example at the QV1, we improved the 33 seconds to 9 seconds. So there's quite a result here, but we've also QVs where we are slower. So the compilation time is mostly low, and in this case, we are better than Vectorwise. And HYPR is not the only system that does this. So HYPR was the pioneering system, so this was the first system that really did it

Starting point is 00:46:39 like all the way for analytical queries, not just small steps, we really did it for the entire query processing. And nowadays, there are also other systems. For example, if you use SQL Server, there's the Microsoft SQL Server, there's the Hackathon engine, which also uses query compilation,

Starting point is 00:46:59 and they create C code as the intermediate language, and then compile the C code. So they really generate readable code. They don't generate assembly. Cloudera does that for parts. They don't compile the entire query. For example, they compile it for predicate evaluation if it is a complex predicate.

Starting point is 00:47:21 Postgres has JIT compilation. SAP HANA has a new engine that uses query compilation, the hex engine, and many other systems nowadays as well. But what the authors found in the years after Hyper was that for certain queries, especially OLTP, so imagine OLTP is, for you're for large systems, OLTP systems, you have tens of thousands of queries per second. So a millisecond overhead for compiling is not possible. You can't do that.

Starting point is 00:47:56 If a query runs much faster than a millisecond, this compilation overhead is just too large. So people need to come up with ideas how to find a middle ground for large OLAB queries, for queries that run seconds or for a long time. The compilation overhead is usually worth it because the code is more efficient, so it's a good trade-off to invest into this. For other queries, this might not be the

Starting point is 00:48:26 case. And then interpretation can be faster because there's no compilation at all. And what you can see on the right here is what the state of the art is here is I hope you can see that. So for example there there is hackathon shown here. So you have a query plan. What hackathon will be doing is they will generate C++ code and compile this. So this is expensive, even though hackathon is done for transactions. When they do this is mostly for stored procedures.

Starting point is 00:48:58 So they don't do it on the fly. In general, they would do it for when you store your procedure, in this case, you prepare the function. There's enough time to do that. And then hopefully when you have a stored procedure, it is called quite often. So yeah, so the compilation overhead is worth it here in this case.

Starting point is 00:49:17 And OMRA on the right side here in red is basically the successor of the TUMUNIC to Hyper. And what they now do is they have this Umbra IR, so they no longer generate LLVM IR. And they have now ways to say, okay, I can compile to C++ and let the compiler do all the magic. So do all the loop optimizations

Starting point is 00:49:43 and all the tricks that the compiler do. So this is the slowest path, probably gives you the most efficient code. They can also go to LLVM IR, which can be interpreted. So it can run this LLVM code without compilation, or you can compile it, which is a little bit faster to execute, or they directly execute this IR code.

Starting point is 00:50:08 So they have different stages and whatever they think they need, if they think it's an expensive queries and investing in compilation is worth it, they go this way and they use all the capabilities of modern compilers or they we don't do it. And what also, so yeah, there's a typical trade-off, latency, is it worth the added latency of compilation or not? And also when we talk about this whole query compilation,

Starting point is 00:50:39 there are also other projects, for example, Mutable, which is a thing from is also a German project, database project that has been recently presented. And they generate WebAssembly, which can be very efficiently executed by most web browsers, modern web browsers nowadays, which is also super interesting. Just as a side note. And this comparison is probably a little bit hard to read, but what the authors did here is that they compiled different generation targets. As I already said, so we have bytecode, which is comparable to

Starting point is 00:51:17 SQLite. We have those fast start engines, flying start or firearm, which have a small overhead, or they go all the way with C++. Let the compiler do all the magic. And what you can, I hope, see a little bit is here that the fast engines, so there's interpreted code. And this flying start engine, and they don't do a lot of optimizations. They just, the aim of them is just to be happy could be for running fast. So here, this is logarithmic scale.

Starting point is 00:51:52 You have very small overheads for compilation, but you don't have the best, the highest throughputs possible, at least not for all queries. And then we have the other two engines, for example, the LAMVR optimized, or we have the yellow one here, C++, with our free optimizations.

Starting point is 00:52:12 So all the optimization passes we have in WADAN compilers. This takes significantly longer to compile, but also gives you overall the best throughput. If it's worth it, totally depends, right? So the overhead, the throughput increase is often not very large, but if your queries run for seconds or minutes, you probably don't care.

Starting point is 00:52:38 If your queries run shorter, then yeah, we have to decide if the red one or the blue one is probably the target that you want to, the generation that you want to target. Okay, so now the question is, what is now the best system? There is VectorVise, which is the vectorization model, and there is Hyper, which does the query compilation.

Starting point is 00:53:06 And since it is really hard to compare two different systems because they are just the way they access disk, how well their query optimizer works, and so on and so forth, it's really hard to compare the systems. So researchers came up with, and built a new system and wrote a paper about it where they combined two

Starting point is 00:53:29 systems so this is not the fully fetched database so they just did a minimal system that could mimic both approaches and they coined those approaches TYPR so for hyper and tecto-wise for the vector wise code. And this paper has got everything you always wanted to know about compiled and vectorized queries, but were afraid to ask. So very short title. And the results that they found was that it really depends on your queries.

Starting point is 00:53:59 So scale factor one is quite small. So, it's just one gigabyte of data. Probably, so, there's already an overhead for a query compilation that you need to consider here. But depending on what the query, extra query does, different systems are better. So, for example, we have Q1 that you have seen before. We have a lot of, very have a very large aggregation.

Starting point is 00:54:26 We have QV6, where the aggregation is just minimal. We have very selective filters. So we have three filters in this QV, on a single table, no join. And these filters, they remove most rows pretty early. We have QV queries three and nine which have multiple joins, large joins. And we have query 18 which has an aggregate

Starting point is 00:54:51 and the huge generaties, so it is almost a key column. So one and a half million groups. And you can see the results here. And again, there's no clear winner if you look at the results. For Quby 1, the generated code of this type engine is faster because the expressions are all fused and generated very efficiently. But for other QBs, for example, QBs 6, 3, and 9 here, the TectorWiseEngine1. So it always depends on how the QB looks,

Starting point is 00:55:35 which engine might be better. And this table, they compiled, they show the results of profiling the two engines. And the reason why a type in this case or the generated code, the code generation was so much better is that since we have this very tight loop with all the expressions already fused into very tight loops

Starting point is 00:56:04 and generated code for the expressions. We have very good cache aggregation, so we have the data already in the registers when we calculate the results. And on the other hand, tactile wise often needs to materialize, which can be really expensive in this case. But, well, so, it depends on the queries for, for example here, for query one, code generation is just for this very compute bound query is much more efficient.

Starting point is 00:56:48 But on the other hand, we have query 9 and here we have an example where the tecto-wise, so vector-wise engine is more efficient and the reason for that is here that we have this tight loop that does the probing of the hash table. So this query is dominated not by navigation, where we can generate the code, but foring of the hash table. So this query is dominated not by an aggregation, where we can generate the code, but for probing the hash table. And here, the TectorWise engine is tight looped. So in this case, there's this tight loop over the probing phase.

Starting point is 00:57:16 And here, the CPU can do many outstanding loads. So we are less bound by the T-RAMP bandwidth. And we can hide, basically, the probinging latency or the access latencies here because we can generate many outstanding reads, memory loads here in this case, if we have this tight loop. But if now our merge code does all the operators, it's harder for the CPU to do those outstanding loads, to issue those outstanding loads. So this is an example where the vectorized code performs better. Here this is a comparison of now these two systems. We have code generation and vectorization. Again, we have, so now we have different metrics here. So from the computation wise, we have usually more efficient code

Starting point is 00:58:12 if we compile the code. Because data, yeah, we can keep most data very often in the registers. We don't have any interpretation overhead for expressions and so on. So computation is very efficient if we generate code just for this particular query. But when we talk about parallel data access, depending on how complex the query is and how well the CPU is able to hide memory latencies,

Starting point is 00:58:42 code generation can be slower. In this case, vectorize or tectorize might be able to generate more concurrent memory loads, which can be better here. Zimd vectorization, that's usually where vectorize engines are better because the optimizer has, since we pre-compiled the code before, we can run all the optimizations level of compilers.

Starting point is 00:59:06 So there's a lot of time for the compiler to do this autovectorization. Compilation-based engines usually don't do it. They could just similarly generate the code, right, or take the time to compile it, but then again you have the problem of long compilation times. Priorization is usually good for both. Or will TP depends, right?

Starting point is 00:59:30 So if you can generate your stop procedure upfront with code generation, this is probably the best you can get. But if you can't generate it upfront, then you have to pay the compilation overhead. Language support depends. And I think the most important thing here is profiling and debugging and for in both cases so profiling debugging is just for VectorWise engine it's just your C++ code as you know it as you have to write in the exercises right and if we were next week we will talk about profiling

Starting point is 01:00:03 but this is probably what you all know, how debugging works and so on. And now if you generate code on the fly, this is a very different beast to debug and to profile this code. So we need to have a lot of additional tooling around that to be able to debug and probably profile that. Any questions about code generation? Okay. I think I'll go back to the table in a minute. Okay, so first the summary.

Starting point is 01:00:44 So we have, first of all, we talked about the iterator model today. This is the standard model. This is for analytical systems. That is a slow model. That is, I think, this isn't used in any of the modern systems for OLAP workloads,

Starting point is 01:00:59 but it's still the model used for other database systems, for a typical Postgres system and so on. So this is still the model widely used, but it's usually the model used for other database systems, for a typical Postgres system and so on. So this is still a model widely used, but it's usually not a good fit if we talk about OLAP workloads. Then we talked about three models that are geared for analytical systems, so that are better and more efficient for those workloads. This is first the materialization model,

Starting point is 01:01:23 which was the first approach, which is the MoDDB approach. And then we have vectorization and code generation, which are basically the two approaches that all the state of the art systems use. So if you talk about Snowflake, Databricks, whatever database you have in mind there, they usually either they generate code or they are something like this vectorization model.

Starting point is 01:01:51 Are there any questions so far? If not, we will continue with data structures tomorrow. And let me just have a few questions. So just, yeah. Let's assume there would be a fourth column. So this table here comes from a PhD thesis, 2009. So there was, I think, no code generation database at this point in time.

Starting point is 01:02:18 But can you help me to fill this? Or what do you think would be the fourth column if we now would have here another code generation? How would that look for code generation? So let's start with instruction cache utilization. How would that look for code generation? I want it to go before we have this table full. So yeah.

Starting point is 01:03:01 So the problem of the iterator model was that we have a lot of code that is running concurrently, right? So basically we go through all the operators, and for each operator, each operator is usually huge because it needs to be capable of processing all whatever data there is. For code generation, we only generate the minimal amount of code because we know this column that we are going to process is an integer column encoded in this and this way, so we don't have to have any switches for data type, for encodings and so on, for null checks. So there is very few instructions, and usually they are, thus it's easier,

Starting point is 01:03:44 we have also tight loops, so usually there's a very good instruction cache utilization. Function calls? Nobody? Guesses? Yeah? I think it depends on the code that we generate, of course. And we can do it in a good way and in a bad way. But I think we can do it in a way that the function codes don't help very much.

Starting point is 01:04:24 Yeah, exactly. So we usually try to... Yeah, exactly. So we usually try to, yeah, as I said, so it depends on which code is generated. So in some cases, it might make sense to have still function calls for complex things, and we still are going to call, like, if we access external data, ssd and so on, we just have to call functions. But for the main code that we try to process there, we try to avoid any function calls. We try to inline as much as possible, so this will be as few as possible, basically. Attribute access is direct. Where do we spend most time on for code generation? There could be two answers.

Starting point is 01:05:12 Where do we process most time on? Where do we, yeah, spend time on? Is it interpretation? Mm-hmm. I think if the data that we have to process is small, then we will spend most of our time on generating the code and making the code efficient. If the data that we have to process is large, then probably most of the time we will go into processing and loading.

Starting point is 01:05:46 Exactly. Yeah. So, yeah, for small queries, OTP, we spend most time on preparing the query or generating the query, compiling the query. If we have a complex query, we spend the time on processing. Exactly. CPU utilization, that's, I guess should be clear that is usually very good here. Compiler optimizations are applicable, right?

Starting point is 01:06:12 This was basically the last thing we talked about. So depending on which backend you use, you can use, you can basically say, okay, I want to use all of the compiler optimizations. If you think it's worth investing the time, or you don't do that. Materialization overhead is very cheap because you usually try to don't materialize anything. In case, just to maybe make that clear again, materialization overhead was huge for the materialization model, or can be there also depends on the query but there are some cases we just have to materialize right so if you have a classical hash join and most databases

Starting point is 01:06:51 use hash joins because they're the most efficient join and if your hash side is huge and you need to process the entire hash table before you can continue then basically all the pipe also the other approaches if this pipeline break up basically the hash on has a huge input then they all have huge inputs and materialized in case and yeah in this case then the hash table right so it also depends so also other models apart from the materialization model can have huge intermediates it As usual, it always depends. But the largest materialization overhead is usually always the materialization model.

Starting point is 01:07:31 And scalability is good as well. Okay. This was it for today. Tomorrow we see each other again for data structures. Have a good day. Bye bye.

Your Ad Here

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Query Execution Models

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.