Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Vectorized Execution (2)

Starting point is 00:00:00 Okay, so welcome everybody to our next session on vectorization. So today we're going to finish up SIMD. So also you're already started or you already started to program this. You know the task, what you have to do. And we'll talk a bit more about the kind of operations that you can do with SIMD. So this is kind of the last part of the... Where do I have the overview? This lecture...

Starting point is 00:00:34 Let me find my cursor. So, we're going to be in this kind of space here. And even though we're probably not going to need all the way to the end of the lecture, this is going to be everything that I'm going to do today, because I feel a bit under the weather. So, we're going to have an earlier stop today, which I guess you also don't mind, right? So, one thing that I still wanted to make you aware of is the following.

Starting point is 00:01:07 And I already reached out to some of you but still we're looking for student assistants and I guess most of you already know here but if you don't know and if you're interested in working on these kind of topics and also in the teaching in these kind of topics, let me know because I think we have some interesting stuff that we're doing at the group and we can really need some help. So with that, let's go back and go right... Let me just find this... Right to where we stopped. So we talked about generic intrinsics. No, this is too far off. We talked about the explicit vectorization.

Starting point is 00:01:53 So this is where I left off last time, where you basically already saw how you can use these SIMD registers and how you can use these SIMD registers and how you can use the individual intrinsics. So, and these intrinsics, these basically SIMD ISAs, they are specific for the different kind of compiler, for the different kind of CPUs, as we said, right?

Starting point is 00:02:22 So this would be the x86 way of doing SIMD using SSEs in this case, so using the 128-bit registers. And if you want to do different registers, again, you will have to program against these different registers, so 256 or 512, in order to really use these individual units. The compiler won't do much with these intrinsics, so this is basically just a function wrapping the assembly code. The compiler will just directly translate this. And if you do it right, you will get good performance, but only on the specific CPU and only on the specific units that you're working for

Starting point is 00:03:18 and that you're programming against. So one thing that in database operations, we use a lot are masks. So that's something that I want to give a bit of or spend a bit of time on today. And this is basically a way to do conditional execution in SIMD. So because I really touched on this last time is you cannot really do branches in SIMD, right? Because, I mean, you would have to execute all different branches depending on what are the individual branching conditions in these different registers. So it doesn't really make sense to have multiple code paths.

Starting point is 00:04:00 Won't work. So what you have to do is you have to kind of this predication way of thinking about the data and one way to do this is work then work with masks is basically saying I'm only selecting subsets of the data that of the of the vector that I'm working on and so you for example can, I only want to add a certain subset of the elements in my vector in a certain operation. And from AVX 512, basically almost all operations support some kind of masking. So one way would be zero masking. You can basically say, I'm selecting certain elements in my vector that I want to work on,

Starting point is 00:04:45 and the others are all going to be zero after the operation. And something else is you can also just keep the values that you had before. So say, for example, you want the vector to stay the same in certain locations. Other locations, you want to do an addition. So this would be an example that we have here. But here, for example, with zero masking, so say we have two vectors, A and B, and we're adding a mask here using zero masking.

Starting point is 00:05:16 So we're saying that the elements that are zero in the mask are going to be zero in the result, and the other ones are going to add it up. So then we have these two vectors, we have our mask, and the result will basically only contain the elements, the added up elements and zeros. And the added up elements where our mask is 1.

Starting point is 00:05:41 And this is basically one of these operations that you could have here. So this is a 512, so ABX 512 operation. As you can see in this case, we're going to do an integer operation, an addition with mask 0. And so this means we give this operation the mask, which can have these zeros and ones,

Starting point is 00:06:05 and the two input vectors, and the result will be our output vector, which will be written into the register. So this also gives us this kind of way of saying, okay, we're selectively working with some of the data. So if our condition is good, is right, then we can basically select the elements in the vector where our conditions are holding and the other ones we can ignore for the rest of the processing. So this is basically the same idea as doing this kind of predication. And we can also do basically blending, so we can say we're selectively merging different kind of vectors, right? So basically just merging two vectors, given a mask where we keep some elements of the old vector,

Starting point is 00:07:08 or we just place in some elements of the new vector. Or we could say, for example, selectively add and keep the rest as we had. Say, for example, here we have a mask with a blending. So see here for example we're saying well for the source vector those which are zero in the mask we're just going to keep. Those which are one in the mask we're going to add up with the other vector. And then we're just going to look at the elements in the second vector that where we we have the mask being one. And there we're going to do an addition.

Starting point is 00:07:48 The rest we're just going to keep as is. Actually quite simple, but translating your operations or your database operators to something like this requires some bit of rethinking how you actually do these database operators. And we'll come to this requires some bit of rethinking how you actually do these database operators and we'll come to this in a bit and one extra thing that's super useful in this uh in this setup

Starting point is 00:08:15 and that you will need as well are compress and expand operations these kind of are let's say they're're orthogonal to each other. So the compress operation basically says, I'm going to just select some of the elements in my vector, and I'm going to move them to the front of my area. This is, for example, if you're doing something like a scan, right? Or if you want to select some elements,

Starting point is 00:08:42 you're doing some conditional operations, so you're trying to figure out which actually match what you ever want in your operation, which elements. And then in the next step, you just want to work with those, right? So you actually want to have them somehow close together, or you want to store them in a tight area. So you will compress them towards the front of the area. And in the other direction, you might again just do some selective merging,

Starting point is 00:09:14 for example, if you already know, okay, we're just going to have some positions in the area where we want to put the data in the vector, then we can decompress and basically move the data to different parts of our vector, right? And of our register in that sense. So that's decompression or expansion. So there's specific operations just basically doing this. Okay. And so this one I actually showed you last time already, I just moved it to improve the flow here. So we basically talked about this, right?

Starting point is 00:09:57 So all of these operations are super specific to the CPUs. Right? So this is really, if I'm using AVX512, this means I can only do this on x86 CPUs and the code will only run on x86 CPUs that have these units. They won't run on any ARM, so if you write this code, you cannot run it on this machine, because this is not an x86. This is an ARM architecture. You will basically need to rewrite this code with new instruction sets, new intrinsics. So we had this, right? basically our x86 version with 128 bits, and this would be the same in ARM and Neon version.

Starting point is 00:10:54 And this is basically the same code, but you just need to duplicate everything. So in order to get away or to move away from this, there's a couple of different strategies. I mean, one way would be what we did all the way in the first place is this autovectorization, right? So one step is basically I'm just writing code in a way that the compiler understands it. I'm writing scalar code, but the compiler can directly see this can be translated.

Starting point is 00:11:23 And for the simple addition, this will work. So that's the most generic way. And of course, then you need to have a compiler that understands this. And if the compiler doesn't see it, well, tough luck, it won't work, right? So, but you can check with Godbolt or stuff like that if it worked or not.

Starting point is 00:11:41 So that's one way. However, you can also use generic vector classes. So there's different kind of abstractions out there in order to not having to write specific intrinsic for single CPU types or single CPU architectures. And this is either library or compiler based. And there's also a trend or there's an effort to standardize within C++. So then this would be like an extra way again,

Starting point is 00:12:20 but that's not there yet. And if you do this, then basically you have a way to go back or to do your coding across different kind of architectures. And there's many different libraries, for example. There's XMD, XMD Vector Class, LibSIMD, DPP, etc. And these are all linked, so you can actually try this out. And they basically provide classes representing vectors and can then also be manipulated using overloaded operators. So you can

Starting point is 00:12:55 basically use them just as you would use regular operators and classes in C++. And then underneath, they provide an implementation, so they do the same thing. They basically have the same kind of code underneath here as we would write it by hand, but they will have this for the different kind of operations. And if, say, for example, the architecture doesn't have a certain type of instruction.

Starting point is 00:13:26 The library will provide an implementation of this instruction. And another thing, a different approach, again, is the compiler extension approach. Meaning there are certain compiler extensions, and these are used also in the compiler internally. Whenever the compiler tries to understand your code, it can be auto vectorized and the compiler will basically translate this to this internal representation and do optimization on those. And you can directly program against these compiler extensions.

Starting point is 00:14:02 So these are GCC vector extensions. This is also adopted by Clang, but not, for example, by Microsoft Visual Studio C. Visual C, that's what it's called. Let's go back, look at the libraries. So we'll look at both independently. And I have to say, both are cool, right? It's a nice way of doing this. But for the tasks that you have to do, you will have to still touch intrinsics. So the CPU specific intrinsics to get really good performance, unfortunately. So there is

Starting point is 00:14:43 some operations that are not within the libraries and not within the compiler extensions such that unfortunately you still have to touch it. In some next iteration of the course, maybe this will be possible. OK, so the SIMD libraries, for example, they are an extra dependency. And they can only work with the features

Starting point is 00:15:08 that the compiler supports. Say, for example, AVX2 does not support 64-bit integer multiplication. And then the library, for example, implements this using 32-bit multiplication example. So an example would be the vector class. And when you're compiling for AVX 512, then the library will still keep the slow code.

Starting point is 00:15:42 We'll still use the library will have this translation, and the compiler will use the manual multiplication logic. And then basically, rather than using the AVX-512 operation, you're going to go back to the 32-bit multiplication. So we can actually look at this. So here we have the same kind of multiplication. So this is basically the basic multiplication that you would do with 32 bits. So you're basically splitting it up into two parts and multiplying the separate parts and then add up the two sums. So this is basically what you can then see here. And if we're

Starting point is 00:16:47 doing the same kind of operation with this multiplication, then later on it will still come out with the same kind of code. So this is basically we're mimicking this compiler, this library approach. So we're basically creating a function here that's the same. And even if we're basically saying you can use larger registers here, then we're going to go with the same kind of code, with the same kind of complicated code rather than a faster code. However, there are efforts to standardize this

Starting point is 00:17:33 in standard SIMD. And there's also a link to the C++ reference, which would solve these problems, basically, if the feature set is large enough. So if basically all of the different implementations are there, then this problem would be solved. The other way of doing this explicit vectorization in a generic fashion are the GCC vector extensions. So what the compiler basically does is internally it will basically map

Starting point is 00:18:12 platform specific types to some kind of internal representation. So that's been done anyway. Even like although the intrinsics will be translated more or less direct to the assembly, there is a wrapper or a mapping to some canonical internal representation, right? Because we want to be able to translate different things. And this basically decouples the compiler implementation, so the different kind of optimizations, how the compiler can do the autovectorization internally,

Starting point is 00:18:56 right? So this basically is used for this framework internally in the compiler. And then basically, the compiler doesn't have to be rewritten for the different kind of, or at least the optimizations don't have to be rewritten for the different kind of architectures because we can do the same kind of optimizations

Starting point is 00:19:18 on different architectures in the auto vectorization approach. Or let's say the compiler can do this. And we can reuse this by basically targeting this direct or directly targeting this canonical representation in the code. So basically the compiler can then do all the platform-specific translation while we're just being generically or writing our code generically in a vector-based way. Then the compiler does the same thing as the libraries would do.

Starting point is 00:20:00 There's basically a natural representation using the standard operators like multiplication, addition, etc. and all kinds of comparisons. So then we can just use them as we would use them regularly in our code in our vectorized code I have to say. Okay, so say for example we have our addition here. So this is the same addition example as we always see it. However, we're already using the GCC vector extensions here.

Starting point is 00:20:44 So this is basically we're saying we the GCC vector extensions here. So this is basically we're saying we're using a vector type. And this is basically done in here. And then we have our addition function, where we say, OK, we're going to do in basically steps of four, we're adding up these vectors and we can use the vectors just as we would in regular operations, right? So this is basically, yeah.

Starting point is 00:21:14 So it's basically the same as we would do as we would do if we do the manual way or let the auto vectorization work, right? So it's more or less the same code, but we're already specifying that we want to have vector types. It also looks very similar to doing the SIMD operations, but without any specification. And it will produce the same kind of platform-specific intrinsics

Starting point is 00:21:49 or let's say identical code with using this platform-specific intrinsics. So this basically, if you look at what the code will produce, right? So this is basically, if we're compiling this and the kind of code that will be produced, it will look the same as if we would use this in an x86 way, or if we would do this for ARM, right? So the results are more or less the same or are the same, but we just have to write one code base, right? So, and we can, this will be directly translated. And the compiler is like all kinds of operations. Unfortunately, it's not feature complete. So it doesn't have all the operations that you would want to have for the kind of operations that we do.

Starting point is 00:22:51 But it has, for example, the shuffle, it doesn't have these extensions or this, we'll see in a bit, there's some operations that it doesn't support. So say for example, but it has this shuffle operation. So we can basically reorder the elements in an area. So we can basically say, we want to have, this would basically mean this is our input area and we have an input vector and we have a vector that specifies a new order in the vector that

Starting point is 00:23:28 will produce this new order. It will just recopy the different elements. And then we have a conversion tool, for example, where we can change the vector types into different element types. So say we want, for example, to have different, so from 32-bit integers or through 32-bit numbers in general, we can translate or convert to 16-bit numbers. But something like this compression that I showed earlier, that's super useful in database operations.

Starting point is 00:24:06 That's something that can right now not be translated or cannot be expressed within these vector extensions. So that, for example, that's something that you cannot express. So say, for example, you just want to keep the elements B and C, as we had earlier, and have them at the beginning of the array, then this won't happen, or this zero masking then. Okay, so far so good. Now let's look at how to use this in database operations. So we'll first go through for some simple stuff and then more explicit or more advanced examples. And let me just quickly fix this here.

Starting point is 00:25:05 So the very simple thing that I already told you that's kind of straightforward to use as a scan. So if we're doing some kind of scan or a selective scan, this is something where you can directly see, okay, instead of checking every individual tuple, we can basically check multiple tuple at a time if they match or predicate or not. So that's very simple.

Starting point is 00:25:32 Other things that are easy vectorizable are joins, sorting, using sorting networks, et cetera, then filters. And there's actually a nice read for this. So how can you rethink SIMD, or how can you rethink database operations using SIMD vectorization? And so there's, this is interlinked and some of the examples that I have are from this paper.

Starting point is 00:26:02 So let's basically look at the scan first. So a scan often also has some arithmetics in there, so this is basically what we would do in the previous examples as well. I mean, it's not a simple scan, it's already some kind of extended projection here, so we're basically looking at certain tuples or certain attributes in our tuple and add them up. So, this is what the extended projection typically does. But underneath, we're just going to scan through

Starting point is 00:26:38 all the data once and add up the elements in our different parts of the attributes. And, well, that's easy to see that we can do this in parallel, right? Another thing is aggregations. Well, this is something that we can also do. I mean, minimum, we have to do it like basically block-wise and then further extend. Similarly, we can do this for sum and for max, etc. So, but at least for some subsets, we can do this quite efficiently.

Starting point is 00:27:18 Right? So, this is quite clear. We can, for each block, we can basically, for multiple elements at a time, we can find the minimum, we can find the maximum, we can basically, for multiple elements at a time, we can find the minimum, we can find the maximum, we can build the sum, and then we have to do this iteratively for the rest, basically. However, the selection, well, I already basically gave that away, right? So selection is a bit more tricky in the sense, so I mean, if we're just doing an addition, that's straightforward.

Starting point is 00:27:55 You're just going to go through the whole dataset. We're just going to do this extended projection. We're just going to go step by step, do everything and we can do this in blocks fitting our SIMD registers. So that's quite easy. The selection is more tricky because we don't have branching, right? So we basically we cannot say if this condition then do that, but we have to express this somehow in another way. So we have to express this somehow in another way. So we have to use this kind of masking idea.

Starting point is 00:28:33 And the other thing is we don't really want to move data in and out of the SIMD registers, because that's quite expensive. So basically, if we have to do this individually for the individual items, then this needs to do this individually for the individual items, then this needs to be done individually, right? So we cannot basically move all of the elements at one time, but we have to either go through memory or do one item at a time. But what we can do is we can extract a mask from the registers, from the SIMD register,

Starting point is 00:29:11 and just copy that in a single register. So that's something that's possible. And the mask can be small enough to fit into the register, into a regular register, unlike the complete SIMD register, right? So if we have like a 512-bit register, we cannot copy this into a single register. So we somehow have to compress this, and one way is using this kind of mask.

Starting point is 00:29:38 So the idea in order to create some kind of scan like this, or some kind of scan like this, or some kind of selection like this, is to actually generate a bit vector, and then interpret this in scalar mode. So, basically, rather than saying we're going to do branching for every individual tuple, or some weird branching logic for a whole vector, we're going to say we're going to create a bit vector and then interpret this in the Scala mode where we can actually do the branching.

Starting point is 00:30:19 So we already had this selective store and expansion, so this will be required for these kind of operations again. So, let's just briefly touch on this again. And you have, of course, we can have to think of this in small areas, but also, of course, for larger vectors, for 512-bit vectors, you can have multiple elements at the same time. And what the selective store does again is basically selects the subset of elements that we're interested in and compresses this to either to the beginning of the register or also directly to memory.

Starting point is 00:31:03 So this can also be directly then written to memory. And the same thing we can do the other way around. Rather than basically saying we want the whole data set that we have in our array, we can just load a selective subset. And then put this compressed at the beginning of our vector. So then we can also efficiently work on that. There's additional operations, and these exist as intrinsics, but some of them are actually not implemented in hardware, but will be translated into multiple rounds of individual operations on the CPU.

Starting point is 00:31:53 Basically, there is no individual unit on the CPU for this operation, but it will lead to some multiple cycles being used to execute this. And one of those would basically be a gather operation. So this is different in a way that you can basically say, I want to directly load a certain set of values to a certain area. You basically load values from non-contiguous memory regions. You can say in my first element,

Starting point is 00:32:35 for example, in my target vector, I want to have the element number 2 in my base vector, so in my base dataset. This in my base data set. So this basically means first says 2, so I'm going to position 0, 1, 2, and this will be my first element in the load. And then we have 9, so 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, we're here, so the tenth position basically, and that will go to the second position.

Starting point is 00:33:07 And this gives you a way to basically do sorting, so the shuffle at the same time with some kind of compression, etc., that we had earlier. But as I said, some of this is not really as a single unit, or not implemented as a circuit in hardware, but will be multiple operations of these earlier instructions. But they exist as a single instruction as an intrinsic in AVX 512, for example. And the same is true for a scatter operation.

Starting point is 00:33:45 So that's kind of the other way around. So we have contiguous memory region that we want to basically load from. And then we translate this to something else. So we write this out to the broader memory location based on where we were basically based on some kind of vector that tells us where the data needs to go. And yeah, so these will also not necessarily be executed in parallel because of the cache. basically we cannot have unlimited amounts of accesses or distinct accesses per each cycle.

Starting point is 00:34:31 So that will basically hinder some of the parallelization there. And that means, as I said, this will need multiple rounds of execution. So let's look at the filtered column scan. So this is basically the traditional way of a simple filtered column scan if you would implement it in a scalar way. And here the result that we want to have is basically, we want to get the positions in the array which match our filter. So, basically, we're not outputting the results or the values that match, but we're outputting the rows that match, so the row numbers.

Starting point is 00:35:28 And that's also important because that's the same thing that we're going to do later on in the vectorized version. So basically what we will do is basically we have our column, whatever, like an array basically, and we go through the column

Starting point is 00:35:45 value by value until we have the complete column size, and for each value in the column so for each row basically, we're checking the filter condition, so this is basically greater than comparison, and if we have a matching output, then we have an, or if we have a matching row, then we add this to our output vector. So on the position, on the next position of the output vector, we're basically adding

Starting point is 00:36:20 the row number. So that's what's done here. So if we're doing a filter of this array, so 4, 13, 18, 4, 2, and our filter condition is 9, so let's look for everything that's larger or greater than 9, then we'll get the positions 1 and 2, because 13 and 18 are greater than nine. And that's something that we can now translate into SIMD expression. And so doing these comparisons as a vector basically means we're using our vector types. So this is basically, we're saying yes, we want a vector data type, this is the GCC vector.

Starting point is 00:37:10 And we need a second one, which is basically our Boolean vector, if the vector matched or not, or the individual row matched or not. And then we have our filter operation or function, basically. And we start by basically getting the right values. And then, in this case, four at a time, basically. We're going to read our value vector, then check, do basically the comparison, where the result will not be a branching, but will be a Boolean vector, basically.

Starting point is 00:37:53 So this is basically our result vector, which is the Boolean vector. This will be converted into a bitmask. And then we're doing this left pack operation. So this is basically a function that we have to define, specify separately. So this is I will show you on the next slide. And then increase our output record or output vector, which is basically the same idea, it's again a vector with all the compressed results. So we're only looking at the, or we're only adding the rows that actually match our condition.

Starting point is 00:38:37 And the left pack operation basically does exactly this. So it basically, it tells us which, or it matches with the result bit mask, basically selects only those subset or only those rows that actually matched or that matched and only adds the subset. And then we can basically increase the output vector by the number of matching tuples. So this basically matches, or this basically counts the number of ones

Starting point is 00:39:13 in this bit mask. So these are all the matches, and that's basically how much we increase our output pointer in order to, in the next loop, basically have, like like start the output or continue with the output from the new empty position. Okay, so now this is kind of where we have this problem where we can not really work with the compiler intrinsics

Starting point is 00:39:41 because we have this or we can but it's not gonna be as efficient because here this this, or we can, but it's not going to be as efficient because here this is something where we want to use this compress function. So if you remember, we said the compress basically we have our vector which has some values which are relevant for us and some values that are not relevant for us and with the mask we basically move those that are relevant to us to the beginning of the vector and Then we can just say okay only those that are in the beginning of the vector. We're going to Basically add to the output the rest. We're going to throw away, we're not going to look at anymore. And the AVX 512 operation, this MM mass compress, will do exactly this. However, there is no GCC function for this. If we want to do a GCC function, then we need to use shuffling. And the shuffling

Starting point is 00:40:42 basically tells us, okay, we're going to shuffle somehow, but then we also need to know which ones we actually want to select. So we need a bit mask, or not a bit mask, we need a vector that basically tells us to shuffle those that we actually want to the beginning. And we somehow need to remember which ones we actually want. So, we basically, the main problem is we need to create this kind of mask or this kind of translation vector to find, to move this to the front. And so one way, so I mean, with the compress function, we get a 30 times speed up over a scalar version

Starting point is 00:41:35 in 512 bit registers. With the shuffling, we still get this, you still get a speed up, it's basically half of that, right? So even less than, no, it's basically half of that, right? So even less than, no, it's basically half of this, because we need to create a lookup table that basically tells us what is the mask that we're using for the shuffling. So, and what we basically do is exactly this, I mean, on the AVX 512, we have the compress function, so this would be the

Starting point is 00:42:07 implementation for this left pack on AVX 512. In the GCC vector extensions, this is generic, will run on any instruction set, but we have this kind of bitmask that tells us, okay, using this bitmask, we're translating this to a shuffle bitmask that then gives us basically the right shuffle result. So this is basically just a lookup. So we basically just have for each bitmask that we would need, we create the correct shuffle instruction. So it's just 16 entries that just translate to the right shuffle.

Starting point is 00:42:56 Okay, questions so far? No questions? Then we're going to do... oh yeah? What's the way to go for AVX2 regarding the left button? Good question. I have to check. So I think then, again, if you don't have AVX512, you would probably use this generic implementation. I have to check if there is... I think the compressed store is only in AVX512. Well, I would have to check if there is. I think the Compress store is only in AVX500 tough.

Starting point is 00:43:28 Well, I would have to look it up. More such good questions. Oh, okay. Then we're going to do a quick five-minute break here and then continue with some more examples. Okay, so let's get started again. So now I want to show you two more examples of like slightly advanced ways of using SIMD instructions in database operations. And then from there, you can basically go and think about different things. So one thing, for example, is if you have kind of bit-packed integers, then, so say, for example, you have reduced precision, or you basically have, say, 9-bit numbers that you want to store

Starting point is 00:44:28 or that you have stored densely in your memory then we can use SIMD to decompress this. So this basically happens in database systems if you know, basically you don't need the full range of integers but you need say for example 9-bit to fully If you know you don't need the full range of integers, but you need, say for example, 9-bit to fully represent the data, then you really want to pack them densely in memory. Of course, they are not aligned.

Starting point is 00:44:57 If you want to work with the data, then you have or decompress this. And this is also a problem because the memory is only byte addressable. So then we need to basically do some movement back and forth. And in order to get the proper output that's 32-bit numbers that we then can continue working on in memory, or not in memory, but continue working on with the CPU. And the scalar approach to this, so the idea is really like your data is basically 9-bit, the first number is 9-bit the next number, etc.

Starting point is 00:45:45 And, well, the scalar approach would basically lead, is basically tracking the unused bit and then copying 4 bytes containing basically all of the 9-number bits into a register and then shift the basically the bits in a way that you have only the data for the individual number that you need. You shift this all the way to the left and then remove any kind of unused bit. And the trailing bits that of course then continue other numbers,

Starting point is 00:46:26 because everything is densely packed. So you copy data over, you move everything that's not necessary for the current number, you move out and then you still have a lot of basically numbers or bits that do not belong to your number, so you're basically zeroing these remaining bits. And you're going to continue this with every next number. So you basically always have to copy a certain amount of data into your registers and then move the data around. And the same thing we can also do with a vector approach. So initially we basically have the complete numbers into, and I maybe should have animated this, but this is basically if we load the input, so these are basically our cache lanes, four bytes, then the individual numbers do not adhere to the byte lanes or the byte width

Starting point is 00:47:27 and also not to the cache lanes. So this is number one with nine bits, number two with nine bits and so on. And you can see this is basically messed up in the memory. So the first thing we do is basically we shuffle the bytes in order to move the numbers to the target lanes. And the idea is here that we're, I mean, in each basically lane or in each word here,

Starting point is 00:47:59 we want to, in the end, have one number, right? So we want to have this nine bit number in 32 bits then. And this is why we're, I mean, we're basically in while initially loading the data, there's something messed up here, but we're not using anything beyond number five and number four, like only number one to number four will actually be used in this individual load,

Starting point is 00:48:26 because everything else will be shifted out. Because we cannot have this in our register at this time, because we only want to have 4 numbers later on in our register here. So we're moving, we're shuffling, in such a way that the bytes are at the right order. So we get number two in the second lane, number three in the third lane, number four in the fourth lane. And we're just copying the bytes.

Starting point is 00:48:59 So this means we still have these empty spaces here, or this is basically the unused leading bits from the previous copy. So here, for example, for number 2, there's still one bit from number 1 in there. For number 3, we can see there's basically then how much should it be? Two bits? We can't calculate right now. So that will be too much, basically.

Starting point is 00:49:29 We have to copy this back and so on for the others as well. So then in the next step, we'll do a shift to remove the unused bits. So then the unused bits are basically at the end, or we're zeroing the end. And then we're basically masking off the most significant bits and basically have these leading zeros then. So I'm also always getting confused. Of course, in memory, this is basically the most significant bit.

Starting point is 00:50:11 So this basically we have to read it backwards and then we're just having leading zeros. We're filling up the lane with leading zeros from here on. And after... Yes? So in the first step we loaded like 16 bytes, I think. Is it like preferred to always load

Starting point is 00:50:34 as much memory as fits into the big register, or is it better to only load... That's a good question. So the question is if we could also load just less than that. I mean, we basically only need like the first four numbers. So the question is if we could also load less.

Starting point is 00:50:54 I'm assuming it doesn't really make a difference in terms of like performance. We would need some extra logic to only load less than that. And so that's my assumption. We won't see much of a difference here. So then, after basically four processing numbers, then the input, then we basically have all the data. We have all individual numbers correctly in memory

Starting point is 00:51:27 and we can continue with the next round. With 256 bit vectors we can do of course 8 numbers at a time and then this would actually work better because in this case it's like 8 is divisible by 8. And there we basically have 9 times 8 is divisible by 8. So we can basically byte a line and we can repeat easier. Okay, so the last thing I want to present to you today is the SIMD tree search. So this is kind of a cool way also of using SIMD in data structures.

Starting point is 00:52:21 So, I mean, basically if you have something like a binary tree, I mean, this is like very simple binary tree. So basic Scala or binary tree implementation would put individual elements as you start with the root, of course, and then the first child will be on the second position, so position one, second child on position two. And in general, usually in memory, the binary tree would be laid out that the children are at position 2n and 2n plus one. This way, you can also easily traverse the binary tree. So you're just jumping to these different positions.

Starting point is 00:53:08 But that also means that the children are actually quite far away in the area in the binary tree. So initially, of course, the root is close to the first children. But then the jumps get larger to the next areas. But you basically can quickly traverse this in a scalar fashion by, well, if you're just looking for the tree, you're basically just jumping

Starting point is 00:53:36 to the right position and try to find the right position by these jumps and by the comparison. And now the question, if we're doing these lookups, can we somehow do the lookup more in parallel? So rather than looking up for these two positions at a time, can we do multiple lookups at a time? Can we somehow, yeah, get some SIMDification in there? So we, of course, wherever we have branching,

Starting point is 00:54:17 we cannot really do any kind of SIMD. So if we do these large jumps in the memory and have to find something else, so these if conditions we cannot SIMD-fy. But if we somehow make sure that we somehow have a subpart in the data set, well, and then the other questions can we do a vectorized inner loop as well? First of all, we're basically going through multiple items, so we're finding multiple items in our data set, so this would be okay.

Starting point is 00:54:57 But then in the inner loop, we're basically doing the actual lookups in the tree. So this is tougher to optimize because there we have data dependencies and we basically have to speculatively somehow maneuver to the levels. But if we have large jumps, we're also going to need a lot of different data in there. So let's basically look what this would look like. So how could we do this in parallel? And like a speculative tree navigation would mean rather than just looking or checking one individual item at the tree, we can also check multiple at a time. So say for example, we do two levels at a time here, then we can basically

Starting point is 00:55:51 do three checks at a single round, in a single step basically, then find the one that actually matches, and then do the next round. Like in the next round we can do three checks again so if we have four values in our registers for example then three three nodes would fit actually at the same time in our SIM do we need to check the one if we check the two and the three? So that's a good question. Why do we need to check the one if we check the two and the three?

Starting point is 00:56:34 Oh, I see. I see. Never mind. I got it. OK. So of course, we need to basically continue in the tree. So we need to know where to continue in the tree. Or we could just be not in the tree at all.

Starting point is 00:56:52 Okay, so how can we get there? So, in the initial layout that I told you, in this case, it would still be easy. This is basically closed in memory. But from here on, these positions, these elements will basically come after all these elements in the tree. So basically, between the 6 and the 12, 7, 8, 9, 10, 11 will be in the memory. So this is actually also in terms of loading, this is inefficient, right? So because we would somehow to gather this from memory, we know that this will be in a loop because there's no, like, this cannot be efficiently done through the cache. So rather than laying it out this way

Starting point is 00:57:41 in the memory, we can also lay it out in a kind of a packed way. So we have to reorganize the data in memory such that we have individual blocks. So we're blocking the binary tree into sub-blocks and making sure that these sub-blocks are adjacent in memory. And this basically means this, like conceptually, this is like one node, and then we have essentially four subnodes of this one node, which again are all individually or are all closely packed in memory. So this means the memory layout changes, but then everything that we need for the comparison is tight in memory.

Starting point is 00:58:25 And we can do a quick look up, or we can do a quick check on these. So the way how we then would basically do the comparison, rather than doing this step by step scalar way, we're going here, here, here, here, here, for example, or another way, we can basically do a complete block at a time. So we use just the SIMD compare. The SIMD compare will give us a mask.

Starting point is 00:58:56 We'll basically say which ones fit and which ones don't. And this mask, then, so here for example, we have our, this top part of the tree, so our binary tree starts with 41, we're looking for 59, so 41 is smaller than 59, yes, so we're basically, we need to go to the right. So this will be smaller, this will be smaller, this will be larger. So we know we have to traverse this way. So we would represent this, for example, the ones that are smaller with a 1,

Starting point is 00:59:40 the ones that are larger with a 0. So the SIMD compare will then basically tell us okay this is smaller smaller this is larger and now we can move the mask to a bit vector that can directly then be loaded into a scalar register so this is what i told you earlier right so we don't want to copy data around from the registers to the, from the SIMD registers to the scalar registers because we would have to do all of that at a time. Yes? Do you think the main benefit of this

Starting point is 01:00:17 is that we can do multiple comparisons at once or that it is that we like don't, like we jump around in memory only half the time? Because I would believe that the memory is the scope part and that we can also do scalar comparisons. Both, actually. So both contribute. So the question is which one really helps?

Starting point is 01:00:37 Is the memory blocking? So this is called memory blocking. Basically, is this beneficial or is the SIMD comparison beneficial the SIMD comparison so we won't get like a perfect speed up we're just looking at three numbers rather than than four numbers at a time but in a single operation hypothetically but we need of course need to do like some additional operations so we will reduce the amount of computation or instructions or not instructions cycles per individual comparison but not that much. But at the same time also the blocking will help for a better memory

Starting point is 01:01:16 alignment. And this is I mean this is in general a good idea to do. So even beyond using SIMD registers, blocking the data, having a better memory layout makes a lot of sense. So it's a good point. I actually have it in the slides as well. So towards the end. Okay, so we're copying the comparison vector. We can directly copy with a move mask operation. So we're copying the comparison vector.

Starting point is 01:01:48 We can directly copy with a move mask operation. We can directly copy this to a scalar register. And using the scalar register, we can then say, okay, with a lookup table, so we're also going to have a lookup table, we can see which child do we have to go to. And the lookup table, so the result of the comparison, there's only a few possible results. So basically we can either say, well, we don't find at all,

Starting point is 01:02:18 so everything, even the root, so the root doesn't match, or so we have to go to the left, to the right, etc. So, the results could be, it's smaller than everything, it's smaller than the root, but not this, it's larger than the root, but not the right child, or it's even larger than the right child and anything else is not possible so we cannot say it's smaller than the left child and larger than the right child or something like that right these combinations would not work so this the bit mask

Starting point is 01:02:59 will basically only have a few possible configurations. Everything else would be an error in our programming, basically. So what we need is basically a lookup table that only has these four different options, which is basically these four different children that we can have. So either everything is larger, so we have to go to the left from the left child it's larger than or the the value is larger than the left child but not the root so then we have to go to the right it's larger than the root then it will also be larger than the left child so then we basically have to go to the right. And then still basically there's two

Starting point is 01:03:47 options. We go to the left from the right child or to the right from the right child. And this is basically, this we can represent in this lookup table. So if everything's smaller, then we basically go to the left. If everything is larger, we go to the left. If everything is larger, we go to the left. If everything is smaller, we go to the right, etc. And using this lookup index, then we can directly say which is the index that we will go next. And then, of course, depending on the level that we're in, we basically have to multiply with the current level, which will give us the memory position that we have to go to, where we find the next block. And we can continue to do the same kind of SIMD operations. We load the next block in our SIMD register, we do the same kind of SIMD comparison,

Starting point is 01:04:39 we create the move mask lookup, which would be the next one, and get the next position until we're all the way at the end. And in general, this is where basically your question comes in. The moving or the blocking is a good idea in general. So having a binary tree in memory that has very long jumps to the children will be bad in terms of finding data quickly because basically in every memory load, we will load individual cache lanes and they will contain a lot of information that we don't need. In case of this kind of blocked way, then we will

Starting point is 01:05:27 load data, at least most of the data will be needed for the comparison. I mean, if we say we have three values at a time, then only one of the values will basically be useless. Everything else, the other comparison actually makes sense so then we can also do like a multiple step blocking so we can do just for the simd for simd doing the blocking we can do cache line blocking and if we're using disk we can do or larger memory pages we can do also page blocking and make sure in every step that basically what we're using, what we're working with directly translates to the granularity of the elements that we're working with. And that will give us

Starting point is 01:06:18 better performance in any way. So here's an example of the results. So this is, again, from this slightly older SIGMOD paper where they also used this to run this on GPUs. So this is basically also a similar way of executing where you could also do the tree search faster and then you can also see that the trees are not that big in the end. But you can see if you do the blocking, then so if we're basically doing just using like a normalized way, if you're just doing a scalar search by just using page blocking then we already get a decent speed up doing cache line

Starting point is 01:07:12 blocking will give us an additional speed up and then using SIMDification so using SIMD instructions will again improve our speed up so in the end and if we're doing software pipe lightening, so that's an additional step that you can look up in the paper, we can even get faster. But this is basically what we're looking at. So let's say a three-fold speed up roughly over the scalar version, where a lot of it actually comes from the blocking.

Starting point is 01:07:42 But the SIMD will also give us some, let's say, 10% additional speed up again. So it's not going to be threefold performance improvement out of SIMD, but it will be an improvement in the end. And if we're looking at very small trees, so 64,000 keys, so that's not a lot, then the blocking won't help as much because most of it will be in the caches anyway, but still the simplification will again give us like a 15-20% performance increase. Okay, so much for today. So I hope you can see, it's kind of a different way of thinking of how to program SIMD,

Starting point is 01:08:37 because forget about branches, basically. So that's, I think, think about multiple items at a time, and then how to deal with whatever you would want to do in conditions with masks and moving around data in memory. So that's kind of the gist of SIMD programming, I would say. And then you have to see if you can somehow have enough data.

Starting point is 01:09:07 And of course, it needs to be properly in memory. So can you have the data in memory properly enough so that you can quickly iterate it over in the registers and in the SIMD unit? Okay, questions so far? No questions right now? OK, next time we're going to talk about execution models, meaning given a query plan, how do we actually

Starting point is 01:09:38 want to execute this in the database? So do we compile it to C++ code and use a compiler? Do we interpret it in some kind of bytecode? Do we do other kinds of iterative approaches? And we'll talk about the different trade-offs in there. Okay, with that, thank you very much and see you all tomorrow.

Your Ad Here

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Vectorized Execution (2)

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.