Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Vectorized Execution (2)

Starting point is 00:00:00 Welcome everybody to our next session on hardware conscious data processing. We'll continue and finish up vectorized execution today. And before we go where we were, let me do one announcement or make one announcement. So task one is online. We had the task description yesterday. Marcel presented this to you. The deadline is May 27. And then for some of you, just as a reminder, there will be an interview on this. So most likely on the 30th. If we can manage everybody on the same day, we'll send out information about this soon. I think the assignment is already done online,

Starting point is 00:00:46 so you can see if you're going to be interviewed on this task or on any other task. If you do this, you have to commit. And if you sign up and create your repo, you have to commit and push your changes so we can see them and we can actually create them so make sure your final submission that you want to be tested is actually pushed at some point so we can see it then the runner executes automatically there's one runner meaning that if you like push lot, this will get slow for everybody. So this is not so much for trial and error. But only when you have a next solution, a couple

Starting point is 00:01:32 of solutions, do it. Don't push hundreds of solutions, because then everything will be slow for everybody. If you have questions, use the forum. We saw that there's already some solutions up, mostly scalar. Of course, the scalar solutions are fine for testing and for evaluating. But for the final solution,

Starting point is 00:01:54 you should do some kind of vectorized execution. And of course, as always, no cheating. So if you cheat, you fail. Very simple. There's also no discussion about it but that's the same every time right so we're looking forward to this if like just also as a reminder basically the interview is just a check right discussing the solution with you. If it doesn't go well, you will have a second interview. Okay. Then, and if we figure out you didn't do it,

Starting point is 00:02:32 then, well, also you fail, right? Also one form of cheating, of course. Okay, so with this, we were in vectorization. So everybody hopefully remembers we have this special kind of units on our course, like the second type of parallelism that we're discussing. So what was the first type of parallelism that we discussed? Anybody listening? Yes?

Starting point is 00:03:06 Pipelining, exactly. Thank you very much. So we did instruction execution, and the instructions are broken, like bigger instructions are broken into pipelines, and these we can pipeline or execute in parallel, in pipeline parallel fashion. And now we're looking at data

Starting point is 00:03:25 parallelism so we can do the same instruction at on multiple data items at the same time in specialized functional units in the cpu right on a single core so we're still looking at parallelism on a single core here okay and last time uh time we talked about auto-vectorization, so how the vectorization works in general, how this is executed, and how the compiler will do some of the stuff for you. And in many cases, if your code has some capabilities,

Starting point is 00:04:06 like you're processing ARIs, and there is something that can be vectorized, the compiler will do this for you. So if your CPU supports it, and it's an easy to detect pattern that can be parallelized, the compiler will automatically do some parallelization here for you, so some vectorization. However, it cannot do everything. So this is restricted to very simple patterns.

Starting point is 00:04:34 And today, we're going to look into how we can actually then do this explicitly, so when we actually want to tell the compiler or the CPU to use these functional units through specialized intrinsics. So there's a different way of doing this. And yeah, so basically we can explicitly tell the CPU this is like load this into the vector register and then use this vector intrinsics. So basically we can do like the CPU can move data in between regular and SIMD registers and load data directly into SIMD registers from memory or cache and then execute vectorized instructions. These are specific to every CPU architecture, right?

Starting point is 00:05:29 So this means you have different intrinsics for x86 and different, we already had this, right? So we have the 128 bits, 256 bit and 512 bit on x86. You don't have 512 on every CPU, but this really is more like on those bigger server CPUs. So your regular CPUs, newer generations will have it, but like slightly older generations might not have this. If you have an ARM CPU like this machine here, again, it's different, right? So you don't have the same intrinsics, you have different kind of intrinsics. And so this is potentially not portable. We're going to mostly work with Intel

Starting point is 00:06:14 intrinsics or x86 intrinsics. And these are actually not assembly instructions, right? They feel like assembly instructions, but they're not. This is really like a wrapper for something that will later on be translated to assembly instruction. And we can use these, like the assembly instructions, by calling a function with the right parameters. And this is usually built out of like a pattern, these instructions. And the pattern basically has three parts. So we have the vector size as one part. So what is the size of the register that we're using? Again, 128, 256, or 512 on x86 typically.

Starting point is 00:07:06 There's also different kind of vector processes that can do far wider vectors, vector sizes, but on regular CPUs, this is what you get. Then you have the actual operation that you need to do or that you want to do, and then a suffix basically telling the data type. So we have on the one end, we have the register width, and we have the data type. So that also depends on what's the parallelism

Starting point is 00:07:40 that we will get. So if you have a very large data type, we will get less parallelism. If you have a very large data type, we will get less parallelism. If you have a very small data type, we get higher degree of parallelism. And since we have all these combinations and like for the different data types and all the different vector sizes, et cetera,

Starting point is 00:07:59 the more vector sizes and the wider the registers, actually the more functions we have, right? Or more intrinsics we have. So let's, I mean, again, the examples we have, mm, like concreting samples in x86, we have mm for 128 bit, I told you this is for multimedia. We have mm 256 for 256, MM512 for 512. Then things like addition, subtraction,

Starting point is 00:08:33 multiplication, masking, things like that. We'll also see that. And then things like float, double, so PS is float, PD is double. AP would be 32 would be assigned int of 32 bit. EPU is 16 is an unsigned 16-bit integer, et cetera. So there's many different things. And this is actually the thing that you

Starting point is 00:08:57 will have to figure out. So this is in the task that we're giving you. This is what you have to find. And it's a bit like a, yeah, bit of a navigating or navigation through the different intrinsics so that you have to search what's the best intrinsics that you can do. Maybe we've also not found it, right?

Starting point is 00:09:17 So there's many different, you might come up with a better plan even than we have or better intrinsics for the tasks that we gave you. But you will find those. I'll also show you some hints on how to find this. And Marcel already did this as well yesterday. Right, so let's look at some examples here. If we have a 180bit, 128-bit

Starting point is 00:09:45 intrinsics, then we need 128-bit registers. So these are basically, say, for example, an integer register for 128-bit would be called underscore underscore

Starting point is 00:10:01 M128i. That's a register for integer type. So this is where we can put all of our integer values in different sizes. If we wanna load, like unaligned 128 bits from memory, then we have this load operation or load intrinsics. And here you can see we have it aligned and we have it load operation or load intrinsics and here you can see we have

Starting point is 00:10:25 it aligned and we have it unaligned again like two versions regarding alignment again important if you can guarantee alignment then the processor doesn't have to do anything extra to basically shift the data into the register if you cannot then you need to do unaligned read and then basically the CPU will do some extra work here to make sure that this is actually later on aligned in the register. So this is like a loading 128 bits from memory and this then corresponds there's an equivalent assembly instruction that this will directly be translated to. So it's, I mean the intrinsic is not the assembly instructions,

Starting point is 00:11:14 it's basically a wrapper, a function wrapper for this assembly instruction. Here we have, like the third one that we have is an addition. Let me see if this works. So this would be addition. Again, 128-bit. We can tell it from this part. Then we just have an addition. Nothing special about this. And assigned 32-bit integer.

Starting point is 00:11:42 Again, there's a direct correspondence to the assembly instruction. And finally, later on, we want to store this again. So then we have a store function and again, unaligned store for 128 bits to memory, for example. And now if we look at this, we can use the same or let's use our example that we've always been using so far. We can do the same kind of addition, array addition that we had earlier with explicit vectorization. So here we've basically put this, we have the same function. Here we're assuming that max is divisible by 4. You remember why that is important.

Starting point is 00:12:34 If it's not divisible by 4, we will have a problem. We're basically not getting the last four elements, or the last elements, like 1, two, three elements, potentially. So for that, we need some kind of epilogue. But if we know this is an axis divisible by four, then we can basically do it like this. We don't have to treat this separately. Also max might be smaller than four, right? So then this will never work, et cetera.

Starting point is 00:13:05 So this is like an assumption that we already have in here. So what we do is we basically load our, we have like three registers, 128-bit registers that point to a part of the array. And we have like an addition in here, we're loading the two vectors, x and y, and we add them up into a result register, and then we're storing the result register back to memory,

Starting point is 00:13:39 to the target array set. Right? So we basically, so I mean, it's basically the same thing that we've seen earlier, and we can also check this in got bolt. So in the compiler, compiler explorer, let me find my cursor. I hope you can see this.

Starting point is 00:14:03 So here, this is exactly this code, right? And you can see, I mean, of course, we have like the loop stuff, etc. But the actual execution here, we basically, we're moving the values into memory. And then we have the parallel execution or the vectorized addition here. And then we're writing the values back. So here we're loading the two vectors into two registers, and we're adding the two registers, and we're writing them back into memory.

Starting point is 00:14:40 The rest basically is just loop code, basically incrementing loop counters, et cetera. And you can see this is basically all the code that there is because we don't need a prologue, we don't need an epilogue in order to fix anything. There's a question, yes? If we don't cast the pointers to everything from 128 can we then suggest an increase instead of writing this last increase

Starting point is 00:15:03 but all the pointer entities, is fine, or we'll get a compilation error? So the question is, if we don't cast a pointer, can we just increase by four, or will we get a compilation error? And to be honest, I don't know. We would have to check. I mean, we could check here.

Starting point is 00:15:24 So where is the? Yeah, I'm't know. We would have to check. We could check here. So where is the? Yeah, I'm not going to do it now. But basically, I have the link in the code. You can just try it out. So you can basically just change the code here, compile, and you will see if it works or doesn't. So what the compiler does here. And just as a reminder, so this is basically we see it.

Starting point is 00:15:44 This is our vectorized code. So we have the two moves, we have the addition and we have the store or the load and the store. And this was our auto vectorized code, right? So here we have the same thing, basically. And we can see this is really exactly the same code that will be generated however in contrast we have all the other code for epi Logan prologue to just make sure that everything is sound and safe that our code is not too like max is not too small max is not like it's divisible by 4, and if it's not, then we have to deal with the rest separately. Okay? Good. Good.

Starting point is 00:16:33 And so this is basically, there's an Intel Intrinsics Guide that you can use to figure out what to use. And on top of that, also, there's a cheat sheet that's helpful. I can also show this to you. Just a second. I think this is here. Yeah. So this is from TU Munich x86 Intrinsic Cheat Sheet. There is basically a description. This is also linked in the task.

Starting point is 00:17:07 So how this is set up, like in example, how the individual, like a comparison is set up, or what are the data type suffixes, what is the instruction sets, what are the data types. And then you have kind of conversions you have like special algorithms you have register loads etc fences bit operations and you can basically see how this would be set together you have a short description of what this does like a move mask for example this would be things that we will look at also in a

Starting point is 00:17:49 bit and yeah so what kind of data types this will work with so that's actually quite helpful alternatively you can go to the intrinsics guide and browse your way through this so there's and as this. So there's, and as I said, like there's lots and lots of instructions with very specific operations, also with very, let's say also some kind of algorithms and there's some caveats. And this is basically where you then have to actually read

Starting point is 00:18:22 the guidelines or the guides is because not all of them are actually a single instruction or are translated into a single instruction. Some of the intrinsics actually have an algorithm behind them that means that they will use multiple cycles to be actually executed. So this means you might have shorter code in your program, but actually it executes longer because multiple steps need to be done. So it's just something to be aware of. However, if there is a special instruction, in many cases,

Starting point is 00:18:54 the algorithm is actually not so bad. So this might still be faster than if you code it yourself. OK. So one important thing that you will also need is masks. So this is something what we do a lot in vectorized execution is basically having a mask that selects part of the vector. And this is also how we do branching in vectorized execution. I'll come to this in a bit more detail later. So from AVX 512, almost all operations support masking, which means we can selectively ask or apply an operation on some of the items in the vector. So this basically, and there's zero masking or just masking.

Starting point is 00:19:57 Basically, say we just apply it to some. We don't care about the rest. Or we apply it to some and we're zeroing out the rest this is basically just to make sure that we we can select a subset of the vector that we're working with and say for example if if we know okay we filtered out something then this would help us so here for example if we do zero masked addition then we have well let me do this with here right so this is this is an example so we're we have a multi an addition of 32-bit integers with zero masking so this is actually like with this the intrinsic actually even gets wider, right? So it's not just a single operation anymore, it's like two parts of the actual operation.

Starting point is 00:20:48 So with the addition and the zero masking. So then we have a mask that's basically a bitmap that selects which elements of the vector will be added and which are zeroed out. So here we have an example, we have two areas of four or two vectors of four that we want to add up. And we have a, sorry about this,

Starting point is 00:21:14 we have a mass that's 0, 1, 0, 1. So we basically say we're adding only the first position and the third position not the zero position zero and position two and positions zero and position two are zeroed out right so these in the result vector will have zero in them the other two will be added up right so then this basically you can see the result is basically we're adding position one and we're adding position four and the other two are zero. Already somebody an idea why we want to have something like this? Yes? Oh, yes.

Starting point is 00:22:08 To be able to select. Very good, yes. So, to be able to select, yes. So, this is basically, with this, we can say, these other positions we don't care about in the vector. So, we'll see how this is useful. And we can basically blend and we can merge. So say, for example, we want to keep one area

Starting point is 00:22:36 and replace some of it, or we basically merge some parts. So there's different ways of doing this. So here we basically have an addition where we keep like below is basically a merge example. So below we're basically adding up some positions and we're not zeroing out, but we're basically keeping these parts of the vector and won't add them up. And we, of course, can also do it the other way around, right?

Starting point is 00:23:10 So keeping the other part of the vector, etc. So this is basically also if we selectively want to do additions or selectively apply some kind of operations to the vector. And with this, then basically all of a sudden, we might end up with a vector where we don't actually need all of the elements anymore. So we might want to actually reduce the amount of elements in our vector, or we have reduced the amount of elements in our vector, right?

Starting point is 00:23:45 So we put most of the vector or some of the vector to zero, and these elements are not interesting anymore. So then it might make sense in further steps to actually compress this again or to pack these elements again to be efficient in the next kind of computations, right? So say, for example, we're selecting half of the elements of our vectors and then we're doing yet another couple of operations on those, say some logical operations or something like that. We don't want to carry all the zero values with us that will not be used anymore.

Starting point is 00:24:20 So in that sense or in that case we can compress these elements to the beginning of the vector in order then to, say, for example, pack multiple vectors again into fully packed vectors. So that will be a compress operation. Again, works with a mask. So basically telling the compress operation which parts will be in the result vector and which parts are not interesting to us. And we also have an expand or decompress operation where then we can do the reverse. So we basically say we're only using the number of non-zero values in our bitmask of elements of the vector from the beginning of the vector

Starting point is 00:25:06 and place this across the vector. So here's an example again. So we have like a compress operation where we have in our vector, we have two values that are actually of interest. The other ones could be zero, could be anything. So we apply the compress operation with this bit mask, so 0, 1, 0, 1. So position 1 and position 3 are interesting. If we compress, these two positions will be, or these

Starting point is 00:25:35 two values at the positions will be at the beginning at 0 and 1. If we decompress or expand with the same bit mask, then the two will go to position one. Position zero will go to position one. And position two will go to position three. Of course, you basically need to know. I mean, this can maximum be the number of elements in the vector. And this is done from the beginning of the vector here, right?

Starting point is 00:26:14 So if you have three positions or three elements that you want to expand somehow, then you will need three one values in your bitmap. Okay? So this is cool, right? bitmap okay so this is cool right so and this is what you will use in the task um but uh there's a problem um because say for example you're like me and you you work on this kind of laptop um then you cannot try it i mean of course you can do this in the compiler like you can compile, but you cannot execute on this machine because this machine doesn't support x86 intrinsics.

Starting point is 00:26:52 Yeah, so if you want to run it on ARM, then you need to separate implementation, right? So this may also means if you use these intrinsics and you want to support multiple platforms, this means code duplication. And here we have this example again, right? So we have our addition with the x86 intrinsics. If we want to do this on ARM, then we would need the neon intrinsics. And they, I mean, they look closer

Starting point is 00:27:21 to what the compiler output looks like or the assembly looks like, but it's really like a completely different intrinsic. And they're also not completely matching. So some intrinsics don't have an expression on the other side. So there might be some intrinsics that you have in x86 that you don't have in Neon and vice versa. So, I mean, if you want to do this,

Starting point is 00:27:53 you basically have to create separate codes or you have to create duplicate code for multiple platforms. So, and actually also these intrinsics are quite hard to read, right? So these are some examples. So maybe you have an idea of what it means, but also for most of them, I have to look up what they actually mean

Starting point is 00:28:19 because they're somehow encrypted in like very short abbreviations that are not super explicit. Right? So doing, I mean, while we will get, like try for or aim for a best throughput that we can get or best performance that we can get in general, using these kinds of intrinsics might not be the best idea

Starting point is 00:28:46 because the platform-specific code reduces the maintainability, reduces the readability, and is more error-prone. So you can get errors here quite easily. So what can we do? And there is a solution in terms of using generic intrinsics. And there's different ways how we can do this. And maybe just as a reminder, for the task, you can try generic intrinsics, but you probably will get best performance with platform-specific intrinsics. That's just also you can try auto vectorization, but you won't get far enough with auto vectorization If you just do that, I mean there you can get some performance benefits

Starting point is 00:29:40 But you won't be able to fully complete the task just with autorectorization. And with generics, I'm actually not sure, but I mean, no, I am sure you don't have all of the operations that you need. So just as a reminder. But in general, it actually makes sense to think if you really need these specific intrinsics, because it's just much harder to maintain. Also, even on a single CPU or a single architecture, you still have these different register widths. So 512. So if you use

Starting point is 00:30:18 512 bit, it will have a better performance, but it's not supported on all x86 CPUs. So all of a sudden, your code will not run on all x86 CPUs. If you're explicitly using 128, it will be supported on more CPUs, but you're not going to hit the 512-bit units. Now, that's just like a means lots of duplication if you want to get best performance. And so for this, you can also use generic vectors.

Starting point is 00:30:56 So rather than doing explicit intrinsics, well, we can do this generic operations as we did with autovectorization. And this can then either be done by the compiler or using a library. There's many libraries out there, Eximdi being one of the major ones. So this is basically just a wrapper

Starting point is 00:31:21 on top of the intrinsics. And then there's different kind of implementations. So this is one approach that you can try. So they basically provide classes representing vectors. And then you can manipulate these vectors using overloaded operators. So it feels like regular vector arithmetic rather than using this very specialized intrinsics.

Starting point is 00:31:52 And then there is also a compiler extension approach. So GCC has vector extensions where basically the compiler provides an API. And that actually also makes sense because technically what happens internally is that your intrinsic that you're using will be translated to a compiler vector extension that then will be translated again to the actual assembly.

Starting point is 00:32:21 So we're basically going an extra step, but the compiler only knows how to translate from, like, has one path from the intrinsic through the compiler representation to the actual assembly. There is no, like, magic in between going from one intrinsic to, like, a different kind of assembly instruction on a different architecture. So this is basically not a solution. The compiler extensions for general, let's say, platform independent execution if you're using x86 in Syrinx, for example.

Starting point is 00:32:59 But we can directly use the compiler extensions or vector extensions. And since they are directly translated to assembly instructions, internally they actually have these different paths. So using the compiler extensions, we can write platform independent code as well with explicitly vectorizing, not doing the autovectorization. This is available in GCC. It's also available in Clang, but say, for example,

Starting point is 00:33:33 not in Microsoft Compiler. So SIMD libraries, one way to do. There is an extra dependency, but they only work with the SIMD features that the compiler supports. Say, for example, AVX2 does not support 64-bit integer multiplication, and then the libraries use 32-bit multiplication. So if you're doing AVX51212 using XSIMD, for example,

Starting point is 00:34:09 or another library, the library will still go back to the slower multiplication. And so this basically won't give you the best performance in these cases. There is a standard for C++ to standardize SIMD. And this would solve this kind of problem. So if this comes and the feature set is large enough, then you basically have the SIMD features in the standard C++. And then these kind of operations,

Starting point is 00:34:48 if the feature set is large enough, will be efficient again. And the other thing, the way we can do is, as I already alluded to a bit, are the vector extensions. So the compiler maps these platform-specific types and operations to an internal representation. And this is basically just to keep the compiler simpler. So there is basically an internal representation

Starting point is 00:35:16 for all of the optimization patterns. So this is basically so in order for the compiler not to be or so the compiler doesn't have to be rewritten or duplicate the code for these optimizations for all separate kind of intrinsic types. There's an internal representation where then optimization will go on, which then is translated to the actual intrinsics

Starting point is 00:35:42 or to the actual assembly. And we can use this directly, right? And this is really for reusing the analysis and optimization. And then rather than doing like x86 intrinsics, we can use the compiler extensions. And then we also have the natural code with arithmetic and comparison operators. So this is an example. So rather than basically using, say, x86 intrinsics and attributes, we can use the compiler extensions.

Starting point is 00:36:25 So basically create vectors that the compiler already knows are vectorized and can be vectorized. And just by doing this, basically, then if we're using this vectorized addition, we get the same kind of vectorized code, right? So this, the compiler already knows this is vectorized code. And if we then compile this, this will create the platform specific

Starting point is 00:36:53 or use the platform specific intrinsics when compiled and really produces identical code on different kind of platforms. So identical code in terms of execution. And we've also used this for load and stores before. And so this you can basically see here. This will translate directly into this kind of code where we use the x86 addition

Starting point is 00:37:27 or in the ARM code, so the Neon code, where we use the Neon addition. So this is basically what the output, I mean the assembly basically, what the output will look like. Let me see, maybe I have an example. Open up. No, I don't have it open. Sorry. I think we probably linked it here somewhere.

Starting point is 00:37:55 So this is basically the ARM version, right? version right so here we're loading the vector and this is the arm code right so where we basically actually do the intrinsic and we can see this here let me show the code here

Starting point is 00:38:23 this really is let's see this Let me show the code here. This really is, let's see, this is the actual addition here. Here we have the, let me find, this should be the actual addition. And you can see this looks very much the same than this code here. So this is basically the ARM assembly for this kind of vectorized execution. And the same is true for x86. So it's basically exactly, the same code so here on top we have the vector extensions and we're just doing the vectorized execution here on the bottom we have

Starting point is 00:39:16 we're using intrinsics with vectors vector registers and the vectorized execution and the actual code for this should be here, right? So this is basically, well, this is unrolled. So this is basically what we see here. Again, we see the unrolled code. This is exactly the same code just using the generic intrinsics and the code. I mean, that's the beauty of it, right? This code is exactly the same that we're using on ARM and we're using on x86. So we don't have to change anything. While if we want to explicitly vectorize this,

Starting point is 00:40:01 we have to do it separately for both. So we have to specialize registers, descriptions, and addition that is different from x86 here. And you can play around with this, because the links are in the slides here. OK. So there's also special operations like shuffle and conversions. So shuffle basically reorders a vector.

Starting point is 00:40:40 And this is basically what we can also use in order to do these compressions and decompressions, for example, or just reorder the elements. So say, for example, in this case, we have a shuffle with three elements, A, B, C, D, and a shuffle vector that basically tells the operation that position 0 should go to position 2. Position 3 should go to position 1.

Starting point is 00:41:18 Position 1 should go position 3. Position 2 should go to position 1. And this basically then just reorders the vector according to this index list. So this basically, given the input, will produce some kind of output. And with the convert vector operation, we can also convert the numbers or vector data types from larger to smaller data types or vice versa right so if we have 32-bit

Starting point is 00:41:55 numbers we can convert them to 16-bit numbers etc and this is cool right so again this kind of stuff is basically generic. So we'll work on x86. We'll work on ARM. We'll also translate to the correct register width that can be used. So if we have 512 bits or AVX 512, and our code supports this or shows that this can be used, then this will be translated to AVX 512. If we only have 128-bit, it will be translated

Starting point is 00:42:31 to 128-bit operations. However, of course, not all of the combinations work well, so you can always find some, let's say, special cases where the conversion doesn't really work well, and not all instructions can be expressed. So especially this compression and decompression or expansion, this is an operation that we'll also use in the task. This is not expressible in generic vector extensions.

Starting point is 00:43:04 So for this kind of stuff, you will have to go back again to the original intrinsics. So this is something that you cannot do generically. Okay. Questions so far? Yes. What about the PowerPC architecture? Like, is it also supported by GCC vector?

Starting point is 00:43:25 That's a good question. Actually, I don't know. So the question is, is PowerPC also supported? And I would have to check, to be honest. I know there are vector extensions, but presumably there is some support, but I don't know how far it goes. Other questions so far?

Starting point is 00:43:50 No other questions? Then we'll do a quick break here. So let's see how we use this in the database operation. So I've talked a bit about this already, and we've discussed a few things. But now let's get a bit more concrete. So typical database operations go through lots of data, say, for example, in a scan.

Starting point is 00:44:21 And this we can also simplify. So these kind of operations, I mean they do the same kind of operation on multiple data items, so there it makes sense to use SIMD operations. But there's also other kind of operations. So whenever we're working with a lot of data or at least say multiple data items at the same time, then we can vectorize data. And usually, I mean, we have to load multiple data items anyway, or we have to do some lookups, then vectorization always makes sense. Only if we're using like a single value,

Starting point is 00:45:00 like our final result, we're doing some operations there, there it doesn't make sense to vectorize, right? Because there's nothing to vectorize. But in general, we have to look, search for something, we have to scan through something, we have to process many data items at the same time. And as soon as we can pack them into a register, into a SIMD register, it makes sense to SIMDify.

Starting point is 00:45:24 And so joins are SIMDifiable, sorting, filters, etc. And there's a nice paper, an older one already, where basically people from the University of Colombia already discussed what kind of things can be SIMDified or how things can be simplified in a database. And so let's look at a scan first as a simple example and I mean technically if we look at a like a extended projection and then this is what we've done so far already. So if we're selecting price and tax, price plus tax, from a table, and we're doing an addition in here

Starting point is 00:46:13 and we're filtering on this, this is more or less what we've done so far. At least the addition is exactly the same kind of stuff. So we basically walked through an array of elements or a vector of elements and did an addition here. Now we just have to filter this on top. So all of the code examples earlier are in essence already like a reduced version of this. But we can also do aggregations, right? So if we do a minimum or a maximum, that's also very efficient.

Starting point is 00:46:49 These are simple comparisons. This we can simplify. And this can be done efficiently using basically these comparison operators and bitmasks. Then later on again, this ends similarly. Sums, et cetera, can be done efficiently. The selection part is a bit more tricky, right? So we said there's no branching.

Starting point is 00:47:16 And of course, like if you would have branching in SIMD instructions, like what would be the semantics? Right, so I mean, if you think about it, SIMD instructions, what would be the semantics? If you think about it, you have only a vector size of four. And you have branching conditions in there. Well, all of a sudden, you have all kinds of 16 different ways of doing the branching. It's not just true or false.

Starting point is 00:47:46 It's true or false for every individual element in the vector. So this is basically going in all different directions. So branch prediction won't help you much anymore, because we have so many chances. So branching doesn't make sense. So how do we do this? You could generate a mask that

Starting point is 00:48:12 selects on the other day. Thank you. Very good. So we can basically use masks. So for this, we can use masks. And on the other end, moving data between SIMD and scalar registers is quite expensive. So we don't want to go back and forth for branching.

Starting point is 00:48:34 So again, we're basically using masks in order to do these kinds of things. So if we want to do something like this, so we're selecting the quantity from a line item again You remember this is TPC-H Where the price is larger than 42 then this we can do with a mask, right? So we can basically load our vector. We do a comparison in parallel We create a mask and then either we do this, interpret this in scalar mode, or we're using this mask to compress the vector, basically

Starting point is 00:49:14 compress all of the elements that we're actually interested in, merge the vectors that are such that we have full vectors again, and then continue working with these full vectors rather than going back and forth in between the registers. Okay, so some of the fundamental database operations for this kind of stuff then would be, say, for example, a selective store. So this is the stuff, then would be say, for example, a selective store. So this is the selective comp or the compression

Starting point is 00:49:48 that I just said, right? So if we're filtering, then we know like with the mask, we basically know what's interesting to us or what passes the filter condition. And then we can selectively store this in memory or in a register. And then basically only with the subset continue further on. And this is basically a compression operation.

Starting point is 00:50:16 So we have here our vector from A to P. And we only want to store the highlighted ones, so the gray ones, B, D, F, G, et cetera. So for that, we basically create a mask. And this can be done through a comparison, for example. So then we have this mask. And with the mask, we can selectively store and compress the result. So we could do a zero mask.

Starting point is 00:50:42 So we're only storing the ones we're keeping, the values that we don't want, or the positions that we don't want with zero, or we directly compress and just keep whatever we need in the end. And the same we can also do vice versa, right? So we can do it the other way around. Again, this is kind of this expand or decompress operation that we can do it the other way around again this this is kind of this expand or decompress operation uh that we can also do during a load um is basically we're loading memory locations like selective memory locations into our memory and we can do this um also selectively into our vectors right so rather than overwriting everything or just loading like a part compressed, we can also say, OK,

Starting point is 00:51:30 we already have a vector. And some parts we want to rewrite with the newly loaded elements. And the parts that we were not interested in then will remain with the old values, right? Or that are not selectively loaded. Another important operation for these kinds of things are scatter and gather operations. So a gather operation is basically where like basically picking individual portions or like we have an area of elements.

Starting point is 00:52:12 And then we say we select individual positions in the area at certain positions in a result area. So rather than just saying, OK, we're compressing or decompressing, we're really shuffling while compressing at the same time or selectively loading at the same time. So it's a bit more advanced operation, but that basically allows us to directly reorder the vector or shuffle the vector while we're selectively loading. So this loads from non-contiguous values. So what we basically need for this is we need our input vector with the values that we're

Starting point is 00:52:54 loading and we need an index vector that tells us which position should be loaded where. So basically, on position 0, this index vector says, on position 0, we load the element from position 2. So this is A, B, C, basically. On position 2, we load the element on position 9. So on position 1, I want to say, of position 9. So we're loading the j, basically. On position 2, we're loading the element from position 0. So this is basically how this works.

Starting point is 00:53:36 The same also works the other way, or as a scatter operation, is basically we're storing the elements on separate non-contiguous positions in an area. So again, we have an input area or a value vector, and we have an index vector. The index vector basically tells us where to store the elements in the target vector. It doesn't have to be in the target memory. And can be again to multiple locations. So on the one hand, we're loading

Starting point is 00:54:14 from non-contiguous locations, but in a contiguous area. The scatter operations loads from contiguous areas, but stores it into a non-contiguous vector. So it's basically doing a very similar thing, but just the way where stuff is stored is different. And these are not executed in parallel. So this is one of the operations. This exists as an intrinsic, but there is an algorithm behind it.

Starting point is 00:54:49 So this will translate in multiple cycles or in multiple CPU cycles because the cache cannot answer all of these individual requests or so many distinct requests at the same time. That's just something to be aware of. Still, the operation exists, so it's easier to write, basically. And so in order to, or if we want to use this in a practical application, so say we have a filtered column scan.

Starting point is 00:55:21 So this is basically what we had earlier, right? So we basically filter an area and using greater than, so same example as before. So everything that's greater than 42, for example, that's what we want to filter in our result. Then we can SIMDify this. So, and this is like a scalar implementation of this. So we have our filter, which basically says, like we input the column that we want to filter and the filter value, and then we have an output and then basically for the column size, so all of the values, we go through them one by one, so row by row, we just input, like this is column-wise execution,

Starting point is 00:56:14 we don't have the full row, we just have the column that we're interested in, and for each row in this column, we're basically checking, does the filter condition hold? And if it holds, yes, then we will output an index for this. That's in order to later on work with the full row again, maybe, or just to know which positions we actually are interested in. And these positions then later, for example, we could use in a gather operation

Starting point is 00:56:44 in order to then have the full column somewhere, to have the actual values somewhere. As an example, if we have the filter, we're inputting five values here, 4, 13, 18, 4, and 2 in an array, and we have the filter condition 9. So everything greater than 9 is matching. Then we know, okay, 4 doesn't match. 13 does match. 18 does match.

Starting point is 00:57:15 4 doesn't match. 2 doesn't match. So positions 1 and 2 in the array match our result, match our condition. So this is the output. And this we can also do as a vector, right? So we can do the same thing. Basically, we have the input as the comparison vector. And so, I mean, we're comparing the complete vector

Starting point is 00:57:44 at the same time. And then our result, rather than individually outputting the indexes, our output will be a bitmask. So we get as an output a bitmask, which tells us which elements in the vector match our filter condition. So here we're using the built-ins. So we're basically checking this.

Starting point is 00:58:14 I mean, with the generics, we're checking, OK, does the comparison vector match the value vector? So we're building a value vector or a comparison vector with our comparison value on each position, and we're comparing, and then our output will be this result bitmask. And now we still, I mean, what we actually want to have in the end is this kind of result, right? So we only want the positions that we're interested in. We don't want the position that, or the columns, the rows that do not match.

Starting point is 00:58:49 So this, for this, we need to do something like some kind of compression again. And this is like, whenever we like use the generics, this is something that we cannot express in the generics. So here we have a separate function in here that basically compresses the rows or left packs the rows in this vector. So we have all of the rows that we need in the beginning, right? And so that basically means how can we implement this? And this can,

Starting point is 00:59:22 in AVX 512, this can be implemented with this compression method that we saw earlier. And we can also do the compress and store in one instruction, because we don't need to first compress in a separate SIMD register and later store, but we can directly say compress it and store it right away. And this cannot be expressed in GCC vector extensions. So if we're using this mask compress in AVX 512,

Starting point is 00:59:56 this gives us a 35 factor speed up over the scalar version. So that's actually quite significant, because we don't have to iterate through all of the individual elements in integer fashion, or in integer data types. But we can do this at once in an individual instruction. And if we want to do this in the GCC vector extension, so we want to do this in a generic way,

Starting point is 01:00:23 then we need to do the shuffling. So for that, we can then basically create a shuffling mask and convert the matching mask into a shuffling mask using a lookup table. So I'll show this to you in a bit or in the next slide. If we're doing the same thing using shuffling, then we do need to do some extra operations. So we basically get half of the throughput.

Starting point is 01:00:49 So we have one extra step, which means our throughput or our speedup is basically half of the speedup that we get if we just use the single intrinsics in AVX 512. So here, if we do a left pack in AVX 512. So here, if we do a left pack in AVX 512, then this is really just this mass compress for. So we have our bit mask that is basically the output from the comparison. And we have our input.

Starting point is 01:01:20 And we're compressing this directly. And we can directly store basically already. And in GCC vector extensions, what we can do, we're using basically this bit mask. And we're transforming this into a shuffle mask. So basically, depending on the bit mask, so how many of these elements match or which of the elements match,

Starting point is 01:01:46 we're shuffling them to the front of the area. So then we just need to remember, basically, the shuffle mask, this would basically mean, okay, we don't have to do anything, no shuffling. The second one is the same thing, right? So basically if only the first one matches, for example, well, we're still good, right? We don't need to change anything. If the first and the second one matches, again, we don't need to change anything. If only, say for example, the last three match, then we need to shift. So then we basically need to move the first element to the last position.

Starting point is 01:02:32 So because we want to compress, so only the first three elements in our vector are actually interesting. So then we need to do some shuffling. And we could use different kind of masks here. But the logical way of doing this is basically shifting everything to the left. So the first element goes to the back, because that's not going to be interesting to us. The second element goes to the first position. Third element goes to the second position, and so on,

Starting point is 01:02:59 and so forth. Make sense so far? This is basically using the shuffling, we can do the compression. So the compression then, I mean, basically in the second step, we then just selectively store the part of the vector that's interesting to us. The rest is basically, will basically be gone then. And we can use this, like what you already see then, to do some kind of bit-packed integer decompression as an example.

Starting point is 01:03:37 So if we say, for example, databases want to be efficient, space efficient, or memory efficient. So if we have 9-bit integers in our data set, we want to store efficient, space efficient, or memory efficient. So if we have 9-bit integers in our data set, we want to store them packed. So this is not going to be aligned in any form, but it's going to be memory efficient. And if we want to work with it in SIMD registers, for example, we have to unpack this again. So we have to make sure that we can calculate with the numbers in our regular, whatever integer size we want to use. So 16 bit or 32 bit integers then later on.

Starting point is 01:04:11 So we have this tightly packed and we want to say, for example, have 32 bit numbers that are zero extended then and do some kind of operations on those. And I mean, a Skylar approach is basically, well, we track the leading unused bit and we're copying the four bytes containing the nine number bits into a register and then shift, basically depending on the data,

Starting point is 01:04:42 on shift in order to remove any unused data bits. And we're doing a bitwise end to zero the remaining bits, the remaining 23 bits, and repeat forever until we have all of the elements fixed. So this is maybe a bit too abstract. So let me show you this in a concrete example. So here we have our individual numbers. So number 1, 2, 3, 4, 5, et cetera, 9-bit numbers.

Starting point is 01:05:18 We have the byte lengths. So byte 1, byte 2, byte 3, byte 4. And we have the lanes, so the 32-bit where we actually want to have the final numbers in. In our result, essentially, number 1 should be in lane 1, number 2 should be in lane 2, number 3 in lane 3 and 4. But if we're loading the 32, 64, 128 bit into our register, then we have many more numbers that we cannot use here actually, and we will basically have to move something.

Starting point is 01:06:01 So in order to work with this, what we're doing is basically we first load the data in here and then we have to shift data. So first then we have to basically move the data. So the first one is already correct, the first element. The second element needs to be moved to the second lane, and we're going to do this on byte ranges, right? So this is basically then moved to the second, like from byte two, we're moving to lane two. From byte three, we're moving to lane three. And from byte four, we're moving to lane four.

Starting point is 01:06:40 So this basically makes sure that we capture everything. But then we have these unused bits in the beginning, right? So this is basically something that we cannot work or that's actually wrong, right? This is part of the data that we don't want to have. So then we basically need to shift every element or in every lane separately, correctly, in order then to have the number at the beginning.

Starting point is 01:07:09 And then in the next step, I mean, this is, whoops, sorry about this. Then in the next step, we basically mask off the most significant 30 to 9, or 23 bits using a bitwise end, and we have the final number. So we just zero this out, and this is the final number. And the somewhat confusing part of this is, in memory,

Starting point is 01:07:39 the data is basically the most significant bit is basically here. So the number is basically written from left to right. So let's say our number is 123. Then what is in memory is 3, 2, 1. And then basically reserving off the rest. And after this, we basically have with these individual operations,

Starting point is 01:08:08 we have the four numbers in our individual register, and we can continue with the next lanes. This is with 128-bit. If we have 256-bit vectors, then we can process, rather than 4, we can process 8 numbers at a time. And we're processing 9 times 8 bits at a time, so which is divisible by 8. So that's actually nice, because the next input element is already

Starting point is 01:08:40 byte aligned, so we don't need to do any shifting in the beginning or any copying there. So that makes it a bit faster. So that's one type of operation. So here you can see that using these SIMD instructions, we basically move data around. So this is a lot of the stuff that we need to do. We need to do some alignment, data movement, et cetera, to properly work with the individual values then. And filter out things that we don't need, et cetera.

Starting point is 01:09:12 And a similar operation, or let's say, a similar way of thinking about it, but very different application is a SIMD tree search. So if we have a binary tree, like a scalar implementation of a binary tree, if you're implementing this efficiently, then you have it in an area, your binary tree, and you basically have the root on the first position, then the second level on the second and third position,

Starting point is 01:09:44 and the third level then so on and so forth. With this you can easily calculate where each individual position in the tree is and you can quickly jump to the correct position in the tree and just compare. And so this is basically usually an efficient way of representing this. And so for each node n, the children then are in positions 2n and 2n plus 1.

Starting point is 01:10:15 So that's actually quite easy to find. So there's a very simple calculation. But the unfortunate part, actually, when finding this, is basically we have to jump a lot in memory right so especially the further down in the tree we are the further away the children are from the from the nodes so that means if we're doing the next comparison this is actually quite a large jump i mean the two children are next to each other which is good, but then the next level will be again far away and the more levels the further away. So if we're looking up for a couple of items then we basically will always go like compare the root and then if the root is or or the current node, if the current node is larger,

Starting point is 01:11:07 then we go to the left child. If the current node is smaller, then we go to the right child, or we might already have found it, right? So this is also like, we could also have like, the current node is the node that we're looking for, then it's also good. And so if we do this in parallel

Starting point is 01:11:25 or we want to do some parallelization, like a lookup for multiple items, the outer loop we cannot really parallelize well, at least not in a simplified fashion because they are completely independent, right? So, I mean, we can do some, but here we would need some scatter-gather operations to properly do this.

Starting point is 01:11:49 And can we vectorize the inner loop? Well, there we have some data dependency between the loop iterations. And we can do multiple levels already in parallel. So we can basically do multiple comparisons across the levels in parallel. But the problem is, if we do these multiple comparisons in parallel, then well, we're jumping a lot of memory, which is not good for SIMD. So this is basically would need some scatter or gather operation to move this together. So I mean, if we say, for example,

Starting point is 01:12:25 we want to do comparison of two levels in parallel, then we can do three comparisons in parallel at one time. So here, for example, this would be the positions in our tree or the area positions in our tree. So the area position one is the root, two is the left child, three is the right child, four is the left child of the two, five is the right child of the two, and eight. So you see the jumps get larger and larger here. So this is basically, and if we do these comparisons in

Starting point is 01:13:00 parallel for the root, it actually works quite nicely because these are tightly aligned we can just load them and directly compare them on the next level here if we're going left here for example this is not tightly packed anymore so this is actually not good right so this we cannot do well anymore so what we actually can do in in order to make work, we can pack these levels tightly. So we're packing subtree regions into some SIMD registers. We're basically giving up the way how the tree is structured in the area in favor, or rearranging how the tree is structured in the area in favor to get these tightly packed subregions, right? So what we can do is basically, I mean, for the first level, this was already correct. Then we can say, well, rather than packing basically these two and then these two,

Starting point is 01:13:57 we can pack these three and these three next to each other, such that then we can load them at once and do the SIMD comparison again. So we need to rearrange in the memory, and then we can do a SIMD fight search operation. We can do basically a SIMD fight comparison operation. This will again give us like this bit mask. And so if we say, for example, we're're searching for key 59 then we're starting up here

Starting point is 01:14:28 so we can compare all three elements at the same time so there's no match in here so that basically already helps us okay we basically need to go further so in this case if we do the comparison with 59 then we know okay this one is definitely smaller this one is smaller as well this one is larger so we would have to continue this this direction right and so our vector comparison will basically tell us okay this one was smaller one this one was smaller, one, this one was smaller, one, and this one was larger, zero. So we need to move here. An alternative would be all of them are smaller, then we need to move here.

Starting point is 01:15:13 This one is larger, this one is smaller, then we need to move here, or we need to move here. So there's basically four options, as the four children that we have under this subtree. And this is basically, so we can basically move this mask into a scalar, like the comparison. The result of the comparison can be a scalar register, move to a scalar register.

Starting point is 01:15:39 And we can use this mask as a lookup, basically say, OK, where do we have to continue from here? And while we could have all kinds of masks, so the mask that we're getting out here is basically 110. I mean, this is like abbreviated. So, or it's actually the, way, it's basically. So this is basically our mask. Something like we can't even get. So is basically this direction. This means like this is smaller, this is smaller.

Starting point is 01:16:22 And of course this is also smaller, needs to be smaller. 00 or 100 would mean this is larger but these two are smaller. This doesn't work, right? So this is a wrong configuration. So this doesn't interest us, only the ones that actually lead to a correct lookup we're using. And using this we basically can then directly know or we can look up in the table where do we need to go from here so we're comparing all these and depending on the result of the comparison we know where to continue in our tree and uh well the the blocking we can also do hierarchically so we see this for simd um makes sense right so in simdified fashion then we can also do hierarchically. So we see this for SIMD makes sense, right?

Starting point is 01:17:06 So in SIMDified fashion, then we can do directly comparison. So we did like in 128 bits, we might get two levels. In 256 bits, we might get three levels already and so on. So we can actually speed up the lookup in here, but loading this or like being fast in loading this also makes sense in memory, right? So we can also do, basically, we can do the SIMD blocking, we can do the cache line blocking, making sure the cache lines

Starting point is 01:17:36 have complete blocks in there, and we can also do this with memory pages, or pages on disk as well, right?, in order to have always load only what's relevant for us or let's say have all of the local data where we will probably do in the next lookups again. And if we do this, then we can see for larger trees where we have, we basically, the blocking makes sense, then we get quite significant speedups by doing this kind of blocking. So here we have basically in the megabytes, I assume. So here, we're basically doing nothing, like a normalized one.

Starting point is 01:18:32 If we're doing page blocking, this already gives us 30%. If we're doing cache line blocking, another 10-ish percent. And then SIMDifying this gives us another 10%, right? And then we can also use software pipelining. So there's a paper on this, how to do this efficiently. And we see that here we can get like a tenfold performance improvement

Starting point is 01:18:56 just through this, like shifting the data or aligning the data properly in memory and then SIMDifying the execution. For very small trees, this doesn't work so well because 64,000 keys, this is something that fits in cache. So here, like doing additional rearrangements, et cetera, won't give you much performance improvement here. Okay, so with this question so far, up to here on SIMDification of your code.

Starting point is 01:19:32 So the idea here is really to give you an idea how to think in SIMD operations. So this is what you will need for your task. It's a kind of different way of writing code, because you don't use branches. But you're actually doing everything in parallel. It's similar to predication, essentially. So we're always executing the code. And all of the things that we do,

Starting point is 01:20:02 we're just continuing with the parts that we're interested in then later on. So we talked about SIMD basics last time and some vectorization options and database operations today. Next week, I'm not here. So Martin will present execution models and data structures to you. Also fun topics. Yes?

Starting point is 01:20:43 That's a good question. Is it aggregated? I guess it's aggregated across all of the cores. So that would give us, like if it's four cores, it would be 4.2, right? That might make sense. Six cores, maybe something like this, presumably. And what is the best reported number?

Starting point is 01:21:06 Is it the current? BASTIAN SCHWEINMANN- Best reported number is basically what other people have reported before in the paper. So I mean, all the details you can actually find in the paper that's linked in here, but the best previously reported number in other research basically related work.

Starting point is 01:21:25 OK, thank you. Oh, you're welcome. More questions? No more questions? Well, then, thank you very much, and see you in two weeks.

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Vectorized Execution (2)

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.