Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Vectorized Execution

Episode Date: May 9, 2023

...

Transcript
Discussion (0)
Starting point is 00:00:00 Okay, so let's get going. Welcome everybody to our next session. Today, mostly we're going to talk about vectorized execution, but in the beginning I'll stop or I'll finish up the last lecture with prefetching. But before that, a quick announcement. Tomorrow there will be another session of the bi-weekly joint seminar with you Darmstadt and one PhD student out of my group Ricardo will present his current work. So that should be interesting. It's not super related to this
Starting point is 00:00:35 Class, but if you're curious about this kind of work feel free to join and in the next couple of sessions of the seminar We're also going to have some nice industry talks and other research talks. I'll keep you updated on it. We're still here. A quick announcement regarding timing. We had some additional small changes, not that much. You already know tomorrow you will learn about the first task. Your SIMD exercise and we'll finish up SIMD then, but we move the profiling session to a bit later because that still needs some preparation. So that's not
Starting point is 00:01:19 going to be next week but the week after basically. And with that, that's all announcements I have so far. So let's switch to the other slides and see, talk about prefetching. If it works, yes, it works. Okay, great. So we talked about instruction execution and different kind of data hazards or general pipeline hazards. And one of the hazards are data hazards, basically the problem, or also one problem that we need to still load stuff from memory or into the caches.
Starting point is 00:02:09 And basically, our CPU has to wait until this is loaded. And then, well, then basically the CPU needs to wait, right? And in order to not make it wait so long, we can also be smart about it. I mean, also the CPU and the compiler tries to be smart about it by basically preloading data that will be used soon. And this we call prefetching.
Starting point is 00:02:37 So rather than just basically reacting on the memory accesses, we can also say, well, this is gonna be data that we're gonna need, so let's get this beforehand. And I mean, this is quite obvious if we're just running through an area one by one, right? We're not gonna just wait until we hit our cache lines, the end of our cache line and get the next cache line, but we know the next area positions will be loaded next
Starting point is 00:03:08 or will be accessed next, so we can already access this or we can already load it. But if this is more complex, the pattern, then the compiler won't see it, the CPU won't necessarily see it, and we can basically tell the compiler or tell the CPU what to load next already. So we can be smart about it.
Starting point is 00:03:35 The CPU has a hardware prefetcher, so the CPU already tries to be smart about it and tries to get the next cache lines and also tries to see the stride pattern and load the data according to the stride pattern. But the prefetchers don't necessarily prefetch across page boundaries. So, you might get everything in a page, but not necessarily to the next page, because the next page might be somewhere completely different in memory, as we know by now. However, rather than just waiting for the hardware prefetcher, or if we have a weird access pattern, which is weird in the sense of not just very regular. Then we can also help the CPU, help the program to be executed more efficiently and prefetch the data explicitly. So there's a built-in that basically says,
Starting point is 00:04:41 well, these memory addresses, please prefetch them. They can be prefetched for reading or for writing, and then they can also be prefetched with a certain access pattern. So, with low temporal locality, meaning we're only going to access this once, or we access this quite frequently, so please keep it in cache, or something in between. We basically can also say this is the way we will access this in the future. That's one way.
Starting point is 00:05:27 And this sometimes helps us to make things faster, right? So basically avoiding all these cache misses, essentially. Something else that we can also do are so-called helper threads. This is, let's say, faking the same thing, right? Rather than actually having having some kind of instructions that tell the CPU, please load this already, I'm going to use this soon, I can basically do have an extra thread
Starting point is 00:05:57 that will access the data that I will later use for processing. So the idea is basically you have a thread that just produces cache misses all the time and basically already by that then has the data in cache once I need it. I mean this is somewhat tricky but it also works. I mean it's basically you have your main thread and you have this helper thread and the helper thread does nothing just touch data that's all it needs to do and by touching the data then we will have it in cache already it of course needs to work ahead of the main thread so it needs to access the data that the main thread does not
Starting point is 00:06:40 access yet otherwise it doesn't really make sense, right? Otherwise we're just producing extra noise. And the main thread somehow has to tell the worker thread or the helper thread, or at least this needs to somehow be in sync so that the helper thread knows what to prefetch. And if the main thread has some kind of worker headset, then the helper thread can already touch all this data. It's going to be in cache. And of course, this sometimes can help a lot if the main thread has enough to do. If the main thread will just be as fast as the helper thread, then we have a problem. The biggest problem with this kind of implementation is that main thread and helper thread are
Starting point is 00:07:33 actually accessing the same cache line. Because then we have two threads touching the same cache line, well, if this is not properly in order, then basically, well, we have two threads touching the same cache lines. We have basically conflicting access. And then the system or the CPU will raise a memory order machine clear event, which basically flushes the pipeline. And this is really bad.
Starting point is 00:08:09 So then this means all of our instructions are flushed. We have kind of a serial access to these cache lines. And our helper thread is not helping us really, but our helper thread is just destroying our parallelism in the CPU. This can even be worse if our helper thread will then just basically spin on the data to basically try to get access to this cache line and by that just basically creating these conflicts all the time. So an idea is what you can do is rather than having the helper thread work forward, like in the same direction,
Starting point is 00:08:51 you can basically have it work backward, like in the opposite direction. You will still, you have a chance of creating these conflicts, but there's only few basically in each work set. And you still have a lot less cache misses in the main thread because you already have all the data in your caches. Or the helper thread at least gets a lot of the data in the caches.
Starting point is 00:09:20 So, well, you can basically see some experiments that people did with different kinds of prefetch. I'm not going to go through this in detail, but you can see that working backwards will basically give you a better performance than working forwards, where you can have more of these conflicts. The most penalty you basically get if you have kind of the spin loop on the cache line. You basically try to access the cache line in the helper thread, even if you have a conflict,
Starting point is 00:10:01 which then means basically you're fighting with your main thread for the same cache line. So, that's kind of the worst. This is even worse than running it just single-threaded. While a regular forward helper thread might give you a benefit for large working sets, a backward helper thread can always give you some benefits. Okay, so much for this, for the prefetching. It's at least good to know, I mean,
Starting point is 00:10:36 it's something special. You probably won't frequently use this like doing your own prefetching, but it's something good to know that this exists, right? So you will see this in your if you do low-level experiments you will see that the prefetch is doing something for you and in some cases might help you or it will in most cases it will help you in some cases it might actually be detrimental for your performance or performance will go down because of the prefetching because The CPU basically tries to get data that you don't really want into your caches your performance. So performance will go down because of the prefetching, because the CPU basically tries to get data that you don't really want into your caches.
Starting point is 00:11:11 Okay, so I promised I'll show you some pipeline architectures, some current pipeline architectures, and these are just schematics, but what a current core sort of not looks like, what the current core features. And so this is Sunny Cove architecture. This is kind of the core architecture. So there's basically, if you look at modern Intel or whatever server CPUs, you have, of course, multiple cores. And then you have kind of this overall layout
Starting point is 00:11:47 for the CPU and then the individual cores have a certain layout. So there's kind of an architecture for each core that then will be somehow arranged for different types of CPUs and that will also be used across different generations of CPUs. And the Sunny Cove, for example, would be the Ice Lake core architecture. This means each of the individual cores features basically this kind of layout. And you can see that, well, there's basically, well, now you can see that it's written here. You have up to six micro-operations per cycle. So this is already parallelism in the CPU.
Starting point is 00:12:35 So it's basically, you can have more or less six pipelines in parallel. And this is done through 10 different execution ports. You cannot use all 10 of them in parallel, but there's 10 different ones that can basically be filled. There's different reservation stations where the data will be, or not the data, the instructions will basically be prepared and then go to the different pipelines. And as I said, up to the different pipelines.
Starting point is 00:13:05 And as I said, up to six in parallel. And then there's different kind of, we said there's different kind of units on the CPU, different kind of processing unit. So you have the integer processing units or the regular ALUs. You have the vectorized execution units, which are somewhere down here. So you have typically two vectorized units that you can run in parallel.
Starting point is 00:13:38 Then you have the load and store units. So here, for example, and you have the different kind of caches that basically, and all of this will be done in parallel, as I said, up to six operations in parallel. And then of course you have these multiple pipeline stages and you have the front end, which basically has all this, what we earlier said, does the branch prediction, caches the instructions and decodes the instructions as well. So this is what the Intel core would look like. This is, for example, again, for my laptop here,
Starting point is 00:14:18 this is like a M1 Firestorm core. So we said there's two different ones. We have the ICE something, something, I forgot, and the firestorm, so the efficient cores and the fast cores. And the fast cores look like this. So here you have up to eight instructions per cycle, so this gives you a bit more parallelism again. And you have quite a large reorder buffer where basically all the different instructions
Starting point is 00:14:52 can be placed and then, according to what's possible, can be reordered. So you have an out-of-order execution here. And then you have the same thing. You have six, seven integer ports and up to eight operations. We already set this eight micro operations per cycle. And again, you will have like multiple ALUs here. You have vectorized. Let me see this. Yeah.
Starting point is 00:15:22 Here you have your vectorized registers and vectorized execution units which do what we will see next basically in the next round, in the next part of the session lecture. Okay so far so this is basically what the individual cores look like and where our instructions then are actually executed. What I want you to take from this is that there is parallelism already in a single core. Even in a single thread, basically, you get parallelism just because of the architecture, because we can do multiple things at a time and we will do multiple things at a time in a single core. Okay, so we talked about how the instructions are executed,
Starting point is 00:16:13 and the pipelining, etc., and the pipeline hazards especially, and then some current architectures. And now we're going to switch back to our vectorization. And this is basically, well, let me go back here, right? So, if we're here in this, where are we? Here in this vectorization, this is basically where all of the stuff is executed that we're going to talk about today. Okay, so today I will only be able to cover the basics. Next time we will get into deeper detail about the programming. If you're really eager in starting the task, you can already
Starting point is 00:17:09 sneak a peek into the rest of the slides. There's everything in there, how it's implemented, etc. Not everything, not all the details, but basically the different ways of parallelizing. But we will also cover this next week. Today, I just want to give you some overview of what this actually means and the kind of thinking that we need. And first of all it's about parallelism. So we already talked about the pipeline parallelism and we also had this general parallelism inside a single core. Now we're going to go into data parallelism, also still in the single core. Well, pipeline parallelism is kind of one simple technique to leverage a hardware parallelism, and it basically always depends on the length of the pipeline.
Starting point is 00:18:03 So usually we cannot get a lot of parallelism through pipeline parallelism. We still want to use it because we can, like some things usually take more time. We want to break down our task into smaller junks. And we have different units on our CPU, as we saw earlier, right? So all of the different execution units on our CPU, as we saw earlier. So all of the different execution units on the CPU,
Starting point is 00:18:29 if we can only do one thing at a time, then we can only utilize one unit at a time. So then a lot of the die is basically idling, not doing anything. So it makes sense to somehow leverage this separately. And for this, we can leverage this pipeline parallelism. And the pipeline parallelism is nice because it gives us some parallelism, but it still keeps the sequential execution order,
Starting point is 00:19:00 at least at the front end. So the instruction stream can still be the same as in the program, but we're just chunking it or cutting it up into smaller pieces and executing different parts of these pieces in parallel. Well, we discussed the problems around the hazards, etc. in the previous part. And here, just because of the chip design and because we need kind of the parallelism
Starting point is 00:19:35 is only in the length of the pipeline, this gives us a clear limit on how much we can parallelize here. Because just getting more parallelism means ever longer pipelines and it basically takes more time than also to go through the pipelines and it takes more chip space as well. So there's other ways in basically doing parallelism on the hardware and of course you already know this, but we can basically just do different tasks in parallel. And this basically means we can have a different chip space for different things, but even thinking about multiple cores, we have multiple chip areas
Starting point is 00:20:24 that are identical and they can basically do either do the same thing or could can do different things in parallel. So being basically we basically send them different kind of instruction streams, we have different processing units, the same kind of processing units and by just giving them different kinds of instruction streams and different kinds of data, they can do things in parallel, different tasks in parallel. And this can be, I mean, we could think of this as different cores, but we can also think of this as different ALUs, so ALUs on the same core. So they can also use different registers, get different micro-operations, and will produce
Starting point is 00:21:11 different outputs. We call this multiple instructions, multiple data, so we use working on, or MIMD, we're working on different data, we're working with different instructions we get different output. On our CPU we also have specialized units or so-called vector units or SIMD units so single instruction multiple data units and these are basically processing units that can execute like a single instruction, the same kind of instruction on multiple data items in parallel. Because of that single instruction, multiple data, but because we're basically working on multiple data items, which we usually have in sort of an
Starting point is 00:22:07 array or a vector this is also why we call it vector processing or a vector unit and there are like complete processes that are just built on this idea so the the title image that I have, this is an old vector processor. That's basically like a huge system just based on this kind of programming model where you say, okay, I'm going to do the same kind of operation on many data items in parallel. So in a nutshell, we differentiate if we're talking about vector processing or SIMD, we're differentiating between vectorized execution or scalar execution. So scalar meaning one instruction, a single operation at a time on a single data item or a data item set. So an addition would be two data items, one result, for example. So that's a scalar addition.
Starting point is 00:23:08 A vectorized addition or vectorized execution would mean we have a vector, an input vector, and we're doing the same, or two input vectors, we're doing the same execution on each of the vector items in parallel. So the multiple data items add a single step. So going back and basically talking a bit more about parallelism. So there's different kind of parallelism that we can look at. And one way of looking at parallelism in applications
Starting point is 00:23:48 is basically differentiating between data level parallelism and task level parallelism. So data level parallelism means we have multiple data items and we're processing them at the same time. And I mean, in a database, this makes total sense, right? Because if we say we do a scan, for example, we have many data items and we can do all of them in parallel.
Starting point is 00:24:15 And often there's not even any dependency in between. So this can really nicely be parallelized. And we're doing the same instruction all over all of them. And task-level parallelism is different because there we're basically doing different things at the same time. So this would basically be in a database. We have different kind of database operators that we want to execute at the same time.
Starting point is 00:24:41 So thinking about some kind of query tree, we have multiple scans, for example, or different kind of operations that we need to do in parallel, or we want to do in parallel. If we do different operations, then we're looking at task-level parallelism. And in computer hardware, we can have four major ways of how to exploit them.
Starting point is 00:25:10 So we can have the instruction level parallelism. And this is basically what we looked at when we looked at instruction execution. So we can have this pipelining in the instructions. And we can have speculative execution in order to get some parallelism. Then we have the vector architecture, so SIMD instruction execution units and registers and say for example GPUs as well, where we have this data level parallelism by executing the same instruction over multiple
Starting point is 00:25:47 data items in parallel. This is also what the GPU basically does. The GPU works in lockstep, does the same operation in many threads in parallel on multiple or on different data, basically, or on, let's say, different variables, for example, different parameters. Then we have thread-level parallelism. So there we can have either task-level parallelism or data-level parallelism by using multiple threads.
Starting point is 00:26:20 So this means then basically using often, ideally, multiple cores. We can also have, of course, multiple threads on a single core, but then they will be scheduled. But like actual real parallelism, we'll have like multiple cores doing either working on the same kind of data or multiple data items, but with the same kind of instruction.
Starting point is 00:26:45 Or they can do completely different things, right? Say, for example, our helper thread, that's parallelism where we have task level parallelism because the helper thread executes a different program than our main thread. And then basically, we can also have like a request level parallelism. This is kind of bigger level where we have completely undecoupled tasks which are somehow specified by the programmer.
Starting point is 00:27:21 Okay, so there is also Flynn's taxonomy on different kinds of parallelism. So the classical execution, if you think about, I don't know, 30 years or more back, you have single instruction, single instruction stream, single data. So this is the standard sequential computer and this is also how we think if we're just programming single threaded usually. So then we can have some instructional level parallelism but we're not going to deal much about it or do much about it but it's just basically one instruction at a time on one data item at a time. Then we have this SIMD. This is what we're talking about today. We have single instruction stream, but multiple data items at a time. So we're working in a vectorized way, basically
Starting point is 00:28:22 on a complete array or multiple data items at a time with the same kind of instruction. And this is basically, we have vector processes for this, GPUs work like this, and then we have these special instruction sets on modern CPUs. So basically all modern CPUs, like laptop CPUs, server CPUs, have these kind of instructions. And so this is what we're going to talk about today mostly. Then there's also multiple instruction streams, single data streams. So there's MISD, whatever.
Starting point is 00:29:05 I never really pronounced this before, I think, or at least I'm not trying hard enough. So this is kind of obscure, but it's basically, you're working on a single data item or a single data stream, but you're doing different things on it. So this is more or less something that you would already have in task parallelism. So this means, well, we're just working on a single data set, but we're doing different things on it.
Starting point is 00:29:31 It usually will be implemented in the same way as MIMD, which means multiple instructions, multiple data streams. Think about multiple threads doing their own thing, working on different data. So the multiple instruction, single data stream, you could think about the helper thread being an instance of this, because we're working on the same data stream, but with different kind of instructions. But in the end, how it's implemented
Starting point is 00:30:06 is more also like a MIMD, multiple instructions, multiple data. And this is basically multi-core, multi-sockets, multiple servers, for example. So this is everything there. While up here in SISD or SIMD, we're still in a single CPU or a single core even. Okay, so SIMD instructions. So, in modern CPUs, we basically have a special class of instructions that allow the processor to process multiple values simultaneously. So, this is really done in parallel.
Starting point is 00:30:59 It's not only instructions, there's special hardware on the CPU to do this in parallel, say for example edition. And all CPUs from major vendors have these extensions, have kind of SIMD instructions. And Intel, for example, has MMX, SSE, and AVX. And SSE and AVX is basically what you're going to be using today. ARM has NEON and SVE, and Spark, for example, has VIS. So you can find different kinds of instruction sets everywhere. And that's actually also a bit of a problem, because this is not standardized in any way. You will find the naming schemes are different, etc.
Starting point is 00:31:44 So it's a bit tricky to navigate. But, most of it is done automatically. Or, if you don't use it yourself, a lot of it is automatically done by the compiler for you. So, we'll see this probably in the next lecture. So, how do these work? So, here's a simple example. Say for example we have two areas and we want to add them up in a third area. And this is going to be our example for most of the time today. And think about this as vectors basically, or areas or whatever.
Starting point is 00:32:19 So you have the area X and the area Y and our array Z is basically just like you add up the X and Y values individually. This is the same thing that will be done here. In a scalar addition, a single instruction, single data addition, so the way you would do this on the ALU would basically be, if you're just using one ALU, I must say, would basically be, you're starting with the first element, adding it up, writing it to the result area. You're doing the second, the third, etc. So, this is basically how the processor will consume this and execute this. Again, since we're having multiple ALUs on our single CPU, what actually will happen is that at least
Starting point is 00:33:14 the compiler will try, I mean, if it doesn't use SIMD instruction, the compiler will typically try to unroll this to some degree, or the CPU will already see, well, I have to do the same thing multiple times, will then execute multiple of these additions already in parallel through the multiple ALUs that it has. However, we can also say, well, let's use our SIMD registers, right? And let's use our SIMD unit on the CPU.
Starting point is 00:33:50 And rather than doing an individual addition, what we're doing, we're loading four of these integers in our 128-bit SIMD register. And then we can add them directly. So this will basically then be like one addition and the SIMD unit does the addition, the individual additions, all of them at a single time. So this will basically then will be done in only two steps, this execution. So this means basically a lot less processing steps, a lot less cycles that you need to do because like one SIMD addition could be done in, well of course,
Starting point is 00:34:35 pipeline length, right? We know multiple cycles, but basically in one cycle on average. coverage. Okay, so we still have time, so let's continue. So let's talk about how this basically evolved over time, the SIMD instructions. And so initially, these operations were basically built for multimedia execution. So, Intel basically started with the MMX stuff, but then they had the streaming SIMD extensions with 128-bit registers. So, this means you can have 128- bits and you can load your data in there and there's multiple different kind of instructions that you can run on there and typically you would load them with 32-bit scalars and
Starting point is 00:35:38 which so say for example int values and they then can be then you can do a four-way parallel execution on these registers. So, say for example, addition, other kind of multiplication, whatever you can think of, and there's different instructions for that. And so, basically in each operation, you can then, or in each cycle, you can then basically do one SIMD operation on each of these elements in parallel. So, this is what we just saw earlier. However, you always need the data in these SIMD registers, right? So, you cannot just load them anywhere. I mean, you always need to load your data into registers, of course, but then there's specific registers just for the SIMD
Starting point is 00:36:33 execution. And these are typically sort of 128-bit wide registers would be XMM0 to XMM15, for example. So this means 16 registers just for this addition. And today, you cannot only use them for these 32-bit scalars, but you can also use them for different other kind of operations. So rather than only going for 4 32-bit, you can also go for 16 8-bit values, or 8 16-bit, or 2 64-bit values. And so, if you have 2 64-bit, well, you have like a parallelism of 2. If you have 16 8-bit values, you'll have a parallelism of 16 already, with of course less
Starting point is 00:37:21 precision or smaller data. But you have these basic arithmetic operations, but also logical instructions. And each of them then, as I said, can be done in parallel across. And besides that, you also have comparison and shuffle instructions. This means you load everything in your registers and you can then compare all of them to a certain value. And this would be something you could, for example, use for a scan. If you're scanning data, you compare all of the data saying,
Starting point is 00:38:00 is this what I want or is this smaller, greater than what I want? So this would be a typical way how we use this in a database operation. And then you can also sort or shuffle them in a way, or not sort, but shuffle them. You can basically say, based on a bit mask, for example, I only want these elements, and then I want them to be moved towards the front of my register. So I again have a dense representation of the data that I'm actually interested in rather than having like a sparse area with lots of holes of the data that I don't want to see. And of course you need to be able to
Starting point is 00:38:43 transform between the SIMD registers and the regular registers. So giving the data in and out. And you can also, you don't necessarily, or there's some way to directly move the data from the SIMD registers to the memory, rather than going through the caches. So say if you're doing this kind of scan-like thing, you might not want to keep the data in the caches all the time. So, there's these 128-bit registers, but then there's even wider. So, there's also two 56-bits, so that's called AVX2. So first in 2009, we had the SSE, the streaming extensions. Then we had AVX in 2014, AVX2.
Starting point is 00:39:36 And now there's even AVX512, which has up to 512-bit registers. And well, there's even more options then, again. So, we have, we can, so there's 32 of these registers. So, set MM0 to set MM31. And you will later also, if we go there, you will see this even in the assembly. You can see that these registers are actually used for loading and storing the data.
Starting point is 00:40:13 And then since they're 512, you can even do like up to 64 8-bit integers or 32 16-bit and so on and so forth. So hypothetically, here you can get up to 64-way parallelism in comparison to a scalar execution. So if you're doing like 8-bit addition, say for example, on the regular ALU, then in contrast to that on the AVX512 registers, you could do 64 in parallel.
Starting point is 00:40:52 However, if you're using AVX512, at least on the current CPUs, the CPU will clock down. So basically the instructions are that the unit is so heavy, so there is so much circuitry around this, that the CPU needs to slow down a bit to make sure that it's processed in time. I'm not sure about the newest generation, so that might change, but we'll also see that's why the compiler doesn't really like those instructions. The compiler basically says, I'm assuming this is slow, so it rarely uses the AVX 512. Yes? In the recent HSE videos, there's the performance and the efficiency course.
Starting point is 00:41:36 Do you know if the efficiency course also support these instructions? So the question is regarding performance efficiency cores. I don't know, honestly. So one would have to check the layout of the two cores. So I know it's on all the server cores. I don't know regarding the larger. I would assume there is something on the efficiency cores, but I'm assuming not like the very
Starting point is 00:42:06 large 512. So I'm not even sure if the 512 are on the desktop CPUs at all. So I think these are mostly on the server CPUs. I guess Marcel is checking. If you find something out, let me know. Okay. So say, for example, on the Skylake CPUs, there's two AVX 512 units that you can work with in parallel, actually, again. And the other thing that we see, we get more instructions. So we get more functionality and there's a link to some cheat sheets to look up what's doing what.
Starting point is 00:42:50 But there's a lot of different things that you can do with these instructions or with these units. But it also gets a lot more complicated. Because basically for each of these you get like separate instructions right so for each complex basically this explodes in combinations so you individually have to specify different instructions for each individual combination so say for the integers for floats for like you have to specify which width, and there's a separate instruction for each of those. So, in SIMD programming, you basically, a lot of, like, in SIMD programming, you explicitly, there's no data hazards or something like this. Because you explicitly do this vectorized execution. So usually if you have the scalar execution,
Starting point is 00:44:06 a lot of the control logic basically deals with making sure that the instructions are properly scheduled, that the instructions get the right registers, and et cetera, and all the right registers, etc. And all the dependencies are tracked properly. And in the SIMD instructions, you have to deal with this yourself. So basically, there is a lot less logic around this. Then you basically have to say,
Starting point is 00:44:40 well, these are my registers that I'm using. Or you explicitly say you're using the vectorized registers, and in there you don't have any branching or anything within the instructions. So basically in the registers or in the vectorized instruction, there is no conditional branching or something. You can just do like the predicated, like the way we did the predicated execution, right? So these examples. This is the only way how we can do any kind of comparisons. We will always execute everything. We cannot say in a vector, well, with this part, I'm going to deal this way. With that
Starting point is 00:45:24 part, I'm going to deal this way, with that part I'm going to deal that way, because we cannot basically split up the vector in different parts. It will always operate on the same vector. And this means we can only check for data hazards or stuff like that between the vectors. And in theory, we can have like an n-fold parallelism, meaning, say, if we have this large, this 512-bit registers, we could have a 64-fold parallelism in theory, but in practice, we're not going to quite achieve it. But at least we get close, we can get within a mortar of magnitude, let's say. Okay, so with this I would say we'll do a quick break here and then look into a bit
Starting point is 00:46:20 more detail regarding the SIMD programming in, let's say, five minutes. Did you already find out? So I found out that M1 also supports 128-bit SIMD instructions in AR. I'm not sure about where it's located on the instruction. So, yeah, the M1 has. I know the M1 has instructions. But if the Intel efficiency cores have it, the M1 has. I know the M1 has instructions. But if the Intel efficiency cores have it, I don't know.
Starting point is 00:46:50 I can check offline, basically. OK, so five minute break. OK, so I could just quickly check. So I couldn't find any detailed info about AVX etc. on the course, but it seems that at least on the efficiency course AVX 512 doesn't exist. But I don't know about the smaller one. Not sure. That's all I could find quickly.
Starting point is 00:47:21 Okay, so SIMD programming. So there's different ways how we can actually use the SIMD units on your CPU. And I mean there's two basic ways, like I said,, the general ways and then there are multiple ways how you can do this explicitly. On the one hand, the compiler will try to use this for you. If you are using a modern compiler, GCC, Clang, etc., they will try to utilize the SIMD instructions on their CPU. Sometimes this requires some hints and some guarantees by the programmer for this to work, because the compiler always needs to basically take all options into account. Anything that can happen, the compiler needs to be sure that it covers that. Anything, let's say legal, that can happen.
Starting point is 00:48:32 And then you need to say, well, I'm pretty sure this is the way the data will basically be laid out. For example, in memory, not pretty sure, I am certain. I guarantee that this will happen. And then the compiler can be more flexible in the optimizations. That's one way. And the other way is explicit vectorization.
Starting point is 00:48:55 So the automatic vectorization works great for simple stuff, so simple loops, et cetera, but doesn't work for everything. So, for more complex stuff, and say, for example, the task that you need to perform, it won't work. So, you can get some vectorization, but you won't get enough vectorization here. And then we need to explicitly use vectorization. And, well, the automatic way is called autovectorization. And this is basically as soon as we write simple code, like just this addition that I showed you earlier, this is something that the compiler can see,
Starting point is 00:49:40 and the compiler can directly basically optimize this. And LLVM, for example, this is one of the parts in Clang. So LLVM has certain optimizations or certain patterns that it can detect and then automatically optimize. I gave you the link here. Let me see if this works quickly.
Starting point is 00:50:09 So if, so here, for example, you can see simple patterns is a bit small. There's basically different kind of examples where the compiler will directly see some kind of reductions, inductions, conversions, etc. that the compiler will check for. The compiler is basically a long case statement, and if it finds one of these instructions and there's a list that scatter, gather, etc. If it finds one of those, it will then automatically try to vectorize this. The cool thing about this is it's platform independent. You still write your scalar code and the compiler does it for you.
Starting point is 00:51:06 The compiler will utilize the SIMD units and execute this in parallel. And the compiler does all the tricky edge cases for you. And we will see why this actually is a good thing. So here's a simple example. so here this is basically the same loop that we had earlier with x y and z and we're adding them up so we have a loop but here we have these individual loads and stores separate right so rather than doing this in a single line, so usually you would just say x i plus y i equals z i. So that would be the way you typically write this output here. Rather than doing this, we're explicitly writing the loads in order to also have these loads.
Starting point is 00:51:59 We can also transform this into a vectorized version. So this is basically, it's of course still scalar code here. But we're working on vectors here. So this is something that if the compiler sees this, should directly be able to translate this into these SSE instructions, right? Because we're doing a simple addition, or we're reloading four values into these areas, and then we're adding all the values,
Starting point is 00:52:33 and we're storing them again. So this is in scalar code, a vectorized addition. Now the question is, can the compiler translate this into this? Could the compiler just replace our scalar code that we write like this, or even like in the shortened form, translate it directly to this? I guess only if the max is divisible by 4. Exactly.
Starting point is 00:53:07 So that's one of the things. So, of course, we need to check all the boundaries, right? So is this max really, can this be divided by 4? Because otherwise, if it's not a multiple of 4, we need sort of an epilogue. We need basically to somehow deal with the remaining iterations. Another thing is x and y alias, so they overlap. Yes, exactly. So that's another problem. It's pointers, right? So this means we could actually look at the same memory addresses. And so if we have pointer aliasing, so then in the loop, we somehow might change stuff in a future iteration. So even like within the same thing, and then we would actually need to execute this in a scalar fashion.
Starting point is 00:53:58 If there's an aliasing, so if the pointers point to the same memory addresses, then we need, there might be something, I mean, of course, you need to program this in a weird way for this to happen, but you could program it in a weird way, so that you cannot parallelize it in this way. So in order to actually, if you want to do this by hand, you would have to write something like this. So, you read to write a function, aliases, that basically checks for you if the pointer is alias. So, if there's an overlap in the pointers, meaning that either X or Y has an overlap with Z. So, that where you're writing to is something
Starting point is 00:54:47 what you're reading from, basically. That's essentially the problem. Then you run through your loop all the way until you've basically consumed a multiple of 4, but of course not too much, because max might not be divisible by 4, so then we need to do this epilogue, where we basically handle the last couple of items that do not fit, or that is not divisible by 4, right, this remainder, basically. So that would be a way. And then, of course, we see we have this nice jump here
Starting point is 00:55:30 in there, which you go to, which you should never use in practice. That's the direct way how one would implement this. But even this aliases is already tricky to implement. So that's something where you have to do some checks, et cetera, to actually implement this properly. So this is not something that the language gives you. And all these offsets are also easy to get wrong, right?
Starting point is 00:55:58 So this still might break if say, for example, max is basically like integer max, something like that. Then i plus 3 could overflow, for example, stuff like that. And max minus 3 could underflow, for example. So that's another. So if we write this in another way, so you can basically always get into troubles
Starting point is 00:56:29 and the compiler will check all of this for you, right? So the compiler will always make sure that this works properly. So if, well, basically manual is easy to get wrong and scalar code is much more maintainable. If, well, basically manual is easy to get wrong and scalar code is much more maintainable. So in this way, we can basically just use the compiler. And this is basically the same function again, right?
Starting point is 00:56:57 You can see here, it's basically our addition now shortened with all of this removed, all of this explicit loading and storing. And then there's a nice website. So this is something that I actually really like. And my students, at least some of them, use a lot. This is basically, this is called Godbolt or the Compiler Explorer. And it's basically a website where you can directly look at the assembly that the compiler generates.
Starting point is 00:57:31 So you give it your function. So here we see this is basically the same function. I have a lot of links in here, which you can directly access, and you can look at the different kind of functions. And then it gives you the output that the compiler produces so then we can basically see this quite a bit here but the hot loop actually would be this here so this should be the hot loop here which is basically you can see that we have our addition. We have first removed the data into the registers and here you can see these SIMD registers,
Starting point is 00:58:15 so the SSE registers, then the parallel SIMD addition, and then we move the data out. And of course, we have a lot of other tests as well. Then we have the scalar execution of the same functions, or the same with different kind of loop condition. And somewhere here, we would have kind of a scalar execution of the same thing, which is basically used for the epilogue. And if you can also look here at somewhere
Starting point is 00:58:51 that the control flow graph, this would basically be, oops, exactly what I'm showing you. I'm struggling with the touchpad, I'm sorry. Okay. You can check it out yourself. Okay, so I have the link here. This is basically what comes out.
Starting point is 00:59:14 It should be the same thing. You can try it with different kinds of compilers and different versions. And you can also use all of the compiler flags. So you can directly see what will be the output. And then, of course, you need to read assembly and figure out, but if you see somewhere, okay, there's like SIMD instructions, then you know, okay, this worked, right? So, there's some SIMD instructions here. So, in this case, this would be the hot loop. So, if you look at the hot loop, and this is slightly different executed, we see it should be more or less the same.
Starting point is 00:59:49 We see here we have kind of this move again. So we're moving data like double quad words unaligned. So basically 128 bits. So double word would be 64, and quad would be 64, and double quad would be 128. So we're moving 128 bits unaligned because we're not sure if the original data is actually properly aligned.
Starting point is 01:00:26 We're moving it into the register and we need to do this for both registers. Then let me get pointer. So then we're doing the addition here and then we're moving it out. Then, of course, we're incrementing probably our loop counter and jumping out either to the next loop or to somewhere else, basically to the next check. Also, the compiler, if you are using the compiler, the compiler can emit you some information about So rpass or rpass analysis and foptinfo for the GCC. And you can see why or why it didn't vectorize a certain,
Starting point is 01:01:36 or why it didn't vectorize a certain loop. So say, for example, if we have the same loop again here, but we have only eight steps, so we're doing only eight computations, we're only adding up eight elements, then Clang will not vectorize this. And we can actually check out why. How do I get rid of this now? I can do the same thing again.
Starting point is 01:02:23 And if we go to this now, so then basically we see the way this is translated here. Let me maybe make it a bit smaller. And you can see this is basically, I mean, it's also interesting because it's completely unrolled here. So you can basically see the compiler, rather than using the SIMD registers and SIMD unit, the compiler just completely unrolled the loop and just did all of the additions one after the other. So moving in the data, adding out, moving it out, et cetera. And the compiler also will tell us why. So if we go here, it basically said, well, this might be a clobbered store, or this might be clobbered, and this is basically this aliasing problem that we had earlier.
Starting point is 01:03:16 So the data right here now, the compiler is not sure that there is not an aliasing problem in here. So set might conflict with x or y. And then there might be a lot of checks necessary. It will also check if we do the same thing with a larger loop. It will still need to do these checks, but in a larger loop, this might actually still make sense. But if we only have these eight individual, or two SIMD instructions, or eight individual operations, then all these checks probably won't be beneficial.
Starting point is 01:04:01 So the compiler says, oh, it doesn't make sense. I'm not using it this time. Okay, so basically this is what it says. It's basically missed. So it missed this operation. So the simple operation is not used because of this clobbering. This is compiler speech, basically. X or Y might be overwritten by the previous store.
Starting point is 01:04:31 So we have this pointer aliasing problem. And the cost model tells the compiler there's only eight iterations. Well, aliasing might not be. Because of the checking that the compiler needs to do at runtime, the SIMD registers might not be beneficial. So what we can do if the compiler, often the compiler is actually right about this. So you think you might be smarter than the compiler, but often you're not. The cost model of the compiler is not
Starting point is 01:05:05 so bad. But often the compiler cannot make certain assumptions that you actually can make. So you might actually know that there's no clobbering, there's no aliasing, no pointer aliasing happening. And then you can tell the compiler. And the compiler will then basically know, oh, OK, no problem here. I don't have to deal with all these runtime checks. So the costs of those runtime checks will basically not be necessary. I can do these optimizations.
Starting point is 01:05:44 And so programmers can actually use certain compiler hints for these kind of optimizations or for guaranteeing certain things. And then the compiler can use these hints for optimization. However, the big issue here is you really have to know that these guarantees that you're giving are holding, right? So, if you're saying there's no aliasing and there is aliasing, the compiler doesn't care, right? So, then the compiler will just write code that will produce wrong results, like random stuff, basically. So it's really up to the developer that this holds. And one way of doing this compiler hints is basically restrict.
Starting point is 01:06:34 So the keyword restrict, it's not standardized, but most compilers support it. It basically guarantees that pointers do not alias to any other pointer. So with this, basically, I can tell the compiler that my set array is not aliasing with any other array. And then the compiler knows, well, no aliasing. I don't have to test for aliasing. Then basically, maybe the compiler will actually optimize. This is something that we can check again.
Starting point is 01:07:08 We have basically the same instructions. I'm just going to use this again. What I can basically say is that here, I'm saying, I just have to see that I can read this. I can basically say this is restrict. Compilation failed. Obviously, I wrote mystery. Sorry, yeah, exactly.
Starting point is 01:07:42 This is so far away. I need glasses. Okay, so if we put it here, all of a sudden the compiler knows no aliasing happening here. So it can just basically use the SIMD instructions here. Again, it will do an unrolling rather than doing like two loop steps. We'll just do this in one step.
Starting point is 01:08:12 And then this code basically also could be, of course, be paralyzed by the CPU again, because we have two SIMD units, both of these instructions could be, or we could reorder them in order to execute basically both additions in parallel. Okay, so I'm fast today, so we can also touch on the explicit vectorization still today. Rather than doing this implicit and auto vectorization, so there's also other different ways we can give hints.
Starting point is 01:08:57 So there's something in the slides that I uploaded. But this is a bit more, let's say, involved and even more easy to get wrong. So I don't really want to present this. And for the task that you have to solve, you will actually need some explicit vectorization. Because not everything can be solved with the autovectorization. And, well, there is CPU intrinsics, basically, that help you to explicitly tell the CPU, also the compiler, the compiler will just basically
Starting point is 01:09:41 pass these through these instructions through and directly tell the CPU how to deal with the SIMD registers and the SIMD instructions. These intrinsics are just wrapped up assembly instructions for the compiler. So the compiler more or less just passes them through, at least the ones that are CPU-specific. And we can basically use direct compiler or assembly instructions by using the right functions with the right parameters.
Starting point is 01:10:23 So typically, they have a certain naming scheme, being like a vector size, then the actual intrinsics operation, then with the suffix again. So, the vector size would be in Intel speech, for example, mm for 128-bit, mm256, and MM for multimedia. So as I said, this was initially designed for multimedia applications. So that's why it's MM256 for 256-bit and 512 for 512-bit. And then you have the individual operations,
Starting point is 01:11:07 so add, sub, mult, etc. But not all of them are as easy to understand as these. And then you have the suffix for the type. So PS for float, PD for double, EP32 for signed int, EPU for unsigned 16-bit int, etc. And this means, as I said, this is basically adding up, right? So for each unit, you have separate instructions. For each data type and size of the data type,
Starting point is 01:11:37 you have separate instructions. And the wider the instruction, basically the wider the registers, the more intrinsics you basically get. And so this is also a bit of this problem why you constantly need to look up these names, because there's so many of them, basically. At a certain point, you get like, I'm not there at all, but you get used to some of these, but still a lot of them are not super intuitive. So for example, explicit instructions for SSE,
Starting point is 01:12:14 so the 128-bit intrinsics, then the registers, so underscore, underscore, M128I would be a register for integer types. Then MM load USI128 would be an unaligned load U for unaligned load of 128 bit from memory. Which is kind of this assembly instruction that we saw earlier. So we had this, if we go back to the code here, this is basically the same instruction. So this is basically, we're moving a double quad unaligned from memory. Then we have a vector addition, for example, which would be this packed addition. So vector at four signed 32-bit integers, et cetera.
Starting point is 01:13:21 So this is basically how these, if you wanna explicitly write those, so you're not gonna write these assembly instructions, but you're gonna write functions that directly map to these assembly instructions. Yes? So why, like, we usually prefer the unaligned instructions because the integers
Starting point is 01:13:44 are only like format aligned, but we did 28 alignments. Exactly. And is it worth it to maybe like get a, not epilogue, but like a? Prologue. Something that would basically pre-align it. I don't know.
Starting point is 01:14:02 You could try. I mean, if you can guarantee that it's aligned, then again, you don't need to unalign. The aligned execution will be a bit faster. Because the CPU doesn't have to do the alignment again. So if you can guarantee it across all executions, then it makes sense. Then you can do the aligned.
Starting point is 01:14:32 You need kind of checks, of course, but then you can do it. So say, for example, here we have, again, kind of the same loop which we had before. So we have this kind of our three buffers, so X, Y, and Z. And we're explicitly like we need these registers basically to store our data. So we're adding up X, Y, and Z. Let me find my pointer here. It's the same code, basically. And we're loading the data in these SIMD registers. So this is basically the explicit load into the SIMD register. So our 128-bit integer registers, and then we do our loop again.
Starting point is 01:15:31 Now, since we're doing four integers at a time, we divide by four. Again, here we're assuming that our max can be divided by four, that we don't have an underflow or something like that by doing the division, etc. So this is already with some strong assumptions. It basically creates the same code that we would have created if we run the other code. So here, this is basically more or less
Starting point is 01:16:06 would look similar to this. So it's within the loop, of course. So we can also look at the code. So we're loading in the registers, we're invoking the intrinsics, and we're writing, we store the result into memory. So that also, of course, needs to be done. And I can actually show you the code briefly.
Starting point is 01:16:30 And this is, of course, a bit shorter than we would expect because we're not doing the checks with the epilogue, etc. Because we're directly telling the CPU that this basically can be divided. And the actual execution is basically in here. So here we're doing the addition, we're moving the data into the registers, we're doing the addition and we're writing it out again. And then we have some loop variables, which basically is this kind of stuff here. And here we also have a link to the Intel Intrinsics Guide,
Starting point is 01:17:18 which basically gives you an overview of all the intrinsics that are out there. And there is also a nice cheat sheet from TU Munich, which gives you a hint based on the naming scheme. It's quite neat to find quickly what are the different operations, also to get a good overview quickly. Okay, so there, but the explicit vectorization, and I mean, this is somewhat working against the task actually that you have to perform because there you will need to do best performance,
Starting point is 01:17:55 but in practice it's actually annoying because all of the instructions are different, right? So for the different kind of vector sizes, so different registers, you have different instructions. For different architectures, you have different instructions. So if you want to run this, say, for example, on the M1, then you need to use completely different instructions. So in M1 or in ARM, the instructions look differently. So here again, we have the comparison.
Starting point is 01:18:29 So this would be our addition in Intel, for example, where we directly load in the registers here, for example. Here, if we do the same addition in ARM, then the addition has a completely different naming scheme. And the registers are called differently, et cetera. I mean, maybe even a bit more intuitive, but just different. So you need to rewrite the code.
Starting point is 01:18:56 If you want to have code that can be run on multiple platforms, you will need to do a lot of case distinctions depending on the platform if you want to use these intrinsics. So for each platform, you will basically need to write different code for this to work.
Starting point is 01:19:20 And the intrinsics are actually hard to read. So if you, I mean, we've put some up here. So it's just some random stuff. So shift, write, logic, integer, whatever. So all kinds of stuff. But it's not something that you would easily memorize unless you write this every day, I guess. So explicitly writing this up easily memorize unless you write this every day, I guess.
Starting point is 01:19:45 So explicitly writing this up actually reduces maintenance or maintainability quite a lot. And we've seen this in projects actually as well. But before continuing with more specific different kind of instructions,, etc., I would say let's stop here. And are there questions so far? Yes? You mentioned that database systems can utilize some deprocessing to see the computations. Yes. I can imagine that, let's say, for a count or some operation,
Starting point is 01:20:31 you can use a lot of such operations. But do you know of some more sophisticated techniques? Yes. So this is a good question. So how can we use this in the database properly rather than just doing addition, like simple aggregations? And say, for example, a scan is something where you can easily use it. Also, an example that I have later in the slide is a compressed scan. So if you basically have your data stored densely, so in the example, in nine bits, for example, densely packed.
Starting point is 01:21:10 And you're loading. You want them to work with the data. Again, you need to somehow decompress the data. So that's something that you can do. Also hashing, for example, you can speed up in the SIMD instructions. Hashing is something that you need all the time in a database for all kinds of operations. So we'll see some examples in the next lecture.
Starting point is 01:21:33 More questions? No more questions? Well, then, more or less perfectly in time. Thank you very much.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.