Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Vectorized Execution

Starting point is 00:00:00 Okay, so despite the nice weather, we have a couple of people here. Great, thanks for coming. I'd rather be outside in the sun right now, to be honest. But I also enjoyed this. Okay, let's get started. So we're going to talk about vectorized execution later today. So first we're going to go back to where what we've where we left off last time which is uh pre-fetching but before that one quick announcement uh we'll have the lecture series

Starting point is 00:00:32 tonight again with ralf fabric on ai and sustainability so again could be interesting to hear about a project there uh so it's going be, yeah, bit of an overview, not so much of like mini projects, but like one dedicated or two dedicated projects. And today we're gonna dive into vectorized execution. Later on, tomorrow is no class. So for those of you who have not heard this, there's a holiday tomorrow, so you can actually enjoy the sun.

Starting point is 00:01:08 But then next week, we're also not going to start with a lecture, but we're going to start with a task. So that's also going to be interesting. You will do some SIMD programming, so we'll touch on this today, finish up next week. So then you should know enough to actually solve the task. It's not super hard. It's just a bit tricky to find the commands, etc.

Starting point is 00:01:32 But you will manage. And then we'll go on. So I'm actually not going to be around on the 14th and 15th. So then Martin will tell you all the funny and nice things about execution models and data structures, which is also a topic that I like a lot, but because I like it a lot, of course I share it.

Starting point is 00:01:54 Okay, so with this, let's go back to where we left off last time, which is prefetching. So you remember we talked about instruction execution, so how instructions are broken down into smaller instructions and then executed on the core. And we've heard about the function of units, et cetera. And we've heard about special operations for loading loading data and this is basically where a lot of time goes right so you've heard that before that basically we have this memory bottleneck which in data processing is something that we have to be aware of we cannot just do like lots and lots of code execution we cannot jump in the code a lot, because that will stall our instruction execution just

Starting point is 00:02:50 because we don't have the code ready. And the same thing is true for the data. So if we have lots of data that we need to read, then ideally somehow the data should be already available in the caches. If we have to go to memory, the CPU will have to wait many cycles, so 100, 200 cycles or something, just idling around doing nothing.

Starting point is 00:03:14 And in order for this to somehow mitigate this, there's a concept called prefetching. And there's two ways we can do this. Basically, the CPU does it for us itself, and we can do this ourselves. So this is something you can try out and we can see. So there's a hardware prefetcher. And the idea here is that basically, I mean,

Starting point is 00:03:40 we already know, right? So the CPU is loading cache lines, not individual bytes, and if we can use everything that's within a cache line, then the CPU will have to do something for a couple of cycles, which is good, because we don't have to load every individual byte from memory. At the same time, if we already know, or the CPU can guess that we're gonna to use the next

Starting point is 00:04:05 cache line as well then it can prefetch this and that's the so-called adjacent cache line prefetcher or also stride prefetcher so basically the CPU can prefetch the next cache line guessing that this will be a cache line that we will also be using if we've used the first cache line, just read through this, maybe, and there's a pattern, then let's take the next one as well. The same is true if we hop through an area to some degree, then there might be a stride pattern, right? So we use every other cache line, for example. So this is also something that the CPU can do by itself. But often, the prefetcher does not prefetch

Starting point is 00:04:48 across page boundaries. That's just something. So if we're in virtual memory, the page might be on disk or something. So then there's additional stuff to do. So this might be costly. So let's keep it at a page level, basically. The other thing we can do is we can also prefetch ourselves

Starting point is 00:05:07 or at least instruct the CPU to do so. We can basically use explicit operations or instructions that tell the CPU it might make sense to do some prefetching here. So there's so-called built-in. So these are basically compiler hints or instructions that will hopefully lead to some desired behavior where nothing would actually change if it doesn't happen.

Starting point is 00:05:39 So this is basically just an optimization that we're doing. Here, for example, this built-in prefetch will basically say a memory address that we want to prefetch and then if it's a read or a write, because that makes the difference in terms of cache coherency, right, how we are blocking this or locking this, and then the access pattern. So basically there is temporal locality or there is no temporal Locality. Meaning we're just going to read

Starting point is 00:06:13 This once or access it once or if we want to access it multiple Times. So if we're just doing a scan Through a lot of data, then we'll only access this once. Meaning once we've used this, the CPU can replace this in the cache. So it's not going to be used again. If we indicate, well, this will be used definitely multiple times, then the CPU will try to keep it in the cache. And basically, then we can make use of that multiple times.

Starting point is 00:06:45 And yet another. So these are basically, this both goes to the hardware prefetcher. So this is basically, we're telling the hardware prefetcher to do some prefetching here, please. But we can also mimic the same behavior by building something called a helper thread. So the idea is, I mean, what does a prefetcher do?

Starting point is 00:07:07 It just gets some data from memory into our caches. And so we can basically instrument or write some code that does the same thing just by basically accessing data. We can ensure that the data is in cache at some point. So if we have our system or our code we'll access next, we can basically have a separate thread, of course, in the same core. I mean, it doesn't make sense if we have it like different socket

Starting point is 00:07:43 or something, not same core, in the same socket. if we have it like different socket or something, not same core on the same socket. If we have it in a different socket, that might actually be very bad. But if we have it on the same socket, this means, and or the same NUMA region, this means we have the data in the caches, and we can basically create a thread that accesses the data and gets the data into the right caches,

Starting point is 00:08:07 at least the lower level caches, not the per-core caches if it's a separate cache. So the idea is we create basically a second thread that performs the same operations as the first thread, but only in terms of data accesses. So we're basically doing the same kind of data accesses without the computation, so this thread will be faster. It will run before the actual thread that does the work, and it will warm our caches. So the data is already in a cache,

Starting point is 00:08:37 which means we have faster access to this. So it needs to somehow work ahead. So we have our main thread that basically then, let me see if this works, right? So we have our main thread that basically then let me see if this works, right? So we have the main thread that works on the caches and we have to help us Threat that basically says okay. This is the data that will access next so I'm just gonna access this and we'll have it in the caches and so this means we're basically already in some level of cache not in in the last level of cache if this is on another core. And if they're on the same core, then this might even

Starting point is 00:09:09 work through scheduling. So we're definitely saving this main memory accesses. So then the main thread basically works with the data already there. And the main thread also can like have, they can be like a shared data structure which is called a worker headset which has the pointers that the helper thread basically has to prefetch which is very much similar than to using this right so to

Starting point is 00:09:40 using this built-in so we can basically basically say um also i can say well i have kind of these are the addresses that i want to access next so i'm using this prefetch or this this built-in prefetch to access this these data or to to warm the caches in the same way the difference is that here whoops wrong Here, the helper thread will definitely access this. So this will definitely be executed. Now, this can also lead to a problem. Because if they're both accessing the same cache line at the same time,

Starting point is 00:10:19 then we have a conflict. So all of a sudden, basically, they're conflicting on the same cache line. So there basically might be cache flushing. So the individual cache line will have to be reloaded because we want to have in-order processing on the CPU. So we have basically what's called a memory order machine

Starting point is 00:10:46 clear event, which completely flushes the pipeline and will be super slow. So this basically all of a sudden means our program gets a lot slower than before, because all of a sudden we have these conflicts on the cache line. Or if the helper thread spins on a lock to wait for new data, then the helper thread is basically waiting. We have a lot of these memory order machine clear events,

Starting point is 00:11:18 so basically we're reducing performance. So an idea is, rather than doing this like basically stepping or working in lockstep, we can do this backwards. So then we have these conflicts much less because only if we have a larger working set, the helper thread works its way through the working set and only at the point where the two meet we might have such an event.

Starting point is 00:11:48 But this one, hopefully it will be much less frequent. So here you can see out of a paper, you can see an example of performance that you might see basically in, I think it was some kind of scanning workload. So what we see, so we have a single threaded execution. So this would be the blue line, right? So this is the single threaded execute. Oops, this is hard to steer.

Starting point is 00:12:23 Let me try it this way. This is the single threaded execution. Then if we have the forward helper thread, that's basically okay for a small workahead size. If it gets larger, it might actually be a problem. We might get into conflicts, and we have these memory order machine clear events. We might have a single forward helper thread with a spin loop, so that's even worse. So here we get more basically clearings and more conflicts.

Starting point is 00:12:57 And if we have the backwards helper thread, we have actually a bit of a performance boost. It's not a lot, but we get 10, 15% performance improvement through this helper thread, through this prefetching, basically. This is where we're saving on not having to access the memory rather than just getting the data from the caches. Okay? So, again, similar to using prefetcher, if the CPU has, like, in our main thread, if there's more to do, right, more execution, then the performance improvement might even

Starting point is 00:13:38 be higher. At a certain point, I mean, if we're just scanning through the data, right, at a certain point, we'll always be bottlenecked by the memory So this is basically this will always be a bottleneck But we can take out some of the additional overhead of synchro synchronously accessing the memory So this is basically where we get kind of like a pipeline improvement here Okay so much for prefetching. Not that much. It's just a good thing to know that this is happening.

Starting point is 00:14:10 This might also bite you if you're not using it properly, right? As we've seen. And with this, let's finally, for this part, look at current pipeline architecture. So just so you have an idea of what a core typically looks like these days. So this is Sunny Cove. So Ice Lake, it's not the latest. So I don't remember what's the current core architecture.

Starting point is 00:14:40 Something, something, Cove. But right now, we would be be with Sapphire Rapids CPU so this is two generations old two three generations and this is the per core architecture right so this means I mean this is the way the the chips are designed you basically have like a chip architecture like the overall layout and then you have for each core you have a core architecture and depending on how the chips are produced and like how the design strategies are basically the core architectures might be used for multiple generations of chips and then there's

Starting point is 00:15:20 kind of like advancements in how they are interconnected how the caches are laid out etc et cetera. But the individual course will typically stay the same for a couple of architectures. And this is one of the architectures. And I mean, the changes are not significant. So what it looks like, you have this front end, which is basically your branch predictor, the micro operation cache, the translation look at site buffer, translation look at site buffer for internal caches, the instruction cache and decoders.

Starting point is 00:15:58 And then you have these reservation stations. So you have multiple different functional units on there. And you also multiple different functional units on there. And you also have multiple functional units of the same type. So here we can see there's basically integer units and vector units. And then we have somewhere we should have floating point units.

Starting point is 00:16:22 I don't see it right now. And we have load and store units and some kind of storage. And so there is basically the ALU would be where we did the arithmetical logical units. The vectorized computations is what we'll look at next. And say a basic computation, basic addition, could go to any of the ALUs. So there's basically four different ports where an individual addition can go to. And multiple of these ports can work at the same time.

Starting point is 00:16:58 So this means we can actually also execute multiple additions at the same time. So within a single core core or we do different kind of things so we could do a logical operations uh ration while we're doing some kind of addition and we can basically start a load or a store while doing um like an arithmetic operation at the same time right so i mean as we're saying there's six micro operations at the same time. So I mean, as we're saying, there's six micro-operations at the same time. We have eight ports. So there's a bit of flexibility. Not everything would work at the same time, or even 10 ports

Starting point is 00:17:36 here. And we have a micro-operation queue up there, basically in the front end, where we can have up to 70 micro operations queued which can be reordered to some degree. So depending if there's dependencies, data dependencies in between, say for example we have a load so the load will take a bit of time then we can basically do some addition that's not depending on the load before doing the individual operation.

Starting point is 00:18:11 Okay, so this is one example. This is another example. So this is what I have here in my laptop, again, as an example. So here we have the cache for an individual core. So this is, again, single core architecture, right? And you can see here you have up to eight instructions per cycle rather than six instructions. So here we can do a bit more even in parallel on a single cycle. And we have a much larger reorder buffer, so 630 micro operations in here rather than the 70 and seven integer ports instead of the instead of the four that we had on the other uh example and again we have some uh vectorized

Starting point is 00:19:00 execution parts and up to eight micro operations per cycle. Right. So this is also why this like if you have the same kind of core on this machine, you will get a bit better throughput per core than you would have on an Intel core of the example that I showed you. OK. Questions so far? So just as a quick summary, so we've

Starting point is 00:19:32 talked about the execution and how this is broken down into these multiple small parts and how these operations then can be pipelined. I mean, pipeline parallelism, we'll see that a couple of times again, or parallelism in general. We talked about the pipelining hazards, and then I showed you two different kind of architectures that you would actually see in practice a lot.

Starting point is 00:19:56 OK, with this, let's switch. Hm, wrong. And continue with SIMD. So the vectorized, so we saw these parts on the course already, right? There's special functional units for vectorized execution or single instruction multiple data. So today I'll start with the basics. So how the parallelism works, how vectorized execution works in general, and how this evolves, because that basically helps you maybe to understand how to find instructions, et cetera, to get a basic overview of where this is going. Okay, so first, let's again talk about parallelism a bit.

Starting point is 00:20:53 So, this is basically what we're diving into explicitly right now. So, we already have some implicit parallelism in our cores just because the CPU decides to do so and because each individual core can do some parallelism, but we don't really have any handle on this. So this is basically within the core. I mean, we can be friendly to that parallelism or unfriendly, so the core and the compiler can use it, but we don't have an explicit way of doing something in there.

Starting point is 00:21:30 And with SIMD, this is basically a first time within a single core. We can explicitly say, here, I actually want to have some kind of parallelism. And first, let's get an overview of the different kind of parallelisms again. So we had this pipeline parallelism already last time. So I still want to quickly repeat what this is about. And so pipelining is basically one way of getting parallelism in the hardware. And this is

Starting point is 00:22:02 basically if we have multiple steps of computation where each is done one after the other. And if we have separate functional units for this, so not the same function, but separate units, somehow separate resources that we're using for these individual tasks, then we can build a pipeline and basically utilize each of the resources independently. So this is basically, we still have the same kind of sequential execution that we had before.

Starting point is 00:22:35 So even if we have this pipeline parallelism on a modern CPU, the execution could just be the exact same sequential execution that we have before. I mean, the CPU will use this reordering, but that's not necessary. Even without the reordering, we get some parallelism just by having these pipelines, right? By doing something like the instruction decode while we're doing instruction execution already. And well, we've talked about the problems around hazards. And so here, basically, the amount of parallelism that we get is limited by the length of the pipeline.

Starting point is 00:23:20 And the length of the pipeline, at the same time, also determines the latency that we get. So the longer the pipeline at the same time also determines the latency that we get. So the longer the pipeline, the more cycles we need to execute an individual instruction. So it's also not useful to have an increasingly long pipeline. If you have a pipeline of length 100 or something, it will take 100 cycles for an individual instruction to execute. So that doesn't really make sense, right? So you don't want this to be ever increasing this length. But this is basically one way to at least harness or fully utilize all different functional units at the CPU

Starting point is 00:24:01 and don't have the costly units, like the cache load and instruction executions, or ALU, et cetera, idling while we're doing something else. So let's use them at the same time. This is what the pipelining does. But there's more hardware parallelism that we can use. So we can also have basically use multiple different parts of the chip area independently, right? Not in a pipeline fashion, but basically say we have, say, for example, multiple cores on the hardware. So let's use those in parallel. Or we have multiple functional units on the hardware.

Starting point is 00:24:48 We saw this. We have different kind of or multiple ALUs. So we can use these multiple ALUs at the same time for different individual operations. So that's also a form of execution or of parallelism. And we call this multiple instruction, multiple data, or MIMD. This is a very simple example.

Starting point is 00:25:11 It's multi-course. So we have multiple cores. Each of the cores does something different, a different task. So we have different instructions. We have different data. They can work independently in parallel. And for this, usually you will have, like, on a core level different data, they can work independently in parallel. For this, usually you will have on a core level or on a multi-core level, you will have the same type

Starting point is 00:25:31 of core multiple times, and it's very flexible, and so then we can basically speed up programs where we either have different things to do in parallel or we do the same kind of operation in parallel. But now, a very special kind of parallelism is the vector units. And this is basically where we have a single instruction, but we work on multiple data items at the same time.

Starting point is 00:26:06 So we do the same kind of computation on different parts of the data. We can also do this in a multi-core environment. So if you think about large-scale data processing systems, which say Spark or Flink or MapReduce, we do the same thing. So we have multiple tasks that go through larger data blocks in parallel. So this is also using the same instruction, but it's not a single instruction.

Starting point is 00:26:33 It's basically lots of different instructions on larger blocks. But now we can also break this down to individual instruction level. So this is where the CPUs today in hardware have individual units, the so-called vector units, that allow you to do this. So we get a single hardware instruction that will be executed on a vector of individual elements. So we execute the same assembly instruction on a set of values, and this is called vector units or I mean there's even vector processors that have like a whole system around this idea.

Starting point is 00:27:11 And so what this looks like in a nutshell, so rather than doing like an individual scalar operation, so we have two inputs, say we have an addition, so we have two inputs here, where's my pointer? There we go. We have two scalars, so individual integers for example. And we have a single addition, so this is your standard addition, we get one output. In a vectorized execution, we take two vectors as input, so two, four, eight integers for example. We do the vectorized addition and we do all of these additions at the same time in hardware. So there's basically an adder in the hardware that

Starting point is 00:28:01 does the same operation within a single cycle, basically. Of course, there's still this pipeline, right? So we have multiple pipeline steps for saying, okay, we have to load the data into the register. We have to first decode the instruction, fetch the instruction, decode the instruction, load the data in the registers, execute the instruction, and then write it back. But we would do the same in the registers, execute the instruction, and then write it

Starting point is 00:28:25 back. But we would do the same in the scalar addition. And here, in the execution step, we do this in the complete vector. And then we get an output that's also basically a vector of elements. Okay, so there's like in terms of getting a bit more into parallelism, there's two types of parallelism that we can have in applications basically already hinted at that. So there's data level parallelism. So we're basically doing the same instructions or same kind of operations on multiple data items at the same time, or there's task level parallelism.

Starting point is 00:29:10 So this is basically, we have different tasks that do different things independently in parallel. And typically if you have a complex system, you will have both, right? So if you have a data processing system, you will have multiple tasks that need to do different things so say for example we have buffer management we have uh like logging etc so these will be in different threads and um or we have multiple queries at the same time right uh then

Starting point is 00:29:41 uh then this will be different tasks but within within each individual query, in order to speed this up and not just run on a single core, we will have this data level parallelism. And even on a single core, we can have data level parallelism through these vectorized instructions. And there's many different ways or four major ways how to use this. So we can have the instructional level parallelism through pipelining and speculative execution. So this is basically what the core individually does, right? So we have the pipeline of operations.

Starting point is 00:30:19 So we have pipeline parallelism and we have speculative execution. So basically we have multiple ports. We're executing the next operation already before the previous oper or during executing the previous operation through a different port. And we're hoping this basically not strict ordering in our code doesn't bite us. In many cases we know it doesn't. And if we're predicting a branch, it might actually be a problem, and then we know we have to trace back.

Starting point is 00:30:48 Then we have the vector architectures, or SIMD, so single-interruption multiple data, and also GPUs. So GPUs are actually exactly working in the same way on a larger level, larger scale, but the idea is basically the same. We do the same kind of operation, the same kind of instructions on many data items in parallel. And you'll see this also later in the lecture. And we basically do same instruction to a collection of data. And then it depends on

Starting point is 00:31:18 how large this collection is and how much hardware we have, right? So how much of the same kind of unit or how much parallelism this unit has. This is basically the level of parallelism that we get. Then we can have the thread level parallelism and like having multiple threads. So having multiple cores in here or end or hyper threading where we get some additional parallelism.

Starting point is 00:31:47 And then of course using like request level parallelism. So where we have data level parallelism and task level parallelism by basically decoupling this through the programmer by like us saying, okay, individual queries will be done on individual cores, individual tasks within our program will be done by individual cores, individual threads. Okay, so for this, there's also Flynn's taxonomy. That's basically if we have something sequential, this is basically single instruction stream,

Starting point is 00:32:24 single data stream. This is basically standard sequential computer model that you would, like if you're just writing spaghetti code, this is basically what the program will then do, and the CPU will execute, and you will still get some kind of parallelism on a modern CPU just through this instruction level parallelism. Then we have SIMD, what we will talk about today, where we have the same instruction that is executed by multiple processors or special functional units. Multiple instruction and single data stream, that doesn't really exist. So I mean, you can hypothesize over this. So this is where people always kind of think, well, where would this make sense? Doesn't really exist. You can hypothesize over this. This is where people always think, where would this make sense?

Starting point is 00:33:08 Doesn't really make that much sense. And then we have multiple instruction, multiple data streams. This is basically your task level parallelism. So you have individual processors, individual cores doing stuff on themselves. So, on multi-core, multi-socket, complete racks of computers where everything is basically completely in parallel. So, there you have the most flexibility, but of course, you also need to do the most management and keep things in line. Okay, so now let's dive more deeply in yet another example of vectorized execution. So, we basically have these SIMD instructions that we the processor to perform multiple, like the same operation on multiple values at the same time.

Starting point is 00:34:10 And these exist in all major chip architectures. Now the tricky thing is, they are different in every chip architecture, right? There's a different instruction set. Typically for the basic operations you have them everywhere, so addition for example, things like that, but they're named differently so they have to be used differently. There's also kind of what is supported is different. So in Intel you have MMX, so the multimedia extensions, you have the streaming extensions and AVX, AVX is advanced vector extensions. On ARM you have NEON, I don't

Starting point is 00:34:47 remember what that exactly stands for, and the Scalable Vector Extensions, SVE, SPARC, there's VIS, and you might even find more. And these are all individual instruction sets. If you directly want to program them, you basically have to use different kind of intrinsics in there. And our standard example, again, would be simple addition. So if we do single instructions, single data, so we're basically running this kind of loop where we have two vectors that are basically added up in another vector.

Starting point is 00:35:25 So our basic loop would look something like this. So we have x and y that are added up into the area z. And we'll see this many times today, exactly this addition, because it's not that easy as you think. So it seems super straightforward. But you will see there is a couple of catches, even in this very simple example. And the very basic way of operating on this, so if you run this on your standard ALU,

Starting point is 00:35:53 you don't optimize through the compiler or nothing, then what basically the CPU will do is basically go like fetch the first item in the array, run it through the ALU, output the first item, then the second item, and so on, right? It will just go through this item by item through the two arrays. So, in this case, we have eight array elements. They will have to be added on. And then we have the final result. So, I mean, this basically should take us

Starting point is 00:36:28 round about eight cycles. Of course, we have basically a couple of cycles in our pipeline, so it will be a bit more than eight cycles, but this is basically roughly what it costs. Now an alternative to that is using the SIMD register or the SIMD units, right? So where we can have multiple data items at the same time. So here, rather than loading the elements into the small register, so where individual items or individual variables are loaded, so integer register.

Starting point is 00:37:09 We can also use this SIMD register or vector register and load four elements. So in a 128-bit SIMD instruction, we can load four integers at the same time and then do the addition across all four at the same time. So there's basically a functional unit that will let you add four items at the same time. And then this will basically be added and written into an output register.

Starting point is 00:37:36 And that's basically a single calculation. Of course, the load, as before, so we need to load into the register. But then we do the addition and we write the result. So in the write as before, right? So we need to load into the register, but then we do the addition and we write the result. So in the write back phase, basically this, the actual execution of the SIMD instruction itself will just be a single cycle, right? And then we only have to do this twice. So for the calculations, for the execution, we get basically a single cycle for four items at a time. So this basically we get, I mean, if all goes well,

Starting point is 00:38:13 we get a performance improvement of a factor of four here. So we basically, instead of using for the pure execution, using eight cycles, we're only going to use two cycles. And there's even wider than that. So we have 128-bit, but we also have 256-bit. And on the modern CPUs, larger server CPUs, we have 512 bits. So that means we can even do more than that. So up to 16 values at the same time.

Starting point is 00:38:41 And depending on the size of the values, of course. Okay, so this is clear, right? So this is kind of straightforward. Okay, so then let's briefly check the SIMD evolution, right? So how does this come about and where are we right now. And so there's the Intel streaming SIMD extensions. That's basically a set of SIMD instructions for 128 SIMD registers, which were introduced in 99. And these are actually also this MMX extensions for multimedia extensions. So the initial idea was basically, oh, we have all these GPUs now, right? So it kind of makes sense to have this parallel,

Starting point is 00:39:32 or this general graphical executions. We have this parallel processing, or the same kind of processing on multiple data items over and over again. So let's also have this on the CPU to some extent. So let's have functional units where we can get at least a bit of the speedup that you would see on a GPU. So the same kind of operations, just on a smaller scale.

Starting point is 00:39:59 That's why they're called multimedia extensions or MMX. You can also still see this in the registers right so the way we program this they're still called mm for multimedia so the idea is we can we have these 128-bit registers where we can load up to 4 32-bit scalar values and then we can have like a single execution or a single instruction on all of the four elements simultaneously. And then, of course, we need to move the data in and out of these registers.

Starting point is 00:40:37 I mean, the same we need to do with other, like for other kind of operations, we also need to move the data into the registers. But here we have these wide registers. And typically, for 128-bit, there might be 16 registers for this. So this is XMM0 to XMM15, so as I use them for these four 32-bit values, but you can also use them for two 64-bit values or for 16 8-bit values.

Starting point is 00:41:13 So this is basically where you have all kinds of different combinations how you can use them. And if you use them for 16 8-bit values and do addition on those, then you will get a 16-fold performance improvement over a single 8-bit instruction on the regular ALU. And then you basically have basic arithmetic operations. So you can do addition, subtraction, et cetera.

Starting point is 00:41:42 And you have logical operations. And then some kind of management operations to find out what is your result of the vector rise execution. What you don't have, and we'll come to this later on, you don't have branches. Because if one of these would be a branch, or we say we have 16 branch conditions, what would happen next?

Starting point is 00:42:04 So if all of them go in different directions, then we all of a sudden have to trace a lot of different stuff, like all kinds of combinations of potential solutions. So that doesn't really make sense. So we'll have to deal with this another way. And there is comparison instructions and miscellaneous stuff for transforming the data between x86 and SIMD registers and moving the data directly into the registers or to the memory.

Starting point is 00:42:37 So this is basically everything that you would have for regular registers as well. And besides the streaming extensions or streaming SIMD extensions, there's also AVX2 and AVX512. So AVX2 would be, rather than 128-bit, would be 256-bit wide registers. And AVX512 is then 512-bit registers. And this is basically 32 of those. And these now can be used for 64 8-bit integers. So here, we might get a 64-fold performance improvement

Starting point is 00:43:15 for a single instruction over a scalar. Or you see all the different steps, right? And the lowest would be like 8 64-bit integers or 864-bit float that we can do on these in parallel. But all of a sudden here, actually some of the operations are so complex that the CPU takes a bit longer. So if you're using 512, then this might, on the one hand, might take multiple cycles. But also, the CPU might, on the one hand, might take multiple cycles.

Starting point is 00:43:45 But also, the CPU might clock down a bit in order to just get through these functional units in time. So on the Skylake CPU, for example, you would have two of these 850, 512-bit units. And you can see they can do 128 bytes per cycle. That should be basically processing the 512 bit in a single cycle. Okay. So basically, yeah, let's talk about the SIMD programming model.

Starting point is 00:44:34 So I mean, in general, the pipeline parallelism, so the parallelism that we get through the pipelining and through the instruction reordering basically depends a lot on the in-flight instructions. So what do I see next? And then there's this reorder buffer in order to basically reorder something so I can actually utilize this parallelism.

Starting point is 00:45:01 But it really doesn't depend on the size of the registers. Now, in SIMD, this is different. Right here, basically, we basically say, like, these couple of operations are really independent. So these four additions are independent. You can do this in parallel. And so there is no hazard within this individual execution. So within this individual instruction, this can be done in parallel. I mean, if I have a loop with four individual additions, or not four individual additions,

Starting point is 00:45:44 like a loop of four additions, one after the other, there might be some kind of hazards. So we saw this earlier, the data hazard. Like the next loop step might depend on the previous loop step. In SIMD, I know this won't happen. So this is basically, I know this, or the CPU knows this can be safely executed in parallel in hardware. And with this, we can basically get this n-fold performance advantage, so meaning on a 128-bit

Starting point is 00:46:10 SIMD register, basically a four times performance improvement when we're using 32-bit values. In practice, this is not quite achievable, because we have some additional management to do, some additional copying. If we have the very large registers, the CPU might be a bit slower. But we can get a definite performance increase. So we've done some work on HashMaps, where we got like on AVX 512, we get like a 16, 18 fold performance improvement, which

Starting point is 00:46:48 is significant. So this is basically speeding up hash maps quite significantly. So with this, we'll go into SIMD programming next. But first, let me ask you if there's questions so far. Yes? Just to clarify, are these 32512 bits simply register per core or per socket? MARTIN SPLITTMANN- Per core.

Starting point is 00:47:15 These are per core, like two functional units per core, basically. I mean, if we go back to the previous, if I find it. So we can, you can see that there is basically two, there would even be three vectorized ALUs on this one. I mean, multiple different vectorized units on a single thing. So for the larger scale, you have these 128-bit units,

Starting point is 00:47:56 the 256, and the 512-bit units. Not all of them can be used at the same time. But say, for example, you have two AVX512 units that can be used in parallel per core. Okay. More questions? No questions? Then let's do a quick four minute break here. Okay, so let's get started again.

Starting point is 00:48:24 So SIMD programming. So there's, I mean, there's multiple different ways how you can use SIMD programming. Not all of them will be applicable for the task, but you can try all of them. So there's basically, I mean, the first thing that you can do is basically do nothing and just hope that the compiler does stuff for you. And that also works surprisingly well in many cases. And actually, let's say, the safest way of using SIMD because there's

Starting point is 00:49:00 easy, like a lot of ways how you can get this wrong in the compiler. Most likely, we'll do everything correctly. Meaning the compiler will find some opportunities for SIMDifying your code. But it might need some guarantees from you. Meaning it might need some assurance how it can work with the data, that there's no dependencies, etc. Then it will try to do this. We'll see a few things. And then there's explicit vectorization.

Starting point is 00:49:30 And there's intrinsics, there's operations, there's libraries, etc. So then there's, again, multiple different ways of how you can do this. And autovectorization, as I said, it's a good thing because it works across machines. Basically, the compiler does it for you. It's only limited by the patterns that the compiler knows. So the compiler will basically look at your code, try to find some kind of patterns in the code. So say a loop, for example, that accesses the same area of data

Starting point is 00:50:05 in every loop step in a clearly identifiable pattern, and there is no dependencies, et cetera, the loop is long enough, then the compiler will try to optimize it itself or will try to introduce vectorization if it thinks it's fast or it's making this faster and that really depends on what else needs to be done so this is say for example in llvm so that's one compiler framework there's auto vectorization there's a link here you can basically check how llvm does auto

Starting point is 00:50:41 vectorization if you follow the link it's platform independent. You still have your scalar code, which is easy to read and maintainable. And the compiler handles all the tricky cases for you. And this is basically what we'll go into in a bit more detail soon. It's basically this is what you have to deal with. If you don't let the compiler do it, then you have to deal with all the tricky cases. Meaning if your loops are not really well aligned with the vector sizes, then you have to deal with this. So if there's some kind of aliasing in your code,

Starting point is 00:51:19 you have to deal with this. The compiler also knows this and will do this for you. So this is basically the edge cases, right? So let's look at a simple example here. It's essentially what we've seen before, right? So we have an area, it's a bit more complicated here. We're loading and explicitly storing the values out of the area. So we're loading.

Starting point is 00:51:46 So we have an array of an arbitrary length, which is max. And we're first loading an element out of x, then an element out of y. We're adding the two, and then we're writing it to z, to the array z. So a simple way of, let's say, vectorizing this is basically making this in a vectorized fashion. So here we're basically doing four of these calculations

Starting point is 00:52:12 at the same time. So we're loading small vectors of four. And then we're just adding these vectors. And this is essentially what then, if we're vectorizing or if we're SIMDifying the code, this is what will be done. So we have basically a vector of four, two vectors of four. We're adding them up, writing them in a new vector of four. So this directly maps to SIMD registers, basically,

Starting point is 00:52:39 and will be executed. If the compiler sees this and it's safe, the compiler will directly translate this into SIMD code. So the question is, can the compiler also do this here? And well, there's a couple of things you have to think about. So first, what if max is not a multiple of 4? Well, if max is not a multiple of 4, then this won't work.

Starting point is 00:53:06 So then we have to at least basically stop before we have the last x, like the last few items that are not anymore a multiple of 4. So that needs to be handled separately. So we need some kind of scalar epilogue for the remaining loop iterations. What if z points to the same area as x or y? So we're basically reusing or doing some in place

Starting point is 00:53:34 computation here. We're basically writing this result to the same vector. Maybe this is shifted somehow. All of a sudden, we might overwrite something that we need in the later loop iteration. Well, then basically, a write to z at point i might change the value of a future x i. So then all of a sudden, our parallel execution

Starting point is 00:54:01 won't be the same as our scalar execution. That's, of course, not good. So we need the same kind of output. So this is basically, if there's pointer aliasing, then we have to do it differently. We basically, I mean, OK, and we could find out how the pointers are somehow aligned or something like this. But the compiler won't go through this hassle.

Starting point is 00:54:25 So the compiler will basically say, if I don't know, I'm going to use scalar execution. So if there's some aliasing, then it will get scalar. So if we're doing this hand vectorized, so we're doing the same thing, then what we can do is basically, well, we have a prologue which basically checks is there aliasing or not. If there is aliasing, well, then there might be some clobbering, right?

Starting point is 00:54:59 So we might overwrite different values in future loops. So then let's do the scalar. As soon as we're not divisible by 4, we also have to go to the epilogue, basically. Otherwise, we can basically compute or do the computation within our loop, like where we basically do the computation vectorized. So our epilogue is basically for all of the items

Starting point is 00:55:35 that are not divisible by 4 anymore. The main loop will be this. And this just checks if we have some alias. So the aliases is kind of tricky to implement. The loop conditions offsets are easy to get wrong, right? So this is something that you can easily mess up somehow. So then, and if you mess it up, then your code will be wrong. And this kind of aliases method here,

Starting point is 00:56:08 this is not something that the compiler gives you or something. This is something you have to implement yourself. This is where you basically have to make sure that this is correct. So yeah, you might get it wrong. Also, I plus 3, this could overflow. So, this you have to handle.

Starting point is 00:56:31 Next, minus 3, so something doing basically the remainder. This might underflow. So, anything you have to basically deal with. Any kind of these problems. So what we can do is we can basically utilize the compiler. So let's just let the compiler do this. And this is basically what then the compiler would put out. So we can actually check this.

Starting point is 00:56:59 So you can't really see it well here, but there's a nice website who of you has heard of Godbolt? this is something you can in the future for your programming you can use a lot so here what you can see is basically we have this kind of loop

Starting point is 00:57:24 the same kind of loop and the same kind of loop, and the compiler explorer or the Godbolt, so there's two pages where you can go to. This is basically, you give it like it's called a small code snippet, and you get the assembly for this, depending on the compiler that you're using, depending on the compiler that you're using, depending on the architecture that you're using, and the compiler flags that you're using. And this, for example, would be the loop that we've just seen, this exact same loop here.

Starting point is 00:57:56 Let me find my mouse again here. So it's really just this. We have the two arrays, or the three areas, and there are maximum integer. And then we're just doing the addition. And you can see the compiler generates quite a bit of code just for this. And this is basically different. I mean, you can see there's kind of basic calculations,

Starting point is 00:58:20 different kind of conditions, and then jump points depending on what is max. So if max is smaller than a certain number, so we can't really fit it into an area, then we have this epilogue that we need to calculate with. There might be aliasing or not. So depending on all of these different points, the compiler will produce separate code for these and then have like one hot loop that if everything is nice, we can actually use these individual instructions. I think this is something like, let me see,

Starting point is 00:59:00 I think L7 or something. Here you can basically see there we're actually doing Let me see. I think L7 or something. That's here. Here you can basically see there we're actually doing a vectorized instruction in here. So this would be a vectorized addition in here. So if everything goes nice, all of the conditions are fine, then we're doing a vectorized instruction. Otherwise, we might just use regular instructions,

Starting point is 00:59:29 regular additions. We might unroll the additions to some degree, et cetera. And this is basically, you can see this. I mean, there's lots of different code for different kind of branches. But then there's this one hot loop, this is basically, you can see this, I mean, there's lots of different code for different kind of branches, but then there's this one hot loop, which is basically, I mean, here we even have like, didn't scroll to the right hot loop. Here we see that we're basically loading the registers, we're adding, and then we're incrementing the counter, et cetera, and continuing the loop, right?

Starting point is 01:00:03 So this is basically all is good. We're within the loop, and we're just adding up in a vectorized fashion these two areas. And so sometimes this doesn't really work. So sometimes the compiler won't give you such a nice vectorized loop. And then the compiler or the vectorization can even emit some information why this didn't work.

Starting point is 01:00:36 So in Clang, you can use our pass or our pass minus analysis, gcc minus foptinfo. And say, for example, if you're using Clang 16, and we have this kind of loop here, it's basically the same kind of loop again that we had earlier, with the small change that our loop is fairly short here. So the loop is fairly short here. So the loop is fairly short. So then the compiler basically all of a sudden doesn't do any vectorization anymore. And we can actually, using this analysis information, we can actually check what's going on.

Starting point is 01:01:21 So this is also something I can show you. So here we have exactly the same loop. And you can see the code that comes out is just basically an unrolled loop here. So the compiler doesn't do nothing. So the compiler still basically does some optimization by not building a loop here and with some kind of condition, but actually just completely unrolls this loop.

Starting point is 01:01:48 So we're just doing eight individual additions here, one after the other. And the compiler also tells us why. So it doesn't do a vectorized because the store might be clobbered. So there might be aliasing, meaning we might override. So this basically storing the data in C might override something in Y or X. So this is what we, this problem what we had before.

Starting point is 01:02:22 And out of this small code snippet, there's no way we can actually tell that this doesn't happen. I hope you agree with this. So here, basically, I mean, these might point, the x and y might point to the same area than z. So then, I mean, they all might point to the same area, essentially. So we're just basically doubling the value, like a weird way of doubling each value in the area,

Starting point is 01:02:49 for example, something like this, and shifting it to make it a bit more exciting, something like this. So then this basically won't work, and then we have a problem. And the compiler notices this, and the compiler notices this. So the compiler can't say if this is safe or not. So the compiler won't parallelize.

Starting point is 01:03:11 And we'll just do loop unrolling here. But we can help the compiler. So rather than staying at this, I mean, we might know that this is not true. So we might know that there is no alias in here. The set pointer, pointer to ARIZ is something that's not reused anywhere else. So we can, I mean, we see that there's this clobbering, right? So because of the previous store and the cost model,

Starting point is 01:03:43 there's knows that there's only eight loop iterations. So let's just unroll this rather than doing the individual or doing some kind of lots and lots of checks. But we can do a compiler hint here. So in the previous version, we saw that the compiler actually does for the same loop. So this, I mean, if you look at the previous loop here, that's the previous loop.

Starting point is 01:04:12 It's the same loop. There's no change here. But it might be executed much more frequently. So this might run for 100 or 1,000 or a million iterations. So there, if you have a million iterations, checking all these things might still make sense. If it's only eight iterations, all the checks will be costly.

Starting point is 01:04:37 But if we don't need some of the checks, this might still make sense, right? So the compiler doesn't rule out aliasing. So we can tell the compiler, you can actually rule out aliasing. So you can use a compiler hint to provide this additional guarantee. And the compiler can use this guarantee.

Starting point is 01:05:02 But here, if you're providing this kind of guarantee, it's your duty to make sure that it's actually true, right? So because the compiler then doesn't check anymore. The compiler says, oh, you're going to be right. So if you say there's no aliasing here, I believe you, if all of a sudden there is aliasing, you're going to be in trouble. So you're probably going to have some runtime, so some segmentation error or whatever. And here the compiler hint is restrict. So here we basically give a guarantee that the pointer does not alias any other pointer.

Starting point is 01:05:40 And this is not standardized in C++, but most compilers support it. So you just say, well, we have our initial compiler hints here. We have our initial areas here, X and Y, and we have our C area, and C is not aliasing to any of X or Y. And if we use this here, then the compiler says, OK, cool. I don't need to do any checks, at least for that part.

Starting point is 01:06:14 Also, I don't need to do any checks for making sure that we need an epilogue because we're not divisible by 4, for example. So this nicely basically fits into my registers, into my multimedia extension or vector registers. So I can just directly use the code as is. And we can look at this as well. So here we can also do a comparison with the same thing.

Starting point is 01:06:44 So here on top, we have the version with restrict, on the bottom we have the previous version. So the previous version is still the same, right? So it's just basically this unrolled loop. On the top we have the same addition, but what we do is we're basically moving the data into the SIMD registers. So we're loading basically the first half of X and the first half of Y into the register here. I mean, this is basically working C here

Starting point is 01:07:18 and then we're doing the addition. And then basically this is written back here. And this is also unlooped, so we don't have a loop in this code, unrolled, because it doesn't make sense to do a loop for two steps, right? And with this, even because we have two SIMD units at least on our CPU, this might actually be run completely in parallel here. So let's say the instruction, if everything goes well, then basically these two additions here will be completely run in parallel within a single cycle.

Starting point is 01:08:01 And hopefully this will then have a a significant improvement, performance improvement over this code. Although, why won't we get an eight-fold performance improvement here? Who has an idea? Why do I not expect an eight-fold performance improvement here? I mean, here I can use two SIMD registers with each four instructions at the same time. What's going to happen down here? Is the core just going to do one addition at a time, most likely? No. Because we have, I mean, there might be, if there's no data dependency in the code, right, no data hazard, then, I mean, on the one hand, we get some pipeline parallelism, but we have

Starting point is 01:08:59 multiple functional units. We have multiple ALUs on the same core so most likely the CPU or the basic CPU will will basically process them on multiple ports at the same time so with this we'll also get some performance improvement we'll have more than like a single instruction per cycle but I don't know two instructions per cycle or something like this, because we can do them in parallel due to the multiple functional units on the same core. Make sense? So this will be faster, the top one, but it's not going to be linearly faster than the

Starting point is 01:09:40 lower one, because the CPU is also, I mean, it's also trying to be smart about this part. Okay? Good. Let me check how much do I have here. Where's my mouse? Yeah, maybe it makes sense to take a break here, because the next part will be a bit lengthy and then we'll do this next week. So next week then we'll talk about explicit vectorization. On Tuesday, you will get an introduction to the task, so what you have to do. This is basically the SIMD scan, right? So something where you do like a true database operation, a couple of SIMD operations,

Starting point is 01:10:38 and where you can get actually a nice performance improvement through SIMDification of your code. Yeah, questions so far? actually a nice performance improvement through simplification of your code. Yeah, questions so far? No questions? Well, then check out Gotbolt. The links are in the slides. This is super helpful as soon as you're starting doing some kind of vectorization.

Starting point is 01:11:02 And with that, well, enjoy the rest of the day and the holiday tomorrow, and see you next week. Thank you.

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Vectorized Execution

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Vectorized Execution

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.