Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Vectorized Execution
Episode Date: April 30, 2024...
Transcript
Discussion (0)
Okay, so despite the nice weather, we have a couple of people here.
Great, thanks for coming.
I'd rather be outside in the sun right now, to be honest.
But I also enjoyed this.
Okay, let's get started.
So we're going to talk about vectorized execution later today.
So first we're going to go back to where what we've where we left off last time which is
uh pre-fetching but before that one quick announcement uh we'll have the lecture series
tonight again with ralf fabric on ai and sustainability so again could be interesting
to hear about a project there uh so it's going be, yeah, bit of an overview,
not so much of like mini projects,
but like one dedicated or two dedicated projects.
And today we're gonna dive into vectorized execution.
Later on, tomorrow is no class.
So for those of you who have not heard this,
there's a holiday tomorrow, so you can actually enjoy the sun.
But then next week, we're also not going to start with a lecture, but we're going to start
with a task.
So that's also going to be interesting.
You will do some SIMD programming, so we'll touch on this today, finish up next week.
So then you should know enough to
actually solve
the task. It's not super hard. It's
just a bit tricky to find the commands, etc.
But you will
manage. And then we'll
go on. So I'm actually
not going to be around on the 14th
and 15th. So then Martin will tell
you all the funny and nice things about
execution models and data structures, which is also a topic that I like a lot, but because I like
it a lot, of course I share it.
Okay, so with this, let's go back to where we left off last time, which is prefetching.
So you remember we talked about instruction execution, so how instructions are broken down into smaller instructions
and then executed on the core.
And we've heard about the function of units, et cetera.
And we've heard about special operations for loading loading data and this is basically where a lot
of time goes right so you've heard that before that basically we have this memory bottleneck
which in data processing is something that we have to be aware of we cannot just do like lots
and lots of code execution we cannot jump in the code a lot, because that will stall our instruction execution just
because we don't have the code ready.
And the same thing is true for the data.
So if we have lots of data that we need to read,
then ideally somehow the data should
be already available in the caches.
If we have to go to memory, the CPU
will have to wait many cycles, so 100, 200 cycles
or something, just idling around doing nothing.
And in order for this to somehow mitigate this,
there's a concept called prefetching.
And there's two ways we can do this.
Basically, the CPU does it for us itself,
and we can do this ourselves.
So this is something you can try out and we can see.
So there's a hardware prefetcher.
And the idea here is that basically, I mean,
we already know, right?
So the CPU is loading cache lines, not individual bytes, and if we can use everything
that's within a cache line, then the CPU will have to do
something for a couple of cycles, which is good,
because we don't have to load every individual byte
from memory.
At the same time, if we already know, or the CPU can guess
that we're gonna to use the next
cache line as well then it can prefetch this and that's the so-called adjacent
cache line prefetcher or also stride prefetcher so basically the
CPU can prefetch the next cache line guessing that this will be a cache line
that we will also be using if we've used the first cache line, just read through this, maybe, and there's a pattern,
then let's take the next one as well. The same is true if we hop through an area to some degree,
then there might be a stride pattern, right? So we use every other cache line, for example.
So this is also something that the CPU can do by itself.
But often, the prefetcher does not prefetch
across page boundaries.
That's just something.
So if we're in virtual memory, the page
might be on disk or something.
So then there's additional stuff to do.
So this might be costly.
So let's keep it at a page level, basically.
The other thing we can do is we can also prefetch ourselves
or at least instruct the CPU to do so.
We can basically use explicit operations or instructions
that tell the CPU it might make sense to do some prefetching here.
So there's so-called built-in.
So these are basically compiler hints
or instructions that will hopefully lead to
some desired behavior where
nothing would actually change if it doesn't happen.
So this is basically just an optimization that we're doing.
Here, for example, this built-in
prefetch will basically say a memory address that we want to prefetch and
then if it's a read or a write, because that makes the difference in terms of
cache coherency, right, how we are blocking this or locking this, and then
the access pattern.
So basically there is temporal locality or there is no temporal
Locality. Meaning we're just going to read
This once or access it once or if we want to access it multiple
Times. So if we're just doing a scan
Through a lot of data, then we'll only access this once.
Meaning once we've used this, the CPU can replace this in the cache.
So it's not going to be used again.
If we indicate, well, this will be used definitely multiple times,
then the CPU will try to keep it in the cache.
And basically, then we can make use of that multiple times.
And yet another.
So these are basically, this both goes to the hardware
prefetcher.
So this is basically, we're telling the hardware
prefetcher to do some prefetching here, please.
But we can also mimic the same behavior
by building something called a helper thread.
So the idea is, I mean, what does a prefetcher do?
It just gets some data from memory into our caches.
And so we can basically instrument or write some code
that does the same thing just by basically accessing data.
We can ensure that the data is in cache at some point.
So if we have our system or our code we'll access next,
we can basically have a separate thread, of course,
in the same core.
I mean, it doesn't make sense if we have it like different socket
or something, not same core, in the same socket. if we have it like different socket or something, not same core on the same socket.
If we have it in a different socket,
that might actually be very bad.
But if we have it on the same socket, this means,
and or the same NUMA region, this
means we have the data in the caches,
and we can basically create a thread that
accesses the data and gets the data into the right caches,
at least the lower level caches, not the per-core caches if it's a separate cache.
So the idea is we create basically a second thread that performs the same operations as
the first thread, but only in terms of data accesses.
So we're basically doing the same kind of data accesses
without the computation, so this thread will be faster.
It will run before the actual thread that does the work,
and it will warm our caches.
So the data is already in a cache,
which means we have faster access to this.
So it needs to somehow work ahead.
So we have our main thread that basically then, let me see if this works, right? So we have our main thread that basically then let me see if this works, right?
So we have the main thread that works on the caches and we have to help us
Threat that basically says okay. This is the data that will access next
so I'm just gonna access this and we'll have it in the caches and
so this means we're basically already in some level of cache not in in the last level of cache if this is on another core.
And if they're on the same core, then this might even
work through scheduling.
So we're definitely saving this main memory accesses.
So then the main thread basically
works with the data already there.
And the main thread
also can like have, they can be like a shared data structure which is called a
worker headset which has the pointers that the helper thread basically has to
prefetch which is very much similar than to using this right so to
using this built-in so we can basically basically say um also i can say well i have
kind of these are the addresses that i want to access next so i'm using this prefetch or this
this built-in prefetch to access this these data or to to warm the caches in the same way
the difference is that here whoops wrong Here, the helper thread will definitely access this.
So this will definitely be executed.
Now, this can also lead to a problem.
Because if they're both accessing
the same cache line at the same time,
then we have a conflict.
So all of a sudden, basically,
they're conflicting on the same cache line.
So there basically might be cache flushing.
So the individual cache line will
have to be reloaded because we want to have in-order processing
on the CPU.
So we have basically what's called a memory order machine
clear event, which completely flushes the pipeline
and will be super slow.
So this basically all of a sudden
means our program gets a lot slower than before,
because all of a sudden we have these conflicts on the cache
line. Or if the helper thread spins on a lock to wait for new data,
then the helper thread is basically waiting.
We have a lot of these memory order machine clear events,
so basically we're reducing performance.
So an idea is, rather than doing this
like basically stepping or working in lockstep,
we can do this backwards.
So then we have these conflicts much less
because only if we have a larger working set,
the helper thread works its way through the working set
and only at the point where the two meet we might have such an event.
But this one, hopefully it will be much less frequent.
So here you can see out of a paper, you can see an example of performance that you might
see basically in,
I think it was some kind of scanning workload.
So what we see, so we have a single threaded execution.
So this would be the blue line, right?
So this is the single threaded execute.
Oops, this is hard to steer.
Let me try it this way.
This is the single threaded execution.
Then if we have the forward helper thread,
that's basically okay for a small workahead size.
If it gets larger, it might actually be a problem.
We might get into conflicts, and we have these memory order machine clear events.
We might have a single forward helper thread with a spin loop, so that's even worse.
So here we get more basically clearings and more conflicts.
And if we have the backwards helper thread,
we have actually a bit of a performance boost.
It's not a lot, but we get 10, 15% performance improvement
through this helper thread, through this prefetching, basically.
This is where we're saving on not having to access the memory
rather than just getting the data from the caches.
Okay? So, again, similar to using prefetcher, if the CPU has, like, in our main thread, if
there's more to do, right, more execution, then the performance improvement might even
be higher.
At a certain point, I mean, if we're just scanning through the data, right, at a certain
point, we'll always be bottlenecked by the memory
So this is basically this will always be a bottleneck
But we can take out some of the additional overhead of synchro synchronously accessing the memory
So this is basically where we get kind of like a pipeline improvement here
Okay so much for prefetching.
Not that much. It's just a good thing to know that this is happening.
This might also bite you if you're not using it properly, right?
As we've seen. And with this,
let's finally, for this part, look at current pipeline architecture.
So just so you have an idea of what a core typically
looks like these days.
So this is Sunny Cove.
So Ice Lake, it's not the latest.
So I don't remember what's the current core architecture.
Something, something, Cove.
But right now, we would be be with Sapphire Rapids CPU so
this is two generations old two three generations and this is the per core
architecture right so this means I mean this is the way the the chips are
designed you basically have like a chip architecture like the overall layout and
then you have for each core you have a core architecture and depending on how
the chips are produced and like how the design strategies are basically the core
architectures might be used for multiple generations of chips and then there's
kind of like advancements in how they are interconnected how the caches are
laid out etc et cetera.
But the individual course will typically stay the same for a couple of architectures.
And this is one of the architectures.
And I mean, the changes are not significant.
So what it looks like, you have this front end, which is basically your branch predictor, the micro operation cache, the translation look
at site buffer, translation look at site buffer
for internal caches, the instruction cache and decoders.
And then you have these reservation stations.
So you have multiple different functional units on there.
And you also multiple different functional units on there. And you also have multiple functional units
of the same type.
So here we can see there's basically integer units
and vector units.
And then we have somewhere we should
have floating point units.
I don't see it right now.
And we have load and store units and some kind of storage.
And so there is basically the ALU
would be where we did the arithmetical logical units.
The vectorized computations is what we'll look at next.
And say a basic computation, basic addition, could go to any of the ALUs.
So there's basically four different ports where an individual addition can go to.
And multiple of these ports can work at the same time.
So this means we can actually also execute multiple additions at the same time.
So within a single core core or we do different kind
of things so we could do a logical operations uh ration while we're doing some kind of addition
and we can basically start a load or a store while doing um like an arithmetic operation at the same
time right so i mean as we're saying there's six micro operations at the same time. So I mean, as we're saying, there's six micro-operations at the same time.
We have eight ports.
So there's a bit of flexibility.
Not everything would work at the same time, or even 10 ports
here.
And we have a micro-operation queue up there,
basically in the front end, where
we can have up to 70 micro operations queued
which can be reordered to some degree.
So depending if there's dependencies, data dependencies in between, say for example we
have a load so the load will take a bit of time then we can basically do some addition
that's not depending on the load before doing the individual operation.
Okay, so this is one example.
This is another example.
So this is what I have here in my laptop, again, as an example.
So here we have the cache for an individual core.
So this is, again, single core architecture, right? And you can see here you have up to eight instructions per cycle rather than six instructions.
So here we can do a bit more even in parallel on a single cycle.
And we have a much larger reorder buffer, so 630 micro operations in here rather than the 70 and seven integer ports instead of the
instead of the four that we had on the other uh example and again we have some uh vectorized
execution parts and up to eight micro operations per cycle.
Right.
So this is also why this like if you have the same kind of core on this machine, you
will get a bit better throughput per core than you would have on an Intel core of the
example that I showed you.
OK.
Questions so far?
So just as a quick summary, so we've
talked about the execution and how this is broken down
into these multiple small parts and how these operations then
can be pipelined.
I mean, pipeline parallelism, we'll
see that a couple of times again, or parallelism in general.
We talked about the pipelining hazards,
and then I showed you two different kind of architectures
that you would actually see in practice a lot.
OK, with this, let's switch. Hm, wrong.
And continue with SIMD. So the vectorized, so we saw these parts on the course already, right?
There's special functional units for vectorized execution or single instruction multiple data.
So today I'll start with the basics.
So how the parallelism works, how vectorized execution works in general, and how this evolves,
because that basically helps you maybe to understand how to find instructions, et cetera,
to get a basic overview of where this is going.
Okay, so first, let's again talk about parallelism a bit.
So, this is basically what we're diving into explicitly right now.
So, we already have some implicit parallelism in our cores just because the CPU decides to do so and because each individual core can do some parallelism,
but we don't really have any handle on this.
So this is basically within the core.
I mean, we can be friendly to that parallelism or unfriendly,
so the core and the compiler can use it,
but we don't have an explicit way
of doing something in there.
And with SIMD, this is basically a first time
within a single core.
We can explicitly say, here, I actually
want to have some kind of parallelism.
And first, let's get an overview of the different kind
of parallelisms again.
So we had this pipeline parallelism already last time. So I still want to quickly repeat what this
is about. And so pipelining is basically one way of getting parallelism in the hardware. And this is
basically if we have multiple steps of computation
where each is done one after the other.
And if we have separate functional units for this,
so not the same function, but separate units,
somehow separate resources that we're using for these
individual tasks, then we can build a pipeline and basically
utilize each of the resources independently.
So this is basically, we still have the same kind of sequential execution that we had before.
So even if we have this pipeline parallelism on a modern CPU,
the execution could just be the exact same sequential execution that we have before.
I mean, the CPU will use this reordering, but that's not necessary.
Even without the reordering, we get some parallelism just by having these pipelines, right?
By doing something like the instruction decode while we're doing instruction execution already.
And well, we've talked about the problems around hazards.
And so here, basically, the amount of parallelism
that we get is limited by the length of the pipeline.
And the length of the pipeline, at the same time,
also determines the latency that we get. So the longer the pipeline at the same time also determines the latency that we get.
So the longer the pipeline, the more cycles we need to execute an individual instruction.
So it's also not useful to have an increasingly long pipeline.
If you have a pipeline of length 100 or something, it will take 100 cycles for an individual instruction to execute.
So that doesn't really make sense, right? So you don't want this to be ever increasing this length.
But this is basically one way to at least harness
or fully utilize all different functional units at the CPU
and don't have the costly units, like the cache load and instruction executions,
or ALU, et cetera, idling while we're doing something else.
So let's use them at the same time.
This is what the pipelining does.
But there's more hardware parallelism that we can use.
So we can also have basically use multiple different parts of the chip area independently, right? Not in a pipeline fashion, but basically say we have, say, for example, multiple cores on the hardware.
So let's use those in parallel.
Or we have multiple functional units on the hardware.
We saw this.
We have different kind of or multiple ALUs.
So we can use these multiple ALUs at the same time
for different individual operations.
So that's also a form of execution or of parallelism.
And we call this multiple instruction, multiple data,
or MIMD.
This is a very simple example.
It's multi-course.
So we have multiple cores.
Each of the cores does something different, a different task.
So we have different instructions.
We have different data.
They can work independently in parallel.
And for this, usually you will have, like, on a core level different data, they can work independently in parallel.
For this, usually you will have on a core level or on a multi-core level, you will have the same type
of core multiple times, and it's very flexible,
and so then we can basically speed up programs
where we either have different things to do in parallel
or we do the same kind of
operation in parallel.
But now, a very special kind of parallelism is the vector units.
And this is basically where we have a single instruction, but we work on multiple data
items at the same time.
So we do the same kind of computation on different parts of the data.
We can also do this in a multi-core environment.
So if you think about large-scale data processing systems,
which say Spark or Flink or MapReduce, we do the same thing.
So we have multiple tasks that go through larger data
blocks in parallel.
So this is also using the same instruction,
but it's not a single instruction.
It's basically lots of different instructions on larger blocks.
But now we can also break this down
to individual instruction level.
So this is where the CPUs today in hardware
have individual units, the so-called vector units, that allow you to do this.
So we get a single hardware instruction that will be executed on a vector of individual elements.
So we execute the same assembly instruction on a set of values, and this is called vector units or I mean there's even vector processors that
have like a whole system around this idea.
And so what this looks like in a nutshell, so rather than doing like an individual scalar
operation, so we have two inputs, say we have an addition, so we have two inputs here, where's
my pointer?
There we go.
We have two scalars, so individual integers for example.
And we have a single addition, so this is your standard addition, we get one output. In a vectorized execution, we take two vectors as input, so two, four, eight integers for
example. We do the vectorized addition and we do all of these additions at the
same time in hardware. So there's basically an adder in the hardware that
does the same operation within a single cycle, basically.
Of course, there's still this pipeline, right?
So we have multiple pipeline steps for saying,
okay, we have to load the data into the register.
We have to first decode the instruction,
fetch the instruction, decode the instruction,
load the data in the registers,
execute the instruction, and then write it back. But we would do the same in the registers, execute the instruction, and then write it
back.
But we would do the same in the scalar addition.
And here, in the execution step, we do this in the complete vector.
And then we get an output that's also basically a vector of elements. Okay, so there's like in terms of getting a bit more into parallelism, there's two types
of parallelism that we can have in applications basically already hinted at that.
So there's data level parallelism.
So we're basically doing the same instructions or same kind of operations on multiple data items
at the same time, or there's task level parallelism.
So this is basically, we have different tasks
that do different things independently in parallel.
And typically if you have a complex system,
you will have both, right?
So if you have a data processing system,
you will have multiple tasks that need to do
different things so say for example we have buffer management we have uh like logging etc so these
will be in different threads and um or we have multiple queries at the same time right uh then
uh then this will be different tasks but within within each individual query, in order to speed this up
and not just run on a single core, we will have this data level parallelism.
And even on a single core, we can have data level parallelism
through these vectorized instructions.
And there's many different ways or four major ways how to use this.
So we can have the instructional level parallelism through pipelining and speculative execution.
So this is basically what the core individually does, right?
So we have the pipeline of operations.
So we have pipeline parallelism and we have speculative execution.
So basically we have multiple ports.
We're executing the next operation already before the previous oper or during executing
the previous operation through a different port.
And we're hoping this basically not strict ordering in our code doesn't bite us.
In many cases we know it doesn't.
And if we're predicting a branch, it might actually be a problem, and then we
know we have to trace back.
Then we have the vector architectures, or SIMD,
so single-interruption multiple data, and also GPUs.
So GPUs are actually exactly working
in the same way on a larger level, larger scale,
but the idea is basically the same.
We do the same kind of operation,
the same kind of instructions on many data items in parallel. And you'll see this also later in
the lecture. And we basically do same instruction to a collection of data. And then it depends on
how large this collection is and how much hardware we have, right? So how much of the same kind of unit
or how much parallelism this unit has.
This is basically the level of parallelism that we get.
Then we can have the thread level parallelism
and like having multiple threads.
So having multiple cores in here
or end or hyper threading
where we get some additional parallelism.
And then of course using like request level parallelism.
So where we have data level parallelism and task level parallelism by basically decoupling
this through the programmer by like us saying, okay, individual queries will be done on individual cores,
individual tasks within our program will be done
by individual cores, individual threads.
Okay, so for this, there's also Flynn's taxonomy.
That's basically if we have something sequential,
this is basically single instruction stream,
single data stream.
This is basically standard sequential computer model that you would, like if you're just
writing spaghetti code, this is basically what the program will then do, and the CPU
will execute, and you will still get some kind of parallelism on a modern CPU just through
this instruction level parallelism.
Then we have SIMD, what we will talk about today, where we have the same instruction that is executed by multiple processors or special functional units. Multiple instruction
and single data stream, that doesn't really exist. So I mean, you can hypothesize over this. So this
is where people always kind of think, well, where would this make sense? Doesn't really exist. You can hypothesize over this. This is where people always think, where would this make sense?
Doesn't really make that much sense.
And then we have multiple instruction, multiple data streams.
This is basically your task level parallelism.
So you have individual processors, individual cores doing stuff on themselves. So, on multi-core, multi-socket, complete racks of computers where everything is
basically completely in parallel. So, there you have the most flexibility, but of course, you also
need to do the most management and keep things in line. Okay, so now let's dive more deeply in yet another example of vectorized execution.
So, we basically have these SIMD instructions that we the processor to perform multiple, like the
same operation on multiple values at the same time.
And these exist in all major chip architectures.
Now the tricky thing is, they are different in every chip architecture, right?
There's a different instruction set.
Typically for the basic operations you have them everywhere,
so addition for example, things like that, but they're named differently so they have
to be used differently. There's also kind of what is supported is different. So in Intel
you have MMX, so the multimedia extensions, you have the streaming extensions and AVX,
AVX is advanced vector extensions. On ARM you have NEON, I don't
remember what that exactly stands for, and the Scalable Vector Extensions, SVE, SPARC,
there's VIS, and you might even find more. And these are all individual instruction sets.
If you directly want to program them, you basically have to use different kind of intrinsics in there.
And our standard example, again, would be simple addition.
So if we do single instructions, single data,
so we're basically running this kind of loop
where we have two vectors that are basically added up
in another vector.
So our basic loop would look something like this.
So we have x and y that are added up into the area z.
And we'll see this many times today, exactly this addition,
because it's not that easy as you think.
So it seems super straightforward.
But you will see there is a couple of catches,
even in this very simple example.
And the very basic way of operating on this, so if you run this on your standard ALU,
you don't optimize through the compiler or nothing, then what basically the CPU will do
is basically go like fetch the first item in the array, run it through the ALU,
output the first item, then the second item, and so on, right?
It will just go through this item by item through the two arrays.
So, in this case, we have eight array elements.
They will have to be added on.
And then we have the final result.
So, I mean, this basically should take us
round about eight cycles.
Of course, we have basically a couple of cycles
in our pipeline, so it will be a bit more
than eight cycles, but this is basically
roughly what it costs.
Now an alternative to that is using the SIMD register
or the SIMD units, right?
So where we can have multiple data items at the same time. So here, rather than loading the elements into the small register, so where individual items or individual variables are loaded, so integer register.
We can also use this SIMD register or vector register and load four elements.
So in a 128-bit SIMD instruction,
we can load four integers at the same time
and then do the addition across all four at the same time.
So there's basically a functional unit
that will let you add four items at the same time.
And then this will basically be added
and written into an output register.
And that's basically a single calculation.
Of course, the load, as before,
so we need to load into the register.
But then we do the addition and we write the result. So in the write as before, right? So we need to load into the register, but then we do the addition
and we write the result. So in the write back phase, basically this, the actual execution of
the SIMD instruction itself will just be a single cycle, right? And then we only have to do this
twice. So for the calculations, for the execution, we get basically a single cycle for four items at a time.
So this basically we get, I mean, if all goes well,
we get a performance improvement of a factor of four here.
So we basically, instead of using for the pure execution,
using eight cycles, we're only going to use two cycles.
And there's even wider than that.
So we have 128-bit, but we also have 256-bit.
And on the modern CPUs, larger server CPUs, we have 512 bits.
So that means we can even do more than that.
So up to 16 values at the same time.
And depending on the size of the values, of course.
Okay, so this is clear, right? So this is kind of straightforward. Okay, so then let's briefly
check the SIMD evolution, right? So how does this come about and where are we right now. And so there's the Intel streaming SIMD extensions.
That's basically a set of SIMD instructions for 128 SIMD registers,
which were introduced in 99.
And these are actually also this MMX extensions for multimedia extensions. So the initial idea was basically, oh, we
have all these GPUs now, right?
So it kind of makes sense to have this parallel,
or this general graphical executions.
We have this parallel processing,
or the same kind of processing on multiple data items
over and over again.
So let's also have this on the CPU to some extent.
So let's have functional units where we can get at least a bit of the speedup that you
would see on a GPU.
So the same kind of operations, just on a smaller scale.
That's why they're called multimedia extensions or MMX.
You can also still see this in the registers right so the way we program this they're still called mm for multimedia so the idea is
we can we have these 128-bit registers where we can load up to 4 32-bit scalar
values and then we can have like a single execution
or a single instruction on all of the four elements
simultaneously.
And then, of course, we need to move the data
in and out of these registers.
I mean, the same we need to do with other,
like for other kind of operations,
we also need to move the data into the registers.
But here we have these wide registers.
And typically, for 128-bit, there might be 16 registers for this.
So this is XMM0 to XMM15, so as I use them for these four 32-bit values,
but you can also use them for two 64-bit values
or for 16 8-bit values.
So this is basically where you have
all kinds of different combinations
how you can use them.
And if you use them for 16 8-bit values and do addition on those,
then you will get a 16-fold performance improvement
over a single 8-bit instruction on the regular ALU.
And then you basically have basic arithmetic operations.
So you can do addition, subtraction, et cetera.
And you have logical operations.
And then some kind of management operations
to find out what is your result of the vector rise execution.
What you don't have, and we'll come to this later on,
you don't have branches.
Because if one of these would be a branch,
or we say we have 16 branch conditions,
what would happen next?
So if all of them go in different directions,
then we all of a sudden have to trace
a lot of different stuff, like all kinds of combinations
of potential solutions.
So that doesn't really make sense.
So we'll have to deal with this another way.
And there is comparison instructions and miscellaneous stuff for transforming the data between x86
and SIMD registers and moving the data directly into the registers or to the memory.
So this is basically everything that you would have for regular registers as well. And besides the streaming extensions or streaming SIMD
extensions, there's also AVX2 and AVX512.
So AVX2 would be, rather than 128-bit,
would be 256-bit wide registers.
And AVX512 is then 512-bit registers.
And this is basically 32 of those.
And these now can be used for 64 8-bit integers.
So here, we might get a 64-fold performance improvement
for a single instruction over a scalar.
Or you see all the different steps, right?
And the lowest would be like 8 64-bit integers or 864-bit
float that we can do on these in parallel.
But all of a sudden here, actually some of the operations
are so complex that the CPU takes a bit longer.
So if you're using 512, then this might, on the one hand,
might take multiple cycles. But also, the CPU might, on the one hand, might take multiple cycles.
But also, the CPU might clock down a bit
in order to just get through these functional units in time.
So on the Skylake CPU, for example,
you would have two of these 850, 512-bit units.
And you can see they can do 128 bytes per cycle.
That should be basically processing the 512 bit
in a single cycle. Okay.
So basically, yeah, let's talk about the SIMD programming model.
So I mean, in general, the pipeline parallelism,
so the parallelism that we get through the pipelining
and through the instruction reordering
basically depends a lot on the in-flight instructions.
So what do I see next?
And then there's this reorder buffer in order
to basically reorder something so I can actually
utilize this parallelism.
But it really doesn't depend on the size of the registers.
Now, in SIMD, this is different.
Right here, basically, we basically say, like, these couple of operations are really independent.
So these four additions are independent.
You can do this in parallel.
And so there is no hazard within this individual execution.
So within this individual instruction, this can be done in parallel.
I mean, if I have a loop with four individual additions, or not four individual additions,
like a loop of four additions, one after the other, there might be some kind of hazards.
So we saw this earlier, the data hazard. Like the next loop step might depend
on the previous loop step.
In SIMD, I know this won't happen.
So this is basically, I know this,
or the CPU knows this can be safely executed
in parallel in hardware.
And with this, we can basically get this n-fold performance advantage, so meaning on a 128-bit
SIMD register, basically a four times performance improvement when we're using 32-bit values.
In practice, this is not quite achievable, because we have some additional management to do,
some additional copying.
If we have the very large registers,
the CPU might be a bit slower.
But we can get a definite performance increase.
So we've done some work on HashMaps,
where we got like on AVX 512, we get like a 16, 18 fold performance improvement, which
is significant.
So this is basically speeding up hash maps quite significantly.
So with this, we'll go into SIMD programming next.
But first, let me ask you if there's questions so far.
Yes?
Just to clarify, are these 32512 bits simply
register per core or per socket?
MARTIN SPLITTMANN- Per core.
These are per core, like two functional units per core,
basically.
I mean, if we go back to the previous, if I find it.
So we can, you can see that there is basically two,
there would even be three vectorized ALUs on this one.
I mean, multiple different vectorized units
on a single thing.
So for the larger scale, you have these 128-bit units,
the 256, and the 512-bit units.
Not all of them can be used at the same time.
But say, for example, you have two AVX512 units that can be used in parallel per core.
Okay.
More questions?
No questions?
Then let's do a quick four minute break here.
Okay, so let's get started again.
So SIMD programming.
So there's, I mean, there's multiple different ways
how you can use SIMD programming.
Not all of them will be applicable for the task,
but you can try all of them.
So there's basically, I mean, the first thing that you can do is basically
do nothing and just hope that the compiler does stuff for you. And that also works surprisingly
well in many cases. And actually, let's say, the safest way of using SIMD because there's
easy, like a lot of ways how you can get this wrong in the compiler. Most likely, we'll do everything correctly.
Meaning the compiler will find some opportunities for SIMDifying your code.
But it might need some guarantees from you.
Meaning it might need some assurance how it can work with the data,
that there's no dependencies, etc.
Then it will try to do this.
We'll see a few things.
And then there's explicit vectorization.
And there's intrinsics, there's operations, there's libraries, etc.
So then there's, again, multiple different ways of how you can do this.
And autovectorization, as I said, it's a good thing
because it works across machines.
Basically, the compiler does it for you.
It's only limited by the patterns that the compiler knows.
So the compiler will basically look at your code, try to find some kind of patterns in the code.
So say a loop, for example, that accesses the same area of data
in every loop step in a clearly identifiable pattern,
and there is no dependencies, et cetera,
the loop is long enough, then the compiler
will try to optimize it itself or will try
to introduce vectorization if it thinks it's fast
or it's making this faster and that really depends on
what else needs to be done so this is say for example in llvm so that's one compiler framework
there's auto vectorization there's a link here you can basically check how llvm does auto
vectorization if you follow the link it's platform independent. You still have your scalar code, which is easy to read
and maintainable. And the compiler handles all the tricky cases for you.
And this is basically what we'll go into in a bit more detail soon.
It's basically this is what you have to deal with. If you don't let the compiler do it,
then you have to deal with all the tricky cases.
Meaning if your loops are not really well aligned
with the vector sizes, then you have to deal with this.
So if there's some kind of aliasing in your code,
you have to deal with this.
The compiler also knows this and will do this for you.
So this is basically the edge cases, right? So let's look at a simple example here.
It's essentially what we've seen before, right?
So we have an area, it's a bit more complicated here.
We're loading and explicitly storing
the values out of the area.
So we're loading.
So we have an array of an arbitrary length, which is max.
And we're first loading an element out of x, then
an element out of y.
We're adding the two, and then we're writing it to z,
to the array z.
So a simple way of, let's say, vectorizing this
is basically making this in a vectorized fashion.
So here we're basically doing four of these calculations
at the same time.
So we're loading small vectors of four.
And then we're just adding these vectors.
And this is essentially what then, if we're vectorizing
or if we're SIMDifying the code, this is what will be done.
So we have basically a vector of four, two vectors of four.
We're adding them up, writing them in a new vector of four.
So this directly maps to SIMD registers, basically,
and will be executed.
If the compiler sees this and it's safe, the compiler will directly translate this
into SIMD code.
So the question is, can the compiler also do this here?
And well, there's a couple of things
you have to think about.
So first, what if max is not a multiple of 4?
Well, if max is not a multiple of 4, then this won't work.
So then we have to at least basically stop
before we have the last x, like the last few items that
are not anymore a multiple of 4.
So that needs to be handled separately.
So we need some kind of scalar epilogue
for the remaining loop iterations.
What if z points to the same area as x or y?
So we're basically reusing or doing some in place
computation here.
We're basically writing this result to the same vector.
Maybe this is shifted somehow.
All of a sudden, we might overwrite something
that we need in the later loop iteration.
Well, then basically, a write to z at point i
might change the value of a future x i.
So then all of a sudden, our parallel execution
won't be the same as our scalar execution.
That's, of course, not good.
So we need the same kind of output.
So this is basically, if there's pointer aliasing,
then we have to do it differently.
We basically, I mean, OK, and we could find out
how the pointers are somehow aligned or something like this.
But the compiler won't go through this hassle.
So the compiler will basically say, if I don't know,
I'm going to use scalar execution.
So if there's some aliasing, then it will get scalar.
So if we're doing this hand vectorized,
so we're doing the same thing, then what we can do
is basically, well, we have a prologue which basically checks
is there aliasing or not.
If there is aliasing, well, then there might be some clobbering, right?
So we might overwrite different values in future loops.
So then let's do the scalar.
As soon as we're not divisible by 4,
we also have to go to the epilogue, basically.
Otherwise, we can basically compute or do
the computation within our loop,
like where we basically do the computation vectorized.
So our epilogue is basically for all of the items
that are not divisible by 4 anymore.
The main loop will be this.
And this just checks if we have some alias.
So the aliases is kind of tricky to implement.
The loop conditions offsets are easy to get wrong, right?
So this is something that you can easily mess up somehow.
So then, and if you mess it up, then your code will be wrong.
And this kind of aliases method here,
this is not something that the compiler gives you
or something.
This is something you have to implement yourself.
This is where you basically have to make sure
that this is correct.
So yeah, you might get it wrong.
Also, I plus 3, this could overflow.
So, this you have to handle.
Next, minus 3, so something doing basically the remainder.
This might underflow.
So, anything you have to basically deal with.
Any kind of these problems.
So what we can do is we can basically utilize the compiler.
So let's just let the compiler do this.
And this is basically what then the compiler would put out.
So we can actually check this.
So you can't really see it well here,
but there's a nice website
who of you has heard of Godbolt?
this is something you can
in the future for your programming
you can use a lot
so here what you can see is basically
we have this kind of loop
the same kind of loop and the same kind of loop, and the compiler explorer
or the Godbolt, so there's two pages where you can go to.
This is basically, you give it like it's called a small code snippet, and you get the assembly
for this, depending on the compiler that you're using, depending on the compiler that you're using,
depending on the architecture that you're using,
and the compiler flags that you're using.
And this, for example, would be the loop that we've just seen,
this exact same loop here.
Let me find my mouse again here.
So it's really just this.
We have the two arrays, or the three areas, and there are maximum integer.
And then we're just doing the addition.
And you can see the compiler generates quite a bit of code
just for this.
And this is basically different.
I mean, you can see there's kind of basic calculations,
different kind of conditions, and then jump points depending on what is max.
So if max is smaller than a certain number, so we can't really fit it into an area,
then we have this epilogue that we need to calculate with.
There might be aliasing or not.
So depending on all of these different points, the compiler will produce separate code for these
and then have like one hot loop that if everything is nice,
we can actually use these individual instructions.
I think this is something like, let me see,
I think L7 or something.
Here you can basically see there we're actually doing Let me see. I think L7 or something. That's here.
Here you can basically see there we're actually
doing a vectorized instruction in here.
So this would be a vectorized addition in here.
So if everything goes nice, all of the conditions are fine,
then we're doing a vectorized instruction.
Otherwise, we might just use regular instructions,
regular additions.
We might unroll the additions to some degree, et cetera.
And this is basically, you can see this.
I mean, there's lots of different code
for different kind of branches. But then there's this one hot loop, this is basically, you can see this, I mean, there's lots of different code for different kind of branches, but then there's this one hot loop, which is basically, I mean,
here we even have like, didn't scroll to the right hot loop.
Here we see that we're basically loading the registers, we're adding, and then we're
incrementing the counter, et cetera, and continuing the loop, right?
So this is basically all is good.
We're within the loop, and we're just
adding up in a vectorized fashion these two areas.
And so sometimes this doesn't really work.
So sometimes the compiler won't give you
such a nice vectorized loop.
And then the compiler or the vectorization
can even emit some information why this didn't work.
So in Clang, you can use our pass or our pass minus analysis,
gcc minus foptinfo. And say, for example, if you're using Clang 16, and we
have this kind of loop here, it's basically the same kind of loop again that we had earlier,
with the small change that our loop is fairly short here.
So the loop is fairly short here. So the loop is fairly short. So then the compiler basically all of a sudden
doesn't do any vectorization anymore.
And we can actually, using this analysis information,
we can actually check what's going on.
So this is also something I can show you.
So here we have exactly the same loop.
And you can see the code that comes out
is just basically an unrolled loop here.
So the compiler doesn't do nothing.
So the compiler still basically does some optimization
by not building a loop here and with some kind of condition,
but actually just completely unrolls this loop.
So we're just doing eight individual additions here,
one after the other.
And the compiler also tells us why.
So it doesn't do a vectorized because the store might be clobbered.
So there might be aliasing, meaning we might override.
So this basically storing the data in C
might override something in Y or X.
So this is what we, this problem what we had before.
And out of this small code snippet,
there's no way we can actually tell that this doesn't happen.
I hope you agree with this.
So here, basically, I mean, these might point,
the x and y might point to the same area than z.
So then, I mean, they all might point to the same area, essentially.
So we're just basically doubling the value,
like a weird way of doubling each value in the area,
for example, something like this,
and shifting it to make it a bit more exciting,
something like this.
So then this basically won't work,
and then we have a problem.
And the compiler notices this, and the compiler notices this.
So the compiler can't say if this is safe or not.
So the compiler won't parallelize.
And we'll just do loop unrolling here.
But we can help the compiler.
So rather than staying at this, I mean,
we might know that this is not true.
So we might know that there is no alias in here.
The set pointer, pointer to ARIZ is something that's not reused anywhere else.
So we can, I mean, we see that there's this clobbering, right?
So because of the previous store and the cost model,
there's knows that there's only eight loop iterations.
So let's just unroll this rather than doing the individual
or doing some kind of lots and lots of checks.
But we can do a compiler hint here.
So in the previous version, we saw that the compiler actually
does for the same loop.
So this, I mean, if you look at the previous loop here,
that's the previous loop.
It's the same loop.
There's no change here.
But it might be executed much more frequently.
So this might run for 100 or 1,000 or a million iterations.
So there, if you have a million iterations,
checking all these things might still make sense.
If it's only eight iterations,
all the checks will be costly.
But if we don't need some of the checks,
this might still make sense, right?
So the compiler doesn't rule out aliasing.
So we can tell the compiler, you can actually
rule out aliasing.
So you can use a compiler hint to provide
this additional guarantee.
And the compiler can use this guarantee.
But here, if you're providing this kind of guarantee, it's your
duty to make sure that it's actually true, right? So because the compiler then doesn't check anymore.
The compiler says, oh, you're going to be right. So if you say there's no aliasing here, I believe
you, if all of a sudden there is aliasing, you're going to be in trouble. So you're probably going to have some runtime, so
some segmentation error or whatever.
And here the compiler hint is restrict.
So here we basically give a guarantee
that the pointer does not alias any other pointer.
And this is not standardized in C++,
but most compilers support it.
So you just say, well, we have our initial compiler hints here.
We have our initial areas here, X and Y,
and we have our C area,
and C is not aliasing to any of X or Y.
And if we use this here, then the compiler says, OK, cool.
I don't need to do any checks, at least for that part.
Also, I don't need to do any checks for making sure
that we need an epilogue because we're not divisible by 4,
for example.
So this nicely basically fits into my registers,
into my multimedia extension or vector registers.
So I can just directly use the code as is.
And we can look at this as well.
So here we can also do a comparison with the same thing.
So here on top, we have the version with restrict,
on the bottom we have the previous version.
So the previous version is still the same, right?
So it's just basically this unrolled loop.
On the top we have the same addition,
but what we do is we're basically moving the data into the SIMD registers. So we're loading basically the first half of X
and the first half of Y into the register here.
I mean, this is basically working C here
and then we're doing the addition.
And then basically this is written back here.
And this is also unlooped, so we don't have a loop in this code, unrolled, because it
doesn't make sense to do a loop for two steps, right?
And with this, even because we have two SIMD units at least on our CPU, this might actually
be run completely in parallel here.
So let's say the instruction, if everything goes well, then basically these two additions
here will be completely run in parallel within a single cycle.
And hopefully this will then have a a significant improvement, performance improvement over this code.
Although, why won't we get an eight-fold performance improvement here?
Who has an idea?
Why do I not expect an eight-fold performance improvement here?
I mean, here I can use two SIMD registers with each four instructions at the same time.
What's going to happen down here?
Is the core just going to do one addition at a time, most likely?
No. Because we have, I mean, there might be, if there's no data dependency in the code, right, no data hazard, then, I mean, on the one hand, we get some pipeline parallelism, but we have
multiple functional units.
We have multiple ALUs on the same core so most likely the CPU
or the basic CPU will will basically process them on multiple ports at the
same time so with this we'll also get some performance improvement we'll have
more than like a single instruction per cycle but I don't know two instructions
per cycle or something like this,
because we can do them in parallel due to the multiple functional units on the same core.
Make sense? So this will be faster, the top one, but it's not going to be linearly faster than the
lower one, because the CPU is also, I mean, it's also trying to be smart about this part.
Okay? Good. Let me check how much do I have here.
Where's my mouse? Yeah, maybe it makes sense to take a break here, because the next part will be a bit
lengthy and then we'll do this next week.
So next week then we'll talk about explicit vectorization.
On Tuesday, you will get an introduction to the task, so what you have to do.
This is basically the SIMD scan, right?
So something where you do like a true database operation, a couple of SIMD operations,
and where you can get actually a nice performance improvement through SIMDification of your code.
Yeah, questions so far? actually a nice performance improvement through simplification of your code.
Yeah, questions so far?
No questions?
Well, then check out Gotbolt.
The links are in the slides.
This is super helpful as soon as you're
starting doing some kind of vectorization.
And with that, well, enjoy the rest of the day
and the holiday tomorrow, and see you next week.
Thank you.