Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Vectorized Execution
Episode Date: May 9, 2023...
Transcript
Discussion (0)
Okay, so let's get going.
Welcome everybody to our next session.
Today, mostly we're going to talk about vectorized execution,
but in the beginning I'll stop or I'll finish up the last lecture with prefetching.
But before that, a quick announcement.
Tomorrow there will be another session of the bi-weekly joint seminar with you Darmstadt and
one
PhD student out of my group Ricardo will present his current work. So that should be interesting. It's not super related to this
Class, but if you're curious about this kind of work feel free to join and in the next couple of sessions of the seminar
We're also going to have some nice industry talks and other research talks.
I'll keep you updated on it.
We're still here. A quick announcement regarding timing.
We had some additional small changes, not that much.
You already know tomorrow you will learn about the first task.
Your SIMD exercise and we'll finish up SIMD then, but we move the profiling session to a bit later
because that still needs some preparation. So that's not
going to be next week but the week after basically. And with that, that's all announcements I
have so far. So let's switch to the other slides and see, talk about prefetching. If it works, yes, it works.
Okay, great. So we talked about instruction execution
and different kind of data hazards
or general pipeline hazards.
And one of the hazards are data hazards,
basically the problem,
or also one problem that we need to still load stuff from memory or into the caches.
And basically, our CPU has to wait until this is loaded.
And then, well, then basically the CPU needs to wait, right?
And in order to not make it wait so long,
we can also be smart about it.
I mean, also the CPU and the compiler
tries to be smart about it by basically preloading
data that will be used soon.
And this we call prefetching.
So rather than just basically reacting on the memory accesses,
we can also say, well, this is gonna be data
that we're gonna need, so let's get this beforehand.
And I mean, this is quite obvious
if we're just running through an area one by one, right?
We're not gonna just wait until we hit our cache lines,
the end of our cache line and get the next cache line,
but we know the next area positions will be loaded next
or will be accessed next, so we can already access this
or we can already load it.
But if this is more complex, the pattern,
then the compiler won't see it,
the CPU won't necessarily see it,
and we can basically tell the compiler or tell the CPU what to
load next already.
So we can be smart about it.
The CPU has a hardware prefetcher, so the CPU already tries to be smart about it and tries to get the next cache lines and also tries
to see the stride pattern and load the data according to the stride pattern.
But the prefetchers don't necessarily prefetch across page boundaries. So, you might get
everything in a page, but not necessarily to the next page, because the next page might be somewhere completely different in memory, as we know by now.
However, rather than just waiting for the hardware prefetcher, or if we have a weird
access pattern, which is weird in the sense of not just very regular.
Then we can also help the CPU, help the program to be executed more efficiently and prefetch the data explicitly.
So there's a built-in that basically says,
well, these memory addresses, please prefetch them.
They can be prefetched for reading or for writing,
and then they can also be prefetched with a certain access pattern.
So, with low temporal locality, meaning we're only going to access this once,
or we access this quite frequently, so please keep it in cache,
or something in between.
We basically can also say this is the way we will access this in the future.
That's one way.
And this sometimes helps us to make things faster, right?
So basically avoiding all these cache misses, essentially.
Something else that we can also do are so-called helper threads.
This is, let's say, faking the same thing, right?
Rather than actually having having some kind of instructions
that tell the CPU, please load this already,
I'm going to use this soon,
I can basically do have an extra thread
that will access the data that I will later use for processing.
So the idea is basically you have a thread that just produces
cache misses all the time and basically already by that then has the data in cache once I need it.
I mean this is somewhat tricky but it also works. I mean it's basically you have your main thread
and you have this helper thread and the helper thread does
nothing just touch data that's all it needs to do and by touching the data
then we will have it in cache already it of course needs to work ahead of the
main thread so it needs to access the data that the main thread does not
access yet otherwise it doesn't really make sense, right? Otherwise we're just producing extra noise. And the main thread somehow has to tell the worker
thread or the helper thread, or at least this needs to somehow be in sync so that
the helper thread knows what to prefetch. And if the main thread has some kind of
worker headset, then the helper thread
can already touch all this data. It's going to be in cache. And of course, this sometimes
can help a lot if the main thread has enough to do. If the main thread will just be as
fast as the helper thread, then we have a problem.
The biggest problem with this kind of implementation is that main thread and helper thread are
actually accessing the same cache line.
Because then we have two threads touching the same cache line, well, if this is not properly in order,
then basically, well, we have two threads touching
the same cache lines.
We have basically conflicting access.
And then the system or the CPU will
raise a memory order machine clear event, which basically flushes the pipeline.
And this is really bad.
So then this means all of our instructions are flushed.
We have kind of a serial access to these cache lines.
And our helper thread is not helping us really,
but our helper thread is just destroying
our parallelism in the CPU.
This can even be worse if our helper thread will then just basically spin on the data to basically try to get access to this cache line and by that just basically creating these
conflicts all the time. So an idea is what you can do is rather than having
the helper thread work forward, like in the same direction,
you can basically have it work backward,
like in the opposite direction.
You will still, you have a chance
of creating these conflicts,
but there's only few basically in each work set.
And you still have a lot less cache misses in the main thread
because you already have all the data in your caches.
Or the helper thread at least gets a lot of the data in the caches.
So, well, you can basically see some experiments
that people did with different kinds of prefetch.
I'm not going to go through this in detail, but you can see that working backwards will
basically give you a better performance than working forwards, where you can have more
of these conflicts.
The most penalty you basically get
if you have kind of the spin loop on the cache line. You basically try to access the cache line
in the helper thread, even if you have a conflict,
which then means basically you're fighting
with your main thread for the same
cache line.
So, that's kind of the worst.
This is even worse than running it just single-threaded.
While a regular forward helper thread might give you a benefit for large working sets,
a backward helper thread can always give you some benefits.
Okay, so much for this, for the prefetching. It's at least good to know, I mean,
it's something special. You probably won't frequently use this like doing your own
prefetching, but it's something good to know that this exists, right?
So you will see this in your if you do low-level experiments
you will see that the prefetch is doing something for you and in some cases might help you or it will in most cases
it will help you in some cases it might actually be
detrimental for your performance or performance will go down because of the prefetching because
The CPU basically tries to get data that you don't really want into your caches your performance. So performance will go down because of the prefetching, because the CPU
basically tries to get data that you don't really want into your caches.
Okay, so I promised I'll show you some pipeline architectures, some current pipeline architectures,
and these are just schematics, but what a current core sort of not looks like,
what the current core features.
And so this is Sunny Cove architecture.
This is kind of the core architecture.
So there's basically, if you look at modern Intel
or whatever server CPUs, you have, of course, multiple cores.
And then you have kind of this overall layout
for the CPU and then the individual cores have a certain layout.
So there's kind of an architecture for each core that then will be somehow arranged for
different types of CPUs and that will also be used across different generations of CPUs.
And the Sunny Cove, for example, would be the Ice Lake core architecture.
This means each of the individual cores features basically this kind of layout.
And you can see that, well, there's basically, well, now you can see that it's written here.
You have up to six micro-operations per cycle.
So this is already parallelism in the CPU.
So it's basically, you can have more or less six pipelines in parallel.
And this is done through 10 different execution ports.
You cannot use all 10 of them in parallel,
but there's 10 different ones that can basically be filled.
There's different reservation stations where the data will be,
or not the data, the instructions will basically be prepared
and then go to the different pipelines.
And as I said, up to the different pipelines.
And as I said, up to six in parallel.
And then there's different kind of,
we said there's different kind of units on the CPU,
different kind of processing unit.
So you have the integer processing units
or the regular ALUs.
You have the vectorized execution units, which are somewhere down here.
So you have typically two vectorized units that you can run in parallel.
Then you have the load and store units.
So here, for example, and you have the different kind of caches that basically,
and all of this will be done in parallel, as I said, up to six operations in parallel.
And then of course you have these multiple pipeline stages and you have the front end,
which basically has all this, what we earlier said, does the branch prediction, caches the instructions
and decodes the instructions as well.
So this is what the Intel core would look like.
This is, for example, again, for my laptop here,
this is like a M1 Firestorm core.
So we said there's two different ones.
We have the ICE something, something, I forgot, and the firestorm,
so the efficient cores and the fast cores.
And the fast cores look like this.
So here you have up to eight instructions per cycle,
so this gives you a bit more parallelism again.
And you have quite a large reorder buffer where basically all the different instructions
can be placed and then, according to what's possible, can be reordered.
So you have an out-of-order execution here.
And then you have the same thing. You have six, seven integer ports and up to eight operations.
We already set this eight micro operations per cycle.
And again, you will have like multiple ALUs here.
You have vectorized.
Let me see this.
Yeah.
Here you have your vectorized registers and vectorized
execution units which do what we will see next basically in the next round, in
the next part of the session lecture. Okay so far so this is basically what
the individual cores look like and where our instructions then are
actually executed. What I want you to take from this is that there is parallelism already in a
single core. Even in a single thread, basically, you get parallelism just because of the architecture,
because we can do multiple things at a time and we will do multiple things at a time in a single core.
Okay, so we talked about how the instructions are executed,
and the pipelining, etc., and the pipeline hazards especially,
and then some current architectures.
And now we're going to switch back to our
vectorization. And this is basically, well, let me go back here, right? So, if we're here in this,
where are we? Here in this vectorization, this is basically where all of the stuff
is executed that we're going to talk about today.
Okay, so today I will only be able to cover the basics. Next time we will get into deeper
detail about the programming. If you're really eager in starting the task, you can already
sneak a peek into the rest of the slides. There's
everything in there, how it's implemented, etc. Not everything, not all the details, but basically the different ways of
parallelizing. But we will also cover this next week. Today, I just want to give you some
overview of what this actually means and the kind of thinking that we need. And first of all it's about parallelism. So we already
talked about the pipeline parallelism and we also had this general parallelism inside a single core.
Now we're going to go into data parallelism, also still in the single core.
Well, pipeline parallelism is kind of one simple technique to leverage a hardware parallelism,
and it basically always depends on the length of the pipeline.
So usually we cannot get a lot of parallelism
through pipeline parallelism.
We still want to use it because we can,
like some things usually take more time.
We want to break down our task into smaller junks.
And we have different units on our CPU,
as we saw earlier, right? So all of the different execution units on our CPU, as we saw earlier.
So all of the different execution units on the CPU,
if we can only do one thing at a time,
then we can only utilize one unit at a time.
So then a lot of the die is basically idling, not doing anything.
So it makes sense to somehow leverage this separately.
And for this, we can leverage this pipeline parallelism.
And the pipeline parallelism is nice
because it gives us some parallelism,
but it still keeps the sequential execution order,
at least at the front end.
So the instruction stream can still be the same as in the program,
but we're just chunking it or cutting it up into smaller pieces
and executing different parts of these pieces in parallel.
Well, we discussed the problems around the hazards, etc.
in the previous part.
And here, just because of the chip design
and because we need kind of the parallelism
is only in the length of the pipeline,
this gives us a clear limit on how much we can parallelize here.
Because just getting more parallelism means ever longer pipelines and it basically
takes more time than also to go through the pipelines and it takes more chip space as
well. So there's other ways in basically doing parallelism on the hardware and of course you already know this, but we can basically just do different tasks in parallel.
And this basically means we can have a different chip space
for different things, but even thinking
about multiple cores, we have multiple chip areas
that are identical and they can
basically do either do the same thing or could can do different things in
parallel. So being basically we basically send them different kind of instruction
streams, we have different processing units, the same kind of processing units
and by just giving them different kinds of instruction streams and different kinds of data,
they can do things in parallel, different tasks in parallel.
And this can be, I mean, we could think of this as different cores,
but we can also think of this as different ALUs, so ALUs on the same core. So they can also use different registers, get different micro-operations, and will produce
different outputs.
We call this multiple instructions, multiple data, so we use working on, or MIMD, we're
working on different data, we're working with different instructions we get different output. On our CPU we also have
specialized units or so-called vector units or SIMD units so single
instruction multiple data units and these are basically processing units that can execute like a single instruction,
the same kind of instruction on multiple data items in parallel.
Because of that single instruction, multiple data, but because we're basically working
on multiple data items, which we usually have in sort of an
array or a vector this is also why we call it vector processing or a vector unit and there are
like complete processes that are just built on this idea so the the title image that I have, this is an old vector processor. That's basically
like a huge system just based on this kind of programming model where you say, okay, I'm going to do the same kind of operation on many data items in parallel.
So in a nutshell, we differentiate if we're talking about vector processing or SIMD, we're differentiating between vectorized execution or scalar execution.
So scalar meaning one instruction, a single operation at a time on a single data item or a data item set.
So an addition would be two data items, one result,
for example.
So that's a scalar addition.
A vectorized addition or vectorized execution
would mean we have a vector, an input vector,
and we're doing the same, or two input vectors,
we're doing the same execution on each of the vector items
in parallel.
So the multiple data items add a single step.
So going back and basically talking a bit more about parallelism.
So there's different kind of parallelism that we can look at. And one way of looking at parallelism in applications
is basically differentiating between data level parallelism
and task level parallelism.
So data level parallelism means we have multiple data items
and we're processing them at the same time.
And I mean, in a database, this makes total sense, right?
Because if we say we do a scan, for example,
we have many data items
and we can do all of them in parallel.
And often there's not even any dependency in between.
So this can really nicely be parallelized.
And we're doing the same instruction all over all of them.
And task-level parallelism is different
because there we're basically doing different things at the same time.
So this would basically be in a database.
We have different kind of database operators
that we want to execute at the same time.
So thinking about some kind of query tree,
we have multiple scans, for example,
or different kind of operations that we
need to do in parallel, or we want to do in parallel.
If we do different operations, then we're
looking at task-level parallelism.
And in computer hardware, we can have four major ways
of how to exploit them.
So we can have the instruction level parallelism.
And this is basically what we looked at when we looked
at instruction execution.
So we can have this pipelining in the instructions.
And we can have speculative execution in order
to get some parallelism.
Then we have the vector architecture, so SIMD instruction execution units and registers
and say for example GPUs as well, where we have this data level parallelism by executing the same instruction over multiple
data items in parallel.
This is also what the GPU basically does.
The GPU works in lockstep, does the same operation in many threads in parallel on multiple or
on different data, basically, or on, let's say, different variables, for example,
different parameters.
Then we have thread-level parallelism.
So there we can have either task-level parallelism
or data-level parallelism by using multiple threads.
So this means then basically using often,
ideally, multiple cores.
We can also have, of course, multiple threads
on a single core, but then they will be scheduled.
But like actual real parallelism,
we'll have like multiple cores doing either
working on the same kind of data or multiple data items,
but with the same kind of instruction.
Or they can do completely different things, right?
Say, for example, our helper thread, that's parallelism where we have task level parallelism
because the helper thread executes a different program than our main thread.
And then basically, we can also have
like a request level parallelism.
This is kind of bigger level
where we have completely undecoupled tasks
which are somehow specified by the programmer.
Okay, so there is also Flynn's taxonomy on different kinds of parallelism.
So the classical execution, if you think about, I don't know, 30 years or more back,
you have single instruction, single instruction stream, single data.
So this is the standard sequential computer and this is also how we think if we're
just programming single threaded usually. So then we can have some instructional level parallelism
but we're not going to deal much about it or do much about it but it's just basically one instruction at a time on one data item at a time.
Then we have this SIMD. This is what we're talking about today. We have single instruction stream,
but multiple data items at a time. So we're working in a vectorized way, basically
on a complete array or multiple data items at a time with the same kind of
instruction. And this is basically, we have vector processes for this, GPUs work like this,
and then we have these special instruction sets on modern CPUs. So basically all modern CPUs, like laptop CPUs, server CPUs,
have these kind of instructions.
And so this is what we're going to talk about today mostly.
Then there's also multiple instruction streams,
single data streams.
So there's MISD, whatever.
I never really pronounced this before, I think,
or at least I'm not trying hard enough.
So this is kind of obscure, but it's basically,
you're working on a single data item or a single data stream,
but you're doing different things on it.
So this is more or less something that you would already
have in task parallelism. So
this means, well, we're just working on a single data set, but we're doing different things on it.
It usually will be implemented in the same way as MIMD, which means multiple instructions, multiple
data streams. Think about multiple threads doing their own thing, working on different data.
So the multiple instruction, single data stream,
you could think about the helper thread
being an instance of this,
because we're working on the same data stream,
but with different kind of instructions.
But in the end, how it's implemented
is more also like a MIMD, multiple instructions, multiple data.
And this is basically multi-core, multi-sockets,
multiple servers, for example.
So this is everything there.
While up here in SISD or SIMD, we're still in a single CPU or a single core even.
Okay, so SIMD instructions. So, in modern CPUs, we basically have a special class of instructions that allow the processor
to process multiple values simultaneously.
So, this is really done in parallel.
It's not only instructions, there's special hardware on the CPU to do this in parallel,
say for example
edition. And all CPUs from major vendors have these extensions, have kind of SIMD instructions.
And Intel, for example, has MMX, SSE, and AVX. And SSE and AVX is basically what you're going to be using today. ARM has NEON and SVE, and Spark, for example, has VIS.
So you can find different kinds of instruction sets everywhere.
And that's actually also a bit of a problem,
because this is not standardized in any way.
You will find the naming schemes are different, etc.
So it's a bit tricky to navigate.
But, most of it is done automatically.
Or, if you don't use it yourself, a lot of it is automatically done by the compiler for you.
So, we'll see this probably in the next lecture.
So, how do these work?
So, here's a simple example. Say for example we have two areas and we want to add them up in a third area.
And this is going to be our example for most of the time today.
And think about this as vectors basically, or areas or whatever.
So you have the area X and the area Y and our array Z is basically just like you add up the X and Y values individually.
This is the same thing that will be done here.
In a scalar addition, a single instruction, single data addition,
so the way you would do this on the ALU would basically be,
if you're just using one ALU, I must say,
would basically be, you're starting with the first element, adding it up, writing it to the
result area. You're doing the second, the third, etc. So, this is basically how the processor will consume this and execute this. Again,
since we're having multiple ALUs on our single CPU, what actually will happen is that at least
the compiler will try, I mean, if it doesn't use SIMD instruction, the compiler will typically try
to unroll this to some degree, or the CPU will already see,
well, I have to do the same thing multiple times,
will then execute multiple of these additions already
in parallel through the multiple ALUs that it has.
However, we can also say,
well, let's use our SIMD registers, right?
And let's use our SIMD unit on the CPU.
And rather than doing an individual addition,
what we're doing, we're loading four of these integers
in our 128-bit SIMD register.
And then we can add them directly.
So this will basically then be like one addition and the SIMD
unit does the addition, the individual additions, all of them at a single time. So this will basically
then will be done in only two steps, this execution. So this means basically a lot less processing steps, a lot less cycles
that you need to do because like one SIMD addition could be done in, well of course,
pipeline length, right? We know multiple cycles, but basically in one cycle on average. coverage. Okay, so we still have time, so let's continue.
So let's talk about how this basically evolved over time, the SIMD instructions.
And so initially, these operations were basically built for multimedia execution.
So, Intel basically started with the MMX stuff,
but then they had the streaming SIMD extensions with 128-bit registers.
So, this means you can have 128- bits and you can load your data in there and there's multiple different kind of instructions
that you can run on there and
typically you would load them with 32-bit scalars and
which so say for example int values and they
then can be then you can do a four-way parallel execution on these registers.
So, say for example, addition, other kind of multiplication, whatever you can think of,
and there's different instructions for that. And so, basically in each operation, you can then, or in each cycle, you can then basically do one SIMD operation on each of these elements in parallel.
So, this is what we just saw earlier.
However, you always need the data in these SIMD registers, right?
So, you cannot just load them anywhere. I mean, you always need
to load your data into registers, of course, but then there's specific registers just for the SIMD
execution. And these are typically sort of 128-bit wide registers would be XMM0 to XMM15, for example. So this means 16 registers just for this addition.
And today, you cannot only use them for these 32-bit scalars,
but you can also use them for different other kind
of operations.
So rather than only going for 4 32-bit,
you can also go for 16 8-bit values, or 8 16-bit, or 2 64-bit values.
And so, if you have 2 64-bit, well, you have like a parallelism of 2. If you have
16 8-bit values, you'll have a parallelism of 16 already, with of course less
precision or smaller data. But you have these basic arithmetic operations,
but also logical instructions.
And each of them then, as I said, can be done in parallel across.
And besides that, you also have comparison and shuffle instructions.
This means you load everything in your registers
and you can then compare all of them to a certain value.
And this would be something you could, for example, use for a scan.
If you're scanning data, you compare all of the data saying,
is this what I want or is this smaller, greater than what I want?
So this would be a typical way how we use this in a database operation.
And then you can also sort or shuffle them in a way,
or not sort, but shuffle them.
You can basically say, based on a bit mask, for example, I only want these elements,
and then I want them to be moved towards the front of my register. So I again have a dense representation of the data that I'm
actually interested in rather than having like a sparse area with lots of
holes of the data that I don't want to see. And of course you need to be able to
transform between the SIMD registers
and the regular registers. So giving the data in and out. And you can also, you don't necessarily,
or there's some way to directly move the data from the SIMD registers to the memory,
rather than going through the caches. So say if you're doing this kind of scan-like thing,
you might not want to keep the data in the caches all the time.
So, there's these 128-bit registers, but then there's even wider.
So, there's also two 56-bits, so that's called AVX2. So first in 2009, we had the SSE, the streaming extensions.
Then we had AVX in 2014, AVX2.
And now there's even AVX512, which has up to 512-bit registers.
And well, there's even more options then, again.
So, we have, we can, so there's 32 of these registers.
So, set MM0 to set MM31.
And you will later also, if we go there,
you will see this even in the assembly.
You can see that these registers are actually used
for loading and storing the data.
And then since they're 512,
you can even do like up to 64 8-bit integers
or 32 16-bit and so on and so forth.
So hypothetically, here you can get up to 64-way parallelism
in comparison to a scalar execution.
So if you're doing like 8-bit addition, say for example,
on the regular ALU, then in contrast to that on the AVX512 registers,
you could do 64 in parallel.
However, if you're using AVX512, at least on the current CPUs, the CPU will clock down.
So basically the instructions are that the unit is so heavy, so there is so much circuitry
around this, that the CPU needs to slow down a bit to make sure that it's processed in
time.
I'm not sure about the newest generation, so that might change, but we'll also see that's
why the compiler doesn't really like those instructions. The compiler basically says, I'm assuming this is slow, so it rarely uses the AVX 512.
Yes?
In the recent HSE videos, there's the performance and the efficiency course.
Do you know if the efficiency course also support these instructions?
So the question is regarding performance efficiency cores.
I don't know, honestly.
So one would have to check the layout of the two cores.
So I know it's on all the server cores.
I don't know regarding the larger.
I would assume there is something on the efficiency
cores, but I'm assuming not like the very
large 512.
So I'm not even sure if the 512 are on the desktop CPUs at all.
So I think these are mostly on the server CPUs.
I guess Marcel is checking.
If you find something out, let me know.
Okay. So say, for example,
on the Skylake CPUs, there's two AVX 512 units that you can work with in parallel, actually,
again. And the other thing that we see, we get more instructions. So we get more functionality and there's a link to some cheat sheets to look up what's doing what.
But there's a lot of different things that you can do with these instructions or with these units.
But it also gets a lot more complicated.
Because basically for each of these you get like separate instructions
right so for each complex basically this explodes in combinations so you individually have to
specify different instructions for each individual combination so say for the integers for floats
for like you have to specify which width, and there's a
separate instruction for each of those. So, in SIMD programming, you basically, a lot of, like, in SIMD programming, you explicitly, there's no data hazards or something like this.
Because you explicitly do this vectorized execution. So usually if you have the scalar execution,
a lot of the control logic basically
deals with making sure that the instructions are properly
scheduled, that the instructions get the right registers,
and et cetera, and all the right registers, etc.
And all the dependencies are tracked properly.
And in the SIMD instructions, you have to deal with this yourself.
So basically, there is a lot less logic around this.
Then you basically have to say,
well, these are my registers that I'm using.
Or you explicitly say you're using the vectorized registers, and in there you don't have any branching or anything within the instructions.
So basically in the registers or in the vectorized instruction, there is no conditional branching
or something. You can just do like the predicated,
like the way we did the predicated execution, right? So these examples.
This is the only way how we can do any kind of comparisons. We will
always execute everything. We cannot say in a vector,
well, with this part, I'm going to deal this way. With that
part, I'm going to deal this way, with that part I'm going to deal that way,
because we cannot basically split up the vector in different parts.
It will always operate on the same vector.
And this means we can only check for data hazards or stuff like that between the vectors. And in theory, we can have like an n-fold parallelism, meaning, say, if we have this
large, this 512-bit registers, we could have a 64-fold parallelism in theory, but in practice,
we're not going to quite achieve it. But at least
we get close, we can get within a mortar of magnitude, let's say.
Okay, so with this I would say we'll do a quick break here and then look into a bit
more detail regarding the SIMD programming in, let's say, five minutes.
Did you already find out?
So I found out that M1 also supports 128-bit SIMD instructions in AR.
I'm not sure about where it's located on the instruction.
So, yeah, the M1 has.
I know the M1 has instructions.
But if the Intel efficiency cores have it, the M1 has. I know the M1 has instructions.
But if the Intel efficiency cores have it, I don't know.
I can check offline, basically.
OK, so five minute break.
OK, so I could just quickly check.
So I couldn't find any detailed info about AVX etc. on the course, but it seems that
at least on the efficiency course AVX 512 doesn't exist.
But I don't know about the smaller one.
Not sure.
That's all I could find quickly.
Okay, so SIMD programming.
So there's different ways how we can actually use the SIMD units on your CPU.
And I mean there's two basic ways, like I said,, the general ways and then there are multiple ways how you can do this explicitly. On the one hand, the compiler will try to use this for
you. If you are using a modern compiler, GCC, Clang, etc., they will try to utilize the SIMD instructions on their CPU.
Sometimes this requires some hints and some guarantees by the programmer for this to work,
because the compiler always needs to basically take all options into account.
Anything that can happen, the compiler needs to be sure that it covers that.
Anything, let's say legal, that can happen.
And then you need to say, well, I'm pretty sure
this is the way the data will basically be laid out.
For example, in memory, not pretty sure, I am certain.
I guarantee that this will happen.
And then the compiler can be more flexible
in the optimizations.
That's one way.
And the other way is explicit vectorization.
So the automatic vectorization works great for simple stuff,
so simple loops, et cetera, but doesn't work for everything.
So, for more complex stuff, and say, for example, the task that you need to perform, it won't work.
So, you can get some vectorization, but you won't get enough vectorization here.
And then we need to explicitly use vectorization. And, well, the automatic way is called autovectorization.
And this is basically as soon as we write simple code,
like just this addition that I showed you earlier,
this is something that the compiler can see,
and the compiler can directly basically optimize this.
And LLVM, for example,
this is one of the parts in Clang.
So LLVM has certain optimizations
or certain patterns that it can detect
and then automatically optimize.
I gave you the link here.
Let me see if this works quickly.
So if, so here, for example, you can see simple patterns
is a bit small.
There's basically different kind of examples where the compiler will directly see some
kind of reductions, inductions, conversions, etc. that the compiler will check for.
The compiler is basically a long case statement, and if it finds one of these instructions and there's a list
that scatter, gather, etc. If it finds one of those, it will then automatically try to
vectorize this. The cool thing about this is it's platform independent. You still write
your scalar code and the compiler does it for you.
The compiler will utilize the SIMD units and execute this in parallel.
And the compiler does all the tricky edge cases for you.
And we will see why this actually is a good thing.
So here's a simple example. so here this is basically the same
loop that we had earlier with x y and z and we're adding them up so we have a loop but here we have
these individual loads and stores separate right so rather than doing this in a single line, so usually you would just say x i plus y i equals z i.
So that would be the way you typically write this output here.
Rather than doing this, we're explicitly writing the loads in order to also have these loads.
We can also transform this into a vectorized version. So this is basically, it's of course still scalar code here.
But we're working on vectors here.
So this is something that if the compiler sees this,
should directly be able to translate this
into these SSE instructions, right?
Because we're doing a simple addition,
or we're reloading four values into these areas,
and then we're adding all the values,
and we're storing them again.
So this is in scalar code, a vectorized addition.
Now the question is, can the compiler translate this
into this?
Could the compiler just replace our scalar code that we write like this,
or even like in the shortened form, translate it directly to this?
I guess only if the max is divisible by 4.
Exactly.
So that's one of the things.
So, of course, we need to check all the boundaries, right?
So is this max really, can this be divided by 4?
Because otherwise, if it's not a multiple of 4, we need sort of an epilogue. We need basically to somehow deal with the remaining iterations.
Another thing is x and y alias, so they overlap.
Yes, exactly. So that's another problem. It's pointers, right? So this means
we could actually look at the same memory addresses. And so if we have pointer aliasing, so then in the loop, we somehow might change stuff in a future iteration.
So even like within the same thing, and then we would actually need to execute this in a scalar fashion.
If there's an aliasing, so if the pointers point to the same memory addresses, then we need, there
might be something, I mean, of course, you need to program this in a weird way for this
to happen, but you could program it in a weird way, so that you cannot parallelize it in
this way.
So in order to actually, if you want to do this by hand, you would have to write something like this.
So, you read to write a function, aliases, that basically checks for you if the pointer is alias.
So, if there's an overlap in the pointers, meaning that either X or Y has an overlap with Z.
So, that where you're writing to is something
what you're reading from, basically.
That's essentially the problem.
Then you run through your loop all the way
until you've basically consumed a multiple of 4,
but of course not too much, because max might not be divisible by 4, so
then we need to do this epilogue, where we basically handle the last couple of items
that do not fit, or that is not divisible by 4, right, this remainder, basically.
So that would be a way. And then, of course, we see we have this nice jump here
in there, which you go to, which you should never
use in practice.
That's the direct way how one would implement this.
But even this aliases is already tricky to implement.
So that's something where you have to do some checks,
et cetera, to actually implement this properly.
So this is not something that the language gives you.
And all these offsets are also easy to get wrong, right?
So this still might break if say, for example,
max is basically like integer max, something like
that.
Then i plus 3 could overflow, for example, stuff like that.
And max minus 3 could underflow, for example.
So that's another.
So if we write this in another way,
so you can basically always get into troubles
and the compiler will check all of this for you, right?
So the compiler will always make sure
that this works properly.
So if, well, basically manual is easy to get wrong
and scalar code is much more maintainable. If, well, basically manual is easy to get wrong
and scalar code is much more maintainable.
So in this way, we can basically just use the compiler.
And this is basically the same function again, right?
You can see here, it's basically our addition now shortened
with all of this removed,
all of this explicit loading and storing.
And then there's a nice website.
So this is something that I actually really like.
And my students, at least some of them, use a lot.
This is basically, this is called Godbolt or the Compiler Explorer. And it's basically a website where you can directly look
at the assembly that the compiler generates.
So you give it your function.
So here we see this is basically the same function.
I have a lot of links in here, which you can directly access,
and you can look at the different kind of functions.
And then it gives you the output that the compiler produces so then we can basically
see this quite a bit here but the hot loop actually would be this here so this should be
the hot loop here which is basically you can see that we have our addition.
We have first removed the data into the registers and here you can see these SIMD registers,
so the SSE registers, then the parallel SIMD addition, and then we move the data out. And of course, we have a lot of other tests as well.
Then we have the scalar execution
of the same functions, or the same with different kind
of loop condition.
And somewhere here, we would have
kind of a scalar execution of the same thing, which
is basically used for the epilogue.
And if you can also look here at somewhere
that the control flow graph,
this would basically be, oops,
exactly what I'm showing you.
I'm struggling with the touchpad, I'm sorry.
Okay.
You can check it out yourself.
Okay, so I have the link here.
This is basically what comes out.
It should be the same thing.
You can try it with different kinds of compilers and different versions.
And you can also use all of the compiler flags.
So you can directly see what will be the output.
And then, of course, you need to read assembly and figure out, but if you see somewhere, okay,
there's like SIMD instructions, then you know, okay, this worked, right? So, there's some SIMD
instructions here. So, in this case, this would be the hot loop. So, if you look at the hot loop,
and this is slightly different executed, we see it should be more or less the same.
We see here we have kind of this move again.
So we're moving data like double quad words unaligned.
So basically 128 bits.
So double word would be 64,
and quad would be 64,
and double quad would be 128.
So we're moving 128 bits unaligned
because we're not sure if the original data is actually properly aligned.
We're moving it into the register and we need to do this for both registers.
Then let me get pointer.
So then we're doing the addition here and then we're moving it out.
Then, of course, we're incrementing probably our loop counter
and jumping out either to the next loop or to somewhere else,
basically to the next check.
Also, the compiler, if you are using the compiler, the compiler can emit you some information about So rpass or rpass analysis and foptinfo for the GCC.
And you can see why or why it didn't vectorize a certain,
or why it didn't vectorize a certain loop.
So say, for example, if we have the same loop again here,
but we have only eight steps, so we're
doing only eight computations, we're only
adding up eight elements, then Clang will not vectorize this.
And we can actually check out why.
How do I get rid of this now?
I can do the same thing again.
And if we go to this now,
so then basically we see the way this is translated here.
Let me maybe make it a bit smaller.
And you can see this is basically, I mean, it's also interesting because it's completely unrolled here.
So you can basically see the compiler, rather than using the SIMD registers and SIMD unit, the compiler just completely unrolled the loop and just did all of the additions one after the other.
So moving in the data, adding out, moving it out, et cetera.
And the compiler also will tell us why. So if we go here, it basically said, well, this might be a clobbered store,
or this might be clobbered, and this is basically this aliasing problem that we had earlier.
So the data right here now, the compiler is not sure that there is not an aliasing problem in here.
So set might conflict with x or y.
And then there might be a lot of checks necessary.
It will also check if we do the same thing with a larger loop.
It will still need to do these checks, but
in a larger loop, this might actually still make sense.
But if we only have these eight individual, or two SIMD instructions, or eight individual
operations, then all these checks probably won't be beneficial.
So the compiler says, oh, it doesn't make sense. I'm not using it this time.
Okay, so basically this is what it says.
It's basically missed.
So it missed this operation.
So the simple operation is not used
because of this clobbering.
This is compiler speech, basically.
X or Y might be overwritten by the previous store.
So we have this pointer aliasing problem.
And the cost model tells the compiler there's only eight iterations.
Well, aliasing might not be.
Because of the checking that the compiler needs to do at runtime,
the SIMD registers might not be beneficial.
So what we can do if the compiler, often the compiler is actually right about this.
So you think you might be smarter than the compiler, but often you're not.
The cost model of the compiler is not
so bad. But often the compiler cannot make certain assumptions that you actually can make.
So you might actually know that there's no clobbering, there's no aliasing, no pointer
aliasing happening. And then you can tell the compiler. And the compiler will then basically know,
oh, OK, no problem here.
I don't have to deal with all these runtime checks.
So the costs of those runtime checks
will basically not be necessary.
I can do these optimizations.
And so programmers can actually use certain compiler hints
for these kind of optimizations or for guaranteeing certain things.
And then the compiler can use these hints for optimization.
However, the big issue here is you really have to know that these guarantees
that you're giving are holding, right? So, if you're saying there's no aliasing and there is
aliasing, the compiler doesn't care, right? So, then the compiler will just write code
that will produce wrong results, like random stuff, basically. So it's really up to the developer that this holds.
And one way of doing this compiler hints is basically restrict.
So the keyword restrict, it's not standardized, but most compilers support it.
It basically guarantees that pointers do not alias to any other pointer.
So with this, basically, I can tell the compiler
that my set array is not aliasing with any other array.
And then the compiler knows, well, no aliasing.
I don't have to test for aliasing.
Then basically, maybe the compiler will actually optimize.
This is something that we can check again.
We have basically the same instructions.
I'm just going to use this again.
What I can basically say is that here,
I'm saying, I just have to see that I can read this.
I can basically say this is restrict.
Compilation failed.
Obviously, I wrote mystery.
Sorry, yeah, exactly.
This is so far away.
I need glasses.
Okay, so if we put it here,
all of a sudden the compiler knows no aliasing happening here.
So it can just basically use the SIMD instructions here.
Again, it will do an unrolling
rather than doing like two loop steps.
We'll just do this in one step.
And then this code basically also could be, of course,
be paralyzed by the CPU again,
because we have two SIMD units,
both of these instructions could be,
or we could reorder them in order to execute
basically both additions in parallel.
Okay, so I'm fast today, so we can also touch on the explicit vectorization still today.
Rather than doing this implicit and auto vectorization, so there's also other different ways we can give hints.
So there's something in the slides that I uploaded.
But this is a bit more, let's say,
involved and even more easy to get wrong.
So I don't really want to present this.
And for the task that you have to solve, you will actually need some explicit vectorization.
Because not everything can be solved with the autovectorization. And, well, there is CPU intrinsics, basically,
that help you to explicitly tell the CPU,
also the compiler, the compiler will just basically
pass these through these instructions through and directly tell the CPU how to deal
with the SIMD registers and the SIMD instructions.
These intrinsics are just wrapped up assembly instructions for the compiler.
So the compiler more or less just passes them through,
at least the ones that are CPU-specific.
And we can basically use direct compiler
or assembly instructions by using the right functions
with the right parameters.
So typically, they have a certain naming scheme,
being like a vector size, then the actual intrinsics operation,
then with the suffix again.
So, the vector size would be in Intel speech, for example,
mm for 128-bit, mm256, and MM for multimedia.
So as I said, this was initially designed for multimedia applications.
So that's why it's MM256 for 256-bit and 512 for 512-bit.
And then you have the individual operations,
so add, sub, mult, etc.
But not all of them are as easy to understand as these.
And then you have the suffix for the type.
So PS for float, PD for double,
EP32 for signed int,
EPU for unsigned 16-bit int, etc. And this means, as I said, this is basically adding up, right?
So for each unit, you have separate instructions.
For each data type and size of the data type,
you have separate instructions.
And the wider the instruction, basically the wider the registers, the more intrinsics you
basically get.
And so this is also a bit of this problem why you constantly need to look up these names,
because there's so many of them, basically.
At a certain point, you get like, I'm not there at all, but you get used to some of these,
but still a lot of them are not super intuitive.
So for example, explicit instructions for SSE,
so the 128-bit intrinsics, then the registers,
so underscore, underscore, M128I would be a register for integer types.
Then MM load USI128 would be an unaligned load U for unaligned load of 128 bit from memory.
Which is kind of this assembly instruction that we saw earlier.
So we had this, if we go back to the code here, this is basically the same instruction.
So this is basically, we're moving a double quad unaligned from memory.
Then we have a vector addition, for example, which would be this packed addition.
So vector at four signed 32-bit integers, et cetera.
So this is basically how these,
if you wanna explicitly write those,
so you're not gonna write these assembly instructions,
but you're gonna write functions
that directly map to these assembly instructions.
Yes?
So why, like, we usually prefer
the unaligned instructions because the integers
are only like format aligned,
but we did 28 alignments.
Exactly.
And is it worth it to maybe like get a, not epilogue,
but like a?
Prologue.
Something that would basically pre-align it.
I don't know.
You could try.
I mean, if you can guarantee that it's aligned,
then again, you don't need to unalign.
The aligned execution will be a bit faster.
Because the CPU doesn't have to do the alignment again.
So if you can guarantee it across all executions,
then it makes sense.
Then you can do the aligned.
You need kind of checks, of course, but then you can do it.
So say, for example, here we have, again,
kind of the same loop which we had before.
So we have this kind of our three buffers, so X, Y, and Z.
And we're explicitly like we need these registers basically to store our data.
So we're adding up X, Y, and Z.
Let me find my pointer here. It's the same code, basically. And we're loading the data in these SIMD registers. So this is basically the explicit load into the SIMD register. So our 128-bit integer registers,
and then we do our loop again.
Now, since we're doing four integers at a time,
we divide by four.
Again, here we're assuming that our max
can be divided by four,
that we don't have an underflow or something like
that by doing the division, etc.
So this is already with some strong assumptions.
It basically creates the same code that we would have created if we run the other code. So here, this is basically more or less
would look similar to this.
So it's within the loop, of course.
So we can also look at the code.
So we're loading in the registers,
we're invoking the intrinsics,
and we're writing, we store the result into memory.
So that also, of course, needs to be done.
And I can actually show you the code briefly.
And this is, of course, a bit shorter than we would expect
because we're not doing the checks with the epilogue, etc.
Because we're directly telling the CPU that this basically can be divided.
And the actual execution is basically in here.
So here we're doing the addition, we're moving the data into the registers, we're doing the
addition and we're writing it out again. And then we have some loop variables,
which basically is this kind of stuff here.
And here we also have a link to the Intel Intrinsics Guide,
which basically gives you an overview
of all the intrinsics that are out there.
And there is also a nice cheat sheet from TU Munich,
which gives you a hint based on the naming scheme.
It's quite neat to find quickly what are the different operations,
also to get a good overview quickly.
Okay, so there, but the explicit vectorization, and I mean, this is somewhat working against
the task actually that you have to perform because there you will need to do best performance,
but in practice it's actually annoying because all of the instructions are different, right?
So for the different kind of vector sizes,
so different registers, you have different instructions.
For different architectures, you have different instructions.
So if you want to run this, say, for example, on the M1,
then you need to use completely different instructions.
So in M1 or in ARM, the instructions look differently.
So here again, we have the comparison.
So this would be our addition in Intel, for example,
where we directly load in the registers here, for example.
Here, if we do the same addition in ARM,
then the addition has a completely different naming
scheme.
And the registers are called differently, et cetera.
I mean, maybe even a bit more intuitive, but just different.
So you need to rewrite the code.
If you want to have code that can
be run on multiple platforms, you
will need to do a lot of case distinctions
depending on the platform
if you want to use these intrinsics.
So for each platform,
you will basically need to write different code
for this to work.
And the intrinsics are actually hard to read.
So if you, I mean,
we've put some up here.
So it's just some random stuff.
So shift, write, logic, integer, whatever.
So all kinds of stuff.
But it's not something that you would easily memorize unless you write this every day, I guess.
So explicitly writing this up easily memorize unless you write this every day, I guess.
So explicitly writing this up actually reduces maintenance
or maintainability quite a lot.
And we've seen this in projects actually as well.
But before continuing with more specific different kind of instructions,, etc., I would say let's stop here.
And are there questions so far?
Yes?
You mentioned that database systems can utilize some deprocessing to see the computations.
Yes. I can imagine that, let's say, for a count or some operation,
you can use a lot of such operations.
But do you know of some more sophisticated techniques?
Yes.
So this is a good question.
So how can we use this in the database properly rather than just doing addition, like simple aggregations?
And say, for example, a scan is something where you can easily use it.
Also, an example that I have later in the slide is a compressed scan.
So if you basically have your data stored densely, so in the example, in nine bits, for example, densely packed.
And you're loading.
You want them to work with the data.
Again, you need to somehow decompress the data.
So that's something that you can do.
Also hashing, for example, you can speed up
in the SIMD instructions.
Hashing is something that you need all the time in a database for all kinds of operations.
So we'll see some examples in the next lecture.
More questions? No more questions?
Well, then, more or less perfectly in time. Thank you very much.