Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Vectorized Execution (2)
Episode Date: May 8, 2024...
Transcript
Discussion (0)
Welcome everybody to our next session on hardware conscious data processing.
We'll continue and finish up vectorized execution today.
And before we go where we were, let me do one announcement or make one announcement.
So task one is online. We had the task description yesterday.
Marcel presented this to you. The deadline is May 27. And then for some of
you, just as a reminder, there will be an interview on this. So most likely on the 30th.
If we can manage everybody on the same day, we'll send out information about this soon.
I think the assignment is already done online,
so you can see if you're going to be interviewed on this task or on any other task.
If you do this, you have to commit.
And if you sign up and create your repo, you have to commit and push your changes so we can
see them and we can actually create them so make sure your final submission that you want to be
tested is actually pushed at some point so we can see it then the runner executes automatically
there's one runner meaning that if you like push lot, this will get slow for everybody.
So this is not so much for trial and error.
But only when you have a next solution, a couple
of solutions, do it.
Don't push hundreds of solutions,
because then everything will be slow for everybody.
If you have questions, use the forum. We saw that there's already some solutions up,
mostly scalar.
Of course, the scalar solutions are fine
for testing and for evaluating.
But for the final solution,
you should do some kind of vectorized execution.
And of course, as always, no cheating.
So if you cheat, you fail.
Very simple.
There's also no discussion
about it but that's the same every time right so we're looking forward to this if like just
also as a reminder basically the interview is just a check right discussing the solution with you. If it doesn't go well, you will have a second interview.
Okay. Then, and if we figure out you didn't do it,
then,
well, also you fail, right? Also one form of cheating, of course.
Okay, so with this,
we were in vectorization. So everybody hopefully remembers we have this special kind of units on our course,
like the second type of parallelism that we're discussing.
So what was the first type of parallelism that we discussed?
Anybody listening?
Yes?
Pipelining, exactly.
Thank you very much.
So we did instruction execution,
and the instructions are broken,
like bigger instructions are broken into pipelines,
and these we can pipeline or execute in parallel,
in pipeline parallel fashion.
And now we're looking at data
parallelism so we can do the same instruction at on multiple data items at the same time in
specialized functional units in the cpu right on a single core so we're still looking at parallelism
on a single core here okay and last time uh time we talked about auto-vectorization,
so how the vectorization works in general,
how this is executed,
and how the compiler will do some of the stuff for you.
And in many cases,
if your code has some capabilities,
like you're processing ARIs, and there
is something that can be vectorized,
the compiler will do this for you.
So if your CPU supports it, and it's
an easy to detect pattern that can be parallelized,
the compiler will automatically do some parallelization here for you, so some vectorization.
However, it cannot do everything.
So this is restricted to very simple patterns.
And today, we're going to look into how we can actually then
do this explicitly, so when we actually
want to tell the compiler or the CPU
to use these functional units through specialized intrinsics.
So there's a different way of doing this.
And yeah, so basically we can explicitly tell the CPU this is like load this into the vector register and then use this vector intrinsics.
So basically we can do like the CPU can move data in between regular and SIMD registers
and load data directly into SIMD registers from memory or cache and then execute vectorized instructions. These are specific to every CPU architecture, right?
So this means you have different intrinsics for x86
and different, we already had this, right?
So we have the 128 bits, 256 bit and 512 bit on x86.
You don't have 512 on every CPU, but this really is more like on those
bigger server CPUs. So your regular CPUs, newer generations will have it, but like slightly older
generations might not have this. If you have an ARM CPU like this machine here, again, it's
different, right? So you don't have the same intrinsics, you have different kind
of intrinsics. And so this is potentially not portable. We're going to mostly work with Intel
intrinsics or x86 intrinsics. And these are actually not assembly instructions, right? They
feel like assembly instructions, but they're not. This is really like a wrapper for something that will later on be translated to assembly
instruction. And we can use these, like the assembly instructions, by calling a function
with the right parameters. And this is usually built out of like a pattern, these instructions.
And the pattern basically has three parts.
So we have the vector size as one part.
So what is the size of the register that we're using?
Again, 128, 256, or 512 on x86 typically.
There's also different kind of vector processes
that can do far wider vectors, vector sizes,
but on regular CPUs, this is what you get.
Then you have the actual operation that you need to do
or that you want to do, and then a suffix basically telling the data type.
So we have on the one end, we have the register width,
and we have the data type.
So that also depends on what's the parallelism
that we will get.
So if you have a very large data type,
we will get less parallelism. If you have a very large data type, we will get less parallelism.
If you have a very small data type,
we get higher degree of parallelism.
And since we have all these combinations
and like for the different data types
and all the different vector sizes, et cetera,
the more vector sizes and the wider the registers,
actually the more functions we have, right?
Or more intrinsics we have.
So let's, I mean, again, the examples we have,
mm, like concreting samples in x86,
we have mm for 128 bit, I told you this is for multimedia.
We have mm 256 for 256, MM512 for 512.
Then things like addition, subtraction,
multiplication, masking, things like that.
We'll also see that.
And then things like float, double,
so PS is float, PD is double.
AP would be 32 would be assigned int of 32 bit.
EPU is 16 is an unsigned 16-bit integer, et cetera.
So there's many different things.
And this is actually the thing that you
will have to figure out.
So this is in the task that we're giving you.
This is what you have to find.
And it's a bit like a, yeah, bit of a navigating
or navigation through the different intrinsics
so that you have to search what's the best intrinsics
that you can do.
Maybe we've also not found it, right?
So there's many different, you might come up
with a better plan even than we have
or better intrinsics for the tasks that we gave you.
But you will find those.
I'll also show you some hints on how to find this.
And Marcel already did this as well yesterday.
Right, so let's look at some examples here.
If we have a 180bit, 128-bit
intrinsics, then we need
128-bit registers.
So these
are basically, say, for example, an integer
register for
128-bit would
be called
underscore underscore
M128i.
That's a register for integer type.
So this is where we can put all of our integer values
in different sizes.
If we wanna load,
like unaligned 128 bits from memory,
then we have this load operation or load intrinsics.
And here you can see we have it aligned and we have it load operation or load intrinsics and here you can see we have
it aligned and we have it unaligned again like two versions regarding
alignment again important if you can guarantee alignment then the processor
doesn't have to do anything extra to basically shift the data into the
register if you cannot then you need to do unaligned read and
then basically the CPU will do some extra work here to make sure that this
is actually later on aligned in the register. So this is like a loading 128
bits from memory and this then corresponds there's an equivalent assembly instruction that this
will directly be translated to. So it's, I mean the intrinsic is not the assembly instructions,
it's basically a wrapper, a function wrapper for this assembly instruction.
Here we have, like the third one that we have is an addition. Let me see if this works.
So this would be addition.
Again, 128-bit.
We can tell it from this part.
Then we just have an addition.
Nothing special about this.
And assigned 32-bit integer.
Again, there's a direct correspondence to the assembly instruction.
And finally, later on, we want to store this again.
So then we have a store function and again, unaligned store for 128 bits to memory, for example.
And now if we look at this, we can use the same or let's use our example that we've always been using so far.
We can do the same kind of addition, array addition that we had earlier with explicit vectorization.
So here we've basically put this, we have the same function.
Here we're assuming that max is divisible by 4.
You remember why that is important.
If it's not divisible by 4, we will have a problem.
We're basically not getting the last four elements,
or the last elements, like 1, two, three elements, potentially.
So for that, we need some kind of epilogue.
But if we know this is an axis divisible by four, then we can basically do it like this.
We don't have to treat this separately.
Also max might be smaller than four, right?
So then this will never work, et cetera.
So this is like an assumption that we already have in here.
So what we do is we basically load our,
we have like three registers, 128-bit registers that
point to a part of the array.
And we have like an addition in here,
we're loading the two vectors, x and y,
and we add them up into a result register,
and then we're storing the result register back to memory,
to the target array set.
Right?
So we basically, so I mean, it's basically the same thing
that we've seen earlier,
and we can also check this in got bolt.
So in the compiler,
compiler explorer, let me find my cursor.
I hope you can see this.
So here, this is exactly this code, right?
And you can see, I mean, of course, we have like the loop stuff, etc.
But the actual execution here, we basically, we're moving the values into memory.
And then we have the parallel execution or the vectorized addition here.
And then we're writing the values back.
So here we're loading the two vectors into two registers,
and we're adding the two registers,
and we're writing them back into memory.
The rest basically is just loop code,
basically incrementing loop counters, et cetera.
And you can see this is basically all the code that there is because we don't need a prologue, we don't need an epilogue
in order to fix anything.
There's a question, yes?
If we don't cast the pointers to everything from 128
can we then suggest an increase
instead of writing this last increase
but all the pointer entities,
is fine, or we'll get a compilation error?
So the question is, if we don't cast a pointer,
can we just increase by four, or will we
get a compilation error?
And to be honest, I don't know.
We would have to check.
I mean, we could check here.
So where is the? Yeah, I'm't know. We would have to check. We could check here. So where is the?
Yeah, I'm not going to do it now.
But basically, I have the link in the code.
You can just try it out.
So you can basically just change the code here, compile,
and you will see if it works or doesn't.
So what the compiler does here.
And just as a reminder, so this is basically we see it.
This is our vectorized
code. So we have the two moves, we have the addition and we have the store or the load and
the store. And this was our auto vectorized code, right? So here we have the same thing,
basically. And we can see this is really exactly the same code that will be generated however in
contrast we have all the other code for epi Logan prologue to just make sure that everything is
sound and safe that our code is not too like max is not too small max is not like it's divisible by 4, and if it's not, then we have to deal with the rest separately.
Okay?
Good. Good.
And so this is basically, there's an Intel Intrinsics Guide that you can use to figure out what to use.
And on top of that, also, there's a cheat sheet that's helpful.
I can also show this to you.
Just a second.
I think this is here. Yeah.
So this is from TU Munich
x86 Intrinsic Cheat Sheet. There is basically a description.
This is also linked in the task.
So how this is set up, like in example,
how the individual, like a comparison is set up,
or what are the data type suffixes,
what is the instruction sets, what are the data types.
And then you have kind of conversions
you have like special algorithms you have register loads etc fences bit operations and you can
basically see how this would be set together you have a short description of what this does
like a move mask for example this would be things that we will look at also in a
bit and yeah so what kind of data types this will work with so that's actually
quite helpful alternatively you can go to the intrinsics guide and browse your
way through this so there's and as this. So there's, and as I said,
like there's lots and lots of instructions
with very specific operations,
also with very, let's say also some kind of algorithms
and there's some caveats.
And this is basically where you then have to actually read
the guidelines or the guides is because not all of them are actually a
single instruction or are translated into a single instruction. Some of the
intrinsics actually have an algorithm behind them that means that
they will use multiple cycles to be actually executed. So this means you
might have shorter code in your program, but actually it executes longer because multiple steps need
to be done.
So it's just something to be aware of.
However, if there is a special instruction, in many cases,
the algorithm is actually not so bad.
So this might still be faster than if you code it yourself. OK.
So one important thing that you will also need is masks. So this is something what we do a lot in vectorized execution
is basically having a mask that selects part of the vector. And this is also how we do branching in vectorized execution.
I'll come to this in a bit more detail later.
So from AVX 512, almost all operations support masking,
which means we can selectively ask or apply an operation on some of the items in the vector.
So this basically, and there's zero masking or just masking.
Basically, say we just apply it to some.
We don't care about the rest.
Or we apply it to some and we're zeroing out the rest this is basically just to make sure that
we we can select a subset of the vector that we're working with and say for example if if we know
okay we filtered out something then this would help us so here for example if we do zero masked
addition then we have well let me do this with here right so this is this is an example
so we're we have a multi an addition of 32-bit integers with zero masking so this is actually
like with this the intrinsic actually even gets wider, right? So it's not just a single operation anymore, it's like two parts of the actual operation.
So with the addition and the zero masking.
So then we have a mask that's basically a bitmap
that selects which elements of the vector will be added
and which are zeroed out.
So here we have an example,
we have two areas of four or two vectors of four
that we want to add up.
And we have a, sorry about this,
we have a mass that's 0, 1, 0, 1.
So we basically say we're adding only the first position and the third position not the
zero position zero and position two and positions zero and position two are zeroed out right so
these in the result vector will have zero in them the other two will be added up right so then this
basically you can see the result is basically we're adding position one and
we're adding position four and the other two are zero.
Already somebody an idea why we want to have something like this?
Yes? Oh, yes.
To be able to select.
Very good, yes.
So, to be able to select, yes.
So, this is basically, with this, we can say,
these other positions we don't care about in the vector.
So, we'll see how this is useful.
And we can basically blend and we can merge.
So say, for example, we want to keep one area
and replace some of it, or we basically merge some parts.
So there's different ways of doing this.
So here we basically have an addition where we keep
like below is basically a merge example.
So below we're basically adding up some positions
and we're not zeroing out,
but we're basically keeping these parts of the vector and won't add them up.
And we, of course, can also do it the other way around, right?
So keeping the other part of the vector, etc.
So this is basically also if we selectively want to do additions
or selectively apply some kind of operations to the vector.
And with this, then basically all of a sudden, we might end up with a vector where we don't actually need
all of the elements anymore.
So we might want to actually reduce the amount of elements
in our vector, or we have reduced the amount of elements
in our vector, right?
So we put most of the vector or some of the vector to zero, and these elements are not
interesting anymore.
So then it might make sense in further steps to actually compress this again or to pack
these elements again to be efficient in the next kind of computations, right?
So say, for example, we're selecting half of the elements of our vectors and then we're
doing yet another couple of operations on those, say some logical operations or something
like that.
We don't want to carry all the zero values with us that will not be used anymore.
So in that sense or in that case we can compress these elements to the beginning of the vector in order then to, say, for example, pack multiple vectors again into fully packed vectors.
So that will be a compress operation.
Again, works with a mask.
So basically telling the compress operation which parts will be in the result vector and which parts are not interesting to us.
And we also have an expand or decompress operation
where then we can do the reverse.
So we basically say we're only using the number of non-zero values
in our bitmask of elements of the vector from the beginning of the vector
and place this across the vector.
So here's an example again.
So we have like a compress operation
where we have in our vector,
we have two values that are actually of interest.
The other ones could be zero, could be anything.
So we apply the compress operation with this bit mask, so 0, 1, 0, 1. So
position 1 and position 3 are interesting. If we compress, these two positions will be, or these
two values at the positions will be at the beginning at 0 and 1. If we decompress or expand with the same bit mask,
then the two will go to position one.
Position zero will go to position one.
And position two will go to position three.
Of course, you basically need to know.
I mean, this can maximum be the number
of elements in the vector.
And this is done from the beginning of the vector here, right?
So if you have three positions or three elements that you want to expand somehow,
then you will need three one values in your bitmap.
Okay? So this is cool, right? bitmap okay
so this is cool right so and this is what you will use in the task um but uh there's a problem um because say for example you're like me
and you you work on this kind of laptop um then
you cannot try it i mean of course you can do this in the
compiler like you can compile,
but you cannot execute on this machine because this machine doesn't support x86 intrinsics.
Yeah, so if you want to run it on ARM, then you need to separate implementation, right?
So this may also means if you use these intrinsics and you want to support multiple platforms,
this means code duplication.
And here we have this example again, right?
So we have our addition with the x86 intrinsics.
If we want to do this on ARM,
then we would need the neon intrinsics.
And they, I mean, they look closer
to what the compiler output looks like
or the assembly looks like, but it's really
like a completely different intrinsic.
And they're also not completely matching.
So some intrinsics don't have an expression on the other side.
So there might be some intrinsics
that you have in x86 that you don't have in Neon and vice versa.
So, I mean, if you want to do this,
you basically have to create separate codes
or you have to create duplicate code for multiple platforms.
So, and actually also these intrinsics
are quite hard to read, right?
So these are some examples.
So maybe you have an idea of what it means,
but also for most of them,
I have to look up what they actually mean
because they're somehow encrypted
in like very short abbreviations
that are not super explicit.
Right?
So doing, I mean, while we will get,
like try for or aim for a best throughput that we can get
or best performance that we can get in general,
using these kinds of intrinsics might not be the best idea
because the platform-specific code reduces the maintainability, reduces the readability,
and is more error-prone. So you can get errors here quite easily. So what can we do? And there is a solution in terms of using generic intrinsics.
And there's different ways how we can do this.
And maybe just as a reminder, for the task,
you can try generic intrinsics, but you probably
will get best performance with platform-specific intrinsics.
That's just also you can try auto vectorization, but you won't get far enough with auto vectorization
If you just do that, I mean there you can get some performance benefits
But you won't be able to fully complete the task just with autorectorization.
And with generics, I'm actually not sure, but I mean, no, I am sure you don't have all of the operations that you need.
So just as a reminder. But in general, it actually makes sense to think if you really need these specific intrinsics,
because it's just much harder to maintain.
Also, even on a single CPU or a single architecture,
you still have these different register widths.
So 512.
So if you use
512 bit, it will have a better performance,
but it's not supported on all x86 CPUs.
So all of a sudden, your code will not run on all x86 CPUs.
If you're explicitly using 128, it will be supported on more CPUs,
but you're not going to hit the 512-bit units.
Now, that's just like a means lots of duplication
if you want to get best performance.
And so for this, you can also use generic vectors.
So rather than doing explicit intrinsics,
well, we can do this generic operations
as we did with autovectorization.
And this can then either be done by the compiler
or using a library.
There's many libraries out there,
Eximdi being one of the major ones.
So this is basically just a wrapper
on top of the intrinsics.
And then there's different kind of implementations.
So this is one approach that you can try.
So they basically provide classes representing vectors.
And then you can manipulate these vectors
using overloaded operators.
So it feels like regular vector arithmetic
rather than using this very specialized intrinsics.
And then there is also a compiler extension approach.
So GCC has vector extensions where basically the compiler
provides an API.
And that actually also makes sense
because technically what happens internally
is that your intrinsic that you're using
will be translated to a compiler vector extension
that then will be translated again to the actual assembly.
So we're basically going an extra step,
but the compiler only knows how to translate from,
like, has one path from the intrinsic through the compiler representation to the actual assembly. There is no, like, magic in between going from one intrinsic to, like, a different kind of
assembly instruction on a different architecture.
So this is basically not a solution.
The compiler extensions for general,
let's say, platform independent execution
if you're using x86 in Syrinx, for example.
But we can directly use the compiler extensions or vector
extensions.
And since they are directly translated to assembly instructions, internally they actually
have these different paths.
So using the compiler extensions, we can write platform independent code as well with explicitly
vectorizing, not doing the autovectorization.
This is available in GCC.
It's also available in Clang, but say, for example,
not in Microsoft Compiler.
So SIMD libraries, one way to do.
There is an extra dependency, but they only work with the SIMD features
that the compiler supports.
Say, for example, AVX2 does not support 64-bit integer
multiplication, and then the libraries
use 32-bit multiplication.
So if you're doing AVX51212 using XSIMD, for example,
or another library, the library will still
go back to the slower multiplication.
And so this basically won't give you the best performance
in these cases. There is a standard for C++ to standardize SIMD.
And this would solve this kind of problem.
So if this comes and the feature set is large enough,
then you basically have the SIMD features in the standard C++.
And then these kind of operations,
if the feature set is large enough,
will be efficient again.
And the other thing, the way we can do is,
as I already alluded to a bit, are the vector extensions.
So the compiler maps these platform-specific types
and operations to an internal representation.
And this is basically just to keep the compiler simpler.
So there is basically an internal representation
for all of the optimization patterns.
So this is basically so in order for the compiler not to be
or so the compiler doesn't have to be rewritten
or duplicate the code for these optimizations
for all separate kind of intrinsic types.
There's an internal representation
where then optimization will go on, which then
is translated to the actual intrinsics
or to the actual assembly.
And we can use this directly, right?
And this is really for reusing the analysis and optimization.
And then rather than doing like x86 intrinsics, we can use the compiler extensions. And then we also have the natural code
with arithmetic and comparison operators.
So this is an example.
So rather than basically using, say, x86 intrinsics
and attributes, we can use the compiler extensions.
So basically create vectors that the compiler already knows
are vectorized and can be vectorized.
And just by doing this, basically,
then if we're using this vectorized addition,
we get the same kind of vectorized code, right?
So this, the compiler already knows this is vectorized code.
And if we then compile this, this
will create the platform specific
or use the platform specific intrinsics when compiled
and really produces identical code
on different kind of platforms.
So identical code in terms of execution.
And we've also used this for load and stores before.
And so this you can basically see here.
This will translate directly into this kind of code
where we use the x86 addition
or in the ARM code, so the Neon code,
where we use the Neon addition.
So this is basically what the output,
I mean the assembly basically, what the output will look like.
Let me see, maybe I have an example.
Open up. No, I don't have it open.
Sorry.
I think we probably linked it here somewhere.
So this is basically the ARM version, right? version right so here we're loading
the vector
and this is the
arm code right so where we basically
actually do the
intrinsic and
we can see this here
let me show the code here
this really is let's see this Let me show the code here.
This really is, let's see, this is the actual addition here.
Here we have the, let me find, this should be the actual addition.
And you can see this looks very much the same than this code here.
So this is basically the ARM assembly for this kind of vectorized execution.
And the same is true for x86.
So it's basically exactly, the same code so here on top we
have the vector extensions and we're just doing the vectorized execution here on the bottom we have
we're using intrinsics with vectors vector registers and the vectorized execution and the actual code for this should
be here, right? So this is basically, well, this is unrolled. So this is basically what
we see here. Again, we see the unrolled code. This is exactly the same code just using the
generic intrinsics and the code. I mean, that's the beauty of it, right?
This code is exactly the same that we're using on ARM
and we're using on x86.
So we don't have to change anything.
While if we want to explicitly vectorize this,
we have to do it separately for both.
So we have to specialize registers, descriptions,
and addition that is different from x86 here.
And you can play around with this,
because the links are in the slides here.
OK. So there's also special operations
like shuffle and conversions.
So shuffle basically reorders a vector.
And this is basically what we can also
use in order to do these compressions and decompressions,
for example, or just reorder the elements.
So say, for example, in this case,
we have a shuffle with three elements, A, B, C, D,
and a shuffle vector that basically tells the operation
that position 0 should go to position 2.
Position 3 should go to position 1.
Position 1 should go position 3.
Position 2 should go to position 1. And this basically then just reorders the vector
according to this index list.
So this basically, given the input,
will produce some kind of output.
And with the convert vector operation,
we can also convert the numbers or vector data types
from larger to smaller data types or vice versa right so if we have 32-bit
numbers we can convert them to 16-bit numbers etc and this is cool right so
again this kind of stuff is basically generic. So we'll work on x86.
We'll work on ARM.
We'll also translate to the correct register width that can be used.
So if we have 512 bits or AVX 512,
and our code supports this or shows that this can be used,
then this will be translated to AVX 512.
If we only have 128-bit, it will be translated
to 128-bit operations.
However, of course, not all of the combinations work well,
so you can always find some, let's say, special cases
where the conversion doesn't really work well,
and not all instructions can be expressed.
So especially this compression and decompression or expansion,
this is an operation that we'll also use in the task.
This is not expressible in generic vector extensions.
So for this kind of stuff,
you will have to go back again to the original intrinsics.
So this is something that you cannot do generically.
Okay.
Questions so far?
Yes.
What about the PowerPC architecture?
Like, is it also supported by GCC vector?
That's a good question.
Actually, I don't know.
So the question is, is PowerPC also supported?
And I would have to check, to be honest.
I know there are vector extensions,
but presumably there is some support,
but I don't know how far it goes.
Other questions so far?
No other questions?
Then we'll do a quick break here.
So let's see how we use this in the database operation.
So I've talked a bit about this already,
and we've discussed a few things.
But now let's get a bit more concrete.
So typical database operations go through lots of data,
say, for example, in a scan.
And this we can also simplify.
So these kind of operations, I mean they do the same kind of
operation on multiple data items, so there it makes sense to use SIMD operations. But there's
also other kind of operations. So whenever we're working with a lot of data or at least say multiple data items at the same time, then we can vectorize data.
And usually, I mean, we have to load multiple data items anyway,
or we have to do some lookups,
then vectorization always makes sense.
Only if we're using like a single value,
like our final result, we're doing some operations there,
there it doesn't make sense to vectorize, right?
Because there's nothing to vectorize.
But in general, we have to look, search for something,
we have to scan through something,
we have to process many data items at the same time.
And as soon as we can pack them into a register,
into a SIMD register, it makes sense to SIMDify.
And so joins are SIMDifiable, sorting, filters, etc.
And there's a nice paper, an older one already,
where basically people from the University of Colombia
already discussed what kind of things can be SIMDified
or how things can be simplified in a database.
And so let's look at a scan first as a simple example and I mean technically if we look at a
like a extended projection and then this is what we've done so far already. So if we're selecting price and tax, price plus tax,
from a table, and we're doing an addition in here
and we're filtering on this, this
is more or less what we've done so far.
At least the addition is exactly the same kind of stuff.
So we basically walked through an array of elements or a vector of elements
and did an addition here. Now we just have to filter this on top. So all of the code examples
earlier are in essence already like a reduced version of this. But we can also do aggregations,
right? So if we do a minimum or a maximum,
that's also very efficient.
These are simple comparisons.
This we can simplify.
And this can be done efficiently using basically
these comparison operators and bitmasks.
Then later on again, this ends similarly.
Sums, et cetera, can be done efficiently.
The selection part is a bit more tricky, right?
So we said there's no branching.
And of course, like if you would have branching
in SIMD instructions, like what would be the semantics?
Right, so I mean, if you think about it, SIMD instructions, what would be the semantics?
If you think about it, you have only a vector size of four.
And you have branching conditions in there.
Well, all of a sudden, you have all kinds
of 16 different ways of doing the branching.
It's not just true or false.
It's true or false for every individual element
in the vector.
So this is basically going in all different directions.
So branch prediction won't help you much anymore,
because we have so many chances.
So branching doesn't make sense.
So how do we do this?
You could generate a mask that
selects on the other day.
Thank you.
Very good.
So we can basically use masks.
So for this, we can use masks.
And on the other end, moving data between SIMD
and scalar registers is quite expensive.
So we don't want to go back and forth for branching.
So again, we're basically using masks
in order to do these kinds of things.
So if we want to do something like this, so we're selecting the quantity from a line item again
You remember this is TPC-H
Where the price is larger than 42 then this we can do with a mask, right?
So we can basically load our vector. We do a comparison in parallel
We create a mask and then either we do this,
interpret this in scalar mode, or we're using this mask to compress the vector, basically
compress all of the elements that we're actually interested in, merge the vectors that are
such that we have full vectors again,
and then continue working with these full vectors
rather than going back and forth in between the registers.
Okay, so some of the fundamental database operations
for this kind of stuff then would be, say, for example,
a selective store. So this is the stuff, then would be say, for example, a selective store.
So this is the selective comp or the compression
that I just said, right?
So if we're filtering, then we know like with the mask,
we basically know what's interesting to us
or what passes the filter condition.
And then we can selectively store this
in memory or in a register.
And then basically only with the subset continue further on.
And this is basically a compression operation.
So we have here our vector from A to P.
And we only want to store the highlighted ones, so the gray ones, B, D, F, G,
et cetera.
So for that, we basically create a mask.
And this can be done through a comparison, for example.
So then we have this mask.
And with the mask, we can selectively
store and compress the result. So we could do a zero mask.
So we're only storing the ones we're keeping, the values that we
don't want, or the positions that we don't want with zero, or we directly compress and just keep
whatever we need in the end. And the same we can also do vice versa, right? So we can do it the
other way around. Again, this is kind of this expand or decompress operation that we can do it the other way around again this this is kind of this expand or decompress operation
uh that we can also do during a load um is basically we're loading memory locations
like selective memory locations into our memory and we can do this um also selectively
into our vectors right so rather than overwriting everything or just loading
like a part compressed, we can also say, OK,
we already have a vector.
And some parts we want to rewrite
with the newly loaded elements.
And the parts that we were not interested in then
will remain with the
old values, right? Or that are not selectively loaded.
Another important operation for these kinds of things are scatter and gather operations.
So a gather operation is basically where like basically picking individual portions or like we have an area of elements.
And then we say we select individual positions in the area at certain positions in a result area. So rather than just saying, OK, we're compressing or decompressing, we're
really shuffling while compressing at the same time
or selectively loading at the same time.
So it's a bit more advanced operation,
but that basically allows us to directly reorder the vector
or shuffle the vector while we're selectively loading.
So this loads from non-contiguous values.
So what we basically need for this is we need our input vector with the values that we're
loading and we need an index vector that tells us which position should be loaded where.
So basically, on position 0, this index vector says, on position 0, we load the element from position 2.
So this is A, B, C, basically.
On position 2, we load the element on position 9.
So on position 1, I want to say, of position 9.
So we're loading the j, basically.
On position 2, we're loading the element from position 0.
So this is basically how this works.
The same also works the other way, or as a scatter operation, is basically we're storing the elements on separate non-contiguous positions in an area.
So again, we have an input area or a value vector,
and we have an index vector.
The index vector basically tells us
where to store the elements in the target vector.
It doesn't have to be in the target memory.
And can be again to multiple locations.
So on the one hand, we're loading
from non-contiguous locations, but in a contiguous area.
The scatter operations loads from contiguous areas,
but stores it into a non-contiguous vector.
So it's basically doing a very similar thing,
but just the way where stuff is stored is different.
And these are not executed in parallel.
So this is one of the operations.
This exists as an intrinsic, but there is an algorithm behind it.
So this will translate in multiple cycles or in multiple CPU cycles
because the cache cannot answer all of these individual requests
or so many distinct requests at the same time.
That's just something to be aware of.
Still, the operation exists, so it's easier to write, basically.
And so in order to, or if we want to use this
in a practical application, so say we have a filtered column
scan.
So this is basically what we had earlier, right? So we basically filter
an area and using greater than, so same example as before. So everything that's greater than
42, for example, that's what we want to filter in our result. Then we can SIMDify this. So, and this is like a scalar implementation of this.
So we have our filter, which basically says, like we input the column that we want to filter
and the filter value, and then we have an output and then basically for the column size,
so all of the values, we go through them one by one,
so row by row, we just input,
like this is column-wise execution,
we don't have the full row,
we just have the column that we're interested in,
and for each row in this column,
we're basically checking, does the filter condition hold?
And if it holds, yes, then we will output an index for this.
That's in order to later on work with the full row again, maybe,
or just to know which positions we actually are interested in.
And these positions then later, for example, we could use in a gather operation
in order to then have the
full column somewhere, to have the actual values somewhere.
As an example, if we have the filter, we're inputting five values here, 4, 13, 18, 4,
and 2 in an array, and we have the filter condition 9.
So everything greater than 9 is matching.
Then we know, okay, 4 doesn't match.
13 does match.
18 does match.
4 doesn't match.
2 doesn't match.
So positions 1 and 2 in the array match our result, match our condition.
So this is the output.
And this we can also do as a vector, right?
So we can do the same thing.
Basically, we have the input as the comparison vector.
And so, I mean, we're comparing the complete vector
at the same time.
And then our result, rather than individually outputting the indexes,
our output will be a bitmask.
So we get as an output a bitmask,
which tells us which elements in the vector
match our filter condition.
So here we're using the built-ins.
So we're basically checking this.
I mean, with the generics, we're checking, OK,
does the comparison vector match the value vector?
So we're building a value vector or a comparison vector
with our comparison
value on each position, and we're comparing, and then our output will be this result bitmask.
And now we still, I mean, what we actually want to have in the end is this kind of result,
right? So we only want the positions that we're interested in. We don't want the position that, or the columns,
the rows that do not match.
So this, for this, we need to do something
like some kind of compression again.
And this is like, whenever we like use the generics,
this is something that we cannot express in the generics.
So here we have a separate function in here
that basically
compresses the rows or left packs the rows in this vector. So we have all of the rows that we need
in the beginning, right? And so that basically means how can we implement this? And this can,
in AVX 512, this can be implemented with this compression method
that we saw earlier.
And we can also do the compress and store in one instruction,
because we don't need to first compress in a separate SIMD
register and later store, but we can directly say compress it
and store it right away.
And this cannot be expressed in GCC vector extensions.
So if we're using this mask compress in AVX 512,
this gives us a 35 factor speed up over the scalar version.
So that's actually quite significant,
because we don't have to iterate through all
of the individual elements in integer fashion,
or in integer data types.
But we can do this at once in an individual instruction.
And if we want to do this in the GCC vector extension,
so we want to do this in a generic way,
then we need to do the shuffling.
So for that, we can then basically create a shuffling
mask and convert the matching mask into a shuffling mask
using a lookup table.
So I'll show this to you in a bit or in the next slide.
If we're doing the same thing using shuffling,
then we do need to do some extra operations.
So we basically get half of the throughput.
So we have one extra step, which means our throughput
or our speedup is basically half of the speedup
that we get if we just use the single intrinsics in AVX 512.
So here, if we do a left pack in AVX 512. So here, if we do a left pack in AVX 512,
then this is really just this mass compress for.
So we have our bit mask that is basically
the output from the comparison.
And we have our input.
And we're compressing this directly.
And we can directly store basically already.
And in GCC vector extensions, what we can do,
we're using basically this bit mask.
And we're transforming this into a shuffle mask.
So basically, depending on the bit mask,
so how many of these elements match
or which of the elements match,
we're shuffling them to the front of the area.
So then we just need to remember, basically, the shuffle mask,
this would basically mean, okay, we don't have to do anything, no shuffling.
The second one is the same thing, right? So
basically if only the first one matches, for example, well, we're still good, right? We don't
need to change anything. If the first and the second one matches, again, we don't need to
change anything. If only, say for example, the last three match, then we need to shift.
So then we basically need to move the first element to the last position.
So because we want to compress, so only the first three elements in our vector are actually interesting.
So then we need to do some shuffling.
And we could use different kind of masks here. But the logical way of doing this
is basically shifting everything to the left.
So the first element goes to the back,
because that's not going to be interesting to us.
The second element goes to the first position.
Third element goes to the second position, and so on,
and so forth.
Make sense so far?
This is basically using the shuffling, we can do the compression.
So the compression then, I mean, basically in the second step,
we then just selectively store the part of the vector that's interesting to us.
The rest is basically, will basically be gone then.
And we can use this, like what you already see then, to do some kind of bit-packed integer decompression
as an example.
So if we say, for example, databases
want to be efficient, space efficient, or memory efficient.
So if we have 9-bit integers in our data set, we want to store efficient, space efficient, or memory efficient. So if we have 9-bit integers in our
data set, we want to store them packed. So this is not going to be aligned in any form,
but it's going to be memory efficient. And if we want to work with it in SIMD registers,
for example, we have to unpack this again. So we have to make sure that we can calculate with the numbers in our regular,
whatever integer size we want to use.
So 16 bit or 32 bit integers then later on.
So we have this tightly packed and we want to say,
for example, have 32 bit numbers that are zero extended
then and do some kind of operations on those.
And I mean, a Skylar approach is basically,
well, we track the leading unused bit
and we're copying the four bytes
containing the nine number bits into a register
and then shift, basically depending on the data,
on shift in order to remove any unused data bits.
And we're doing a bitwise end to zero the remaining bits,
the remaining 23 bits, and repeat forever
until we have all of the elements fixed.
So this is maybe a bit too abstract.
So let me show you this in a concrete example.
So here we have our individual numbers.
So number 1, 2, 3, 4, 5, et cetera, 9-bit numbers.
We have the byte lengths.
So byte 1, byte 2, byte 3, byte 4.
And we have the lanes, so the 32-bit
where we actually want to have the final numbers in.
In our result, essentially, number 1
should be in lane 1, number 2 should be in lane 2, number 3
in lane 3 and 4. But if we're loading the 32, 64, 128 bit into our register, then we have many more numbers
that we cannot use here actually, and we will basically have to move something.
So in order to work with this, what we're doing is basically we first load the data in here and then we have to shift data.
So first then we have to basically move the data.
So the first one is already correct, the first element.
The second element needs to be moved to the second lane, and we're going to do this on byte ranges, right?
So this is basically then moved to the second,
like from byte two, we're moving to lane two.
From byte three, we're moving to lane three.
And from byte four, we're moving to lane four.
So this basically makes sure that we capture everything.
But then we have these unused bits in the beginning, right?
So this is basically something that we cannot work
or that's actually wrong, right?
This is part of the data that we don't want to have.
So then we basically need to shift every element
or in every lane separately, correctly,
in order then to have the number at the beginning.
And then in the next step, I mean, this is, whoops,
sorry about this.
Then in the next step, we basically
mask off the most significant 30 to 9,
or 23 bits using
a bitwise end, and we have the final number.
So we just zero this out, and this is the final number.
And the somewhat confusing part of this is, in memory,
the data is basically the most significant bit is basically
here.
So the number is basically written from left to right.
So let's say our number is 123.
Then what is in memory is 3, 2, 1.
And then basically reserving off the rest.
And after this, we basically have
with these individual operations,
we have the four numbers in our individual register,
and we can continue with the next lanes.
This is with 128-bit.
If we have 256-bit vectors, then we can process, rather than 4,
we can process 8 numbers at a time.
And we're processing 9 times 8 bits at a time,
so which is divisible by 8.
So that's actually nice, because the next input element is already
byte aligned, so we don't need to do any shifting
in the beginning or any copying there.
So that makes it a bit faster.
So that's one type of operation.
So here you can see that using these SIMD instructions, we basically move data around.
So this is a lot of the stuff that we need to do.
We need to do some alignment, data movement, et cetera, to properly work with the individual values then.
And filter out things that we don't need, et cetera.
And a similar operation, or let's say,
a similar way of thinking about it,
but very different application is a SIMD tree search.
So if we have a binary tree, like a scalar implementation
of a binary tree, if you're implementing this efficiently,
then you have it in an area, your binary tree,
and you basically have the root on the first position, then
the second level on the second and third position,
and the third level then so on and so forth.
With this you can easily calculate
where each individual position in the tree is
and you can quickly jump to the correct position
in the tree and just compare.
And so this is basically usually an efficient way
of representing this.
And so for each node n, the children then are in positions 2n and 2n plus 1.
So that's actually quite easy to find.
So there's a very simple calculation.
But the unfortunate part, actually, when finding this, is basically we have to jump a lot in memory
right so especially the further down in the tree we are the further away the children are from the
from the nodes so that means if we're doing the next comparison this is actually quite a large
jump i mean the two children are next to each other which is good,
but then the next level will be again far away and the more levels the further away.
So if we're looking up for a couple of items then we basically will always go like compare the root and then if the root is or or the current node, if the current node is larger,
then we go to the left child.
If the current node is smaller,
then we go to the right child,
or we might already have found it, right?
So this is also like, we could also have like,
the current node is the node that we're looking for,
then it's also good.
And so if we do this in parallel
or we want to do some parallelization,
like a lookup for multiple items,
the outer loop we cannot really parallelize well,
at least not in a simplified fashion
because they are completely independent, right?
So, I mean, we can do some,
but here we
would need some scatter-gather operations to properly do this.
And can we vectorize the inner loop?
Well, there we have some data dependency
between the loop iterations.
And we can do multiple levels already in parallel.
So we can basically do multiple comparisons across the
levels in parallel. But the problem is, if we do these multiple comparisons in parallel, then
well, we're jumping a lot of memory, which is not good for SIMD. So this is basically would
need some scatter or gather operation to move this together. So I mean, if we say, for example,
we want to do comparison of two levels in parallel,
then we can do three comparisons in parallel at one time.
So here, for example, this would be the positions
in our tree or the area positions in our tree.
So the area position one is the root,
two is the left child, three is the right child,
four is the left child of the two, five is the right child of the two, and eight. So you see
the jumps get larger and larger here. So this is basically, and if we do these comparisons in
parallel for the root, it actually works quite nicely because these are tightly aligned we can just load them and directly compare them on the next level here if we're going left here for
example this is not tightly packed anymore so this is actually not good right so this we cannot do
well anymore so what we actually can do in in order to make work, we can pack these levels tightly. So we're packing subtree
regions into some SIMD registers. We're basically giving up the way how the tree is structured in
the area in favor, or rearranging how the tree is structured in the area in favor to get these
tightly packed subregions, right?
So what we can do is basically, I mean, for the first level, this was already correct.
Then we can say, well, rather than packing basically these two and then these two,
we can pack these three and these three next to each other,
such that then we can load them at once
and do the SIMD comparison again.
So we need to rearrange in the memory,
and then we can do a SIMD fight search operation.
We can do basically a SIMD fight comparison operation.
This will again give us like this bit mask.
And so if we say, for example, we're're searching for key 59 then we're starting up here
so we can compare all three elements at the same time so there's no match in here so that basically
already helps us okay we basically need to go further so in this case if we do the comparison
with 59 then we know okay this one is definitely smaller this one is
smaller as well this one is larger so we would have to continue this this direction right and
so our vector comparison will basically tell us okay this one was smaller one this one was smaller, one, this one was smaller, one, and this one was larger, zero.
So we need to move here.
An alternative would be all of them are smaller,
then we need to move here.
This one is larger, this one is smaller,
then we need to move here, or we need to move here.
So there's basically four options,
as the four children that we have under this subtree.
And this is basically, so we can basically
move this mask into a scalar, like the comparison.
The result of the comparison can be a scalar register,
move to a scalar register.
And we can use this mask as a lookup,
basically say, OK, where do we have to continue from here? And while we could
have all kinds of masks, so the mask that we're getting out here is basically 110. I mean, this
is like abbreviated. So, or it's actually the, way, it's basically.
So this is basically our mask.
Something like we can't even get.
So is basically this direction.
This means like this is smaller, this is smaller.
And of course this is also smaller, needs to be smaller.
00 or 100 would mean this is larger but these two are smaller.
This doesn't work, right? So this is a wrong configuration.
So this doesn't interest us, only the ones that actually lead to a correct lookup we're using.
And using this we basically can then directly know or we can look up in the table where
do we need to go from here so we're comparing all these and depending on the result of the
comparison we know where to continue in our tree and uh well the the blocking we can also do
hierarchically so we see this for simd um makes sense right so in simdified fashion then we can also do hierarchically. So we see this for SIMD makes sense, right?
So in SIMDified fashion,
then we can do directly comparison.
So we did like in 128 bits, we might get two levels.
In 256 bits, we might get three levels already and so on.
So we can actually speed up the lookup in here,
but loading this or like being fast in loading this
also makes sense in memory, right? So we can also do, basically, we can do the
SIMD blocking, we can do the cache line blocking, making sure the cache lines
have complete blocks in there,
and we can also do this with memory pages, or pages on disk
as well, right?, in order to have
always load only what's relevant for us or let's say have all of the local data where we will
probably do in the next lookups again. And if we do this, then we can see for larger trees where we have, we basically, the blocking
makes sense, then we get quite significant speedups by doing this kind of blocking. So
here we have basically in the megabytes, I assume.
So here, we're basically doing nothing, like a normalized one.
If we're doing page blocking, this already gives us 30%.
If we're doing cache line blocking, another 10-ish
percent.
And then SIMDifying this gives us another 10%, right?
And then we can also use software pipelining.
So there's a paper on this, how to do this efficiently.
And we see that here we can get
like a tenfold performance improvement
just through this, like shifting the data
or aligning the data properly in memory
and then SIMDifying the execution.
For very small trees, this doesn't work so well
because 64,000 keys, this is something that fits in cache.
So here, like doing additional rearrangements, et cetera,
won't give you much performance improvement here.
Okay, so with this question so far, up to here on SIMDification of your code.
So the idea here is really to give you an idea how to think in SIMD operations.
So this is what you will need for your task.
It's a kind of different way of writing code,
because you don't use branches.
But you're actually doing everything in parallel.
It's similar to predication, essentially.
So we're always executing the code.
And all of the things that we do,
we're just continuing with the parts
that we're interested in then later on.
So we talked about SIMD basics last time and some vectorization
options and database operations today.
Next week, I'm not here.
So Martin will present execution models and data structures to you.
Also fun topics.
Yes?
That's a good question.
Is it aggregated?
I guess it's aggregated across all of the cores.
So that would give us, like if it's four cores,
it would be 4.2, right?
That might make sense.
Six cores, maybe something like this, presumably.
And what is the best reported number?
Is it the current?
BASTIAN SCHWEINMANN- Best reported number
is basically what other people have reported before
in the paper.
So I mean, all the details you can actually
find in the paper that's linked in here,
but the best previously reported number in other research
basically related work.
OK, thank you.
Oh, you're welcome.
More questions?
No more questions?
Well, then, thank you very much, and see you in two weeks.