Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Vectorized Execution (2)
Episode Date: May 16, 2023...
Transcript
Discussion (0)
Okay, so welcome everybody to our next session on vectorization.
So today we're going to finish up SIMD.
So also you're already started or you already started to program this.
You know the task, what you have to do.
And we'll talk a bit more about the kind of operations that you can do with SIMD.
So this is kind of the last part of the...
Where do I have the overview?
This lecture...
Let me find my cursor.
So, we're going to be in this kind of space here.
And even though we're probably not going to need all the way to the end of the lecture,
this is going to be everything that I'm going to do today,
because I feel a bit under the weather.
So, we're going to have an earlier stop today,
which I guess you also don't mind, right?
So, one thing that I still wanted to make you aware of is the following.
And I already reached out to some of you but still we're looking for student assistants and I guess most of you
already know here but if you don't know and if you're interested in working on
these kind of topics and also in the teaching in these kind of topics, let me know because I think we have some interesting stuff
that we're doing at the group and we can really need some help.
So with that, let's go back and go right...
Let me just find this... Right to where we stopped. So we talked about generic intrinsics.
No, this is too far off.
We talked about the explicit vectorization.
So this is where I left off last time,
where you basically already saw how you can use these SIMD registers
and how you can use these SIMD registers
and how you can use the individual intrinsics.
So, and these intrinsics,
these basically SIMD ISAs,
they are specific for the different kind of compiler,
for the different kind of CPUs, as we said, right?
So this would be the x86 way of doing SIMD using SSEs in this case, so
using the 128-bit registers. And if you want to do different registers, again, you will have to program against these different registers, so 256 or 512, in order to really use these individual units.
The compiler won't do much with these intrinsics, so this is basically just a function wrapping
the assembly code.
The compiler will just directly translate this.
And if you do it right, you will get good performance,
but only on the specific CPU
and only on the specific units that you're working for
and that you're programming against.
So one thing that in database operations,
we use a lot are masks.
So that's something that I want to give a bit of or spend a bit of time on today.
And this is basically a way to do conditional execution in SIMD.
So because I really touched on this last time is you cannot really do branches in SIMD, right?
Because, I mean, you would have to execute all different branches depending on what are the individual branching conditions in these different registers.
So it doesn't really make sense to have multiple code paths.
Won't work.
So what you have to do is you have to kind of this predication way of
thinking about the data and one way to do this is work then work with masks is
basically saying I'm only selecting subsets of the data that of the of the
vector that I'm working on and so you for example can, I only want to add a certain subset of the elements in my vector in a certain operation.
And from AVX 512, basically almost all operations support some kind of masking.
So one way would be zero masking.
You can basically say, I'm selecting certain elements in my vector that I want to work on,
and the others are all going to be zero after the operation.
And something else is you can also just keep the values that you had before.
So say, for example, you want the vector to stay the same in certain locations.
Other locations, you want to do an addition.
So this would be an example that we have here.
But here, for example, with zero masking,
so say we have two vectors, A and B,
and we're adding a mask here using zero masking.
So we're saying that the elements that are zero in the mask
are going to be zero in the result,
and the other ones are going to add it up.
So then we have these two vectors,
we have our mask,
and the result will basically only contain the elements,
the added up elements and zeros.
And the added up elements where our mask is 1.
And this is basically one of these operations
that you could have here.
So this is a 512, so ABX 512 operation.
As you can see in this case, we're
going to do an integer operation, an addition
with mask 0.
And so this means we give this operation the mask, which
can have these zeros and ones,
and the two input vectors, and the result will be our output vector, which will be written into the register.
So this also gives us this kind of way of saying, okay, we're selectively working with some of the data. So if our condition is good, is
right, then we can basically select the elements in the vector where our
conditions are holding and the other ones we can ignore for the rest
of the processing. So this is basically the same idea as doing this kind of
predication. And we can also do basically blending, so we can say we're selectively merging different
kind of vectors, right?
So basically just merging two vectors, given a mask where we keep some elements of the old vector,
or we just place in some elements of the new vector.
Or we could say, for example, selectively add and keep the rest as we had.
Say, for example, here we have a mask with a blending. So see here for
example we're saying well for the source vector those which are zero in the mask
we're just going to keep. Those which are one in the mask we're going to add up
with the other vector. And then we're just going to look at the elements in
the second vector that where we we have the mask being one.
And there we're going to do an addition.
The rest we're just going to keep as is.
Actually quite simple,
but translating your operations
or your database operators
to something like this
requires some bit of rethinking
how you actually do these database operators. And we'll come to this requires some bit of rethinking how you actually do these database operators and
we'll come to this in a bit and one extra thing that's super useful in this uh in this setup
and that you will need as well are compress and expand operations these kind of are
let's say they're're orthogonal to each other.
So the compress operation basically says,
I'm going to just select some of the elements in my vector,
and I'm going to move them to the front of my area.
This is, for example, if you're doing something
like a scan, right?
Or if you want to select some elements,
you're doing some conditional operations,
so you're trying to figure out which actually match what you ever want in your operation,
which elements.
And then in the next step, you just want to work with those, right?
So you actually want to have them somehow close together,
or you want to store them in a tight area.
So you will compress them towards the
front of the area. And in the other direction, you might again just do some selective merging,
for example, if you already know, okay, we're just going to have some positions in the area
where we want to put the data in the vector, then we can decompress and basically move the data to different parts of our vector, right?
And of our register in that sense.
So that's decompression or expansion.
So there's specific operations just basically doing this.
Okay. And so this one I actually showed you last time already,
I just moved it to improve the flow here.
So we basically talked about this, right?
So all of these operations are super specific to the CPUs.
Right? So this is really, if I'm using AVX512, this means I can only do this on x86 CPUs and the
code will only run on x86 CPUs that have these units.
They won't run on any ARM, so if you write this code, you cannot run it on this machine,
because this is not an x86.
This is an ARM architecture.
You will basically need to rewrite this code with new instruction sets, new intrinsics.
So we had this, right? basically our x86 version with 128 bits, and this would be the same in ARM and Neon version.
And this is basically the same code, but you just need to duplicate everything.
So in order to get away or to move away from this, there's a couple of different strategies.
I mean, one way would be what we did all the way
in the first place is this autovectorization, right?
So one step is basically I'm just writing code
in a way that the compiler understands it.
I'm writing scalar code,
but the compiler can directly see this can be translated.
And for the simple addition, this will work.
So that's the most generic way.
And of course, then you need to have a compiler
that understands this.
And if the compiler doesn't see it,
well, tough luck, it won't work, right?
So, but you can check with Godbolt or stuff like that
if it worked or not.
So that's one way.
However, you can also use generic vector classes.
So there's different kind of abstractions out there
in order to not having to write specific intrinsic
for single CPU types or single CPU architectures. And this is either library or compiler based.
And there's also a trend or there's an effort
to standardize within C++.
So then this would be like an extra way again,
but that's not there yet.
And if you do this, then basically you have a way to go back
or to do your coding across different kind of architectures.
And there's many different libraries, for example.
There's XMD, XMD Vector Class, LibSIMD, DPP, etc.
And these are all linked, so you can actually try this out.
And they basically provide
classes representing vectors and can then also be manipulated using overloaded operators. So you can
basically use them just as you would use regular operators and classes in C++. And then underneath,
they provide an implementation,
so they do the same thing.
They basically have the same kind of code underneath here
as we would write it by hand,
but they will have this for the different kind of operations.
And if, say, for example,
the architecture doesn't have a certain type of instruction.
The library will provide an implementation of this instruction.
And another thing, a different approach, again, is the compiler extension approach.
Meaning there are certain compiler extensions,
and these are used also in the compiler internally.
Whenever the compiler tries to understand your code, it can be auto vectorized and the
compiler will basically translate this to this internal representation and do optimization
on those.
And you can directly program against these compiler extensions.
So these are GCC vector extensions.
This is also adopted by Clang, but not, for example, by Microsoft Visual Studio C. Visual
C, that's what it's called.
Let's go back, look at the libraries.
So we'll look at both independently. And I have to say,
both are cool, right? It's a nice way of doing this. But for the tasks that you
have to do, you will have to still touch intrinsics. So the CPU specific
intrinsics to get really good performance, unfortunately. So there is
some operations that are not within the libraries
and not within the compiler extensions such
that unfortunately you still have to touch it.
In some next iteration of the course,
maybe this will be possible.
OK, so the SIMD libraries, for example,
they are an extra dependency.
And they can only work with the features
that the compiler supports.
Say, for example, AVX2 does not support 64-bit integer
multiplication.
And then the library, for example,
implements this using 32-bit multiplication example.
So an example would be the vector class.
And when you're compiling for AVX 512,
then the library will still keep the slow code.
We'll still use the library will have this translation,
and the compiler will use the manual multiplication logic.
And then basically, rather than using the AVX-512 operation,
you're going to go back to the 32-bit multiplication.
So we can actually look at this.
So here we have the same kind of multiplication. So this is basically the basic multiplication that you would do with 32 bits.
So you're basically splitting it up into two parts and multiplying the separate parts and then
add up the two sums. So this is basically what you can then see here. And if we're
doing the same kind of operation with this multiplication, then later on it
will still come out with the same kind of code. So this is
basically we're mimicking this compiler, this library approach.
So we're basically creating a function here that's the same.
And even if we're basically saying you can use larger registers here,
then we're going to go with the same kind of code,
with the same kind of complicated code rather than a faster code.
However, there are efforts to standardize this
in standard SIMD.
And there's also a link to the C++ reference, which
would solve these problems, basically, if the feature set is large enough.
So if basically all of the different implementations
are there, then this problem would be solved.
The other way of doing this explicit vectorization
in a generic fashion are the GCC vector extensions.
So what the compiler basically does is internally it will basically map
platform specific types to some kind of internal representation. So that's
been done anyway. Even like although the intrinsics will be translated more or less direct
to the assembly, there is a wrapper or a mapping to some
canonical internal representation, right? Because we want to be able to translate
different things.
And this basically decouples the compiler implementation,
so the different kind of optimizations,
how the compiler can do the autovectorization internally,
right?
So this basically is used for this framework
internally in the compiler.
And then basically, the compiler doesn't
have to be rewritten for the different kind of,
or at least the optimizations don't
have to be rewritten for the different kind of architectures
because we can do the same kind of optimizations
on different architectures in the auto vectorization
approach.
Or let's say the compiler can do this.
And we can reuse this by basically targeting this direct
or directly targeting this canonical representation in the code.
So basically the compiler can then do all the platform-specific translation while we're just being generically
or writing our code generically in a vector-based way.
Then the compiler does the same thing as the libraries would do.
There's basically a natural representation using the standard operators
like multiplication, addition, etc.
and all kinds of comparisons.
So then we can just use them
as we would use them regularly in our code
in our vectorized code I have to say. Okay, so say for example we have our addition here.
So this is the same addition example as we always see it.
However, we're already using the GCC vector extensions here.
So this is basically we're saying we the GCC vector extensions here.
So this is basically we're saying we're using a vector type.
And this is basically done in here.
And then we have our addition function, where we say, OK,
we're going to do in basically steps of four,
we're adding up these vectors and we can use the vectors just as we would
in regular operations, right?
So this is basically, yeah.
So it's basically the same as we would do
as we would do if we do the manual way
or let the auto vectorization work, right?
So it's more or less the same code,
but we're already specifying that we want to have vector types.
It also looks very similar to doing the SIMD operations,
but without any specification.
And it will produce the same kind of platform-specific intrinsics
or let's say identical code with using this platform-specific intrinsics.
So this basically, if you look at what the code will produce, right?
So this is basically, if we're compiling this and the kind of code that will be produced,
it will look the same as if we would use this in an x86 way, or if we would do this for
ARM, right? So the results are more or less the same or are the same,
but we just have to write one code base, right? So, and we can, this will be directly translated.
And the compiler is like all kinds of operations. Unfortunately, it's not feature complete. So it doesn't have all the operations that you would want to have for the kind of operations
that we do.
But it has, for example, the shuffle, it doesn't have these extensions or this, we'll see in
a bit, there's some operations that it doesn't support.
So say for example, but it has this shuffle operation.
So we can basically reorder the elements in an area.
So we can basically say, we want to have,
this would basically mean this is our input area
and we have an input vector and we have a vector
that specifies a new order in the vector that
will produce this new order.
It will just recopy the different elements.
And then we have a conversion tool, for example, where we can change the vector types into
different element types. So say we want, for example, to have different,
so from 32-bit integers or through 32-bit numbers in general,
we can translate or convert to 16-bit numbers.
But something like this compression that I showed earlier,
that's super useful in database operations.
That's something that can right now not be translated or cannot be expressed within
these vector extensions. So that, for example, that's something that you cannot express. So say,
for example, you just want to keep the elements B and C, as we had earlier, and
have them at the beginning of the array, then this won't happen, or this zero masking then.
Okay, so far so good.
Now let's look at how to use this in database operations. So we'll first go through for some simple stuff
and then more explicit or more advanced examples.
And let me just quickly fix this here.
So the very simple thing that I already told you
that's kind of straightforward to use as a scan.
So if we're doing some kind of scan or a selective scan,
this is something where you can directly see,
okay, instead of checking every individual tuple,
we can basically check multiple tuple at a time
if they match or predicate or not.
So that's very simple.
Other things that are easy vectorizable are joins,
sorting, using sorting networks, et cetera, then filters.
And there's actually a nice read for this.
So how can you rethink SIMD,
or how can you rethink database operations
using SIMD vectorization?
And so there's, this is interlinked
and some of the examples that I have are from this paper.
So let's basically look at the scan first.
So a scan often also has some arithmetics in there,
so this is basically what we would do in the previous examples as well.
I mean, it's not a simple scan,
it's already some kind of extended projection here,
so we're basically looking
at certain tuples or certain attributes in our tuple and add them up. So, this is what
the extended projection typically does. But underneath, we're just going to scan through
all the data once and add up the elements in our different parts of the attributes.
And, well, that's easy to see that we can do this in parallel, right?
Another thing is aggregations.
Well, this is something that we can also do.
I mean, minimum, we have to do it like basically block-wise
and then further extend.
Similarly, we can do this for sum and for max, etc.
So, but at least for some subsets, we can do this quite efficiently.
Right?
So, this is quite clear.
We can, for each block, we can basically, for multiple elements at a time,
we can find the minimum, we can find the maximum, we can basically, for multiple elements at a time, we can find
the minimum, we can find the maximum, we can build the sum, and then we have to do this
iteratively for the rest, basically. However, the selection, well, I already basically gave that away, right?
So selection is a bit more tricky in the sense, so I mean, if we're just doing an addition,
that's straightforward.
You're just going to go through the whole dataset.
We're just going to do this extended projection.
We're just going to go step by step, do everything and we can do this in blocks
fitting our SIMD registers. So that's quite easy. The selection is
more tricky because we don't have branching, right? So we basically
we cannot say if this condition then do that, but we have to express this somehow
in another way. So we have to express this somehow in another way.
So we have to use this kind of masking idea.
And the other thing is we don't really want to move data in and out of the SIMD registers,
because that's quite expensive.
So basically, if we have to do this individually
for the individual items,
then this needs to do this individually for the individual items,
then this needs to be done individually, right? So we cannot basically move all of the elements at one time,
but we have to either go through memory or do one item at a time.
But what we can do is we can extract a mask from the registers, from the SIMD register,
and just copy that in a single register.
So that's something that's possible.
And the mask can be small enough to fit into the register, into a regular register, unlike
the complete SIMD register, right?
So if we have like a 512-bit register,
we cannot copy this into a single register.
So we somehow have to compress this,
and one way is using this kind of mask.
So the idea in order to create some kind of scan like this,
or some kind of scan like this, or some kind of selection like this,
is to actually generate a bit vector,
and then interpret this in scalar mode.
So, basically, rather than saying
we're going to do branching for every individual tuple,
or some weird branching logic for a whole vector, we're going to say we're going to create
a bit vector and then interpret this in the Scala mode where we can actually do the branching.
So we already had this selective store and expansion,
so this will be required for these kind of operations again.
So, let's just briefly touch on this again.
And you have, of course, we can have to think of this in small areas,
but also, of course, for larger vectors, for 512-bit vectors,
you can have multiple elements at the same time.
And what the selective store does again is basically selects the subset of elements that we're interested in
and compresses this to either to the beginning of the register or also directly to memory.
So this can also be directly then written to memory.
And the same thing we can do the other way around.
Rather than basically saying we want the whole data set that we have in our array,
we can just load a selective subset.
And then put this compressed at the beginning of our vector.
So then we can also efficiently work on that.
There's additional operations, and these exist as intrinsics,
but some of them are actually not implemented in hardware, but will be translated into multiple rounds of individual operations on the CPU.
Basically, there is no individual unit on the CPU for this operation, but it will lead to some
multiple cycles being used to execute this.
And one of those would basically be a gather operation.
So this is different in a way that you can basically say,
I want to directly load a certain set of values
to a certain area.
You basically load values from non-contiguous memory regions.
You can say in my first element,
for example, in my target vector,
I want to have the element number 2 in my base vector,
so in my base dataset. This in my base data set. So this basically means
first says 2, so I'm going to position
0, 1, 2, and this will be my first element in the load.
And then we have 9, so 0, 1, 2, 3, 4, 5, 6, 7, 8,
9, we're here, so the tenth position basically,
and that will go to the second position.
And this gives you a way to basically do sorting,
so the shuffle at the same time with some kind of compression, etc.,
that we had earlier.
But as I said, some of this is not really as a single unit,
or not implemented as a circuit in hardware,
but will be multiple operations of these earlier instructions.
But they exist as a single instruction as an intrinsic in AVX 512, for example.
And the same is true for a scatter operation.
So that's kind of the other way around.
So we have contiguous memory region
that we want to basically load from.
And then we translate this to something else.
So we write this out to the broader memory location
based on where we were basically based on some kind of vector
that tells us where the data needs to go.
And yeah, so these will also not necessarily be executed in parallel because of the cache. basically we cannot have unlimited amounts of accesses or distinct accesses per each cycle.
So that will basically hinder some of the parallelization there.
And that means, as I said, this will need multiple rounds of execution.
So let's look at the filtered column scan.
So this is basically the traditional way of a simple filtered column scan if you would
implement it in a scalar way.
And here the result that we want to have is basically, we want to get the positions in the array which match our filter.
So, basically, we're not outputting the results or the values that match, but we're outputting the rows that match,
so the row numbers.
And that's also important
because that's the same thing
that we're going to do later on
in the vectorized version.
So basically what we will do
is basically we have our column,
whatever, like an array basically,
and we go through the column
value by value until we have the complete
column size, and for each value in the column
so for each row basically, we're checking the filter
condition, so this is basically greater than
comparison, and if we have a matching
output, then we have an, or if we have a matching
row, then we add this to our output vector. So on
the position, on the next position of the output vector, we're basically adding
the row number. So that's what's done here. So if we're doing a filter
of this array, so 4, 13,
18, 4, 2, and our filter condition is 9, so let's look for everything that's larger or greater than
9, then we'll get the positions 1 and 2, because 13 and 18 are greater than nine. And that's something that we can now translate
into SIMD expression.
And so doing these comparisons as a vector
basically means we're using our vector types.
So this is basically, we're saying yes, we want a vector data type, this is the GCC vector.
And we need a second one, which is basically our Boolean vector, if the vector matched
or not, or the individual row matched or not. And then we have our filter operation or function,
basically.
And we start by basically getting the right values.
And then, in this case, four at a time, basically.
We're going to read our value vector,
then check, do basically the comparison,
where the result will not be a branching, but will be a Boolean vector, basically.
So this is basically our result vector, which is the Boolean vector.
This will be converted into a bitmask.
And then we're doing this left pack operation.
So this is basically a function that we have to define, specify separately. So this is I will show you on the next slide.
And then increase our output
record or output vector, which is basically the same idea, it's again a vector with all the compressed
results.
So we're only looking at the, or we're only adding the rows that actually match our condition.
And the left pack operation basically does exactly this. So it basically, it tells us which, or it matches with the result bit mask,
basically selects only those subset
or only those rows that actually matched
or that matched and only adds the subset.
And then we can basically increase the output vector
by the number of matching tuples.
So this basically matches,
or this basically counts the number of ones
in this bit mask.
So these are all the matches,
and that's basically how much we increase our output pointer
in order to, in the next loop,
basically have, like like start the output
or continue with the output from the new empty position.
Okay, so now this is kind of where we have this problem
where we can not really work with the compiler intrinsics
because we have this or we can
but it's not gonna be as efficient because here this this, or we can, but it's not going to be as efficient
because here this is something where we want to use this compress function.
So if you remember, we said the compress basically we have our vector which has some values which are relevant for us and some values that are not relevant for us and
with the mask we basically move those that are relevant to us to the beginning of the vector and
Then we can just say okay only those that are in the beginning of the vector. We're going to
Basically add to the output the rest. We're going to throw away, we're not going to look at anymore. And the AVX 512 operation, this MM mass compress, will do exactly this. However, there is no GCC
function for this. If we want to do a GCC function, then we need to use shuffling. And the shuffling
basically tells us, okay, we're going to shuffle somehow,
but then we also need to know which ones we actually want to select. So we need a bit mask,
or not a bit mask, we need a vector that basically tells us to shuffle those that we actually want
to the beginning. And we somehow need to remember which ones we actually want. So, we basically, the main problem is we
need to create this kind of mask or this kind of translation vector
to find, to move this to the front.
And so one way, so I mean, with the compress function,
we get a 30 times speed up over a scalar version
in 512 bit registers.
With the shuffling, we still get this,
you still get a speed up, it's basically half of that, right?
So even less than, no, it's basically half of that, right? So even
less than, no, it's basically half of this, because we need to create a lookup table
that basically tells us what is the mask that we're using for the shuffling.
So, and what we basically do is exactly this, I mean, on the AVX 512,
we have the compress function, so this would be the
implementation for this left pack on AVX 512. In the GCC vector extensions, this is generic,
will run on any instruction set, but we have this kind of bitmask that tells us, okay,
using this bitmask, we're translating this to a shuffle bitmask
that then gives us basically the right shuffle result.
So this is basically just a lookup.
So we basically just have for each bitmask that we would need, we create the correct
shuffle instruction. So it's just 16
entries that just translate to the right shuffle.
Okay, questions so far? No questions? Then we're going to do... oh yeah?
What's the way to go for AVX2 regarding the left button?
Good question.
I have to check.
So I think then, again, if you don't have AVX512,
you would probably use this generic implementation.
I have to check if there is...
I think the compressed store is only in AVX512. Well, I would have to check if there is. I think the Compress store is only in AVX500 tough.
Well, I would have to look it up. More such good questions.
Oh, okay. Then we're going to do a quick five-minute break here and then continue with some more examples.
Okay, so let's get started again. So now I want to show you two
more examples of like slightly advanced ways of using SIMD instructions in database operations.
And then from there, you can basically go and think about different things.
So one thing, for example, is if you have kind of bit-packed integers,
then, so say, for example, you have reduced precision,
or you basically have, say, 9-bit numbers that you want to store
or that you have stored densely in your memory
then we can use SIMD to
decompress this. So this basically happens in database systems
if you know, basically you don't need the full range of integers
but you need say for example 9-bit to fully If you know you don't need the full range of integers,
but you need, say for example, 9-bit to fully represent the data,
then you really want to pack them densely in memory.
Of course, they are not aligned.
If you want to work with the data, then you have or decompress this.
And this is also a problem because the memory is only byte addressable.
So then we need to basically do some movement back and forth.
And in order to get the proper output that's 32-bit numbers
that we then can continue working on in memory, or
not in memory, but continue working on with the CPU.
And the scalar approach to this, so the idea is really like your data is basically 9-bit,
the first number is 9-bit the next number, etc.
And, well, the scalar approach would basically lead, is basically tracking the unused bit
and then copying 4 bytes containing basically all of the 9-number bits into a register and then shift the basically the bits in a way
that you have only the data for the individual number
that you need.
You shift this all the way to the left
and then remove any kind of unused bit.
And the trailing bits that of course
then continue other numbers,
because everything is densely packed. So you copy data over, you move everything that's not
necessary for the current number, you move out and then you still have a lot of basically numbers
or bits that do not belong to your number, so you're basically zeroing these remaining bits.
And you're going to continue this with every next number. So you basically always have to copy a
certain amount of data into your registers and then move the data around. And the same thing
we can also do with a vector approach. So initially we basically have the complete numbers into, and I maybe should have
animated this, but this is basically if we load the input, so these are basically our cache lanes,
four bytes, then the individual numbers do not adhere to the byte lanes or the byte width
and also not to the cache lanes.
So this is number one with nine bits, number two with nine bits and so on.
And you can see this is basically messed up in the memory.
So the first thing we do is basically we shuffle the bytes
in order to move the numbers
to the target lanes.
And the idea is here that we're, I mean,
in each basically lane or in each word here,
we want to, in the end, have one number, right?
So we want to have this nine bit number in 32 bits then.
And this is why we're, I mean,
we're basically in while initially loading the data,
there's something messed up here,
but we're not using anything beyond number five
and number four, like only number one to number four
will actually be used in this individual load,
because everything else will be shifted out.
Because we cannot have this in our register at this time,
because we only want to have 4 numbers later on in our register here.
So we're moving, we're shuffling,
in such a way that the bytes are at the right order.
So we get number two in the second lane, number three in the third lane,
number four in the fourth lane.
And we're just copying the bytes.
So this means we still have these empty spaces here,
or this is basically the unused leading
bits from the previous copy.
So here, for example, for number 2, there's still one bit from number 1 in there.
For number 3, we can see there's basically then how much should it be?
Two bits?
We can't calculate right now.
So that will be too much, basically.
We have to copy this back and so on for the others as well.
So then in the next step, we'll do a shift to remove the unused bits. So then the unused bits are basically at the end,
or we're zeroing the end.
And then we're basically masking off the most significant bits
and basically have these leading zeros then.
So I'm also always getting confused.
Of course, in memory,
this is basically the most significant bit.
So this basically we have to read it backwards
and then we're just having leading zeros.
We're filling up the lane with leading zeros from here on.
And after... Yes?
So in the first step we loaded
like 16 bytes, I think.
Is it like
preferred to always load
as much memory as fits into the
big register, or
is it better to only load...
That's a good question.
So the question is if we could also load
just less than that.
I mean, we basically only need like the first four numbers.
So the question is if we could also load less.
I'm assuming it doesn't really make a difference in terms of like performance.
We would need some extra logic to only load less than that.
And so that's my assumption.
We won't see much of a difference here.
So then, after basically four processing numbers,
then the input,
then we basically have all the data.
We have all individual numbers correctly in memory
and we can continue with the next round.
With 256 bit vectors we can do of course 8 numbers at a time and then this would actually work better because in this case it's like 8 is divisible by 8.
And there we basically have 9 times 8 is divisible by 8.
So we can basically byte a line and we can repeat easier.
Okay, so the last thing I want to present to you today
is the SIMD tree search.
So this is kind of a cool way also of using SIMD
in data structures.
So, I mean, basically if you have something like a binary tree, I mean, this is like very
simple binary tree.
So basic Scala or binary tree implementation would put individual elements as you start
with the root, of course, and then the first child will be on the second
position, so position one, second child on position two. And in general, usually in memory,
the binary tree would be laid out that the children are at position 2n and 2n plus one.
This way, you can also easily traverse the binary tree.
So you're just jumping to these different positions.
But that also means that the children are actually
quite far away in the area in the binary tree.
So initially, of course, the root
is close to the first children.
But then the jumps get larger to the next areas.
But you basically can quickly traverse this
in a scalar fashion by, well, if you're just
looking for the tree, you're basically just jumping
to the right position and try to find the right position
by these jumps and by the comparison.
And now the question,
if we're doing these lookups,
can we somehow do the lookup more in parallel?
So rather than looking up for these two positions at a time,
can we do multiple lookups at a time? Can we somehow, yeah, get some SIMDification in there?
So we, of course, wherever we have branching,
we cannot really do any kind of SIMD.
So if we do these large jumps in the memory
and have to find something else,
so these if conditions we cannot SIMD-fy.
But if we somehow make sure that we somehow have a subpart in the data set,
well, and then the other questions can we do a vectorized inner loop as well?
First of all, we're basically going through multiple items, so we're finding multiple
items in our data set, so this would be okay.
But then in the inner loop, we're basically doing the actual lookups in the tree. So this is tougher to optimize because there we have data dependencies and we basically
have to speculatively somehow maneuver to the levels.
But if we have large jumps, we're also going to need a lot of different data in there.
So let's basically look what this would look like.
So how could we do this in parallel?
And like a speculative tree navigation would mean rather than just looking or checking
one individual item at the tree, we can also check multiple at a time. So say for example, we do two levels at a time
here, then we can basically
do three checks at a single round, in a
single step basically, then find the one that actually
matches, and then do the next round. Like in the next round
we can do three checks again so if we
have four values in our registers for example then three three nodes would fit actually at
the same time in our SIM do we need to check the one if we check the two and the three?
So that's a good question.
Why do we need to check the one if we check the two and the three?
Oh, I see.
I see.
Never mind.
I got it.
OK.
So of course, we need to basically continue in the tree.
So we need to know where to continue in the tree.
Or we could just be not in the tree at all.
Okay, so how can we get there?
So, in the initial layout that I told you, in this case, it would still be easy.
This is basically closed in memory. But from here on, these
positions, these elements will basically come after all these elements in the tree. So basically,
between the 6 and the 12, 7, 8, 9, 10, 11 will be in the memory. So this is actually
also in terms of loading, this is inefficient, right? So because
we would somehow to gather this from memory, we know that this will be in a loop because there's
no, like, this cannot be efficiently done through the cache. So rather than laying it out this way
in the memory, we can also lay it out in a kind of a packed way.
So we have to reorganize the data in memory such that we have individual blocks.
So we're blocking the binary tree into sub-blocks and making sure that these sub-blocks are
adjacent in memory.
And this basically means this, like conceptually, this is like one node,
and then we have essentially four subnodes of this one node, which again are all individually
or are all closely packed in memory. So this means the memory layout changes, but then
everything that we need for the comparison is tight in memory.
And we can do a quick look up, or we
can do a quick check on these.
So the way how we then would basically do the comparison,
rather than doing this step by step scalar way,
we're going here, here, here, here, here, for example,
or another way, we can basically do a complete block at a time.
So we use just the SIMD compare.
The SIMD compare will give us a mask.
We'll basically say which ones fit and which ones don't. And this mask, then, so here for example, we have our, this top part of the tree, so
our binary tree starts with 41, we're looking for 59, so 41 is smaller than 59, yes, so
we're basically, we need to go to the right.
So this will be smaller, this will be smaller,
this will be larger.
So we know we have to traverse this way.
So we would represent this, for example,
the ones that are smaller with a 1,
the ones that are larger with a 0.
So the SIMD compare will then basically tell us okay this is smaller smaller this is larger
and now we can move the mask to a bit vector that can directly then be loaded into a scalar register
so this is what i told you earlier right so we don't want to copy data around from the registers
to the, from the SIMD registers to the scalar registers
because we would have to do all of that at a time.
Yes?
Do you think the main benefit of this
is that we can do multiple comparisons at once
or that it is that we like don't,
like we jump around in memory only half the time?
Because I would believe that the memory is the scope part
and that we can also do scalar comparisons.
Both, actually.
So both contribute.
So the question is which one really helps?
Is the memory blocking?
So this is called memory blocking.
Basically, is this beneficial or is the SIMD comparison beneficial the SIMD
comparison so we won't get like a perfect speed up we're just looking at three numbers rather than
than four numbers at a time but in a single operation hypothetically but we need of course
need to do like some additional operations so we will reduce the amount of computation or
instructions or not instructions cycles per individual comparison but not that
much. But at the same time also the blocking will help for a better memory
alignment. And this is I mean this is in general a good idea to do. So even beyond using SIMD registers,
blocking the data, having a better memory layout
makes a lot of sense.
So it's a good point.
I actually have it in the slides as well.
So towards the end.
Okay, so we're copying the comparison vector.
We can directly copy with a move mask operation. So we're copying the comparison vector.
We can directly copy with a move mask operation.
We can directly copy this to a scalar register.
And using the scalar register, we can then say,
okay, with a lookup table,
so we're also going to have a lookup table,
we can see which child do we have to go to.
And the lookup table, so the result of the comparison, there's only a few possible results.
So basically we can either say, well, we don't find at all,
so everything, even the root, so the root doesn't match,
or so we have to go to the left, to the right, etc.
So, the results could be, it's smaller than everything,
it's smaller than the root, but not this,
it's larger than the root, but not the right child,
or it's even larger than the right child and anything
else is not possible so we cannot say it's smaller than the left child and larger than the right
child or something like that right these combinations would not work so this the bit mask
will basically only have a few possible configurations. Everything else would be an error in our programming, basically.
So what we need is basically a lookup table
that only has these four different options,
which is basically these four different children that we can have.
So either everything is larger,
so we have to go to the left from the left child it's larger than or the the value is
larger than the left child but not the root so then we have to go to the right it's larger than
the root then it will also be larger than the left child so then we basically have to go to the right. And then still basically there's two
options. We go to the left from the right child or to the right from the right child. And this is
basically, this we can represent in this lookup table. So if everything's smaller, then we basically
go to the left. If everything is larger, we go to the left. If everything is larger, we go to the left. If
everything is smaller, we go to the right, etc. And using this lookup index, then we can
directly say which is the index that we will go next. And then, of course, depending on the level
that we're in, we basically have to multiply with the current level, which will give us the memory position that we have to go to, where we find the next block.
And we can continue to do the same kind of SIMD operations.
We load the next block in our SIMD register, we do the same kind of SIMD comparison,
we create the move mask lookup, which would be the next one,
and get the next position until we're all the way at the end.
And in general, this is where basically your question comes in. The moving or the blocking is a good idea in general.
So having a binary tree in memory that has very long jumps to the children
will be bad in terms of finding data quickly
because basically in every memory load, we will load individual cache lanes
and they will contain a lot of information that we don't need.
In case of this kind of blocked way, then we will
load data, at least most of the data will be needed for the comparison. I mean, if we
say we have three values at a time, then only one of the values will basically be useless.
Everything else, the other comparison actually makes sense so then we can also do
like a multiple step blocking so we can do just for the simd for simd doing the blocking we can
do cache line blocking and if we're using disk we can do or larger memory pages we can do also
page blocking and make sure in every step
that basically what we're using, what we're working with directly translates
to the granularity of the elements that we're working with. And that will give us
better performance in any way. So here's an example of the results. So this is, again, from this slightly older SIGMOD paper
where they also used this to run this on GPUs.
So this is basically also a similar way of executing
where you could also do the tree search faster and then you can also see that the trees are not that big
in the end.
But you can see if you do the blocking, then so if we're basically doing just using like
a normalized way, if you're just doing a scalar search by just using
page blocking then we already get a decent speed up doing cache line
blocking will give us an additional speed up and then using SIMDification
so using SIMD instructions will again improve our speed up so in the end and
if we're doing software pipe lightening,
so that's an additional step that you can look up in the paper,
we can even get faster.
But this is basically what we're looking at.
So let's say a three-fold speed up roughly over the scalar version,
where a lot of it actually comes from the blocking.
But the SIMD will also give us some, let's say,
10% additional speed up again. So it's not going to be threefold performance improvement
out of SIMD, but it will be an improvement in the end. And if we're looking at very small trees, so 64,000 keys, so that's not a lot,
then the blocking won't help as much because most of it will be in the caches anyway,
but still the simplification will again give us like a 15-20% performance increase.
Okay, so much for today.
So I hope you can see,
it's kind of a different way of thinking of how to program SIMD,
because forget about branches, basically.
So that's, I think,
think about multiple items
at a time, and then how to deal with whatever you would want
to do in conditions with masks and moving
around data in memory.
So that's kind of the gist of SIMD programming, I would say.
And then you have to see if you can somehow have enough data.
And of course, it needs to be properly in memory.
So can you have the data in memory properly enough
so that you can quickly iterate it over in the registers
and in the SIMD unit?
Okay, questions so far?
No questions right now?
OK, next time we're going to talk about execution models,
meaning given a query plan, how do we actually
want to execute this in the database?
So do we compile it to C++ code and use a compiler?
Do we interpret it in some kind of bytecode?
Do we do other kinds of iterative approaches?
And we'll talk about the different trade-offs in there.
Okay, with that, thank you very much and see you all tomorrow.