Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Multicore Parallelism
Episode Date: May 31, 2023...
Transcript
Discussion (0)
Okay, so I guess we can get started.
Welcome everybody to our next session.
Today we're going to talk about multi-core parallelism once we finished up the art.
But before I want to do one announcement at least and give you a bit more two, three announcements.
Um, but one important one, we have an anonymous feedback, uh, form in Moodle.
Um, and I mean, you can just see it it's right beneath the multi-course slides.
It's very short, just five questions.
Um, just like a short midterm eval. and the idea is that you give us feedback
if the course matches your expectations and everything is in a way that what you expected
or if there's something where you think we can improve stuff.
Because if we get the feedback now we can actually still do something, right?
And this will also help us to adjust to make sure that the
expectations are clear for the course so please fill this out help us improve the course
it's anonymous so it's we're not going to trace it back to you and if you i mean if you give us
critique try to make it actionable.
So I mean, if you say, well, the course is stupid,
there's little we can do.
If you tell us what exactly is stupid
and maybe even what you think would be better,
then we can actually improve ourselves in the course.
And that's what we actually want to do,
to make sure that the course is something
that you enjoy and we enjoy.
Okay, so today we're gonna to talk about multi-core parallelism, but before that we'll finish up
art, because we kind of got stuck in there last week. Yesterday you already had the task
description, so you maybe already have a bit of an idea.
One additional thing is that on Tuesday we had a Q&A session, but I'm not going to be here,
so there will be no class at all.
And then we're going to continue, so I'm pretty sure I'm not going to finish up multi-core today. So we're going to continue multi-core, then do locking, start locking, maybe finish up locking,
and then continue as before.
So have something about NUMA, etc., and so forth.
But before we go there, we're going to go back to art.
So the adaptive Radix tree and before I do that, it's there.
Just briefly as a reminder, what is a try or Radix tree or prefix tree or digital tree.
So just so we're back so this is a special type of search
index or tree structure where we're not basically deciding on on the full key on every
three node but we're deciding on prefixes of the tree so that's the important difference right so
we're um all of like if we have a constant key length um so all of our keys are the same same
length then all of our leaf nodes all of our values will be on the same height so we automatically get a balanced tree just because we're basically only indexing the keys piece by piece.
So we're splitting up the key in different parts and indexing that.
And in general, this could be like, let's say, letters of the alphabet.
But we already said this doesn't really work that well.
So typically we're using a radix tree, meaning we're splitting up the key into bytes or in bit areas. And we're deciding on each leaf or each node based on these bit areas.
We're deciding which way to go.
And we don't have to store the keys in every node we don't have we're because they're implicitly stored in the path from
the root to the node to the leaf so we have this example here right so we're starting with an A, for example, as first digit.
And since all of our values that we have start with an A,
so we don't have any additional children, just one.
So there's no search required here.
We're just directly following the one path to the next leaf.
And there we can already split up based on N and R.
And we have lexographical order typically but if we're having bits we're just basically interpreting them as binary
numbers and then we have two children and from there then we have three children d,
t, y so these are already the, could directly be basically pointers to the values,
or could be one additional indirection to basically addresses that go to the values themselves.
So one, like additional pointers, basically.
And so we already said, like, in general how this works and this kind of trade-off between
height and space.
And this is kind of the problem.
So we said in the Radix Tree, we're basically splitting rather than looking at letters or
something like that, alphabet, we're actually looking at bits.
So let's say six
bits for example two bits and based on that that directly decides how many
children we will have for each node and then we have to straight off so if we're
saying we're only looking at one bit this means we'll have a binary tree
basically each node will basically only have two children.
And that means automatically we get a higher tree.
But we're using less space because we're only
in a binary tree.
At most, we waste half of the node, essentially.
Because if we only have one child,
we will have basically we basically one paint pointer wasted but if we have larger sub trees or large
larger nodes meaning more bits that we're looking at then the tree will be
flatter but in each individual node we will or potentially we will waste more space in there.
And so this is this efficiency trade-off in the end. And this is why this basically depends on the type of values that we want to store.
Sometimes, and if they're dense or if they're sparse,
sometimes larger nodes would be better, like larger nodes would be better sometimes smaller nodes
would be better if you have very sparse values will probably have many empty or almost empty
nodes and then basically very wide nodes will be inefficient and in order to to basically get to be more flexible there, the ART index was basically created.
So as I said, I mean, this is kind of this space problem and an alternative to that is
having multiple different node sizes.
That's what the ART or the Adaptive Radix Tree does. So depending on the number of children
that we have, so the number of different values
we have for a certain prefix, we can decide
what kind of node we're going to use here.
And in the ART,
so basically there we have a fixed subset of the key that we're using.
So we're always going to use 1 byte or 8 bit, which means we have 256 different values that we potentially have.
And we'll split up any kind of key into exactly these um in these subsets
so say for example you have this integer key and so this is let me count this 18,237,439.
This is our key.
We're going to convert this into a bit representation
and we'll split up this bit representation.
So this is a 32-bit unsigned
and we'll split it up into blocks of 8 bits.
So here you can see this is basically our 8 bits
and it's a bit small.
I mean, you get the idea and you can see it in the slides if you check them later.
We can basically see, okay, these are the bit representations of the,
I mean, this is the complete bit representation of the integer key.
And then we take the subsets of 8 bits, and we interpret those as the individual prefixes.
And these, I mean, we can think of as individual numbers in the end.
So, say, for example, this is something that we can easily still see.
This would be 2, for example.
This is also still easy to see.
It's 255, ones basically and this is
basically our byte representation of the key and since we have this integer key we have four bytes
meaning we'll have four prefixes this also means if we don't allow any longer keys this is also the maximum tree height that we'll have right in this case the tree can be at max four levels deep okay so so far so
good this is still basic radix tree stuff right so this would any kind of
radix tree would work like this if we're saying saying our tree or our node size would be 8 bits, then any
Radix tree would work like this. Now the Adaptive Radix Tree says, okay, we're
going to have four physical representations of inner nodes for this
quite high fanout. So, fanout means how many children can one node have.
And in this case, because we're using 8-bit,
each node can have 256 different children.
So from 0 to 255, all possible combinations.
And in some cases, this might actually be justified right so if you think about random numbers
or even not random numbers if you think like a consecutive range of numbers right so then
most likely the least significant bits these ones they were not most likely if you have them
consecutively these will be dense
right this will basically all of these values will exist somewhere um but however uh say
the the most significant bits if we're starting from zero they might not be so um So in some cases we might have dense levels, meaning all of the values will be set.
So we will have values for all bit representations.
And in some cases we don't.
So then we might have zero or we might or a few um and then it makes sense to have smaller
keys that cannot save um like these 256 different values so the most general case um would be in a
regular adaptive radix tree would basically say well on every node i need to be able to save 256 values. But now we're going to have these
different representations where we're only, depending on how many children we actually have,
we're going to use a different representation. And this actually makes sense. I mean if you look
at the previous slide right here, just briefly, we can see that in this tree, basically, there are these situations where in certain sub-children or in certain children, basically, in certain nodes, I will not have all of the different values in there.
I mean, here, basically, I could have the whole alphabet but i'm
only i only have n and r actually as pointers so that's kind of this case that we're looking at
additionally what can happen is that we have these sub paths in our tree where nothing changes right so basically like in the initial example
everything starts with an A so we only have like a straight path we don't have
like additional children and in this case there's no decision to be made
right so in this case we basically would just need a chain of nodes that will
eventually show point to the same thing.
And in there, there's an additional optimization that the adaptive radix tree has, which is path compression.
So these kind of paths, they can be collapsed.
We just say, okay, all of the nodes below this will start with the same prefix.
So we're just going to combine all these prefixes
and only store the subchildren.
All of a sudden, the tree is not balanced anymore, right?
Because we're basically losing some entries
or we're losing some children in there.
But that's fine, right?
So in the end, we still have this maximum depth
that depends on the key length.
Only if we have no additional representation or no additional children here, we're not
going to have this additional inefficiency of hopping a couple of nodes which are actually
empty.
And we're also using lazy expansion, meaning we're not going to expand the node unless
we actually need, and we're not going to expand the path unless we actually need and we're not going to expand the path unless we actually need and we're trying to align or the
adaptive rate extreme tries to align all of the nodes with cache cache lines so
that we have efficient lookups in cache we're not going to hit the memory all
the time and it's SIMD friendly so we can use SIMD code to
code this. Okay so there's four and to do this to have these four different
presentations we basically have four different node types which
will be used based on how many values we actually have in there.
So how many children does a node have?
And the smallest one is node 4,
which basically can store up to four child node pointers.
So typically two to four, because if we have only one child,
we'll basically do this path compression so if you're
only one child we'll just remove the whole node completely and just basically go to the next node
where we actually have to decide something there we have to think i mean if we if we do this path
compression we actually need to somehow store the prefix again, because we're jumping. And if we're looking for something
that actually doesn't match the path that we're jumping,
we still have to look up the key,
because it's not necessarily stored anymore,
or it's not completely stored anymore in our path.
Anyway, so here, the node 4, as we said, from 1 to 4 or from 2 to 4 children.
And here we basically have two parts to these nodes, to most of the nodes.
First, we basically have the key and then we have the child pointers.
And the key here is basically directly stored in an area of size 4.
So here we're basically saying, and it's in a sorted area, but we're saying, okay, the
first key, the first child pointer is basically stored in the first area element, the second
in the second, and so forth. And then this directly says, okay, the four children that start on this level
with the prefix 0, we have the pointer that goes here, right? So this is just
directly aligned to each other. If we have the prefix 2, this will go from here.
If we have the prefix 255, this will go from here.
And if we don't have a match in here,
then this prefix doesn't exist on this level.
And in order to find something,
we will have to go through this area
until we find all the values.
And it's SIMD, right?
So this is something we can actually do
in a single operation in our SIMD register.
Or it's four elements, four pointers. So this is something we can actually do in a single operation in our SIMD register.
It's four elements, four pointers, so this would be something that would nicely fit, for example.
But if we have a scalar implementation, this is something where we just linearly go through.
But it's only four values, so it's not really that costly in the end.
So this is the the smallest node
then we have the node 16 that's basically just for for larger values for larger nodes and it has the
exact same structure as the node 4 just it's smaller smaller, it's larger, it can store up to
16 elements and again this is just because sometimes we have very sparse
data, we have some some branches in here so the prefix will be used by
multiple children but by very few. If it's very few we'll use the node 4 if it's up to 16 we will use the node 16.
so again idea is basically the same we have 16 an area of length 16 for the keys and then we have 16
child pointers and still we basically have to go through this area of keys to find our elements in here.
In this case, because it's 16, it already makes sense to do binary search in here.
This is sorted.
Sorry, I'm going to stop my laser pointer here.
This is sorted in here.
If we have 16 values, we can actually start jumping in the middle and then basically checking, doing the binary search in here. So if we have 16 values, we can actually start jumping in the middle and then basically
checking, doing the binary search in here. We don't have to linearly go through. It doesn't
really make sense for four values. And then again, if we have the prefix in here, then we basically have, yeah, then we find, like, there's again the direct mapping
from 16 array elements here to 16 pointers here.
So if we find our prefix in here, then we know directly, okay, we have to go from this
pointer.
Okay?
It's quite easy so far.
I mean, it's the whole thing is easy, it's just
because it's multiple different representations that makes it a bit more complex. Then, if
we have more than 16 nodes, but less than 49, then we have the more node 48 and the node 48 basically is an additional representation where we can store
17 to 48 i mean we could also store less but again we would only use this if we cannot use the node
4 and node 16 and um here we we rather than storing these direct...
We just store an area of 48 keys
and search through this.
Here we're directly storing an area of 255 elements,
but we're only gonna add pointers to two addresses where or
basically offsets into the area where we actually have children so in here we can
basically say okay for each of them where we actually have a child node so
for zero we might have a child node then we have a pointer to the the correct
address for one we don't have a child node so for the prefix one uh there doesn't exist any children
or there are no children so we're not going to store an offset here we're just going to store
this is empty something like this otherwise we have the offset into this array of 48 children.
And so then that basically still gives us a quite efficient representation
because we only need a few bits here to represent this offset.
So we don't need like full addresses or something for these offsets.
And otherwise, we're just not storing anything.
It's still a fixed area.
And in this case, we don't have to search for the key.
We don't know directly.
For each individual prefix, we have an entry.
Either there is an offset or there is no offset. If there is no offset, then basically we know there is no child for this.
Yes?
Is that the only reason that we don't have to do binary search
where we have like 10, 15, 6 values and not store
like in the graded nodes where we have like 16 keys?
Yes.
So it's basically a faster lookup.
And at a certain point, it doesn't really
make that much of a difference anymore,
because we just need to, I mean, the overhead is not
that large, basically.
So we're faster.
We're also faster in inserting, because we can basically
directly insert the offset, et cetera.
And so that will make it more efficient.
So that's the reason, basically.
Okay?
Okay, and then if this is still too small,
like we have more than that,
then we're basically saying,
okay, now let's deal with a full node where we can basically store from 49 to 256
individual values and in this case again uh we're i mean not again in this case we will have a
complete array of um of 256 entries where we directly have the pointers to the children
or we have no pointer which means
we don't have this prefix basically.
Okay, so quite easy.
I mean this node will be much larger but again
it will only be used if we actually have this very high fanout already.
So a fanout of 49 is already pretty large and it's almost at least a one-fifth full automatically
because otherwise we will always use the other types of nodes.
Clear so far?
Okay.
And of course, all nodes have a small header that contains a node type
So we need to know which node do we have and I mean you will see all the details in
When you program this even like to a higher degree than what I'm presenting here
they also have the number of the child nodes and
Potentially the compressed path.
Because we said we can compress the path if we only have one child below,
then we don't need an extra node for this.
We can basically just compress this to the next node, but only to a certain level i mean we can also do more but
then we have to check below if the path is actually still correct what we're following because
at a certain point we cannot basically go byte by byte check with our key but we're jumping
in the key and what we're jumping might actually be wrong. So there we need to do additional checks.
Okay.
So in the leaf nodes,
I mean, it could only be like a unique index.
Otherwise we need an additional indirection in the leaf nodes.
Obviously, I mean,
just like in a B tree or any other tree, right?
If we're not unique,
then we need to do some additional handling in the leaf nodes because then basically somewhere down there down in
the leaf nodes we need to be able to store multiple leaves so this means
we'll probably have some special type of leaf node where we have the pointers to
the additional duplicates we can have different kind of leaf
nodes. I mean the most generic version would be we have single valued leaves that store one value
basically and then this value can be whatever it wants right but of course there's an additional step that we need to take.
If we have only one child per leaf, then that basically means there's one from the last pointer.
We have this additional step to this one leaf.
And if we need multiple leaves or we do something range, that will be less efficient. So because of that, we can also have multi-valued leaves,
which basically store key values.
Then they might have a different structure.
And this is good.
I mean, this would work well
if all of the keys have the same length
and also maybe the values have the same length and we have
like good structure we can easily store this if not then we have to do some additional stuff
and finally if if our key values are basically the same size of our pointers that we're using
then we can actually use the same structure
that we have in the inner nodes.
We can use the exact same way that we store
the data in the inner nodes or the pointers and the values
in the inner nodes in order to store the actual leaf values.
And this will basically be most efficient
because then we can basically have these different kind of node types.
We have nice cache alignment and we have fast lookup speed.
Okay, so let's go through this in an example.
So this here is basically one tree of four levels I mean which we
would expect if we have this binary representation of 32-bit keys with with
8-bit per level so and in this case in our first note we basically only have so
the root node in this case, we only have four,
or node four with actually only three entries.
So we only have three children.
The second level that we're going to go to is node 48.
So up to 48 different values, then node 16, and then 256.
And again, I mean, this is a tree, right?
So we're just looking at one path
from the root to the leaf so there will of course be multiple levels or multiple nodes on each level
except for the root otherwise we would compress the path so we're looking up our integer key that we had before right 218 237 439 so we're splitting this up into a byte representation
or binary bit representation that then will be again a byte representation so this is just for So here our first entry would be 13, second entry 2, 9 and 255.
And so first we're basically going to look in the root node.
And in this case, in our example here, 13 will be the first prefix in the root node.
So at position 0.
And we're going to take the whole area we're going to start
scanning the area the first one is already the right one so we know the first area entry in the
child pointers will be the one that we're following so here we're going to get the address of the next
node that we have to look at this is a node 48 and in the node 48 we remember we basically have 256
end slots in our area where we're basically going to look up the the offset to the 48 child pointer array so in this case we're going to go with two so in so 0 1 2 we have to look up
for an offset and luckily we actually have the offset in here so meaning we can actually follow
the pointer so this will be in position one so the second entry in the child nodes and then
Somehow I don't know why the pointer is here should probably be here
because we're going from two and
We're gonna go Well, actually here there it might not even be sorted, right?
So this is okay this here in this case the child pointer is also for efficiency and adding here this doesn't need
to be sorted so then we can basically go directly here so that's why this is an example we'll go to
the node 16 node 16 there again we have this um these direct offsets are the not direct offsets are the... Not direct offsets, we have an area of 16 entries
where we have the offsets,
where we have the direct mapping
from this area to the child pointers.
So here we're going to look for the 9.
We're going to find the 9 on position 2.
And then we know, because 1, 2, 3 on the third entry here we directly know
where to go and then the last level we have 250, no 256 so there we can directly jump to the right
position to find our key and in this case we have a tuple identifier stored directly there so we can
actually rather than storing a child pointer here we have the tuple identifier stored directly there so we can actually rather than storing a child pointer
here we have the tuple identifier which will point us directly to the to the correct position to the
correct data item that we're looking at so that i mean that we could basically use for further
further lookups, et cetera.
OK, so I already mentioned this a couple of times,
but we basically, if we have very long keys,
the tree might be very high.
Because, I mean, say, I don't know, you have 128 bits, then all of a sudden, rather than being,
and we still have this byte representation, rather than being four levels, we'll have 16 levels.
So that's already quite a few lookups that will cost us something. So in this case we actually want to somehow reduce the height
and there's basically two things where we what we can do where we basically can say well if like while entering or while adding data and while removing data from the tree so if
basically we're adding stuff to the tree only if we actually create something like we create
distinctions in the nodes we will actually add additional paths so I mean, there's two ways.
So basically, lazy expansion means if we're just adding new
values to our tree, but if we don't need to
distinguish two nodes, right?
So we can basically directly go to one node.
In this case, so for example have we have only a few children that
will all fit into a single into a single leaf or something like that then will
not expand here will not expand the path but will keep this like stored within
the same within the same child node,
and then do the distinction on the child node
rather than going through the path.
At the same time, if we have within our path nodes that
don't differentiate, so where we only have one child
under the node, then we'll remove this node.
So we'll compress this node where only there is a single
child or even a chain of nodes where we have only a single child but of course when doing so we need
to somehow store this left out part of the key so here for example we see that while b actually
points us to something new right a doesn't give us anything new here.
So A, all of the children of this sub-tree will basically have an A in there on this position.
So we can just remove this,
but we need to store the A somewhere.
So we need to store this compressed path in this node.
Otherwise, if we have something like bore
or boor or something or whatever
then this will also be found here so basically we cannot like basically stop our search
if we don't if we're removing this so somehow this A needs to be stored and it will be basically stored in this node
up to a certain prefix length.
And in ART, this is up to eight bytes.
But sometimes this might even be too little, right?
So up to eight bytes, we'll just store in here.
And then we can basically compress
up to four levels here right
because we have one byte per level eight levels up to eight levels because we
have one byte per level this means we'll be able to just remove eight of these
eight of these like a chain of eight. But there might even be longer if you have very long keys.
And then in this case, we basically
have to compare on the tuple level.
So we basically have to go here.
We might still compress, but then on this level,
you have to check, well, whatever we left out in there,
was this actually correct?
Or is this subpart actually still correct?
Or did I find a match which actually in the middle somewhere is actually not correct?
I hope that's kind of clear.
Okay, so let's look at some performance numbers or why this is good.
And so, first of of all there's the
space consumption so here on this plot you can basically see how much space on
a log scale does a tree need for a certain size come in comparison to the
tree height and I think this is 16 million values.
Don't remember correctly.
You can see it in the paper, basically,
if you go to the art paper.
There's the details.
But you can see if we're just using one bit,
then we use very little space, of course,
because we're not wasting it, or we're only wasting very little,
because we only can, like, if our tree, like all of our nodes will basically at least be half full.
And it's still gonna be the same more or less with two bits,
because, I mean, most, like, in many cases,
the tree will not be that sparse.
But, of course, two bits basically already means our tree is half as high than with one bit.
So, this is how we basically have to look at this here.
So, one bit means basically a tree height of 32 levels.
And now we basically could write well then we know
okay it's a 32-bit key here and our total tree is uh that's something that i cannot easily
calculate now uh how much they're basically how many keys we have in here. But we can see the more bits we use basically the
less the tree height of course right because basically the more bits means
we're cutting larger chunks of the key at a time and we only need fewer
chunks so I mean this is basically what we said here like one byte um and that's basically uh eight
times less uh um height than uh if we have one byte of course because we're basically every time
we're cutting off eight bits from the tree and that would basically be this okay 32 bit means we have a tree height of four right makes sense um and that's
also what uh the art has right so this is the the height uh tree height what the art has because we
have eight bit um now the general um prefix tree uh so that's like a paper from 2011 by Matthias Böhm in BTW.
This is something where they, at least in the standard setting, they used four bits.
Then you can see, well, this is, of course, double as high because it uses only half of the number of bits here.
All right.
And this, for example, is the Linux kernel Radix tree.
So the Linux kernel also uses a Radix tree that uses 6-bit, for example.
And we can see the art has, well, I mean,
this is kind of no surprise that the art is flatter here.
But the surprise is actually we can see that the art also uses less space.
Because here the generalized prefix tree and the LRT,
I mean, technically the nodes are smaller, right?
So at least if we're talking about full nodes,
we use less space in these nodes because they use less bits.
However, because the nodes are adaptive,
so we can also use smaller nodes.
Here, the art actually uses much less space
and uses almost as little space as a tree with only two bits
or one bit and uses basically the same as a tree that only uses three bit on every level. We're still going to use more space because
not all of the nodes will be full all the time but on average we're using a
lot less space than say for example any kind of tree that would always use 8
bits in here. And the regular tree or regular radix tree with 8 bits would basically always have
these 255 nodes rather than the other ones. So that's actually why this actually makes a lot of
sense having these different node types because we're using a lot less space while still having this very flat tree. And this also makes sense in terms of performance.
So here is a lookup performance of an art
compared to the generalized prefix tree
and to a cache-sensitive B-tree.
So this is something that we talked about earlier.
Then a K-ary search tree, fast architecture search tree,
change hash table.
So this is like you remember, like this change hash table,
not open addressing.
And then I don't remember.
There's also the generalized prefix tree that we had.
And you can see that in a dense tree, the art lookup performance is actually much better.
And it's on par with a hash table, basically.
In a sparse tree, and the difference is basically dense tree means,
like, if we have, like, say, n values in our tree, then the values that we're inserting will
be a permutation of 1 to n.
So, basically, everything is densely set.
So this means the nodes will be nicely filled, essentially, on most levels.
And in a sparse tree everything is randomly
selected out of the key space. So in this case we will have like very like we will
have some holes in the tree where we basically have very empty trees. Some
nodes will be more filled etc. And you can see that well in a sparse tree the
the performance will be not as good,
but it's still better than all of the other trees out there
and all of the other radix trees in particular.
And also the insert performance is good.
Like in the Stens tree, again, it's nice.
It's better because we basically can nicely pack our individual tree,
our individual nodes, so we don't need all this path compression, etc.
In the Sparse tree, this will be slower.
We can even be faster if we have a book insert.
So if we know exactly what we're inserting, we can basically start from the leaves, pack them nicely and create a dense tree representation or efficient tree representation.
And again, this will be faster than the other trees.
And even still the sparse case, at least in their experiments, was faster than building up a hash table, etc.
Okay, and the coolest thing is you will implement this yourself. You can see all of this yourself.
So you will basically find out and enjoy the art yourself.
Okay, so with that, we're basically through with the data structures for today at
least well actually no more data structures today um and just as a reminder right we talked about
hash functions and hash table organizations p3 and the cache sensitive p3 and then the radix trees
and tries in general and especially today we talked about the adaptive radix tree,
which you will be implementing.
And with that, we're going to switch to multi-core parallelism.
But before we do so, are there any questions?
Yes?
I think it was on slide 13,
30, we have individual notetimes.
Yes?
And one says that each point is four-by-hands-wise red. Sorry? Yes. Yes.
Sorry?
The pointers are... 4 bytes or 8 bytes.
So that's a good question.
I think here in the paper, at least they're 4 bytes.
So that's why.
I mean, if you have 8 byte pointers, you will need 8 bytes.
Yes.
So in this case, all of the pointers are 4 byte.
I think this is still, I mean, this is a bit older.
So that might be because of the older representation.
I don't remember what we have in our implementation.
If there you are going to use 8 byte in there.
So that might, you might need to adjust to some degree. Okay, more
questions? No more questions? Well then we're going to do a four minute break
and then switch over to multi-core parallelism. Okay, so let's get started
with multi-ore parallelism.
And what I want to show you today is some basic concepts of parallelism.
We'll also need this later for NUMA and cross-rack scale parallelism.
And then we're going to go, an example to parallel joints. So how do you make a joint parallel?
And like different kind of approaches.
But we'll probably not be able to cover this today.
So there's something where we will probably then end up talking about this next week.
So, but multi-core parallelism.
So why do we need this in the first place?
I showed you this before and I mean basically what we see here is that at a certain point in the 2000s
we've kind of hit the energy wall so we couldn't really or manufacturers couldn't efficiently increase performance on CPUs anymore
just by scaling up the frequency, right?
So in 2005-ish or something, 2004, basically the frequency was not increased anymore.
I mean, you can see some slightly higher frequencies, but there's nothing in the in the tenth of
gigahertz basically right now and before every essentially every few few years we
basically doubled we saw a doubling of the the frequency or that chips process
with and that would basically just automatically give you better performance.
But this stopped, as we can see here, right?
So the frequency is just basically tumbling in the same kind of ballpark.
And in order to get better performance, what manufacturers did, just add more cores.
Right? So more functional units of the same type.
And this is kind of what we see in the title slide here, right?
So this is what I think this is a C on Phi,
where you can see it's not one or it's one chip,
but you can see there's lots and lots of individual cores
that all basically look the same, right?
So this is, I mean, this is a C on Phi.
It's a bit older, but here you can see nicely the structure.
In any kind of chip that you will look at today,
the dies, you will see these kinds of substructures
where lots and lots of the same kind of processing unit
will be in there.
And even in there, you have structures that are replicated
as we had earlier, right?
So the functional units, you will
have multiple same type of functional units
in a single core in order to already get
some parallelism there.
And that's something we have to work with.
So this is basically going back to single threaded programs
means,
well, we'll get some performance benefits
because the single-thread performance
still is increasing, right?
So the chips are getting smarter.
We have multiple function units,
the compiler and the chip are basically,
CPU are basically trying to reorder something
to get some parallelism out of there.
But this is not really increasing a lot anymore, right?
So it's increasing, but not that fast.
But the number of transistors is increasing, so still continuing to increase.
And this depends on the number of logical cores, or number of logical or actually physical cores doesn't really matter that much and that's where we can still
get a lot of performance out and that's what we have to work against and this is
basically what today's chips can look like so an Intel Ice Lake which is kind
of current current architecture will have up to 40 cores, basically 80 threads
with hyperthreading, and up to eight of these on a single system. So up to eight sockets,
and then including up to six terabytes of RAM, including persistent memory. So this is quite a beefy machine, right?
And here I actually have a picture of a Skylake
rather than Ice Lake, but this also shows you
the general architecture here.
And we'll come to this if we talk about NUMA,
so non-uniform memory access.
You can see that these are all connected with each other,
but it's not fully connected.
So these nodes, for example, are not directly connected,
so you will have to go through somewhere else
in order to talk between these two.
And then each of the nodes will directly have,
will have some memory attached to it.
So in this case, in the sky lake, six memory modules.
In the ice lake, you would have eight memory modules
plus eight persistent memory modules maximum in here.
And that's what we have to do, right?
So we have to utilize these multiple cores.
And in order to do this in the database system,
we have to, well, somehow give them tasks.
And there's different ways of doing so.
We can either give, or, well, we can give different,
or we need to give different tasks to the different cores right they need to be somehow independently working on something and a simple way of getting
there is inter-query parallelism so rather than parallelizing a single query we just allow
multiple queries to enter the system
and each query will be handled by a separate thread, for example, or by a separate process.
And then this automatically will give us some performance improvement. If it's just queries, right, they're just reading stuff, there's not that many conflicts, we still have to somehow deal
with conflicts, because if everything reads the same kind of data in different cores and different caches
then we might get into some problems there so with multiple multiple queries
being in the system at the same time, we can increase the throughput of our system.
So say we have 40 cores, we have a single processor in there, we have 40 cores.
We can basically easily get 40 queries or we could have 40 queries at the same time.
And we ideally, if there's no contention somewhere, we'll get a 40 way or 40 times
performance improvement.
However, each individual query will still take the same amount of time
as if we run them individually or just independently,
because every query will be run by a single core, or a single thread, meaning they will take the same time
if we have one core, if we have 100 cores.
Doesn't really make a huge difference.
Of course, if there's contention,
we might actually have higher latency all of a sudden.
But we're not talking about this yet.
So here, if we're using this inter query parallelism
we'll increase the throughput of the crew of the query stream and this is really important for all
tp right so if you have small queries you have high number of transactions some kind of shopping
system for example then you want to have this, right?
So each individual transaction is actually fast,
doesn't take much time.
The latency is okay for most applications.
Then you don't need to parallelize the individual queries,
but you need to be able to process many queries
at the same time.
If you have long running queries and or increasing amounts of data that you
have to process, all of a sudden you actually want to have parallelism within a query.
And then you're basically we're talking about intra-query parallelism. That means we have to split up the query into multiple tasks.
This means either we split them up into single operators and say for example use pipeline parallelism as we already know it or but that will be limited to the length of our pipeline as we know
and it will not really help us with latency, only with throughput again.
So in the end, we want something where we actually also break up the individual operators.
So say our scan, our join, et cetera, they need to be run in parallel in order to get
better performance, in order to get better latency.
So we get a reduced response time for a single query.
And, well, what we do is we basically split queries into smaller tasks
and execute the tasks in parallel on multiple cores,
ideally enabling inter- and intra-query parallelism.
And, of course, we still want to be able to answer multiple queries at the same time, right?
So we don't want to have the system
always be exclusively blocked by a single query,
even though we want to be able to parallelize queries.
So there's, again, kind of trade-offs, right?
It's all about trade-offs all the time.
So there's a couple a few basic concepts.
So one thing is we need to be able to split up our work into sub-parts.
And that's called work partitioning.
And these need to be able to be run in parallel.
And sometimes this is also known as domain composition.
But we're going to talk about work partitioning.
And often this for us even means data partitioning.
So we're splitting up the data into different parts.
And we can work on these parts separately.
And then if we can neatly do this and there is no communication or at least little communication needed between processing these parts, at least for a certain amount of step, then we can get nice performance improvements.
These individual parts, these individual tasks, then somehow need to be given to some execution unit.
And that's what we call scheduling.
We've already heard this in the CPU.
The CPU schedules the individual micro-operations, does some reordering.
Same thing on a larger level, more high level, needs to be happening for our database tasks. So say we're splitting up our scan into sub-regions that we're scanning,
then some worker at a certain point needs to take the scan
or needs to be told to do this scan now.
And this is basically some execution context.
And the amount of work
that needs to be done
is kind of the task granularity
that if it's
too little, if we have very
small tasks, then we have
a lot of overhead from the scheduling,
from the context switches,
etc. So basically
the tasks need to be feeded all the time
and they basically only do a few operations.
Then it's not really efficient.
If the tasks are too large,
then all of a sudden we'll probably get large differences in runtime.
And we might again not be as efficient because the load might be very different on different nodes.
We'll see an example probably next lecture then.
And of course, everything in the end
still needs to be correct.
And that's a critical one and a tricky one
because as soon as we're running stuff in parallel,
well, we might get into all kinds of race conditions.
So while in a single threaded program,
you can very much depend on writing variables
in a certain way that you actually wrote your code in,
in parallel execution, accessing global variables, etc., all of a sudden can basically destroy
everything.
You can get into all kinds of inconsistency.
So you need to be able to synchronize, etc., between multiple threads, multiple execution
units.
You have to deal with this.
And often you actually have to deal with this yourself.
So the environment doesn't necessarily give you any guarantees in there
unless you're careful.
So you need some kind of synchronization to enforce the order.
So talking about outputs, for example, or hash tables, right?
So you want to build a hash table.
If you have multiple threads accessing the hash table,
they might overwrite your value.
So in the end, you end up with a broken hash table
rather than an actual hash table that you need.
Okay.
So if we parallelize, we have to talk about how well does this work, right?
So do we get actually better performance?
And this is typically called scalability.
And so this is also like one of the things that database of research looks into.
So how can we make stuff scalable?
But it's not like, I mean, there's
multiple different things that we think of if we're
talking about scalability.
So in the end, I mean, it's somehow, how do we increase,
how well does the system parallelize, I would say.
So this is basically what we mean by scalability,
but there's different things that we could mean.
So on the one hand, we can talk about speed up.
It's basically if we have one node or one core
and a problem that we need to solve,
so our query TPC-H scaling factor one, query one,
for example.
What if we use all 40 cores instead?
So how much faster is this?
This is what we call speed up.
And this is also called strong scaling.
So if we have 40 cores,
we're trying this on a single core,
executing this query scaling factor 1, query 1, and then we use 40 cores and we get a 40 times
speed up so we're 40 times faster with the same query same data set size that's what we call
strong scaling because basically the the problem size is the same we're just throwing more parallelism
at it and if we then get better performance then this is actually good
right so this is what we want and this is this is also the more tricky one to get we
also have scale up which basically means well what if we add we basically have a larger load and a larger problem size
and larger hardware.
So say, for example, we're using one thread for scaling factor
1, and we're using 40 threads or all 40 cores for scaling factor
40.
So that would be kind of the same, that would be this scale up.
So in this case we have more resources, but we also have a larger problem.
And this then we basically should be in the same,
like ideally we're in the same ballpark number of performance.
Meaning 40 cores with 40 times as much data will take the same amount of time
as one core with 1 40th of the data.
And that's why it's called weak scaling.
Typically, this is called weak scaling because it's typically easier to achieve.
With the problem size, typically your parallelization is easier to achieve.
There's not as much synchronization needed anymore.
It really depends on the problem, but in general, this holds.
And then finally, what if we,
and maybe I've mixed up something here,
larger load, adding resources should be fine.
And finally, what if we have a larger load
on larger machines?
So we're basically, well, I've mixed up the previous ones.
We have same problem size, but larger resources.
No, something is there.
I have to fix this later.
I have to see this later if I mixed up something. Anyway, scale out is we have a larger load and more machines.
And again, this should basically be the same as scale up.
Well, okay. Scale out means basically, I've mixed up the two.
Scale up is still the same.
It's correct what I said.
Scale out only means, well, what if we have, rather than having more cores, for example,
we have more machines.
So we have, rather than having one beefier machine, we have more machines.
So that's the difference here. And it's basically the same as scale up with the difference that we talk about individual machines
rather than talking about stronger machines, basically.
Or more resources in a single machine. Sorry about this. OK.
Then based on this, we can basically see, OK,
how good is our speedup?
So we already said in the reduction of the response time
in the strong scaling.
And while the speedup typically should basically or will be either a linear speedup, an ideal speedup,
would basically, based on the amount of resources that we add, we get the same amount of speedup.
The factor of how much faster our system is, is in the number of resources that we add
this is ideal and this happens if we have like a very easy to parallelize problem so say for
example we have independent queries that we schedule on individual individual threads
and there's no data contention whatsoever, then we can achieve this ideal
speedup, right? So because there's no synchronization needed, multiple threads will
just add more threads. We'll just be able to process more of the queries. so we get better throughput we have better speed up basically typically we have
sub-linear speed up because there are some contentions somewhere there's some overhead here
and there and then well we add more resources we get faster but we don't get as fast as we want. And sometimes, in some cases,
we actually get super linear speed up.
So this means we add more resources
and all of a sudden, like we add, say,
two processors or two additional cores,
so from one core to two cores,
and all of a sudden our system is not twice as fast,
but it's two and a half times as fast for
some something like this and this happens um if we have some kind of caching effects right so all of
a sudden everything fits into a cache for example we have better cache efficiency than we had before
and then uh all of a sudden like we're basically just going above a certain threshold
or the translation look-aside buffer.
So before, we basically had a lot of misses in the translation look-aside buffer.
If all of a sudden that works well, the translation look-aside buffer,
we can always utilize it.
Then this kind of will give us a huge performance boost and
will lead to something or could lead to a super linear speedup.
But that basically then often means well your initial program wasn't really that efficient after all.
And
well, sub-linear speedup happens well in many cases.
So say for example your program is not parallel enough.
So your problem is just something that is not easy to parallelize.
Then, well, or at least the way you structured it,
then, well, you're not going to get a speedup, of course,
if you have to see clearly.
Like everybody's touching the same data item.
You have a lock there, everybody's just waiting,
all of the threads are just waiting for this lock.
Well, you're not gonna get speed up.
Well, same problem, or if you need a lot of synchronization,
so that would actually be the same thing,
then, well, you need to communicate a lot
for synchronization that will basically lead to slow down
or will make your program slower.
And then if your system is memory bound
in the first place, right?
Or disk bound, for example, right?
We had this, so there's some part of your system where you have contention.
Then parallelizing something else won't help you at all.
So if you get some more parallelism in your CPU,
but you're memory-bound anyway, well, it doesn't help you.
So then there's not going to be any speed-up for this,
or at least you're going to get sublinear speed-up.
And with this, who's heard of amdahl's law all of you yeah yeah okay so i mean this is basically if you have a certain program um like if you look at a certain program, there typically is a parallelizable portion and an unparallelizable or sequential portion.
And if, or at least the way you're parallelizing your code, you basically parallelize a certain part of the code,
and some part will not be parallelized.
Say some basic setup procedure, for example, will probably not be parallelized. Let's say some basic setup procedure for example will probably not be
parallelized and that will be a certain part of your execution time. And then based on this you
can actually calculate what's the maximum speedup that you can achieve. So this is what Amdahl's Law says. So if 90% of your code is parallelized
and 10% is sequential,
then the maximum speedup that you can get is 10.
So you can throw as many cores as you want to it
because then basically these 90% will go to whoever parallelized it.
This part will be less and less significant to the overall execution time.
And the 10% that are still left, this is basically what will dominate your execution time.
So in the end, that's the maximum speedup.
And this is a theoretical concept. Typically, if your problem size increases,
if you take a lot of care, there is actually more parallelism to be gained, right? So, I mean,
of course, if you stick to a certain algorithm, to a certain implementation, then you can basically
check out, okay, which, how much of this code can I actually parallelize. And then you can also see what's the maximum parallelism that I can get.
And then you can see what's actually efficient.
After a certain point, you will see,
well, I'm going to add more and more resources to this.
But here, well, I'm not going to get more performance out of it.
So this is basically here.
I'm still getting more speed up with say 32, 64, whatever processes,
but here from a thousand processes, it's negligible.
So I'm just wasting resources in the end.
And this typically holds for strong scaling in the end.
So if you have a certain problem size and you want to parallelize,
unless you're doing something fundamentally different, then you're going to be stuck here.
But in many cases of weak scaling, so you have larger problem sizes, then the parallel
part or the sequential part of your problem your problem will actually be reduced so there is more
more parallel portion typically in real world examples so meaning um if all of a sudden you're
a huge internet company most of your workloads will inherently have a lot of potential for parallelism rather than looking
at i mean say you you're doing business with two people right two people doing business there's not
much parallelism that we can achieve but if we have a gazillion people doing business with each
other there's lots of individual interactions and there's lots of subsets that we can break this down to.
Okay, so there's a couple of pitfalls in parallelizing stuff.
And this is also when Hamdahl's law hits you.
And one is basically non-scalable algorithms, right?
If you're using something like a depth-first search in a tree, that's actually hard to parallelize.
So if you have breadth-first search, this is something where you can easily think about,
oh, how can I parallelize this? It's easier to parallelize.
So rethinking your algorithm is similar, and we already did right in thinking about Zim decode
you basically need a different focus you actually have to think parallel to do
this and then and then you will get new forms of parallelism it might be a
little bit less efficient than doing it like in on a single thread or in a
scalar way you have to do more operations but in the end it pays
off because there's more option to parallelize a big problem is load imbalance so if you
if you have large tasks that you need to deal with then break them down into smaller tasks and dynamically schedule between processors and
threads etc. in order to get better load balance so that different units are not, so cores
etc. are not waiting.
And this is a typical problem.
We can see this if you go to big data systems, like in Hadoop, etc., this is a very typical problem.
Basically, you have one core in a cluster of 10 or 100 nodes that is actually doing something,
and all the other cores are waiting because this is just the one reducer where all the data ends up in the end. So somehow dynamically basically scheduling stuff and breaking it down,
you can get much better performance.
And of course, splitting stuff up and having very small tasks means a lot of overhead.
So make sure that the minimum task size is
not too small again trade-off not too small not too large otherwise you will
basically have lots of overhead or you will have lots of imbalance okay so well
yeah let's stop here I think It's a good point to stop.
Do we have questions?
No questions so far?
Okay, then next time we're going to continue with parallelism and different kind of algorithms.
As I said, this will be on Wednesday. On Tuesday I'm out doing other stuff.
So then you have one day off. Use
it for the task. If you have questions regarding the task we can also schedule
something there. So if there's some tricky stuff in the programming then
reach out to us. We will schedule something. Otherwise we'll also have additional opportunity later on.
Okay, well then, thank you and see you next Wednesday.