Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Multicore Parallelism

Starting point is 00:00:00 Okay, so I guess we can get started. Welcome everybody to our next session. Today we're going to talk about multi-core parallelism once we finished up the art. But before I want to do one announcement at least and give you a bit more two, three announcements. Um, but one important one, we have an anonymous feedback, uh, form in Moodle. Um, and I mean, you can just see it it's right beneath the multi-course slides. It's very short, just five questions. Um, just like a short midterm eval. and the idea is that you give us feedback

Starting point is 00:00:46 if the course matches your expectations and everything is in a way that what you expected or if there's something where you think we can improve stuff. Because if we get the feedback now we can actually still do something, right? And this will also help us to adjust to make sure that the expectations are clear for the course so please fill this out help us improve the course it's anonymous so it's we're not going to trace it back to you and if you i mean if you give us critique try to make it actionable. So I mean, if you say, well, the course is stupid,

Starting point is 00:01:28 there's little we can do. If you tell us what exactly is stupid and maybe even what you think would be better, then we can actually improve ourselves in the course. And that's what we actually want to do, to make sure that the course is something that you enjoy and we enjoy. Okay, so today we're gonna to talk about multi-core parallelism, but before that we'll finish up

Starting point is 00:01:52 art, because we kind of got stuck in there last week. Yesterday you already had the task description, so you maybe already have a bit of an idea. One additional thing is that on Tuesday we had a Q&A session, but I'm not going to be here, so there will be no class at all. And then we're going to continue, so I'm pretty sure I'm not going to finish up multi-core today. So we're going to continue multi-core, then do locking, start locking, maybe finish up locking, and then continue as before. So have something about NUMA, etc., and so forth. But before we go there, we're going to go back to art.

Starting point is 00:02:45 So the adaptive Radix tree and before I do that, it's there. Just briefly as a reminder, what is a try or Radix tree or prefix tree or digital tree. So just so we're back so this is a special type of search index or tree structure where we're not basically deciding on on the full key on every three node but we're deciding on prefixes of the tree so that's the important difference right so we're um all of like if we have a constant key length um so all of our keys are the same same length then all of our leaf nodes all of our values will be on the same height so we automatically get a balanced tree just because we're basically only indexing the keys piece by piece. So we're splitting up the key in different parts and indexing that.

Starting point is 00:03:57 And in general, this could be like, let's say, letters of the alphabet. But we already said this doesn't really work that well. So typically we're using a radix tree, meaning we're splitting up the key into bytes or in bit areas. And we're deciding on each leaf or each node based on these bit areas. We're deciding which way to go. And we don't have to store the keys in every node we don't have we're because they're implicitly stored in the path from the root to the node to the leaf so we have this example here right so we're starting with an A, for example, as first digit. And since all of our values that we have start with an A, so we don't have any additional children, just one.

Starting point is 00:04:53 So there's no search required here. We're just directly following the one path to the next leaf. And there we can already split up based on N and R. And we have lexographical order typically but if we're having bits we're just basically interpreting them as binary numbers and then we have two children and from there then we have three children d, t, y so these are already the, could directly be basically pointers to the values, or could be one additional indirection to basically addresses that go to the values themselves. So one, like additional pointers, basically.

Starting point is 00:05:39 And so we already said, like, in general how this works and this kind of trade-off between height and space. And this is kind of the problem. So we said in the Radix Tree, we're basically splitting rather than looking at letters or something like that, alphabet, we're actually looking at bits. So let's say six bits for example two bits and based on that that directly decides how many children we will have for each node and then we have to straight off so if we're

Starting point is 00:06:18 saying we're only looking at one bit this means we'll have a binary tree basically each node will basically only have two children. And that means automatically we get a higher tree. But we're using less space because we're only in a binary tree. At most, we waste half of the node, essentially. Because if we only have one child, we will have basically we basically one paint pointer wasted but if we have larger sub trees or large

Starting point is 00:06:51 larger nodes meaning more bits that we're looking at then the tree will be flatter but in each individual node we will or potentially we will waste more space in there. And so this is this efficiency trade-off in the end. And this is why this basically depends on the type of values that we want to store. Sometimes, and if they're dense or if they're sparse, sometimes larger nodes would be better, like larger nodes would be better sometimes smaller nodes would be better if you have very sparse values will probably have many empty or almost empty nodes and then basically very wide nodes will be inefficient and in order to to basically get to be more flexible there, the ART index was basically created. So as I said, I mean, this is kind of this space problem and an alternative to that is

Starting point is 00:07:59 having multiple different node sizes. That's what the ART or the Adaptive Radix Tree does. So depending on the number of children that we have, so the number of different values we have for a certain prefix, we can decide what kind of node we're going to use here. And in the ART, so basically there we have a fixed subset of the key that we're using. So we're always going to use 1 byte or 8 bit, which means we have 256 different values that we potentially have.

Starting point is 00:08:41 And we'll split up any kind of key into exactly these um in these subsets so say for example you have this integer key and so this is let me count this 18,237,439. This is our key. We're going to convert this into a bit representation and we'll split up this bit representation. So this is a 32-bit unsigned and we'll split it up into blocks of 8 bits. So here you can see this is basically our 8 bits

Starting point is 00:09:23 and it's a bit small. I mean, you get the idea and you can see it in the slides if you check them later. We can basically see, okay, these are the bit representations of the, I mean, this is the complete bit representation of the integer key. And then we take the subsets of 8 bits, and we interpret those as the individual prefixes. And these, I mean, we can think of as individual numbers in the end. So, say, for example, this is something that we can easily still see. This would be 2, for example.

Starting point is 00:10:01 This is also still easy to see. It's 255, ones basically and this is basically our byte representation of the key and since we have this integer key we have four bytes meaning we'll have four prefixes this also means if we don't allow any longer keys this is also the maximum tree height that we'll have right in this case the tree can be at max four levels deep okay so so far so good this is still basic radix tree stuff right so this would any kind of radix tree would work like this if we're saying saying our tree or our node size would be 8 bits, then any Radix tree would work like this. Now the Adaptive Radix Tree says, okay, we're going to have four physical representations of inner nodes for this

Starting point is 00:11:00 quite high fanout. So, fanout means how many children can one node have. And in this case, because we're using 8-bit, each node can have 256 different children. So from 0 to 255, all possible combinations. And in some cases, this might actually be justified right so if you think about random numbers or even not random numbers if you think like a consecutive range of numbers right so then most likely the least significant bits these ones they were not most likely if you have them consecutively these will be dense

Starting point is 00:11:46 right this will basically all of these values will exist somewhere um but however uh say the the most significant bits if we're starting from zero they might not be so um So in some cases we might have dense levels, meaning all of the values will be set. So we will have values for all bit representations. And in some cases we don't. So then we might have zero or we might or a few um and then it makes sense to have smaller keys that cannot save um like these 256 different values so the most general case um would be in a regular adaptive radix tree would basically say well on every node i need to be able to save 256 values. But now we're going to have these different representations where we're only, depending on how many children we actually have,

Starting point is 00:12:55 we're going to use a different representation. And this actually makes sense. I mean if you look at the previous slide right here, just briefly, we can see that in this tree, basically, there are these situations where in certain sub-children or in certain children, basically, in certain nodes, I will not have all of the different values in there. I mean, here, basically, I could have the whole alphabet but i'm only i only have n and r actually as pointers so that's kind of this case that we're looking at additionally what can happen is that we have these sub paths in our tree where nothing changes right so basically like in the initial example everything starts with an A so we only have like a straight path we don't have like additional children and in this case there's no decision to be made right so in this case we basically would just need a chain of nodes that will

Starting point is 00:14:03 eventually show point to the same thing. And in there, there's an additional optimization that the adaptive radix tree has, which is path compression. So these kind of paths, they can be collapsed. We just say, okay, all of the nodes below this will start with the same prefix. So we're just going to combine all these prefixes and only store the subchildren. All of a sudden, the tree is not balanced anymore, right? Because we're basically losing some entries

Starting point is 00:14:33 or we're losing some children in there. But that's fine, right? So in the end, we still have this maximum depth that depends on the key length. Only if we have no additional representation or no additional children here, we're not going to have this additional inefficiency of hopping a couple of nodes which are actually empty. And we're also using lazy expansion, meaning we're not going to expand the node unless

Starting point is 00:15:02 we actually need, and we're not going to expand the path unless we actually need and we're not going to expand the path unless we actually need and we're trying to align or the adaptive rate extreme tries to align all of the nodes with cache cache lines so that we have efficient lookups in cache we're not going to hit the memory all the time and it's SIMD friendly so we can use SIMD code to code this. Okay so there's four and to do this to have these four different presentations we basically have four different node types which will be used based on how many values we actually have in there. So how many children does a node have?

Starting point is 00:15:50 And the smallest one is node 4, which basically can store up to four child node pointers. So typically two to four, because if we have only one child, we'll basically do this path compression so if you're only one child we'll just remove the whole node completely and just basically go to the next node where we actually have to decide something there we have to think i mean if we if we do this path compression we actually need to somehow store the prefix again, because we're jumping. And if we're looking for something that actually doesn't match the path that we're jumping,

Starting point is 00:16:30 we still have to look up the key, because it's not necessarily stored anymore, or it's not completely stored anymore in our path. Anyway, so here, the node 4, as we said, from 1 to 4 or from 2 to 4 children. And here we basically have two parts to these nodes, to most of the nodes. First, we basically have the key and then we have the child pointers. And the key here is basically directly stored in an area of size 4. So here we're basically saying, and it's in a sorted area, but we're saying, okay, the

Starting point is 00:17:14 first key, the first child pointer is basically stored in the first area element, the second in the second, and so forth. And then this directly says, okay, the four children that start on this level with the prefix 0, we have the pointer that goes here, right? So this is just directly aligned to each other. If we have the prefix 2, this will go from here. If we have the prefix 255, this will go from here. And if we don't have a match in here, then this prefix doesn't exist on this level. And in order to find something,

Starting point is 00:17:54 we will have to go through this area until we find all the values. And it's SIMD, right? So this is something we can actually do in a single operation in our SIMD register. Or it's four elements, four pointers. So this is something we can actually do in a single operation in our SIMD register. It's four elements, four pointers, so this would be something that would nicely fit, for example. But if we have a scalar implementation, this is something where we just linearly go through.

Starting point is 00:18:19 But it's only four values, so it's not really that costly in the end. So this is the the smallest node then we have the node 16 that's basically just for for larger values for larger nodes and it has the exact same structure as the node 4 just it's smaller smaller, it's larger, it can store up to 16 elements and again this is just because sometimes we have very sparse data, we have some some branches in here so the prefix will be used by multiple children but by very few. If it's very few we'll use the node 4 if it's up to 16 we will use the node 16. so again idea is basically the same we have 16 an area of length 16 for the keys and then we have 16

Starting point is 00:19:18 child pointers and still we basically have to go through this area of keys to find our elements in here. In this case, because it's 16, it already makes sense to do binary search in here. This is sorted. Sorry, I'm going to stop my laser pointer here. This is sorted in here. If we have 16 values, we can actually start jumping in the middle and then basically checking, doing the binary search in here. So if we have 16 values, we can actually start jumping in the middle and then basically checking, doing the binary search in here. We don't have to linearly go through. It doesn't really make sense for four values. And then again, if we have the prefix in here, then we basically have, yeah, then we find, like, there's again the direct mapping

Starting point is 00:20:10 from 16 array elements here to 16 pointers here. So if we find our prefix in here, then we know directly, okay, we have to go from this pointer. Okay? It's quite easy so far. I mean, it's the whole thing is easy, it's just because it's multiple different representations that makes it a bit more complex. Then, if we have more than 16 nodes, but less than 49, then we have the more node 48 and the node 48 basically is an additional representation where we can store

Starting point is 00:20:47 17 to 48 i mean we could also store less but again we would only use this if we cannot use the node 4 and node 16 and um here we we rather than storing these direct... We just store an area of 48 keys and search through this. Here we're directly storing an area of 255 elements, but we're only gonna add pointers to two addresses where or basically offsets into the area where we actually have children so in here we can basically say okay for each of them where we actually have a child node so

Starting point is 00:21:41 for zero we might have a child node then we have a pointer to the the correct address for one we don't have a child node so for the prefix one uh there doesn't exist any children or there are no children so we're not going to store an offset here we're just going to store this is empty something like this otherwise we have the offset into this array of 48 children. And so then that basically still gives us a quite efficient representation because we only need a few bits here to represent this offset. So we don't need like full addresses or something for these offsets. And otherwise, we're just not storing anything.

Starting point is 00:22:28 It's still a fixed area. And in this case, we don't have to search for the key. We don't know directly. For each individual prefix, we have an entry. Either there is an offset or there is no offset. If there is no offset, then basically we know there is no child for this. Yes? Is that the only reason that we don't have to do binary search where we have like 10, 15, 6 values and not store

Starting point is 00:22:59 like in the graded nodes where we have like 16 keys? Yes. So it's basically a faster lookup. And at a certain point, it doesn't really make that much of a difference anymore, because we just need to, I mean, the overhead is not that large, basically. So we're faster.

Starting point is 00:23:18 We're also faster in inserting, because we can basically directly insert the offset, et cetera. And so that will make it more efficient. So that's the reason, basically. Okay? Okay, and then if this is still too small, like we have more than that, then we're basically saying,

Starting point is 00:23:40 okay, now let's deal with a full node where we can basically store from 49 to 256 individual values and in this case again uh we're i mean not again in this case we will have a complete array of um of 256 entries where we directly have the pointers to the children or we have no pointer which means we don't have this prefix basically. Okay, so quite easy. I mean this node will be much larger but again it will only be used if we actually have this very high fanout already.

Starting point is 00:24:24 So a fanout of 49 is already pretty large and it's almost at least a one-fifth full automatically because otherwise we will always use the other types of nodes. Clear so far? Okay. And of course, all nodes have a small header that contains a node type So we need to know which node do we have and I mean you will see all the details in When you program this even like to a higher degree than what I'm presenting here they also have the number of the child nodes and

Starting point is 00:25:04 Potentially the compressed path. Because we said we can compress the path if we only have one child below, then we don't need an extra node for this. We can basically just compress this to the next node, but only to a certain level i mean we can also do more but then we have to check below if the path is actually still correct what we're following because at a certain point we cannot basically go byte by byte check with our key but we're jumping in the key and what we're jumping might actually be wrong. So there we need to do additional checks. Okay.

Starting point is 00:25:48 So in the leaf nodes, I mean, it could only be like a unique index. Otherwise we need an additional indirection in the leaf nodes. Obviously, I mean, just like in a B tree or any other tree, right? If we're not unique, then we need to do some additional handling in the leaf nodes because then basically somewhere down there down in the leaf nodes we need to be able to store multiple leaves so this means

Starting point is 00:26:16 we'll probably have some special type of leaf node where we have the pointers to the additional duplicates we can have different kind of leaf nodes. I mean the most generic version would be we have single valued leaves that store one value basically and then this value can be whatever it wants right but of course there's an additional step that we need to take. If we have only one child per leaf, then that basically means there's one from the last pointer. We have this additional step to this one leaf. And if we need multiple leaves or we do something range, that will be less efficient. So because of that, we can also have multi-valued leaves, which basically store key values.

Starting point is 00:27:12 Then they might have a different structure. And this is good. I mean, this would work well if all of the keys have the same length and also maybe the values have the same length and we have like good structure we can easily store this if not then we have to do some additional stuff and finally if if our key values are basically the same size of our pointers that we're using then we can actually use the same structure

Starting point is 00:27:45 that we have in the inner nodes. We can use the exact same way that we store the data in the inner nodes or the pointers and the values in the inner nodes in order to store the actual leaf values. And this will basically be most efficient because then we can basically have these different kind of node types. We have nice cache alignment and we have fast lookup speed. Okay, so let's go through this in an example.

Starting point is 00:28:20 So this here is basically one tree of four levels I mean which we would expect if we have this binary representation of 32-bit keys with with 8-bit per level so and in this case in our first note we basically only have so the root node in this case, we only have four, or node four with actually only three entries. So we only have three children. The second level that we're going to go to is node 48. So up to 48 different values, then node 16, and then 256.

Starting point is 00:29:02 And again, I mean, this is a tree, right? So we're just looking at one path from the root to the leaf so there will of course be multiple levels or multiple nodes on each level except for the root otherwise we would compress the path so we're looking up our integer key that we had before right 218 237 439 so we're splitting this up into a byte representation or binary bit representation that then will be again a byte representation so this is just for So here our first entry would be 13, second entry 2, 9 and 255. And so first we're basically going to look in the root node. And in this case, in our example here, 13 will be the first prefix in the root node. So at position 0.

Starting point is 00:30:04 And we're going to take the whole area we're going to start scanning the area the first one is already the right one so we know the first area entry in the child pointers will be the one that we're following so here we're going to get the address of the next node that we have to look at this is a node 48 and in the node 48 we remember we basically have 256 end slots in our area where we're basically going to look up the the offset to the 48 child pointer array so in this case we're going to go with two so in so 0 1 2 we have to look up for an offset and luckily we actually have the offset in here so meaning we can actually follow the pointer so this will be in position one so the second entry in the child nodes and then Somehow I don't know why the pointer is here should probably be here

Starting point is 00:31:12 because we're going from two and We're gonna go Well, actually here there it might not even be sorted, right? So this is okay this here in this case the child pointer is also for efficiency and adding here this doesn't need to be sorted so then we can basically go directly here so that's why this is an example we'll go to the node 16 node 16 there again we have this um these direct offsets are the not direct offsets are the... Not direct offsets, we have an area of 16 entries where we have the offsets, where we have the direct mapping from this area to the child pointers.

Starting point is 00:31:58 So here we're going to look for the 9. We're going to find the 9 on position 2. And then we know, because 1, 2, 3 on the third entry here we directly know where to go and then the last level we have 250, no 256 so there we can directly jump to the right position to find our key and in this case we have a tuple identifier stored directly there so we can actually rather than storing a child pointer here we have the tuple identifier stored directly there so we can actually rather than storing a child pointer here we have the tuple identifier which will point us directly to the to the correct position to the correct data item that we're looking at so that i mean that we could basically use for further

Starting point is 00:32:43 further lookups, et cetera. OK, so I already mentioned this a couple of times, but we basically, if we have very long keys, the tree might be very high. Because, I mean, say, I don't know, you have 128 bits, then all of a sudden, rather than being, and we still have this byte representation, rather than being four levels, we'll have 16 levels. So that's already quite a few lookups that will cost us something. So in this case we actually want to somehow reduce the height and there's basically two things where we what we can do where we basically can say well if like while entering or while adding data and while removing data from the tree so if

Starting point is 00:33:50 basically we're adding stuff to the tree only if we actually create something like we create distinctions in the nodes we will actually add additional paths so I mean, there's two ways. So basically, lazy expansion means if we're just adding new values to our tree, but if we don't need to distinguish two nodes, right? So we can basically directly go to one node. In this case, so for example have we have only a few children that will all fit into a single into a single leaf or something like that then will

Starting point is 00:34:36 not expand here will not expand the path but will keep this like stored within the same within the same child node, and then do the distinction on the child node rather than going through the path. At the same time, if we have within our path nodes that don't differentiate, so where we only have one child under the node, then we'll remove this node. So we'll compress this node where only there is a single

Starting point is 00:35:05 child or even a chain of nodes where we have only a single child but of course when doing so we need to somehow store this left out part of the key so here for example we see that while b actually points us to something new right a doesn't give us anything new here. So A, all of the children of this sub-tree will basically have an A in there on this position. So we can just remove this, but we need to store the A somewhere. So we need to store this compressed path in this node. Otherwise, if we have something like bore

Starting point is 00:35:44 or boor or something or whatever then this will also be found here so basically we cannot like basically stop our search if we don't if we're removing this so somehow this A needs to be stored and it will be basically stored in this node up to a certain prefix length. And in ART, this is up to eight bytes. But sometimes this might even be too little, right? So up to eight bytes, we'll just store in here. And then we can basically compress

Starting point is 00:36:23 up to four levels here right because we have one byte per level eight levels up to eight levels because we have one byte per level this means we'll be able to just remove eight of these eight of these like a chain of eight. But there might even be longer if you have very long keys. And then in this case, we basically have to compare on the tuple level. So we basically have to go here. We might still compress, but then on this level,

Starting point is 00:36:59 you have to check, well, whatever we left out in there, was this actually correct? Or is this subpart actually still correct? Or did I find a match which actually in the middle somewhere is actually not correct? I hope that's kind of clear. Okay, so let's look at some performance numbers or why this is good. And so, first of of all there's the space consumption so here on this plot you can basically see how much space on

Starting point is 00:37:33 a log scale does a tree need for a certain size come in comparison to the tree height and I think this is 16 million values. Don't remember correctly. You can see it in the paper, basically, if you go to the art paper. There's the details. But you can see if we're just using one bit, then we use very little space, of course,

Starting point is 00:38:01 because we're not wasting it, or we're only wasting very little, because we only can, like, if our tree, like all of our nodes will basically at least be half full. And it's still gonna be the same more or less with two bits, because, I mean, most, like, in many cases, the tree will not be that sparse. But, of course, two bits basically already means our tree is half as high than with one bit. So, this is how we basically have to look at this here. So, one bit means basically a tree height of 32 levels.

Starting point is 00:38:43 And now we basically could write well then we know okay it's a 32-bit key here and our total tree is uh that's something that i cannot easily calculate now uh how much they're basically how many keys we have in here. But we can see the more bits we use basically the less the tree height of course right because basically the more bits means we're cutting larger chunks of the key at a time and we only need fewer chunks so I mean this is basically what we said here like one byte um and that's basically uh eight times less uh um height than uh if we have one byte of course because we're basically every time we're cutting off eight bits from the tree and that would basically be this okay 32 bit means we have a tree height of four right makes sense um and that's

Starting point is 00:39:50 also what uh the art has right so this is the the height uh tree height what the art has because we have eight bit um now the general um prefix tree uh so that's like a paper from 2011 by Matthias Böhm in BTW. This is something where they, at least in the standard setting, they used four bits. Then you can see, well, this is, of course, double as high because it uses only half of the number of bits here. All right. And this, for example, is the Linux kernel Radix tree. So the Linux kernel also uses a Radix tree that uses 6-bit, for example. And we can see the art has, well, I mean,

Starting point is 00:40:37 this is kind of no surprise that the art is flatter here. But the surprise is actually we can see that the art also uses less space. Because here the generalized prefix tree and the LRT, I mean, technically the nodes are smaller, right? So at least if we're talking about full nodes, we use less space in these nodes because they use less bits. However, because the nodes are adaptive, so we can also use smaller nodes.

Starting point is 00:41:11 Here, the art actually uses much less space and uses almost as little space as a tree with only two bits or one bit and uses basically the same as a tree that only uses three bit on every level. We're still going to use more space because not all of the nodes will be full all the time but on average we're using a lot less space than say for example any kind of tree that would always use 8 bits in here. And the regular tree or regular radix tree with 8 bits would basically always have these 255 nodes rather than the other ones. So that's actually why this actually makes a lot of sense having these different node types because we're using a lot less space while still having this very flat tree. And this also makes sense in terms of performance.

Starting point is 00:42:08 So here is a lookup performance of an art compared to the generalized prefix tree and to a cache-sensitive B-tree. So this is something that we talked about earlier. Then a K-ary search tree, fast architecture search tree, change hash table. So this is like you remember, like this change hash table, not open addressing.

Starting point is 00:42:39 And then I don't remember. There's also the generalized prefix tree that we had. And you can see that in a dense tree, the art lookup performance is actually much better. And it's on par with a hash table, basically. In a sparse tree, and the difference is basically dense tree means, like, if we have, like, say, n values in our tree, then the values that we're inserting will be a permutation of 1 to n. So, basically, everything is densely set.

Starting point is 00:43:18 So this means the nodes will be nicely filled, essentially, on most levels. And in a sparse tree everything is randomly selected out of the key space. So in this case we will have like very like we will have some holes in the tree where we basically have very empty trees. Some nodes will be more filled etc. And you can see that well in a sparse tree the the performance will be not as good, but it's still better than all of the other trees out there and all of the other radix trees in particular.

Starting point is 00:43:56 And also the insert performance is good. Like in the Stens tree, again, it's nice. It's better because we basically can nicely pack our individual tree, our individual nodes, so we don't need all this path compression, etc. In the Sparse tree, this will be slower. We can even be faster if we have a book insert. So if we know exactly what we're inserting, we can basically start from the leaves, pack them nicely and create a dense tree representation or efficient tree representation. And again, this will be faster than the other trees.

Starting point is 00:44:38 And even still the sparse case, at least in their experiments, was faster than building up a hash table, etc. Okay, and the coolest thing is you will implement this yourself. You can see all of this yourself. So you will basically find out and enjoy the art yourself. Okay, so with that, we're basically through with the data structures for today at least well actually no more data structures today um and just as a reminder right we talked about hash functions and hash table organizations p3 and the cache sensitive p3 and then the radix trees and tries in general and especially today we talked about the adaptive radix tree, which you will be implementing.

Starting point is 00:45:29 And with that, we're going to switch to multi-core parallelism. But before we do so, are there any questions? Yes? I think it was on slide 13, 30, we have individual notetimes. Yes? And one says that each point is four-by-hands-wise red. Sorry? Yes. Yes. Sorry?

Starting point is 00:46:03 The pointers are... 4 bytes or 8 bytes. So that's a good question. I think here in the paper, at least they're 4 bytes. So that's why. I mean, if you have 8 byte pointers, you will need 8 bytes. Yes. So in this case, all of the pointers are 4 byte. I think this is still, I mean, this is a bit older.

Starting point is 00:46:30 So that might be because of the older representation. I don't remember what we have in our implementation. If there you are going to use 8 byte in there. So that might, you might need to adjust to some degree. Okay, more questions? No more questions? Well then we're going to do a four minute break and then switch over to multi-core parallelism. Okay, so let's get started with multi-ore parallelism. And what I want to show you today is some basic concepts of parallelism.

Starting point is 00:47:13 We'll also need this later for NUMA and cross-rack scale parallelism. And then we're going to go, an example to parallel joints. So how do you make a joint parallel? And like different kind of approaches. But we'll probably not be able to cover this today. So there's something where we will probably then end up talking about this next week. So, but multi-core parallelism. So why do we need this in the first place? I showed you this before and I mean basically what we see here is that at a certain point in the 2000s

Starting point is 00:47:54 we've kind of hit the energy wall so we couldn't really or manufacturers couldn't efficiently increase performance on CPUs anymore just by scaling up the frequency, right? So in 2005-ish or something, 2004, basically the frequency was not increased anymore. I mean, you can see some slightly higher frequencies, but there's nothing in the in the tenth of gigahertz basically right now and before every essentially every few few years we basically doubled we saw a doubling of the the frequency or that chips process with and that would basically just automatically give you better performance. But this stopped, as we can see here, right?

Starting point is 00:48:51 So the frequency is just basically tumbling in the same kind of ballpark. And in order to get better performance, what manufacturers did, just add more cores. Right? So more functional units of the same type. And this is kind of what we see in the title slide here, right? So this is what I think this is a C on Phi, where you can see it's not one or it's one chip, but you can see there's lots and lots of individual cores that all basically look the same, right?

Starting point is 00:49:22 So this is, I mean, this is a C on Phi. It's a bit older, but here you can see nicely the structure. In any kind of chip that you will look at today, the dies, you will see these kinds of substructures where lots and lots of the same kind of processing unit will be in there. And even in there, you have structures that are replicated as we had earlier, right?

Starting point is 00:49:44 So the functional units, you will have multiple same type of functional units in a single core in order to already get some parallelism there. And that's something we have to work with. So this is basically going back to single threaded programs means, well, we'll get some performance benefits

Starting point is 00:50:08 because the single-thread performance still is increasing, right? So the chips are getting smarter. We have multiple function units, the compiler and the chip are basically, CPU are basically trying to reorder something to get some parallelism out of there. But this is not really increasing a lot anymore, right?

Starting point is 00:50:30 So it's increasing, but not that fast. But the number of transistors is increasing, so still continuing to increase. And this depends on the number of logical cores, or number of logical or actually physical cores doesn't really matter that much and that's where we can still get a lot of performance out and that's what we have to work against and this is basically what today's chips can look like so an Intel Ice Lake which is kind of current current architecture will have up to 40 cores, basically 80 threads with hyperthreading, and up to eight of these on a single system. So up to eight sockets, and then including up to six terabytes of RAM, including persistent memory. So this is quite a beefy machine, right?

Starting point is 00:51:26 And here I actually have a picture of a Skylake rather than Ice Lake, but this also shows you the general architecture here. And we'll come to this if we talk about NUMA, so non-uniform memory access. You can see that these are all connected with each other, but it's not fully connected. So these nodes, for example, are not directly connected,

Starting point is 00:51:52 so you will have to go through somewhere else in order to talk between these two. And then each of the nodes will directly have, will have some memory attached to it. So in this case, in the sky lake, six memory modules. In the ice lake, you would have eight memory modules plus eight persistent memory modules maximum in here. And that's what we have to do, right?

Starting point is 00:52:23 So we have to utilize these multiple cores. And in order to do this in the database system, we have to, well, somehow give them tasks. And there's different ways of doing so. We can either give, or, well, we can give different, or we need to give different tasks to the different cores right they need to be somehow independently working on something and a simple way of getting there is inter-query parallelism so rather than parallelizing a single query we just allow multiple queries to enter the system

Starting point is 00:53:10 and each query will be handled by a separate thread, for example, or by a separate process. And then this automatically will give us some performance improvement. If it's just queries, right, they're just reading stuff, there's not that many conflicts, we still have to somehow deal with conflicts, because if everything reads the same kind of data in different cores and different caches then we might get into some problems there so with multiple multiple queries being in the system at the same time, we can increase the throughput of our system. So say we have 40 cores, we have a single processor in there, we have 40 cores. We can basically easily get 40 queries or we could have 40 queries at the same time. And we ideally, if there's no contention somewhere, we'll get a 40 way or 40 times

Starting point is 00:54:06 performance improvement. However, each individual query will still take the same amount of time as if we run them individually or just independently, because every query will be run by a single core, or a single thread, meaning they will take the same time if we have one core, if we have 100 cores. Doesn't really make a huge difference. Of course, if there's contention, we might actually have higher latency all of a sudden.

Starting point is 00:54:41 But we're not talking about this yet. So here, if we're using this inter query parallelism we'll increase the throughput of the crew of the query stream and this is really important for all tp right so if you have small queries you have high number of transactions some kind of shopping system for example then you want to have this, right? So each individual transaction is actually fast, doesn't take much time. The latency is okay for most applications.

Starting point is 00:55:14 Then you don't need to parallelize the individual queries, but you need to be able to process many queries at the same time. If you have long running queries and or increasing amounts of data that you have to process, all of a sudden you actually want to have parallelism within a query. And then you're basically we're talking about intra-query parallelism. That means we have to split up the query into multiple tasks. This means either we split them up into single operators and say for example use pipeline parallelism as we already know it or but that will be limited to the length of our pipeline as we know and it will not really help us with latency, only with throughput again.

Starting point is 00:56:06 So in the end, we want something where we actually also break up the individual operators. So say our scan, our join, et cetera, they need to be run in parallel in order to get better performance, in order to get better latency. So we get a reduced response time for a single query. And, well, what we do is we basically split queries into smaller tasks and execute the tasks in parallel on multiple cores, ideally enabling inter- and intra-query parallelism. And, of course, we still want to be able to answer multiple queries at the same time, right?

Starting point is 00:56:48 So we don't want to have the system always be exclusively blocked by a single query, even though we want to be able to parallelize queries. So there's, again, kind of trade-offs, right? It's all about trade-offs all the time. So there's a couple a few basic concepts. So one thing is we need to be able to split up our work into sub-parts. And that's called work partitioning.

Starting point is 00:57:20 And these need to be able to be run in parallel. And sometimes this is also known as domain composition. But we're going to talk about work partitioning. And often this for us even means data partitioning. So we're splitting up the data into different parts. And we can work on these parts separately. And then if we can neatly do this and there is no communication or at least little communication needed between processing these parts, at least for a certain amount of step, then we can get nice performance improvements. These individual parts, these individual tasks, then somehow need to be given to some execution unit.

Starting point is 00:58:08 And that's what we call scheduling. We've already heard this in the CPU. The CPU schedules the individual micro-operations, does some reordering. Same thing on a larger level, more high level, needs to be happening for our database tasks. So say we're splitting up our scan into sub-regions that we're scanning, then some worker at a certain point needs to take the scan or needs to be told to do this scan now. And this is basically some execution context. And the amount of work

Starting point is 00:58:46 that needs to be done is kind of the task granularity that if it's too little, if we have very small tasks, then we have a lot of overhead from the scheduling, from the context switches, etc. So basically

Starting point is 00:59:02 the tasks need to be feeded all the time and they basically only do a few operations. Then it's not really efficient. If the tasks are too large, then all of a sudden we'll probably get large differences in runtime. And we might again not be as efficient because the load might be very different on different nodes. We'll see an example probably next lecture then. And of course, everything in the end

Starting point is 00:59:34 still needs to be correct. And that's a critical one and a tricky one because as soon as we're running stuff in parallel, well, we might get into all kinds of race conditions. So while in a single threaded program, you can very much depend on writing variables in a certain way that you actually wrote your code in, in parallel execution, accessing global variables, etc., all of a sudden can basically destroy

Starting point is 01:00:09 everything. You can get into all kinds of inconsistency. So you need to be able to synchronize, etc., between multiple threads, multiple execution units. You have to deal with this. And often you actually have to deal with this yourself. So the environment doesn't necessarily give you any guarantees in there unless you're careful.

Starting point is 01:00:33 So you need some kind of synchronization to enforce the order. So talking about outputs, for example, or hash tables, right? So you want to build a hash table. If you have multiple threads accessing the hash table, they might overwrite your value. So in the end, you end up with a broken hash table rather than an actual hash table that you need. Okay.

Starting point is 01:01:00 So if we parallelize, we have to talk about how well does this work, right? So do we get actually better performance? And this is typically called scalability. And so this is also like one of the things that database of research looks into. So how can we make stuff scalable? But it's not like, I mean, there's multiple different things that we think of if we're talking about scalability.

Starting point is 01:01:30 So in the end, I mean, it's somehow, how do we increase, how well does the system parallelize, I would say. So this is basically what we mean by scalability, but there's different things that we could mean. So on the one hand, we can talk about speed up. It's basically if we have one node or one core and a problem that we need to solve, so our query TPC-H scaling factor one, query one,

Starting point is 01:02:04 for example. What if we use all 40 cores instead? So how much faster is this? This is what we call speed up. And this is also called strong scaling. So if we have 40 cores, we're trying this on a single core, executing this query scaling factor 1, query 1, and then we use 40 cores and we get a 40 times

Starting point is 01:02:28 speed up so we're 40 times faster with the same query same data set size that's what we call strong scaling because basically the the problem size is the same we're just throwing more parallelism at it and if we then get better performance then this is actually good right so this is what we want and this is this is also the more tricky one to get we also have scale up which basically means well what if we add we basically have a larger load and a larger problem size and larger hardware. So say, for example, we're using one thread for scaling factor 1, and we're using 40 threads or all 40 cores for scaling factor

Starting point is 01:03:23 40. So that would be kind of the same, that would be this scale up. So in this case we have more resources, but we also have a larger problem. And this then we basically should be in the same, like ideally we're in the same ballpark number of performance. Meaning 40 cores with 40 times as much data will take the same amount of time as one core with 1 40th of the data. And that's why it's called weak scaling.

Starting point is 01:03:57 Typically, this is called weak scaling because it's typically easier to achieve. With the problem size, typically your parallelization is easier to achieve. There's not as much synchronization needed anymore. It really depends on the problem, but in general, this holds. And then finally, what if we, and maybe I've mixed up something here, larger load, adding resources should be fine. And finally, what if we have a larger load

Starting point is 01:04:42 on larger machines? So we're basically, well, I've mixed up the previous ones. We have same problem size, but larger resources. No, something is there. I have to fix this later. I have to see this later if I mixed up something. Anyway, scale out is we have a larger load and more machines. And again, this should basically be the same as scale up. Well, okay. Scale out means basically, I've mixed up the two.

Starting point is 01:05:27 Scale up is still the same. It's correct what I said. Scale out only means, well, what if we have, rather than having more cores, for example, we have more machines. So we have, rather than having one beefier machine, we have more machines. So that's the difference here. And it's basically the same as scale up with the difference that we talk about individual machines rather than talking about stronger machines, basically. Or more resources in a single machine. Sorry about this. OK.

Starting point is 01:06:05 Then based on this, we can basically see, OK, how good is our speedup? So we already said in the reduction of the response time in the strong scaling. And while the speedup typically should basically or will be either a linear speedup, an ideal speedup, would basically, based on the amount of resources that we add, we get the same amount of speedup. The factor of how much faster our system is, is in the number of resources that we add this is ideal and this happens if we have like a very easy to parallelize problem so say for

Starting point is 01:06:55 example we have independent queries that we schedule on individual individual threads and there's no data contention whatsoever, then we can achieve this ideal speedup, right? So because there's no synchronization needed, multiple threads will just add more threads. We'll just be able to process more of the queries. so we get better throughput we have better speed up basically typically we have sub-linear speed up because there are some contentions somewhere there's some overhead here and there and then well we add more resources we get faster but we don't get as fast as we want. And sometimes, in some cases, we actually get super linear speed up. So this means we add more resources

Starting point is 01:07:50 and all of a sudden, like we add, say, two processors or two additional cores, so from one core to two cores, and all of a sudden our system is not twice as fast, but it's two and a half times as fast for some something like this and this happens um if we have some kind of caching effects right so all of a sudden everything fits into a cache for example we have better cache efficiency than we had before and then uh all of a sudden like we're basically just going above a certain threshold

Starting point is 01:08:26 or the translation look-aside buffer. So before, we basically had a lot of misses in the translation look-aside buffer. If all of a sudden that works well, the translation look-aside buffer, we can always utilize it. Then this kind of will give us a huge performance boost and will lead to something or could lead to a super linear speedup. But that basically then often means well your initial program wasn't really that efficient after all. And

Starting point is 01:08:58 well, sub-linear speedup happens well in many cases. So say for example your program is not parallel enough. So your problem is just something that is not easy to parallelize. Then, well, or at least the way you structured it, then, well, you're not going to get a speedup, of course, if you have to see clearly. Like everybody's touching the same data item. You have a lock there, everybody's just waiting,

Starting point is 01:09:27 all of the threads are just waiting for this lock. Well, you're not gonna get speed up. Well, same problem, or if you need a lot of synchronization, so that would actually be the same thing, then, well, you need to communicate a lot for synchronization that will basically lead to slow down or will make your program slower. And then if your system is memory bound

Starting point is 01:09:59 in the first place, right? Or disk bound, for example, right? We had this, so there's some part of your system where you have contention. Then parallelizing something else won't help you at all. So if you get some more parallelism in your CPU, but you're memory-bound anyway, well, it doesn't help you. So then there's not going to be any speed-up for this, or at least you're going to get sublinear speed-up.

Starting point is 01:10:35 And with this, who's heard of amdahl's law all of you yeah yeah okay so i mean this is basically if you have a certain program um like if you look at a certain program, there typically is a parallelizable portion and an unparallelizable or sequential portion. And if, or at least the way you're parallelizing your code, you basically parallelize a certain part of the code, and some part will not be parallelized. Say some basic setup procedure, for example, will probably not be parallelized. Let's say some basic setup procedure for example will probably not be parallelized and that will be a certain part of your execution time. And then based on this you can actually calculate what's the maximum speedup that you can achieve. So this is what Amdahl's Law says. So if 90% of your code is parallelized and 10% is sequential, then the maximum speedup that you can get is 10.

Starting point is 01:11:35 So you can throw as many cores as you want to it because then basically these 90% will go to whoever parallelized it. This part will be less and less significant to the overall execution time. And the 10% that are still left, this is basically what will dominate your execution time. So in the end, that's the maximum speedup. And this is a theoretical concept. Typically, if your problem size increases, if you take a lot of care, there is actually more parallelism to be gained, right? So, I mean, of course, if you stick to a certain algorithm, to a certain implementation, then you can basically

Starting point is 01:12:19 check out, okay, which, how much of this code can I actually parallelize. And then you can also see what's the maximum parallelism that I can get. And then you can see what's actually efficient. After a certain point, you will see, well, I'm going to add more and more resources to this. But here, well, I'm not going to get more performance out of it. So this is basically here. I'm still getting more speed up with say 32, 64, whatever processes, but here from a thousand processes, it's negligible.

Starting point is 01:12:52 So I'm just wasting resources in the end. And this typically holds for strong scaling in the end. So if you have a certain problem size and you want to parallelize, unless you're doing something fundamentally different, then you're going to be stuck here. But in many cases of weak scaling, so you have larger problem sizes, then the parallel part or the sequential part of your problem your problem will actually be reduced so there is more more parallel portion typically in real world examples so meaning um if all of a sudden you're a huge internet company most of your workloads will inherently have a lot of potential for parallelism rather than looking

Starting point is 01:13:47 at i mean say you you're doing business with two people right two people doing business there's not much parallelism that we can achieve but if we have a gazillion people doing business with each other there's lots of individual interactions and there's lots of subsets that we can break this down to. Okay, so there's a couple of pitfalls in parallelizing stuff. And this is also when Hamdahl's law hits you. And one is basically non-scalable algorithms, right? If you're using something like a depth-first search in a tree, that's actually hard to parallelize. So if you have breadth-first search, this is something where you can easily think about,

Starting point is 01:14:34 oh, how can I parallelize this? It's easier to parallelize. So rethinking your algorithm is similar, and we already did right in thinking about Zim decode you basically need a different focus you actually have to think parallel to do this and then and then you will get new forms of parallelism it might be a little bit less efficient than doing it like in on a single thread or in a scalar way you have to do more operations but in the end it pays off because there's more option to parallelize a big problem is load imbalance so if you if you have large tasks that you need to deal with then break them down into smaller tasks and dynamically schedule between processors and

Starting point is 01:15:27 threads etc. in order to get better load balance so that different units are not, so cores etc. are not waiting. And this is a typical problem. We can see this if you go to big data systems, like in Hadoop, etc., this is a very typical problem. Basically, you have one core in a cluster of 10 or 100 nodes that is actually doing something, and all the other cores are waiting because this is just the one reducer where all the data ends up in the end. So somehow dynamically basically scheduling stuff and breaking it down, you can get much better performance. And of course, splitting stuff up and having very small tasks means a lot of overhead.

Starting point is 01:16:22 So make sure that the minimum task size is not too small again trade-off not too small not too large otherwise you will basically have lots of overhead or you will have lots of imbalance okay so well yeah let's stop here I think It's a good point to stop. Do we have questions? No questions so far? Okay, then next time we're going to continue with parallelism and different kind of algorithms. As I said, this will be on Wednesday. On Tuesday I'm out doing other stuff.

Starting point is 01:17:04 So then you have one day off. Use it for the task. If you have questions regarding the task we can also schedule something there. So if there's some tricky stuff in the programming then reach out to us. We will schedule something. Otherwise we'll also have additional opportunity later on. Okay, well then, thank you and see you next Wednesday.

Your Ad Here

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Multicore Parallelism

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.