Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Data Structures

Starting point is 00:00:00 Okay, welcome everybody. Today we're going to talk about data structures. Before that, so there's... we're in the final phase. It's one week left, right? Yeah, so we have one week left for the first task. Give or take, we'll figure it out. If you haven't started, do so now um because it it will take you some time to get it fixed and we're not gonna be able to help you last minute if there's some kind of problem or well probably not so um but i assume everybody who's here already started at least the way i hear you talk about the task. So that's good. And it's going to be more fun even with the next task.

Starting point is 00:00:50 So I hope you enjoy this one. And if you do the next year, you're really going to enjoy the next one. So where are we? We're still in the kind of single core CPU package area, talking about data structures that fit this paradigm or this space well. So not so much going to storage, not so much going to network or distributed, but see how we can make data structures efficient

Starting point is 00:01:25 for this kind of caching layers. Next time, we're going to talk about, or tomorrow, we're going to talk about the profiling. So that's going to be fun. And then you will have the ART, so the Adaptive Radix Tree, that you can implement, or I'll introduce this today. One thing that I want to make you aware of

Starting point is 00:01:51 is on the 14th of June, we're going to have to switch lecture halls. So there's some conflicting stuff going on in this hall. So we're going to be in L10 or L2, so that's first floor. We'll also have an announcement at some point, I guess, or dedicated announcement again in Moodle, just so you are already aware of it. There we're going to have to move somewhere else. And with that, otherwise not much new stuff here. Today, I want to talk about three things, or three kind of index structures, the three kind of most common index

Starting point is 00:02:32 structures that we will be using in database systems. One is hash tables and hashing in general. So this is something which is always important, needs to be efficient. And for that, we'll look at a couple of hashing functions. So what can you use as a hash function? And then the hash table organization. So this basically means if we have a hash function, how do we use it? How do we map it to the hash table? Then we're going to look at trees. Well, a reminder, a very brief reminder about the B-tree, so you know where we are. And then some adaptions of the B-tree for cache

Starting point is 00:03:16 sensitivity. So meaning if we're in memory, we're not going to disk in the B-tree. We said the B-tree is great if we have disk. It's not so great if we're just in memory. So how can we make it better and greater for memory and for caches? So that's the cache-sensitive B-tree. And we'll look at that one. And finally, we're going to look at tries. So another type of tree structure. And they're just tries in general first.

Starting point is 00:03:50 Then Radix tries, which are the kind of tries that you want to use in a main memory database typically. And then the D1 data structure that you will also implement, the adaptive radix tree or ART, which is a really nicely designed data structure that will give you some idea about how to make data structures efficient in general for caches. And that's why we picked this one. And this is also why, because it's a nice data structure and it's not super complex, we also let we picked this one. And this is also why, because it's a nice data structure and it's not super complex,

Starting point is 00:04:26 we also let you implement this one. Okay, so let's talk about in-memory indexes or indexing paradigms. And we kind of have three different paradigms that we're going to talk about today. These generally can be divided into two categories. First is basically comparison-based or hashing-based. So with a hashing-based index, we directly know where we want to go into our index. So basically we're calculating the position in our index and we have to jump there or we jump there

Starting point is 00:05:12 and directly more or less directly find what we are looking for. And then we have all these comparison based indexes where we're basically saying, okay, is it greater? Is it less? So somehow we compare our value that we try to find with, or the key that we try to find with something in our index. And there's tree structures, and then there's tristructures.

Starting point is 00:05:38 So in the tree structures, as you already know, so we're basically looking, is it greater or is it less? And then traverse the tree from there on. And the tristructure is different because there we're basically looking at subparts of the key. And based on these subparts of the key, we're going to continue traversing. So we're not comparing the complete key at every stage, but we're comparing prefixes typically. That's also why it's called a prefix tree or trie. We're going to start basically or with dissecting the key into multiple parts and compare with that.

Starting point is 00:06:18 And if you're thinking about asymptotics, so there's usually different ways how you can calculate or how you could model the asymptotics in these indexing paradigms. But if we say we have n keys of length k, if we have fixed size keys, then typically we would say, well, hashing is like constant time, right? So we directly, we're looking at the key, we directly know where to go. And then we have like a constant time or all of one access. In a tree, it basically depends on the number of items in the tree.

Starting point is 00:07:04 So how many comparisons do we need to do? And this is why we have a logarithmic axis. And that's basically as we would always have. Like in a binary tree, we have basically a logarithm to the power of 2, or logarithm of 2. If we have B tree, we'll have much bigger logarithm. I told you, right, in many cases, this means like three levels, which basically, if the keys are not, or if we don't have a very large number of keys, this means in practice we're also at a constant access rate,

Starting point is 00:07:42 because we only have so few levels. And for the try, the try basically depends on the size. Sorry about this, just a second. It's going to be better now. The tree size basically just depends on the key length. As I told you, we're splitting up the key in multiple subparts and we're finding our items based on these subparts and we're finding our items based on these subparts. So the tree height of the trie only depends on the key length. And this means if we have a fixed size key, this also means we have a constant axis, which is O. O of 1. Right. And I mean, of course, in the hash table, we might have like a lesser constant,

Starting point is 00:08:49 but asymptotically, it's the same. And, but if we have, like, varying key length, then this means also hashing will cost us more for larger keys. So then the hash function, if you look at it, the hash function is actually a function on the length of the key. So you also have to do more if you have longer keys. Same is true for the comparison in the tree. So if you have longer keys, then basically you have more or your comparison will be longer so then this also goes into a factor of the of the search and for the try again as we're just looking at the different parts of the key we're in kind of the same asymptotics as in the hash function or hash table again.

Starting point is 00:09:49 And I'm showing you this because typically, if you would look at typical comparisons of these different kind of tree structures or index structures, then in many cases, the asymptotics that you would see would basically say, well, for the hash table, we're always constant. For the search tree, we have a logarithm based on the number of elements. And for the trie, we have basically the factor would be the key length. But this is kind of not painting or showing the whole picture.

Starting point is 00:10:25 This is why I'm adapting or using also the same asymptotics that Victor Lise uses for this. Because if you look at the calculations that we need to do and the comparisons that we need to do, then this is what we have to do. And this is what kind of if we're looking at efficiency and the amount of work that we need to do, this is where we're going to go

Starting point is 00:10:46 Okay Let's go to hash tables right so hash tables basically consists of two main things that we need we on the one hand needle hash function that turns our key that we're looking for into a hash value and Then we have some kind of hash table organization. So given this hashed key So the hash value we somehow have to map this into our hash table And I mean if we have perfect hashing meaning our hash table, like each value that we're hashing will only map one item in our hash table or one entry in our hash table, then

Starting point is 00:11:32 everything's easy, right? So then it's just an array basically and we have a direct mapping. But that's not going to be the case and that's why we need different kind of hash table organizations. And we'll look at these second. First, we're going to look at different kind of hash functions. And, well, a hash function in general is basically just a function that maps arbitrary data to fixed size values. So this is kind of like a plain definition. And given these fixed size values, we can then basically say for these, we can have a hash table where we can look up where our variable size or variable arbitrary data

Starting point is 00:12:15 should be located in the hash table. So our hash function is really just a function from one domain to the hash value domain and typically the domain would be larger that we're going to like our original domain would be larger than the hash function otherwise our hash table is not going to be efficient at all or let's say also our hash function is not efficient at all, because we're not using the space of the hash function. And there's a lot of different examples, a lot of different things that people come up to. And it really depends on what you want to do with the hash function, what you're using. And one thing that's important to keep in mind, there's a huge difference in hash functions

Starting point is 00:13:06 for cryptographic approaches and techniques and stuff that we do. Because for us, there's always this trade-off in how good is the hash function and what's the performance. So we can basically also deal with poorer or actually poor hash functions if they're faster if it costs us less to deal with collisions etc in the hash function then uh if we would compute like a much better much more uniformly distributed hash function

Starting point is 00:13:42 so this kind of trade-off while in cryptography, if you have lots of collisions, well, if, and that means basically if your hash function doesn't really give you a good statistics and good randomness, then you basically can re-compute maybe the original value, or you can just use some other value that would give you the same kind of hash, and you can defer this from the hash function.

Starting point is 00:14:15 That doesn't really matter for us in many cases, right? So these kind of situations where things don't work out that well, we just need to make sure that the performance doesn't degrade too much. Everything else is kind of fine. And so, stuff that we're looking at, I mean, most simple would be division. So, this you should have seen by now already. So, this is just modulo, right? So, we're modulo function is something that gives us a very simple hash function. So, basically, we have our hash domain, so the space that we want to hash into.

Starting point is 00:14:58 And we can just use this in the modulo function, and we will automatically get, if we have a numerical representation of our value, we will automatically basically get a hash function, not necessarily a good hash function. We'll see this on the next slide. Then there's other stuff, and just so you've heard this, right? There's Fibonacci hashing,

Starting point is 00:15:20 which basically use a Fibonacci-like sequence to get an even spread, but it's more complex to calculate. Then what we actually use a lot is multiply at shift or XOR functions, so where you will do some multiplication, some addition with certain specific constants, and then do shift operations. So the idea is you do this a couple of rounds or even just a few rounds

Starting point is 00:15:50 and you try to somehow generate what you call avalanches in the bit representation. So meaning that your bit representation, like the original bit representation of your value will somehow be perturbated so that the values look most different and get like neighboring values basically get spread across the hash function as good as possible. And this is what most hash functions that we're looking at are actually using. So these kind of simple multiplications, additions, and shift operations. This is also called rotation. So we have our initial bit representation or numerical representation.

Starting point is 00:16:41 We're going to multiply this with something, so it's going to be larger. Then we're shifting it. We're just using some part, like the most significant bits. Then might multiply it again, might add something until our value basically gets mapped to something like a completely different domain, essentially. And we're trying, or people who design these hash functions basically try to find as good constants as possible for different kind of key lengths, such that you get a good distribution. It's not going to be perfect, but it's going to be quite good.

Starting point is 00:17:20 And one instance of this that's used a lot is murmur hashing. So we'll talk about this in a bit. Other stuff that's also interesting is, for example, tabulation hashing. So there the idea is that you, again, you split up your key into different parts. And for each of these different parts, these sub-representations, you create a table, a lookup table with random values. But basically you're saying, OK, I have the simplest case. Let's say we have four. So we're taking two bits. So we have four different values.

Starting point is 00:17:55 We're going to have a lookup with randomness in every of these four values. So we're just filling this up. And then using these random values, basically we're just filling this up. And then using these random values, we're going to add them together to get a quite uniform or universal hash function. Because using these tables, we can quite well create any kind of hashing function. But you basically need a special table to do this, and you need to store these tables.

Starting point is 00:18:33 So this is not something that we're going to look into in more detail, but something interesting to look up if you need a better type of hash function. OK, so let's look at hash function qualities and the kind of problems that we see. And so for a simple experiment, so it's not something I did, but I'm also showing to you for a simple experiment

Starting point is 00:19:00 that you would have in a database, think about a date field that's represented or that we want to represent as a 32-bit unsigned integer, where we're encoding the year into the most, 16 most significant bits, then the month in the next eight bits, and the day into the least eight bits, right? So this, for example, so I put an example here. So we're gonna encode all dates from the first of January in 1950 to the 31st in December in 1999.

Starting point is 00:19:38 So apparently it's a bit older example in 32 bits. So then say say for example our third the third of February in 51 would look like this and I've messed up here the 51 sorry about this just gonna quickly fix it No, I said 51. Actually, I just messed it up here. Okay. So if we basically compute this, so we have 1951 in binary representation, then this would be this nice binary number here.

Starting point is 00:20:34 Then the 2nd or February, that's still easy to read. And the 3rd of February is also still easy to read. So it's an easy kind of representation of a date into a 32-bit unsigned integer. And if we do this, basically we're translating all of these, and I've fixed my number, so 18,000 something something values. And here, if we fix this, if we encode this, and we're starting to hash this with a modulo function, then the following happens, right?

Starting point is 00:21:15 So let's look at, so think about a hash table of different sizes. So in powers of two, so say, for example, we're starting with two to the power of 8 and up to the 2 to the power of 19 different kind of slots in our hash table and we're hashing these values to these slots. Then what happens is basically mostly especially initially when we're starting with the first eight bits, basically, what will happen in the modulo function is we're just going to look at the last, the least significant bits in our representation and given that these are dates this means these are really just the last few bits here right so these will only touch basically these 31 days that you have in maximum per month right because we're never going to use the full 8-bit so 256 different dates that we could have because we just have like months only have days from 1 to 31 so this means in our hash

Starting point is 00:22:37 table actually only 31 of the or also in the hash domain so it doesn't really matter if it's a hash domain or not, or a hash table. In our hash domain, only 31 values will actually be touched or will be filled. And all of the other... All of the other values are basically empty. And, well, 225, I actually have it on the list here. So, this means only these few values are actually used, and we have lots and lots of conflicts. So, our hash function will actually basically only map to 31 buckets and everything else is empty. And

Starting point is 00:23:31 we have on average basically 589 values in a single bucket. And if we would have a uniform distribution or a universal hash table or universal hash function that gives us a perfect hashing, in this case, we would only have 71.33 for collisions or entries per individual bucket. So this is just to basically show you why a simple hash function like module or a simple modulo operation won't give you a good distribution in many cases.

Starting point is 00:24:15 And I mean this basically continues, right? So thinking about getting like nine bits, then we're basically touching the first bit of the month so we're going to have two two more or a factor of two more individual buckets but still most of the buckets because here a lot of bits are unused basically most of the buckets are still going to be empty so rather than having 31 different buckets, we're only going to have 62 different buckets that are used. And we still have 450 buckets which are empty. So, basically, our statistics are just as bad as before.

Starting point is 00:24:59 So, the function, and this continues, right? So, this goes all the way if even if we have 19 bit then we're still basically we have a lot of or most of the buckets are still going to be empty and even adding more and more bit won't help because we still like the in the first in the least significant bits back here i mean all the like the most, not the least significant bits, the first couple, these are gonna actually change, but then here we're gonna have a large space, which always is basically the same,

Starting point is 00:25:34 which always will give us a quite large range of buckets, which are unused. And so what you should take from that is basically if in many cases if we're using a modulo function we won't be able to fully utilize our hash table and we will have lots and lots of collisions. So in this case basically even here where in a uniform or universal hash table, so if we have a good hash table, right, where we're trying to map our 18,000 values into these 500,000 values, then usually we should have almost no collisions, right?

Starting point is 00:26:19 We will still have collisions in there in a uniform or universal table. This is just because of the Berkeley paradox. So you probably know this. If we basically touch enough values, then the likelihood of collisions at some point will just be very large. So we will have collisions, but we'll have very few collisions. In our case, here, we still have a lot of collisions.

Starting point is 00:26:49 So, we will still have, on average, six values per bucket. Rather than, on average, six. Let's say, for each bucket that is filled, we, on on average have still more than six values in there. So let's look at a more advanced hashing function. And this is murmur hashing. So that's the multiply and rotate function. So it's multiplication and rotate, so we're multiplying with a number, and then we're shifting the values,

Starting point is 00:27:30 and we do this in a loop, that's why it's called multiply rotate, mer, and we're looping, mer, mer, so this is kind of the thing, and this is kind of how it was implemented in the first place. And this is still an easy enough or a good enough hash function for many cases. However, the current version of Murmur, so that's version 3 or something,

Starting point is 00:27:54 that will actually use multiply, shift, and XOR functions. So this gives you an even better hash behavior, hashing behavior, than just doing multiply and rotate. But as I said, it's one example, and with the current kind of, or a good kind of murmur hashing, we get this kind of behavior here.

Starting point is 00:28:20 So here we can see if we're using eight bits for the same kind of setup, so again, we're using 8 bits for the same kind of setup. Again, we're using the state representation. Then we can see we're using all of the buckets. All of the buckets are actually filled. All of our hash values have original values in there. We have zero empty buckets. On average, we have 71 chain length, basically, which is perfectly the same as we would have in a universal hash table, which kind of makes sense, right?

Starting point is 00:29:01 Because, I mean, if we're using all of the buckets and we're averaging, then it should be the same. However, we can also see that if we look at the longest subsequence, right? So the longest chain, basically, meaning the most collisions that we have in here, we're still not too far from the average, which basically means we get a quite nice even spread out right and uh and this basically continues right so i mean this is already good for 256 uh so eight bits but for nine bits we see well it's still kind of the same um our longest sequence well it's not it's like it's not in decreasing as fast as the the average but it's still, I mean, we're close by. And then for 19 bits, we're also still, we're basically, I mean, there we can see we're close to the actual number of values that we have in our original, in our original, or in the original set.

Starting point is 00:30:02 So I said 18,000 something something so 250 something values that we're hashing and we see that we're close to 18,000. Again with this number of elements it's the likelihood of collisions is just very big and then our hash function is not perfect. So this means we will have some collisions in there, but we don't have that many collisions. And the maximum number of collisions that we have is three, right? So this means on average, the number of elements per bucket is actually very low. And of course, in a uniform, this is something that we won't be able to achieve, because we're only counting the values or the buckets that actually have something in there. Okay, so this basically already shows us

Starting point is 00:31:06 if we put a bit of effort, a bit more effort in our hash function, then we get much nicer behavior, we get much better distribution across the different kind of buckets. And, well, with that, basically, for sure, a much better organization of our table. And, well, so somehow I messed up my animations here. I'll have to fix this separately. So if we basically, if we have our hash function,

Starting point is 00:31:43 then what we actually want for the hash function... Let me just open this up, otherwise I'm getting confused. So, if we have a hash function, then we want to map it to a hash table. So, the hash table basically has a certain size. In this case, for example, m, our original data is n, and then ideally we want this kind of uniformity, so the length of the average collision chain should be n divided by m, integer division, of course we cannot basically have have a less than that. And the probability of a collision should be 1 divided by m, meaning hopefully we get a nice uniform distribution across the different kind of buckets,

Starting point is 00:32:37 which was not the case for a modulo. So there, basically, we mapped everything to a very small sub-part, so the probability of a collision was much higher than the actual size of the hash table. And of course, everything should be efficient. So we don't want to spend too many cycles. And this is important. So even there we have this big trade-off.

Starting point is 00:33:01 The better our hash function basically, the better the hash table organization. But if we spend too much time in the hash function, our hash function is super expensive, then we won't get this performance improvement back by the better hash table organization. Because, well, on average, the lookup, if we have a good enough hash table, hash function, on average, the lookup will even out, right? We might have some lookups that are bad, but most of them will work out nicely. If we spend too much on the hash function, this won't really pay off anymore. And so in practice we actually use something like multiply shift XOR combinations and you can basically look up there's a couple of different implementations out there.

Starting point is 00:33:53 Usually the trick is really finding good constants that you multiply with for different domains. So be it 32 bits, be it 64 bits, etc. You try to find good constants and then you get a good function. In practice, you actually need two steps that you have to take when you do your hash function. In the first step, we're basically mapping our value to the hash function. So in the first step we're basically mapping our value to the hash value so this is basically just applying the hash function and then most hash functions will give you some output as a power of two and well if if your hash table is not in a power of two or your hash table is a different power of two,

Starting point is 00:34:45 then you need an additional step where you basically do a mapping to the hash table size. And this is typically called the finalizer. If your hash function maps to the same power of two, for example, then we're good. But in many cases, we might actually start with a smaller hash 2, for example, then we're good. But in many cases, we might actually start with a smaller hash table, for example. Or our hash function will always give us a 32-bit hash value,

Starting point is 00:35:15 but we don't need a hash table that's large. So then we need this mapping. And there, again, one simple way would be modulo or you can do other more complex tricks especially if the modulo would again give you some kind of shifted representation. And as always it's kind of just straight off right. So now given that we have a good hash function, or a good enough hash function, so it's not about having a perfect hash function, we need to somehow map this to our hash table.

Starting point is 00:35:52 I mean, of course, first thing is, we have our finalizer, so we have our original hash function, and the finalizer, so the hash function, will be mapped to our hash table size. Still, we will have collisions. And now we have to somehow deal with these collisions. And there's two principal approaches. One's called chaining. And then some more complex stuff that I'll go to next.

Starting point is 00:36:19 But let's say using our single hash table, not any auxiliary structures, or not many or complex auxiliary structures. One way is using chaining. So we basically say we're hashing into our bucket. Our bucket might even be able to keep multiple values or addresses of the actual value, but at a certain point, it's kind of full. And then we're just chaining. We're adding more buckets where we add more values into the same bucket. And this means we get these long chains of values. Say, for example, if our hash table is actually much smaller than the original or the values that we hash in there, then say for example, in our example earlier with the dates,

Starting point is 00:37:17 256 buckets originally, and we want to map 18,000 values in there then for sure we need some extra space where we put these values and this would be then a list of additional values in there or a list of buckets if these lists get long we'll we'll basically pay by comparing right so then all of a sudden, we're not in linear or in constant access time anymore, but we're in the number of keys that we add to our table in asymptotics. So we basically will have to linearly go through this in the end.

Starting point is 00:38:04 Especially if there's no additional the end, especially if there's no additional structure. Basically, if there is no sorting, nothing, then we'll actually have to search through this in linear time. So that's fine if we have few collisions. It's bad if we have many collisions. So as soon as these things get long,

Starting point is 00:38:24 it's going to be expensive. Another way is called open addressing. So there the idea is rather than basically doing some extra structure in the hash table, we're just going to use our hash table. So ideally we still have space in our hash table. So typically we would try to have our hash table larger than the number of keys that we're actually using. And then the first, most simple approach is we're doing linear probing, meaning we have our, we are hashing our key into the position

Starting point is 00:39:01 in our hash table. So consider your hash table like an array, right? So we're hashing the value. The hash value is just the address in the array. If there's already a value there, well, let's just go to the next position. If there's a value there, let's go to the next position. So we're just going to continue until we find empty space and we're going to put our value in there or our address of the original value and

Starting point is 00:39:29 This is good because it preserves locality so well values Like our hash that are hashed into a certain space will basically be close by so we can just search them Linearly in the table if we're trying to search something in the hash table We're going to look up at the hash value position. If this is full but not our value, we'll have to go to the next one. If it's still not, we have to go to the next one until we basically find an empty slot

Starting point is 00:39:58 or we find the value that we're looking for. And well, it can, like in our example, again, where we have a lot of collisions, for example, this will create a cluster. So basically, everything will be in the same kind of space. And that means we'll, again, get to something like a linear search through our hash table in many cases. So while inserting and also while searching.

Starting point is 00:40:26 Because in these clusters, we basically have to, like in both steps, while inserting and while searching, we'll have to go through all of the values in there. So that's why people try different approaches there or come up with different approaches like quadratic probing. Rather than going to the next slot, we can also try to spread out across the table. Basically, well, if it's not the next slot, it's free.

Starting point is 00:40:53 We're going to go to the fourth slot, so 2 square, right? Then if that's not free, 3 square, 9. If that's not free, 4 square, 16. If that's not free, 4 square, 16. So we're going to jump ever further. And hopefully, with this, we're not going to create clusters anymore. So because the values will be spread out more across. If we have collisions, we're going to spread out

Starting point is 00:41:20 more across the hash table. So why might that not be great for our in-memory setup? We will jump over the cache boundaries. Exactly. So the cache misses are a problem, right? So we don't have locality. If we look up something and we're going like, I don't know, to some power that already crosses the cache line boundaries, then we'll have to read from memory again, and that's gonna be costly.

Starting point is 00:42:00 So there's, again, this trade-off. And for that, or as another idea, there's the so-called Robin Hood hashing, where we just use this, again, this linear probing approach, but because Robin Hood is kind of always helping the poor right so we're kind of being most what you say fair to everybody to all of the values so basically as soon as we're doing linear probing and we find a value that's actually closer to its original position than our current value then we're going to swap them we're going to say okay well i've i'm already too far from my current position, so you will have to go, you're going to switch with me and have to go further from your actual

Starting point is 00:42:50 hash position. So in this way, we're basically creating similar length probing sequences. Because in the linear probing, what could happen, right, is basically I'm probing for something and there's already some bit of a cluster and I'm the first value, so I'm adding something new to the first value in the cluster, then I have to go through the whole cluster, add it to the end, so I'll have a long probing sequence if I have to go to find something at this value. And in round robin hashing, I'm just going to switch with something that has a shorter probing sequence. So on average, the probing length will even out. Asymptotically, it doesn't make a difference.

Starting point is 00:43:39 But your average axis will be closer to the average uh typically because we'll i mean all of them will be more or less the same so again i mean in in linear probing you will have the same asymptotics because i mean you will have faster accesses and you have slower or less probing and longer probing, but in round robin, you kind of get a more robust behavior. Another approach, like more advanced approach that I want to show you is Cuckoo hashing. So in here, what we do is we're using two hash tables and it's called Cuckoo and two hash functions. Cuckoo hashing because basically it works like a cuckoo, right?

Starting point is 00:44:32 The bird basically that takes or puts its egg into some other bird's basket, and then the bird will basically, the cuckoo nestling will basically kick out the other bird. So the idea, that's kind of the idea. We're kicking, similar to the round robin, the robin hood hashing, where we're kicking out value. So the idea is we're trying, like we're first trying to insert in the first table, first hash table, and if it works, fine. If it doesn't work, the new value will kick out the old value. So the new value will kick out the old value and the old value will be hashed

Starting point is 00:45:13 to the second hash table. The second hash table has different hash functions, so in this case we're basically then going to have to find, or we hopefully find some space in there. If not, we're basically kicking out that value again and we're going to rehash this value to the first hash function. With the first hash function. And hopefully then there might be space there, but of course there might also be some recursion in there. If the recursion basically happens, usually at some point it should stop. And if for the lookup we're going to do the same,

Starting point is 00:45:56 we're just going to jump back and forth until we basically find our value or we find an empty slot, there might be a loop. So we might actually come to this place where we're kicking out a value. That value is kicking out a value, and it's going to go back to the original value. Then we basically cannot insert our hash value anymore,

Starting point is 00:46:17 because it's just going to create loops, going to continue forever. And in that case, we have to basically rehash everything. We need new hash functions. So I'm going to give you an example, just so you know how it works. So we're going to start with inserting these values with two hash functions, so not good hash functions.

Starting point is 00:46:39 I told you, right, we're just going to do modulo. And we're going to multiply with a too small number to actually get some kind of avalanches, some actual distributions just for showing you how it works. So our first hash function is modulo eight. Our hash tables are eight spaces, so that's going to be fine. And our second hash function,

Starting point is 00:47:03 we're multiplying the value by 3 and take modulo 8. And you can already see there's collisions in there. So we're going to have a loop quite quickly. But we're going to start with inserting 1. So 1 modulo 8 is going to be 1. So this is going to fit into the first hash table here. Nothing in there else. So we're actually good.

Starting point is 00:47:26 Then, 4 modulo 8 is 4, so we're in the fifth position in here, and we're good. Then we have 8 modulo 8. What's the result of that? Quick check. 8 modulo 8? Zero, perfect, thank you very much. 8 modulo 8? 0. Perfect. Thank you very much.

Starting point is 00:47:46 So this is going to be added in the first position, so position 0 in our 8. And now we want to add 0. 0 modulo 8 is 0 again. So here we have the first collision. So this basically means we want to add this here. So what happens now? Who can help me? Exactly and we're rehashing the 8, 8 times 3, 24, modulo 8.

Starting point is 00:48:27 Still 0. Still 0. Very good. So we're going to kick out the 8. And that's why I told you it's not really a good hash function, because that's always going to happen in this case, right? So we have a quite tight loop already there, which will mean we'll need new hash functions quickly. And in this case, but still works because we don't have more collisions right now.

Starting point is 00:48:53 So then we want to insert the 9. Where does 9 go? 9, modulo 8. 1. 1. What happens? We kick out the 1. We kick out? 1. What happens? We kick out the 1. Where does the 1 go?

Starting point is 00:49:11 3. Position 4. So this gets kicked out here. And this is our final hash table. And so this is how basically cuckoo hashing works. Basically, if we want to do a lookup now, say,

Starting point is 00:49:26 for example, we want to look up the 1, then we would first hash with our first hash function. We would go to 9, then find the 9, but not the 1. So we hash with the second hash function. We find the 1 where it is, and we basically have our lookup. If we would look up, let's say, 16, for example, then modulo 8 would be 0, right? So we're looking at a 0. Well, it doesn't match. We're going to the second hash function. We're looking again at position 0.

Starting point is 00:50:07 There's the 8. Nothing happened. So we're basically back in a loop. So we know it's not in there. OK. Yes? If we add another 0, for example, what happens? So another 0?

Starting point is 00:50:23 You mean modulo 0? Or if you want to mean modulo 0, or if you want to add like modulo 0. So basically, then we have a, if we add a 0 again, then we have two values with the same key. So that basically means then we would basically need something where we store for one key, we store multiple addresses. So this is basically.

Starting point is 00:50:44 16 then. If we have 16, then we have this ring, right? So then all of a sudden, we have zero in the first table, position zero in the first table, position zero in the second table. And we're basically trying to recurse. It doesn't work. We have to find a new hash function. So then basically, in this case, it doesn't work, we have to find a new hash function. So then basically,

Starting point is 00:51:08 in this case, it doesn't work, we have to fix this. Okay? Good. So with this, I would say let's do a break here and go into tree structures after a five-minute break. So, everybody fresh? So I forgot something, an announcement, which I actually really want to tell you. So there's going to be a presentation in our joint seminar with TU Darmstadt tomorrow by Anna Klimovic from ETH Zurich. It's going to like not 100% fit to our topic here, but still, Anna is in hardware and systems, so still there will be angles to what we're doing here. It's more on the ML side, her current work, but still systems and hardware focused. So it's resource efficient machine learning with scalable input data processing.

Starting point is 00:52:07 Tomorrow 4.15, you have the link, the Zoom link, in Moodle as well. So I actually can highly recommend that. So if you have time at that time, nothing else to do, then feel free to take a look and listen in, because that should be quite valuable. Let me see if, yes. Okay, so with that, let's switch gears, go to the tree structure. So we discussed caching in length, now we're going to talk about trees.

Starting point is 00:52:45 And as a quick recap, let's talk about the B-tree once more. So reminder, as always, B-tree is the universal data structure for this kind of search-based or for comparison-based search in a disk-based database system. And let's say disk-based or something where we're working with larger blocks, with larger grain, more coarse-grained granularity. And the nodes, it's a very wide data structure, so each node has many children, unlike a binary tree where we only have two children.

Starting point is 00:53:29 And we're trying to keep each of the nodes not too empty, so at least half full, except for the root. Meaning we have in each node basically at least k. If the node size is 2k, then we have at least k keys. And I've somehow mixed up. No, no. If you have d keys, then we have at most 2d keys. And then if you have k keys, we have k plus 1 children, basically always saying greater or less than and equal

Starting point is 00:54:11 or the other way around, like less than or greater and equal to the root. So this basically looks something like this, right? And the k would somehow fit the disk page space. And this still makes sense in in-memory databases because we can basically look at memory page sizes and do the prefetching there. So that still somehow works well, but not as well.

Starting point is 00:54:46 Because basically we could also try to keep this better structured in memory so we don't have these random jumps in memory to different positions in the memory. The keys are basically, or the children in a B-tree can be located somewhere completely differently. So basically, this means from one level to the next will jump far in memory. And that means we're not going to have any cache locality here. So this basically, in a general layout,

Starting point is 00:55:26 this would probably mean for every step in the tree, one basically page or one cache miss, and then basically going to memory, which is not that bad. But it's also not great. We can do better in memory. But still, I mean, we always have, like, we have a very wide tree, so we don't have that many steps, right? Unlike a binary tree, where we're going to go, like, the tree is going to be very wide, we're going to go very far, or many steps, many iterations through the tree. Here, we only have these few.

Starting point is 00:56:07 But as I said, we can do better. And for this, people came up with the cache-sensitive B plus tree. And here, the main idea is basically we're not using as many pointers. So in the B tree, basically every child node gets a separate pointer in our node. And this means basically every node, or looking up another node, basically means doing a jump in memory, so doing

Starting point is 00:56:41 this kind of additional cache lookup. And so rather than doing that, what we're trying to do is we're laying out our nodes, our children, in memory close by so that if we're loading something, we don't really need to do our pointer addressing. We don't have to go through the pointers. But we can just do a lookup directly, because we know the memory layout.

Starting point is 00:57:07 And that can be done by storing all of the children basically consecutively in memory. So rather than basically having every node somewhere different in memory, we have node somewhere different in memory, we have them close by in memory. And then that just basically means like an ARRI lookup in memory. And then basically our tree or our lookups will be faster because we directly know.

Starting point is 00:57:43 So we can basically have like a more cache-friendly lookup for the children. But we need to block already at preparing the nodes or at creating a node, we already need to block all of the memory for the children, even if they're not necessarily full. Still, in the nodes, we will typically have keys in the nodes, not the children, but we will have them in each key, in each node that is filled somehow, we will have between D and 2D keys. So we're not going to have just sparsely filled keys. But we might have space in memory that we're not using yet. So that's kind of the trade-off that we're doing. So that's the overall idea.

Starting point is 00:58:49 So rather than having the data being organized randomly in memory, which we would have in the B-tree, we're trying to have larger consecutive blocks in memory. And I mean, we could still have addresses here or pointers, but since they're located consecutively in memory, we don't need them, so we can directly look up. There can even be a more efficient lookup version or if we don't need to do many updates, right? Or if we don't need to do no updates basically, then we can build such a tree in a lookup friendly version or a read-only version which is the cache sensitive search tree. So

Starting point is 00:59:39 because we cannot add something, where then we make sure that each node basically fits into a cache line and we can fill all of the, because we cannot add something, where then we make sure that each node basically fits into a cache line. And we can fill all of the, because we know there's no updates, we can fill all of the nodes 100%. And then we don't need any child pointers. So we know everything is filled, basically. Then rather than using child pointers, we can just use offset computation.

Starting point is 01:00:06 So our complete tree is basically one large memory region. If we need to break it up into separate regions, then we might need pointers there. But using virtual memory, we can really just use one large area where we put all of the data in and then just compute the offsets rather than looking up pointers. And that would basically help us, but we don't have to look something up to find the next tree or to find the next node, but we can just compute the node.

Starting point is 01:00:43 So this basically will be faster for the lookup. So we're going to save on this additional indirection where we first have to read the address and then from the address basically load something where we can directly say, please give me this next address or this next position. So, and then we're basically in this trade-off here again, like we can make larger nodes, then we have a lower tree height. We can make smaller nodes. We basically fit them better into the cache lines, but then the trees might get a bit higher.

Starting point is 01:01:27 So there's kind of this usual trade-off. And this is also kind of a similar idea to have a full CSP plus tree. So this basically, we can also use this where we basically have all of the children in all of the child nodes in the same node in one big area. This we would then be calling a node group. And we can also have, or this is also how we would structure the CSB3, CSB plus tree in general. So, of course, at a certain level, we might want to, or, I mean, we always will have all of the children in one node,

Starting point is 01:02:18 but then we can also group them in larger areas. And then we pre-aggregate. As we already know, the space in the tree is completely full. We pre-aggregate the space for the tree. Or even if it's not 100% filled yet, then we still pre-aggregate the full tree. And then we get the same kind of search performance. I mean, we might have some empty space here,

Starting point is 01:02:50 but we know exactly where to look everything up. And we also get a good insert performance if we don't need to do a lot of splitting, right? So because we already have a lot of space pre-allocated. But again, we might pre-allocate a bit much space if our tree is not very, very full. So again, trade-offs between the two. The read-only version, this we can use if, I mean, like using the full tree, this we can only use if we actually have no or almost no updates, because otherwise we'll have to do

Starting point is 01:03:30 a lot of reorganization all the time. Then we have something called the SCSB or the segmented CSB plus tree, rather than basically saying we have all of the children of one node in one big area, we can also say, well, let's lose different kinds of subsegments, could be two or three, and then one child pointer per segment. So there we get better update performance because our splitting is cheaper. We don't have to split as much data. And then, however, we have to go through these pointers

Starting point is 01:04:19 if we want to search for something, right? So there's like additional lookups. So the search performance is somewhat worse in this case. And if we look at the performance in general, so if we have our cache sensitive search tree, so the CSS, that has best search performance, right? Because we directly can just compute all of the locations. We don't need any addresses. We don't need any lookups, basically, when we have to read the values. But we don't need to follow pointers. We're just basically using offset computation.

Starting point is 01:04:56 The CSB plus tree, so the full CSB plus tree, basically, in this case, means we have also still good because within each of the children or for each of the children we just have a single lookup or we have all of the values in the same area basically. Which kind of works well but is like a, or has like a bit less performance because we need this one indirection through the addressing, but still better performance than the segmented CSB plus tree where we have to go through multiple addresses in case. And then of course the B plus tree will be a bit worse or will be worse again because there we're going to random positions in the memory all the time.

Starting point is 01:05:50 So there we're not going to have as good memory layout. And that's also the same case for the S-CSB. So the memory layout probably will not be as good in the CSB. And again, the CSB will not be as good as the CSS, which has a perfect memory layout in the end. And for insertion, the costs of a B plus tree should be similar to the S-CSB plus tree, where basically, we're just doing local splits. So we don't need to do a lot of reorganization all the time.

Starting point is 01:06:28 In the CSB plus tree, we will have to work with these larger areas. So copying data back and forth in these areas will be more costly in there. In the CSS tree there, we have to reorganize everything. As soon as we insert a new value, the whole tree will change essentially. So that's why this is actually much more costly. And well, so basically CSB plus and SCSB are good if we have more insertions and are kind of good tree structures for cache-sensitive

Starting point is 01:07:08 access, and cache-sensitive layouts of a B plus tree in the end. And CSS would be best in the read-only environment. So if we have data that we're just going to read all the time, then this will give us a quite nice access performance. As one way to basically see how we can actually structure or map something like a B-tree into a more cache-friendly version where we can get better performance out of it. So this is a few examples of how we deal with this.

Starting point is 01:07:50 And the idea here is just make sure we have a better memory layout than we would have in the regular version, where then we basically go to different locations. We have to go through memory to look up different things in there all the time. Okay, good. Then, if my pointer lets me, yes. Let's go to the adaptive radix tree or to tries first in general. Somehow this is a bit tricky. So, what's a try? radix tree or to tris first in general.

Starting point is 01:08:27 Somehow this bit tricky. So what's a tri? So it's a so-called K-ary search tree. So it's again, something or like a B tree, not a binary tree, but a tree that has multiple children per node, or many children per note. The name originally comes from retrieval, so the original pronunciation would actually rather than try

Starting point is 01:08:54 would be tree, but somehow try stuck, because it's also easier to differentiate the two in language, right? So we're talking about tries today, even originally it would be retrieval, so tree. This is, it's also called a digital tree or prefix tree. Prefix tree because we're basically looking at prefixes of the key, the search key,

Starting point is 01:09:21 rather than looking at the whole search key all the time, as we would do in hashing, as we would do in a B-tree, etc. We're just looking at sub-parts of the key. And prefix, because initially people were thinking about text, or alphabetical values, for example, looking up those, you would start basically letter by letter. So starting, say, for example, we want to have and, and, any, are, art, stuff like this. Then the prefix tree or try would start with the A.

Starting point is 01:09:54 So that's our first letter. We only have letter or words with A, so nothing there to do. Then our second level, we're looking at the second value, so there we have n and r, because we have there this kind of difference, so we're gonna split up into two sub-children for n and for r, and then from there on, we have d, t, y for n and e and t for r and r.

Starting point is 01:10:23 So this is kind of this idea of the prefix. And the K, so that the number of different children is basically the domain of the key character. So if you're looking at alphabet, right? So it's 24 letters in English alphabet, something like this. If we're looking Unicode, it's 100,000 something, something different letters.

Starting point is 01:10:52 So that would not be really smart if we have Unicode in there. So in the end, our tree width, so the width of a single node, really depends on the domain. And the height of the tree only depends on the key length. So this is basically the real difference to a regular tree, to the B-tree. In the B-tree, the height depends on how many values do we enter. In the trie, the height depends on how long are our keys,

Starting point is 01:11:28 or in how many subparts do we divide our key, basically. And there is no rebalancing. If the keys are all the same length, they will all be on the same level of the tree. So we don't have to do anything there. We just basically need to organize the tree. The only difference is that some parts of the tree will be more heavily filled, other parts will be less filled if we're not using the whole domain of the key values.

Starting point is 01:11:58 We have a lexicographical order in the keys and the keys also are implicitly stored. We don't really need to store the keys at all because going from the root to the actual node will read through the key. So that basically all of the tree inherently already contains the whole key and we can reconstruct it from going through the path. Actually, we don't need to reconstruct it because we're gonna read through it and we already reconstruct it from going through the path. Actually, we don't need to reconstruct it

Starting point is 01:12:25 because we're going to read through it and we already have it there. This is all good and nice, but say, for example, as I said, Unicode, well, this won't work very well anymore because in general, we might have on each level or in each node 150,000 something or 40-something thousand different values. So the domain is way too large.

Starting point is 01:12:54 So with that, we need to do something else. And for this, we have the so-called Radix tree. And you will hear Radix again. So this is an easy way of dealing with any kind of data by just using the binary representation. So basically, as soon as we have a binary representation of our values, we can just split that up into different chunks. So we're just going to use basically

Starting point is 01:13:23 the binary representation starting from the most significant bits on and split our data up from there. And the cool thing about this is we can also nicely configure the fanout. So the fanout being the number of children per node. Because in the alphabet, right, so alphabet, we basically have an alphabet, we have to deal with this, this is given to us. If we're using a binary representation, we can just use powers of two, we're saying, oh, we're going to use one bit, well, then we have two children, we're going to use two bit, then we're going to have four children. If we use three bit, then we're going to have four children. If we use three-bit, then we're going to have eight children. So this way, we can very nicely basically say, this is the fan-out, or this is the width on each level for each sub-node.

Starting point is 01:14:16 And then each inner node is basically just an array of two to the power of number of bits pointers. So that's an easy way of splitting this up basically. So here, as I said, if we have one bit, we're going to have two nodes. If we have three bits or two children, if we have three bits, we're going to have eight children, for example. The height is basically just as long, again, as the length of the key. If we're using a radix, we're just going to split it up by the length of the key,

Starting point is 01:15:05 or by the length of the number of bits, or we divide the key length by the number of bits per level. So say for example, we have k bits and we're using three or s number of bits per level, then we have k divided by s levels in our tree. So we can directly configure how deep is our tree actually. And the number of bits that we're using essentially then is a critical parameter for a radix tree in terms of height and space consumption. Because technically on each level, we will need as many bits for the representation, or we will store a number of slots for each bit. So say we have three bits on each level, we're going to store basically eight pointers.

Starting point is 01:16:00 Not all of them will point to something, but at least we need this kind of number of space to go in there. If we use only two bits, we're not going to use as much data or as much space on each node, but our tree will be much deeper, because we're only consuming two bits per tree level of the key. So the key will be split up in many more separate positions. So basically the tree height depends on the number of trees, so we already said this. And if we have a perfect binary search tree, for example, then this will only be better

Starting point is 01:16:53 like if we have a very small number of keys, because then basically the binary search tree will still be very flat, and our radix tree will always have the same height. If we have a larger number of keys, then basically really just like the height of the tree will always be the same. So we're always going through the same levels. And the more bits we're consuming per level, the less the tree height.

Starting point is 01:17:27 So there's a few trade-offs that we have to do, right? One thing, or problems that we have to deal with, one thing is kind of space consumption per node, right? So if we have very wide nodes, then these very wide nodes will consume a lot of space on each level. And they might not be filled. So think about the tree earlier where we had A and then N, R, and N, I think.

Starting point is 01:17:57 And so the A basically, even though we just have one entry, it will still consume basically all of the different locations because we need a node in there, right? The node needs to be somehow filled and it's going to be quite empty. We're not going to have completely empty nodes, but we will have very empty nodes. So having smaller S basically means

Starting point is 01:18:24 we're not going to waste as much space. Having larger S means we're wasting more space, but we're going to have better lookup performance because the tree is going to be smaller. So, this is something that we have to deal with. So, an alternative to this, and this is where the art tree comes in, is basically, well, let's have multiple node sizes. So rather than, say, every node needs to be the same size,

Starting point is 01:18:53 all of the bits, or separate slots for all of the bits, I mean, we still need to be able to represent all locations, but we can somehow structure this in a way that if we only have one key in there, we just mark this as a node that has only one key or less keys, and then we know, okay, we're just going to use less space in there. And this basically, based on the filling degree, we can create different kinds of tree or different kinds of node sizes. So that's the idea of the adaptive radix tree or the main idea. The thing that basically really gives us an advantage. So let's look how the adaptive radix tree works.

Starting point is 01:19:44 And well, actually we're already out of time, right? So maybe let's look at this next time, rather than rushing through this too much, because you will have to implement this. So let me maybe give you some, rather than basically just, I'm going to continue here next time, which is not really a problem because you'll have enough time to still implement it. And I'll go through this in detail with you in the next session.

Starting point is 01:20:16 So so far, basically remember it's also tree structure. And it really just depends on the key. So we're splitting up the keys. And the one thing you can also already remember is this radix idea. So this here, we're basically splitting up our binary representation into different subparts. So it's hard to read. But say, for example, we have an integer. Maybe I should create this large.

Starting point is 01:20:42 So we have some kind of integer. We use a binary representation, and we're just splitting this binary representation into subgroups and using these subgroups or these subparts as index on the different level. And these would then, this is kind of one of the configurations for the radix tree, saying the subgroups. And this, of course, needs to be fixed. So we could somehow create different kinds of lengths for different kinds of levels,

Starting point is 01:21:16 but it's going to be a mess and a nightmare to produce. So we want this to be easy to compute. But this also determines the number of children we will have. So if we have 8 bits, for example, this means 256 different children, or potentially 256 different values, potentially 256 children.

Starting point is 01:21:45 OK? Questions? No questions. Great. Well, then see you tomorrow. Tomorrow we're not going to continue with this, but we're going to continue with the profiling session. So that should also be fun.

Starting point is 01:22:01 And maybe if Lawrence is, so Lawrence will do the profiling session if he's done very early so I'm also going to be here then I might jump in and just represent the rest of the art or we're going to do this next week with that thank you very much and see you tomorrow

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Data Structures

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.