Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Data Structures

Starting point is 00:00:00 Welcome back to Hardware Conscious Data Processing. My name is still Martin. Tilman is still gone this week to a conference, so I'm going to talk about data structures today. Before we start with the actual topic, two announcements. There are only two weeks left for the exercise, so if you haven't started yet, so please start. And the second thing is there will be another announcement or reminder on Moodle, but we need to move the lecture in two weeks to another room.

Starting point is 00:00:38 So it will be up in the first floor because we can't be here on May the 29th. Okay so yesterday we talked about execution models of data query execution models basically after we have learned how CPUs functions and how we can how they work internally the question was how can we execute data processing efficiently on such CPUs? And today's topic is data structures. So now the question is how can we construct data structures in a way that we use the capabilities of modern CPUs?

Starting point is 00:01:22 Okay, so what are we going to talk about in this lecture? This lecture is going to be, first thing is hashing, so hash maps and hash functions. Afterwards, we are going to talk about trees. So we already talked a couple of weeks ago about B-trees and other forms of B-trees. Today we are going to talk mainly about the CSB plus tree. And third will be TRIs, which is also another tree form data structure. And we're going to talk about two forms of TRIs.

Starting point is 00:01:55 Or basically, there is one form, there's a radix tree, and then a specialization of a radix tree called the adaptive radix tree. Okay, so there are, basically when we talk about in-memory indexing, there are basically these three forms that we just mentioned. There's hashing, comparison-based trees,

Starting point is 00:02:19 for example, the B-tree, and tries. Those are the three main data structures that we always use if we want to index data in the main memory. And you probably have now or you remember the asymptotics. So we have n keys of length k. If you have a hash set, we say it's o of 1, right? So you calculate your hash function, if you have a hash table. This hash function gives you a certain bucket. You can determine the bucket from that hash result, and then you look into this bucket, and there you usually find your key.

Starting point is 00:02:57 For search trees, this is logarithmic, depending on how many nodes you have, or how large your nodes are. You have some form of logarithm here if you look for a certain value or look for a certain key. And in a try it's similar to a hash function. So what tries usually do is that we have prefixes or that we split up a key and then we have instead of actually comparing the whole value we have per level in the tree, we only look at parts of the key. And we will later talk about that in more detail. But what is actually interesting,

Starting point is 00:03:32 or what you should keep in mind is that those asymptotics are not always that easy, depending on how large your key is. Like O is totally fine if you just consider, for example, integers, which you usually do if you write a benchmark or most research papers do that or if you do your textbook implementation of a hash table. But there might be situations where actually the key can be quite large and quite long. So also your hash function is not simple. It's not that simple anymore, but a hash function might take some time. And then if you have found your bucket in your hash table, comparing the actual values in this bucket to see if this is the actual value that you're looking for,

Starting point is 00:04:12 might also be more expensive in case you have long keys. The same is true for comparison-based trees. So at each level you have to compare, okay, am I smaller or larger than this split value? So comparing for a very long key can be quite expensive there. Can be, actually, can be the most important, the most expensive thing in this case. And typical examples, or also examples that where this can happen is, for example, if you have a group by operation in the database, we talked about it yesterday briefly, but if you have such group bys and you group by multiple columns and now you're using a hash structure for that, that might mean that you have to cache all the keys that you group by. And if you have like six, seven group by keys, this can be quite expensive.

Starting point is 00:05:03 And those can also be strings, right? So this can be quite expensive. And those can also be strings, right? So this can be quite expensive to have longer keys. Okay, so if you're talking about hash tables, there are certain degrees of freedom when we design, when we try to come up with a good hash table or an efficient hash table. The first thing is the hash function. So which hash function do we use? And the second thing is the hash table organization. So we usually

Starting point is 00:05:29 use this hash function to get into the final position in our hash table, but usually those are two distinct things. So depending on for different hash tables we can use different hash functions, so that they are not necessarily directly connected. And what are hash tables, we can use different hash functions. So that they are not necessarily directly connected. And what are hash functions? So in the end, or simply said, we map one domain into our hash domains, into a smaller domain. Usually this hash domain is much smaller because we want to use it to find our buckets in the

Starting point is 00:06:00 hash table. And the hash function is simply basically a mapping to the smaller domain. And there are many different things that might make sense in some cases and might not work that well in other cases. Simple example is division, for example using modulo. So this is one of the pretty much simplest cases for hash function. It can work quite well for some cases and it can be really bad as we will see in a minute. Another thing is Fibonacci hashing, which is also often used because it's relatively fast and it uses Fibonacci sequences.

Starting point is 00:06:39 Then the probably most well-known or most often used things are multiply, add, shift forms of hash functions where you simply multiply by a certain constant. Usually, it's really tricky to find those constants, so they're like magic numbers they use there. You multiply these, then you maybe add something, and then you shift. This is usually, this is most often one of the used hash functions. And then we have also murmur hashing, which we will talk about, or tabulation. Tabulation is actually some quite interesting thing. We are not going to talk about it.

Starting point is 00:07:15 But it's a really good hash function with good properties. And what you do here is that you split your key into certain parts, and then for each part you have another table, which is filled with random values. And then for each part you have your table, look up a random number in this table. So I have multiple numbers there and multiple tables there, and then you XOR them.

Starting point is 00:07:43 This is basically the idea of tabulationulation and it has some really nice properties. It's not that often used, but there are some cases where it has some advantages over other functions. Okay, so the first thing we talked about and basically the simplest thing to do is model hash. And to see where this might not be the best idea, we have here a simple example. So we have the dates, we have separate dates

Starting point is 00:08:12 from the first of January 1950 to the very last day of 1999. And we encode this in a simple format in a 32-bit unsigned integer. We use the most significant 16 bits for the years, and then eight bits for month, and month and day. So here you see an example. So for the 3rd of February, 51, so the rightmost eight bits would be the 3, this is the day encoded,

Starting point is 00:08:48 then the next 8 bits are the 2 for the month, and then we have the date, the year encoded, and the 16 other bits, most significant bits. So it's a relatively simple encoding. And so now, that's quite a huge table. But if you look at this table, and so we are now assuming we have, can you see that? Yes. Modulo hashing, right? So if you have a simple hash function and we model by, for example, by 8, then on the first line of this table you see, okay, if we model by 8, you see what we do here is that we have, potentially, we have only 256 values. But the issue is here, if you take those eight bits, so the least significant bits,

Starting point is 00:09:54 which are the bits for the day, the problem here is that we only have, we will never see more than 31 values. So the hash table that we might have now, we have 256 slots, buckets, where we can put our values into, but we're never going to use more than 31 of those because they are simply, our hash function doesn't, if we use model on this date representation, there's never going to be more than 31 values in the lower eight bits.

Starting point is 00:10:27 So that leads to the case that we have an average size. So we have, I think, for all these 50 years, we have roughly 18,000 values. So in the end, in our buckets, we now have 600. So in the buckets that have values we have an average almost 600 values. If we would have a really nice distribution on this relatively small table we would only have 70 values in each bucket. You see, we really don't use all the space that we have or we don't really fully use our hash table and we have a much larger buckets so we need to find our values we have to an average check almost 600 values instead of only 70 values and this continues on so even for increasing number of bytes you will with a ninth bit you will actually gain a little bit, because now you have one bit of the month.

Starting point is 00:11:28 This is like every month there's a one or zero, because we have even and uneven month. So this helps a little bit, but you see then again, we run into the next kind of bits, so like the upper bits of the month, again, where we don't have any values in those bits. So if we use modulo here, we waste a lot of potential in our hash table.

Starting point is 00:11:59 And yeah. So which is obviously problematic. So if we talk about performance, ideally what the purpose of our hash table is, or what the goal of our hash table is, is that we want to have a hash function that gives us a position in the hash table, and then a bucket,

Starting point is 00:12:17 and in this bucket there's ideally just a single value. So in the end we'll never have a hash table, or probably never have a hash table without any collisions. So there might be a couple of values, but we don't want to end up with many, many values, especially if we go all the way up here to the bottom of this table. We have many different values now if we use model-only hashing, but we still have a long chain.

Starting point is 00:12:41 So while we actually have a table that is large enough to have enough values for each potential value, enough buckets for each potential value, in average, we still have chains of the length of seven. That's instead of one or something near one. So this is a simple example where you can see modular hashing might work in some cases, but you have to be really careful how your data looks if there's enough entrepreneur data. Something else that is widely used is

Starting point is 00:13:15 MOMO hashing, murmur hashing. Initially the first version of murmur hashing was using multiply and rotate. So this is where the name comes from. This is M, U, and R for rotate. And since this has been done in the loop, this is where the murmur, so this is murmur hashing. And the current version, which is version three, no longer, or is a little bit more complicated with shifting and XR, but in general,

Starting point is 00:13:44 it's just a quite good hash function. And I forgot to mention that, sorry, but when we talk about hash functions for database systems or data processing, this is different from hashing for cryptography and so on, right? So this should be quite clear that we don't care about certain constraints or certain guarantees that we would like to have if we want to, if we're coming from a security side of things, we are here just in performance side. We want to have good distribution of data

Starting point is 00:14:19 and many of the properties that cryptographically good hash functions have, we don't care about them here in this case. So, murmur is much more complex than the previous example of modal hashing, but gives us quite nice properties. So, if we can see here, so if we have now a hash table again, 256 buckets here,

Starting point is 00:14:41 and now if we distribute our dates over all the buckets, we can now see that all of a sudden, basically, we have an average. We have a length for the buckets that are filled. We have an average length of 70, which is exactly that what we are hoping for. So this is quite good, an average. Since we fill all the buckets, an average

Starting point is 00:15:04 needs to be the number actually, but it helps us a lot here to distribute the data as we want to. And all the way up here where we basically hope that there is no collision at all because we have much more values. Remember, 18,000 values are there, dates. And we have now a hash table of over 500,000 buckets. And so we are

Starting point is 00:15:27 almost there where we want to be that we have only a single value in each bucket so single value in each filled bucket and we are almost there so this is a quite good hash function for this example at least. Okay so the next thing, if we have no hash table let's say we have data size n so we want to hash n things and we have a hash domain m. So what we actually would like to see what are the properties of good that we want to be good at with our hash function. There are three properties. There's uniformity, and this is basically

Starting point is 00:16:12 the length of the collision chain. So if our hash table is too small, we have too few buckets, there will be collisions. There will be multiple items per bucket, but ideally this will be relatively evenly distributed over all the chains so they don't have one chain where all the buckets are in but we have with the equal uniform distribution of the change so we want to have uniformity we want to have universeality. That means that probability of a collision is one by m.

Starting point is 00:16:47 I think that should be quite clear. So if we, yeah, there will be cases where we can't avoid collisions, but that should be relatively low, this as low as possible, this probability. And of course we want to be efficient. And now when we do use the hash function for our hashing, for our hash table, the basic means, okay, first we have to map the hash value, map a value to a hash value.

Starting point is 00:17:13 Then we use this hash value h in this case to find the position in our hash table. And this mapping is often called a finalization. So often you have to, for example, your hash might be a 64-bit integer. So the hash table is usually smaller. So now the question is how can we determine the bucket, the actual bucket of our table with a much smaller size? And this can be some shifting. If it is a power of 2, it's probably the most efficient thing or something else, depending on how many buckets you have in your hash table.

Starting point is 00:17:52 Typical trade-offs here, we have the hash quality, we have performance. So there are very fast hash functions. Usually you always have the typical trade-off, the faster a hash function is, the worse its properties are. And in practice, what does most people use are those multiply, shift and so on alternatives or hash functions, because they are quite good from the properties that they have, but they are still fast and efficient on modern CPUs. So apart from the hash function, now the question is if we have the hash table how do we organize that? How do we store our data in a hash table? And there are

Starting point is 00:18:36 two main schemes out there that are basically used by all the pretty much for all the data structures out there, hash tables out there. One thing is chaining. Sorry. So we have a list of buckets for each key. So you have, yeah, you have a list of buckets, and then let's say your bucket is basically full.

Starting point is 00:18:58 So what you would do is you would chain your buckets here. So you have, let's say, a bucket of eight slots. Now for some reason your bucket is full, you have more than eight items mapped to this bucket. So you would basically say, okay, there's another bucket, you would chain that, so like a linked list of multiple buckets, and then there would be another bucket somewhere in memory. So this would be like a linked list, not size one, but a little bit larger usually, and you would chain those buckets. Another idea is open addressing. And what we do here is that you have a large consecutive array, just yeah, a large array, and then you would just simply offset your bucket, basically

Starting point is 00:19:44 it's just the offset into this array, where you would try simply offset your bucket, basically it's just the offset into this array where you would try to insert your value. And one idea how to, one way to work with that would be, for example, linear probing. And this means you first go to, for example, if you want to insert that, you would go to this bucket,

Starting point is 00:20:02 to the offset of your bucket, and then you would check, okay, can I insert here? Is this slot free? If not, I will continue and continue and continue. So this might mean that you end up at a position that is actually within the bucket of another value, hash value. But this is basically how you would enter it.

Starting point is 00:20:20 So you might end up in another bucket. And same if you're now looking for your key, you would start at the position, at your initial offset that the hash function gives you, and then you would probe and continuously probe until you find either your value or you find an empty slot and then you know, okay, my value is not in the hash table right now

Starting point is 00:20:41 that I'm looking for. The good thing here is that, because it is the operation operation once you have your offset and you look for your value this is a sequential operation so we have all the prefetching we have cache locality this is quite efficient on modern CPUs but you also generate clusters so if you have a poor hash function you might have a few clusters that are really full. So we have to look quite a while. So we have to scan a lot. And you end up pretty far away because there's a large cluster when you're looking for your value. And for that reason, there's also another alternative that is called quadratic probing.

Starting point is 00:21:21 And there we would not simply, in case the first slot is filled, we will not check or linearly go to the next slots. We would increase our step width quadratically. So we basically check plus one, and then we check plus four, plus nine, plus 16, and we would try to distribute the values so we don't have this clustering effect in case we have pretty bad hash functions.

Starting point is 00:21:46 And another idea that often works together with linear probing is Robin Hood hashing. And the name comes from the idea that Robin Hood, as you probably know, is somebody who tries to be always fair, wants to ensure that all the people are equal, and obviously has an unfair advantage. And the idea here is, for example, let's say you end up in a situation where some values of some buckets are pretty distant, because right in between there are values of other buckets.

Starting point is 00:22:18 Because you have linear probing, you have clustering, so some value might not be in this actual bucket, but might be far off because other buckets are filled that are right in between. So now what Robin Hood hashing would do is to say, okay, I want to avoid that some searches for some hash values are much slower because they have to go search quite a while, quite some large distances, so they would swap. When they recognize such a situation,

Starting point is 00:22:48 they would say, okay, I would try to pull this swap, this value that is far off from its actual bucket, and change this with a later bucket, with the value of a later bucket, so that both are more, like the distances that they are usually probing are more equal, more uniform. Another hash table or often used hash table form is,

Starting point is 00:23:22 form of hash table is implemented is cuckoo hashing is implemented, is cuckoo hashing. And the name of cuckoo hashing comes from this bird, the cuckoo, and probably know that there's this, yeah. The thing is that they sometimes leg the eggs into a different nest, or yeah, not in its own nest, and kick out other eggs, right? So, and this is exactly what we do here in cuckoo hashing. So basically when we try, we have two hash tables

Starting point is 00:23:49 or multiple hash tables and each hash table has different hash function that you use to find your bucket. And now if you want to insert into our hash table, then what you should do is you go to hash table one and then check, okay, can I put it in here? If the slot is free, okay, just do it, fine. If the slot is not free,

Starting point is 00:24:10 you're kicking out the current value. This value goes to the other hash table and I put my value now into the first hash table. If the second value will be at a position that is also taken, you can come, like, this can be problematic, right? So you can have recursion in there, it can even have loops in there.

Starting point is 00:24:34 If at some point you come back to your first table and you are at the first table, so the kicked out value comes back into the first table and you're again kicking out your own value. So in some point you might have a recursion in there. And this is then the case where cuckoo hashing would just say, okay, I need to resize my table. I need to have other hash functions

Starting point is 00:24:54 and other sizes to continue. But in general, since we are, the chance that we have such recursions is pretty low. We have good hash functions,, we have good hash functions, assuming we have good hash functions, cuckoo hashing works quite well. But also means for lookups, you need to check both tables, unless you find your value in the very first beginning,

Starting point is 00:25:16 and delete the same. We can shortly go through an example here. So we have now T1 and T2 and for the first hash table we have a simple modulo 8 hash function. So we just take the value modulo 8 and then we know which bucket to use. And for the second table we use we first multiply by 3 and then take modulo 8. Now if we're going to insert the 1, we take the 1 modulo 8, which is 1. Both hash tables here are 0 indexed.

Starting point is 00:25:51 So we see that the first slot is not taken, or the slot with index 1 is not taken, so we can put our value 1 in there. Then we take the 4. Again, we take 4 modulo 8, which is 4. Take the fifth slot, index 4, and take 4 modulo 8, which is 4. Take the fifth slot, index 4, and put it in there because nothing is in there. We do the same with the 8. 8 modulo 8 is 0. So we put it in the first slot with index 0.

Starting point is 00:26:21 And now we want to insert the value 0. So in this case, we would now see, OK, I can't add it to the to the slot because there is already a value in there. And now this cuckoo thing happens that I basically say, okay, I'm going to kick out the other egg from the nest basically. So I'm going to kick out the 8. Where does the 8 go? The 8 is multiplied by 3, so it's 24. Modulo 8 is again 0, so not the best hash function here. We put the 8 to the first slot of the second hash table, and the 0 will take its place in the first hash table. For the 9, we do the same, and the one goes there.

Starting point is 00:27:08 Any questions so far for cuckoo hashing or hash tables, hash functions? Okay. I guess we're gonna do a break after the beat. Yeah, we're gonna do the break later on. So quick recap, we already talked about beat trees a couple of weeks ago. You probably know most of them,

Starting point is 00:27:37 but usually when you have a degree of D for this beat tree, we have at most two D keys. We try to be balanced, so all keys should have the same length, and so on and so forth. And you have, like if you have K keys in a node, you have K plus one pointers. So one pointer for the overlap, and the other pointers that give you

Starting point is 00:27:59 for all the values smaller than one value, the pointer. And B-trees just work quite well and still continue to work very well for disk-based database or for block-based databases. If you have a database that is block-based, p-trees are still the most widely used data structure, indexing structure there, but they are not necessarily the most efficient thing

Starting point is 00:28:24 to do in main memory. And for that reason there is a data structure on improvement called the cache sensitive B plus tree. So it's called the CSB plus tree. The main idea here is, or the main observation why the authors came up with this idea here is that in a B-tree, like if you have, imagine you have, let's say you have this simple node here and here you have four values, those might be 4K, four integers, for byte integers, so 16 bytes for those integers. And then you would have five pointers here,

Starting point is 00:29:10 which in most systems nowadays are 8 byte. So you have a lot of data is wasted, or not wasted, but you use a lot of space for just keeping around those pointers. And then pointer chasing also might incur some random access latencies because you probably won't or maybe you don't have the next node already in cache. So if this data structure is in my memory you don't utilize the cache as good as you might maybe could and you have a lot of random accesses to addresses that you might have to look up so there will be TLB misses and so on and so the idea here

Starting point is 00:29:50 for the CSB tree, the CSB plus tree is that we only store one pointer so you still might have like your node with the four split values, but you only store one pointer and offsets then to the other nodes. And the nodes, so basically the nodes here below, so on the third level here, so these nodes are no longer stored separately, like on different pages for example. The P-tree would usually try to have different pages, but here now for the main memory3 would usually try to have different pages. But here, now for the main memory case, we would try to have them allocated on a very dense array. So we put them together, and we have only one pointer here,

Starting point is 00:30:37 and then the rest would be just, basically would be internally stored here as just offset. So if we are, maybe here we think, okay, now we are smaller than the second value here in the middle node then we would have an offset into this group where we basically store multiple nodes there. And offsets usually those offsets can be much smaller so much less bytes or bits even a few bits might be required to store these offsets into those little node groups here. So we can store, we have less pointers to store, only offsets.

Starting point is 00:31:12 And we have here a dense array that is consecutively stored. So that might be better for caching and for reading the values in order sequentially. Okay, so then, as always, there are different trade-offs or different versions of that, and also there are different alternatives. Most of them all come from the same paper, but this was basically the CSB Trust was the successor to the CSS, the cache sensitive search tree. And this, but this has, this tree was mainly geared towards read-only databases. So they didn't really care about what happens with updates. And what they did was just, okay, this is a read-only version, and this was very dense.

Starting point is 00:32:04 So they tried to have a single array where they can just jump into they didn't need any pointers they could use only offsets and try to have all the nodes are sized so that each node when you jump you have only one cache line so we don't like good for cpu for cpu processing that all the nodes are cache line sized and this thing was really not working if you have any updates. But it was super efficient and cache efficient for read-only workloads. Then again you have to, you could basically say okay maybe you don't want only a single cache line but about multiple cache lines, should be would have been larger the key, the less levels you have, but you also, yeah, typical trade off between like how large

Starting point is 00:32:56 your larger node sizes might not be one or might be more cache lines to read, but I have fewer levels in my index that I need to traverse to find the actual values. And then where is the full csb tree? This is an alternative that the office of the csb plus tree proposed and here we have also one array but we don't fill up all the arrays. So we leave some space similar to the CSB tree. But so there's a little bit pre-over allocation so that we still have some space for inserts. And this is almost the same performance as the CSS plus tree that we talked about, but it still allows inserts, but only to a certain degree.

Starting point is 00:33:54 So if there are splits happening, the full CSV tree, there needs to be, since we have only this one allocation, we have to rewrite the entire tree. The CSB tree in this case, we also over-allocate, but we still have multiple leaves, so we don't have to reconstruct the entire tree. So depending on what you want to do, there might be different alternatives that are better. Next thing also by the same author, so they liked proposing alternatives. The CSB plus tree, there was another alternative

Starting point is 00:34:26 with segments. So now all the notes, the child notes of one note are basically put into one or three segments instead of having one large segment. Again, simple idea to somewhat account for changes. And now splits are cheaper, but also reading is less efficient, because you might have to read a couple of segments.

Starting point is 00:34:51 And performance-wise, obviously, or hopefully obviously, the one single array that is perfectly geared towards reading and doesn't account for any updates, so it just can't update it, has the best search performance, the CSS tree. Then we have the CS plus tree, then the segmented tree and the B plus tree.

Starting point is 00:35:13 But if we look at insertion, then it's the other way around, right? So we have the B plus tree, which is pretty much similar to the segmented CSB plus tree. And then we have the standard cache sensitive B plus tree, and then the worst insertion performance because we need to rewrite everything as the cache sensitive search tree. Yeah. So the conclusion should be clear, right? So depending on what your workload is,

Starting point is 00:35:48 how many inserts do you have, how many modifications, depending on that, you have to choose, you can choose different alternatives here. And the next part will be about tries, and I think we're gonna take a five-minute break and then we continue with the third topic which will be twice and the adaptive relics tree. The next part will be mainly so the adaptive radix tree is a radix tree, and the radix trees belong to the class of tris. What are tris?

Starting point is 00:36:32 So tris are K research trees, so we have still trees, and we might have K elements per node, so nodes can be also large. And the name initially comes from retrieval. For some reason it's called trice now, maybe to differentiate it from trees. But actually it comes also from the tree in retrieval. They're also known as prefixed trees and prefixed trees probably the best name because that already gives away

Starting point is 00:37:01 pretty good the idea of tr tries. And there's a simple example here of such a try for strings. In this case we have the words and, aren't, any, are and aren't. And now what we simply do here, for example, have a take each note, each note level or each tree level, we take now one letter of our four words that we want to index and then this will be used for the key. So you can see here for all the words we all share the word A. So now we have basically the first node, so all the words start with an A, so okay, sure, we only have As,

Starting point is 00:37:48 so we go to the next node, and then basically we say, okay, what is the next letter of our words? Let's say we are searching for the word any, then what means when we have an N? So we go to the node that contains all the words with A and N, and then on the very bottom here we have the third

Starting point is 00:38:07 letter and now we can say okay is it is it a D, T or Y. If we look for any we basically go to the follow the pointer to the Y. Now we have found our value any and that would also mean that you don't have to store in this case the actual key because this is implicitly you just have the key in because you when you traverse back to the top you have your entire key this is a very simple structure but can also be problematic in some cases so first of, now the tree height does no longer depend on n, right? So for for B trees and other trees that are comparison based, usually the tree size depends on large nodes are and how many values you have.

Starting point is 00:38:57 Here it solely depends on the length of the key. So in this case the length of the words that you have. We don't have to rebalance for that So in this case, the length of the words that you have. We don't have the rebalance for that. With this structure, we don't have the rebalance and we have a lexicographic order, which is quite nice if you traverse it. And again, so keys are stored implicitly and we can reconstruct them entirely from the path that we take from the root to the leaf node.

Starting point is 00:39:26 And, so first, one question here, or one problem here is, let's say we have the alphabet, so we would say the first node where we have the A in, and we want to also be able to store other words in here, or only uppercase, that would mean we would already have 26, I think, slots here in case we have other words that we want to insert into this tree. And for each node, we would have to have 26 buckets

Starting point is 00:39:58 to have all the keys in there and to be able to index them. For this example here, we would waste a lot of space. So either we can't have them all pre-allocated or we need to find some way to do that efficiently. Here for the second level you would only use n and r of all the 26 characters. And now if you imagine you would do that this way character based for Unicode this would simply not possible. So you need to come up with some other way there and one way to deal with that is to use a Radix tree. And here we do only take the binary representation of whatever your key is. So what you can for example do

Starting point is 00:40:48 here is take each bit. So each bit, it says one is a bit one or one or zero. So then this is a binary tree. The RedX tree would be like in a binary tree form. Or you can also have a larger amount of bits. The number of bits sometimes called a span. And here you can see for example larger amount of bits. The number of bits sometimes called a span. And here you can see, for example, let's say we take of our longer word, we cut it or we split it into parts of three bits each. That means that each node,

Starting point is 00:41:17 we can represent eight values, three bits. So we would have eight children per node. Four as one, again, right, this is a binary tree, so we only have two. This also determines how deep our tree is. For the binary key, it's basically the longest possible way, so one level per bit and the larger the span is the lower the height of our index. Yeah, so this the span size determines the height and also not only also again the key length but the key length is still the most crucial thing here for WIDEX trees because you store the entire key in the levels of the tree. And if you compare that to a

Starting point is 00:42:16 balanced binary tree, it's quite interesting to see the difference here. So if you have a number of keys is growing for the binary search tree, this starts being smaller, but then as this tree grows with the tree height grows with the number of elements in it, we see that it grows and grows and grows while this is static for the Radix tree depending on which depending on the K and on the span size again but you see here the number of elements the height does not the height doesn't change just because we have more elements but as you've already seen in the example with the art,

Starting point is 00:43:12 with this example with any art and end, is that you might have a rather large space consumption if you do not have a good fill level of your node. So even if you don't use characters, as in the previous example, and you use now the Radix tree with the bits, you might still have a lot of bit combinations that are not used.

Starting point is 00:43:37 But you would still, if you have your, in this span of three, you would have your eight possible variations, but only few might be used so this node is not splash efficient so a lot of a lot of values wouldn't be used and so you have to find an S that is basically balances this and the larger S the better as a lookup performance because you have less jumps you have a tree is not as high so you have only few interactions but with a smaller S you have a better space consumption, less wastage but you have a higher tree so

Starting point is 00:44:17 you usually get slower. And one idea or one alternative here and this is the alternative that we are going to talk about for the rest of the lecture, is that we have different node sizes. So we don't have to have the same node size for all the levels. And we have adaptive node sizes. And this is where the name comes from. This is then the adaptive radix tree also ART index and in the example below here you see that the middle level has a very small node so there are only few elements in there. Here you have only two children. Then on the bottom level we have larger nodes so in the right case there you

Starting point is 00:45:05 see there I think eight so children and so they have different node sizes so how does that work the adaptive Redux tree has a fixed span size and we split each value that we have each value that the index and the art, we split that into its bytes. So we have span size of 8 bit. So we have 256 values per node. Well, I probably don't have to go through this example, but I guess I can. So if you have this image of 218 million something,

Starting point is 00:45:47 something here, you would have this representation, bit representation, I hope you can read it. And now if we split it up, this number into a single bits, and only take, represent the bits alone, we would here have the value of 13, the value of 2 here for the second byte, so to say, 9, and here you can see it's all 1s, so we have 255. This will be basically if we split this 32-bit integer. And what the ART does is, we have a high fan out so we have

Starting point is 00:46:28 8-bit that means we can have a fan out for 256 but we have four physical different node types that we store or that we can store and the main purpose is to save space. So we have this high fan out in case it's fully used, then we have still a very flat tree, so a fast tree. But in case we have sparse values, we don't waste that much space. We can then use other node types. And what the YDX tree also does,

Starting point is 00:47:00 we'll later see it a bit more in an example, is that we have path compression. In the beginning shown, if there is something, if you have nodes that are all having the same letter, like for example, we would only have and and any, and so basically all have, the second node would always be the n. We don't have to store those nodes. We can basically compress them.

Starting point is 00:47:25 If there's only one for each, on one level there's only always one child, we can compress the levels. Furthermore, all the nodes are cache aligned, so they're more efficient on modern CPUs, and there's also the possibility to use Zimdi when processing the ART index. And here you see an overview of the four nodes that we have in an ART index. And here you see an overview of the four nodes that we have in an ART index. There's the node 4, the node 16, node 48, and the node 256. The first one is the smallest node. So this is a node that is used when you only have two to four child nodes. I say two because if you have one there comes this node compression will then be used or might

Starting point is 00:48:09 be used. But this is a node that you can use if you have only four child nodes. And how would we store that? In this case we have a fixed array of four values. Let me take the other pointer. Yeah. So we have those four values stored here. And this is not consecutive, right? So this can't be consecutive because you still have the entire domain of whatever this bit might store.

Starting point is 00:48:40 But if you have only four values, you can basically now do a linear... If it's for four, you probably wouldn't do linear search, you would just search, basically, is my value here? So if my byte value is three, then I know, okay, I find the three here, I iterate over the values, find the three here, and then I know

Starting point is 00:48:57 I have to use the pointer at the third position here. So the node four, basically, is four times times the byte and then 4 times the pointers to find the child node for this particular byte. And then if you have more values, the next node size would be node 16. Here you have now 16 potential values in there that you can use. And again you have 16 pointers. And for 16 there might be a binary search already better than linear scanning it. Depending on, you can also use Zimd here to find your value. So different ways, different things are possible. This is node 16.

Starting point is 00:49:44 And then in node 48, 48 is a little bit different. So what we do here is that we no longer have only 48 bytes. Because actually the difference to storing 256 bytes is not that large if it's only bytes, right? So we have, we already store 256 bytes, but only 48 pointers, because the pointers are much more expensive, right? So we compare eight byte versus one byte, so points are more expensive. So what we do here is we have already 256 slots for the values, so many of them might not be used, and I don't have to search, binary search or linear search, I can just say, okay, if my value is 17, then I go to the 17, if it's set, then I go to check where the pointer is to it, and then get the pointer. I don't have to search for the 17,

Starting point is 00:50:41 basically, I can directly jump to it, which is a little bit faster and also allows for easier inserts. And for that reason, this is no longer sorted. So the points are no longer sorted because maybe the value 17 came before the value or was inserted before the value 2. So that's not necessarily sorted. And the last node type is the largest node type. as there's no compression at all. And this is the node 256. And here we don't have to have those bits anymore stored because we can directly go with our bit representation of this, but we can directly go to the pointer

Starting point is 00:51:19 that we need to, and then have our value or the next children. Yeah, and per node, you always have to add up 16 bytes. And this basically tells you the marker, what kind of node I know expecting. So how large is this node that I have now at this point address? It says how many children are there right now. So if I have a node 16,

Starting point is 00:51:45 only 13 slots might be used up. And there's also a compressed path. You will see that later what that means. And this is always 16 bytes. The standard ART index is only a unique index, so it doesn't store multiple values, multiple, for example, positions per value. There are different, so this is the easiest thing, is just single value leaves. So if you have just one value, then a leaf is the value. And if you have now multiple values per leaf, you have to now, again, think of how could you store them, can you store them, you don't want to have a single allocation for every leaf value or multiple leaf values, so how can you do that and how can that be again then be efficient if you have updates

Starting point is 00:52:36 and inserts and so on. And another alternative that you also can do is, believe that you don't store the pointer in the very last level, if your value fits or is the same size or smaller than the pointer, so 8 bits, bytes, then you can also store directly the value in there. You don't have to store a pointer that directs somewhere else. Yeah. Here now, yeah, here we come back to the example. So we have this 218 million value number that we are now trying to find,

Starting point is 00:53:15 or we try to find the, this is the value that they're looking for. So the secret query says where ID column is 218 million something. And now we want to find the top, so which row in our database is this value? Right, so typical index lookup. So what we do know for the art is we take our value,

Starting point is 00:53:35 take the byte representation, so we split it into bytes. We have bytes 13, two, nine, 255. And now what we do here is in the first level we find a node 4 here. We have the 13 so we don't know how they are stored. We have to look here. We can't directly jump to a certain offset. We have to look for the 13. If it is there, okay, we find the 13 in the first slot. So we know we have to take the first pointer and check the children node, child node, that is found under this pointer.

Starting point is 00:54:14 So now here, looking at the header information for this node, we know now, okay, this is a node 48. So for 48, we know this is already, we no longer have to search. We can directly offset into it. For now our second byte which stores the value 2 we would thus go to the index 2. Here we have an offset again. So we have a small offset. I think it's one byte only. So we can offset directly into, in this example, this offset will be at position zero. And again here we find the pointer to our next node. And then for this

Starting point is 00:54:52 node the header already gives us a way, like this is node 16. We have now, our value now has the value nine. Node 16 is, here we have to search the values, potentially all the 16 values to look for the 9. Again we find the pointer. And then in the very last level here we could store directly our pointer or the tuple identifier, whatever our system needs now. But now we would be here with the 255. We know, okay, we have to go to the very last position into this in this node then here we have directly our t id the tuple identifier and we're done oh now we can fetch the page whatever we need to do to fetch the tuple at the id at this given id here

Starting point is 00:55:42 any questions so far to how the art works? Okay. Yeah, so long keys result in a large height, right? So this is a typical problem. So what can be done to reduce the height? And here you see those path compression tricks that we basically had, that we briefly talked about. For the left part of the tree you can see here that there is, there are only the words B-A-R and B-A-Z. So this node for A doesn't

Starting point is 00:56:22 do anything because there's only one child. So what is possible here is to compress this. So we don't have to jump to another node, fetch this node from my memory, only to see that there's not much in it. There's only the next pointer, which is always the same in this node. So we could basically compress this. And then this path compression is then stored into the child nodes they basically say okay my compressed path is this and that because in some way need to know that the a has basically been compressed away but if you look for a b o o like

Starting point is 00:56:59 you would still you could otherwise still end up at the wrong path. So you need to somewhere store what has been compressed. And this is part of the header information per node. And yeah, so the left-out part is stored in the nodes. And the other thing is lazy expansion. So we would try to expand the nodes. So if we have here this foo, this might be compressed. Now, if there is another value in there, we would try to expand and create to actually increase the height

Starting point is 00:57:34 by expanding it as late as possible. Only when we really need to expand it, then we would start doing that. So for the second O here, if the new word would be fob, then we don't have to expand. So for the second O here, if the new word would be a fob, then we don't have to expand, right? I do the latest possible, for example, if we have, yeah, as late as possible. I think I get the idea, right?

Starting point is 00:57:54 And this is somewhat fixed, so we can't have an arbitrary long path compression. This is fixed to eight bytes. And if this is exceeded, we need to store the complete key either in the leaf, or we just say, or you have to store some information that you need to check for the actual value in the TID. For example, you can still have the TID, but then you need to know, okay, I'm not really sure if that's the correct value because there was so much

Starting point is 00:58:23 compression, then you need to check for the TID and then check there, okay, is this the actual value that I was looking for? Here you see an overview. This is directly from the paper, which we have linked before the paper in case you want to look at it. And here you see basically why this is such an interesting idea, this art index, is that we compare, or this graph on the paper compares the space consumption and the tree height. So tree height is basically exactly, this is basically the performance part, right?

Starting point is 00:59:00 The lower the tree or the less high, the faster we are, the less jumps we have. And space consumption is always something we always want to optimize. The smaller the tree, like using less space is the one that's always good. On the other hand, it's always also good for performance because maybe the entire tree can be in main memory. And what you can see here is that the art for, so what data set is that we have 1 million uniformly distributed 32-bit integers. And so you can like, you have four million different values, right?

Starting point is 00:59:41 Four billion different values for 32-bit integers. So if we now have only 1 million, you see there's a lot, there's not all the nodes will be fully, fully used. So now depending on, so we have a lot of space wasted, so to say, for large spans. And this, this is what we see is shown here. If we have the bits, the size of one, the span size of 1, we have the binary tree, which has fully all the 32 levels that we have, because per bit we have no one level. So this has quite good space consumption, but also we have to traverse a lot in this tree if you want to find our value. But if we have larger spans, so we have very small trees or small height, those trees, since they are really sparse or the data set is really sparse, they will waste a lot.

Starting point is 01:00:38 This is logarithmic scale, they waste a lot of space. And the art index is, yeah, really excellent trade off here, here. So it's a little bit or has something like the height of a span size of eight, but only the space consumption of a span size of three in this case. So it's a good trade-off or nicely balances the height and the space consumption here. And height means basically the lookup performance. And in this graph here, the authors compared the art for two datasets,

Starting point is 01:01:17 for sparse and dense. So dense dataset basically means pretty much all the values exist. Like if I have a test set of one million values, then every value exists, like increasing key or something. And dense means that a lot of values don't exist. And you see here that for both data sets, the ARC performs really well.

Starting point is 01:01:37 And the only thing that comes near is a hash table, in this case a chained hash table. So the ARC performs quite well for lookup performance compared to other prefix trees. We also see here the CSB plus tree, so the cache-sensitive B plus tree, and other alternatives. And for inside performance, it's even better, so to say. So we are faster than a hash table. Hash tables are not the fastest thing to insert usually. And they are just outperforming.

Starting point is 01:02:14 Mostly for dense operations, dense inserts, or no, dense data sets. But even for sparse data sets, it's still the fastest data structure here. And if you have bulk insert, it's even better. So there's the chance to bulk insert also in the art. Okay, so that was it mainly for today. So in summary, oh no, before we talk about the summary, any questions so far about the adaptive relics tree?

Starting point is 01:02:50 Okay, so what did we talk about today? We talked about hashing, what hash functions, how you typically have this trade-off between quality and performance of the hash function. We talked about the hash table organization, so there are different strategies to implement the hash function. We talked about the hash table organization. So there are different strategies to implement the hash table. This might be open addressing or chaining. We talked about trees. We have the B-tree, which is optimized for block-based storage. So it's very

Starting point is 01:03:18 good for hard disks and solid state drives. And then there is the cache sensitive B plus tree, which is optimized for main memory computing and to have a better data layout that is more cache friendly and better for the CPU to process. And then last but not least, we talked about TRIs, so the Radix tree, and mostly the,

Starting point is 01:03:46 especially the adaptive Radix tree with differently sized node types for Radix trees. Okay, so one question. I hope somebody helps me to answer that today. So what we've seen, given the performance, so let's say we have a sparse data set, so performance to the hash table and the art is somewhat comparable, right?

Starting point is 01:04:18 Then lookup performance is also comparable to the hash table. Our hash table is actually really good, and this is a ch the hash table. Or the hash table is actually really good. And this is a chain hash table is not the thing you would usually do for high performance. So there's even better hash tables. Why, what would be a reason to use the art? Do you have an idea why I would maybe still use the art, even though a hash table is often simpler to implement?

Starting point is 01:04:48 Right, yeah, it's simpler to implement, and it's also performing quite well. So why would I use, what would be an idea why I would use the hard index? Because the runtime of a hash table is stochastic. Sorry? The runtime of a hash table is stochastic. I have a bad worst case. If my initial data has a bad distribution, then I might get a. Yeah. That's one really good point. So we have also depending on the data, how the data looks, also the hash function, right? So how good this hash function distributes the values. So it's sometimes hard to say how a hash table works there, how well.

Starting point is 01:05:33 This is true. Then one other point is insertion for hash table. Insertions are really fast until one point when your hash table gets too full. And this basically means then you're going to, usually what most hash tables do, they double their size. So this is extremely, so you have fast insertions until one point and then you rewrite the entire hash table and rehash it usually. So in average, it's still quite nice inside performance. And if you know the size up front, that's also still nice. But if you don't know the size upfront, that's also so nice. But if you don't know the size, and then at one point,

Starting point is 01:06:07 you're going to hit this point where you have to resize the hash table and other tree structures are often better here because you can have smaller updates or smaller changes and you don't have to write this entire data structure again. And then, but there's one more thing for querying, where the hash table is, does not allow you something to do when you search for a key.

Starting point is 01:06:36 Do you have an idea what we can do? Mm-hmm. I think some kind of range queries could be difficult with the cache table, but supported for ART. Exactly. So that's exactly the point. So if you, let's say, search for a key, all the keys smaller than one million, you would basically know, OK, I get the byte representation, right, one million.

Starting point is 01:07:04 So there were a lot of, or let's say something very small, like all the keys smaller than 17. So there are a lot of zero bytes in the beginning. So I directly know, okay, where to go in my art. And then you go to leaves and then you can basically, the art gives you the chance to do such range predicates. Right, then you can easily, once you find your first value, then you can traverse the values of the leaves quite easily. For the hash table, range predicates are not possible. So in case very few exceptions where you just can enumerate all the values, then this might be possible.

Starting point is 01:07:40 But there is not something with hashing where you can say, give me all the buckets that I need to look at with values smaller than 17. Not possible, right? So this is something and range predicates often happen in database systems. So this is something where hash tables don't work and the art with a nice performance does that. Okay, So far. If you don't have any questions,

Starting point is 01:08:08 we are going to continue next week with, I think Marcel is going to present, talk about profiling. And then the second talk next week will be about multi-core parallelism. In case you don't have any questions, yeah, have a nice day. Thank you. Bye-bye.

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Data Structures

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.