Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Data Structures
Episode Date: May 15, 2024...
Transcript
Discussion (0)
Welcome back to Hardware Conscious Data Processing.
My name is still Martin.
Tilman is still gone this week to a conference,
so I'm going to talk about data structures today.
Before we start with the actual topic, two announcements.
There are only two weeks left for the exercise,
so if you haven't started yet, so please start. And the second thing is there will be another announcement or
reminder on Moodle, but we need to move the lecture in two weeks to another room.
So it will be up in the first floor because we can't be here on May the 29th.
Okay so yesterday we talked about execution models of data query
execution models basically after we have learned how CPUs functions and
how we can how they work internally the question was how can we execute data processing
efficiently on such CPUs?
And today's topic is data structures.
So now the question is how can we construct data structures
in a way that we use the capabilities of modern CPUs?
Okay, so what are we going to talk about in this lecture?
This lecture is going to be, first thing is hashing, so hash maps and hash functions.
Afterwards, we are going to talk about trees.
So we already talked a couple of weeks ago about B-trees and other forms of B-trees.
Today we are going to talk mainly about the CSB plus tree.
And third will be TRIs,
which is also another tree form data structure.
And we're going to talk about two forms of TRIs.
Or basically, there is one form, there's a radix tree,
and then a specialization of a radix tree
called the adaptive radix tree.
Okay, so there are,
basically when we talk about in-memory indexing,
there are basically these three forms
that we just mentioned.
There's hashing, comparison-based trees,
for example, the B-tree, and tries.
Those are the three main data structures
that we always use if we want to index
data in the main memory. And you probably have now or you remember the asymptotics. So we have
n keys of length k. If you have a hash set, we say it's o of 1, right? So you calculate your hash function, if you have a hash table.
This hash function gives you a certain bucket.
You can determine the bucket from that hash result, and then you look into this bucket,
and there you usually find your key.
For search trees, this is logarithmic,
depending on how many nodes you have,
or how large your nodes are. You have some
form of logarithm here if you look for a certain value or look for a certain key. And in a
try it's similar to a hash function. So what tries usually do is that we have prefixes
or that we split up a key and then we have instead of actually comparing the whole value
we have per level in the tree, we only look at parts
of the key. And we will later talk about that in more detail. But what is actually interesting,
or what you should keep in mind is that those asymptotics are not always that easy,
depending on how large your key is. Like O is totally fine if you just consider,
for example, integers, which you usually do if you write a benchmark or most research papers do that or if you do your textbook implementation of a hash table.
But there might be situations where actually the key can be quite large and quite long.
So also your hash function is not simple.
It's not that simple anymore, but a hash function might take some time. And then if you have found your bucket in your hash table,
comparing the actual values in this bucket
to see if this is the actual value that you're looking for,
might also be more expensive in case you have long keys.
The same is true for comparison-based trees.
So at each level you have to compare,
okay, am I smaller or larger than this split value?
So comparing for a very long key can be quite expensive there.
Can be, actually, can be the most important, the most expensive thing in this case.
And typical examples, or also examples that where this can happen is, for example, if you have a group by operation in the database, we talked about it yesterday briefly, but if you have such group bys and you group by multiple columns and now you're using a hash structure for that, that might mean that you have to cache all the keys that you group by.
And if you have like six, seven group by keys, this can be quite expensive.
And those can also be strings, right? So this can be quite expensive. And those can also be strings, right?
So this can be quite expensive to have longer keys.
Okay, so if you're talking about hash tables,
there are certain degrees of freedom when we design,
when we try to come up with a good hash table
or an efficient hash table.
The first thing is the hash function.
So which hash function do we use? And the second thing is the hash table organization. So we usually
use this hash function to get into the final position in our hash table, but
usually those are two distinct things. So depending on
for different hash tables we can use different hash functions, so that
they are not necessarily directly connected. And what are hash tables, we can use different hash functions. So that they are not necessarily directly connected.
And what are hash functions?
So in the end, or simply said, we map one domain into our hash domains, into a smaller
domain.
Usually this hash domain is much smaller because we want to use it to find our buckets in the
hash table.
And the hash function is simply basically a mapping to the smaller domain.
And there are many different things that might make sense in some cases and might not work
that well in other cases.
Simple example is division, for example using modulo.
So this is one of the pretty much simplest cases for hash function.
It can work quite well for some cases and it can be really bad as we will see in a minute.
Another thing is Fibonacci hashing, which is also often used because it's relatively fast and it uses Fibonacci sequences.
Then the probably most well-known or most often used things are multiply, add, shift forms of hash functions
where you simply multiply by a certain constant.
Usually, it's really tricky to find those constants, so they're like magic numbers they
use there.
You multiply these, then you maybe add something, and then you shift.
This is usually, this is most often one of the used hash functions.
And then we have also murmur hashing, which we will talk about, or tabulation.
Tabulation is actually some quite interesting thing. We are not going to talk about it.
But it's a really good hash function with good properties.
And what you do here is that you split your key into certain parts,
and then for each part you have another table,
which is filled with random values.
And then for each part you have your table,
look up a random number in this table.
So I have multiple numbers there and multiple tables there,
and then you XOR them.
This is basically the idea of tabulationulation and it has some really nice properties.
It's not that often used, but there are some cases
where it has some advantages over other functions.
Okay, so the first thing we talked about
and basically the simplest thing to do is model hash.
And to see where this might not be the best idea,
we have here a simple example.
So we have the dates, we have separate dates
from the first of January 1950 to the very last day of 1999.
And we encode this in a simple format
in a 32-bit unsigned integer.
We use the most significant 16 bits for the years,
and then eight bits for month, and month and day.
So here you see an example.
So for the 3rd of February, 51,
so the rightmost eight bits would be the 3, this is the day encoded,
then the next 8 bits are the 2 for the month, and then we have the date, the year encoded,
and the 16 other bits, most significant bits. So it's a relatively simple encoding. And so now, that's quite a huge table. But if you
look at this table, and so we are now assuming we have, can you see that? Yes. Modulo hashing,
right? So if you have a simple hash function and we model by, for example, by 8,
then on the first line of this table you see, okay, if we model by 8,
you see what we do here is that we have, potentially, we have only 256 values.
But the issue is here, if you take those eight bits,
so the least significant bits,
which are the bits for the day,
the problem here is that we only have,
we will never see more than 31 values.
So the hash table that we might have now,
we have 256 slots, buckets, where we can put our values into, but we're
never going to use more than 31 of those because they are simply, our hash
function doesn't, if we use model on this date representation, there's
never going to be more than 31 values in the lower eight bits.
So that leads to the case that we have an average size. So we have, I think, for all these 50 years,
we have roughly 18,000 values. So in the end, in our buckets, we now have 600. So in the buckets that have values we have an
average almost 600 values. If we would have a really nice distribution
on this relatively small table we would only have 70 values in each bucket.
You see, we really don't use all the space that we have or we don't
really fully use our hash table and we have a much larger buckets so we need to find our values we have to an average check
almost 600 values instead of only 70 values and this continues on so even for increasing number
of bytes you will with a ninth bit you will actually gain a little bit, because now you have one bit of the month.
This is like every month there's a one or zero,
because we have even and uneven month.
So this helps a little bit, but you see then again,
we run into the next kind of bits,
so like the upper bits of the month,
again, where we don't have any values in those bits.
So if we use modulo here, we waste a lot of potential
in our hash table.
And yeah.
So which is obviously problematic.
So if we talk about performance,
ideally what the purpose of our hash table is,
or what the goal of our hash table is,
is that we want to have a hash function
that gives us a position in the hash table,
and then a bucket,
and in this bucket there's ideally just a single value.
So in the end we'll never have a hash table,
or probably never have a hash table without
any collisions.
So there might be a couple of values, but we don't want to end up with many, many values,
especially if we go all the way up here to the bottom of this table.
We have many different values now if we use model-only hashing, but we still have a long
chain.
So while we actually have a table that is large enough to have enough values for each potential value,
enough buckets for each potential value,
in average, we still have chains of the length of seven.
That's instead of one or something near one.
So this is a simple example where you can see
modular hashing might work in some cases,
but you have to be really careful how your data looks
if there's enough entrepreneur data. Something else that is widely used is
MOMO hashing, murmur hashing. Initially the first version of murmur hashing was using
multiply and rotate. So this is where the name comes from.
This is M, U, and R for rotate.
And since this has been done in the loop,
this is where the murmur, so this is murmur hashing.
And the current version, which is version three,
no longer, or is a little bit more complicated
with shifting and XR, but in general,
it's just a quite good hash function.
And I forgot to mention that, sorry, but when we talk about hash functions for database systems
or data processing, this is different from hashing for cryptography and so on, right? So
this should be quite clear that we don't care about certain constraints or certain guarantees that we would
like to have if we want to,
if we're coming from a security side of things,
we are here just in performance side.
We want to have good distribution of data
and many of the properties that cryptographically
good hash functions have,
we don't care about them here in this case.
So, murmur is much more complex
than the previous example of modal hashing,
but gives us quite nice properties.
So, if we can see here,
so if we have now a hash table again, 256 buckets here,
and now if we distribute our dates over all the buckets,
we can now see that all of a sudden, basically,
we have an average.
We have a length for the buckets that are filled.
We have an average length of 70, which is exactly that
what we are hoping for.
So this is quite good, an average.
Since we fill all the buckets, an average
needs to be the number actually, but
it helps us a lot here to distribute
the data as we want to. And all the way up here
where we basically hope that there is
no collision at all because we have much more values.
Remember, 18,000 values are there, dates.
And we have now a hash table of over 500,000
buckets. And so we are
almost there where we want to be that we have only a single value in each bucket
so single value in each filled bucket and we are almost there so this is a
quite good hash function for this example at least. Okay so the next thing, if we have no hash table let's say we
have data size n so we want to hash n things and we have a hash domain m. So
what we actually would like to see what are the properties of good that we want
to be good at with our hash function.
There are three properties.
There's uniformity, and this is basically
the length of the collision chain.
So if our hash table is too small,
we have too few buckets, there will be collisions.
There will be multiple items per bucket,
but ideally this will be relatively evenly distributed
over all the chains so they don't have one chain where all the buckets are in
but we have with the equal uniform distribution of the change so we want to
have uniformity we want to have universeality. That means that probability of a collision is one by m.
I think that should be quite clear.
So if we, yeah, there will be cases
where we can't avoid collisions,
but that should be relatively low,
this as low as possible, this probability.
And of course we want to be efficient.
And now when we do use the hash function for our hashing, for our hash table,
the basic means, okay, first we have to map the hash value, map a value to a hash value.
Then we use this hash value h in this case to find the position in our hash table. And this
mapping is often called a finalization. So often you have to, for example, your hash might be a 64-bit integer.
So the hash table is usually smaller.
So now the question is how can we determine the bucket,
the actual bucket of our table with a much smaller size?
And this can be some shifting.
If it is a power of 2, it's probably the most efficient thing or something else, depending
on how many buckets you have in your hash table.
Typical trade-offs here, we have the hash quality, we have performance.
So there are very fast hash functions.
Usually you always have the typical trade-off, the faster a hash function is, the worse its properties are. And in practice, what does most people
use are those multiply, shift and so on alternatives or hash functions, because
they are quite good from the properties that they have, but they are still
fast and efficient on modern CPUs.
So apart from the hash function, now the question is if we have the hash table
how do we organize that? How do we store our data in a hash table? And there are
two main schemes out there that are basically used by all the pretty much
for all the data structures out there,
hash tables out there.
One thing is chaining.
Sorry.
So we have a list of buckets for each key.
So you have, yeah, you have a list of buckets,
and then let's say your bucket is basically full.
So what you would do is you would chain your buckets here.
So you have, let's say, a bucket of eight slots. Now for some reason your
bucket is full, you have more than eight items mapped to this bucket. So you would basically say,
okay, there's another bucket, you would chain that, so like a linked list of multiple buckets,
and then there would be another bucket somewhere in memory. So this would be like a linked list,
not size one, but a little bit larger usually, and you would chain those buckets. Another idea is open
addressing. And what we do here is that you have a large consecutive array, just
yeah, a large array, and then you would just simply offset your bucket, basically
it's just the offset into this array, where you would try simply offset your bucket, basically it's just the offset into this array
where you would try to insert your value.
And one idea how to,
one way to work with that would be,
for example, linear probing.
And this means you first go to,
for example, if you want to insert that,
you would go to this bucket,
to the offset of your bucket,
and then you would check,
okay, can I insert here?
Is this slot free?
If not, I will continue and continue and continue.
So this might mean that you end up at a position
that is actually within the bucket of another value, hash value.
But this is basically how you would enter it.
So you might end up in another bucket.
And same if you're now looking for your key,
you would start at the position,
at your initial offset that the hash function gives you,
and then you would probe and continuously probe
until you find either your value
or you find an empty slot and then you know,
okay, my value is not in the hash table right now
that I'm looking for.
The good thing here is that, because it is the operation operation once you have your offset and you look for your value this
is a sequential operation so we have all the prefetching we have cache locality
this is quite efficient on modern CPUs but you also generate clusters so if you
have a poor hash function you might have a few clusters that are really full.
So we have to look quite a while. So we have to scan a lot.
And you end up pretty far away because there's a large cluster when you're looking for your value.
And for that reason, there's also another alternative that is called quadratic probing.
And there we would not simply, in case the first slot is filled,
we will not check or linearly go to the next slots.
We would increase our step width quadratically.
So we basically check plus one,
and then we check plus four, plus nine, plus 16,
and we would try to distribute the values
so we don't have this clustering effect
in case we have pretty bad hash functions.
And another idea that often works together with linear probing is Robin Hood hashing.
And the name comes from the idea that Robin Hood, as you probably know, is somebody who
tries to be always fair, wants to ensure that all the people are equal,
and obviously has an unfair advantage.
And the idea here is, for example,
let's say you end up in a situation
where some values of some buckets are pretty distant,
because right in between there are values of other buckets.
Because you have linear probing, you have clustering,
so some value might not be in this actual bucket,
but might be far off
because other buckets are filled that are right in between. So now what Robin
Hood hashing would do is to say, okay, I want to avoid that some searches
for some hash values are much slower because they have to go search
quite a while, quite some large distances, so they would swap.
When they recognize such a situation,
they would say, okay, I would try to pull this swap,
this value that is far off from its actual bucket,
and change this with a later bucket,
with the value of a later bucket,
so that both are more,
like the distances that they are usually probing are more equal,
more uniform.
Another hash table or often used hash table form is,
form of hash table is implemented is cuckoo hashing is implemented, is cuckoo hashing.
And the name of cuckoo hashing comes from this bird,
the cuckoo, and probably know that there's this, yeah.
The thing is that they sometimes leg the eggs
into a different nest, or yeah, not in its own nest,
and kick out other eggs, right?
So, and this is exactly what we do here in cuckoo hashing.
So basically when we try, we have two hash tables
or multiple hash tables and each hash table
has different hash function that you use
to find your bucket.
And now if you want to insert into our hash table,
then what you should do is you go to hash table one
and then check, okay, can I put it in here?
If the slot is free, okay, just do it, fine.
If the slot is not free,
you're kicking out the current value.
This value goes to the other hash table
and I put my value now into the first hash table.
If the second value will be at a position
that is also taken,
you can come, like, this can be problematic, right?
So you can have recursion in there,
it can even have loops in there.
If at some point you come back to your first table
and you are at the first table,
so the kicked out value comes back into the first table
and you're again kicking out your own value.
So in some point you might have a recursion in there.
And this is then the case where cuckoo hashing
would just say, okay, I need to resize my table.
I need to have other hash functions
and other sizes to continue.
But in general, since we are,
the chance that we have such recursions is pretty low.
We have good hash functions,, we have good hash functions,
assuming we have good hash functions,
cuckoo hashing works quite well.
But also means for lookups, you need to check both tables,
unless you find your value in the very first beginning,
and delete the same.
We can shortly go through an example here.
So we have now T1 and T2 and for the first
hash table we have a simple modulo 8 hash function. So we just take the value
modulo 8 and then we know which bucket to use. And for the second
table we use we first multiply by 3 and then take modulo 8. Now if we're going to
insert the 1, we take the 1 modulo 8, which is 1.
Both hash tables here are 0 indexed.
So we see that the first slot is not taken, or the slot with index 1 is not taken,
so we can put our value 1 in there.
Then we take the 4.
Again, we take 4 modulo 8, which is 4.
Take the fifth slot, index 4, and take 4 modulo 8, which is 4.
Take the fifth slot, index 4, and put it in there because nothing is in there.
We do the same with the 8. 8 modulo 8 is 0.
So we put it in the first slot with index 0.
And now we want to insert the value 0.
So in this case, we would now see, OK, I can't add it to the to the slot because there is already a value in there. And now this cuckoo thing
happens that I basically say, okay, I'm going to kick out the other egg from the nest
basically. So I'm going to kick out the 8. Where does the 8 go? The 8 is multiplied by 3, so it's 24.
Modulo 8 is again 0, so not the best hash function here.
We put the 8 to the first slot of the second hash table,
and the 0 will take its place in the first hash table.
For the 9, we do the same, and the one goes there.
Any questions so far for cuckoo hashing
or hash tables, hash functions?
Okay.
I guess we're gonna do a break after the beat.
Yeah, we're gonna do the break later on.
So quick recap, we already talked about beat trees
a couple of weeks ago.
You probably know most of them,
but usually when you have a degree of D for this beat tree,
we have at most two D keys.
We try to be balanced, so all keys should have
the same length, and so on and so forth.
And you have, like if you have K keys in a node,
you have K plus one pointers.
So one pointer for the overlap,
and the other pointers that give you
for all the values smaller than one value, the pointer.
And B-trees just work quite well
and still continue to work very well
for disk-based database or for block-based databases.
If you have a database that is block-based,
p-trees are still the most widely used data structure,
indexing structure there,
but they are not necessarily the most efficient thing
to do in main memory.
And for that reason there is
a data structure on improvement called the cache sensitive B plus tree.
So it's called the CSB plus tree.
The main idea here is, or the main observation why the authors came up with this idea here
is that in a B-tree, like if you have, imagine you have, let's say you have this simple node
here and here you have four values, those might be 4K, four integers, for byte integers, so 16 bytes for those integers.
And then you would have five pointers here,
which in most systems nowadays are 8 byte.
So you have a lot of data is wasted, or not wasted,
but you use a lot of space for just keeping around those pointers.
And then pointer chasing also might incur some
random access latencies because you probably won't or maybe
you don't have the next node already in cache. So if this data structure is in my
memory you don't utilize the cache as good as you might maybe could and you
have a lot of random accesses to addresses that you might have to look up so there will be TLB misses and so on and so the idea here
for the CSB tree, the CSB plus tree is that we only store one pointer so you still
might have like your node with the four split values, but you only store one pointer and offsets then to
the other nodes. And the nodes, so basically the nodes here below, so on the
third level here, so these nodes are no longer stored separately, like on
different pages for example. The P-tree would usually try to have different
pages, but here now for the main memory3 would usually try to have different pages.
But here, now for the main memory case, we would try to have them allocated on a very dense array.
So we put them together, and we have only one pointer here,
and then the rest would be just,
basically would be internally stored here as just offset.
So if we are, maybe here we think,
okay, now we are smaller than the second value here
in the middle node then we would have an offset into this group where we basically store
multiple nodes there. And offsets usually those offsets can be much smaller so much less bytes
or bits even a few bits might be required to store these offsets into those little node groups here.
So we can store, we have less pointers to store, only offsets.
And we have here a dense array that is consecutively stored.
So that might be better for caching and for reading the values in order sequentially.
Okay, so then, as always, there are different trade-offs or different versions of that, and also there are different alternatives.
Most of them all come from the same paper, but this was basically
the CSB Trust was the successor to the CSS, the cache sensitive search tree.
And this, but this has, this tree was mainly geared towards read-only databases.
So they didn't really care about what happens with updates.
And what they did was just, okay, this is a read-only version, and this was very dense.
So they tried to have a single array where they can just jump into they didn't need any pointers they could use only
offsets and try to have all the nodes are sized so that each node when you jump you have only one
cache line so we don't like good for cpu for cpu processing that all the nodes are cache line sized and this thing was really not
working if you have any updates. But it was super efficient and cache efficient for read-only
workloads. Then again you have to, you could basically say okay maybe you don't want only a
single cache line but about multiple cache lines, should be would have been larger the key,
the less levels you have, but you also,
yeah, typical trade off between like how large
your larger node sizes might not be one
or might be more cache lines to read,
but I have fewer levels in my index
that I need to traverse to find the actual values.
And then where is the full csb tree? This is an alternative that the office of the csb plus tree
proposed and here we have also one array but we don't fill up all the arrays.
So we leave some space similar to the CSB tree. But so there's a little bit pre-over allocation
so that we still have some space for inserts. And this is almost the same performance as the CSS plus tree that we talked about, but it still allows inserts, but only to a certain degree.
So if there are splits happening, the full CSV tree, there needs to be, since we have only this one allocation, we have to rewrite the entire tree. The CSB tree in this case, we also over-allocate,
but we still have multiple leaves,
so we don't have to reconstruct the entire tree.
So depending on what you want to do,
there might be different alternatives that are better.
Next thing also by the same author,
so they liked proposing alternatives.
The CSB plus tree, there was another alternative
with segments.
So now all the notes, the child notes of one note
are basically put into one or three segments
instead of having one large segment.
Again, simple idea to somewhat account for changes.
And now splits are cheaper,
but also reading is less efficient,
because you might have to read a couple of segments.
And performance-wise, obviously,
or hopefully obviously, the one single array
that is perfectly geared towards reading
and doesn't account for any updates,
so it just can't update it,
has the best search performance, the CSS tree.
Then we have the CS plus tree,
then the segmented tree and the B plus tree.
But if we look at insertion,
then it's the other way around, right?
So we have the B plus tree,
which is pretty much similar to the segmented CSB plus tree.
And then we have the standard cache sensitive B plus tree, and then the worst insertion
performance because we need to rewrite everything as the cache sensitive search tree.
Yeah.
So the conclusion should be clear, right? So depending on what your workload is,
how many inserts do you have, how many modifications,
depending on that, you have to choose,
you can choose different alternatives here.
And the next part will be about tries,
and I think we're gonna take a five-minute break and then we continue with the third topic which will be twice and the adaptive relics tree.
The next part will be mainly so the adaptive radix tree is a radix tree,
and the radix trees belong to the class of tris.
What are tris?
So tris are K research trees,
so we have still trees,
and we might have K elements per node,
so nodes can be also large.
And the name initially comes from retrieval.
For some reason it's called trice now, maybe to differentiate it from trees.
But actually it comes also from the tree in retrieval. They're also known as prefixed
trees and prefixed trees probably the best name because that already gives away
pretty good the idea of tr tries. And there's a simple
example here of such a try for strings. In this case we have the words and, aren't,
any, are and aren't. And now what we simply do here, for example, have a
take each note, each note level or each tree level, we take now one letter of our
four words that we want to index and then this will be used for the key.
So you can see here for all the words we all share the word A. So now we have
basically the first node, so all the words start with an A,
so okay, sure, we only have As,
so we go to the next node,
and then basically we say,
okay, what is the next letter of our words?
Let's say we are searching for the word any,
then what means when we have an N?
So we go to the node that contains all the words
with A and N,
and then on the very bottom here we have the third
letter and now we can say okay is it is it a D, T or Y. If we look for any we
basically go to the follow the pointer to the Y. Now we have found our value
any and that would also mean that you don't have to store in this case the
actual key because this is implicitly you just have the key in because you when you traverse back to
the top you have your entire key this is a very simple structure but can also be
problematic in some cases so first of, now the tree height does no longer depend on n, right? So for
for B trees and other trees
that are comparison based, usually the tree size depends on large nodes are and how many values you have.
Here it solely depends on the length of the key. So in this case the length of the words that you have.
We don't have to rebalance for that So in this case, the length of the words that you have.
We don't have the rebalance for that.
With this structure, we don't have the rebalance and we have a lexicographic order,
which is quite nice if you traverse it.
And again, so keys are stored implicitly
and we can reconstruct them entirely
from the path that we take from the root to the leaf node.
And,
so first, one question here, or one problem here is,
let's say we have the alphabet,
so we would say the first node where we have the A in,
and we want to also be able to store other words in here,
or only uppercase, that would mean we would already have 26, I think, slots here
in case we have other words that we want to insert into this tree.
And for each node, we would have to have 26 buckets
to have all the keys in there and to be able to index them.
For this example here, we would waste a lot of space.
So either we can't have them all pre-allocated or we need to find some way to do that efficiently.
Here for the second level you would only use n and r of all the 26 characters.
And now if you imagine you would do that this way character based
for Unicode this would simply not possible. So you need to come up with
some other way there and one way to deal with that is to use a Radix tree.
And here we do only take the binary representation of whatever your key is. So what you can for example do
here is take each bit. So each bit, it says one is a bit one or one or zero.
So then this is a binary tree. The RedX tree would be like in a binary tree form.
Or you can also have a larger amount of bits. The number of bits
sometimes called a span. And here you can see for example larger amount of bits. The number of bits sometimes called a span.
And here you can see, for example,
let's say we take of our longer word,
we cut it or we split it into parts of three bits each.
That means that each node,
we can represent eight values, three bits.
So we would have eight children per node. Four as one, again,
right, this is a binary tree, so we only have two. This also determines how deep our tree is.
For the binary key, it's basically the longest possible way, so one level per bit and the larger the span is the lower the height of our index.
Yeah, so this the span size determines the height and also not only also again
the key length but the key length is still the most crucial thing here for
WIDEX trees because you store the entire
key in the levels of the tree. And if you compare that to a
balanced binary tree, it's quite interesting to see the difference here.
So if you have a number of keys is growing for the binary search tree,
this starts being smaller, but then as this tree grows with the tree height
grows with the number of elements in it, we see that it grows and grows and
grows while this is static for the Radix tree
depending on which depending on the K and on the span size again but you see
here the number of elements the height does not the height doesn't change just
because we have more elements but as you've already seen in the example with the art,
with this example with any art and end,
is that you might have a rather large space consumption
if you do not have a good fill level of your node.
So even if you don't use characters,
as in the previous example,
and you use now the Radix tree with the bits,
you might still have a lot of bit combinations
that are not used.
But you would still, if you have your,
in this span of three,
you would have your eight possible variations, but only few might be used so this node is not splash efficient
so a lot of a lot of values wouldn't be used and so you have to find an S that
is basically balances this and the larger S the better as a lookup
performance because you have less jumps you have a tree is not as high so
you have only few interactions but with a smaller S you have a
better space consumption, less wastage but you have a higher tree so
you usually get slower. And one idea or one alternative here and
this is the alternative that we are going to talk about for the rest of the lecture, is that we have different node sizes.
So we don't have to have the same node size for all the levels.
And we have adaptive node sizes. And this is where the name comes from. This is then the
adaptive radix tree also ART index and in the example below here you see that
the middle level has a very small node so there are only few elements in there.
Here you have only two children. Then on the bottom level we have larger nodes so
in the right case there you
see there I think eight so children and so they have different node sizes so how
does that work the adaptive Redux tree has a fixed span size and we split each
value that we have each value that the index and the art, we split that into its bytes.
So we have span size of 8 bit.
So we have 256 values per node.
Well, I probably don't have to go through this example,
but I guess I can.
So if you have this image of 218 million something,
something here, you would have this representation,
bit representation, I hope you can read it.
And now if we split it up, this number into a single bits,
and only take, represent the bits alone,
we would here have the value of 13, the value of 2 here for the second byte, so to say, 9,
and here you can see it's all 1s, so we have 255.
This will be basically if we split this 32-bit integer.
And what the ART does is, we have a high fan out so we have
8-bit that means we can have a fan out for 256 but we have four physical
different node types that we store or that we can store and the main
purpose is to save space. So we have this high fan out in case it's fully used,
then we have still a very flat tree, so a fast tree.
But in case we have sparse values,
we don't waste that much space.
We can then use other node types.
And what the YDX tree also does,
we'll later see it a bit more in an example,
is that we have path compression.
In the beginning shown, if there is something,
if you have nodes that are all having the same letter,
like for example, we would only have and and any,
and so basically all have, the second node would always be the n.
We don't have to store those nodes.
We can basically compress them.
If there's only one for each, on one level there's only always one child, we can compress the levels.
Furthermore, all the nodes are cache aligned, so they're more efficient on modern CPUs,
and there's also the possibility to use Zimdi when processing the ART index.
And here you see an overview of the four nodes that we have in an ART index. And here you see an overview of the four
nodes that we have in an ART index. There's the node 4, the node 16, node 48, and the
node 256. The first one is the smallest node. So this is a node that is used
when you only have two to four child nodes. I say two because if you
have one there comes this node compression will then be used or might
be used. But this is a node that you can use if you have only four child
nodes. And how would we store that? In this case we have a fixed array of
four values. Let me take the other pointer.
Yeah.
So we have those four values stored here.
And this is not consecutive, right?
So this can't be consecutive because you still have the entire domain
of whatever this bit might store.
But if you have only four values,
you can basically now do a linear...
If it's for four,
you probably wouldn't do linear search,
you would just search, basically, is my value here?
So if my byte value is three, then I know, okay,
I find the three here, I iterate over the values,
find the three here, and then I know
I have to use the pointer at the third position here.
So the node four, basically, is four times times the byte and then 4 times the pointers to find the child node for this particular byte.
And then if you have more values, the next node size would be node 16. Here you have now 16 potential values in there that you can use.
And again you have 16 pointers.
And for 16 there might be a binary search already better than linear scanning it.
Depending on, you can also use Zimd here to find your value.
So different ways, different things are possible.
This is node 16.
And then in node 48, 48 is a little bit different.
So what we do here is that we no longer have only 48 bytes.
Because actually the difference to storing 256 bytes is not that large if it's only bytes, right? So we have, we already store 256 bytes,
but only 48 pointers, because the pointers are much more expensive, right? So we compare
eight byte versus one byte, so points are more expensive. So what we do here is we have already
256 slots for the values, so many of them might not be used, and I don't have to search, binary search or
linear search, I can just say, okay, if my value is 17, then I go to the 17, if it's set, then I go to
check where the pointer is to it, and then get the pointer. I don't have to search for the 17,
basically, I can directly jump to it, which is a little bit faster and also allows for easier inserts. And for that reason, this is no longer sorted.
So the points are no longer sorted because maybe the value 17 came before the value or
was inserted before the value 2. So that's not necessarily sorted. And the last node
type is the largest node type. as there's no compression at all.
And this is the node 256.
And here we don't have to have those bits anymore stored
because we can directly go with our bit representation
of this, but we can directly go to the pointer
that we need to, and then have our value
or the next children.
Yeah, and per node, you always have to add up 16 bytes. And this basically tells you the marker,
what kind of node I know expecting.
So how large is this node that I have now
at this point address?
It says how many children are there right now.
So if I have a node 16,
only 13 slots might be used up. And there's also a compressed path. You will
see that later what that means. And this is always 16 bytes. The standard
ART index is only a unique index, so it doesn't store multiple values, multiple, for example, positions per value.
There are different, so this is the easiest thing, is just single value leaves.
So if you have just one value, then a leaf is the value.
And if you have now multiple values per leaf, you have to now, again, think of how could you store them, can you store them,
you don't want to have a single allocation for every leaf value or multiple leaf values,
so how can you do that and how can that be again then be efficient if you have updates
and inserts and so on.
And another alternative that you also can do is, believe that you don't store the pointer in the very last level, if your value
fits or is the same size or smaller than the pointer, so 8 bits, bytes, then you can also
store directly the value in there.
You don't have to store a pointer that directs somewhere else.
Yeah.
Here now, yeah, here we come back to the example. So we have this 218 million value number
that we are now trying to find,
or we try to find the,
this is the value that they're looking for.
So the secret query says where ID column
is 218 million something.
And now we want to find the top,
so which row in our database is this value?
Right, so typical index lookup.
So what we do know for the art is we take our value,
take the byte representation, so we split it into bytes.
We have bytes 13, two, nine, 255.
And now what we do here is in the first level we find a node 4 here.
We have the 13 so we don't know how they are stored. We have to look here. We
can't directly jump to a certain offset. We have to look for the 13. If it is there,
okay, we find the 13 in the first slot. So we know we have to take the first pointer
and check the children node, child node,
that is found under this pointer.
So now here, looking at the header information for this node,
we know now, okay, this is a node 48.
So for 48, we know this is already,
we no longer have to search. We can directly offset into it.
For now our second byte which stores the value 2 we would thus go to the index 2.
Here we have an offset again. So we have a small offset. I think it's one byte only.
So we can offset directly into, in this example, this offset
will be at position zero. And again here we find the pointer to our next node. And then for this
node the header already gives us a way, like this is node 16. We have now, our value now has the
value nine. Node 16 is, here we have to search the values, potentially all the 16 values to look for the 9.
Again we find the pointer.
And then in the very last level here we could store directly our pointer or the tuple identifier, whatever our system needs now.
But now we would be here with the 255.
We know, okay, we have to go to the very last position into this in this node then here we have directly our t id the tuple identifier
and we're done oh now we can fetch the page whatever we need to do
to fetch the tuple at the id at this given id here
any questions so far to how the art works?
Okay.
Yeah, so long keys result in a large height, right?
So this is a typical problem.
So what can be done to reduce the height?
And here you see those path compression tricks that we
basically had, that we briefly talked about. For the left part of the tree you can see
here that there is, there are only the words B-A-R and B-A-Z. So this node for A doesn't
do anything because there's only one child.
So what is possible here is to compress this.
So we don't have to jump to another node, fetch this node from my memory,
only to see that there's not much in it.
There's only the next pointer, which is always the same in this node.
So we could basically compress this.
And then this path compression is then stored into the child nodes they basically say okay my compressed path is this and that because in some
way need to know that the a has basically been compressed away but if you look for a b o o like
you would still you could otherwise still end up at the wrong path. So you need to somewhere store what has been compressed.
And this is part of the header information per node.
And yeah, so the left-out part is stored in the nodes.
And the other thing is lazy expansion.
So we would try to expand the nodes.
So if we have here this foo, this might be compressed.
Now, if there is another value in there,
we would try to expand and create to actually increase the height
by expanding it as late as possible.
Only when we really need to expand it,
then we would start doing that.
So for the second O here, if the new word would be fob,
then we don't have to expand. So for the second O here, if the new word would be a fob,
then we don't have to expand, right? I do the latest possible, for example, if we have,
yeah, as late as possible.
I think I get the idea, right?
And this is somewhat fixed,
so we can't have an arbitrary long path compression.
This is fixed to eight bytes.
And if this is exceeded, we need to store the complete key either in the leaf,
or we just say, or you have to store some information that you need to check
for the actual value in the TID.
For example, you can still have the TID, but then you need to know,
okay, I'm not really sure if that's the correct value because there was so much
compression, then you need to check for the TID and then check there, okay, is this the
actual value that I was looking for?
Here you see an overview.
This is directly from the paper, which we have linked before the paper in case you want
to look at it.
And here you see basically why this is such an interesting idea, this art index,
is that we compare, or this graph on the paper compares the space consumption and the tree height.
So tree height is basically exactly, this is basically the performance part, right?
The lower the tree or the less high, the faster we are, the less jumps we have.
And space consumption is always something we always want to optimize. The smaller the tree,
like using less space is the one that's always good. On the other hand, it's always also good
for performance because maybe the entire tree can be in main memory. And what you can see here is
that the art for, so what data set is that we have 1 million
uniformly distributed 32-bit integers.
And so you can like,
you have four million different values, right?
Four billion different values for 32-bit integers. So if we now
have only 1 million, you see there's a lot, there's not all the nodes will be fully, fully used. So now depending on,
so we have a lot of space wasted, so to say, for large spans. And this, this is what we see is shown here. If we have
the bits, the size of one, the span size of 1, we have the binary tree, which has fully all the 32 levels that we have, because per bit we have no one level.
So this has quite good space consumption, but also we have to traverse a lot in this tree if you want to find our value.
But if we have larger spans, so we have very small trees or small height,
those trees, since they are really sparse
or the data set is really sparse, they will waste a lot.
This is logarithmic scale, they waste a lot of space.
And the art index is, yeah,
really excellent trade off here, here. So it's a little
bit or has something like the height of a span size of eight, but only the space consumption
of a span size of three in this case. So it's a good trade-off or nicely balances the height and the space consumption here.
And height means basically the lookup performance.
And in this graph here,
the authors compared the art for two datasets,
for sparse and dense.
So dense dataset basically means
pretty much all the values exist.
Like if I have a test set of one million values,
then every value exists, like increasing key or something.
And dense means that a lot of values don't exist.
And you see here that for both data sets,
the ARC performs really well.
And the only thing that comes near is a hash table,
in this case a chained hash table.
So the ARC performs quite well for lookup performance compared to other prefix trees.
We also see here the CSB plus tree, so the cache-sensitive B plus tree, and other alternatives.
And for inside performance, it's even better, so to say.
So we are faster than a hash table.
Hash tables are not the fastest thing to insert usually.
And they are just outperforming.
Mostly for dense operations, dense inserts,
or no, dense data sets.
But even for sparse data sets, it's still the fastest data structure here.
And if you have bulk insert, it's even better.
So there's the chance to bulk insert also in the art.
Okay, so that was it mainly for today.
So in summary, oh no, before we talk about the summary,
any questions so far about the adaptive relics tree?
Okay, so what did we talk about today?
We talked about hashing, what hash functions,
how you typically have this trade-off
between quality and performance of the hash function.
We talked about the hash table organization,
so there are different strategies to implement the hash function. We talked about the hash table organization. So there are different strategies to implement the hash table.
This might be open addressing or chaining. We talked about trees.
We have the B-tree, which is optimized for block-based storage. So it's very
good for hard disks and solid state drives.
And then there is the cache sensitive B plus tree,
which is optimized for main memory computing
and to have a better data layout
that is more cache friendly
and better for the CPU to process.
And then last but not least, we talked about TRIs,
so the Radix tree, and mostly the,
especially the adaptive Radix tree
with differently sized node types for Radix trees.
Okay, so one question.
I hope somebody helps me to answer that today.
So what we've seen, given the performance,
so let's say we have a sparse data set,
so performance to the hash table and the art
is somewhat comparable, right?
Then lookup performance is also comparable to the hash table.
Our hash table is actually really good, and this is a ch the hash table. Or the hash table is actually really good.
And this is a chain hash table is not the thing
you would usually do for high performance.
So there's even better hash tables.
Why, what would be a reason to use the art?
Do you have an idea why I would maybe still use the art,
even though a hash table is often simpler to implement?
Right, yeah, it's simpler to implement, and it's also performing quite well. So why would I use, what would be an idea why I would use
the hard index?
Because the runtime of a hash table is stochastic. Sorry? The runtime of a hash table is stochastic. I have a bad worst case. If my initial data has a bad distribution, then I might get a.
Yeah.
That's one really good point.
So we have also depending on the data, how the data looks, also the hash function, right?
So how good this hash function
distributes the values. So it's sometimes hard to say how a hash table works there, how well.
This is true. Then one other point is insertion for hash table. Insertions are really fast until
one point when your hash table gets too full. And this basically means then you're going to, usually what most hash tables do, they
double their size.
So this is extremely, so you have fast insertions until one point and then you rewrite the entire
hash table and rehash it usually.
So in average, it's still quite nice inside performance.
And if you know the size up front, that's also still nice. But if you don't know the size upfront, that's also so nice.
But if you don't know the size, and then at one point,
you're going to hit this point where you have to resize
the hash table and other tree structures are often better
here because you can have smaller updates or smaller
changes and you don't have to write this entire data
structure again.
And then, but there's one more thing for querying,
where the hash table is, does not allow you something
to do when you search for a key.
Do you have an idea what we can do?
Mm-hmm.
I think some kind of range queries
could be difficult with the cache table, but supported for ART.
Exactly.
So that's exactly the point.
So if you, let's say, search for a key, all the keys smaller than one million, you would
basically know, OK, I get the byte representation, right, one million.
So there were a lot of, or let's say something very small, like all the keys smaller than 17.
So there are a lot of zero bytes in the beginning. So I directly know, okay, where to go in my art.
And then you go to leaves and then you can basically, the art gives you the chance to do such range predicates.
Right, then you can easily, once you find your first value,
then you can traverse the values of the leaves quite easily.
For the hash table, range predicates are not possible.
So in case very few exceptions where you just
can enumerate all the values, then this might be possible.
But there is not something with hashing where you can say,
give me all the buckets that I need to look at with values smaller than 17.
Not possible, right? So this is something and range predicates often happen in database systems.
So this is something where hash tables don't work and the art
with a nice performance
does that.
Okay, So far.
If you don't have any questions,
we are going to continue next week with,
I think Marcel is going to present,
talk about profiling.
And then the second talk next week will be about multi-core parallelism.
In case you don't have any questions,
yeah, have a nice day.
Thank you. Bye-bye.