Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Data Structures
Episode Date: May 23, 2023...
Transcript
Discussion (0)
Okay, welcome everybody. Today we're going to talk about data structures.
Before that, so there's... we're in the final phase. It's one week left, right?
Yeah, so we have one week left for the first task. Give or take, we'll figure it out.
If you haven't started, do so now um because it it will take
you some time to get it fixed and we're not gonna be able to help you last minute if there's some
kind of problem or well probably not so um but i assume everybody who's here already started at
least the way i hear you talk about the task. So that's good.
And it's going to be more fun even with the next task.
So I hope you enjoy this one.
And if you do the next year, you're really going to enjoy the next one.
So where are we? We're still in the kind of single core CPU package area,
talking about data structures that fit this paradigm
or this space well.
So not so much going to storage,
not so much going to network or distributed,
but see how we can make data structures efficient
for this kind of caching layers.
Next time, we're going to talk about, or tomorrow,
we're going to talk about the profiling.
So that's going to be fun.
And then you will have the ART, so the Adaptive Radix Tree,
that you can implement,
or I'll introduce this today.
One thing that I want to make you aware of
is on the 14th of June, we're going
to have to switch lecture halls.
So there's some conflicting stuff going on in this hall.
So we're going to be in L10 or L2, so that's first floor.
We'll also have an announcement at some point, I guess, or dedicated announcement
again in Moodle, just so you are already aware of it. There we're going to have to
move somewhere else. And with that, otherwise not much new stuff here. Today, I want to talk about three things, or three kind
of index structures, the three kind of most common index
structures that we will be using in database systems.
One is hash tables and hashing in general.
So this is something which is always important,
needs to be efficient. And for that,
we'll look at a couple of hashing functions. So what can you use as a hash function? And then
the hash table organization. So this basically means if we have a hash function, how do we use
it? How do we map it to the hash table? Then we're going to look at trees. Well, a reminder, a very brief reminder about
the B-tree, so you know where we are. And then some adaptions of the B-tree for cache
sensitivity. So meaning if we're in memory, we're not going to disk in the B-tree. We said the B-tree is great if we have disk.
It's not so great if we're just in memory.
So how can we make it better and greater for memory and for caches?
So that's the cache-sensitive B-tree.
And we'll look at that one.
And finally, we're going to look at tries.
So another type of tree structure.
And they're just tries in general first.
Then Radix tries, which are the kind of tries that you want to use in a main memory database typically.
And then the D1 data structure that you will also implement, the adaptive radix tree or ART, which
is a really nicely designed data structure that
will give you some idea about how
to make data structures efficient in general for caches.
And that's why we picked this one.
And this is also why, because it's a nice data structure
and it's not super complex, we also let we picked this one. And this is also why, because it's a nice data structure and it's not super complex,
we also let you implement this one.
Okay, so let's talk about in-memory indexes or indexing paradigms.
And we kind of have three different paradigms that we're going to talk about today.
These generally can be divided into two categories.
First is basically comparison-based or hashing-based.
So with a hashing-based index, we directly know where we want to go into our index.
So basically we're calculating the position in our index
and we have to jump there or we jump there
and directly more or less directly find
what we are looking for.
And then we have all these comparison based indexes
where we're basically saying, okay, is it greater?
Is it less?
So somehow we compare our value that we try to find with,
or the key that we try to find with something in our index.
And there's tree structures, and then there's tristructures.
So in the tree structures, as you already know,
so we're basically looking, is it greater or is it less?
And then traverse the tree from there on.
And the tristructure is different because there we're basically looking at subparts of the key.
And based on these subparts of the key, we're going to continue traversing.
So we're not comparing the complete key at every stage, but we're comparing prefixes typically. That's also
why it's called a prefix tree or trie. We're going to start basically or with dissecting
the key into multiple parts and compare with that.
And if you're thinking about asymptotics, so there's usually different ways how you can calculate
or how you could model the asymptotics in these indexing paradigms.
But if we say we have n keys of length k, if we have fixed size keys, then typically
we would say, well, hashing is like constant
time, right?
So we directly, we're looking at the key, we directly know where to go.
And then we have like a constant time or all of one access.
In a tree, it basically depends on the number of items in the tree.
So how many comparisons do we need to do?
And this is why we have a logarithmic axis.
And that's basically as we would always have.
Like in a binary tree, we have basically a logarithm
to the power of 2, or logarithm of 2.
If we have B tree, we'll have much bigger logarithm. I told you, right, in many cases, this means like three levels,
which basically, if the keys are not, or if we don't have a very large number of keys,
this means in practice we're also at a constant access rate,
because we only have so few levels.
And for the try, the try basically depends on the size.
Sorry about this, just a second.
It's going to be better now. The tree size basically just depends on the key length. As I told you, we're splitting up the key in multiple subparts and we're finding our
items based on these subparts and we're finding our items based on these subparts.
So the tree height of the trie only depends on the key length.
And this means if we have a fixed size key, this also means we have a constant axis, which
is O. O of 1. Right. And I mean, of course, in the hash table, we might have like a lesser constant,
but asymptotically, it's the same. And, but if we have, like, varying key length,
then this means also hashing will cost us more for larger keys. So then the hash function, if you look at it,
the hash function is actually a function on the length of the key.
So you also have to do more if you have longer keys.
Same is true for the comparison in the tree.
So if you have longer keys, then basically you have more or your
comparison will be longer so then this also goes into a factor of the of the search and for the try
again as we're just looking at the different parts of the key we're in kind of the same asymptotics as in the hash function or hash table again.
And I'm showing you this because typically,
if you would look at typical comparisons of these different kind of tree structures
or index structures, then in many cases,
the asymptotics that you would see would basically say, well, for the hash table,
we're always constant.
For the search tree, we have a logarithm based on the number of elements.
And for the trie, we have basically the factor would be the key length.
But this is kind of not painting or showing the whole picture.
This is why I'm adapting or using also the same asymptotics
that Victor Lise uses for this.
Because if you look at the calculations
that we need to do and the comparisons that we need to do,
then this is what we have to do.
And this is what kind of if we're looking at efficiency
and the amount of work that we need to do,
this is where we're going to go
Okay
Let's go to hash tables right so hash tables
basically consists of two main things that we need we on the one hand needle hash function that
turns our key that we're looking for into a hash value and
Then we have some kind of hash table organization. So given this hashed key
So the hash value we somehow have to map this into our hash table
And I mean if we have perfect hashing meaning our hash table, like each value that we're hashing will
only map one item in our hash table or one entry in our hash table, then
everything's easy, right? So then it's just an array basically and we have a
direct mapping. But that's not going to be the case and that's why we need
different kind of hash table organizations. And we'll look at these second.
First, we're going to look at different kind of hash functions.
And, well, a hash function in general is basically just a function that maps arbitrary data to fixed size values.
So this is kind of like a plain definition. And given these fixed size values, we can then basically say for these,
we can have a hash table where we can look up
where our variable size or variable arbitrary data
should be located in the hash table.
So our hash function is really just a function
from one domain to the hash value domain and typically the domain
would be larger that we're going to like our original domain would be larger than
the hash function otherwise our hash table is not going to be efficient at
all or let's say also our hash function is not efficient at all, because we're not using the space of the hash function.
And there's a lot of different examples, a lot of different things that people come up to.
And it really depends on what you want to do with the hash function, what you're using. And one thing that's important to keep in mind, there's a huge difference in hash functions
for cryptographic approaches and techniques
and stuff that we do.
Because for us, there's always this trade-off
in how good is the hash function
and what's the performance.
So we can basically also deal with poorer or actually poor hash functions
if they're faster if it costs us less to deal with collisions etc in the hash function then
uh if we would compute like a much better much more uniformly distributed hash function
so this kind of trade-off while in cryptography,
if you have lots of collisions, well, if,
and that means basically if your hash function
doesn't really give you a good statistics
and good randomness, then you basically can re-compute
maybe the original value, or you can just
use some other value that would give you the same kind of hash, and you can defer this
from the hash function.
That doesn't really matter for us in many cases, right?
So these kind of situations where things don't work out that well, we just need to make sure that the performance doesn't degrade too much. Everything else is kind of fine.
And so, stuff that we're looking at,
I mean, most simple would be division. So, this you should have seen by now already. So, this is just modulo, right? So, we're
modulo function
is something
that gives us a very simple hash function.
So, basically, we have our hash domain, so the space that we want to hash into.
And we can just use this in the modulo function, and we will automatically get,
if we have a numerical representation of our value,
we will automatically basically get a hash function,
not necessarily a good hash function.
We'll see this on the next slide.
Then there's other stuff,
and just so you've heard this, right?
There's Fibonacci hashing,
which basically use a Fibonacci-like sequence
to get an even spread,
but it's more complex to calculate.
Then what we actually use a lot is multiply
at shift or XOR functions, so where
you will do some multiplication, some addition
with certain specific constants, and then do shift operations.
So the idea is you do this a couple of rounds or even just a few rounds
and you try to somehow generate what you call avalanches in the bit representation.
So meaning that your bit representation, like the original bit representation of your value will somehow be
perturbated so that the values look most different and get like neighboring values
basically get spread across the hash function as good as possible. And this is what most
hash functions that we're looking at are actually using.
So these kind of simple multiplications, additions, and shift operations.
This is also called rotation.
So we have our initial bit representation or numerical representation.
We're going to multiply this with something, so it's going to be larger.
Then we're shifting it. We're just using some part, like the most significant bits.
Then might multiply it again, might add something until our value basically gets
mapped to something like a completely different domain, essentially. And we're trying, or people who design these hash functions
basically try to find as good constants as possible
for different kind of key lengths,
such that you get a good distribution.
It's not going to be perfect, but it's going to be quite good.
And one instance of this that's used a lot is murmur hashing.
So we'll talk about this in a bit.
Other stuff that's also interesting is, for example, tabulation hashing.
So there the idea is that you, again, you split up your key into different parts.
And for each of these different parts, these sub-representations, you create a table, a lookup table with random values. But basically you're saying, OK, I have the simplest case.
Let's say we have four.
So we're taking two bits.
So we have four different values.
We're going to have a lookup with randomness
in every of these four values.
So we're just filling this up.
And then using these random values, basically we're just filling this up. And then using these random values, we're going to add them together to get a quite
uniform or universal hash function.
Because using these tables, we can quite well create any kind of hashing function.
But you basically need a special table to do this,
and you need to store these tables.
So this is not something that we're going to look into
in more detail, but something interesting to look up
if you need a better type of hash function.
OK, so let's look at hash function qualities
and the kind of problems that we see.
And so for a simple experiment,
so it's not something I did,
but I'm also showing to you for a simple experiment
that you would have in a database,
think about a date field that's represented or that we want
to represent as a 32-bit unsigned integer, where we're encoding the year into the most,
16 most significant bits, then the month in the next eight bits, and the day into the least
eight bits, right? So this, for example, so I put an example here.
So we're gonna encode all dates
from the first of January in 1950
to the 31st in December in 1999.
So apparently it's a bit older example in 32 bits.
So then say say for example our third the third of February
in 51 would look like this and I've messed up here the 51 sorry about this
just gonna quickly fix it No, I said 51.
Actually, I just messed it up here.
Okay.
So if we basically compute this, so we have 1951 in binary representation,
then this would be this nice binary number here.
Then the 2nd or February, that's still easy to read.
And the 3rd of February is also still easy to read. So it's an easy kind of representation of a date
into a 32-bit unsigned integer.
And if we do this, basically we're translating all of these,
and I've fixed my number, so 18,000 something something values.
And here, if we fix this, if we encode this,
and we're starting to hash this with a modulo function,
then the following happens, right?
So let's look at, so think about a hash table of different sizes.
So in powers of two, so say, for example,
we're starting with two to the power of 8 and up to the 2 to the power of 19 different kind of slots in
our hash table and we're hashing these values to these slots. Then what happens
is basically mostly especially initially when we're starting with the first eight bits, basically, what will happen in the modulo function is we're just going to look at the last, the least significant bits in our representation and given that these are dates this means these are really
just the last few bits here right so these will only touch basically these 31 days that you have
in maximum per month right because we're never going to use the full 8-bit so 256 different dates that we
could have because we just have like months only have days from 1 to 31 so this means in our hash
table actually only 31 of the or also in the hash domain so it doesn't really matter if it's a hash domain or not,
or a hash table.
In our hash domain, only 31 values will actually be touched or will be filled. And all of the other...
All of the other values are basically empty.
And, well, 225, I actually have it on the list here.
So, this means only these few values are actually used,
and we have lots and lots of conflicts.
So, our hash function will actually basically only map to 31 buckets and everything else is empty. And
we have on average basically 589 values in a single bucket. And if we would have a uniform distribution
or a universal hash table
or universal hash function
that gives us a perfect hashing,
in this case, we would only have 71.33
for collisions or entries per individual bucket.
So this is just to basically show you why a simple hash function like
module or a simple modulo operation won't give you a good distribution in many cases.
And I mean this basically continues, right? So thinking about getting like nine bits, then
we're basically touching the first bit of the month so we're going to have two
two more or a factor of two more individual buckets but still most of the buckets because
here a lot of bits are unused basically most of the buckets are still going to be empty so rather
than having 31 different buckets,
we're only going to have 62 different buckets that are used.
And we still have 450 buckets which are empty.
So, basically, our statistics are just as bad as before.
So, the function, and this continues, right?
So, this goes all the way if even if we have 19 bit
then we're still basically we have a lot of or most of the buckets are still going to be empty
and even adding more and more bit won't help because we still like the in the first in the
least significant bits back here i mean all the like the most, not the least significant bits, the first couple,
these are gonna actually change,
but then here we're gonna have a large space,
which always is basically the same,
which always will give us a quite large range of buckets,
which are unused.
And so what you should take from that is basically if in many cases if we're using a modulo function
we won't be able to fully utilize our hash table and we will have lots and lots of collisions.
So in this case basically even here where in a uniform or universal hash table,
so if we have a good hash table, right,
where we're trying to map our 18,000 values into these 500,000 values,
then usually we should have almost no collisions, right?
We will still have collisions in there in a uniform or universal table.
This is just because of the Berkeley paradox.
So you probably know this.
If we basically touch enough values,
then the likelihood of collisions at some point
will just be very large.
So we will have collisions, but we'll have very few collisions.
In our case, here, we still have a lot of collisions.
So, we will still have, on average, six values per bucket.
Rather than, on average, six.
Let's say, for each bucket that is filled, we, on on average have still more than six values in there.
So let's look at a more advanced hashing function.
And this is murmur hashing.
So that's the multiply and rotate function.
So it's multiplication and rotate,
so we're multiplying with a number, and then we're shifting the values,
and we do this in a loop, that's why it's called
multiply rotate, mer, and we're looping, mer, mer,
so this is kind of the thing, and this is kind of how
it was implemented in the first place.
And this is still an easy enough
or a good enough hash function for many cases.
However, the current version of Murmur,
so that's version 3 or something,
that will actually use multiply, shift, and XOR functions.
So this gives you an even better hash behavior,
hashing behavior,
than just doing multiply and rotate.
But as I said, it's one example,
and with the current kind of,
or a good kind of murmur hashing,
we get this kind of behavior here.
So here we can see if we're using eight bits
for the same kind of setup, so again, we're using 8 bits for the same kind of setup.
Again, we're using the state representation.
Then we can see we're using all of the buckets.
All of the buckets are actually filled.
All of our hash values have original values in there.
We have zero empty buckets. On average, we have 71 chain length, basically, which is
perfectly the same as we would have in a universal hash table, which kind of makes sense, right?
Because, I mean, if we're using all of the buckets and we're averaging,
then it should be the same. However, we can also see that if we look at the longest subsequence,
right? So the longest chain, basically, meaning the most collisions that we have in here,
we're still not too far from the average, which basically means we get a quite nice even spread out right and uh and this
basically continues right so i mean this is already good for 256 uh so eight bits but for
nine bits we see well it's still kind of the same um our longest sequence well it's not it's like
it's not in decreasing as fast as the the average but it's still, I mean, we're close by.
And then for 19 bits, we're also still, we're basically, I mean, there we can see we're close to the actual number of values that we have in our original, in our original, or in the original set.
So I said 18,000 something something so 250 something values that we're hashing
and we see that we're close to 18,000. Again with this number of elements it's the likelihood
of collisions is just very big and then our hash function is not perfect.
So this means we will have some collisions in there, but we don't have that many collisions.
And the maximum number of collisions that we have is three, right? So this means on average, the number of elements per bucket is actually very low.
And of course, in a uniform, this is something that we won't be able to achieve,
because we're only counting the values or the buckets that actually have something in there.
Okay, so this basically already shows us
if we put a bit of effort, a bit more effort in our hash function,
then we get much nicer behavior,
we get much better distribution across the different kind of buckets.
And, well, with that, basically,
for sure, a much better organization of our table.
And, well, so somehow I messed up my animations here.
I'll have to fix this separately.
So if we basically, if we have our hash function,
then what we actually want for the hash function...
Let me just open this up, otherwise I'm getting confused.
So, if we have a hash function, then we want to map it to a hash table.
So, the hash table basically has a certain size.
In this case, for example, m, our original data is n,
and then ideally we want this kind of uniformity, so the length of the average collision chain should be n divided by m,
integer division, of course we cannot basically have have a less than that. And the probability of a collision should be 1 divided by m,
meaning hopefully we get a nice uniform distribution across the different kind of buckets,
which was not the case for a modulo.
So there, basically, we mapped everything to a very small sub-part,
so the probability of a collision was much higher
than the actual size of the hash table.
And of course, everything should be efficient.
So we don't want to spend too many cycles.
And this is important.
So even there we have this big trade-off.
The better our hash function basically, the better the hash table organization.
But if we spend too much time in the hash function, our hash function is super expensive,
then we won't get this performance improvement back by the better hash table organization.
Because, well, on average, the lookup, if we have a good enough hash table,
hash function, on average, the lookup will even out, right? We might have some lookups that are
bad, but most of them will work out nicely. If we spend too much on the hash function,
this won't really pay off anymore. And so in practice we actually use something like multiply shift XOR combinations
and you can basically look up there's a couple of different implementations out there.
Usually the trick is really finding good constants that you multiply with for different domains. So
be it 32 bits, be it 64 bits, etc. You try to find good constants and then you get a good function.
In practice, you actually need two steps that you have to take
when you do your hash function.
In the first step, we're basically mapping our value to the hash function. So in the first step we're basically mapping our value to the hash
value so this is basically just applying the hash function and then most hash
functions will give you some output as a power of two and well if if your hash
table is not in a power of two or your hash table is a different power of two,
then you need an additional step where you basically
do a mapping to the hash table size.
And this is typically called the finalizer.
If your hash function maps to the same power of two,
for example, then we're good.
But in many cases, we might actually start with a smaller hash 2, for example, then we're good. But in many cases, we might actually
start with a smaller hash table, for example.
Or our hash function will always give us a 32-bit hash value,
but we don't need a hash table that's large.
So then we need this mapping.
And there, again, one simple way would be modulo or you can do
other more complex tricks especially if the modulo would again give you some
kind of shifted representation. And as always it's kind of just straight off
right. So now given that we have a good hash function, or a good enough hash function,
so it's not about having a perfect hash function,
we need to somehow map this to our hash table.
I mean, of course, first thing is, we have our finalizer,
so we have our original hash function,
and the finalizer, so the hash function, will be mapped to our hash table size.
Still, we will have collisions.
And now we have to somehow deal with these collisions.
And there's two principal approaches.
One's called chaining.
And then some more complex stuff that I'll go to next.
But let's say using our single hash table, not any auxiliary structures,
or not many or complex auxiliary structures.
One way is using chaining.
So we basically say we're hashing into our bucket.
Our bucket might even be able to keep multiple values or addresses of the actual value, but at a certain point,
it's kind of full. And then we're just chaining. We're adding more buckets where we add more
values into the same bucket. And this means we get these long chains of values. Say, for example, if our hash table is actually much smaller than the original or the values that we hash in there,
then say for example, in our example earlier with the dates,
256 buckets originally, and we want to map 18,000 values in there then for sure we
need some extra space where we put these values and this would be then a list of
additional values in there or a list of buckets if these lists get long we'll
we'll basically pay by comparing right so then all of a sudden, we're not in linear or in constant
access time anymore, but we're in the number of keys
that we add to our table in asymptotics.
So we basically will have to linearly go
through this in the end.
Especially if there's no additional the end, especially if there's
no additional structure.
Basically, if there is no sorting, nothing,
then we'll actually have to search through this
in linear time.
So that's fine if we have few collisions.
It's bad if we have many collisions.
So as soon as these things get long,
it's going to be expensive.
Another way is called open addressing.
So there the idea is rather than basically doing some extra structure in the hash table,
we're just going to use our hash table.
So ideally we still have space in our hash table.
So typically we would try to have our hash table larger than the
number of keys that we're actually using. And then the first, most simple approach is
we're doing linear probing, meaning we have our, we are hashing our key into the position
in our hash table. So consider your hash table like an array, right? So we're hashing the value.
The hash value is just the address in the array.
If there's already a value there,
well, let's just go to the next position.
If there's a value there, let's go to the next position.
So we're just going to continue until we find empty space
and we're going to put our value in there
or our address of the original value and
This is good because it preserves locality so well values
Like our hash that are hashed into a certain space will basically be close by so we can just search them
Linearly in the table if we're trying to search something in the hash table
We're going to look up at the hash value position.
If this is full but not our value,
we'll have to go to the next one.
If it's still not, we have to go to the next one
until we basically find an empty slot
or we find the value that we're looking for.
And well, it can, like in our example,
again, where we have a lot of collisions, for example,
this will create a cluster.
So basically, everything will be in the same kind of space.
And that means we'll, again, get to something
like a linear search through our hash table in many cases.
So while inserting and also while searching.
Because in these clusters, we basically have to,
like in both steps, while inserting and while searching,
we'll have to go through all of the values in there.
So that's why people try different approaches there
or come up with different approaches like quadratic probing.
Rather than going to the next slot,
we can also try to spread out across the table.
Basically, well, if it's not the next slot, it's free.
We're going to go to the fourth slot, so 2 square, right?
Then if that's not free, 3 square, 9.
If that's not free, 4 square, 16. If that's not free, 4 square, 16.
So we're going to jump ever further.
And hopefully, with this, we're not
going to create clusters anymore.
So because the values will be spread out more across.
If we have collisions, we're going to spread out
more across the hash table. So why might that not be great for our in-memory setup?
We will jump over the cache boundaries.
Exactly.
So the cache misses are a problem, right?
So we don't have locality. If we look up something and we're going like, I don't know,
to some power that already crosses the cache line boundaries,
then we'll have to read from memory again,
and that's gonna be costly.
So there's, again, this trade-off.
And for that, or as another idea, there's the so-called Robin Hood hashing,
where we just use this, again, this linear probing approach,
but because Robin Hood is kind of always helping the poor right so we're kind of being most what you say fair to everybody to all
of the values so basically as soon as we're doing linear probing and we find a value that's actually
closer to its original position than our current value then we're going to swap them we're going
to say okay well i've i'm already too far from my current position,
so you will have to go, you're going to switch with me and have to go further from your actual
hash position. So in this way, we're basically creating similar length probing sequences.
Because in the linear probing, what could happen, right, is basically I'm probing for something and there's already
some bit of a cluster and I'm the first value, so I'm adding something new to the first value
in the cluster, then I have to go through the whole cluster, add it to the end, so I'll
have a long probing sequence if I have to go to find something at this value.
And in round robin hashing, I'm just going to switch with something that has a shorter probing sequence.
So on average, the probing length will even out.
Asymptotically, it doesn't make a difference.
But your average axis will be closer to the average uh typically because we'll i mean all
of them will be more or less the same so again i mean in in linear probing you will have the same
asymptotics because i mean you will have faster accesses and you have slower or less probing and longer probing, but in round robin, you kind of get a more robust behavior.
Another approach, like more advanced approach
that I want to show you is Cuckoo hashing.
So in here, what we do is we're using two hash tables
and it's called Cuckoo and two hash functions.
Cuckoo hashing because basically it works like a cuckoo, right?
The bird basically that takes or puts its egg into some other bird's basket, and then
the bird will basically, the cuckoo nestling will basically kick out the other bird.
So the idea, that's kind of the idea.
We're kicking, similar to the round robin, the robin hood hashing, where we're kicking out value.
So the idea is we're trying, like we're first trying to insert in the first table,
first hash table, and if it works, fine.
If it doesn't work, the new value will kick out the old
value. So the new value will kick out the old value and the old value will be hashed
to the second hash table. The second hash table has different hash functions, so in
this case we're basically then going to have to find, or we hopefully find some space in there.
If not, we're basically kicking out that value again and we're going to rehash this value to the first hash function.
With the first hash function.
And hopefully then there might be space there, but of course there might also be some recursion in there.
If the recursion basically happens, usually at some point
it should stop.
And if for the lookup we're going to do the same,
we're just going to jump back and forth until we basically
find our value or we find an empty slot,
there might be a loop.
So we might actually come to this place
where we're kicking out a value.
That value is kicking out a value,
and it's going to go back to the original value.
Then we basically cannot insert our hash value anymore,
because it's just going to create loops,
going to continue forever.
And in that case, we have to basically rehash everything.
We need new hash functions.
So I'm going to give you an example, just
so you know how it works.
So we're going to start with inserting these values
with two hash functions, so not good hash functions.
I told you, right, we're just going to do modulo.
And we're going to multiply with a too small number
to actually get some kind of avalanches,
some actual distributions just for showing you how it works.
So our first hash function is modulo eight.
Our hash tables are eight spaces,
so that's going to be fine.
And our second hash function,
we're multiplying the value by 3 and take modulo 8.
And you can already see there's collisions in there.
So we're going to have a loop quite quickly.
But we're going to start with inserting 1.
So 1 modulo 8 is going to be 1.
So this is going to fit into the first hash table here.
Nothing in there else.
So we're actually good.
Then, 4 modulo 8 is 4,
so we're in the fifth position in here, and we're good.
Then we have 8 modulo 8.
What's the result of that?
Quick check.
8 modulo 8?
Zero, perfect, thank you very much. 8 modulo 8? 0. Perfect.
Thank you very much.
So this is going to be added in the first position,
so position 0 in our 8.
And now we want to add 0.
0 modulo 8 is 0 again.
So here we have the first collision.
So this basically means we want to add this here. So what happens now?
Who can help me?
Exactly and we're rehashing the 8, 8 times 3, 24, modulo 8.
Still 0.
Still 0.
Very good.
So we're going to kick out the 8.
And that's why I told you it's not really a good hash function,
because that's always going to happen in this case, right?
So we have a quite tight loop already there,
which will mean we'll need new hash functions quickly. And in this case, but still works because we don't have more collisions right now.
So then we want to insert the 9.
Where does 9 go?
9, modulo 8.
1.
1.
What happens? We kick out the 1. We kick out? 1. What happens?
We kick out the 1.
Where does the 1 go?
3.
Position 4.
So this gets kicked out here.
And this is our final hash table.
And so this is how basically
cuckoo hashing works.
Basically,
if we want to do a lookup now, say,
for example, we want to look up the 1,
then we would first hash with our first hash function.
We would go to 9, then find the 9, but not the 1.
So we hash with the second hash function.
We find the 1 where it is, and we basically have our lookup. If we would look
up, let's say, 16, for example, then modulo 8 would be 0, right? So we're looking at a 0. Well,
it doesn't match. We're going to the second hash function.
We're looking again at position 0.
There's the 8.
Nothing happened.
So we're basically back in a loop.
So we know it's not in there.
OK.
Yes?
If we add another 0, for example, what happens?
So another 0?
You mean modulo 0?
Or if you want to mean modulo 0, or if you want to add like modulo 0.
So basically, then we have a, if we add a 0 again,
then we have two values with the same key.
So that basically means then we would basically
need something where we store for one key,
we store multiple addresses.
So this is basically.
16 then. If we have 16, then we have this ring, right?
So then all of a sudden, we have zero in the first table,
position zero in the first table,
position zero in the second table.
And we're basically trying to recurse.
It doesn't work.
We have to find a new hash function.
So then basically, in this case, it doesn't work, we have to find a new hash function. So then basically,
in this case, it doesn't work, we have to fix this. Okay? Good. So with this, I would
say let's do a break here and go into tree structures after a five-minute break. So, everybody fresh? So I forgot something, an announcement,
which I actually really want to tell you. So there's going to be a presentation in
our joint seminar with TU Darmstadt tomorrow by Anna Klimovic from ETH Zurich.
It's going to like not 100% fit to our topic here, but still, Anna is in hardware and systems,
so still there will be angles to what we're doing here.
It's more on the ML side, her current work, but still systems and hardware focused.
So it's resource efficient machine learning with scalable input data processing.
Tomorrow 4.15, you have the link, the Zoom link,
in Moodle as well.
So I actually can highly recommend that.
So if you have time at that time, nothing else to do,
then feel free to take a look and listen in, because that should be quite valuable.
Let me see if, yes.
Okay, so with that, let's switch gears, go to the tree structure.
So we discussed caching in length, now we're going to talk about trees.
And as a quick recap, let's talk about the B-tree once more.
So reminder, as always, B-tree is the universal data structure
for this kind of search-based or for comparison-based search
in a disk-based database system.
And let's say disk-based or something where we're working with larger blocks,
with larger grain, more coarse-grained granularity.
And the nodes, it's a very wide data structure, so each node has many children,
unlike a binary tree where we only have two children.
And we're trying to keep each of the nodes not too empty,
so at least half full, except for the root.
Meaning we have in each node basically at least k. If the node size is 2k, then we have at least k keys.
And I've somehow mixed up.
No, no.
If you have d keys, then we have at most 2d keys.
And then if you have k keys, we have k plus 1 children,
basically always saying greater or less than and equal
or the other way around, like less than or greater
and equal to the root.
So this basically looks something like this, right?
And the k would somehow fit the disk page space.
And this still makes sense in in-memory databases
because we can basically look at memory page sizes
and do the prefetching there.
So that still somehow works well, but not as well.
Because basically we could also try to keep this better structured in memory
so we don't have these random jumps in memory to different positions in the memory. The keys are basically, or the children in a B-tree
can be located somewhere completely differently.
So basically, this means from one level to the next
will jump far in memory.
And that means we're not going to have any cache locality
here.
So this basically, in a general layout,
this would probably mean for every step in the tree,
one basically page or one cache miss,
and then basically going to memory, which is not that bad.
But it's also not great.
We can do better in memory. But still, I mean, we always have,
like, we have a very wide tree, so we don't have that many steps, right? Unlike a binary tree,
where we're going to go, like, the tree is going to be very wide, we're going to go very far,
or many steps, many iterations through the tree. Here, we only have these few.
But as I said, we can do better.
And for this, people came up with the cache-sensitive B plus tree.
And here, the main idea is basically we're not using as many pointers.
So in the B tree, basically every child node
gets a separate pointer in our node.
And this means basically every node,
or looking up another node, basically
means doing a jump in memory, so doing
this kind of additional cache lookup.
And so rather than doing that, what we're trying to do
is we're laying out our nodes, our children,
in memory close by so that if we're loading something,
we don't really need to do our pointer addressing.
We don't have to go through the pointers.
But we can just do a lookup directly,
because we know the memory layout.
And that can be done by storing all of the children
basically consecutively in memory.
So rather than basically having every node somewhere different
in memory, we have node somewhere different in memory,
we have them close by in memory.
And then that just basically means like an ARRI lookup in memory.
And then basically our tree or our lookups
will be faster because we directly know.
So we can basically have like a more cache-friendly lookup for the children.
But we need to block already at preparing the nodes or at creating a node, we already
need to block all of the memory for the children, even if they're not necessarily full. Still, in the nodes, we will typically have
keys in the nodes, not the children, but we will have them in each key, in each node that is filled somehow, we will have between D and 2D keys.
So we're not going to have just sparsely filled keys.
But we might have space in memory that we're not using yet.
So that's kind of the trade-off that we're doing.
So that's the overall idea.
So rather than having the data being organized randomly in memory, which we would have in
the B-tree, we're trying to have larger consecutive blocks in memory. And I mean, we could still have addresses here or pointers,
but since they're located consecutively in memory,
we don't need them, so we can directly look up.
There can even be a more efficient lookup version
or if we don't need to do many updates, right? Or if we don't need
to do no updates basically, then we can build such a tree in a lookup friendly
version or a read-only version which is the cache sensitive search tree. So
because we cannot add something, where then we make sure that each node
basically fits into a cache line and we can fill all of the, because we cannot add something, where then we make sure that each node basically fits into a cache line.
And we can fill all of the, because we
know there's no updates, we can fill all of the nodes 100%.
And then we don't need any child pointers.
So we know everything is filled, basically.
Then rather than using child pointers,
we can just use offset computation.
So our complete tree is basically one large memory
region.
If we need to break it up into separate regions,
then we might need pointers there.
But using virtual memory, we can really just
use one large area where we put all of the data in and then just compute the offsets rather than looking up pointers.
And that would basically help us, but we don't have to look something up to find the next tree or to find the next node,
but we can just compute the node.
So this basically will be faster for the lookup.
So we're going to save on this additional indirection where we first have to read the
address and then from the address basically load something where we can directly say,
please give me this next address or this next position. So, and then we're basically in this trade-off here again,
like we can make larger nodes, then we have a lower tree height.
We can make smaller nodes.
We basically fit them better into the cache lines,
but then the trees might get a bit higher.
So there's kind of this usual trade-off.
And this is also kind of a similar idea
to have a full CSP plus tree.
So this basically, we can also use this where we basically have all of the children in all of the child nodes in the same node in one big area.
This we would then be calling a node group.
And we can also have, or this is also how we would structure the CSB3, CSB plus tree in general.
So, of course, at a certain level, we might want to,
or, I mean, we always will have all of the children in one node,
but then we can also group them in larger areas.
And then we pre-aggregate.
As we already know, the space in the tree is completely full.
We pre-aggregate the space for the tree.
Or even if it's not 100% filled yet,
then we still pre-aggregate the full tree.
And then we get the same kind of search performance.
I mean, we might have some empty space here,
but we know exactly where to look everything up.
And we also get a good insert performance
if we don't need to do a lot of splitting, right?
So because we already have a lot of space pre-allocated.
But again, we might pre-allocate a bit much space if our tree is not very, very full.
So again, trade-offs between the two. The read-only version, this we can use if, I mean,
like using the full tree, this we can only use if we actually have no
or almost no updates, because otherwise we'll have to do
a lot of reorganization all the time.
Then we have something called the SCSB
or the segmented CSB plus tree,
rather than basically saying we have all of the children
of one node in one big area, we can also say, well, let's lose different kinds of subsegments,
could be two or three, and then one child pointer per segment. So there we get better
update performance because our splitting is cheaper.
We don't have to split as much data. And then, however, we have to go through these pointers
if we want to search for something, right? So there's like additional lookups. So the search performance is somewhat worse in this case.
And if we look at the performance in general,
so if we have our cache sensitive search tree,
so the CSS, that has best search performance, right? Because we directly can just compute all of the locations.
We don't need any addresses.
We don't need any lookups, basically, when we have to read the values.
But we don't need to follow pointers.
We're just basically using offset computation.
The CSB plus tree, so the full CSB plus tree, basically, in this case,
means we have also still good because within each of
the children or for each of the children we just have a single lookup or we have all of
the values in the same area basically.
Which kind of works well but is like a, or has like a bit less performance because we need
this one indirection through the addressing, but still better performance than the segmented
CSB plus tree where we have to go through multiple addresses in case.
And then of course the B plus tree will be a bit worse or will be worse again because there we're going to random positions in the memory all the time.
So there we're not going to have as good memory layout.
And that's also the same case for the S-CSB.
So the memory layout probably will not be as good in the CSB. And again, the CSB will not be as good as the CSS, which
has a perfect memory layout in the end.
And for insertion, the costs of a B plus tree
should be similar to the S-CSB plus tree, where basically,
we're just doing local splits.
So we don't need to do a lot of reorganization all the time.
In the CSB plus tree, we will have
to work with these larger areas.
So copying data back and forth in these areas
will be more costly in there.
In the CSS tree there, we have to reorganize everything.
As soon as we insert a new value, the whole tree
will change essentially. So that's why this is actually much more costly.
And well, so basically CSB plus and SCSB are good if we have more insertions and are kind of good tree structures for cache-sensitive
access, and cache-sensitive layouts of a B plus tree in the end.
And CSS would be best in the read-only environment.
So if we have data that we're just going to read all the time, then this will give us
a quite nice access performance.
As one way to basically see how we can actually structure
or map something like a B-tree into a more cache-friendly
version where we can get better performance out of it.
So this is a few examples of how we deal with this.
And the idea here is just make sure we have a better memory layout
than we would have in the regular version,
where then we basically go to different locations.
We have to go through memory to look up different things in there all the time.
Okay, good.
Then, if my pointer lets me, yes.
Let's go to the adaptive radix tree or to tries first in general.
Somehow this is a bit tricky. So, what's a try? radix tree or to tris first in general.
Somehow this bit tricky. So what's a tri?
So it's a so-called K-ary search tree.
So it's again, something or like a B tree,
not a binary tree,
but a tree that has multiple children per node,
or many children per note.
The name originally comes from retrieval,
so the original pronunciation would actually rather than try
would be tree, but somehow try stuck,
because it's also easier to differentiate the two
in language, right?
So we're talking about tries today,
even originally it would be retrieval, so tree.
This is, it's also called a digital tree or prefix tree.
Prefix tree because we're basically looking at prefixes
of the key, the search key,
rather than looking at the whole search key all the time,
as we would do in hashing, as we would do in a B-tree, etc.
We're just looking at sub-parts of the key.
And prefix, because initially people were thinking about text,
or alphabetical values, for example, looking up those,
you would start basically letter by letter.
So starting, say, for example, we want to have and, and, any, are, art, stuff like this.
Then the prefix tree or try would start with the A.
So that's our first letter.
We only have letter or words with A, so nothing there to do.
Then our second level, we're looking at the second value,
so there we have n and r,
because we have there this kind of difference,
so we're gonna split up into two sub-children
for n and for r, and then from there on,
we have d, t, y for n and e and t for r and r.
So this is kind of this idea of the prefix.
And the K, so that the number of different children
is basically the domain of the key character.
So if you're looking at alphabet, right?
So it's 24 letters in English alphabet,
something like this.
If we're looking Unicode, it's 100,000 something,
something different letters.
So that would not be really smart
if we have Unicode in there.
So in the end, our tree width,
so the width of a single node, really depends on the domain.
And the height of the tree only depends on the key length.
So this is basically the real difference to a regular tree, to the B-tree.
In the B-tree, the height depends on how many values do we enter.
In the trie, the height depends on how long are our keys,
or in how many subparts do we divide our key, basically.
And there is no rebalancing.
If the keys are all the same length,
they will all be on the same level of the tree.
So we don't have to do anything there.
We just basically need to organize the tree.
The only difference is that some parts of the tree will be more heavily filled, other
parts will be less filled if we're not using the whole domain of the key values.
We have a lexicographical order in the keys and the keys also are implicitly stored.
We don't really need to store the keys at all
because going from the root to the actual node
will read through the key.
So that basically all of the tree inherently
already contains the whole key
and we can reconstruct it from going through the path.
Actually, we don't need to reconstruct it because we're gonna read through it and we already reconstruct it from going through the path. Actually, we don't need to reconstruct it
because we're going to read through it
and we already have it there.
This is all good and nice, but say, for example,
as I said, Unicode, well, this won't work very well anymore
because in general, we might have on each level
or in each node 150,000 something
or 40-something thousand different values.
So the domain is way too large.
So with that, we need to do something else.
And for this, we have the so-called Radix tree.
And you will hear Radix again.
So this is an easy way of dealing with any kind of data
by just using the binary representation.
So basically, as soon as we have a binary representation
of our values, we can just split that up into different chunks.
So we're just going to use basically
the binary representation starting from the
most significant bits on and split our data up from there. And the cool thing about this
is we can also nicely configure the fanout. So the fanout being the number of children per node.
Because in the alphabet,
right, so alphabet, we basically have an alphabet, we have to deal with this, this is given to us.
If we're using a binary representation, we can just use powers of two, we're saying, oh, we're going to use one bit, well, then we have two children, we're going to use two bit, then we're
going to have four children. If we use three bit, then we're going to have four children. If we use three-bit, then we're going to have eight children. So this way, we can very nicely basically say,
this is the fan-out, or this is the width on each level for each sub-node.
And then each inner node is basically just an array
of two to the power of number of bits pointers.
So that's an easy way of splitting this up basically.
So here, as I said, if we have one bit, we're going to have two nodes.
If we have three bits or two children, if we have three bits,
we're going to have eight children, for example.
The height is basically just as long, again, as the length of the key.
If we're using a radix, we're just going to split it up by the length of the key,
or by the length of the number of bits,
or we divide the key length by the number of bits per level.
So say for example, we have k bits and we're using three or s number of bits per level,
then we have k divided by s levels in our tree. So we can directly configure
how deep is our tree actually. And the number of bits that we're using essentially then is a
critical parameter for a radix tree in terms of height and space consumption. Because technically on each level, we will need as many bits for the representation,
or we will store a number of slots for each bit.
So say we have three bits on each level, we're going to store basically eight pointers.
Not all of them will point to something, but at least we need this kind of number of space to go in there.
If we use only two bits, we're not going to use as much data or as much space on each node,
but our tree will be much deeper, because we're only consuming two bits per tree level of the key.
So the key will be split up in many more separate positions.
So basically the tree height depends on the number of trees,
so we already said this.
And if we have a perfect binary search tree, for example,
then this will only be better
like if we have a very small number of keys,
because then basically the binary search tree
will still be very flat,
and our radix tree will always have the same height.
If we have a larger number of keys, then basically really just like the height of the tree will
always be the same.
So we're always going through the same levels.
And the more bits we're consuming per level, the less the tree height.
So there's a few trade-offs that we have to do, right?
One thing, or problems that we have to deal with,
one thing is kind of space consumption per node, right?
So if we have very wide nodes, then these very wide nodes
will consume a lot of space on each level.
And they might not be filled.
So think about the tree earlier where we had A and then N, R,
and N, I think.
And so the A basically, even though we just have one entry,
it will still consume basically all of the different locations
because we need a node in there, right?
The node needs to be somehow filled
and it's going to be quite empty.
We're not going to have completely empty nodes,
but we will have very empty nodes.
So having smaller S basically means
we're not going to waste as much space.
Having larger S means we're wasting more space,
but we're going to have better lookup performance
because the tree is going to be smaller.
So, this is something that we have to deal with.
So, an alternative to this, and this is where the art tree comes in,
is basically, well, let's have multiple node sizes.
So rather than, say, every node needs to be the same size,
all of the bits, or separate slots for all of the bits,
I mean, we still need to be able to represent all locations, but we can somehow
structure this in a way that if we only have one key in there, we just mark this as a node that has
only one key or less keys, and then we know, okay, we're just going to use less space in there.
And this basically, based on the filling degree, we can create different kinds of tree or different kinds of node sizes.
So that's the idea of the adaptive radix tree or the main idea.
The thing that basically really gives us an advantage.
So let's look how the adaptive radix tree works.
And well, actually we're already out of time, right?
So maybe let's look at this next time, rather than rushing through this too much, because
you will have to implement this.
So let me maybe give you some, rather than basically just, I'm going to continue here next time, which is not really
a problem because you'll have enough time to still implement
it.
And I'll go through this in detail
with you in the next session.
So so far, basically remember it's also tree structure.
And it really just depends on the key.
So we're splitting up the keys.
And the one thing you can also already remember is this radix idea.
So this here, we're basically splitting up our binary representation into different subparts.
So it's hard to read.
But say, for example, we have an integer.
Maybe I should create this large.
So we have some kind of integer. We use a binary representation, and we're
just splitting this binary representation into subgroups
and using these subgroups or these subparts
as index on the different level.
And these would then, this is kind
of one of the configurations for the radix tree, saying the subgroups.
And this, of course, needs to be fixed.
So we could somehow create different kinds of lengths for different kinds of levels,
but it's going to be a mess and a nightmare to produce.
So we want this to be easy to compute.
But this also determines the number
of children we will have.
So if we have 8 bits, for example,
this means 256 different children,
or potentially 256 different values, potentially 256
children.
OK?
Questions?
No questions.
Great.
Well, then see you tomorrow.
Tomorrow we're not going to continue with this,
but we're going to continue with the profiling session.
So that should also be fun.
And maybe if Lawrence is, so Lawrence will do the profiling session if he's done very
early so I'm also going to be here then I might jump in and just represent the rest of the art
or we're going to do this next week with that thank you very much and see you tomorrow