Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Concurrency and Synchronization
Episode Date: June 4, 2024...
Transcript
Discussion (0)
Okay, so let's get started.
It's a bit loud today, right?
Let me see.
This is maybe better.
Okay, so welcome to today's session on hardware-conscious data processing.
We're going to talk about concurrency and synchronization,
mostly synchronization, actually.
And I have a couple of
announcements today. Again, we'll have Götz Käfer here on Friday talking about
modifying an existing sort order with offset value codes. Should be interesting.
So I invite you to come here on Friday at 1 p.m.
Tomorrow we'll have the joint seminar with TU Darmstadt and their team Mubang,
former PhD student of Darmstadt,
will present, he's now at UC Berkeley, on
fast and accurate object placement oracle. I actually don't know exactly what that is,
but it should be interesting. Very important, we have an intermediate feedback survey
in Moodle, which is maybe a bit late, but still,
I mean, it would be great if you could fill this in
or fill this out and basically give us some feedback
on what you like about the course etc.
So just five quick questions and we'll try to incorporate this into the remaining parts of the lecture and the tasks etc.
So this is really to help improve the course.
Of course we also use evap at the end, but that's too late for you.
That's going to be good for the next people.
But with this, we can maybe also improve the course for you.
So this is why it's really good if you can give us feedback here.
We're still in the CPU package, still single CPU,
but already looking at NUMA kind of style where we also have multiple
CPUs. And this will then really be the topic tomorrow. So what if you have multiple CPUs,
not just a single CPU with multiple cores? What does this tell us? And today we'll look at, well, what if we have data structures
that need to be accessed from multiple cores, multiple CPUs?
So not only data structures, generally stuff,
cache lines that need to be accessed from multiple parts.
How do we deal with this?
And first, what does a CPU do about it?
And then what do CPU do about it? And then, what do we do about
it? So, how can we make this efficient? One other, so, I mean, this is the remaining timeline.
And while we've seen this a lot of times, two things or two notes about this. So, we're
here on the top, right? So, tomorrow will be NUMA. Next week, I'm not here. So this will be done,
storage will be done by Martin and compute express link will be done by Marcel.
And then also important, we'll have the task three session and that will not be here,
but it will be upstairs and this should be the seminar room. But I always confuse the two.
But I think this one is basically on this side that looks to the terrace.
So basically it's something else going on here. So we'll be upstairs and we'll
announce this again also in Moodle. If we not already have announced it. We already did announce it still.
So when you come in here
and there's something else going on
and you didn't remember we're somewhere else,
then remember we're gonna be upstairs.
Okay.
With this, yeah, quick outlook, what's going to happen today?
We're going to talk about cache coherency.
So basically, as soon as multiple cores, multiple threads,
access the same kind of memory regions,
what does the CPU do for us?
How does this work?
Then, and there we will see the underlying protocol and maybe already get a bit of an understanding
what the CPU needs to do, how this works, and then also with this information later
on see how we can be efficient about this and why some things are more efficient than others or some architectures in how we do locking
and latching in data structures with data in general.
And for this, we'll look at different latch variants.
So how can you implement something like this?
So how can you implement synchronization in data structures. And then basically for a linked list,
which is kind of one of the basic, very simple data
structures, we'll look into this into more detail.
So what are the different kind of modes
in terms of latching or locking in these data structures?
And I've used a lot of my colleagues' work for this.
There's also lots of interesting papers on this.
And this is also something that you will have to do in the last task.
So looking at especially the log-free data structures part, this will be for the skip list.
You will, on the one hand, learn how the skip list works,
and on the other hand, learn how to implement the log-free data structure, which I think is quite interesting and useful,
although you might not necessarily need a linked list or a skip list all the time.
But still, it's a good thing to know how this works.
Okay, so with this, let's go into cache coherency, right?
So, well, motivation, why do we need this?
Databases are often faced
with highly concurrent workloads, right?
So we basically, we have multiple queries
and while reading not necessarily is a big problem, right?
So we can basically have multiple
versions or multiple copies of the same data and read concurrently not not a big issue but as soon
as we update something then um well uh we need to synchronize right so if we have multiple queries
especially let's say in an OLTP setup, where many queries
will go to the same kind of data structures, do some updates, we need to do some kind of
synchronization.
And some of it the CPU does for us.
So to some degree, basically CPU makes sure that caches are coherent. So whatever is accessed in different cores,
it will be coherent across the different kinds of caches.
So it will be the same kind of versions.
And if there are some changes in different caches,
then basically the CPU takes care of that,
basically updates the caches.
You will have to reread from another core or from the memory.
If we have concurrent workloads, that's good for us, right?
Because that means we have some parallelism that we can work with.
That's basically also easier than doing like this all operator
deconstruction etc. But the bad news is as soon as we have lots of concurrency
and we're constantly accessing the same kind of data items or same kind of data
structures then we need to synchronize and this will be costly especially if
it's like really the same kind of cache lines across many cores,
across many CPUs.
Then all of a sudden, this will really hit us,
because this basically means our caches won't work anymore,
because we always have to go back through memory.
You have to communicate the changes
across the different cores and different CPUs.
That in general is not bad news, really.
The bad news is this will make it slower.
So as soon as we have lots of synchronization,
this usually makes stuff slower, unless we do something about it.
But still, I mean, this is basically synchronization
is where we have communication, is where we
don't have basically parallelism anymore.
And in a database, there's two levels of synchronization in general.
This is whenever we have a table, our student table where we write your new marks for this course, for example, and
for other courses.
This is where we typically say we're locking the tables.
So I mean, we can lock the whole table, we can lock individual tuples, we can lock columns,
et cetera.
There's all kinds of granularities.
We have hierarchies of locking.
We might even need to lock the whole database
for large kind of updates.
This is not what we're gonna talk about today.
This is super important.
This is basically, as I said,
so we want to do updates in the database.
So we're adding new grades, we wanna make sure
that if we add, well, we're changing one thing and add a grade here or something like this,
or we exchange something, that we don't have problems in the updates.
So you know this ATM example, right?
Two changes at the same time, say Studienreferat and me, we're at the same time entering your grades in the end.
Either you have two grades for the same course,
you have zero grades for the same course, et cetera.
This is where we have locking, and this
is where we need to have these transaction guarantees.
But on top of that, also, we need synchronization
on these database internal data structures.
And this is what we typically call a latch instead of a lock.
In this lecture, we'll call everything a lock,
just because also the general terminology is a lock for this
if we're basically implementing data structures.
But for databases or database systems, we often just in order to differentiate the two,
we say the lock is whatever we put on the table or the tuple.
The latch is where we were synchronizing inside a data
structure, say a single node in the B-tree or in a skip list,
something like this.
So this would be the latch.
And as I said, we'll focus on the latches, even
if we refer to them as locks. But you know, in the future
if you see latch, that means data structure access.
If you see lock, it can be both.
That's it.
So latches, latches the one that basically tells you, yes,
this is about data structures.
Logs can be both.
OK.
Before we go there, before we go to the latches,
first we have to talk about cache coherence.
So how do caches work, or what cache coherence actually means?
How do caches keep in sync with each other?
So you remember, right, CPU has multiple cores.
Each of these cores have their private caches.
So at least L1, depends on the architecture,
but at least the L1 cache, typically instruction
and data cache, this will be private
for every individual core.
So this very small cache, and it might even be L2 cache
as well, and then if we talk about multiple CPUs,
then we're even at L3 cache, which is private or separate across
different parts of the CPU or different CPUs, right? And the CPU manages this, so it has this
large shared memory pool, and some cache lines will end up in the caches, and you do something
with it, right? So as soon as you're're changing something this will go to the cache rather than directly to the memory unless you're doing a
non-temporal right which will go all the way down through the caches into the memory basically
telling the CPU well this one I'm not going to need in the caches I actually want this to go
right back to the memory so So that's a different story.
But if you don't do that, if you're not
doing anything special, everything
will be done within the caches.
And that means if you have data structures
that you're using frequently, say we're talking about a tree,
then the root node of the tree, most likely,
if you always go to the same tree,
like many different queries work on the same data set,
on the same table, we're indexing this
with a single tree, then the root node, for example,
will be in every cache eventually.
So because we have many queries that go to a lookup
through the same kind of tree.
So we have cache lines that appear everywhere.
Now, if we have an update, or say we're growing our tree,
so the root gets split, for example,
all of a sudden this is modified,
and we have to make sure that all of the other caches somehow get updated.
So they don't work on the old tree
or we were deleting something,
we're changing something in there.
So this, and this in general is handled by the CPU.
So the CPU makes sure that we have,
like that cache lines are coherent across different cores.
And this is done using a cache coherency protocol.
In general, this only works on a single node, right?
And meaning that we basically, we have one server,
might have one CPU, two CPUs, four CPU, eight CPU, whatever.
But within this, we have cache coherency. If we go out of this, we don't have
cash currency. For some hardware and for CXL, and we'll learn about this in two to three sessions,
in three sessions, there we also have cash currency across multiple nodes. So this is
new kind of standards. And you will reuse or rehear parts of it, not
everything. So we're in sync here. But you will remember that this is kind of the kind
of protocols that you need there. Okay. So the cache coherency protocol, what it does
is ensures the consistency of the data in the caches. So, and there's two basic fundamental operations.
That's the load in the store, right?
So we're basically loading data into a cache
and restoring the data out of the cache
back into the memory.
And within the cache, we can basically change things.
And you remember there's basically write back
and write through. Write back is basically doing updates within the cache write through means we're doing updates all
the way down back into the memory if we need to go down into the memory that's slow so that's why
modern cpus don't do that for every individual update in the cache. So this means our updates are in the cache.
So whenever something changes in different parts,
or we have items cached in multiple caches,
we have a cache line in multiple caches basically,
then we need to ensure that this cache line is
in sync with others.
That basically means in one node it's updated,
it needs to be invalidated in the other.
The others basically need to update their value
if they want to touch this again.
And this is done with two different techniques,
either snooping or directory-based coherence.
And snooping is basically communicating about it so the caches
communicate to each other saying oh this is basically I need this value is it's
in cash somewhere and or I change this value if you have it and value by value
I mean cash line if you have this cash, then please invalidate yours.
So you cannot use your previous version.
I'm working on this.
And this is the snooping.
Snooping in the sense of, oh, we have something like a bus within the CPU
where everybody's talking to each other.
And if I hear somebody's changing something,
so one cache basically
says this is changed and everybody else listens and says oh this is something
that I actually also have so I'm going to invalidate this that's one way
another way is using a directory meaning that we have some kind of directory
structure where a centralized directory holds all the information about where the data is.
And typically there is some kind of combination
in between the two.
Because the snooping,
and I mean, we'll go through this in a bit more detail,
but the snooping means communication,
means if we have many cores,
that's gonna be lots of communication.
Directory is one central point
where we can just ask directly.
You don't have to chatter around.
If you have large or many cores and multiple cores, multiple CPUs,
all of a sudden we'll actually end up with multiple directories
where, again, we'll have to communicate.
And this is on different kinds of levels.
Okay. different kind of levels. Okay, so the cache coherence protocols track the state of the
sharing of data. So how is basically our cache lines shared across different kind of cores and different kind of CPUs.
And the snooping is basically one approach to implement this.
And every cache with a copy basically just listens in what is going on.
So it's basically, is this cache line shared somewhere else?
And this is done via some broadcast medium. So typically or not typically, within the CPU package, within a single CPU, all of the
cores are connected with an on-die network.
And this is different for different architectures.
It used to be like a ring for some time.
Then more modern CPUs have something like a mesh setup.
So they're connected with a mesh with some routing infrastructure as well.
And then you have, we'll talk about this next time, I guess, UPI, for example, or Infinity Fabric or Crossbus across different kind of CPUs
and talking to each other.
So this is kind of the broadcast medium.
And on that medium, these caches will communicate to each other.
And so the cache basically, I mean, we know the cache is basically
a small directory with some information,
with different kind of updates, strategies, etc.
And the cache controller listens on the medium,
so on our on-die fabric, basically, if there's something going on.
If it hears a message that says this cache line that we're also caching or I also have in my cache,
if this is shared somewhere else, if this is invalidated, if this is needed exclusively, etc.
And typically the most common approach is write invalidate protocol.
So basically, as soon as we're changing something,
so any kind of cache line can be in as many cores as we have, right?
So the cache lines can be read,
and all of the cores can happily read this cache line
and work with the data in the cache line.
But as soon as we're updating,
then only one cache should actually do these updates.
Everybody else basically,
because otherwise we'll have diverging data in the cache line,
so everybody else is basically excluded.
So we need an exclusive access
before we're writing something into the cache line.
And this means we're sending a message,
our cache is sending a message,
I'll have exclusive access on this cache line,
and all of the other caches listen in on the fabric,
say, oh, I also have this cache line,
I'm invalidating this cache line,
so I cannot use it anymore.
If I want to use it, either I'm going to ask this cache or I'll read it back from the memory. So if it's still somewhere else, I need to read it from that cache rather than from memory, because we said we still have the data in the cache rather than in the memory. If it's written back to the memory, I'll have to read it from the memory. Right?
And typically this means there's something like a finite state controller per core.
This means that we have like each of the cache lines
has a certain state and can then only change
to another state.
We'll go through this in an example in a bit.
Okay.
So this is an example of such a topology.
So here, this is actually one CPU package.
And you can see here we have 24 cores, I guess.
Yeah, 24 cores within two NUMA nodes.
So we call this a chiplet,
meaning that we have basically within the single CPU,
we have two regions that's already NUMA style,
so we'll talk about this next time.
So this is basically, there's even,
it's not completely homogeneous structure
within the single CPU.
So even within the single CPU, there's basically some memory
regions that are closer to a core and some memory regions
that are further away from a core.
And you might have some substructures.
Or let's say, not different, you might have,
you have some substructures in the CPU
if the CPU becomes large. So this is, I guess, Skylake architecture.
So in this we still have this ring buffer, more recent,
or not ring buffer, not buffer, ring topology.
So we have basically here like a ring communication across the cores and then we have
some interconnects between the two rings that connect each 12 of the cores. In a modern or
larger CPU, so Sapphire Rapids, ML Rapids for example, this would be a mesh structure. So this is not two rings, but they're in the mesh structure.
But again, multiple chiplets interconnected.
So up to four, I guess, right now.
And then within these, so we have like the cores.
And the cores have their caches and their caching agents.
So this is basically the cache for the individual core
and the cache has the caching agent
and the caching agent listens to this internal fabric
and also communicates on this internal fabric.
So we'll send messages.
And there's different snooping modes.
So no snoop basically, well, we're not snooping,
or we're doing lots of communication across early snoop
means we're, yeah, so I actually don't remember the exact details
between the two, but both basically say
we're sending snooping messages from the individual caches, so from the
individual cache managers on the core level and that basically
means we have lots of communication here, right? So if we have this tree idea again,
so we have one tree that every query will need to access.
We have the tree, the query is running on all of the cores, so, or cached on all of
the cores, then these kind of snooping messages mean, oh, okay, like some cache
basically needs to update, then all of the others will have to invalidate and
or one cache basically asks well who's sharing this or who has like access to
this then all of the other caches will basically answer to this question so
that's a lot of chatting and this is why there's different versions also the home
snoop which basically means here we have a home agent
like a memory controller on the chiplet which then would do the kind of snooping requests
or answer these snooping requests rather than having it everywhere and then we might have a
on the home agent we might have this directory structure which keeps track of the caching.
So who's caching what, and who has access to what.
And then again, we'll do these answers
in order to reduce the amount of chit chat that's
going on on this network.
So then, well, we can use this as one big region,
or one big CPU.
So we can basically ignore that there's multiple regions and do the kind of like the way the
snooping et cetera and the communication is done just as it was one large structure.
Or we can do this on a sub-numa clustering.
So we can basically say, well, this is organized individually,
and this is organized individually.
And these are basically also these different kind
of snooping modes.
These are bias settings.
So you can basically actually tell your CPU,
I want to use this directory, or I want to use the regular snooping mode.
And then this will be set up differently in the BIOS.
Okay, so now we said there is a protocol how this works, right? So basically for
each of the cores, for each of the cores,
for each of the cache lines, we have different kind of states that the
the cache line can be in. And in the classic cache currency protocol,
these are four states we have, and this is why it's called the MESI protocol. So we have a modified, exclusive, shared and invalid.
And the order is just so it spells nicely
or you can easily pronounce it.
It's not necessarily the order
that you would have the states in.
So the initial state for all cache lines at boot time
is invalid.
So all of the cache lines are basically unused
or are not usable right now in the way they are.
Then as soon as we're reading something or we have a cache line read into our cache,
initially if our core is the first one that has read it, it's in exclusive state, meaning nobody else has it.
And we can use it as we want.
As soon as somebody else also uses the same cache line,
then we're in shared state.
So multiple cache lines have the same state.
As soon as we're changing it, meaning the cache line is different than what it was in the memory,
then it's in modified state.
And of course, if it's in modified state,
it cannot be in shared state somewhere else.
So then this kind of cache line needs to be invalidated everywhere else.
Otherwise, we might have this diverging versions of the cache.
As I said, initially, versions of the cache.
As I said, initially, all of the cache lines are marked invalid.
And we said these kind of changes are communicated across this interconnect, this on-die interconnect
here.
So if the core doesn't change,
I mean this is something the home agent cannot know. So if I want to
change this here, the home agent has no idea. So this I need to basically
communicate on this kind of network. If I want to find out is this shared for
example, so I'm reading something,
then somebody will have to answer this question.
So this could be the home agent.
In a snooping mode, all of the cores could basically answer
if this cache line is shared or not.
And because of that, if you have 24 cores, for example,
in the worst case, you get 24 answers that your cache line is actually shared.
And because of that, there's like an extension.
So this is what Intel, for example, uses the MESIF protocol.
So there's an extra state that's called the forward state.
And that's basically a designated responder.
If we have a shared cache line to just say this cache line is actually shared.
Because in the snoop mode, again, going here, say our core 10, for example, here has the cache line shared.
Now some other or like, let's say all of the core, almost all of the cores have the same kind of cache line
in a shared state.
Now our core 20, or what is this?
Nine, core nine wants to read this cache line.
Now it asks, is this cache line shared?
And then potentially all of the other cores would answer.
And rather than having all of them answer,
we're going to designate the core 10 as the forwarder
and then this core will answer, yes, this is shared.
And so this basically is one of the shared nodes
or one of the shared cache lines will just have this mark.
You're the forwarder or I'm the forwarder cache for this cache line.
And all others are basically just shared.
And that's just to basically reduce the amount of chitchat again on the internal fabric.
Okay.
Good.
So let's look at this again very briefly,
going through the protocol in an example.
So we have our different in here.
They are CPUs.
This works across CPUs, works across cores.
So you can think of the CPUs here as cores.
You can think of them as CPUs.
So for now, since we're not really NUMA yet, let's say the CPUs are actually cores, and each of the cores has their own cache.
So initially, all of the cache lines are invalidated.
Then our CPU1 or Core1 starts reading a cache line, so our cache line A.
So then it has exclusive access to this cache line.
So this is basically the first state.
Nobody else has ever used it.
It's not shared, it's not modified, so it's in exclusive state.
Everything else is invalidated.
Now CPU2 also reads the same block or same cache line.
Then, we're in shared state.
Both of them have read the same cache line.
Now, both of them are in shared state.
Then, if CPU2 writes the block, then it's modified in CPU2.
It's invalid in CPU1.
So it needs to be invalidated.
If CPU1 wants to use it again, it has to read it again.
And if CPU3 now reads the block,
then it will read the block, if it's in modified state,
it will read the block from the cache of CPU2,
and both of them are in shared state again.
And then if CPU2 writes the block again,
it will be modified in CPU2.
It will be invalidated in CPU3.
So if CPU3 wants to access this again,
it needs to read it again from the other cache.
Or once it's written back to the memory, it needs to read it again from the other cache, or once it's written back to the memory,
it needs to read it from memory.
Okay.
Okay.
Good.
So, why is this important?
Because basically, as soon as we're accessing the same cache lines
across different CPUs, across different cores,
it means we need to communicate.
So we basically, and this needs a couple of cycles, right?
This basically is these cache invalidations.
And you remember, I don't remember, but it's basically almost like a memory request.
So it's a bit, it's not that many cycles, but it's basically multiple cycles going across the different kind of cores through the caches.
How many cycles would that be?
So it's probably level three cache number of cycles.
Yes?
How does CPU3 in the example know
whether it has to read the cache line from memory
or from the cache of CPU2?
So CPU3, if it wants to read this,
it will basically the caching agent will ask,
is there, it will snoop on the bus.
So either it asks on the bus, is this in a cache?
Or it basically remembers by seeing, oh, this is invalid.
It sees that this basically needs to be read.
But in this case, it would actually probably ask on the bus,
is this somewhere?
And if it's somewhere, it will get it from the other CPU.
And this is also because this happens so frequently right so whenever we're
basically doing some accesses in the caches this is also why we don't want to have too much
communication in there this needs to be fairly fast but if like all of the cpus all of the time
basically answer to everything then this might this then the bus might actually get overloaded.
It can actually get overloaded.
This is also why if you're not smart
or if you're not numer-aware across multiple nodes,
this can basically bottleneck your memory access
or your caching behavior.
Yes?
And will the data of the modified cache line
be transferred over the communication bus
or over separate?
It's the same.
So the question online is basically,
where is the data communicated?
And this is basically the same infrastructure in here.
So this has different kind of messages in here
that we can send.
This will also be used to communicate the data
from the memory across the network.
So there's not many different kind of networks on the CPU.
But this is basically for cache invalidations
for memory transfers.
Yes?
If CPU2, in the example, would decide to write the data to memory, the cache line to memory,
what happens to the state of the cache line for all other CPUs in that case?
Will it still be invalid or?
So in the case where, I mean, as soon as it, the question is,
what if we have a write through by CPU2,
say, in this last version, right?
So as soon as we're changing this cache line
and it's written through the memory,
it's basically changed.
So it needs to be invalidated everywhere else, right?
So it's basically as soon as there is a modification.
Otherwise, I mean, we would have like older versions
in the other caches.
And so it needs to be re-read back into the cache somehow.
I'm not sure if we could directly, when we actually write it back,
if we could directly update it in the other caches.
It could be possible, but I'm not sure if there's a mechanic for this.
You probably also don't know, right?
Okay. Okay.
Further questions on this?
No? Okay, then let's start with the locking.
So, basically, while this on the cache lines is kind of nice and fine, so we're basically having a different kind of...
The CPU ensures that the cache lines are coherent across different nodes.
We still want to make sure that another cache or another core
cannot change what we're working on right now.
So we basically want to ensure that this core that wants to do the changes on the
on the certain part of the data structure has exclusive access to this so that no other cores
because I mean just using the cache clearance protocol whoever is first can change basically
our cache line and if we have multiple cache lines to change at a time, this will basically mess up our data structure anyway.
So this is kind of clear.
So as soon as we're not on the single cache line level
anymore, then this kind of coherency
doesn't help or doesn't save us from breaking our data
structures.
Because if our data structure consists of multiple cache lines,
if we're doing concurrent updates, everything kind of gets messy. So we want some way of basically
telling other cores, don't do anything here right now. So I need exclusive access to this. And this can be implemented on different kind of levels.
Right?
And so one way to do this is like, is Atomix.
So in x86, there's a lock prefix for different kind of
intrinsics or different kind of operations that basically tells the CPU
or the other cores, basically in the instruction,
it basically says nobody else should read or write this value
until I'm done with it.
This basically means nobody else can touch this value.
It cannot go to another cache.
Nobody else can get an exclusive
access or modify it while it's still in my cache. And this is not the default case because it's
slow. This basically means any kind of other access will be stopped and we'll have to wait
for this. And there's different kind of operations that have this. So some of them have it basically always,
and some of them you explicitly have to tell them,
oh, we need like a lock here.
And compare and swap with a lock, for example,
means, oh, we do a compare and swap atomically with a lock.
So it's going to be done before,
or I want to completely do this operation swap atomically with a lock. So it's going to be done before,
or I want to completely do this operation
before anybody else can touch this again.
So in a compare and swap is basically,
I'm comparing the value to another value,
or I have my cash line or my value,
I compare it to something and based on the comparison, I either swap it or
I keep it. Then I have an exchange operation. So this automatically locks this operation there.
I don't need to say lock again. So lock itself is not an operation. It's just a prefix. And
the exchange basically just says, okay, exchange two values, right?
So we're just basically swapping.
Then we have a read, modify, write operation.
So something like an add addition.
So if we want an counter, for example,
we can lock this, we can have an atomic counter
that basically says, oh, I'm incrementing this atomically.
So I'm going there and I'm making sure
that this is incremented atomically.
And if, I mean, one thing that you need to ensure,
if the compiler or you emits non-temporal stores, you also
need some kind of fencing around your code
so it's not reordered.
So the non-temporal stores are the ones
that go right through.
So we're writing all the way down to the memory.
And in general, if you have this,
the CPU or the compiler
might think, well, this is actually slow, right?
So writing through all the way to the memory,
that takes some time.
So let's just reorder this, right?
So I can do something else in between.
And that might break your code if you
need some kind of locking around this, right?
So you want to lock something.
You want to write through your value.
All of a sudden, the CPU says,
well, this takes forever.
Let me do something else in between.
I don't want to wait for this.
So you can do some fencing and say,
within this code, please don't do any reordering here.
So this portion of the code should be executed
in the way that I wrote it
rather than having
any kind of reordering.
Okay, so these are just basically atomic operations that you can use to build logs.
So I mean, if you only need to log a single value, that's enough, right?
So if you just want to change, like your database is just a single value, that's enough. So if you just want to change, your database is just a single value,
you need many threads to update this and whatever, this is enough.
Then you can just work with these atomics.
It will be slow, though, because this means you will always have,
like, I mean, you're not going to get perfect speedup with many threads
because this is always cache invalidating, right?
So as soon as somebody else tries to access this,
like in somebody else being another core,
it needs to be invalidated in one cache,
needs to go all the way to the other cache.
So this will always take time.
Also just basically reading this
goes always through this network.
So that's kind of slowish,
so you don't want to just constantly work on the single value here.
And in general, you want to do more than that, right?
You want to update multiple values.
Say you will have your data structure, very simple, a linked list, right?
So your linked list doesn't just contain
a single value but it contains at least two values it contains your value that you're storing
plus the pointer to the next list item so that's that's the least that you need and so this means
if you want to change something in your linked list, you need to be able to do work on both of these things at the same time,
if you add a node, for example.
Not only that, you need to make sure that the whole consistency of the data structure is still intact,
so you want to log more of this, actually.
You want to have a larger part.
And now there's two general, or there's
different general ways of locking such a data structure.
And one way is doing it pessimistic,
meaning we're always exclusively locking.
So whatever we're doing, we're exclusively
locking, making sure that nobody else can touch this.
And there's different kind of granularities as well.
But in general, if I'm reading, I'm locking.
If I'm writing, I'm locking.
I can have shared reader-writer locks, et cetera,
but like very pessimistic ways, just, okay,
if I do something, I'm blocking, just to make sure nothing happens.
That's safe, but often it's not needed.
So if we're just reading all of the threads,
like our data structure is all read, read, read,
then just doing pessimistic locking basically
is just an unnecessary overhead.
If nothing ever changes, well, we wouldn't have actually had to lock. And so for this,
there's optimistic locking. And there are the ideas that we're reading. And then we're checking,
did something actually change? If nothing changed, it's fine. So we can basically just continue with
our lives without ever having taken any log. So basically, just need to read the value
again and here the caches help us again. So if basically I have my notes that I'm reading, I have it in the cache, I can check, nothing changed.
If the cache is still valid, right,
so nothing has changed, basically, I know all is fine,
I can continue.
If my cache line is invalidated,
if I have to basically go somewhere else, reread, whatever,
then, well, we were too optimistic.
So we basically have to maybe start again or do something
else.
And then we have the so-called lock-free,
which is basically rather than doing some kind of locking,
we're basically never blocking.
And so here, we're using the atomics for synchronization.
And this basically means we're more or less just checking certain values.
Is this good?
So can I access this now?
So I'm setting markers, basically.
There's a marker.
Is this free to access?
If yes, then I'm
basically I'm updating it and I'm updating the marker atomically and if
somebody else wants to read it just tries to read it well it doesn't work
right it's marked as in progress or updated I have to basically start again
or continue wait for something to happen but there's no real lock. So the threads can just continue reading these things.
And this might be good for simple data structures,
might be inefficient because again, right?
If we're basically reading the same thing
and we're not holding locks on anything,
this might mean that we constantly invalidating our caches.
So basically, we constantly have to check the different kind of caches
if we're using these kind of markers.
But we'll go through these separately.
So we'll look at the different steps or different ways of doing this
and also combinations of them separately after a quick break. Questions so far.
But as I said, I mean, it doesn't make that much sense to ask in detail because I'm going to go
through this in more detail after the break. So first, let's look at optimistic logging.
So what we do is basically we're validating that the data,
so we're reading the data,
we're just going through the data.
Then after we've read the data,
we're validating that the data in the critical section
has not changed.
That's basically, we're just making sure
whatever we've done so far is still accurate
or is still correct.
And then we can continue right so this is basically i mean in a linked list you can imagine it's fairly fast right so i have
basically some kind of version counter or something like that or i'm just looking at the value itself
i'm reading the value doing whatever i want to do with the value i'm checking again is the value still fine okay
good let's go on and this is good for frequently read data because we don't have to do these atomic
writes we don't have to do so the atomic rights is basically just for these locks right so we
have something like a semaphore or whatever that that blocks everything and um that basically so without doing with doing not not doing this
basically is much cheaper because we also have everything in our cache right
so only if somebody else writes this then our cache will be invalidated and then also we know that not only by
the cache invalidation I mean we don't necessarily see the cache invalidation
in our code right but if nothing has changed everything will be in the cache
so everything will be fast all these changes etc. or not the not changes will see fairly quickly because everything's local.
If something has changed, well then we have to restart. We're basically starting again,
reading the value, doing whatever you want to do with the value and checking again that the value is still the same before continuing.
So, and there's a little problem in here because this part, this will be actually slower, right?
So as soon as something has changed, then our cache will be invalidated.
We have to read the new value.
And if something, if we're at a point in our data structure that is constantly changed,
then this means we could do this forever, right?
So our thread that just wants to read this,
if there's like lots of changes on this part,
then we're never gonna be done with this.
So especially if the reading takes some time,
so more complex data structure, et cetera. we're never going to be done with this. So especially if the reading takes some time,
so more complex data structure, etc.
And of course, this is only good if there's no side effects, right? So everything we can basically restart.
So whatever we do in our read operation,
it cannot affect everything else already
because everything we did there is basically invalidated already.
Meaning, we have some, I don't know, we want to build a hash table, for example.
We're inserting this data in our hash table, then we have to be able to revert this change if the value has changed. And as I said, if it's too expensive,
then to invalidate this or to restart this,
this might be not good.
And if we have too much contention,
we have to restart too frequently,
then we might starve.
So we also don't want that.
So we might fall back to pessimistic locking
at a certain point.
So this is something that you would,
how you would actually implement this in one way or the other.
So we have some kind of read callback.
So this is what we do if our read is good,
or what we do in our read.
So for a certain number of attempts
that we are happy to wait for,
we're basically trying, we're getting the current version.
If it's locked, well, then we have to continue, we have to start again.
If not, then we're actually reading the value,
we're doing whatever we want to do in our read.
We're checking the version again. If the version is the same, we're actually reading the value, we're doing whatever we want to do in our read, we're checking the version again. If the version is the same, we're good. Otherwise,
we basically have to start again. So, we basically have to try again, we're reading.
We don't do anything except for reading the value, so this is optimistic. If it is locked forever,
for example, we're not going to get anywhere. If it's changed all the time, we're not going to get anywhere. So after a certain amount of attempts,
we'll fall back to
pessimistic locking. So meaning,
we're going to basically put our foot down and say, well, whenever the lock gets free,
so everybody else who's waiting for the lock has got their lock.
We're going to lock it as well.
And then we're just going to read as well
in an exclusive mode rather than in an optimistic mode.
So this is cool because this, of course,
multiple threads can do at the same time, right?
So this part, like this is all fine.
Everybody can do at the same time.
Everybody can read.
Only if it's locked, we're basically getting,
we're going to have to reiterate and redo,
or if something's changed.
Again, if we wait too much or have to do too many attempts,
then we're going to start exclusively locking in order not to starve.
Okay. So now the question is, I mean, this upper part is very simple, right? So we just
basically read the versions. Now we have to think about how to implement the lock down there.
And so this is basically also telling us, is this locked?
Or now we want to lock, right?
And there's a different kind of, or two major strategies,
how to implement pessimistic locking.
And for each of those, then there's many ways
how they can be implemented internally.
And there's basically spinning locks in user space.
And there's blocking OS service locks.
And a spinning lock in user space, or spin lock,
is basically some kind of data item
that a thread just frequently pulls and just checks,
is it free?
Is it free, basically basically so this is why
it's called a spin lock so the the thread basically continuously checks this is free
this is free it's a busy simple to implement and it's fairly
cost efficient. So if it's implemented well, all we need to basically have is the amount
of cash invalidations to do this is fairly low? We're just basically, and we'll go through this,
how this can be done is basically we're just checking
our cache all the time and once it's free,
well, the cache, like once our lock is updated,
the cache needs to be invalidated.
And if we're basically changing the lock to locked again,
the cache needs to be invalidated elsewhere.
So that basically, that would be a fairly efficient way
of implementing this while blocking OS service,
something like a mutex or futex in user space.
So there's like a mutex in the standard template library.
So there you get like an actual threat context switch.
So here basically our threat goes to sleep.
If it's locked, then it will be waked up
as soon as the lock is free again.
So this is, of course course much more costly than just like a
cache invalidation. So we've said that, right? So the, I mean, it's still much cheaper than
changing a process, but still we have to do some, basically some updates on the OS or the OS does some maintenance that's more costly than just
keeping the threads running. But we're burning cycles, right? So, I mean, this basically, if we
have locks that are very long running, so I do whatever, right? Like very large changes in my
data structures, and I can do this with multiple threads, et cetera,
then it might actually make sense to put a lot of,
like the threads that are just blocked for a long time
to put them to sleep rather than have them burn cycles
in a waiting fashion, right?
So there's always trade-offs.
That's, I mean, that's the general rule here, right?
So there's always trade-offs, and you have to think about what makes sense when.
But in the case where we have small changes in the data structure, spinning log might actually make sense.
And there's different variations, as we said, that we can use.
And one is a test and set spin log.
So here, basically basically we're setting
and we're returning the old value.
So essentially this could be just a bit
that tells us is this locked, right?
So then what I'm doing, I'm doing the test and set.
I'm saying, well, set it to lock
and tell me what it was before.
If it was locked before, then, well,
I didn't change anything, right?
So it was locked before, so I'm blocked, right?
So I basically have to wait.
If it was not locked, then I actually put the lock there,
so I'm the person who holds the lock right now.
Right, so essentially, I'm doing an update
and I get back the old value. So I'm definitely going to set it
to locked but the question was what was it before?
Was it unlocked? If it was unlocked before
then now I have the lock. If it was locked before I don't have
the lock. So it was locked before, I don't have the lock.
So then whoever had the lock before still keeps the lock.
Of course, this is something you can easily do wrong.
So if you're doing a bug here, you might still
update something else.
So you have to ensure that whatever you implement there
actually is like you know what the semantics are.
But if you're basically doing this, this works, right?
So if you're keeping this kind of semantics,
say I'm putting a bit to one, if it's locked,
and if I'm reading the bit and it's been to one,
then I don't hold the lock and I know like
then then it's basically fine so this is this is good it's efficient it's small right but it's busy
polling and the problem it doesn't scale because we have a contention on a single cache line right
so this basically means everybody who wants this lock will work on the same exact same cache line. So this basically means everybody who wants this log will work on the same exact
same cache line. So we basically always have to go to the same cache line, different threads,
different cores will need the same cache line for every individual access. Whenever I'm polling
this log, whenever I'm checking this log, This means it will be invalidated everywhere else because I'm
updating the log, even though I'm just writing the old value back. So this is basically,
yeah, it's slow because we have this contention on the single cache.
It's not really a problem for the thread that's working, that holds the lock,
because the thread that holds the lock doesn't have to do anything with the lock while it holds
the lock. But everybody else will be fairly slow in updating this. And the more threads
are on the same cache line, the more they will just have to wait to be able to update this,
because everybody else is basically always invalidated.
We have to read it back.
So this is basically, it's, it's that we have this contention.
So but this works, right?
In order to make this more efficient, we can use a cube based spin lock.
And that's actually quite neat idea.
So the idea here is that rather than having like a single lock, a single bit, for example,
that every thread pulls on or pulls on and does like this busy weight on,
we'll create a separate lock for every thread themselves.
And we'll put them in a queue.
So what we'll do is basically the first thread that comes along and needs a lock puts like a base latch and it's like a head of a queue.
And the next thread that wants to access this will basically just append a new queue item and then pull on this next queue item. Right, so basically the first CPU or first core
in the queue holds the lock
and will only release this lock once it's basically done.
The next CPU basically has like a new queue item
and it will always only point to like test this one.
So in here, we basically, what we need is here,
we need this bit.
Okay, this is locked.
And we need a next pointer that tells us
where is the next item.
So this is then we'll basically have the CPU2
just pull on this one.
We'll check, is this still locked?
If it's still locked, I'm just going to continue here.
If the next CPU comes, it will just create a new um
basically or will create a new uh queue item where this cpu will run on as soon as this cpu releases
the lock it will basically release the lock in here and release the lock in the next node so
because then this node knows oh now the lock the lock is free, right? So now I can basically continue working or I can set this lock and can work on this.
And only this CPU, too, will ever pull on this lock.
So the only time that somebody else in this CPU will touch this latch or this lock is when this CPU releases the lock.
The idea or the very big difference is that these parts, these latches, will be in the cache of the
individual cores. So if we have a single spin lock, then this will be always like in different caches, right?
So it's basically, and every access to this lock
is an update, so every time it's invalidated,
it basically, we need all the cache invalidation,
need the communication across the network,
et cetera, the chip fabric.
Here, we only needed to set up the queue,
so in order to add a new queue item,
we need to basically check the previous one.
Oh, okay, this is basically the tail.
I mean, we might have to go through the whole queue,
find the tail, add our new latch,
and then we're just spinning on that latch
rather than spinning on each of the others.
So as soon as all of the other previous latches are gone, then I can continue.
Every new thread will just create a new latch and will spin on that.
And this is because there we don't have this cash currency problem.
It scales much better.
I mean, we're still busy waiting. We're still continuously querying the log,
but we're much faster here.
And of course, rather than just having a single log,
we can also have reader and writer logs.
So we can have concurrent readers that we allow.
We can have just for writes,
et cetera.
So we can use different kind of locks
for different kind of methods to ensure that we don't always
have these tests, et cetera, and we're not always
invalidating the caches.
So the idea here, or the idea is the same, right?
Or the problem that we're solving is the same, rather than having, like, of course, you would say, well, why would we have a separate, or you probably already understood, but you could say, why would we have a separate lock for the reader?
Right, so we want to be optimistical about this or something.
But as soon as we want to check for any kind of lock,
again, we have to read it into our cache.
Even if we don't change anything.
We just have to read it into the cache to check if there's a lock.
As soon as somebody
like a thread spins on the lock,
this will be an invalidation of the cache.
Again, this means there is some communication.
So having a separate lock that's just read, right?
Where we don't do any updates
because it's where like the writer lock
hasn't changed or anything, right?
So if we have the separate log for the reading,
if this can be shared because we don't have any updates,
we don't need to invalidate the caches.
It might be more efficient there.
Okay?
Good.
Okay, so most database... So what are the requirements for latches?
So in general, so most database accesses are read mostly.
So we're usually most of the time what we're doing is read.
That's actually not true for cloud databases.
That's an interesting fact, right? So these
days everybody stores everything and a lot of the data is actually garbage that stores.
Nobody ever touches this again. So this kind of breaks this assumption. If you're just
dumping all your data, eventually you will end up with a lot of data that you're not
using. But assuming you have a database that's fairly efficient
and you have meaningful data in there, then most likely you'll mostly read.
So most of the access will basically be reading even in an OLTP workload.
So in general, I mean, we're doing updates, but in order to do the updates, we'll still
have to read the values before, right?
Only for insert only, then that's where insert mostly, that's when we're like a write mostly,
we have a write mostly workload.
As soon as we're working with the data, then most of the accesses will be reading.
And that means reading should be fast and scalable. And this is also why we might consider
having a separate lock for reading, et cetera.
Also, if we have a tree-based data structure, which we often
have, or we have some kind of directory or something,
like the root nodes, whatever, the directory,
this needs to be traversed a lot.
And in a tree, the root node will always be traversed.
This means we might have high contention in such hotspots.
So this needs to be lockable with a very low overhead.
So this should not be expensive.
Locking these nodes, especially for reading,
should be fairly cheap.
Latency is critical often in database systems,
or at least we want to be efficient.
We're talking about hardware efficient work here,
so let's keep it fast.
So we don't want too much context switching.
So we don't really want to have OS logs most of the time.
I told you in some cases if we know this is going to take a long time,
something needs to be locked for a long time
and all of the other threads will not be able to make progress anyway,
then it might make sense to use OS logs, because then the other threads are sleeping rather than busy polling. And well, if we have fine-grained data structures, very small nodes, linked list, art, et cetera.
Then we also want fine grained locks,
and we want to have small locks.
So the locks shouldn't be much larger than the data structure
itself, or let's say equally large as the data structure,
because otherwise it's just not efficient again. If we have 40 to 80 bytes for a node, if we're doubling this size
just by adding a lock, like a standard OS lock or something like this, well, this is not efficient.
So let's see how we can be efficient there somehow. And of course we not want to have like efficient contention handling so
and this is really where these cube-based locks for example come into play if all of a sudden
every like we said 24 cores right in this old kind of cpu if you have like a modern cpu you get in the 40s of course you can have up to eight
cpus on a single board well then we're in the hundreds of course um well if they're all kind of
constantly invalidating each other's caches this will be very slow right all of a sudden we're not
going to have throughput anymore just because there's a lot of contention
and everything's bottlenecked by the synchronization.
So we want to be able to handle the contention gracefully
without sacrificing the uncontended path,
meaning without being slow when we're just reading right so we still want to
ideally we still want to have optimistic reading in any case or log free reading so we're basically
we can always read the data if there's no other stuff going on makes sense to just
be able to go without logs in any way and And so for that, there's a nice paper
where somebody, Bertscher et al, looked
at how do different kind of logs work
in different kind of setups or for different kind
of workloads.
And meaning read-only setups, so if we're really just reading,
there are some data structures that you're practically
only reading.
So, you're very rarely updating metadata in the database.
So, of course, every now and then it might be updated, but that might be a huge maintenance
task.
So, that's a different story.
Otherwise we're just reading the data structure.
That's one case.
Then read mostly with cheap reads.
So it's basically something like a small data structure
where the read is simple, meaning we can directly
read the data items in our data structure.
Linked list, very cheap read.
So I read the node. I read the value, I read the follow.
Like the node consists of the value and the pointer
to the next node reading, this is fairly cheap.
Different story if I have a compressed node
where I need some SIMD instructions, et cetera,
to uncompress the node, get an understanding
what's actually in there.
So all of a sudden my reading might take many cycles or I have a large node,
so B3 node for example, that's not just a single cache line, that might be like
larger memory page etc. So if I have a big read set, that's again a different
story because the read will take a long time.
So this means if I'm in the node, reading the node will take some time.
Some other thread might come in and somehow annoy me with updates in the same node.
If it's a cheap read, well, most likely another thread.
There might be another thread coming, but even if the thread is coming, I'm already done with my read. For an expensive read, that might not be the case.
Then I have write heavy. Lots of writes. And write only is basically, say, a log. A log is
write only. That's something, it's a different story. And here, they basically looked at these
different type of workload types and tried it out you can read the paper it's
and Damon Damon is a workshop at sigmod sigmods conference is actually next week
so I can suit or I suggest to check out the program you can basically see what
kind of papers are out there what what's current research. And Daemon is the workshop there that's for modern hardware.
So there, people are trying to come up with new ideas
on how to use modern hardware.
So this is why we also have some of the pointers here.
And in a read-only use case,
we'll always go for an optimistic locking,
or it makes sense to always go for an optimistic locking, or it makes sense to always go for an optimistic locking
because there's no overhead.
And we still have this version checking,
but we're not waiting for anything, right?
So we don't have any cache invalidations or anything.
So that makes a lot of sense.
Exclusive is too restrictive,
just completely unnecessary overhead.
Shared locks are basically, we get this read-read contention,
right? So we might still have like updates or checking the same kind of nodes,
checking the same kind of locks, which is still more costly than going for optimistic.
If we have cheap reads and we read mostly,
then an optimistic logging still makes a lot of sense
because, well, we have, like, restarts are unlikely
because essentially it's a cheap read.
If there's a restart, we still go basically,
or rereading is not really a big problem.
If we're in an exclusive log, well, it's again too restrictive, right?
In a shared log, well, there might be still some contention on the log, but it's okay,
right?
In a read-mostly setup setup where we have big reads,
then the shared log actually all of a sudden makes sense
because the shared log, I mean,
it still is some overhead to have this log, right?
So create this log, but because the read is expensive,
the cost for the log actually gets less and less.
And on the other hand, if we have restarts,
even though the restarts are unlikely,
if a restart happens in an optimistic log,
because the reading is expensive, this might annoy us.
So it might actually make sense to have this shared log, where
we're basically also setting a log for reading,
because we avoid any kind of restarts during the reads.
And then in a write heavy setup, shared locks are good.
Exclusive locks are always quite restrictive
and an optimistic lock all of a sudden
will be kind of stupid because we have so many restarts.
Right, so as soon as we're basically optimistically
and often we still need kind of reads,
even in a write heavy setup,
just to find the node that we actually want to update,
right, to find the data item.
So even if it's update heavy, we will do some reads,
but we might not even get there because like everything's updated all the time.
So we have like lots of aborts and we might have some starvation.
So because of that, of course, we have like additionally, we have these pessimistic logs with our optimistic log to be safe.
But like a shared log might make more sense. with our optimistic log to be safe.
But like a shared log might make more sense. In a write only setup, well,
then we will need exclusive logs eventually,
or it doesn't really make a difference
because the shared logs or the exclusive logs
were, are exclusive anyway, right? Or the writes are exclusive anyway, right?
Or the writes are exclusive anyway,
so the shared log doesn't really make a difference for us
or if we're just exclusively logging everything.
Okay.
So I have the data structures part now,
but we're so far in time already
that I would say I'll continue with this next time,
meaning tomorrow. Do we have questions up to here so as a reminder we talked about cache coherence and this is really I mean this is why we talked
about this right because the cache coherence or how the caches work etc and
how the memory access work is et cetera, and how the memory access work,
is why different kind of locks make different or have different kind of properties
and why it makes sense to kind of build these structures
and queues of locks just to make sure
that we're not basically destroying our caches all the time.
So, I mean, the caches are good if we have our data local,
if we have shared access to the data, so that's all fine.
But as soon as we're locking, we're updating.
With these updates and spin locks, we get all these cache invalidations, and that's
slow.
That's basically why we can be more smart about it, building these kind of queues, et
cetera, of data structure or of locks that make it faster.
OK.
Yes?
Actually, I have one question to the test and set spin lock.
Because you said this is inefficient because you always update the cache line so it's invalidated
in all other cores.
Yes. Because you said this is inefficient because you always update the cache line so it's invalidated in all other cores.
Yes.
What if that operation just sets the value if the value or if the log is not set?
Is it difficult to make it atomic then? Or why do we always update the cache line or always update the log updates if it's not changed.
So you can also implement it.
The lock itself will then be more work.
So there's just a couple more cycles that you need for this set.
So this is an atomic operation that's intrinsic in the CPU that's really just like
more or less single cycle operation. So that's very fast. If you start branching,
well, then you have branch prediction, all the additional stuff. You can lock this,
but it's just going to be more cycles. I don't know, there's also, there might be atomics for that, but it's, it's, that's the way how you implement it most efficiently.
I mean, there might, you might be with able to do it with compare and swap, et cetera, but usually,
I mean, this means like, this is a single operation that you do. Otherwise you have
multiple operations, meaning multiple cycles that you do. Otherwise you have multiple operations,
meaning multiple cycles that you basically need to check.
You need to branch to ensure that it's going on.
This is basically branch free, right?
So think about it similar to the predication, et cetera.
It's a single operation.
I have no idea if it would make sense to actually implement it in a different way and then maybe be faster.
Try it.
Well, it might be possible, right?
So I mean, there's these different kind of, we had this here somewhere, right there's these different kind of ways
how you can implement this.
So there might be a different way of implementing this.
I've actually not really questioned it so far.
So it might make sense to think about it.
Maybe as a follow up, are there some
where there's some tries to make cache so the state of the cache changing the state
of the cache when we write a value are there some motivations to make it more smarter that
that the other cache lines are not invalidated if the same value is written again. So the cache line doesn't change in any way?
OK, so the question is, what if I'm updating the cache line
and it's not changed?
So why would I invalidate the cache line?
Good question.
I don't know if the CPU checks this.
Might be possible. Well, that's a good question. I don't know if the CPU checks this. It might be possible.
Well, that's a good question. I would have to check, actually.
Okay, further questions? No further questions? Okay, well, then, thank you very much.
And see you tomorrow.