Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Concurrency and Synchronization

Episode Date: June 4, 2024

...

Transcript
Discussion (0)
Starting point is 00:00:00 Okay, so let's get started. It's a bit loud today, right? Let me see. This is maybe better. Okay, so welcome to today's session on hardware-conscious data processing. We're going to talk about concurrency and synchronization, mostly synchronization, actually. And I have a couple of
Starting point is 00:00:27 announcements today. Again, we'll have Götz Käfer here on Friday talking about modifying an existing sort order with offset value codes. Should be interesting. So I invite you to come here on Friday at 1 p.m. Tomorrow we'll have the joint seminar with TU Darmstadt and their team Mubang, former PhD student of Darmstadt, will present, he's now at UC Berkeley, on fast and accurate object placement oracle. I actually don't know exactly what that is, but it should be interesting. Very important, we have an intermediate feedback survey
Starting point is 00:01:09 in Moodle, which is maybe a bit late, but still, I mean, it would be great if you could fill this in or fill this out and basically give us some feedback on what you like about the course etc. So just five quick questions and we'll try to incorporate this into the remaining parts of the lecture and the tasks etc. So this is really to help improve the course. Of course we also use evap at the end, but that's too late for you. That's going to be good for the next people.
Starting point is 00:01:48 But with this, we can maybe also improve the course for you. So this is why it's really good if you can give us feedback here. We're still in the CPU package, still single CPU, but already looking at NUMA kind of style where we also have multiple CPUs. And this will then really be the topic tomorrow. So what if you have multiple CPUs, not just a single CPU with multiple cores? What does this tell us? And today we'll look at, well, what if we have data structures that need to be accessed from multiple cores, multiple CPUs? So not only data structures, generally stuff,
Starting point is 00:02:36 cache lines that need to be accessed from multiple parts. How do we deal with this? And first, what does a CPU do about it? And then what do CPU do about it? And then, what do we do about it? So, how can we make this efficient? One other, so, I mean, this is the remaining timeline. And while we've seen this a lot of times, two things or two notes about this. So, we're here on the top, right? So, tomorrow will be NUMA. Next week, I'm not here. So this will be done, storage will be done by Martin and compute express link will be done by Marcel.
Starting point is 00:03:13 And then also important, we'll have the task three session and that will not be here, but it will be upstairs and this should be the seminar room. But I always confuse the two. But I think this one is basically on this side that looks to the terrace. So basically it's something else going on here. So we'll be upstairs and we'll announce this again also in Moodle. If we not already have announced it. We already did announce it still. So when you come in here and there's something else going on and you didn't remember we're somewhere else,
Starting point is 00:03:55 then remember we're gonna be upstairs. Okay. With this, yeah, quick outlook, what's going to happen today? We're going to talk about cache coherency. So basically, as soon as multiple cores, multiple threads, access the same kind of memory regions, what does the CPU do for us? How does this work?
Starting point is 00:04:20 Then, and there we will see the underlying protocol and maybe already get a bit of an understanding what the CPU needs to do, how this works, and then also with this information later on see how we can be efficient about this and why some things are more efficient than others or some architectures in how we do locking and latching in data structures with data in general. And for this, we'll look at different latch variants. So how can you implement something like this? So how can you implement synchronization in data structures. And then basically for a linked list, which is kind of one of the basic, very simple data
Starting point is 00:05:10 structures, we'll look into this into more detail. So what are the different kind of modes in terms of latching or locking in these data structures? And I've used a lot of my colleagues' work for this. There's also lots of interesting papers on this. And this is also something that you will have to do in the last task. So looking at especially the log-free data structures part, this will be for the skip list. You will, on the one hand, learn how the skip list works,
Starting point is 00:05:45 and on the other hand, learn how to implement the log-free data structure, which I think is quite interesting and useful, although you might not necessarily need a linked list or a skip list all the time. But still, it's a good thing to know how this works. Okay, so with this, let's go into cache coherency, right? So, well, motivation, why do we need this? Databases are often faced with highly concurrent workloads, right? So we basically, we have multiple queries
Starting point is 00:06:21 and while reading not necessarily is a big problem, right? So we can basically have multiple versions or multiple copies of the same data and read concurrently not not a big issue but as soon as we update something then um well uh we need to synchronize right so if we have multiple queries especially let's say in an OLTP setup, where many queries will go to the same kind of data structures, do some updates, we need to do some kind of synchronization. And some of it the CPU does for us.
Starting point is 00:06:57 So to some degree, basically CPU makes sure that caches are coherent. So whatever is accessed in different cores, it will be coherent across the different kinds of caches. So it will be the same kind of versions. And if there are some changes in different caches, then basically the CPU takes care of that, basically updates the caches. You will have to reread from another core or from the memory. If we have concurrent workloads, that's good for us, right?
Starting point is 00:07:37 Because that means we have some parallelism that we can work with. That's basically also easier than doing like this all operator deconstruction etc. But the bad news is as soon as we have lots of concurrency and we're constantly accessing the same kind of data items or same kind of data structures then we need to synchronize and this will be costly especially if it's like really the same kind of cache lines across many cores, across many CPUs. Then all of a sudden, this will really hit us,
Starting point is 00:08:12 because this basically means our caches won't work anymore, because we always have to go back through memory. You have to communicate the changes across the different cores and different CPUs. That in general is not bad news, really. The bad news is this will make it slower. So as soon as we have lots of synchronization, this usually makes stuff slower, unless we do something about it.
Starting point is 00:08:38 But still, I mean, this is basically synchronization is where we have communication, is where we don't have basically parallelism anymore. And in a database, there's two levels of synchronization in general. This is whenever we have a table, our student table where we write your new marks for this course, for example, and for other courses. This is where we typically say we're locking the tables. So I mean, we can lock the whole table, we can lock individual tuples, we can lock columns,
Starting point is 00:09:18 et cetera. There's all kinds of granularities. We have hierarchies of locking. We might even need to lock the whole database for large kind of updates. This is not what we're gonna talk about today. This is super important. This is basically, as I said,
Starting point is 00:09:36 so we want to do updates in the database. So we're adding new grades, we wanna make sure that if we add, well, we're changing one thing and add a grade here or something like this, or we exchange something, that we don't have problems in the updates. So you know this ATM example, right? Two changes at the same time, say Studienreferat and me, we're at the same time entering your grades in the end. Either you have two grades for the same course, you have zero grades for the same course, et cetera.
Starting point is 00:10:09 This is where we have locking, and this is where we need to have these transaction guarantees. But on top of that, also, we need synchronization on these database internal data structures. And this is what we typically call a latch instead of a lock. In this lecture, we'll call everything a lock, just because also the general terminology is a lock for this if we're basically implementing data structures.
Starting point is 00:10:42 But for databases or database systems, we often just in order to differentiate the two, we say the lock is whatever we put on the table or the tuple. The latch is where we were synchronizing inside a data structure, say a single node in the B-tree or in a skip list, something like this. So this would be the latch. And as I said, we'll focus on the latches, even if we refer to them as locks. But you know, in the future
Starting point is 00:11:16 if you see latch, that means data structure access. If you see lock, it can be both. That's it. So latches, latches the one that basically tells you, yes, this is about data structures. Logs can be both. OK. Before we go there, before we go to the latches,
Starting point is 00:11:40 first we have to talk about cache coherence. So how do caches work, or what cache coherence actually means? How do caches keep in sync with each other? So you remember, right, CPU has multiple cores. Each of these cores have their private caches. So at least L1, depends on the architecture, but at least the L1 cache, typically instruction and data cache, this will be private
Starting point is 00:12:12 for every individual core. So this very small cache, and it might even be L2 cache as well, and then if we talk about multiple CPUs, then we're even at L3 cache, which is private or separate across different parts of the CPU or different CPUs, right? And the CPU manages this, so it has this large shared memory pool, and some cache lines will end up in the caches, and you do something with it, right? So as soon as you're're changing something this will go to the cache rather than directly to the memory unless you're doing a non-temporal right which will go all the way down through the caches into the memory basically
Starting point is 00:12:56 telling the CPU well this one I'm not going to need in the caches I actually want this to go right back to the memory so So that's a different story. But if you don't do that, if you're not doing anything special, everything will be done within the caches. And that means if you have data structures that you're using frequently, say we're talking about a tree, then the root node of the tree, most likely,
Starting point is 00:13:21 if you always go to the same tree, like many different queries work on the same data set, on the same table, we're indexing this with a single tree, then the root node, for example, will be in every cache eventually. So because we have many queries that go to a lookup through the same kind of tree. So we have cache lines that appear everywhere.
Starting point is 00:13:49 Now, if we have an update, or say we're growing our tree, so the root gets split, for example, all of a sudden this is modified, and we have to make sure that all of the other caches somehow get updated. So they don't work on the old tree or we were deleting something, we're changing something in there. So this, and this in general is handled by the CPU.
Starting point is 00:14:13 So the CPU makes sure that we have, like that cache lines are coherent across different cores. And this is done using a cache coherency protocol. In general, this only works on a single node, right? And meaning that we basically, we have one server, might have one CPU, two CPUs, four CPU, eight CPU, whatever. But within this, we have cache coherency. If we go out of this, we don't have cash currency. For some hardware and for CXL, and we'll learn about this in two to three sessions,
Starting point is 00:14:55 in three sessions, there we also have cash currency across multiple nodes. So this is new kind of standards. And you will reuse or rehear parts of it, not everything. So we're in sync here. But you will remember that this is kind of the kind of protocols that you need there. Okay. So the cache coherency protocol, what it does is ensures the consistency of the data in the caches. So, and there's two basic fundamental operations. That's the load in the store, right? So we're basically loading data into a cache and restoring the data out of the cache
Starting point is 00:15:36 back into the memory. And within the cache, we can basically change things. And you remember there's basically write back and write through. Write back is basically doing updates within the cache write through means we're doing updates all the way down back into the memory if we need to go down into the memory that's slow so that's why modern cpus don't do that for every individual update in the cache. So this means our updates are in the cache. So whenever something changes in different parts, or we have items cached in multiple caches,
Starting point is 00:16:13 we have a cache line in multiple caches basically, then we need to ensure that this cache line is in sync with others. That basically means in one node it's updated, it needs to be invalidated in the other. The others basically need to update their value if they want to touch this again. And this is done with two different techniques,
Starting point is 00:16:37 either snooping or directory-based coherence. And snooping is basically communicating about it so the caches communicate to each other saying oh this is basically I need this value is it's in cash somewhere and or I change this value if you have it and value by value I mean cash line if you have this cash, then please invalidate yours. So you cannot use your previous version. I'm working on this. And this is the snooping.
Starting point is 00:17:13 Snooping in the sense of, oh, we have something like a bus within the CPU where everybody's talking to each other. And if I hear somebody's changing something, so one cache basically says this is changed and everybody else listens and says oh this is something that I actually also have so I'm going to invalidate this that's one way another way is using a directory meaning that we have some kind of directory structure where a centralized directory holds all the information about where the data is.
Starting point is 00:17:46 And typically there is some kind of combination in between the two. Because the snooping, and I mean, we'll go through this in a bit more detail, but the snooping means communication, means if we have many cores, that's gonna be lots of communication. Directory is one central point
Starting point is 00:18:02 where we can just ask directly. You don't have to chatter around. If you have large or many cores and multiple cores, multiple CPUs, all of a sudden we'll actually end up with multiple directories where, again, we'll have to communicate. And this is on different kinds of levels. Okay. different kind of levels. Okay, so the cache coherence protocols track the state of the sharing of data. So how is basically our cache lines shared across different kind of cores and different kind of CPUs.
Starting point is 00:18:45 And the snooping is basically one approach to implement this. And every cache with a copy basically just listens in what is going on. So it's basically, is this cache line shared somewhere else? And this is done via some broadcast medium. So typically or not typically, within the CPU package, within a single CPU, all of the cores are connected with an on-die network. And this is different for different architectures. It used to be like a ring for some time. Then more modern CPUs have something like a mesh setup.
Starting point is 00:19:27 So they're connected with a mesh with some routing infrastructure as well. And then you have, we'll talk about this next time, I guess, UPI, for example, or Infinity Fabric or Crossbus across different kind of CPUs and talking to each other. So this is kind of the broadcast medium. And on that medium, these caches will communicate to each other. And so the cache basically, I mean, we know the cache is basically a small directory with some information, with different kind of updates, strategies, etc.
Starting point is 00:20:09 And the cache controller listens on the medium, so on our on-die fabric, basically, if there's something going on. If it hears a message that says this cache line that we're also caching or I also have in my cache, if this is shared somewhere else, if this is invalidated, if this is needed exclusively, etc. And typically the most common approach is write invalidate protocol. So basically, as soon as we're changing something, so any kind of cache line can be in as many cores as we have, right? So the cache lines can be read,
Starting point is 00:20:55 and all of the cores can happily read this cache line and work with the data in the cache line. But as soon as we're updating, then only one cache should actually do these updates. Everybody else basically, because otherwise we'll have diverging data in the cache line, so everybody else is basically excluded. So we need an exclusive access
Starting point is 00:21:19 before we're writing something into the cache line. And this means we're sending a message, our cache is sending a message, I'll have exclusive access on this cache line, and all of the other caches listen in on the fabric, say, oh, I also have this cache line, I'm invalidating this cache line, so I cannot use it anymore.
Starting point is 00:21:42 If I want to use it, either I'm going to ask this cache or I'll read it back from the memory. So if it's still somewhere else, I need to read it from that cache rather than from memory, because we said we still have the data in the cache rather than in the memory. If it's written back to the memory, I'll have to read it from the memory. Right? And typically this means there's something like a finite state controller per core. This means that we have like each of the cache lines has a certain state and can then only change to another state. We'll go through this in an example in a bit. Okay. So this is an example of such a topology.
Starting point is 00:22:27 So here, this is actually one CPU package. And you can see here we have 24 cores, I guess. Yeah, 24 cores within two NUMA nodes. So we call this a chiplet, meaning that we have basically within the single CPU, we have two regions that's already NUMA style, so we'll talk about this next time. So this is basically, there's even,
Starting point is 00:22:56 it's not completely homogeneous structure within the single CPU. So even within the single CPU, there's basically some memory regions that are closer to a core and some memory regions that are further away from a core. And you might have some substructures. Or let's say, not different, you might have, you have some substructures in the CPU
Starting point is 00:23:22 if the CPU becomes large. So this is, I guess, Skylake architecture. So in this we still have this ring buffer, more recent, or not ring buffer, not buffer, ring topology. So we have basically here like a ring communication across the cores and then we have some interconnects between the two rings that connect each 12 of the cores. In a modern or larger CPU, so Sapphire Rapids, ML Rapids for example, this would be a mesh structure. So this is not two rings, but they're in the mesh structure. But again, multiple chiplets interconnected. So up to four, I guess, right now.
Starting point is 00:24:15 And then within these, so we have like the cores. And the cores have their caches and their caching agents. So this is basically the cache for the individual core and the cache has the caching agent and the caching agent listens to this internal fabric and also communicates on this internal fabric. So we'll send messages. And there's different snooping modes.
Starting point is 00:24:43 So no snoop basically, well, we're not snooping, or we're doing lots of communication across early snoop means we're, yeah, so I actually don't remember the exact details between the two, but both basically say we're sending snooping messages from the individual caches, so from the individual cache managers on the core level and that basically means we have lots of communication here, right? So if we have this tree idea again, so we have one tree that every query will need to access.
Starting point is 00:25:25 We have the tree, the query is running on all of the cores, so, or cached on all of the cores, then these kind of snooping messages mean, oh, okay, like some cache basically needs to update, then all of the others will have to invalidate and or one cache basically asks well who's sharing this or who has like access to this then all of the other caches will basically answer to this question so that's a lot of chatting and this is why there's different versions also the home snoop which basically means here we have a home agent like a memory controller on the chiplet which then would do the kind of snooping requests
Starting point is 00:26:10 or answer these snooping requests rather than having it everywhere and then we might have a on the home agent we might have this directory structure which keeps track of the caching. So who's caching what, and who has access to what. And then again, we'll do these answers in order to reduce the amount of chit chat that's going on on this network. So then, well, we can use this as one big region, or one big CPU.
Starting point is 00:26:48 So we can basically ignore that there's multiple regions and do the kind of like the way the snooping et cetera and the communication is done just as it was one large structure. Or we can do this on a sub-numa clustering. So we can basically say, well, this is organized individually, and this is organized individually. And these are basically also these different kind of snooping modes. These are bias settings.
Starting point is 00:27:17 So you can basically actually tell your CPU, I want to use this directory, or I want to use the regular snooping mode. And then this will be set up differently in the BIOS. Okay, so now we said there is a protocol how this works, right? So basically for each of the cores, for each of the cores, for each of the cache lines, we have different kind of states that the the cache line can be in. And in the classic cache currency protocol, these are four states we have, and this is why it's called the MESI protocol. So we have a modified, exclusive, shared and invalid.
Starting point is 00:28:05 And the order is just so it spells nicely or you can easily pronounce it. It's not necessarily the order that you would have the states in. So the initial state for all cache lines at boot time is invalid. So all of the cache lines are basically unused or are not usable right now in the way they are.
Starting point is 00:28:26 Then as soon as we're reading something or we have a cache line read into our cache, initially if our core is the first one that has read it, it's in exclusive state, meaning nobody else has it. And we can use it as we want. As soon as somebody else also uses the same cache line, then we're in shared state. So multiple cache lines have the same state. As soon as we're changing it, meaning the cache line is different than what it was in the memory, then it's in modified state.
Starting point is 00:29:08 And of course, if it's in modified state, it cannot be in shared state somewhere else. So then this kind of cache line needs to be invalidated everywhere else. Otherwise, we might have this diverging versions of the cache. As I said, initially, versions of the cache. As I said, initially, all of the cache lines are marked invalid. And we said these kind of changes are communicated across this interconnect, this on-die interconnect here.
Starting point is 00:29:44 So if the core doesn't change, I mean this is something the home agent cannot know. So if I want to change this here, the home agent has no idea. So this I need to basically communicate on this kind of network. If I want to find out is this shared for example, so I'm reading something, then somebody will have to answer this question. So this could be the home agent. In a snooping mode, all of the cores could basically answer
Starting point is 00:30:14 if this cache line is shared or not. And because of that, if you have 24 cores, for example, in the worst case, you get 24 answers that your cache line is actually shared. And because of that, there's like an extension. So this is what Intel, for example, uses the MESIF protocol. So there's an extra state that's called the forward state. And that's basically a designated responder. If we have a shared cache line to just say this cache line is actually shared.
Starting point is 00:30:49 Because in the snoop mode, again, going here, say our core 10, for example, here has the cache line shared. Now some other or like, let's say all of the core, almost all of the cores have the same kind of cache line in a shared state. Now our core 20, or what is this? Nine, core nine wants to read this cache line. Now it asks, is this cache line shared? And then potentially all of the other cores would answer. And rather than having all of them answer,
Starting point is 00:31:26 we're going to designate the core 10 as the forwarder and then this core will answer, yes, this is shared. And so this basically is one of the shared nodes or one of the shared cache lines will just have this mark. You're the forwarder or I'm the forwarder cache for this cache line. And all others are basically just shared. And that's just to basically reduce the amount of chitchat again on the internal fabric. Okay.
Starting point is 00:32:03 Good. So let's look at this again very briefly, going through the protocol in an example. So we have our different in here. They are CPUs. This works across CPUs, works across cores. So you can think of the CPUs here as cores. You can think of them as CPUs.
Starting point is 00:32:31 So for now, since we're not really NUMA yet, let's say the CPUs are actually cores, and each of the cores has their own cache. So initially, all of the cache lines are invalidated. Then our CPU1 or Core1 starts reading a cache line, so our cache line A. So then it has exclusive access to this cache line. So this is basically the first state. Nobody else has ever used it. It's not shared, it's not modified, so it's in exclusive state. Everything else is invalidated.
Starting point is 00:33:03 Now CPU2 also reads the same block or same cache line. Then, we're in shared state. Both of them have read the same cache line. Now, both of them are in shared state. Then, if CPU2 writes the block, then it's modified in CPU2. It's invalid in CPU1. So it needs to be invalidated. If CPU1 wants to use it again, it has to read it again.
Starting point is 00:33:33 And if CPU3 now reads the block, then it will read the block, if it's in modified state, it will read the block from the cache of CPU2, and both of them are in shared state again. And then if CPU2 writes the block again, it will be modified in CPU2. It will be invalidated in CPU3. So if CPU3 wants to access this again,
Starting point is 00:34:00 it needs to read it again from the other cache. Or once it's written back to the memory, it needs to read it again from the other cache, or once it's written back to the memory, it needs to read it from memory. Okay. Okay. Good. So, why is this important? Because basically, as soon as we're accessing the same cache lines
Starting point is 00:34:20 across different CPUs, across different cores, it means we need to communicate. So we basically, and this needs a couple of cycles, right? This basically is these cache invalidations. And you remember, I don't remember, but it's basically almost like a memory request. So it's a bit, it's not that many cycles, but it's basically multiple cycles going across the different kind of cores through the caches. How many cycles would that be? So it's probably level three cache number of cycles.
Starting point is 00:34:57 Yes? How does CPU3 in the example know whether it has to read the cache line from memory or from the cache of CPU2? So CPU3, if it wants to read this, it will basically the caching agent will ask, is there, it will snoop on the bus. So either it asks on the bus, is this in a cache?
Starting point is 00:35:23 Or it basically remembers by seeing, oh, this is invalid. It sees that this basically needs to be read. But in this case, it would actually probably ask on the bus, is this somewhere? And if it's somewhere, it will get it from the other CPU. And this is also because this happens so frequently right so whenever we're basically doing some accesses in the caches this is also why we don't want to have too much communication in there this needs to be fairly fast but if like all of the cpus all of the time
Starting point is 00:36:00 basically answer to everything then this might this then the bus might actually get overloaded. It can actually get overloaded. This is also why if you're not smart or if you're not numer-aware across multiple nodes, this can basically bottleneck your memory access or your caching behavior. Yes? And will the data of the modified cache line
Starting point is 00:36:27 be transferred over the communication bus or over separate? It's the same. So the question online is basically, where is the data communicated? And this is basically the same infrastructure in here. So this has different kind of messages in here that we can send.
Starting point is 00:36:50 This will also be used to communicate the data from the memory across the network. So there's not many different kind of networks on the CPU. But this is basically for cache invalidations for memory transfers. Yes? If CPU2, in the example, would decide to write the data to memory, the cache line to memory, what happens to the state of the cache line for all other CPUs in that case?
Starting point is 00:37:24 Will it still be invalid or? So in the case where, I mean, as soon as it, the question is, what if we have a write through by CPU2, say, in this last version, right? So as soon as we're changing this cache line and it's written through the memory, it's basically changed. So it needs to be invalidated everywhere else, right?
Starting point is 00:37:55 So it's basically as soon as there is a modification. Otherwise, I mean, we would have like older versions in the other caches. And so it needs to be re-read back into the cache somehow. I'm not sure if we could directly, when we actually write it back, if we could directly update it in the other caches. It could be possible, but I'm not sure if there's a mechanic for this. You probably also don't know, right?
Starting point is 00:38:26 Okay. Okay. Further questions on this? No? Okay, then let's start with the locking. So, basically, while this on the cache lines is kind of nice and fine, so we're basically having a different kind of... The CPU ensures that the cache lines are coherent across different nodes. We still want to make sure that another cache or another core cannot change what we're working on right now. So we basically want to ensure that this core that wants to do the changes on the
Starting point is 00:39:08 on the certain part of the data structure has exclusive access to this so that no other cores because I mean just using the cache clearance protocol whoever is first can change basically our cache line and if we have multiple cache lines to change at a time, this will basically mess up our data structure anyway. So this is kind of clear. So as soon as we're not on the single cache line level anymore, then this kind of coherency doesn't help or doesn't save us from breaking our data structures.
Starting point is 00:39:44 Because if our data structure consists of multiple cache lines, if we're doing concurrent updates, everything kind of gets messy. So we want some way of basically telling other cores, don't do anything here right now. So I need exclusive access to this. And this can be implemented on different kind of levels. Right? And so one way to do this is like, is Atomix. So in x86, there's a lock prefix for different kind of intrinsics or different kind of operations that basically tells the CPU or the other cores, basically in the instruction,
Starting point is 00:40:33 it basically says nobody else should read or write this value until I'm done with it. This basically means nobody else can touch this value. It cannot go to another cache. Nobody else can get an exclusive access or modify it while it's still in my cache. And this is not the default case because it's slow. This basically means any kind of other access will be stopped and we'll have to wait for this. And there's different kind of operations that have this. So some of them have it basically always,
Starting point is 00:41:08 and some of them you explicitly have to tell them, oh, we need like a lock here. And compare and swap with a lock, for example, means, oh, we do a compare and swap atomically with a lock. So it's going to be done before, or I want to completely do this operation swap atomically with a lock. So it's going to be done before, or I want to completely do this operation before anybody else can touch this again.
Starting point is 00:41:31 So in a compare and swap is basically, I'm comparing the value to another value, or I have my cash line or my value, I compare it to something and based on the comparison, I either swap it or I keep it. Then I have an exchange operation. So this automatically locks this operation there. I don't need to say lock again. So lock itself is not an operation. It's just a prefix. And the exchange basically just says, okay, exchange two values, right? So we're just basically swapping.
Starting point is 00:42:10 Then we have a read, modify, write operation. So something like an add addition. So if we want an counter, for example, we can lock this, we can have an atomic counter that basically says, oh, I'm incrementing this atomically. So I'm going there and I'm making sure that this is incremented atomically. And if, I mean, one thing that you need to ensure,
Starting point is 00:42:41 if the compiler or you emits non-temporal stores, you also need some kind of fencing around your code so it's not reordered. So the non-temporal stores are the ones that go right through. So we're writing all the way down to the memory. And in general, if you have this, the CPU or the compiler
Starting point is 00:43:07 might think, well, this is actually slow, right? So writing through all the way to the memory, that takes some time. So let's just reorder this, right? So I can do something else in between. And that might break your code if you need some kind of locking around this, right? So you want to lock something.
Starting point is 00:43:24 You want to write through your value. All of a sudden, the CPU says, well, this takes forever. Let me do something else in between. I don't want to wait for this. So you can do some fencing and say, within this code, please don't do any reordering here. So this portion of the code should be executed
Starting point is 00:43:42 in the way that I wrote it rather than having any kind of reordering. Okay, so these are just basically atomic operations that you can use to build logs. So I mean, if you only need to log a single value, that's enough, right? So if you just want to change, like your database is just a single value, that's enough. So if you just want to change, your database is just a single value, you need many threads to update this and whatever, this is enough. Then you can just work with these atomics.
Starting point is 00:44:15 It will be slow, though, because this means you will always have, like, I mean, you're not going to get perfect speedup with many threads because this is always cache invalidating, right? So as soon as somebody else tries to access this, like in somebody else being another core, it needs to be invalidated in one cache, needs to go all the way to the other cache. So this will always take time.
Starting point is 00:44:39 Also just basically reading this goes always through this network. So that's kind of slowish, so you don't want to just constantly work on the single value here. And in general, you want to do more than that, right? You want to update multiple values. Say you will have your data structure, very simple, a linked list, right? So your linked list doesn't just contain
Starting point is 00:45:06 a single value but it contains at least two values it contains your value that you're storing plus the pointer to the next list item so that's that's the least that you need and so this means if you want to change something in your linked list, you need to be able to do work on both of these things at the same time, if you add a node, for example. Not only that, you need to make sure that the whole consistency of the data structure is still intact, so you want to log more of this, actually. You want to have a larger part. And now there's two general, or there's
Starting point is 00:45:50 different general ways of locking such a data structure. And one way is doing it pessimistic, meaning we're always exclusively locking. So whatever we're doing, we're exclusively locking, making sure that nobody else can touch this. And there's different kind of granularities as well. But in general, if I'm reading, I'm locking. If I'm writing, I'm locking.
Starting point is 00:46:17 I can have shared reader-writer locks, et cetera, but like very pessimistic ways, just, okay, if I do something, I'm blocking, just to make sure nothing happens. That's safe, but often it's not needed. So if we're just reading all of the threads, like our data structure is all read, read, read, then just doing pessimistic locking basically is just an unnecessary overhead.
Starting point is 00:46:46 If nothing ever changes, well, we wouldn't have actually had to lock. And so for this, there's optimistic locking. And there are the ideas that we're reading. And then we're checking, did something actually change? If nothing changed, it's fine. So we can basically just continue with our lives without ever having taken any log. So basically, just need to read the value again and here the caches help us again. So if basically I have my notes that I'm reading, I have it in the cache, I can check, nothing changed. If the cache is still valid, right, so nothing has changed, basically, I know all is fine, I can continue.
Starting point is 00:47:35 If my cache line is invalidated, if I have to basically go somewhere else, reread, whatever, then, well, we were too optimistic. So we basically have to maybe start again or do something else. And then we have the so-called lock-free, which is basically rather than doing some kind of locking, we're basically never blocking.
Starting point is 00:48:02 And so here, we're using the atomics for synchronization. And this basically means we're more or less just checking certain values. Is this good? So can I access this now? So I'm setting markers, basically. There's a marker. Is this free to access? If yes, then I'm
Starting point is 00:48:25 basically I'm updating it and I'm updating the marker atomically and if somebody else wants to read it just tries to read it well it doesn't work right it's marked as in progress or updated I have to basically start again or continue wait for something to happen but there's no real lock. So the threads can just continue reading these things. And this might be good for simple data structures, might be inefficient because again, right? If we're basically reading the same thing and we're not holding locks on anything,
Starting point is 00:49:03 this might mean that we constantly invalidating our caches. So basically, we constantly have to check the different kind of caches if we're using these kind of markers. But we'll go through these separately. So we'll look at the different steps or different ways of doing this and also combinations of them separately after a quick break. Questions so far. But as I said, I mean, it doesn't make that much sense to ask in detail because I'm going to go through this in more detail after the break. So first, let's look at optimistic logging.
Starting point is 00:49:42 So what we do is basically we're validating that the data, so we're reading the data, we're just going through the data. Then after we've read the data, we're validating that the data in the critical section has not changed. That's basically, we're just making sure whatever we've done so far is still accurate
Starting point is 00:50:03 or is still correct. And then we can continue right so this is basically i mean in a linked list you can imagine it's fairly fast right so i have basically some kind of version counter or something like that or i'm just looking at the value itself i'm reading the value doing whatever i want to do with the value i'm checking again is the value still fine okay good let's go on and this is good for frequently read data because we don't have to do these atomic writes we don't have to do so the atomic rights is basically just for these locks right so we have something like a semaphore or whatever that that blocks everything and um that basically so without doing with doing not not doing this basically is much cheaper because we also have everything in our cache right
Starting point is 00:50:56 so only if somebody else writes this then our cache will be invalidated and then also we know that not only by the cache invalidation I mean we don't necessarily see the cache invalidation in our code right but if nothing has changed everything will be in the cache so everything will be fast all these changes etc. or not the not changes will see fairly quickly because everything's local. If something has changed, well then we have to restart. We're basically starting again, reading the value, doing whatever you want to do with the value and checking again that the value is still the same before continuing. So, and there's a little problem in here because this part, this will be actually slower, right? So as soon as something has changed, then our cache will be invalidated.
Starting point is 00:52:01 We have to read the new value. And if something, if we're at a point in our data structure that is constantly changed, then this means we could do this forever, right? So our thread that just wants to read this, if there's like lots of changes on this part, then we're never gonna be done with this. So especially if the reading takes some time, so more complex data structure, et cetera. we're never going to be done with this. So especially if the reading takes some time,
Starting point is 00:52:26 so more complex data structure, etc. And of course, this is only good if there's no side effects, right? So everything we can basically restart. So whatever we do in our read operation, it cannot affect everything else already because everything we did there is basically invalidated already. Meaning, we have some, I don't know, we want to build a hash table, for example. We're inserting this data in our hash table, then we have to be able to revert this change if the value has changed. And as I said, if it's too expensive, then to invalidate this or to restart this,
Starting point is 00:53:09 this might be not good. And if we have too much contention, we have to restart too frequently, then we might starve. So we also don't want that. So we might fall back to pessimistic locking at a certain point. So this is something that you would,
Starting point is 00:53:26 how you would actually implement this in one way or the other. So we have some kind of read callback. So this is what we do if our read is good, or what we do in our read. So for a certain number of attempts that we are happy to wait for, we're basically trying, we're getting the current version. If it's locked, well, then we have to continue, we have to start again.
Starting point is 00:53:58 If not, then we're actually reading the value, we're doing whatever we want to do in our read. We're checking the version again. If the version is the same, we're actually reading the value, we're doing whatever we want to do in our read, we're checking the version again. If the version is the same, we're good. Otherwise, we basically have to start again. So, we basically have to try again, we're reading. We don't do anything except for reading the value, so this is optimistic. If it is locked forever, for example, we're not going to get anywhere. If it's changed all the time, we're not going to get anywhere. So after a certain amount of attempts, we'll fall back to pessimistic locking. So meaning,
Starting point is 00:54:36 we're going to basically put our foot down and say, well, whenever the lock gets free, so everybody else who's waiting for the lock has got their lock. We're going to lock it as well. And then we're just going to read as well in an exclusive mode rather than in an optimistic mode. So this is cool because this, of course, multiple threads can do at the same time, right? So this part, like this is all fine.
Starting point is 00:55:02 Everybody can do at the same time. Everybody can read. Only if it's locked, we're basically getting, we're going to have to reiterate and redo, or if something's changed. Again, if we wait too much or have to do too many attempts, then we're going to start exclusively locking in order not to starve. Okay. So now the question is, I mean, this upper part is very simple, right? So we just
Starting point is 00:55:37 basically read the versions. Now we have to think about how to implement the lock down there. And so this is basically also telling us, is this locked? Or now we want to lock, right? And there's a different kind of, or two major strategies, how to implement pessimistic locking. And for each of those, then there's many ways how they can be implemented internally. And there's basically spinning locks in user space.
Starting point is 00:56:07 And there's blocking OS service locks. And a spinning lock in user space, or spin lock, is basically some kind of data item that a thread just frequently pulls and just checks, is it free? Is it free, basically basically so this is why it's called a spin lock so the the thread basically continuously checks this is free this is free it's a busy simple to implement and it's fairly
Starting point is 00:56:52 cost efficient. So if it's implemented well, all we need to basically have is the amount of cash invalidations to do this is fairly low? We're just basically, and we'll go through this, how this can be done is basically we're just checking our cache all the time and once it's free, well, the cache, like once our lock is updated, the cache needs to be invalidated. And if we're basically changing the lock to locked again, the cache needs to be invalidated elsewhere.
Starting point is 00:57:31 So that basically, that would be a fairly efficient way of implementing this while blocking OS service, something like a mutex or futex in user space. So there's like a mutex in the standard template library. So there you get like an actual threat context switch. So here basically our threat goes to sleep. If it's locked, then it will be waked up as soon as the lock is free again.
Starting point is 00:58:03 So this is, of course course much more costly than just like a cache invalidation. So we've said that, right? So the, I mean, it's still much cheaper than changing a process, but still we have to do some, basically some updates on the OS or the OS does some maintenance that's more costly than just keeping the threads running. But we're burning cycles, right? So, I mean, this basically, if we have locks that are very long running, so I do whatever, right? Like very large changes in my data structures, and I can do this with multiple threads, et cetera, then it might actually make sense to put a lot of, like the threads that are just blocked for a long time
Starting point is 00:58:52 to put them to sleep rather than have them burn cycles in a waiting fashion, right? So there's always trade-offs. That's, I mean, that's the general rule here, right? So there's always trade-offs, and you have to think about what makes sense when. But in the case where we have small changes in the data structure, spinning log might actually make sense. And there's different variations, as we said, that we can use. And one is a test and set spin log.
Starting point is 00:59:23 So here, basically basically we're setting and we're returning the old value. So essentially this could be just a bit that tells us is this locked, right? So then what I'm doing, I'm doing the test and set. I'm saying, well, set it to lock and tell me what it was before. If it was locked before, then, well,
Starting point is 00:59:49 I didn't change anything, right? So it was locked before, so I'm blocked, right? So I basically have to wait. If it was not locked, then I actually put the lock there, so I'm the person who holds the lock right now. Right, so essentially, I'm doing an update and I get back the old value. So I'm definitely going to set it to locked but the question was what was it before?
Starting point is 01:00:16 Was it unlocked? If it was unlocked before then now I have the lock. If it was locked before I don't have the lock. So it was locked before, I don't have the lock. So then whoever had the lock before still keeps the lock. Of course, this is something you can easily do wrong. So if you're doing a bug here, you might still update something else. So you have to ensure that whatever you implement there
Starting point is 01:00:43 actually is like you know what the semantics are. But if you're basically doing this, this works, right? So if you're keeping this kind of semantics, say I'm putting a bit to one, if it's locked, and if I'm reading the bit and it's been to one, then I don't hold the lock and I know like then then it's basically fine so this is this is good it's efficient it's small right but it's busy polling and the problem it doesn't scale because we have a contention on a single cache line right
Starting point is 01:01:21 so this basically means everybody who wants this lock will work on the same exact same cache line. So this basically means everybody who wants this log will work on the same exact same cache line. So we basically always have to go to the same cache line, different threads, different cores will need the same cache line for every individual access. Whenever I'm polling this log, whenever I'm checking this log, This means it will be invalidated everywhere else because I'm updating the log, even though I'm just writing the old value back. So this is basically, yeah, it's slow because we have this contention on the single cache. It's not really a problem for the thread that's working, that holds the lock, because the thread that holds the lock doesn't have to do anything with the lock while it holds
Starting point is 01:02:10 the lock. But everybody else will be fairly slow in updating this. And the more threads are on the same cache line, the more they will just have to wait to be able to update this, because everybody else is basically always invalidated. We have to read it back. So this is basically, it's, it's that we have this contention. So but this works, right? In order to make this more efficient, we can use a cube based spin lock. And that's actually quite neat idea.
Starting point is 01:02:40 So the idea here is that rather than having like a single lock, a single bit, for example, that every thread pulls on or pulls on and does like this busy weight on, we'll create a separate lock for every thread themselves. And we'll put them in a queue. So what we'll do is basically the first thread that comes along and needs a lock puts like a base latch and it's like a head of a queue. And the next thread that wants to access this will basically just append a new queue item and then pull on this next queue item. Right, so basically the first CPU or first core in the queue holds the lock and will only release this lock once it's basically done.
Starting point is 01:03:33 The next CPU basically has like a new queue item and it will always only point to like test this one. So in here, we basically, what we need is here, we need this bit. Okay, this is locked. And we need a next pointer that tells us where is the next item. So this is then we'll basically have the CPU2
Starting point is 01:03:56 just pull on this one. We'll check, is this still locked? If it's still locked, I'm just going to continue here. If the next CPU comes, it will just create a new um basically or will create a new uh queue item where this cpu will run on as soon as this cpu releases the lock it will basically release the lock in here and release the lock in the next node so because then this node knows oh now the lock the lock is free, right? So now I can basically continue working or I can set this lock and can work on this. And only this CPU, too, will ever pull on this lock.
Starting point is 01:04:36 So the only time that somebody else in this CPU will touch this latch or this lock is when this CPU releases the lock. The idea or the very big difference is that these parts, these latches, will be in the cache of the individual cores. So if we have a single spin lock, then this will be always like in different caches, right? So it's basically, and every access to this lock is an update, so every time it's invalidated, it basically, we need all the cache invalidation, need the communication across the network, et cetera, the chip fabric.
Starting point is 01:05:19 Here, we only needed to set up the queue, so in order to add a new queue item, we need to basically check the previous one. Oh, okay, this is basically the tail. I mean, we might have to go through the whole queue, find the tail, add our new latch, and then we're just spinning on that latch rather than spinning on each of the others.
Starting point is 01:05:43 So as soon as all of the other previous latches are gone, then I can continue. Every new thread will just create a new latch and will spin on that. And this is because there we don't have this cash currency problem. It scales much better. I mean, we're still busy waiting. We're still continuously querying the log, but we're much faster here. And of course, rather than just having a single log, we can also have reader and writer logs.
Starting point is 01:06:19 So we can have concurrent readers that we allow. We can have just for writes, et cetera. So we can use different kind of locks for different kind of methods to ensure that we don't always have these tests, et cetera, and we're not always invalidating the caches. So the idea here, or the idea is the same, right?
Starting point is 01:06:47 Or the problem that we're solving is the same, rather than having, like, of course, you would say, well, why would we have a separate, or you probably already understood, but you could say, why would we have a separate lock for the reader? Right, so we want to be optimistical about this or something. But as soon as we want to check for any kind of lock, again, we have to read it into our cache. Even if we don't change anything. We just have to read it into the cache to check if there's a lock. As soon as somebody like a thread spins on the lock,
Starting point is 01:07:25 this will be an invalidation of the cache. Again, this means there is some communication. So having a separate lock that's just read, right? Where we don't do any updates because it's where like the writer lock hasn't changed or anything, right? So if we have the separate log for the reading, if this can be shared because we don't have any updates,
Starting point is 01:07:50 we don't need to invalidate the caches. It might be more efficient there. Okay? Good. Okay, so most database... So what are the requirements for latches? So in general, so most database accesses are read mostly. So we're usually most of the time what we're doing is read. That's actually not true for cloud databases.
Starting point is 01:08:23 That's an interesting fact, right? So these days everybody stores everything and a lot of the data is actually garbage that stores. Nobody ever touches this again. So this kind of breaks this assumption. If you're just dumping all your data, eventually you will end up with a lot of data that you're not using. But assuming you have a database that's fairly efficient and you have meaningful data in there, then most likely you'll mostly read. So most of the access will basically be reading even in an OLTP workload. So in general, I mean, we're doing updates, but in order to do the updates, we'll still
Starting point is 01:09:01 have to read the values before, right? Only for insert only, then that's where insert mostly, that's when we're like a write mostly, we have a write mostly workload. As soon as we're working with the data, then most of the accesses will be reading. And that means reading should be fast and scalable. And this is also why we might consider having a separate lock for reading, et cetera. Also, if we have a tree-based data structure, which we often have, or we have some kind of directory or something,
Starting point is 01:09:41 like the root nodes, whatever, the directory, this needs to be traversed a lot. And in a tree, the root node will always be traversed. This means we might have high contention in such hotspots. So this needs to be lockable with a very low overhead. So this should not be expensive. Locking these nodes, especially for reading, should be fairly cheap.
Starting point is 01:10:12 Latency is critical often in database systems, or at least we want to be efficient. We're talking about hardware efficient work here, so let's keep it fast. So we don't want too much context switching. So we don't really want to have OS logs most of the time. I told you in some cases if we know this is going to take a long time, something needs to be locked for a long time
Starting point is 01:10:40 and all of the other threads will not be able to make progress anyway, then it might make sense to use OS logs, because then the other threads are sleeping rather than busy polling. And well, if we have fine-grained data structures, very small nodes, linked list, art, et cetera. Then we also want fine grained locks, and we want to have small locks. So the locks shouldn't be much larger than the data structure itself, or let's say equally large as the data structure, because otherwise it's just not efficient again. If we have 40 to 80 bytes for a node, if we're doubling this size just by adding a lock, like a standard OS lock or something like this, well, this is not efficient.
Starting point is 01:11:37 So let's see how we can be efficient there somehow. And of course we not want to have like efficient contention handling so and this is really where these cube-based locks for example come into play if all of a sudden every like we said 24 cores right in this old kind of cpu if you have like a modern cpu you get in the 40s of course you can have up to eight cpus on a single board well then we're in the hundreds of course um well if they're all kind of constantly invalidating each other's caches this will be very slow right all of a sudden we're not going to have throughput anymore just because there's a lot of contention and everything's bottlenecked by the synchronization. So we want to be able to handle the contention gracefully
Starting point is 01:12:36 without sacrificing the uncontended path, meaning without being slow when we're just reading right so we still want to ideally we still want to have optimistic reading in any case or log free reading so we're basically we can always read the data if there's no other stuff going on makes sense to just be able to go without logs in any way and And so for that, there's a nice paper where somebody, Bertscher et al, looked at how do different kind of logs work in different kind of setups or for different kind
Starting point is 01:13:17 of workloads. And meaning read-only setups, so if we're really just reading, there are some data structures that you're practically only reading. So, you're very rarely updating metadata in the database. So, of course, every now and then it might be updated, but that might be a huge maintenance task. So, that's a different story.
Starting point is 01:13:39 Otherwise we're just reading the data structure. That's one case. Then read mostly with cheap reads. So it's basically something like a small data structure where the read is simple, meaning we can directly read the data items in our data structure. Linked list, very cheap read. So I read the node. I read the value, I read the follow.
Starting point is 01:14:08 Like the node consists of the value and the pointer to the next node reading, this is fairly cheap. Different story if I have a compressed node where I need some SIMD instructions, et cetera, to uncompress the node, get an understanding what's actually in there. So all of a sudden my reading might take many cycles or I have a large node, so B3 node for example, that's not just a single cache line, that might be like
Starting point is 01:14:37 larger memory page etc. So if I have a big read set, that's again a different story because the read will take a long time. So this means if I'm in the node, reading the node will take some time. Some other thread might come in and somehow annoy me with updates in the same node. If it's a cheap read, well, most likely another thread. There might be another thread coming, but even if the thread is coming, I'm already done with my read. For an expensive read, that might not be the case. Then I have write heavy. Lots of writes. And write only is basically, say, a log. A log is write only. That's something, it's a different story. And here, they basically looked at these
Starting point is 01:15:24 different type of workload types and tried it out you can read the paper it's and Damon Damon is a workshop at sigmod sigmods conference is actually next week so I can suit or I suggest to check out the program you can basically see what kind of papers are out there what what's current research. And Daemon is the workshop there that's for modern hardware. So there, people are trying to come up with new ideas on how to use modern hardware. So this is why we also have some of the pointers here. And in a read-only use case,
Starting point is 01:16:01 we'll always go for an optimistic locking, or it makes sense to always go for an optimistic locking, or it makes sense to always go for an optimistic locking because there's no overhead. And we still have this version checking, but we're not waiting for anything, right? So we don't have any cache invalidations or anything. So that makes a lot of sense. Exclusive is too restrictive,
Starting point is 01:16:19 just completely unnecessary overhead. Shared locks are basically, we get this read-read contention, right? So we might still have like updates or checking the same kind of nodes, checking the same kind of locks, which is still more costly than going for optimistic. If we have cheap reads and we read mostly, then an optimistic logging still makes a lot of sense because, well, we have, like, restarts are unlikely because essentially it's a cheap read.
Starting point is 01:16:59 If there's a restart, we still go basically, or rereading is not really a big problem. If we're in an exclusive log, well, it's again too restrictive, right? In a shared log, well, there might be still some contention on the log, but it's okay, right? In a read-mostly setup setup where we have big reads, then the shared log actually all of a sudden makes sense because the shared log, I mean,
Starting point is 01:17:30 it still is some overhead to have this log, right? So create this log, but because the read is expensive, the cost for the log actually gets less and less. And on the other hand, if we have restarts, even though the restarts are unlikely, if a restart happens in an optimistic log, because the reading is expensive, this might annoy us. So it might actually make sense to have this shared log, where
Starting point is 01:17:59 we're basically also setting a log for reading, because we avoid any kind of restarts during the reads. And then in a write heavy setup, shared locks are good. Exclusive locks are always quite restrictive and an optimistic lock all of a sudden will be kind of stupid because we have so many restarts. Right, so as soon as we're basically optimistically and often we still need kind of reads,
Starting point is 01:18:34 even in a write heavy setup, just to find the node that we actually want to update, right, to find the data item. So even if it's update heavy, we will do some reads, but we might not even get there because like everything's updated all the time. So we have like lots of aborts and we might have some starvation. So because of that, of course, we have like additionally, we have these pessimistic logs with our optimistic log to be safe. But like a shared log might make more sense. with our optimistic log to be safe.
Starting point is 01:19:07 But like a shared log might make more sense. In a write only setup, well, then we will need exclusive logs eventually, or it doesn't really make a difference because the shared logs or the exclusive logs were, are exclusive anyway, right? Or the writes are exclusive anyway, right? Or the writes are exclusive anyway, so the shared log doesn't really make a difference for us or if we're just exclusively logging everything.
Starting point is 01:19:35 Okay. So I have the data structures part now, but we're so far in time already that I would say I'll continue with this next time, meaning tomorrow. Do we have questions up to here so as a reminder we talked about cache coherence and this is really I mean this is why we talked about this right because the cache coherence or how the caches work etc and how the memory access work is et cetera, and how the memory access work, is why different kind of locks make different or have different kind of properties
Starting point is 01:20:10 and why it makes sense to kind of build these structures and queues of locks just to make sure that we're not basically destroying our caches all the time. So, I mean, the caches are good if we have our data local, if we have shared access to the data, so that's all fine. But as soon as we're locking, we're updating. With these updates and spin locks, we get all these cache invalidations, and that's slow.
Starting point is 01:20:41 That's basically why we can be more smart about it, building these kind of queues, et cetera, of data structure or of locks that make it faster. OK. Yes? Actually, I have one question to the test and set spin lock. Because you said this is inefficient because you always update the cache line so it's invalidated in all other cores. Yes. Because you said this is inefficient because you always update the cache line so it's invalidated in all other cores.
Starting point is 01:21:08 Yes. What if that operation just sets the value if the value or if the log is not set? Is it difficult to make it atomic then? Or why do we always update the cache line or always update the log updates if it's not changed. So you can also implement it. The lock itself will then be more work. So there's just a couple more cycles that you need for this set. So this is an atomic operation that's intrinsic in the CPU that's really just like more or less single cycle operation. So that's very fast. If you start branching,
Starting point is 01:22:13 well, then you have branch prediction, all the additional stuff. You can lock this, but it's just going to be more cycles. I don't know, there's also, there might be atomics for that, but it's, it's, that's the way how you implement it most efficiently. I mean, there might, you might be with able to do it with compare and swap, et cetera, but usually, I mean, this means like, this is a single operation that you do. Otherwise you have multiple operations, meaning multiple cycles that you do. Otherwise you have multiple operations, meaning multiple cycles that you basically need to check. You need to branch to ensure that it's going on. This is basically branch free, right?
Starting point is 01:22:55 So think about it similar to the predication, et cetera. It's a single operation. I have no idea if it would make sense to actually implement it in a different way and then maybe be faster. Try it. Well, it might be possible, right? So I mean, there's these different kind of, we had this here somewhere, right there's these different kind of ways how you can implement this. So there might be a different way of implementing this.
Starting point is 01:23:32 I've actually not really questioned it so far. So it might make sense to think about it. Maybe as a follow up, are there some where there's some tries to make cache so the state of the cache changing the state of the cache when we write a value are there some motivations to make it more smarter that that the other cache lines are not invalidated if the same value is written again. So the cache line doesn't change in any way? OK, so the question is, what if I'm updating the cache line and it's not changed?
Starting point is 01:24:16 So why would I invalidate the cache line? Good question. I don't know if the CPU checks this. Might be possible. Well, that's a good question. I don't know if the CPU checks this. It might be possible. Well, that's a good question. I would have to check, actually. Okay, further questions? No further questions? Okay, well, then, thank you very much. And see you tomorrow.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.