Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Concurrency and Synchronization

Episode Date: June 13, 2023

...

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome. So next session, hardware conscious data processing. Today we're going to talk about concurrency and synchronization once we're finished up with the sort merge join. But first, a quick announcement. Tomorrow we're actually going to be upstairs. So I said this last time already, but now again. So don't walk in here. There's going to be something else in here. So I said this last time already but now again. So don't walk in here, there's gonna be something else in here. But go upstairs. This is actually, it's also a nice small classroom. So it should be fun.
Starting point is 00:00:39 Otherwise we're still in the CPU package. Soon done with the CPU package. So today, in the second part, I will talk about, or main part, actually, we'll talk about locking. So if we have some parallelism, how do we synchronize between different cores? Make sure that during the parallelism, if we update our data structures, that's mostly what we're going to talk about. We're not going to break them. And then, next time, we're going to talk about what happens if we have multiple CPUs and non-uniform memory access
Starting point is 00:01:23 before finally looking at PMEM. And then we're out of the CPU package and going to look at the peripherals. And so we're in the second part. So the only second timeline left. So today locking, next time NUMA, and then you will see the task three, which is about locking. And we'll also have task four sort of about locking. We said hash table, most likely
Starting point is 00:01:49 it's not gonna be a hash table. We're still working on this one. So we'll have some kind of lock-free data structure for you here. Still designing around it to make sure that we can test it well and you can easily understand that we have all of the information that you need. Okay, so going back to the other slides, let me find them. So this is where we left off last time in the sort merge join. So important to remember is that it's two phases, right? So we're doing
Starting point is 00:02:28 the sorting and then in the second phase, we're doing merging. And also the sorting basically works in a parallel fashion. The sorting works in the same way, right? So we're also doing small sorting, it's like small pieces of the data, and then merging again. And so smaller runs, and then we're merging these runs. And since we're on hardware, so we want to be efficient on parallel hardware,
Starting point is 00:03:01 on the one hand, we have to parallelize this, and on the other hand, we have to look at the cache hierarchies. And for this is basically the idea that we always stay in, or as long as we can, we stay within the caches to make the merging efficient.
Starting point is 00:03:19 And the sorting in general also efficient, right? As soon as we kind of have to go randomly out of the caches, we'll have lots of cache misses, we'll always be memory bound basically in every each operation, so the CPU will wait a lot. If we can do something in the caches, it's going to be faster. And so we said we're starting with a sorting network,
Starting point is 00:03:44 and that's basically like I explained this last time, right? So a simple implementation, if we're not doing it in hardware, would be using min, max. And this can be done with a difference or zero. This is where I got stuck last time, right? So basically difference or zero gives you either zero back or the difference between the two, which would be a positive or a negative number, which then means if we're doing a max, for example, and we're doing difference or 0 of x and y.
Starting point is 00:04:17 So in this example here, right? So then the difference, if y is larger, then the difference would be 0. Or if y is smaller, then the difference would basically be x minus y. So in the end, we're going to get the addition, which means we're going to add up as much such that we have Y again, that we have X again, because X is larger, we said. And so in this way, with difference or zero, we can do basically a branch-free min and max implementation. And then we can combine these branch-free min and max to this And then we can combine these brand free min and max
Starting point is 00:05:05 to this sorting network where in several steps, like in parallel again, we can do comparisons. So say for example, our input is A, B, C, D, then we can do these four min, max in parallel. Then we have E, F, G in parallel. Then we have EFGH. Then we can do basically these three in parallel, if I see correctly. And finally, we can do these three in parallel.
Starting point is 00:05:38 We have to check again. And that means we have more parallelism. And we can also do this in, as I said, we can also do this in hardware. We'll see this later. Maybe, I don't remember fully. But say something like an FPGA, this is how you can have very efficient sorting in an FPGA
Starting point is 00:05:59 by building exactly these networks where you do these comparisons quite quickly and efficiently. And this is branchless. And because it's branchless, we can also do this in SIMD. But again, we have to think in SIMD. So in SIMD, we cannot do something or SIMD operations do something in parallel on multiple values in the registers So it's not we're not comparing within a single register. So we have to compare across registers
Starting point is 00:06:35 So if we're doing a min max This basically means we're gonna do this Say for example If we're using SSE, we're going to do this, say, for example, if we're using SSE, we're going to do this, in example, for four integers in parallel. So this means rather than doing min, max in here or something, we're doing across four, so basically four comparisons in parallel. And we're going to get the output in the other register.
Starting point is 00:07:08 So that means in a single step, we can do multiple comparisons in AVX 512. We can do 16, basically, comparisons in parallel. But then we only have min max like between these pairs we don't have them like across the whole register so then uh the next in the next step basically we have to shuffle again right so first we do the the comparison across the registers. And then we can do a shuffle in order to shuffle them back and generate runs. So in the initial step, we might have something like this.
Starting point is 00:07:55 So we have four registers. So now we're going to do multiple comparisons, not just between two, but between four, in order to get across all four in every column basically min-max. So that's basically the sorting across using min-max, using this kind of min-max network. And then in the second step we need to shuffle.
Starting point is 00:08:25 So we basically need to say, well, this value will be fine here, but this value needs to go here. This value is fine. This value basically needs to go here. This value needs to go here. So we're basically switching the order in order to then for each, or we're basically shuffling in order for each of the registers then to contain one sorted run. And then in an output, we basically have a sorted or we have multiple sorted outputs. So far, so good, right? So we basically come up with in this case four
Starting point is 00:09:09 or runs in the length of four but with AVX 512 it would be runs in the length of 16. And then these we have to merge again, right? So that would be like in this sort merge fashion or like in a sort-merge fashion or like in a multi-way merge sort. Basically, we have to merge those again. And this we can do in a bitonic merge network, which is basically very similar to the sorting network, but it does a merging. So it doesn't do a sorting, it just does a merging of multiple inputs. And say this, for example, would be a merge network
Starting point is 00:09:53 and a bitonic merge network means basically we're merging two sorted inputs. And here the comparison, let's say, the A's are sorted in this way and B's are sorted in this way. Then with this kind of network, we will get a complete sorted output. So basically always comparing these pairs.
Starting point is 00:10:15 In the end, we basically get all of the comparisons we need in order to have, in the end, a sorted output. And the network really needs kind of this, like, unlike the sorting network, this network uses this information that these runs are already sorted. If they're not, then we need a sorting network, which does more comparisons, right?
Starting point is 00:10:39 So here, yeah. So with this, we can basically basically again create larger outputs. So then we have even larger runs that are sorted. And with this then we can do the final. Or in this step we would do again a multi-way merge sort. So then we have kind of these sorted runs. That come out of the Bitonic merge. So first we do the sort network, then we do the Bitonic merge,
Starting point is 00:11:15 and then we can basically merge again. And here, again, we want a cache-sized queue. So we don't want to basically have the data or load so much data that it's not in the queue anymore or not in the cache anymore but we want to make sure that we really utilize the caches and that we also the our output somehow fits in the caches such that we're not basically always need to write to memory and read from memory. But we always try to keep the data in memory-sized chunks. And then we can do this in multiple phases, basically always making sure that the intermediates fit into cache, just like a regular multi-way merge sort would do. With the difference that if you have a multi-way merge sort
Starting point is 00:12:12 or multi-phase multi-way merge sort, as we, for example, have it in big data systems, we discussed it, there we always make sure that everything fits into memory. And here we want to make sure that everything fits into cache, like the intermediates, such that we are most efficient when accessing the data. And that then in the end, like after the merging, of course, we're basically done with the sorting. Then we have a sorted output.
Starting point is 00:12:43 And what do we need to do then in order to get our full join remember from last time so this is basically one table that comes in or one input. We need to do this on the other side and then we're basically here. So then we're in this stage, now we're doing the merging. And the merging, how would we do this? Presumably? Yes, and how much will we load every time? Exactly. So we're going to make sure everything fits into cache, so our caches will be efficiently used,
Starting point is 00:13:42 and then we just have to basically merge. So in this case, we have two inputs. So it's going to be, yeah. We just need to make sure that we're not loading too little. We're not working with too much data at the time. Otherwise, we'll basically not be efficient with the caches. OK, so this is everything I have for the multi-core parallelism. So we talked about this basics of parallelism
Starting point is 00:14:09 and inter- and intra-query parallelism and inter- and intra-operator parallelism. And we actually want intra-operator parallelism if we're talking about large amounts of data, or if we're talking about long-running queries. In short-running queries, and we have many queries in parallel, then inter-query parallelism might be enough, because we can basically utilize our system just
Starting point is 00:14:42 by having enough requests. So similar to a web server, as we said, as soon as we have long-running queries or we don't have that many queries in parallel, in order to fully utilize our system, we really need to break everything down into small tasks, small enough, but not too small tasks, such that we can actually really utilize the the hardware we talked about the radix join and the
Starting point is 00:15:11 certain sort of merge join as examples for this question so far okay then let's switch yes OK, then let's switch. I have one more question about the Yes? I thought about this for a moment, but with the sorting, and we re-merge the sort of list, how do they all fit in the cache? Because at the end, the list must be the full relation, right? Yes, the idea is the chunks that we're working on are smaller bytes. Yes, they will get longer and longer, the runs, but what we're loading basically needs
Starting point is 00:15:58 to fit into the caches. So we don't want to make sure that we always have to go to memory for every next Value and initially our runs can fit into memory can fit into the caches from the bottom part of the tree, we must wait for the entire top half of the tree, right? Say again? So the tree. So I mean, the question is also, I mean, basically how many queues do we merge here, right?
Starting point is 00:16:40 So if we have many queues, then basically our intermediate might not fit into into or we might have to read through too many intermediates in order to fit this into the caches so that's kind of the idea so if we have like i don't know i mean in a multi-way merge sort, one thing that you can decide is how many ways you're merging. Two-way means we're looking at two runs in parallel, but we can also look at 10 runs in parallel and merge from those. But at a certain point, we'll have too many runs so that it doesn't fit into the caches anymore. at a certain point we'll basically have too many runs so that it doesn't fit into the caches anymore so this is basically and if we that's how we've then the intermediate results won't
Starting point is 00:17:33 we basically cannot load enough or i mean we basically could set up a way that we have so many runs in parallel such that we always have to go to memory for each individual run right or to to basically find the next and this is what we're basically making sure that we don't have too many merging steps so too many ways of merging here we always only merge two two queues but we can merge multiple in parallel. And that's basically where if we're doing them, so this would be only two way merge sort basically, but we can do multi-way merge sort, say five queues in parallel, 10 queues in parallel or 10 runs in parallel.
Starting point is 00:18:22 And then we always have to basically check. Okay, which one is next right? So in each of those But we don't have to read like we don't have as many steps. So this basically will lower the the height of this merge tree or if this like the merge steps less merge steps or merge phases. So less work to do here, but at a certain point it won't fit into cache anymore. And then we're basically screwed.
Starting point is 00:18:53 Then everything will be much slower. Make sense? Yeah. That's all. Yeah, I think I have to... Well, if you think about it, right? So I mean think about it, we have four, we can fit four, four, let's say four. We can fit four elements into cache. So then, I mean four is a bit little because say, I't know 16 we can fit 16 elements into cache
Starting point is 00:19:28 each of our run is four elements long right so then we could hypothetically load four runs but then we cannot output anything so if we now load three runs we can basically check through the three runs we can we check through the three runs we have 12 items in cache. We can check for the three runs which one is the smallest item or the smallest in like in which queue do we have the smallest. We can basically get three or four outputs in total. We can write a new output run which will be four in length and flush it basically or write it out of the cache we can do another one etc and and we'll still be in caches basically all the
Starting point is 00:20:14 intermediates will be in caches if we try to merge four ways then we cannot write anything out so then all of a sudden we'll always have to go to memory or to the next level cache. Does it make sense? The next operator we still have to, what we wrote, the next operator still has to load from memory again. Yes, yes, yeah, yeah. So eventually it will have to go.
Starting point is 00:20:38 Yeah, we cannot load all of, like we cannot have all of the data in memory. The question is, like the working set, basically, does the working set fit into cache memory? Does the working set fit into cache or not? Do we, like in between the steps, do we already have to go to memory or not? All of this data will not fit into memory.
Starting point is 00:21:01 So basically, here somewhere in these fit into caches. In between these steps, stuff needs to go to memory. But the question is, these queues, if we're doing it, or multiple queues, all these queues that need to be worked on in a single merge step, basically, do all of those fit into the caches? OK? Still not so sure?
Starting point is 00:21:34 No, I think it makes sense. OK. So maybe I'll add some animation to this. Would make sense. I'll put some animation to this. It would make sense. I'll put a note. And the other thing is basically as soon as we're not in the caches anymore, we have to basically go to memory.
Starting point is 00:22:05 And I don't want to try to connect this too much now. So let's take a step back. So as soon as we're working in parallel, as in the sorting or in the data structures, we somehow need to make sure that we're synchronizing things. So we need to make sure that these parallel threads or parallel cores or parallel processes, whatever, that they don't interfere with each other. So they basically, I mean, of course, they will interfere with each other,
Starting point is 00:22:40 but they don't destroy each other's data. They somehow have to work on the like same data structure look up stuff update stuff but do it in a consistent way and today we're basically going to look into how this actually works how can we make sure that multiple threads do not destroy each other's data so how do we synchronize between them? And we'll first look into the cache coherency protocol. So the MISI protocol, which is like a baseline, which like invariants of those will always be used
Starting point is 00:23:17 in modern CPUs to synchronize or to create cache coherency in between different caches of the different cores. And then we'll look at different synchronization strategies called latches, basically. So locks for data structures. And then, like as an example, I'll walk you through different locking or different types of latches and with this different kind of locking structures for a linked list example. And so let's dive right into this and start with the cache coherency. So we know that databases are faced with highly concurrent workloads, especially in OLTP settings Especially where we have
Starting point is 00:24:18 In OLTP where we have like many concurrent accesses, right and in OLTP We also don't not just have accesses just reading but we also have updates so our data structures say our r3 for example will have many small local updates somewhere so i don't know you have your customers the customers will update their data so i don't know their their usernames, for example, that would be a nice fit for an alt. Then this would be a local change somewhere, but it's a change in one big data structure.
Starting point is 00:24:55 So, many threads will do this in parallel, so we need to make sure that they don't interfere with each other. If we have many parallel tasks, the good thing is we can actually use the parallelism in the hardware. Especially in an OLTP setup, this is kind of nice. Typically, there are many accesses that don't all work on the same cache line or something, so we're not super contented. So, we will actually be able to exploit the hardware. If we have OLAP kind of workloads,
Starting point is 00:25:32 where we have these large queries, we can basically break up our operators in a way that we have many small tasks. And we want to break them up in a smart way, such that they don't always conflict with each other, so that they work on independent data parts. So typically, we do this by partitioning the data. We talked about this last time.
Starting point is 00:25:56 However, still, there will be some conflicts. I mean, even if multiple people do their banking, for example, most of them will just work with their own bank accounts. So this is all parallel, no problem. But all of a sudden they do exchange money, two people work on the same account, et cetera. So we need to have some kind of synchronization mechanism. And of course, not only that, because we are working on the same kind of data items,
Starting point is 00:26:27 but we also have data structures which are global data structures. So our B-tree, our ART, et cetera, these data structures, so the kind of indexes, et cetera, we need to make sure that these are handled consistently while multiple threads are working on those. So, we need synchronization mechanisms. And in databases, there are two levels of synchronization. On the one hand, we have
Starting point is 00:26:59 the transactional guarantees. So, banking? So we have multiple customers going to the ATM and we need to ensure that whatever they're doing is consistent with the whole world inside the database, right? So our database always needs to be in a consistent state after an transaction. And in databases, this terminology is called logs, right? So a customer updates their information in the database, or we're updating student information in our student database. In order to make sure that you're not getting wrong marks or grades in different kind of courses. We need to make sure while we're updating something in the grading sheet that we have some kind of lock on your grades, for example, or if we're doing an average, for example. So then this needs to be locked. And this is also what the database terminologies we're calling the lock
Starting point is 00:28:06 We're basically locking the table. We're locking a sub part of the table or something like that However at the same time if we're working in the database inside right what we're talking about here We're working on the data structures. We need to basically lock the table, update your student ID, for example, for whatever reason you get a new student ID. Then this also affects our data structures in the database. And this means multiple threads that work on the data structure
Starting point is 00:28:48 also need to basically adhere to some kind of synchronization. And the locks that we're using there are typically called latches. So this is kind of low-level database internal data structure locks. They're also called locks, but if we talk about both, then the table locks would be the locks and the data structure locks would be latches and i'm not super consistent in in this lecture regarding locks and latches but it's simple because we're only going to talk about latches here we We're not gonna talk about database locks at all. So no table locking, no locking transactional guarantees, just how do we lock data structures and make them work across different kind of threads. So only latches basically.
Starting point is 00:29:41 So again, looking at the caches, right? So we know we want to be cache efficient. So all of our cores have individual caches. They will, like whenever we're working on a data item, part of our data structure, it will be in the caches, right? So this is how it works. We have data in the memory. In order to do something with it, it needs to go to the cacheaches. So this is how it works. We have data in the memory. In order to do something with it, it needs to go to the cache
Starting point is 00:30:07 first, then we can work on this and then it typically goes back to the cache. Eventually, it will go back to memory. And this is basically what the CPU does for us. And in order for this to be efficient, each of the cores typically has their own cache. So these are private. And that basically means if multiple cores work on the same cache lines, well then we need some kind of synchronization. And this is what the CPU does.
Starting point is 00:30:40 So the CPU basically we know will see all of the memory. So for convenience, we're not going to look at the CPU as it actually is. So I mean, typically, the CPU, we know there's multiple cores with their individual caches. And multiple CPUs have kind of their own memory, et cetera. So we've seen some of this already. But from a programming abstraction the first point uh or the most general point this is basically more or less transparent we don't
Starting point is 00:31:12 know exactly where the data is residing it looks like one large address space we're assuming everything is in memory it might even be paged out to disk, but we're hoping or we're trying to keep everything in cache most of the time. If the data is in cache and in the first level or second level cache, again, depends on the CPU, it will be private to one core. And now the CPU takes care about managing this in between the different kinds of cores. And this is called the cache coherency protocol.
Starting point is 00:31:51 And this will be even worse, getting even larger. Cache coherency is extended across multiple CPUs today. So it's not just one CPU and also across peripherals. So with modern protocols, we'll hear this in the last lecture or second to last lecture, I think, by Marcel. With new technologies and new protocols, this cash currency also works across multiple nodes,
Starting point is 00:32:23 across CPU and GPU, etc. And this basically means, well, the cache basically says, well, we're basically loading the data from memory into our cache, so we want to work on this, and later on we're going to store this again. And for this, we need to make sure what's the status of this data, actually. Is somebody else using this data or not? And in order to do so, the cache or the different caches,
Starting point is 00:32:56 the different kind of cores, need to communicate to each other. They need to coordinate. And there's different ways how this is done. And on the one hand, there's a so-called snooping-based coherence, and on the other hand, there's a directory-based coherence. And snooping-based coherence basically
Starting point is 00:33:15 means we're communicating with each other to figure out if data is in another cache or not. And a directory-based coherence means there is some kind of directory central, or at least central to some sort of sub-memory region or to some memory region, and the directory holds all the state or the information about where data is currently and what state this data is. And now we need some kind of protocol how to know what is the state of the data and where is the data right now. And so the most widely used protocol or let's say the baseline protocol would be the MISI protocol.
Starting point is 00:34:08 And typically, so this is kind of the baseline, and then there's variants that are actually used by contemporary processors. And the MISI protocol basically has four different states for each cache line. So either, or the cache line is modified and if it's modified it's only in a one current cache. So it's in one core. It can be that's the M basically and that's why it's MISI. So it's modified, exclusive, shared, invalid. It's exclusive, meaning it's in one cache
Starting point is 00:34:48 and it's not been modified. It's shared, that means it's in multiple caches or it's invalid, which means it's unused right now. And so this basically means we can have a cache line in multiple caches. So it doesn't have to be in all caches. So it just means, OK, this cache line
Starting point is 00:35:14 is basically present in multiple cores. And multiple cores can read from it. As soon as it's basically modified, then it should be only in a single cache, or it will only be in a single cache, and only this basically cache, only this core can actually work on it. It can be exclusive, meaning it's only in one core, so it's basically more or less the same as modified, with the exception that it's not modified
Starting point is 00:35:45 it's it's the same as it used to be in memory like we have the same in memory and in the cache if it's shared it can be in multiple caches it can be modified but all of the basically it can be different from what is actually in memory so it doesn't have to be stored into memory yet. But basically all caches see the same cache line. They all see the same version of the cache line. There's no diff, like we don't have different versions in different caches. And then finally, there's invalid, which means this cache line currently is not used. This is basically the baseline case, so we haven't loaded anything or the case, well, I'm changing this now, right?
Starting point is 00:36:35 So I'm working on this cache line, then all of the other caches basically have to invalidate the cache line because they are not supposed to work on it. And in a snooping fashion basically or when we're basically asking the other cores, the other caches, are you in this cache or not? Then in the shared setup, all of the caches who have the shared cache line will answer to this. So basically whenever I'm doing a cache lookup in a snooping protocol I will basically ask does anybody have this cache line? My core will ask this and all in if it's shared then all of the caches who have this cache line will answer
Starting point is 00:37:18 say well I have it or at least try to answer for example. if it's invalid we're not going to get an answer if it's exclusive then one cache will basically answer if it's modified the same one cache will basically answer and because of the shared state um intel uses the massive protocol with an additional forward state, which basically says, well, I'm looking or in the current shared state, so we know this cache line is shared. If it is shared, there's a single responder for this. So there's one cache that basically has this forward state that will answer to any kind of cache snooping. So basically, say we have 40 cores. The cache line is in all 40 cores.
Starting point is 00:38:16 So which could happen if we have some kind of tree lookup, right? So we have a B tree, an art. So the root of the tree, this is something that we're always looking at. So then we're doing lots of lookups. So this kind of node, it's highly likely that it's in many caches. And then whenever I'm basically asking,
Starting point is 00:38:39 do you have this cache line? Or one core is asking, do you have this cache line then all of the shared states will basically answer and for this we're basically saying well in this case there's one core which is supposed to answer this and which would have then this forward state. Okay so additionally there's a different kinds of snooping modes. There can be no snooping, early snooping, home snooping. I don't want to go into too much detail. It's basically when do I sno a few cores, then I can basically ask everybody.
Starting point is 00:39:29 And all of the caches can basically answer. If I have many cores, then I want to somehow centralize this. And then it's basically, HomeSnoop means basically the memory controller. Rather than the individual cache controller, the memory controller will basically remember where the data is and will answer to this. And then we basically start using these directories. So then rather than having individual caches checking which kind of data is in the cache and answering to snoop requests,
Starting point is 00:40:09 the memory controller would answer to the snoop requests. And the snoop requests, so the way this actually works is whenever I'm basically trying to do a memory lookup, so I'm loading data from memory, I will start or this will automatically basically execute a sno lookup. So I'm loading data from memory. I will start, or this will automatically basically execute a snoop request or a kind of directory lookup in order to make sure if somebody else has this cache line in order to keep this cache coherency. And in larger clusters, or in larger, not larger clusters, larger cores, larger processors, then it could even be,
Starting point is 00:40:50 or then even the core can be split up. The cores can be split up in multiple regions, which then have their own directories, and which then also act as individual NUMA regions. We'll talk about NUMA, so non-uniform memory access, next time. So this basically means then logically we'll see two processors in a single processor. So we have one processor that's split up into basically two processors, and these have their own memory regions.
Starting point is 00:41:22 And kind of the snooping and directory. So the cache currency will be done for the sub packages separately. And these regions or these two sets, of course, will basically act as two separate processes more or less. And while there's all these kind of different snooping modes, and you will still find this in many CPUs, most current processes, especially the larger ones, are using these directory-based cache coherency, basically saying the memory controller will remember where the cache lines are and will basically answer to these requests.
Starting point is 00:42:11 So say, okay, well, there's basically a central place where I can check where the data is cached and which cores the data is cached, rather than the cores making all this communication in the bus or on the mesh. So here we see this kind of skylight. I think it's a skylight. I don't fully remember. Set up there we see it's kind of a ring topology. And today, the modern CPUs, they have a mesh topology. So the CPUs or the cores are connected in a mesh.
Starting point is 00:42:49 And then the cache currency will be done by the memory controller also connected to this mesh. OK, so this is just important to know. This is basically what we're going to use when multiple cores are working on the same data. And this also means as soon as one core is basically working with the data, the cache lines are invalidated on the other cores, and they will have to reload once they want to basically work with the data again. So multiple caches can read or multiple cores can read the
Starting point is 00:43:33 same data but as soon as we're working with the data, the data needs to be synchronized again. So all of the caches will basically, but the one that's changing it will invalidate the cache and then they can share again through the cache coherency protocol. It doesn't have to go to memory, but still it's communication that needs to be done in this internal network in the CPU. So which could be a ring today would most likely be some kind of mesh. Okay, so let's get started again and talk about locking, by which I mean latching actually. So, first we're going to talk about the Atomic, so what can you use to implement this. And then we'll see how to we actually use this in a data structure. And so in x86, there's this log prefix, which tells the hardware, well, do not anyone read
Starting point is 00:44:36 until I'm done with it. So there's kind of this log prefix that you can use. And this is not done in general. So in general, if you're writing your code, it won't be like the individual cache lines, et cetera, won't be locked because this will be slow. And so there's different operatives or different things that you can use. So say, for example, there's a compare and swap operation. And compare and swap basically means
Starting point is 00:45:11 you're checking the value of a bit or a variable. And then if it's a certain value that you're expecting, then it will be changed or else it fails. Basically, you get zero back. I actually don't know what you get back. It just basically will return a wrong statement.
Starting point is 00:45:41 It basically fails. So this is basically something where you can say, is this free? For example, you have a locked bit or something and if it's free then you're setting it to locked. If it's not free, then you're failing. You have to somehow retry, for example. Then there's an exchange. So you can automatically exchange something. You can do read, modify, write, where you're basically doing a small transaction. You will lock this value and no other thread can work with this until you're done. You cannot work on this while another thread is working on this.
Starting point is 00:46:27 And, I mean, just in general, this is like tricky stuff. If you're trying to implement logs, etc., and you're not using mutexes, if you're trying to implement this yourself, then you really got to be careful because then all of a sudden, we're also talking again about the compiler.
Starting point is 00:46:47 The compiler can reorder your code to some degree. And if the compiler does this with your locks, well, then you're screwed. And I mean, if you're not basically, well, you could be screwed because you might be locking your stuff before actually changing the data or after changing the data structure, if the compiler thinks this is better or faster, which it might think because the lock might be slow.
Starting point is 00:47:17 And then you need to do some fencing. So we'll not go into the details of this, but basically there, with fencing, you can ensure that the order of operations is the way you write it in your code. And this is especially in non-temporal stores important. So if you want to be fast with persistent memory, for example, this is something that might happen where you're
Starting point is 00:47:40 not writing into caches with non-temporal stores, but you're directly trying to write into memory. And that is something that will be slow. And then the compiler will think, well, maybe let's reorder this somehow. So then you need to be extra careful. And yeah, this is tricky stuff. And there's lots of information online.
Starting point is 00:48:05 And even if you read a lot about it, it's still tricky to do. So given that you have some kind of lock, so this can be like a kernel thing, like, say, a mutex or something like this, or you implement it yourself, then there's still different ways how we lock our data structure. And we can do this in a pessimistic way, basically saying we're always going to lock when we try to access the data structure.
Starting point is 00:48:40 And then basically we're sure that nobody else will have a problem. Or basically we're the only ones working on the data. And once we're done, we can continue. This basically means only, at least in the subpart that we're working on, only one thread can work with this at a time. And even all of the readers, et cetera, in an exclusive lock, they cannot do anything. We can also have kind of a shared lock,
Starting point is 00:49:17 which means for multiple readers, we're allowing multiple readers, but as soon as we have somebody or some thread that wants to write, again, we're gonna go readers, but as soon as we have somebody or some thread that wants to write, again, we're going to go back to an exclusive log. Nobody else can work with the data. And that usually means if we have many readers, the writers will basically block everything. And another way that we can do, which often, but it really depends on the case, how the data structure is accessed, how many writes versus reads, etc.,
Starting point is 00:49:54 which might be still more efficient, is optimistic locking. And in contrast, so in the pessimistic locking, we always take the lock. In optimistic locking, we're basically trying to not take a lock and then see if everything went well afterwards. And if everything went well, we don't have to do anything. If we have some kind of synchronization issue, we have to retry. So then we're basically restarting from scratch. And we need, of course, to be sure that we somehow can retry, that we can validate, and that not everybody else is basically, we're not breaking everything in between.
Starting point is 00:50:40 We don't have an invalid state somewhere in between where readers, while we're modifying, for example, will read wrong data. And so then we'll still need some kind of locking. But say, for example, for the reading, we can just try to read and then check, was this actually correct what I've been reading? And if not, then I have to retry. But we'll go through this again. And finally, there is also the option to go completely lock-free. So, meaning we're not using any real locks,
Starting point is 00:51:18 but we're basically just trying to continue and using these kind of atomics rather than having something like a mutex or something like that. And say we're using this compare and swap operation, for example. And so this means the threads don't have to wait. They always can do compare and swap, right? So they can always execute this operation, but it might just not work because somebody else
Starting point is 00:51:52 is concurrently also changing something. So then they always have to just retry. And this means, I mean, this can be fast because we're not using any locks. So the threads never go to sleep. The threads will continuously basically work. But they might, like if we have lots of contention, they might just do busy work, not getting any real progress.
Starting point is 00:52:17 And this is kind of hard to implement. And doesn't really, often is not really efficient for very complex data structures. But we'll see there's some kind of trade-offs also. So let's look at optimistic locking in a bit more detail. So what we're doing is basically while reading, typically this is for reading, we're validating or after reading, we're validating that what we've read
Starting point is 00:52:48 still is correct. So, and this is basically, so we basically want to read our data structure. For example, we're traversing our data structure. And while we're traversing to the next node, if somebody else is concurrently changing this, we might traverse into something that's actually completely outdated, right? We might go into a deleted node and then end up somewhere in Nirvana in memory. And in order to make sure that this doesn't happen, we're basically checking versions typically. So we're basically reading a version when we're starting our read. Then say, for example, we're in our R3, right? So we're reading the version of this node. We're
Starting point is 00:53:35 starting finding this stuff in the node that we're actually looking for. And then later on, we're checking again, is this still valid, this node, what we're looking at? And if it's still valid, then we know, okay, this is still a good node. We can continue where we left off. And so there we don't need to lock. We don't need to basically say, okay, here I'm holding a read,
Starting point is 00:54:02 at least read lock for this node, but I'm just checking, I'm holding a read, at least read log for this node, but I'm just checking, I'm just reading the log. And the read log, I mean, the difference is really, again, in the cache coherency, right? And if we're holding a log, we have to write something because we have to somehow mark this log. So the log is just a variable, basically, some kind that will be in the cache line somewhere,
Starting point is 00:54:27 that if somebody else wants to access this, they also have to read it. So then basically the different cores need to communicate. So the different caches need to see the same cache line if they want to access the same. And if this is like a node high up in a tree, for example, this log will be a log that many threads will look at. And so this means that many threads will also update. And this means this will frequently be changed and will frequently be invalidated.
Starting point is 00:55:02 So this is always costly because it's a cache invalidation. If we're avoiding this, we don't need to invalidate, right? Then we're in this shared state. All of the nodes, basically all of the threads, can have the same cache line and can work concurrently on this. Unless, of course, somebody changes it, then we need to basically update this. But the change, as we saw in
Starting point is 00:55:25 the MESIF protocol, this is something where the cache that actually does the change will inform the other caches that there is something happening. So then we have the snooping request, then we know okay something's happening, so then there will be a cache invalidation. Otherwise, in an optimistic read, this won't happen. And of course, we can combine this with some pessimistic locking if we have high contention, if we're failing too often. So we're going to try.
Starting point is 00:56:01 OK, I'm reading the version. Does it work? Well, I'm reading the version. If it's locked, I cannot really work on it. I know it's going to be changed. Then I'm going to read again. If it's not locked, I'm trying to read, do my stuff in the node. I'm going to check the version again. Then if it's fine, I'm basically good. If it's not fine, I'm going to check the version again. Then if it's fine, I'm basically good. If it's not fine, I'm going to try again until I have maximum attempts. And then I don't want to starve. So then I'm going to start.
Starting point is 00:56:39 Now, if I feel I'm going to starve, I'm going to start locking this. And then I can for sure at some point, I will get the note. I will get the lock. So this is how basically you also can avoid starvation here. Now, if we run into this locking mode, right? If we'll go down here, then we have to use a lock. Or if we want to write something, even in like optimistic locking, we need to write something, even in optimistic locking, we need to have a lock.
Starting point is 00:57:08 There are basically two strategies. Either we're doing a spin lock in the user space, which is kind of this, where we have the threads continuously waiting and pulling the lock until it becomes free. And or we have like an OS service like a mutex or food text or something like that, where basically the OS unschedules or unschedules the threads or deschedules the threads until the lock becomes free. And the difference is either we're basically paying with cycles that the threads burn while they're checking and this cache misses when we're writing
Starting point is 00:57:55 and reading this lock. So as soon as we're writing the lock, then the cache needs to be invalidated for the other threads or for the other cores, basically. And as soon as we're getting it, basically, again, we need cache invalidation. We need to have this log in our cache. And in blocking in the OS service, basically, there we have this context switch. So, there the thread needs to go to sleep.
Starting point is 00:58:26 It needs to wake up again. So the context switch, of course, could be more expensive depending on how much we're spinning around, basically. So this is basically how this would look like with a mutex, right? So we're basically saying, locking, then we're doing our thing, our update to the node, for example, and then we're unlocking again.
Starting point is 00:58:53 And anybody else who tries to get this mutex will basically go to sleep while this is running. And the threads will be queued in the right way and will be scheduled in the right way to access then the log. There's many ways how you can implement a spin log. We talked about compare and swap. There's another one that's called test and set, just says we're setting the value and we're
Starting point is 00:59:29 returning the old value this is basically if we just have a single bit saying this is locked for example this is what the bit tells us then we can just use test and set and we're getting back or this has been set before already or it's not been set and now it's set so if it's been set already before then we know it's already locked although we basically wrote locked again it was already locked so there's no change we cannot proceed if we basically test and we're getting back it hasn't been locked, then we know, oh, now we're on. It's where basically we can work.
Starting point is 01:00:10 And this is efficient, but we have this busy polling, right? And I mean, the problem can be quickly that we might have, if we're using this for highly accessed or highly frequently accessed cache lines, then we basically might have this many cache invalidations, right? So then all of the threads will basically fight or all of the cores will fight for this single cache line, which will cost a lot basically
Starting point is 01:00:39 because we have to do all the internal communication for cache coherency. And this is why this won't necessarily scale. And so in order to make this more scalable, we can do kind of a queue based lock. So we can have like a base lock or a base latch, which we're building a queue with for each thread, a separate lock as well. And then each lock can basically check
Starting point is 01:01:07 their own lock, if it's free or not. And basically depending on the other latches. So this basically means, I mean, we'll have to check more, but for the busy polling, we can only first check our own latch. And right there, we can basically also implement concurrent readers. So we can say multiple threads can basically read there or can basically set up the queue
Starting point is 01:01:45 where they basically have the reader or have their own lock for reading. And then, and as soon as we're writing, we're gonna go back to the single lock. In a database, so what do we need for the latches in general? We assume that it's mostly reads. Yes? So I don't understand why the extra flag for grep is useful.
Starting point is 01:02:18 We always, like, I mean, if I'm already locked, then I already know that I'm in my critical section. So why would I ever check my own latch again? So if you're, I mean, if you're built, so what you're basically doing is you're creating not a single cache line that you're checking, but you're creating, let's say, multiple states that you can check. So let me think again. So I mean, you always need to check, basically,
Starting point is 01:02:52 can I already read or can I not read? And if you're working on your own cache line, so if you're just polling on the single cache line, then you're not gonna basically use another threads cache line. So of course, as soon as the latch becomes free, we have to invalidate the cache lines, but we don't need to basically commute,
Starting point is 01:03:18 like we don't need to constantly poll like each and everybody on the same cache line. But we can say as soon as it becomes free, then we have to invalidate again. But in the meantime while we're polling, we know that we're just reading this on every thread reads their own cache line. I basically would set So basically I'm gonna set as if I also want to take the lock, I basically would set, so basically I'm going to set, if I also want to take the lock, I'm going to set my latch. If it's going to be free, or so I basically
Starting point is 01:03:54 have the base latch, which basically tells this one is actually blocked right now. And then CPU has their latch. CPU 2 might have their latch. And as soon as it becomes free, I'm gonna unset them, for example. And of course there's many different implementations I could do this, but I will only,
Starting point is 01:04:12 the CPU2 will only queue query here. And as soon as CPU1 would be ready, for example, it could unlock all of those, like the whole chain basically. And then the next one could go in. Or it unlocks these two and only then these two are un-unlocked and the next one can go there. So we're basically creating a chain of locks and while spinning, everybody only reads their own log rather than this one single cache line that everybody needs to read. Does this make sense? So if we're doing test and set, right?
Starting point is 01:04:53 So if we're using this test and set spin log, in every single test and set, we're actually basically writing the lock. So every time we're basically writing, so this will every time change the cache line or at least every time touch the cache line. So then rather than doing this, I'm going to get my own lock with basically a connection to the next lock. lock so then i know okay there's a base latch which i have to go from this cpu1 latch is locked now this means i am cpu1 is there
Starting point is 01:05:36 and cpu2 will only spin say for example on this cpu1 will unlock this and CPU2 can... So this would be one implementation. CPU2 can then have their own lock and other CPUs basically... Getting out of here. So other locks will then basically be added to this. So we're creating a queue of locks, just like a queue of threads being scheduled. And everybody only spins on their own log, rather than on one single log. Oh, so every thread, once they're done, they are like just notifying the one next? Exactly.
Starting point is 01:06:18 OK. Yeah. Makes sense. So then basically, we go to the next one. The next one will be open. The ones queued next basically will have to continue spin on their log. So we get some invalidations, but not these constant invalidations. We're not constantly just spinning on a single cache line that everybody tries to access.
Starting point is 01:06:43 Everybody also always sets, right? cache line that everybody tries to access. Everybody also always sets, so this is always a write operation. This technique we can also use for these reader-writer locks. We can basically say, oh, we have these many readers, and now we have a writer. And as soon as the writer is there, basically all of the readers cannot work. But all of the readers can read in parallel. So in a database, we basically want the latches to be fast,
Starting point is 01:07:25 and the reading needs to be fast and scalable. And I've iterated on this already. So as soon as we have a tree-based data structure, then we have the nodes. The root nodes basically will always be high contention nodes. So this should be lockable with minimal overhead. And of course, we don't want a lot of latency there.
Starting point is 01:07:53 So using the kernel ORS context switching or ORS-based locks is kind of latency detrimental detrimental because basically every time another thread will try to access these three nodes, we'll basically get into this de-scheduling scheduling. So, well, yeah, so we basically, for this, especially if we have small nodes and small data, we also want some kind of fine-grained logs, right? So, mutex can be 40 to 80 bytes. So, having something like a test and set bit will save you a lot of space here also in the cache if your nodes are small then this would be like a the mutex might be much larger already than your individual nodes right so if you have the small art nodes for example so then you actually want user space locks as well
Starting point is 01:09:03 and of course you also want this efficient contention handler meaning as soon as you have high contention on a single cache line then well this of course will degrade but it shouldn't degrade everything else right so it should of course we, we cannot get indefinitely fast on a single cache line, because this always means communication. But at least threats that access other parts of the data structure should not be bound by this. And I mean, there's different kind of locks and locking types
Starting point is 01:09:52 that we saw already. And there's which one is best for which workload type really depends on the type, or which lock is best depends on the type of workload. So if you have read only, of course, you don't really need any locking. If you have read mostly, then you want something most optimistically.
Starting point is 01:10:15 If you're only writing, then everything kind of doesn't really make much of a difference, because you're going to be in conflicts all the time anyway. Or as soon as you hit conflicts, you need to deal with them anyway. So then exclusive or shared locking will be equally good. And optimistic locking doesn't really
Starting point is 01:10:39 apply, because we're not reading anyway. So in an exclusive or in a read-only type, for example, then, of course, optimistic locking would be most efficient. So it really depends. And this also depends like what you're using for your data structure. What kind of workload are you assuming? So if you're assuming like a write heavy workload then optimistic optimistic locking might not be ideal because there will be so many aborts or starvation for the read. Right so then
Starting point is 01:11:23 this also means in the end you probably want to have something that's in a hybrid setup. We saw this in this example for the optimistic locking that eventually, if we're aborting too frequently, well then let's use a pessimistic strategy. Let's look at this briefly. We're going to start with this at least, how this looks in the data structure. And for this, we're going to look at a very simple data structure. And most likely, you will implement something similar, a bit more elaborate than this in the last task. So we're going to look at the list-based set.
Starting point is 01:12:03 So we have insert, remove, and contains. The elements are stored in a sorted way in the linked list. And we always have head and tail. So as an invariant, we know where the list starts and where the list ends. And then there's different ways in terms of granularity how we lock this. And then there's different ways in terms of granularity how we lock this and then there's different strategies in terms of opportunistic, pessimistic, etc. And well, the most simple way of locking would be just coarse grained locks, a single lock for the whole data structure. And this could be, for example, mutex.
Starting point is 01:12:43 That's, of course, easy to implement, for example, mutex. That's of course easy to implement, but it's highly inefficient. So we cannot really use this in a scalable way. And one thing that you always need to keep in mind, the more threads we add, the more performance we want. So the more parallelism we want. With this, we're not going to get this at all. So having more accesses or more threads should always give us more parallelism. And for this, we need some kind of fine-grained locking. And so in order to do this scalably, we want fine-grained locking. We need to split the object into two small synchronized components. So this means in our case, basically into individual blocks
Starting point is 01:13:27 or individual nodes, right? So in a linked list, the individual node would be a logical element for locking. And then we can do optimistic synchronization. So we can search, for example, without locking. And because, I mean, this is basically just reading, meaning, so traversing the list, we don't need to lock. If we find something, then we need to lock and we can check. If it's okay, then we're done. So if we can grab the lock, then we're done. If not, we can start over. So that's an optimistic way of doing this. And this can be expensive in a list-based set, for example, because we're going to have to start over from scratch,
Starting point is 01:14:14 from the beginning of the list, for example. We can do lazy synchronization. And that's basically, we're going to postpone whatever we're doing in the list and we're gonna just mark stuff and Say for example if we want to remove we mark the node as remove as removed and then later on or slightly later Remove the node. So in order for any kind of reader It's still stuck in this node that we want to remove
Starting point is 01:14:45 that they see okay we've changed this in an atomic way to being deleted but we're not just dropping the way the node completely because a reader might then follow like a wrong path that we've not completely established it right So the data structure is larger, has different kind of pointers. And these pointers then might point to something that we don't actually want to have yet. Or we might have a half-established node or something like that. So having this logical removal gives us some extra time
Starting point is 01:15:21 to do the physical removal on the site and make sure that anybody who's still reading this can finish. And then we can also do log free. And this is kind of, as we said, using this compare and swap. So just basically trying it over again. If it didn't work out,'re checking if the versions are right if it didn't work out we have to start over we're going to start again and we're going to look at this for or each of these ways i'm not sure i'm probably not going to finish all of them but
Starting point is 01:16:01 this lock coupling i wanted to wanted to show you first. Because in a linked list, for example, and similarly in a tree or any kind of these index structures, also if we have a hash list with chaining, for example, we basically cannot just log a single node, right? So if we're logging a single node. So if we're locking a single node, changing the single node, of course,
Starting point is 01:16:30 we can just update the value in a single node. Then locking the single node would be enough. But if we want to add a node, if we want to delete a node, locking a single node won't be enough. We somehow need to also lock the neighbors. And for a linked list, actually locking two nodes is enough. So for removal, for example, the predecessor and the node that's going to be removed will be enough. For locking, for adding the predecessor and the successor will
Starting point is 01:17:02 be enough because of course the new node doesn't have to be locked yet, for example. And the same thing we also will do for reading, right? So we'll also have a reader lock. So somebody else will not change the data structure while we're reading. And lock coupling basically says, well, I somehow need they get to the right point and be able to to find this so what we're doing is we're basically going in a lock step through the data set so we're going to start at the root node or at the the head and we're going to lock the next node and then continue from there basically and we're always going to only at max hold two logs.
Starting point is 01:17:47 And as soon as we have this, for example, for the removal, as soon as we have a log on the node that we want to remove and the predecessor, then we can actually do our operation. And this is easy to implement. It's better than the coarse-grained lock, but it's still inefficient because we also need a read lock. We can have reader and writer locks, but we still need to somehow lock for reading. And we already said we can do optimistic locking, which means just for the traversal, we're not going to lock. But then, of course, we need to take extra care when we're doing some kind of changes.
Starting point is 01:18:33 So in optimistic locking, we're basically reading without, at least with traversing without locking. It's for reading, we will still or we might still lock. Then for changing, for sure, we're going to lock the predecessor and the successor for an addition. For delete, we're going to lock the predecessor and the node. After each change, basically, we have to validate that the nodes are still accessible. If they're not, then we have to restart from scratch. It might happen, for example, that we're traversing here, we're acquiring the lock,
Starting point is 01:19:17 and while we've acquired this, somebody else has already deleted this node. Then we have to start again then we're going to traverse again until here and or until the node that where we wanted to add and try again right so this is kind of uh in an optimistic fashion we don't know if something like up here already happens. So this might be somehow deleted. So this is kind of the problem. So we're adding this node here, but this is deleted. So then we need to re-change. We need to basically fix this again.
Starting point is 01:20:03 And there, it's unlikely that we have a lock contention because we only lock really where we need it. But we need to validate. Basically, for the validation, we actually need to go from the start. So in the validation phase, we're not going to just go here or something. I mean, we basically start from the beginning again,
Starting point is 01:20:28 go all the way through the list and check if what we've been doing actually was correct and has been correctly, like there's no other operation that somehow interfered with our operation. And we also need to lock for the reading still. I mean, we're still in the case where we basically want to make sure that our read properly went through. Let me see, yeah. So in the lazy synchronization stuff there,
Starting point is 01:21:03 we basically, rather than doing this locking, we only traverse once. And rather than deleting directly, which would lead to this case where we're basically losing the node where we changed something and we have to validate again, we're deleting logically and then physically. So there we're basically setting up the lock, but first we're marking this atomically as deleted with some kind of marker bit and making sure that anybody who's still in these nodes here has enough time and can see that we've been deleting this. So then if they basically pass the deleted node, they will see, oh, this is deleted, so I can just jump across, but the list is still intact, so we're not ending up somewhere else. So in this case, we can go wait-free through the list, and we also have just checking if something is
Starting point is 01:22:06 there is weight-free so we don't need a lock for this but if we're basically adding something or removing something if a thread is really slow then we might have to re-traverse. Now we can combine all this with optimistic lock coupling, but we will do this next time, because we're already close towards the end. So then optimistic lock coupling is basically using lazy locking, using lock coupling and optimistic reading in a quite nice fashion.
Starting point is 01:22:48 So questions so far up to here? Yes? In the optimistic locking, I don't understand why, if we have a lock on 42, is it possible that it can be removed somewhere else? So the case is basically, I'm stuck in 42. Basically, let me see. We're not taking a lock for the traversal. So I'm going all the way to 42. And I'm taking a lock on 42.
Starting point is 01:23:24 So nobody else at this case has a lock on 42. So nobody else at this case has a lock on 42. But I think... So it's... What was the case? So somehow we can basically get to a point where two threads are trying to delete two nodes that are just adjacent to each other. So I'm basically taking, while I'm going in this log step, right? So I'm basically taking, while I'm going in this log step, right? So, I'm taking this log.
Starting point is 01:24:07 No, I'm not logging at all. I'm going here. I'm taking these two logs. So, what can happen is basically I'm reading the log. The node is locked. So I'm waiting for the log to get the log and I'm getting the log but in this state the node is actually deleted. So I have the node but it's deleted.
Starting point is 01:24:39 So I somehow need to make sure the make sure so I get the node while traversing was still able to see the node the thread was still able to see the node during the traversal and trying to acquire the lock so it's basically in line to get the lock but then the
Starting point is 01:25:00 lock the node is deleted then I get the lock I see that I still have the node, basically, because I could still, during the traversal, I can still access it. And then I somehow need to ensure that this thread that saw the node, but is not aware that it's actually been deleted.
Starting point is 01:25:20 And for this, I need this kind of, if I want to make this work, I somehow need to make sure that the thread that still has access to this node needs to be aware of the, say, for example, either by by marking it or by re-traversing. Basically, then I can I could still see it. I'm changing it. I'm going there back. Oh, it's not there anymore.
Starting point is 01:25:44 So there is something wrong here Make sense Okay, good question More questions, so it's a bit I Mean you can see it's it's it seems simple, but there's just so many ways how this can go wrong in a parallel axis. Because if multiple nodes see the data structure at the same time, they basically have a pointer to this. They can basically go to load the node that's still there, but it's not used in the data structure anymore. This is where we somehow need to make either we mark it or we make sure it's not accessible anymore, etc.
Starting point is 01:26:33 Okay, good. With this, there's no other questions. We'll go through log coupling or optimistic log coupling, which is actually a quite nice technique, next time. Next time meaning tomorrow upstairs. Thanks.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.