Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Non-uniform Memory Access

Starting point is 00:00:00 So welcome everybody to our next session. Eventually today about non-uniform memory access somehow also already earlier but we'll I mean you noticed last time I took a bit long with the data structures it's our with the locking we'll talk about the data structures today but first I'm going to do two announcements once I have focused back here. Three announcements. First of all, I told you I'm going to say this until it's there. We'll have the presentation by Götz on Friday. Please come. Unfortunately, we won't have the presentation by Timo. He had an urgency and cannot make it.

Starting point is 00:00:45 And we still have the survey open. I've seen there is already a couple of answers and some good comments and suggestions. We'll try to implement those. So one I saw multiple times is you asking for the top solution. So what's the difference between these and your code so we'll try to somehow come up with something to present this so right now my idea is to like a special session after the semester like when the last task is due where the people with the top

Starting point is 00:01:21 solutions can present their code to everybody we We're going to do this here, not online, not recorded. So that's an idea yet, but I'll have to talk to my folks here, see how we implement this, but I think that might be useful for you. So that's something. If you have further ideas, suggestions, etc., please fill in the survey. We'll try to make something out of this. There are still not that many submissions.

Starting point is 00:01:53 I think there were four with quite a few more people. So, do this. It helps us. Or it helps you, actually, hopefully. Okay, so before I go into NUMA, which we're going to do today, we're going to go back and talk about concurrent data structures. And I've tried to figure this out with the question

Starting point is 00:02:15 that we had yesterday regarding the sharing and invalidation, and I couldn't. So I mean, just by googling this a bit, I didn't find anything that would really tell me, OK, this works or it works not. Sorry about this. I would have to do more googling, et cetera, but feel free to do it yourself.

Starting point is 00:02:34 But I came up with a quote, which I actually really liked, so I'm going to tell you this. There's regarding cache invalidations. So there's two famously hard problems in computer science, cache invalidation, naming things, and off by one errors. So that's at least one of the things that I found out. So let's go to concurrent data structures. So basically what we talked about last time, so the locks,

Starting point is 00:03:02 how do we apply this to data structures? And we'll talk about a linked list. So very simple data structure. So you have an interface with insert, remove and contains. So that's all that we need to do. All like the elements are stored in a linked list and sorted by key. Could be one way of implementing this, right?

Starting point is 00:03:30 And we have a head and tail that always exists. And now the question is like, how can we make this concurrent? And this is actually something that you already did, like as a warmup task. And you will also do this again in more detail and more fancy in the last task, where you're going to do this without locks, so lock-free.

Starting point is 00:03:54 And yeah, so one way is doing very coarse-grained locking. I mean, this data structure, everybody understands, right? In order to find something, we start from the head and then go through the list until we find our element. And if we want to add something, we basically have to go to the right position in the list, create a new node, plug the new node in, and so on, remove the same thing. And the very simple way of making this concurrent,

Starting point is 00:04:33 so letting multiple threads use this data structure in a way that we don't destroy it, is just a single log for the whole data structure. And this can be a reader and a writer lock. So we talked about this. So if we're just reading, we can still have concurrent access as soon as we're writing. For sure, we're just going to have this one lock.

Starting point is 00:04:58 And then with this one lock, only one thread at a time can basically access this. Again, the type of log can be different, right? So what were the types of logs that we talked about that we could use here? So we had like overall two large versions of logs, so we have the, yes? Locks and latches? LOCKS AND LATCHES, different, I mean, it's different concept. Locks and latches basically differentiate between what do we, where do we apply our locks.

Starting point is 00:05:39 So if we're talking about locks and latches, then what we're talking here is all latches. So basically, the latches are the locks for data structures. Locks are the locks, if we differentiate, are either these or are the locks for tables, so for basically locking a table, a column, a row, et cetera, in the database. So this is very good. So it's basically even higher level differentiation. So let's go one level down.

Starting point is 00:06:14 So we said there's two ways of implementing locks. Yes? You could use a spin lock or an OS lock. Exactly. So we have then the two options of using OS locks or using spin locks or user space locks. And the difference is an OS lock is something where the thread basically goes to sleep.

Starting point is 00:06:40 And once the lock is free, we'll be waked up again. And the user lock, so the spin lock, is something where we have some kind of data item that the user thread spins on and basically does busy polling, checking if the lock is free or not. And we said for efficiency reasons, if we want to have low latencies, then a user space lock is faster, so a spin lock.

Starting point is 00:07:14 Because we don't have this thread constant switches, but it's burning CPU cycles all the time. So basically, the thread doesn't go to sleep. It just continuously tries to acquire the lock. Within this, we again had different implementations, right? So we can just have like a single data, the cache line that we're locking. And then that's basically inefficient, we said,

Starting point is 00:07:42 because we're gonna have this invalidation of the cache all the time that we access the lock. So what we want is a queue-based lock where all of the threads or each of the threads gets their own queue element that they can spin on. And so we don't have this cache invalidation. So basically everything is always in the caches except for when the lock gets free.

Starting point is 00:08:06 Then we're invalidating the cache of the thread that's, or basically the thread that acquires the lock or has the lock first, basically needs to change their or give up their lock and also give up or open up the log for the next thread. So that's more efficient and it doesn't basically lead to this contention. Okay. So now let's say we have this log. You could use any of those for implementation of these logs, but we're going to think about, I mean, technically we're going to think about the spinning log or cube-based spinning log if we want to have a fast data structure. Again, we said there's trade-offs. And now, we have this single log. We can have, again, we can have logs for reading and for writing.

Starting point is 00:08:57 So, as soon as we just have read-only access, we know this can be done concurrently. If we only have read lock, then multiple threads can go and lock at the same time. And there was yet another thing that we had, is we have optimistic locking and we have pessimistic locking, right, in a combination of the two. So just to remind you. And all of those will work with these logs, right? So, all of these different logs will work with the data structure. Which one you use really just depends on how frequently you will need to release and acquire the log.

Starting point is 00:09:41 If you have more write accesses, if you have more read accesses, et cetera. OK, so but this, well, having one single log, whichever it is, it's easy to implement, but it does not scale, right? So this is fairly easy to understand, multiple threads accessing the same log, or trying to access the same data structure. Let's let it be a very large data structure

Starting point is 00:10:08 with many elements. This basically means, well, every single access wherever in the data structure is basically blocked by just a single other access. So that's not what we want. So in order to get more scalable in this locking, we can do fine-grained locking. So let's split up the object into independently synchronized components. And then we only have the conflict whenever the same component in the data structure is accessed.

Starting point is 00:10:42 So it's basically only if we're accessing a certain element in our list, then we have this problem. We can do optimistic synchronization. So we can do search without locking. If we find it, then we basically so if we want to insert, for example, something, we don't need to lock right away. First, we need to find where to insert, for example, something, we don't need to log right away. First we need to find where to insert. And once we're there, then we can start to log because up to there, it's only read access.

Starting point is 00:11:15 And then we're going to log and then we're going to try to do everything. If it's fine, if it worked out, good. Maybe something else changed in between, right? So because we just read until there. If it didn't work out, if we do this optimistically, then we have to start over. Just like in general, we had to do with optimistic locking, right?

Starting point is 00:11:36 So we're just trying to go there, read our stuff. If the node was changed while we're reading, this can happen in very slow reads. So if we have big reads, complex data structures, then this is more likely than if, for example, in this small linked list where a read will most likely be very efficient. So then it's unlikely that something else changes, but it might still happen.

Starting point is 00:12:03 We might still have this race condition, so we always have to check. We can do lazy synchronization. So if we do update the data structure, we can just say, well, let's just put a marker. And then later on, so we're just saying, OK, this needs to be updated. But we're not updating it right away,

Starting point is 00:12:25 but we'll just mark this as to be updated in one way or the other, and then we'll change it later. And so basically, this is especially true for removing, right? So for removing, we can just put a marker on the node and say this one is gone now. And we're not removing it right away, but we're only doing this once things have settled or we have some more time.

Starting point is 00:12:52 Or just asynchronously, we can basically rewire this. And of course, we can all do this lock-free. So we don't use any locks at all. We just basically do these test and set operations. So we're just checking, is this free or not? And well, this is good, right? So it's basically fast. We're not locking, but it's actually complex to implement. And for complex data structures, the overhead might actually be quite high. So let's take a look how we can implement these different kind of things on this linked list data structure.

Starting point is 00:13:38 And in order to do this kind of locking or to go through the list in a locking way, right? So if we have read logs. So of course, we could say we don't want read logs, but if we have read logs. So we don't want any writing to happen while we're reading. So this is not optimistic, but it's

Starting point is 00:14:00 kind of a pessimistic locking, but still fine grained. And in order to ensure this, we have to make sure that we're not somehow overtaken or somebody else is taking, like, we're hopping from node to node and somebody else is basically locking everything. So we're losing our locks, essentially. So in order to do this, what we're doing is lock coupling. So we're not locking a single element, but we're locking at least one element

Starting point is 00:14:29 and basically walking through the list one by one. So we're starting with the head, we're locking the head, and before giving up the lock on the head, we're locking the next node. And only once we have this lock, we're giving up the lock to the head. So that's why it's also called crabbing. So we're basically like hopping one to the next. So we're holding at most one lock at a time, two locks at a time, and at least one lock at a time. So we never have no lock, basically. So that's the idea.

Starting point is 00:15:03 And well, we then also can release them pairwise. And of course, we need or we could use read and write logs if we want to have concurrent readers for this. So we could have completely exclusive access, or we can have different kind of logs for reading and writing in order to make sure that we can have concurrent readers. Okay, so this is basically what this will look like. So right, if I want to lock, say the number 42, I want to update this, I want to remove

Starting point is 00:15:36 this, something like this. I'm going to start with the head, then lock the 7, then I can give up the head, then lock the 42, and this is basically then where I am. So again, this is easy to implement and it's much better than the coarse gain locked, but it's still inefficient for read access, right? So because basically we have to slowly walk through this. If we want to add something here, then we would basically lock the predecessor and the successor, and then we can just add in between.

Starting point is 00:16:19 So everything's safe. If you want to remove something, we're going to lock the predecessor and the node that we will remove. And for like a contains, we would lock the predecessor and the current node, just basically walking along this way. And of course, for the actual, like while reading, we only need to lock for the actual node. But since we're always like doing this grabbing way, we would have like many cases, we will have these two logs. In order to make the reading more efficient,

Starting point is 00:16:55 we can add optimistic logging in here. So we basically say for reading, we don't do any logging. This means we can just walk through the list without any, or read through the list without any locking. And then once we want to change something, we're going to lock the predecessor and the current node or the future successor of the node and do our changes. This means basically every every time we want to change something we just have to go exactly or just log exactly where we

Starting point is 00:17:35 are. We don't need any other logs. So we're still going to have these multiple or two logs in order to make sure that we're not losing our locks in between. But then, since we're moving through the list and we only have the lock at the point in the list where we're actually doing our business, where we're changing something, somebody else might also try to acquire these locks. And if our thread is somewhat slow or the other thread is waiting for the lock, and then, say, for example, we've moved all the way to 42, and another thread also tries to grab the lock for the 42. So while we already have the lock for the 42,

Starting point is 00:18:23 another thread now tries to grab the lock for the 42 and has the lock for the 42. So while we already have the lock for the 42, another thread now tries to grab the lock for the 42 and has the lock for the 11. We're adding the 46. And the other thread gets this lock and deletes the 42. So then basically, well, we have inserted our 46 correctly. But the 42 is gone. So basically, our node is gone again. Because the thread that locked the 11 but the 42 is gone, so basically our node is gone again, because the thread that locked the 11 and the 42 and removed the 42 will just reroute

Starting point is 00:18:52 this pointer to the 47, so our node 46 is now in nirvana, hanging somewhere, not reachable anymore. I mean, it has a correct pointer to the 47, but the 42 is gone. In order to avoid this, the next step that we need to do is then basically we need to validate the changes again. So, and the principal approach would be, well, we're starting from the beginning again, walking through the whole list and ensuring that we can still reach this node. So we can still reach our node so then our validation has worked out and well all is good. So here the log contention is unlikely but the bad thing is we need to traverse the list twice in order to figure out that our access actually worked out.

Starting point is 00:19:54 So far, so good? OK. Now, another way that we said what we can do is we can do the lazy synchronization. And rather than basically doing our changes right away, we can basically delete logically first and then do a physical removal later. And this basically means with the same thing

Starting point is 00:20:21 that we had earlier does not necessarily happen. So because if we're lazily, like if we have a marker in our node, then we know, OK, this node will be deleted, right? Some other node basically put a marker here. Then we can validate this earlier. And well, marker bits are useful for other things as well. And so here, we can traverse. So we can basically just do the traversal as earlier.

Starting point is 00:20:55 So we just walk through the list. Read access doesn't have to do anything. But add and remove may have to re-traverse if there are still some slow threads that are accessing these nodes. So we also might use these markers. We basically have to ensure that there is no other thread that somehow still has to do something here. So if there might be another thread, then we may have to re-traverse to make sure

Starting point is 00:21:26 that our changes, like we are adding a node, some other node, put like a slow marker. So then we need to re-traverse. If we can ensure that nobody else acquired a lock here or does any changes, then we don't have to re-traverse so far. So in order to combine these, what we can do, so because so far, if we don't know like what update, like who did an update, where was an update happening, then we basically, we can add something more. We can do optimistic log coupling. So basically, so far, we don't know if something really has changed in here.

Starting point is 00:22:18 So we just basically traverse, and we see if the node is still accessible. And in order to really figure this out, we basically have to go through the list again and see that no changes have been made. And in order to avoid this, what we can do is we can have like an update counter. And the update counter will tell us,

Starting point is 00:22:39 okay, something has changed here, right? If something has changed, then I have to restart. If the update counter is still the same, then I know, okay, my nodes are still safe. I can do everything locally. I don't have to restart. So in the write setup here, basically I'm acquiring a lock so I'm excluding any other rights and I'm incrementing the counter when I'm unlocking. So this means if another threat was waiting for the lock or basically read this already during the traversal then try to acquire the lock was waiting for the lock to get basically the access to the node. This is kind of this problem that we might have.

Starting point is 00:23:27 I'm traversing to the node, then I'm acquiring the log. While I'm trying to acquire a log, another thread skips by and changes the node, deletes the node, something like this. So this is basically what I cannot see on the node if there's not some kind of marker in here. So now, in this case, I have a counter. And basically, while I'm traversing, I'm looking at these counters.

Starting point is 00:23:52 If I see a change in the counter, because every time a node updates something in here with this optimistic log coupling, then when I see the change, basically I know something has changed, I have to basically start again. Of course, for the writes, I also have to update these counters. So basically, I'm acquiring a log, I go to the right position, no logs so far. If I'm at the right position where I want to do my changes, I'm acquiring the logs. I will have read the node already

Starting point is 00:24:30 and the counter for the node. Then once I'm acquiring the log, of course, I also have to read the counter again. So if this is still fine, because it might have changed in the meantime, assuming nothing has changed, I'm reading the count. I'm basically doing my work up and then update the counter. And as soon as I've, like once I see something has changed,

Starting point is 00:24:55 I'm going through the list to the right position. I'm at the right position, and I remember when I'm at the right position. I'm remembering the update counter. Once remember when I'm at the right position. I'm remembering the update counter. Once I acquired the lock for the node, I'm checking the counter again. If something has changed, I'm not acquiring a lock. I basically have to start again. I basically have to find the node again in order to be safe that the list is actually

Starting point is 00:25:19 I'm still at a point in the list that still makes sense. And in the read, I basically don't need to acquire any logs. I'm just proceeding optimistically and I can like detect any kind of changes through these update counters. So once I'm at the node that I need, I'm reading the node, I'm doing my work. So of course, again, this might be more complex, right? If I have a more complex read in the list, this will be fairly fast.

Starting point is 00:25:51 I'm just getting the value. Then I'm checking the update counter again. If the counter is still the same, I've read actually the current version of the list. If the update counter is different, then something has changed. I have to read again. And I can also basically interleave the version validations across multiple nodes. Just like I do this just like in the log coupling,

Starting point is 00:26:18 basically making sure that I basically walk through this list similarly. Again, this is easy to implement. It's scalable, but it needs restarts. So as soon as we hit an update counter that was wrong, then I need to start from scratch again. I need to start from the beginning of the list. And we've looked at other data structures already. And Martin did this with you.

Starting point is 00:26:48 And here, this is for art, for example, so for this adaptive radix tree. Here, Victor, Lise, and friends implemented the different kind of locking techniques for art. And they tried it out for lookups, for inserts, and for remove, either with no synchronization at all. No synchronization at all means, well, for lookup, this works if you don't have any updates.

Starting point is 00:27:18 For inserts and removes, this doesn't work because your data structure is broken very quickly. But for the lookup, if we don't have any synchronization whatsoever, then we can see the performance is fairly good. And then we have things like hardware transactional memory. I'll come to this in a bit. We have ROEX, which is kind of a combination. We have log coupling, which does not scale so well.

Starting point is 00:27:51 And we have optimistic log coupling, which is actually much faster, right here. And yeah, and we see the same basically for inserts and for removes. So the optimistic lock coupling, that's actually always more or less the best with or similarly to ROVEX, which I'll explain in a second. And you can see in terms of scalability here,

Starting point is 00:28:20 they basically this also out of the paper. So lock-free is more scalable, but it's actually quite hard to implement. Fine-grained locking is easier, but it's not really scalable. Optimistic lock coupling is basically fairly easy, not as scalable as lock-free, but at least almost as scalable. And then hardware transactional memory is also easy to use and scalable, but it has

Starting point is 00:28:48 certain limitations. So we'll come to this also in a second. Yet another idea that we can do is lock-free data structures. So rather than using locks, we can just basically always check if something has changed, right? So we're just using this compare and swap or compare and set operations. So we're checking the values. If the value is correct, then we're updating. If it's not correct, then we're basically reporting failure. So we'll start again. So it's basically we're using this marker bits and trying to change things. Just basically checking, can I do an update here?

Starting point is 00:29:32 So I'm basically going through, I'm creating an update. I'm checking again if the version is fine. And then if this worked out, I'm good. If it didn't work out, then I have to try again, essentially. So, and for deletion, for example, I would do two markers. So first I go through the list to the node that I want to delete. Then I put a marker on the node that I want to delete.

Starting point is 00:29:58 I basically mark it as deleted. And from that point on on it's basically deleted. And then I'll also create a pointer or a marker to the previous node, so the predecessor, in order to make sure that this pointer to the node also is clear that this will need to be updated. So this means as soon as I'm marking such a pointer, the pointer cannot be changed anymore. So this is basically just to ensure that nobody else is basically rewriting or rewiring this somehow. So say, for example, creating an insertion or something

Starting point is 00:30:43 that then would lead to an already deleted node, for example, or later on deleted node that then where somehow our list basically falls apart. What I do is basically before I physically change the list, I'm just basically putting markers everywhere, making sure that no other thread can change this. And then I'm starting to update. And this I can do atomically, right? So these markers I can set atomically, so I can ensure that nobody else will interfere with this. I just have to ensure that across these different kinds of operations, writes and reads and deletes, etc., that these are in sync with each other so that if I'm doing a delete, a write cannot basically somehow

Starting point is 00:31:35 crash or break my delete. And while this might still sound a bit high level, you will see all the details in task four. So we have, there's a nice paper that we're referencing where there's actually explicitly explained how the individual steps are there. So you will actually see this in your own implementation. Okay, so locking and lock-free, actually it's not exclusive, right? So it doesn't mean like you either implement a lock-free data structure or you implement a locking data structure. It's more a property of a method. So we could have like a lock-free read.

Starting point is 00:32:19 We could have a lock-free delete while still having a lock for writes, for example, or for inserts and stuff like this. So we can do different kind of combinations of lock-free and locking just in order to make it easy for us or in order to make it scalable. So wherever it makes sense to have like a lock-free implementation because, say, this is something that we're doing very frequently. I don't know, inserting, we are doing very frequently.

Starting point is 00:32:52 But we don't have the pipe contention on the node, or we don't want to have the contention on the lock. So we're implementing this part lock-free. And other things, it might make more sense to actually lock. We can, again, do it differently. Sometimes it even might make sense to just lock the whole data structure for certain kind of methods. Complete reorganization, right?

Starting point is 00:33:15 So say, for example, you have some kind of index. Every now and then you want to reorganize. Then it makes sense to actually lock the whole thing, do your reorganize, then it makes sense to actually lock the whole thing, do your reorganization, and then do any other changes later on, just because the fine-grained locking will just be overhead and everything else will just interfere with your work. And one example here, this is what we saw earlier, is a read-optimized write exclusive. This is, for example, implemented in ART.

Starting point is 00:33:46 So there we have a lock-free contains, and we have locks for adds and deletes. And well, this basically means we need some kind of atomic access to the concurrently read fields. So if we have multiple different fields that we need to access atomically, then we need to somehow implement this so that we have this atomic access.

Starting point is 00:34:12 Yet another different approach to the synchronization is hardware transactional memory. So basically, the hardware supports transactions or transactional guarantees based on the cache coherency. So this means as long as we're in a cache line, we can basically use the access to only the cache line and then have specific kind of operations that lock these individual cache lines. So this basically, this only works for very small, so only for cache line sizes, right?

Starting point is 00:34:58 And this is neat because it basically means we're only using it. So the hardware, since we have this hardware support, it basically says we're not using a lock unless another thread actually, or another core actually tries to access this. So this is a bit more efficient than if we implement it with a lock because if we have a lock,

Starting point is 00:35:22 we'll have a lock all the time, right? So if we implement a lock in our code, we'll always create this lock, we'll always write this lock in order to basically have the lock and in order to ensure that another thread will not access this. If we have hardware transactional memory, the hardware will basically just create this log if another thread tries to access this. And otherwise, it will not create the log, right? This is nice, but it might get some kind of false conflicts because of the cache line granularity. So essentially what happens is that we have,

Starting point is 00:36:09 once we're under cache line, so that's what called false sharing, if we have something that's smaller than the cache line, right, so we have multiple of data items that fit into a single cache line, where we have these locks. This will always work on the cache line granularity. So if we're changing something, like two threads are changing different items in the same cache line, then we'll have false sharing, we'll have a false conflict in

Starting point is 00:36:38 the hardware transaction when we will create these logs, essentially. Yeah, so, I mean, this is just another visualization. But as long as, so if we have multiple threads and they're working on different kind of cache lines, this will not create any locks. If they're working on the same cache line, then we'll create a lock and the other thread will basically be blocked. So the access to the cache line will

Starting point is 00:37:06 be denied. So this is basically until the cache line is written back from the current thread that we're working on. Okay, so very quick run through the linked list with different kind of log granularities, log types. And remember, still, the logs we can implement in different ways. Each function we can log basically or have log three in different ways. You just have to make sure that this basically in the end still works together, right? So, I mean, if we have a lock free contains or a read that is lock free,

Starting point is 00:37:53 we're completely non-synchronized, then we need some extra work for the writes because then basically, well, somebody else might walk into our current node and try to get the same kind of locks as well. Because we cannot see what else is going on. Questions up to here? Nope. Very good.

Starting point is 00:38:20 Then let's switch to NUMO. Okay, so we're last part of the CPU package. So you can see I've drawn this nicely, right? So it took a bit of effort to do this. So here we have the one CPU and we have another CPU next to it. So this is basically last time this was still grayed out. Now this is also blue, meaning we have the one CPU and we have another CPU next to it. So this is basically, last time this was still grayed out. Now this is also blue, meaning we have multiple CPUs. And this is the case in many servers these days.

Starting point is 00:38:53 So laptops typically don't have multiple CPUs, but servers often have multiple CPUs. Typically two, very few. I mean, there are servers with four and eight CPUs, typically two, very few. I mean, there are servers with four and eight CPUs, but most servers today have two CPUs. You still get servers with one CPU, but many, let's say, I would say the default configuration typically is two CPUs.

Starting point is 00:39:21 And now, basically, we have to see what this means for data access. So we'll look at how the memory is attached to these CPUs and what this means for latencies, basically. And then we'll look, basically, at the different kind of accesses that we can do and what the hardware looks like, how we can use this in database management system in the OS. And we'll revisit the multi-way sort merge join, how we can optimize this for NUMA. I actually kicked out the massively parallel sort merge because it's not really tied to NUMA that much. Okay, so non-uniform memory access. And I'll start this before we do the break. Yes, I'll start this before we do the break.

Starting point is 00:40:19 Okay, so in general, in order to be efficient in our database management system or data processing in general, we always want to have the data local, right? I mean, of course, local means in we want to have it in the caches, ideally most of the time, and we want to have the data close to the CPU somehow.

Starting point is 00:40:46 And the database management ideally knows about the hardware layout underneath, so how the memory is laid out underneath, and places the data of the database somewhere close to the workers that later on need to work on. And this can happen on many different levels, right? So we can have this on storage. If we have network, we want to make sure that most of our accesses don't go across the network, even though networks are fast these days. As soon as we have it closer to our CPU, the bandwidth will be higher, the latency will be lower,

Starting point is 00:41:28 the processing will be faster. Since our CPUs are scaling and scaling, right, we get more and more cores, we get more and more CPUs, essentially the latencies change in between different kind of data accesses, even within memory on a single server. We can just ignore this. So in general, since we have all this cache coherence and everything,

Starting point is 00:41:56 essentially it doesn't matter where on our server our data is residing. So each CPU can have eight DIMMs or something like that if we have two CPUs we have 16 DIMMs doesn't really matter in general i can access each of the DIMMs or data on each of the DIMMs the same way it changes like the access latency is different but the axis is the same. So from a program perspective, except for it's a bit slower, I don't notice. And this is what we would call like uniform memory access or Yuma. So if I completely ignore what's going on in there, then I'm using uniform memory access.

Starting point is 00:42:41 I can also like explicitly use this by just interleaving all of the memory, like basically how the memory DIMMs are used. So basically I've split up my data such that a portion of the data is on each of the DIMMs. That increases my bandwidth because I have more memory controllers, but the latency is uh is worse and typically from a single cpu i'm anyway not going to be able to or single core not be able to utilize all of the dims but i can only saturate like one or two of the dims so that says that basically means i it actually makes sense to think about where data is resizing. And maybe in certain situations, be aware where the DIMMs are, which DIMMs are closer, and where I have less latency.

Starting point is 00:43:39 And today, as I already said, practically all server hardware has some kind of NUMA. And what this looks like is something like this. This would be like a uniform memory access. So we have all of the individual CPUs, and we have one big bus. And this is what's classically called the front-side bus. And then we have a memory controller. And the memory controller would have the DRAMs all connected to it. And this we can still simulate today.

Starting point is 00:44:17 So we can basically just ignore that there is a different setup and just say, OK, we're still in the early 2000s, in the 90s, somewhere there, where we had a north bridge and a south bridge. And the north bridge would basically do the access to the or coordinate all the memory access. It would be basically memory controllers, a separate chipset that's not on the CPU. And we might have multiple CPUs that use the same kind of chipset that's not on the CPU, and we might have multiple CPUs that use the same kind of

Starting point is 00:44:45 chipset there, and does all of the memory transactions for us. So, it's far anyway. So, a bit more wiring doesn't really make a huge difference anymore. If we have this uniform accesses, then all of the memory access needs to go through this bus, or at least hypothetically needs to go through this bus. And with more CPUs, then the demand on this bus increases. And well, because of that, basically the central memory doesn't really scale with number of CPUs that well. So the more CPUs we have, the more pressure we get on this front side bus, the more pressure we get on the memory controller. So we need more and more hardware here.

Starting point is 00:45:33 So it actually makes sense to somehow distribute this, right? And so because of that, chip designers and hardware designers started building a non-uniform memory access. And this means we have an integrated memory controller with each of the CPUs. I mean, it could be integrated, it could be separate. But with each CPU, there's a memory controller. And each CPU basically has their own local memory access.

Starting point is 00:46:10 Each CPU then, because each CPU has their own memory access, we put separate memory to each of the CPUs. And we still want to be able to have one large memory address space. We don't really want to deal with this on a programming perspective because all of a sudden then this means well MPI parallel programming whatever right lots of communication you don't really want to do this separately for every hardware that we are accessing right so one server might have six DIMMs or like previous generation has six DIMMs, next generation has eight DIMMs, one server has two CPUs, one server has one CPU.

Starting point is 00:46:48 So ideally, still we want to use this in the same way. But the memory is distributed and each CPU can read the local and can read the remote memory, but with this single address space. And this is what this typically looks like. So we have two to eight sockets in modern, and for a long time now. So the eight socket is something fancy.

Starting point is 00:47:21 I don't think we even have a single system with eight sockets here. But this is what this would look like, right? So we have the CPUs, the memory is connected to one of the CPUs. Each of the CPUs contains between four and 40 cores or up to 128 on AMD, yeah, modern AMD CPUs. then a couple of DIMMs

Starting point is 00:47:47 attached to it, so six or eight typically, and then some kind of interconnect network between the sockets to access and to communicate to each other. Now, this is basically on top of the network that we have within the chip. So remember last time I showed you this map of the chip itself. So, there is this kind of fabric inside the network.

Starting point is 00:48:12 And then this basically extends to the other nodes. And we can also communicate to the other nodes in a similar way. And if we look at this in a schematic way in my representation, then this is what this could look like. This really depends on the vendor, on the motherboard, basically how these are connected. But this would be, as we can see here, for example, this would be one of the setups here.

Starting point is 00:48:43 We could have just two connected to each, or we have this kind of fully connected mesh here. So each node is basically connected to each node here. So that's basically good. So each node or each CPU, each socket has RAM. So here I just put three. As I said today, this would be six or eight DIMMs. And then we have this interconnect. And this is in Intel, it's called UltraPath interconnect. In AMD, it's called Infinity Fabric and in IBM or Power it's called Crossbus or Xbus. So just so if you've heard this and this is basically how that they communicate.

Starting point is 00:49:31 And as I said in two sessions from now, I think we'll talk about CXL and then this further gets extended basically also to communicate outside to other kind of memory. Now local memory access means in here, so in the CPU, we have the internal memory controller. This means like, sorry, this core wants to read something, right? So then of course there's some kind of load instruction. It's not in the L1 cache, not in the L2 cache, not in the L3 cache. What a pity, right?

Starting point is 00:50:09 So we actually have to go all the way down to the memory. So the internal memory controller will access this DRAM and will load it into the caches so we can actually work on the data. So that's local memory access. If we have remote memory access, so we're accessing a virtual memory address that's not in the, so on this core, for example, we want

Starting point is 00:50:33 to access something that's unfortunately not in one of these DIMMs. It's not in the caches. We have to go through the Infinity Fabric or to the UPI, whatever we into the other CPU into there so in there we have this UPI I should have put this here right so basically it actually has to go through this UPI through that UPI connection into the internal memory controller this is without the CPU here doing anything it's really just these controllers

Starting point is 00:51:05 doing things, reading this cache line, or not this cache line, reading this memory block and sending it to the other CPU and reading the data. Here I have some examples for old memory architectures. I actually don't know how many generations old this is, just because I used this from somewhere else. Today these numbers will be far higher. So here we have memory access per CPU in like 25 or 52 gigabytes per second and then bi-directional we would have 12 or 16 gigabytes with the UPI connections. And this is still true, right? So the memory basically the direct memory connections are higher, the bandwidth is higher than the UPI connections. So basically I cannot get the same kind of throughput

Starting point is 00:52:03 through the internal memory, at least I cannot get the same kind of throughput through the internal memory. At least I cannot fully exploit like other far memory if I have to go through this internal fabric typically. And we see there's different setups. So in the Harlem, this would look like this. So we have this fully connected in the Sandy Sandy Bridge, we only have these four connections. So there I actually might have further hops. So if it's not fully connected, this means if I want to read something from this memory, so this CPU wants to read something from this memory,

Starting point is 00:52:40 I cannot directly hop to this CPU, but I have to hop here, then here, and go here. Right. And of course, the more CPUs, the more often these kind of connections or these kind of additional hops happen. So this is the architecture of Skylake, but it's still true for Sapphire Rapids. I don't know for ML Rapids, but could be the same. But Martin confirmed for Sapphire Rapids, this is still the setup that you have.

Starting point is 00:53:12 So here you can see, I mean, again, typically you would have this setup, like two. I have to go here so my pointer actually works. So we have these two nodes. These are directly connected. With four nodes or four CPUs, we have all connections. We are, again, fully connected.

Starting point is 00:53:33 With eight nodes, we're not fully connected anymore. So with eight nodes, we're basically, if we want to go, like, say, from this node to this node, then we have one, two hops. From this node this node, then we have one, two hops. From this node to this node, we have one, two hops. So we have basically an additional hop for some of the connections, which makes it slower. And you can see here, and this is the difference.

Starting point is 00:54:00 In the Skylake, we would still have six DIMMs per node. In the Sapphire Rapids, we can have up to eight DIMMs per node. And here we have bidirectional in the Skylake, we have 40 gigabytes per second and we, yeah, bidirectional. The DIMMs should give us, like each DIMM, something like 50 gigabytes per second, roundabout, or 40 gigabytes per second. So this means, again, we cannot fully, like, if this node needs to fully read out this memory, it will definitely be bottlenecked by this interconnect.

Starting point is 00:54:46 So it can read much faster to its own memory than going somewhere. So this is kind of the important part here. So before we do a quick break after this, just like one further note here. So this is basically how this developed. So if we have like older systems, fairly old systems, as I said, early 2000s, 90s on, then the memory controller and the PCI chip,

Starting point is 00:55:20 this would actually be outside of the CPU, right? So this is where we have the north bridge. So separate, what's called chipset, basically, north bridge and south bridge. And the north bridge would be the one that's close to the memory. It's also closer to the CPU, just because of latency. So we don't have that long.

Starting point is 00:55:41 Like, the wiring is not that long. And the memory controller would be completely outside of the CPU itself. The CPU has caches inside, but the memory controller is really going outside. It's communication to the outside of the CPU to the motherboard. This eventually was changed in, I don't even know when exactly.

Starting point is 00:56:06 But eventually, the memory controller was moved inside the CPU package in order to just have lower latencies here and higher throughput. So we don't have to go out. We basically can do everything local. The DIMMs are still outside. I mean, you know this, right? The DIMMs are still outside. I mean, you know this, right? So the DIMMs are separate chips

Starting point is 00:56:27 you basically plug in outside on your motherboard. But now, in order to further scale this, because these cores are getting more and more, we have a chiplet design. So meaning that within the CPU, so this is one single socket within the CPU, we have multiple memory controllers, we have multiple PCI Express controllers that are basically interconnected in between with internal fabric. So within the CPU, we have so-called chiplets that are basically like a replicated design of something like a single CPU socket earlier. And this means we have like a sub, what's it called? Yes, like a sub-dye NUMA design. So within the dye, we basically have already NUMA effect.

Starting point is 00:57:24 We could see NUMA effects, although there's a paper on this that I've linked here that basically says, well, it's not that much. You don't see that much. So, the design is fairly good. There's enough buffers or enough, let's say, headspace within the chip fabric such that you don't really see the NUMA effects within a single chip. But the larger they get, the more this will be pronounced, essentially. Okay, so with this, I would say let's do a quick break here. But first, what are your questions?

Starting point is 00:58:04 I'm asking in an open way. Yes? Yes. Yes. So... So is this happening here too? So for example, the eighth corner accesses the remote memory of the first one. Does it then copy the stuff from the dgram to its own? MARTIN SPLITTMANN- So the question is,

Starting point is 00:58:34 do the caches work still the same way, basically, as if we would work on a single CPU? And so what if the CPU down here requests something up here? It still works the same way. So it still is cached. Otherwise, every individual access would need to go through this internal fabric again. I think that was clear. I was wondering if this also means that the remote memory is put into the local memory? So the way this works is it should just go into the... Let's go back here.

Starting point is 00:59:16 So basically, it's not put... So the access goes through UPI, through the memory controller. It's not put in the memory here. It's basically put in the caches here. That's your question? Yeah, because I was wondering, at some point, that would change, right? Because at some point, the locality would be so beneficial that we would copy it over to the

Starting point is 00:59:37 MARTIN SPLITTMANN- So would the data be moved to another node? Well, the OS, basically the virtual address management and the paging might do this, if you're pinning the memory, like you're placing your data into that memory and it's actually mapped to that memory, nothing will change. So all of the accesses will always go through the internal network, basically. More questions? No questions?

Starting point is 01:00:16 Then let's do a quick break here. Four minutes. So since there's still a bit of stuff to cover, let's get started again. So here basically I have some numbers, again old, right? This is 10-year-old numbers. But this is like some people actually tried this out. And while the numbers might shift a little bit,

Starting point is 01:00:45 this is still essentially true. So how these accesses work and how far the connections would be. And here we have a Nihalim setup with eight cores and four CPUs each. And then you can see, right, so this is not fully connected. So essentially we have,

Starting point is 01:01:12 like we don't have these connections. So we can basically have two hubs here for these connections. But I mean, not really different from, even if it would be completely connected, then we also might need to have these two hubs here, right? No, not true. So here we actually have, we can have two hubs that we would not have if it were fully

Starting point is 01:01:39 connected. And here, what they did is they basically measure the performance for reading local versus remote memory. So since it's in the Harlem, we basically already had these numbers earlier, right? So we can have 25 gigabytes local memory, 12 gigabytes bidirectional through the UPI connection. And this basically means like remote access will probably be a bit slower in terms of bandwidth and will have a higher latency as well because we have these additional hops.

Starting point is 01:02:13 And so the direct connection is basically exactly what we have up like more or less with a bit of over or remove the overhead is like if we just directly access the memory locally then we get up to 24.7 gigabytes per second with 12 threads so I mean we cannot get like the full memory bandwidth with just a single thread but we need multiple threads to actually utilize all of the DIMMs in parallel. And we get a latency of 150 nanoseconds on this setup or 340 CPU cycles. If we read the flow too, so we're reading from socket zero to, so socket zero reads reads from socket 3. So, in this case, because it's older, it's not UPI, it's QPI, so Quick Path Interface.

Starting point is 01:03:13 And so, we're reading there. Then we can see we actually get a much lower bandwidth. So, we get 10.9, which is this basically limited by this interconnect here. And our latency is increased. It's not that badly increased, right? So it's not like doubled or something, but we get a higher latency by 30 nanoseconds, so 185 instead of 150 nanoseconds.

Starting point is 01:03:43 And now worst case in this setup is basically two hops. So we're reading from memory from socket 1 into socket 0. So here we don't have a direct connection to socket 1. So we need to go through socket 2 or through socket 3. So it doesn't really. So how it works is actually up to the chip then, or the motherboard, the internal setup. And here we can see the maximum aggregate bandwidth.

Starting point is 01:04:20 It's not lower. So we're still at the same throughput. So even though we do multiple hops, the bandwidth is the same. But we have a fairly much increased latency basically through this additional hop. So we need to do additional processing in here to get through the sockets. So it's not just basically doing like quick path here and quick path here, but it's doing quick path here, here, here, and here. So this is why we're basically adding on top of that. And our latency becomes 230 nanoseconds or 520 CPU cycles. And this is already almost twice as much

Starting point is 01:05:06 as the local memory latency, right? So this, I mean, not only bandwidth-wise will feel this, also latency-wise will definitely feel this in our algorithms. And so if we do cross-traffic on top, so in the last flow, we're basically at the same time we're doing some additional reading, so we're doing the same thing again over two hops and we're doing some other cross traffic, we see it doesn't change that much in terms of latency, but the additional cross traffic will basically take,

Starting point is 01:05:47 or with the cross traffic will take a hit on the throughput, right? So if the quick path interface here is the limitation for the bandwidth, then if somebody else sends on top, then basically we'll just have to share the bandwidth. So if there's another flow that reads, like say we're reading from socket one to socket zero and socket two reads from socket three.

Starting point is 01:06:15 So then we basically use the same kind of connections here. Right, so we use this. So we have some cross traffic in here somewhere. Then our bandwidth is basically half of the bandwidth that we had initially because we're sharing the throughput. OK, so the effects of the NUMA hardware, well, we can increase the total memory bandwidth of the system. So, and this is basically when we cannot increase the individual thread bandwidth, right?

Starting point is 01:06:54 So, just by adding NUMA. So, per thread, we're not going to get more just by adding more memory across the whole motherboard, right? So, through multiple sockets, et cetera. Because a single thread will usually be able, or will not be able to saturate everything on a single CPU. But in all in all, so if we're using the whole package, all of the CPUs, all of the sockets, et cetera, then

Starting point is 01:07:26 we can increase the bandwidth. Then we can basically utilize more DIMMs in total. And we can do this in parallel, so we don't have a central communication bottleneck. Then if we would communicate over the Infinity Fabric or the UPI all the time. The problem is the access time now depends on where the data basically is stored.

Starting point is 01:07:52 So if we're unlucky and our data is always completely mixed up somewhere in some of the memory banks, then everything will be limited through these bottlenecks again. While we have more memory, we cannot fully utilize this and we'll have higher latency. So we really need to make our software NUMA-aware for getting good performance. And with very large number of NUMA regions, all of a sudden, this is not fully connected anymore. So we had this example here with four CPUs where it's not fully connected.

Starting point is 01:08:31 Four CPUs we can actually fully connect, but eight CPUs typically, for example, are not fully connected anymore. And well, this means we have higher latency if they're not directly connected, and we might have cross traffic, et cetera. And modern CPUs, they actually, they have meshes again. So we basically have some kind of mesh set up and we're doing some routing and switching in between. So we want to differentiate like for our data access

Starting point is 01:09:03 or algorithms and data structures, we want to differentiate between local and remote memory. And local memory is faster and has higher bandwidth than remote memory. So we try to keep everything local. When we do synchronization, synchronization obviously is much faster in a local setup than if we have to do it across the UPI or infinity fabric again.

Starting point is 01:09:30 So again, synchronization means cache invalidation, means basically accessing the other node, means basically completely, or means this communication. So in order to make the scale, we need to make sure that most of the synchronization, everything we probably cannot make sure that there's no communication whatsoever. But most of the communication should stay local.

Starting point is 01:09:55 And well, we have the NUMA effects everywhere. And we can also not only do this on the data structure level or the algorithm level. We can also do this on the data structure level or the algorithm level we can also do this on the database level right so we can basically also partition the database in order to make sure that this is faster right so either if we have like an OLTP kind of setup if we if we can split up different warehouses or something like that, or different clients, right? So we can put them on different NUMA nodes. This will be faster than if we don't do this,

Starting point is 01:10:29 if we have cross-traffic everywhere. We can split up tables horizontally or even vertically, but horizontally often makes more sense. And then basically locally process horizontal partitions. So remember this Morsel kind of execution. So if we have morsels, we can basically easily place them in different NUMA regions, work on them independently and this will nicely scale. And if we do this, if we're smart about it, if we know where the different partitions are, then we can also schedule our operators in a way and our query execution in a way that it's local.

Starting point is 01:11:14 The database system typically tries to do all the management itself or as much management as it can itself. So it will not rely on the OS for many things. So basically, I mean we still have virtual memory etc. but we don't really want to use like the paging etc. to like have the memory place or wherever the memory is placed randomly. But we'll have our own buffer management. We'll have our own memory management. And with this, we can also make sure that operations are local to the NUMA node that we're working on.

Starting point is 01:11:57 And well, I mean, with this, we basically come into partitioning and placement. So in the partitioning, if we're looking at data partitioning, basically a relation, then we can do different ways on how or there's different ways how we partition them. We can do round robin or random, so just splitting up the tables into chunks and randomly place it or place them one by one.

Starting point is 01:12:28 So through the different NUMA regions, we can do this range partitioning. So basically splitting up the table in different kind of ranges and putting it, placing it in different parts. Or we can also do a hash partitioning. Again, it's more like a random partitioning where our hash function decides where the data is going. And the granularity will define the number of partitions that are created. If we have a coarser partitioning, then we have lower scheduling overhead. If we have a fine-grained partitioning, then we can do better load balancing. And this, again, is a trade-off.

Starting point is 01:13:13 So we talked about this earlier already. And we usually want some kind of placement scheme that then, or we have some kind of placement scheme that tells the database management where to put those partitions. And this is true for a distributed system. This is also true for a NUMA kind of system, where we say, OK, where are our different memory regions? How do I distribute this? If I want a very even load or a very, like, every, we're utilizing all of the memory in the same way. I could use something like round

Starting point is 01:13:52 Robin partitioning where it just plays like alternating I place partitions across all of the memory then all in all I will get the same kind of memory access speeds because some of them are NUMA, some of them are local, some of them are remote. So all in all, it will be round about the same. But I'm not NUMA aware. If I have some kind of range partitioning,

Starting point is 01:14:17 I might have larger blocks that are close. And then I want to have some kind of scheduling that makes sure that the operations are on the actual NUMA nodes that we're working on right now. The operating system gives us some support for this. So Linux can partition the memory into numazones. Each socket basically has their own numazone. Even for more modern CPUs, we have this chiplet design. We can also expose this. Basically, we have different numazones within the single socket.

Starting point is 01:15:02 And then there's basically separate management data structures for each of the numeral zones. We can basically say, OK, this numeral zone, please allocate the data here or there, even within a single CPU. And by default, Linux already tries to be smart about it. So if I have a thread that's working on or that's running in a certain NUMA zone, if that thread wants to allocate some memory, Linux will try to allocate it within the same

Starting point is 01:15:33 NUMA region. So if this is configured, then this already should give us local access, typically, as long as memory is available there. Unless we're specifically binding the memory to a certain socket, or like there's an mBind system call where we can basically say, well, this memory I definitely want to have in this memory region. And this might be useful if I have some kind of memory allocation thread or something that's just running on one

Starting point is 01:16:09 of the cores and I'm just basically doing some memory stuff, then I don't really want to have the data necessarily all close to that thread. I want to have it allocated where I then later will actually use it. Or I have some other kind of ideas how I will use the data later on. So what happens if the CPU calls memory alloc?

Starting point is 01:16:43 So if the allocator in the OS doesn't already have some kind of chunk of memory, then basically, well, if we're calling malloc, nothing will happen, or almost nothing will happen. It's just basically the process data segment, or the memory data segment, or the processes data segment will be extended. But that's only virtual memory.

Starting point is 01:17:14 So this is basically not immediately backed by physical memory. So again, this idea of virtual memory, a process can get way more memory than actually available in the system. And the sum of process also can get way more memory than actually available to the system. Some of this might be just backed up with disk. If I'm just allocating a lot of memory, I'm never using it, it's just there, but it's not used at all. So sometimes you can see that in,

Starting point is 01:17:49 if you look at your processes, you will see there's lots and lots of memory allocated, which is much more than you actually have physical memory in your system. And essentially, the OS will only allocate this memory as soon as we're trying to use it, so when there's a page fault. As soon as we're touching the page, then the memory will be allocated. So the physical, like in an OS, in a NUMA system, the OS will try, if we have NUMA set up,

Starting point is 01:18:28 will try to locate it physically close to the same node. But we could also say we don't want that. We could actually also generate or have this in an interleaving setup. So we can have like a Yuma set up, where then the allocation, like each of the pages, will basically be interleaved across the different kind of NUMA zones, NUMA regions. And again, this is good if we want

Starting point is 01:18:57 to have the same kind of access across all of the nodes. So if we don't really, our software is not NUMA aware, we just want to have like equal-ish kind of access across all of the processes or all of the threads. So then we're just going to use this kind of NUMA setup. The data is interleaved, we get a good bandwidth, right? So because we are using all of the DIMMs, but we have these higher latencies. Then there's also an approach that's called first touch. So whenever we have a page fault, basically the we will then allocate the page. So in the memory that's close to that thread in the same NUMA region. And, well, otherwise we can also say, well, we don't want that, right? We don't want the OS to do this, but we want to make sure

Starting point is 01:20:07 that we're actually doing this in a useful way that's later on our database management system is set up in the right way, right? The memory is allocated in the right NUMA regions, then we can use NUMA-Alloc and control the NUMA regions ourselves. And just as an example, so this is an again older paper, 10 years old, using an SGI. We actually also have one of these nodes still here. Eight, and this is Martin just corrected me, this is a machine that actually has 8 CPUs in this kind of hypercube setup or blades in kind of a NUMA setup. And here, if we look at this, so these machines have a lot of RAM also. If we put all of the data in a single memory here, then we can get up 100 or close to 200 gigabyte per second

Starting point is 01:21:29 in this setup. If we are interleaving this across all these nodes, so here this means within a single blade. So we can see this is the setup that they were using. So there's basically four blades. Each blade has eight nodes in here. And then these are also connected with an interconnect that's called HARP, which is similar again to Infinity, Fabric, UPI, Crossbus basically, but across more nodes here. So if they have just local RAM, basically we get 200 gigabytes. If across the whole system, they do interleaved access, they get 270, which is not

Starting point is 01:22:17 that much more, right? But if they do NUMA optimized, so basically partition the data in a way that everything is locally accessed, they can get 2000 gigabytes, so 10 times the amount of throughput that they would have with a local setup, as an example for NUMA aware scheduling. And with that, I actually wanted to briefly talk about the join, but there's not that much, like the join, the sort merge join, you actually already know how it works. The only idea here is that it's, we're making it more NUMA aware. So this is basically this, you remember this multi-way sort merge join where we have kind of like a local sort and then a merge.

Starting point is 01:23:11 And in order to make this more NUMA aware, because if we're doing this merge, then we're out of NUMA, right? So we have locally, we have these accesses, but these merge accesses will be basically on different nodes. So in order to make this more NUMA aware, what we can do is we can basically have an extra step where we have multiple merge steps. So we're not trying to read as much or individual and in the same kind of buffered setup that we had in the SIMD merge and then we can be more NUMA aware can be faster and better utilize the thread up and I

Starting point is 01:23:57 would say check the rest out yourself right so I'm not going to cover it next week. Martin will cover storage with you. And this is not super complex. You can basically see it in the slides or read the paper. So with this, this is the quick overview of non-uniform memory access. So more like high level. So you get an understanding how this works, where data is placed, also how you can actually place it, and that you should be aware of how it works and what the interconnects are in order to be efficient again. Do we have questions?

Starting point is 01:24:42 What are your questions? No questions. If we have no questions then next time um i'm not going to be here martin will tell you about storage and then uh marcel so this is going to be tuesday ish maybe uh even longer we'll see that's well you will figure it out. Then Marcel will talk about CXL. Then we'll have the next talk. And then I'm going to be back and talk about networking. And then we're almost done already. OK, with that, thank you very much.

Pet Camera - EBO Air 2

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Non-uniform Memory Access

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Non-uniform Memory Access

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.