Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Non-uniform Memory Access
Episode Date: June 14, 2023...
Transcript
Discussion (0)
I guess we can get started.
So, welcome everybody to our next session.
I actually like this picture. It's also quite personal.
Okay, today we're going to talk, once we're finished with watching,
we're going to talk about non-uniform memory access.
And this concludes the part about the CPU, basically. We'll then continue
with persistent memory and then finally go into the last part which is going to be about
the periphery. So I'm pretty sure I will finish this off today. Next, you're going to have task 3 on the buffer and locking, then we continue
as planned, basically.
So before we go here, we said let's finish off locking.
And just as a reminder, David, that there's multiple different ways of doing locking.
We can do coarse-grained locking, but that's not for us.
Or maybe also not for you here in this course, because this is so...
And we talked about the locking in terms of data structures.
So how can I implement locking for data structures?
There's different ways to implement locks, and then there's different strategies in how we use those locks.
And of course we want this fine grain, so we want to basically lock individual small parts of the data structures,
and small enough parts. This also means
we cannot, I mean, paper, we have to use
this example, right, so we cannot just go and just lock a single
node in the link just because that won't work.
Because other things might happen, like changes, like I think
more than just a single node.
This is, I mean, in the new units it's kind of clear, right, we have the predecessor and the successor.
In the tree this might even be multiple nodes.
So again, we have to think about how to do this in a smart way to not lock too much, but also not to lock too little.
And then there's different strategies. We can do pessimistic locking,
meaning we're always locking
when we're accessing the node,
which basically is
making sure that
it can not happen anything.
Or we can be more optimistic about it
and say, okay, we're just going to try
working on this, and this is mainly for
reading.
We can often use the key to try to read without checking, without really knocking, but then
if something went wrong, we have to restart.
And often we want to combine this in one way or the other, meaning that if we have too
many restarts, we'll go back to regular
locking. And we then also set another approach, the
lazy synchronization. Meaning that rather than doing the actual operation, before, when we hit the locking already, we can start by just marking this one.
And we know this from other data structures, so basically something like a tombstone. You can say, well this node is deleted, but it's still there,
so we can still traverse the node, and this is necessary as soon as we're in a concurrent
setup where multiple threads might still reach the deleted node after we've already deleted it. So then, because they, one, this is marked, and then the other nodes, other threads form,
basically you will see this.
And finally, and we'll come to this also in a bit, there's the option to try to do everything
lock-free, meaning rather than holding actual locks, we're using
these compare and swap operations. And like this, I mean, sort of user space locks where
the threads never really get each scheduled, which I think the purpose needs to be often
not to re-implement everything.
So we're not using regular locks, we're just using this compare and swap operations
and see we're not really blocking any operations, but if we see it didn't really work,
the versions don't check out, then we have to re-do stuff.
And this means we often have to re-implement everything for this to get to work.
So it's an option, but it's not the best option in all cases.
And with this, I basically showed you the lock coupling, I showed you optimistic locking,
I showed you lazy synchronization, and now we're going to go to optimistic lock coupling,
which kind of combines the previous ones.
I have another question for the optimistic locking. Is this a single linked list?
This is a single linked list.
If it's single linked, why are we locking the successor? Because we're not changing anything in the successor, right?
We're always knocking through the support addition, you mean?
Yeah, but for like the 47 we're not really touching any properties of 47 if it's just
if we're just saving the successor, right?
So...
We're changing for 42 we say the next node is 46 and then in 46 we say the next node is 47, but for the 47 we don't really...
I think the problem might... I mean we cannot delete the 47, but that should be a problem.
If you don't have a lock on...
I would have to check, to be honest.
Potentially it could work without locking the successor, but I'd have to check.
It could be deleted, but then it's the same thing as if we have...
Yeah, so then we need, if we want to delete it, then we also need to lock for the 42.
If we want to add something else, we would have the same problem.
If we traverse, if we read...
I'll have to check. I didn't check that, so that's a good question.
Okay, I'll have to check. So there might be a combination of adding and removing.
I mean for the removal it's clear, for the addition, optimistic lock coupling is basically, we saw that for optimistic locking
and other cases, if we restart to re-read all the way from
the beginning because we don't know if something happened.
Also in these relegation steps in the log copy, we go back from the beginning and check
if the nodes that we changed are actually still digital.
And with optimistic log company, the idea is that we try to avoid this, right?
So this is really expensive, going through the list again.
And we can avoid this by having an update counter.
So basically, you get sort of a version of ID for the nodes. And so basically we're associating the mock with rather than with just a split, we're
associating with a counter.
And with that we can then say is the version still the same what we looked at.
So for the validation we don't really have to check the whole list again, we can just check if
something changed just in front of us.
So if we're writing an optimistic lock-locking, we're basically acquiring the lock.
Again, we're acquiring two locks, and we're incrementing, whenever we're changing, we're
incrementing the counter when unlocking.
And so if we are not changing anything,
like while traversing the linked list,
we don't need to require anything.
The same thing basically also works for trees.
And it's like an atomic increment,
because if we look at two then we can't have most
preferent unlocking things.
Sorry, I saw that it's an atomic block, so the counter basically needs to be incremented
atomically. So, then if you read, basically we again, we don't require locks, we just proceed optimistically,
so other than the regular lock coupling where we would basically traverse using locks.
And then we can detect any kind of modifications through the counters.
So basically while we're reading, we're starting with reading the counter,
then we're doing our read operation, we're basically checking the whole node,
and then we're going to read the counter again, and if something changed,
then we know that something happened, right? So our node is not good anymore.
Then we have to basically restart.
And in order to make sure that we don't need these locks like we do in lock coupling,
in lock coupling we always have two locks in order to not get into any kind of modification while we're traversing in there.
In order to have the same kind of setup, we can interleave the version validations through multiple nodes.
So basically saying, okay, I'm going to the next node, basically.
And whenever we're basically seeing that the conversions change, then we know, okay, something happened, right?
So there's a concurrent operation.
So then we have to go back.
So if somebody changed the linked list
to at this point where we're currently reading,
then we have to start, we have to go back
and traverse again.
And the last thing, if you hit implement,
it's scalable because you're not requiring locks while you're reading.
The bad thing is that you need restarts for the reading, basically.
Assuming that you're reading somewhere and there's a current modification in there, then the stuff needs to be restarted.
Okay, so let's look at what this looks like in performance.
This is also wasn't, it's not for a main grid, it's for the art, for the data structure that
we know by now.
And here, we implemented no synchronization, meaning that the software will just do what it does, it's not necessarily consistent anymore.
We have log coupling, log coupling as I initially explained, the pessimistic log coupling, the optimistic log coupling, which is read optimized write exclusion, which is a combination of optimistic locking and
no locking, or a lock-free data structure.
Then hybrid transaction memory, I'll briefly touch on this later,
and then another data structure that's built around, I think, latch-free implementation
as well.
And what you can see, lock coupling basically doesn't really scale well, so we see in the
number of threads for lookup, insert, and remove operations that lock-upping
basically doesn't really scale very well.
So basically the threads will contend in all of the different operations in the same way.
Because in all of the different operations, we do the same kind of locking.
So this is as we would expect it. Then we have the optimistic lock-up.
You can see for look-up it scales quite nicely. For insert it scales well and for removal,
well there we have this even a bit more work, so it doesn't scale as well, but it still scales.
And it scales similarly to RoVEX, so RoVEX is another approach, I'll show you in a
bit, I'll give a few more details.
But you can see in general it scales very well, it's always close to the other optimal,
the other scalable ways of
blocking.
And you can see in kind of like a qualitative comparison in terms of scalability and ease
of use, using lock-free typically is very scalable or latch-free implementations, but
it's really hard to implement.
Then fine-grained locking is very simple to implement, but it's not scalable.
And optimistic lock-hopping is sort of in the middle of two sorts.
Basically, it's quite easy to use and it's quite scalable.
Another option is hardware transactional memory, that's basically hardware support, that's
also easy to use but it doesn't exist everywhere in all systems.
And we'll briefly touch on this in a few minutes or in a minute.
Okay, going from this to lock-free data structures.
And for lock-free data structures, and this is something that you would implement
in the last part,
there you're basically just using
compare and swap or compare and set operations.
And so the idea is that you basically
do your operations,
so either the read, for example,
and then you check the value,
if the value is correct,
or if updated, if it's not correct.
So in a comparing swap, you check the value, if it's correct, if not updated, if it's not
correct, you report the failure.
And the comparing swap is an atomic operation. So that basically means you can change something in your data structure atomically,
so nobody else can do this at the same time.
And this also means you can use this for a lot, but you can also use it for your data structure.
Say, for example, this counter. So incrementing a counter with that.
And say for example for deletion, if you want to do the linked list, as we did it,
so the kind of same setup, rather than doing a single marker, you would add two markers in this case.
So you need basically to make sure, if you don't two markers in this case, right? So you need to make sure
if you don't want to lock the nodes, you need to make sure that the node is marked as needed,
but also that the pointer is marked as invalid. So because this is kind of the tricky part, right?
Assume that this pointer might point somewhere, or the nodes might traverse somewhere, then
OR, let's put it different.
If this is not marked, another node might change, or another thread might change something
on this node, and then we're going to get kind of an inconsistent list again. So what we need to do is create something that gives us the same safe environment as the log coupling,
but just with markers that we can change with compare and swap.
So we need to have this delete marker in here.
As soon as we have the delete marker in here. As soon as we have the delete marker in here, basically we can do a lazy setup, right? We can do this lazy synchronization.
We can just say, okay, this is deleted. Fine. So then we have some time to basically do the rest.
But then in order to actually update everything, we need to ensure that this, like nobody else
will change this node at the same time.
So for this, we, or at least the pointer here cannot change.
I mean the value potentially could change, but this pointer cannot change.
And for that we need an additional marker. And for additional details, you will actually get the tasks for.
So it's not going to be a very simple linked list, but probably a sleep list.
But then we'll have a paper, and there's many different ways how you can implement this.
If you put the marker first, or what kind of markers, what kind of operations you use,
if you use test and set or compare and swap, etc.
We'll give you one way.
So we're still debating on which one is kind of the easiest and at the same time scalable. Then there's as a third, I mean, how you exactly put the markers, what kind of, yeah, how you
interleave everything.
There's lots of different ways in doing this.
Everything is of course, some are more scalable, some are less scalable.
How many operations, how many steps do you need
in order to actually perform the update?
Okay.
So the final note on this locking or latching and latch-free,
it's not really exclusive, right?
I mean, as you already saw with the optimistic lock tracking, right?
So we don't need to say, oh, we're just doing optimistic locking
or we're just doing latch free or something like that.
We can combine this.
Actually, using locks or lock frees is a property of a method
rather than of a data structure.
So I can do reading lock-free, I can do updating with locks, there are different ways of implementing it.
I just need to make sure that everything fits together. So this means there is an opportunity to actually use everything or different methods in the same data structure.
Use different kind of locking schemes in a different data structure.
And one example would be this roadways that I had on the previous slide.
So that's Read Optimize Drive Exclusive. So there basically we have a lock-free reading for contains method
and we're using locks for add and delete. So here the idea is basically, well, if you have lots and lots of reads,
you don't want the reads to be blocked by anything. So you always want the reads to go through.
You don't necessarily try to optimize this in a way that there's also no restart.
That means, well, then for that and for the read you need to do more work.
Because there, as soon as somebody reads the notes, you kind of need some kind of atomic way of changing them.
So, also you need kind of an atomic way of reading the fields.
So that again means you have to update the data structures for this to work. Unlike
some of the other operations or some of the other ways where you can say, well,
for the optimistic read operation, for example, we said, okay, we can basically fall back
to a pessimistic locking fashion by just creating the lock in a different way.
And another way where we can basically fall back is hybrid-conductional memory. So, current memory CPUs have this hardware transactional memory idea, which basically
is hardware support based on the cache coherency protocol.
So, hardware support for transactions.
So, this basically uses L1 as a buffer, and that means also L1 is kind of the maximum
transaction size.
And the idea is you basically implement everything as you would implement it with a lock.
So you say, oh I have my lock here and I'm locking something and do my operation.
And the hardware takes care for you of this lock.
The hardware basically checks if there is a conflict actually and if there is no
conflict the hardware doesn't use a lock. So the idea is basically if, and this is done by
using the cache currency protocol, so if you can see that this cache line is only in one core, then there cannot be any kind of conflict in there.
So if only one thread is working on a cache line on a certain data item, no other thread accesses it, so we don't have a problem.
And we always know this, at least to some degree, because of the cache currency protocol.
So because the cores, where the difference comes, since we communicate with each other about the cores,
but the caches constantly communicate with each other, checking if something is in the cache or not.
And if something is in the cache, well, in more multiple caches, then we know there is a conflict.
Then we will have the lock. If there is no conflict, then we'll be lied, so we're not going to use the lock.
So we basically just ignore the lock and continue without the lock.
Because of the cache granularity, and the cache BPCs, the beta, we know there is a set space caching.
There are conflicts, so there are, these are not real conflicts, but basically based on the earthquake or something.
So basically we can have a conflict although the cache, therefore there are actually different
lines that we're accessing.
So the way it would work, we're always trying to optimistically execute stuff.
So assume that we're basically touching whatever we have in our transaction.
And this, I mean, here it's a static ATM example,
but it can also be I'm changing
my node index, right?
It needs to fit in L1. If it doesn't
fit in L1, it won't work, because
basically, this is what
we're looking at.
If
it's too large,
it's very likely that we'll have
conflicts because of the cache granularity. If it's too large, it's very likely that we'll have conflicts because of the cache granularity.
If it's small, most likely, or if we're not touching the same cache line with different threads,
then the hardware will basically say, oh, I know these two cache lines are not working.
This cache line is not used anywhere else, and let all of the threads execute in parallel,
or recurrent thread, or the other not executed in parallel.
Assume that the validation on the hardware doesn't work,
we're basically gonna have a real lock here.
And then we'll of course pay a bit extra to execute.
Not a bit extra, we basically pay whatever a lock will cost us here. to execute. Not that it makes a real pay-to-pay, whatever the cost is.
And this is basically something that current hardware supports,
or current
Intel CPUs have hardware transactional memory, but the tricky part is
that the transactions need to be small.
Meaning also the nodes, for example, what you would look at in the data structure need to be small.
There cannot be too many conflicts in L1, because otherwise, as soon as we're touching too much data, this will always
be false conditions, and then will always fall back into an extra one.
Okay, so something to keep in mind, basically.
Okay, so what we talked about in this part there,
yesterday we talked about MIDI and different kind of blocks
and then log coupling and log-free data structures
and I'll look into the optimistic log coupling
or optimistic locking, why we need to lock two nodes here.
And now we're going to switch, unless there are questions so far.
We're going to switch to Luma.
No questions?
Okay, good
Okay, so we are in NUMA, so non uniform memory access
ATT, well let me go back here, so far we've been looking at this here. So, even for one CPU, I've touched a little bit yesterday already on the concept that
well, even in one CPU, we might have different areas where multiple chips or multiple cores
are closer to one memory or to one memory than to the other.
As soon as we have multiple CPUs, so multiple sockets,
in modern architectures this will always be the case.
So essentially, this is some part of the RAM.
And then the other CPU again will be connected to RAM and we
can still have this big address space where we can address all of the memory in a single
process, but it will be in different regions.
So if we don't necessarily see this, we can
influence it,
but in general, if we don't take care
about it, some of the memory will
be faster than others.
Because it's close to the core that we're
currently running at, or it's far
from it.
And this is what we're
going to discuss here today.
Database in general, we have a distributed system, we want to make sure that we have
everything close to our system, close to the, or, yeah, we have a mobile hard drive or something like that.
Right? And the same is true if we have multiple CPUs and we want to make sure that the data,
and we have the memory that the data that we're working on is close to the CPU that we're currently running on.
And a database system will always try to be aware of this,
and will try to schedule the tasks in a way
such that the data is local to the current thread.
And then there's different ways of how we can do this.
Or, I mean, we can also be agnostic about this.
So if we have a classical system,
or if we're not talking about this,
then we have a classical system, or we're not talking about this, then we have uniform data memory access.
So we're basically treating all memory as the same.
But if you want to be faster, we need to differentiate.
And today, practically all the servers have NUMA.
I mean, even if you have a single socket, if you have enough cores, they will be split
up into different parts and basically have their own memory channels.
I mean, so practically this would look like this, right?
So if you have a, I don't know, old server, probably 15, 20 years old, something like
this with multiple sockets, Then this would look like this.
We have multiple CPUs.
We have a front-side bus, which is connected to a memory controller,
and that, again, is connected for the memory.
The DRAM is connected to the memory controller.
And this is nice because we have one large memory region,
we have one large memory region, we have one large bus, and
all of the CPUs basically have the same access to all of the memory.
So it's quite easy.
But you can already see there's a bottleneck here, right?
So this basically means all of the communication needs to go through here.
So it doesn't really scale that well because everything needs to
be connected to the same bus, all of the communication is on the same bus so we somehow, especially
if we have more CPUs, we want to split this up.
Although one has to say, I mean, four CPUs are really a lot so even today most servers
only have more than eight CPUs.
So that's kind of what you get.
But we want to somehow split this up in a way.
If we have this kind of setup, we have this uniform memory access, everything needs to go through the bus,
the bandwidth in demand on the bus increases, and it doesn't really scale at a certain point.
So because of this, we have today have non-uniform memory access.
Because this didn't scale enough anymore, each CPU has their own memory controller,
and each CPU through this memory controller gets their own local memory.
And then we put all of the memory state to be distributed, ideally,
equivalently distributed across the different CPUs, but we still have one single virtual address
space. So we still can touch all of the memory in a single address space
and just use it as one big memory. So this means your application, even if you
work it through this kind of machine, will still run in the same way on a newer machine.
However, it won't be as fast as if you know that this is new hardware.
As I said, most servers today have multiple CPUs or multiple sockets,
and typically the setup would look something like this. I mean, there's not that many of them.
I don't know how many servers we have with more than two CPUs.
Not that many, actually.
But many of them have two CPUs, and that many actually, but many of them have two CPUs.
And this would usually look something like this. So each server has their own memory channels, has their own memory controller, and then there's an internal network between the different cores.
And then we can have the difference of CPUs. And if we have more than two, so say four sockets for example,
then again it depends on how are they actually connected with each other.
So we could have just direct connections between, so just four connections between four CPUs,
or we can have all connections between four CPUs, so six connections.
We can be sure that every CPU can communicate with every
other CPU directly, which then would mean in any case I only always have one hub. In this case,
I want to go from here to that memory, I basically have to do two hops in order to get to the memory. Each of the sockets here typically has a direct or each of the CPUs in this socket has between
4 and 40 cores in Intel and AMD is actually 128 cores and these 128 cores in AMD are then
again split up into multiple separate tiles.
And these again will have a non-uniform memory access internally.
So you're basically looking at hierarchies of networks.
And each socket will then have a few DIMM modules.
So currently up to eight DIMM modules per socket basically and then all of these sockets are connected with an interconnect network
so that's the chip internal interconnect
this would be ultra-path in Intel and the infinity fabric in AMD for example. And that's kind of a proprietary internal network that lets the CPUs, etc. communicate with each other.
And you can see that A socket string again looks different.
Of course, it's not 3D. It's laid out flat on the motherboard.
Okay, so if we have this kind of setup, right, so this would be such a four socket, one uniform memory active setup with three RAM DIMMs per CPU. Each core, each socket basically has multiple cores, so for us it's two cores here,
with private L1 and private L2, and shared L3 cache. We have the internal integrated
memory controller, and we have the UPI, so this is an Intel processor here. And we can
have a local access, which, let me just change it again, right, so we can have a local access, so this means if this core for example wants to access some memory in here,
this means we go to our internal memory controller and we directly access the data in here.
But we can also have a non-uniform memory access or remote access.
A remote meaning just another socket, right?
So then this means we basically actually go through here, right?
So we go through the UPI network, we take the shortest connection, meaning one hop,
and we go to the other through UPI to the integrated memory controller and access the media.
And you can already see, I mean just visually, that this is more distance than just directly going there.
And practically what is actually more costly is we have this UPI in between.
And this UPI interface, so this on CPU interface, this is again bandwidth limited.
So, I mean, and it adds additional latency as well.
So what we have here, we would have something like up to 64 GB per second per GIMP.
This will be, I don't know the numbers, 200 gigabytes, something like that.
So with UPI 2, per lane, it's 20 gigabytes per direction,
but there are multiple lanes between circuits.
It's also random specific, I don't know.
I would say between 100 and 200 gigabytes per second,
you can get something like that here depending on the
thing i think the cross path um the power we can measure maybe or something like that and the newer
ones should be even higher let me see if i have some phone numbers here for example. So this is older stuff, right?
So this is not what we would see today.
So this is a New Holland processor. This is Sandy Bridge. Today we're at Sapphire Rapids.
So there we would not have like just 51 gigabytes per second, but we would have up to 8 BIMs, each of the BIMs 64, 40 up to 48 GB currently,
moving up to 384. So here, per CPU, up to 384 GB per second, and this will have less.
So I mean, I don't know the exact number, but it will be less here.
And then it really depends on how this is basically connected with each other.
And again, this one example here, you can see Skylake.
So Skylake has, let me see, For link, it says 41 GB bidirectional, so meaning one direction 20 GB here, per link.
And I don't know how many links there are right now, to be honest. So this would be, say for example, this is Sky Lake, so Sky Lake was before Cascade Lake,
then Ice Lake, and now Everett River.
So basically three generations back.
And here you can see how they're connected in this, but the connections will still look
the same, the numbers are slightly different.
So here you can see we have up to six memory modules here, we have UPI in between, and if we have two nodes, well there will be multiple connections.
If you have four nodes, they're basically fully connected. If we have eight nodes, they're not
fully connected anymore, right? So when there are multiple connections per node, so three per node basically, but say
for example this node and this node are not directly connected here.
Well, this node, this node, this node, this node are directly connected.
But, I mean you can see there's ways where you have like a single lock.
So say for example, going from like from here to here would be a single lock
from here to here would basically be two locks and I think that's actually the maximum that we would get
let me see, say for example if I want to go from here to here I would have to go one, two, three.
One, no, one, two, actually. Two hops. Excellent.
In this case, I have an example.
And this basically means that the number of hops that you need
will increase your latency and as soon as you have to go out of your own core, you will
also have less bandwidth.
The bandwidth here should be the same, so unless there is lots of other traffic going
on, you will have the same bandwidth here that you would just have here
But the latency will be higher
So actually before I go into the details here, maybe let's take a break here so far
Now we know what this looks like, next we're going to talk about what the speeds
actually are, or the differences, again, on older systems, because you will get a feeling
for the relative speeds.
Any questions so far?
Okay, then let's do another three minutes break. Okay, so, let's do another three minutes. Wait.
Okay, so let's continue.
So I couldn't find it
on the
I checked regarding
adding, but I
actually, I couldn't find it yet.
So I'm going to check again.
It just says you need to
lock both, but it doesn't say why.
So I have to look up again.
Okay, so then let's look.
So we saw what NUMA architecture looks like, so how these can be connected.
And current Sapphire Rapids, they also come with this kind of configuration,
up to eight sockets maybe, instead of 6 memory modules, up to
8 memory modules per node, but otherwise more or less the same.
I mean the interconnects will be a bit faster, but same set of all in all. involved. And here we basically look at something, so
me and Al checked this, experimented on different kinds of, well, this kind of NUMA architecture,
so four ACOR Nihalem processors, you can see this is a bit older already, some 10 years
old. And here they basically checked what is the difference between different kind of connections.
So here you can see this setup basically doesn't have to direct connections with each node.
So socket 0 is connected to socket 2 socket 1 to socket 0 to socket 1, for example, we
have to first have to go to socket 2 and then to socket 1.
So, we have up to two hops in here.
In the worst case, we can have direct access. So we have up to two hops in here.
In the worst case, we can have direct access, which is the first one.
We can have remote access, so from socket 3, for example.
We can have two hop access from socket 1, for example.
The same from socket 3 to socket 2, for example. So, we are reading locally, where in this case we get a maximum aggregated bandwidth
with 12 threads of 24.7 gigabytes. This means that we can max out whatever the memory has, and the latency for reading locally
is 150 ks.
Because of course we have to go through the caches in the network.
So this is changing exactly this way.
So just a local read, we're doing it in parallel, so we're reading across all of the memory
bins that we have in here.
Just the single data item won't achieve this, right?
So getting the maximum bandwidth means we really have to go over and use all of the memory channels that we have in here.
And this we can only do by also using multiple threads, so we cannot do this in a single core.
Then the second step would be we go from socket 3 over 1 QPI, so this is quick path, this
is the previous test through UPI, so from socket 3 to socket 1 from this memory, and
there we can see the aggregated bandwidth is 10.9
gigabytes so this is basically what the quick touch gives us. So this we're
mapping out this bandwidth here and getting 420 CPU cycles so 185
nanoseconds. You can see the overhead is not that much actually.
I mean it's still decently fast in terms of how much latency we have if we go through
this network.
At least I find that it's decently fast because we'd be like going to another node somewhere
else would be much more of a thing.
But it is there, there is a difference between
them. And then additionally we are reading remotely, meaning we are going from socket
1 over 2 QPI links, through socket 2 into socket 0. again we get the same bandwidth.
So this is as I said earlier, basically we're not doing anything else here in the system,
we're just reading with as many threads as makes sense through this QPI connection,
or actually through this QPI connection here,
meaning we have additional hops at the bandwidth,
there's nothing else going on, so we get the same bandwidth.
However, we have an additional hop, and as we can see,
say this is 35-ish nanoseconds extra,
so an additional is even more than that, right?
So an additional 15 nanoseconds, close to 15 nanoseconds, in order to get from here
in order to do these two hops here.
And then finally we have some, if we have some cross traffic, meaning this is kind of
the left test here, which would
be the flow number 4, so we still do the same kind of read that we do here, and at the same
time we're doing a read to a different socket, meaning we're going to this socket 2 from
socket 3, so again two paths, and then if we have cross traffic
we're basically sharing this interconnect here, and of course what happens is that we
have to share, meaning if there's two flows going on at the same time we get half of the
bandwidth, and this is exactly what we see, so we get close to two and a half of the bandwidth, 5.3 GB per
second with the same kind of latency. Take a slight overhead because of the additional
work that needs to be done for scheduling between the two, but sort of more or less
getting the same throughput. And what you can see here is basically,
well, if we're just using the system as this, right,
so if we have multiple threads, we're using the whole memory,
reading everything from everywhere,
and we're doing a lot of reading, I mean, we're a database, right,
so we do a lot of data accesses,
then this interconnect will be our bottleneck.
So this is FAPG, then eventually we're bottlenecked
by bandwidth, and we're also our latency local
in terms of accesses.
If we wanna be really efficient,
we somehow have to ensure that if we're on this socket,
we're mostly working on this number.
And likewise with the others.
If, and also, I mean, think about it, you read locally, right here you get 24.7 GB per second on a single socket.
If you do this across all four sockets in parallel, this means you get close to 100
GB of memory.
Just because we are maxing out the memory here.
If we are reading across everything, we just have this one QPI.
So we basically split up this 10.9 megabytes in ways, so four ways in the
end.
And of course we'll have some local reading, but this will be our bottleneck.
So this will basically mean in the end we're going to be limited by QPI, so it's going
to be closer to the 10.9 megabytes per second than we are to the 100 megabytes per second
that we would have if we just begin loading.
Good so far?
Okay, so, and that's basically, I mean this is a summarization of what I just said, right?
So actually the NUMA hardware is cool, and it's good because it increases the total bandwidth
of the system.
It's just like basically if we're buying more servers, we get more bandwidth because each
of the servers can do this separately.
We don't have this kind of central communication bottleneck, so this is how you can think about
it. So you can say that each code, each circuit can work independently on independent data.
However, as soon as we have some communication, this will somehow break down.
As soon as we're not aware of where the data lies and we have kind of remote memory that we're working with,
and remote meaning another NUMAR region, right?
We're not talking about different servers, different nodes in the rack, whatever.
We're just talking about basically a memory-filled module that's placed next to the other CPU rather than to this CPU on the same motherboard. So, this means the access time depends on the bank where the data is stored, and also
the throughput, right?
The bandwidth really depends if we're using local memory or not.
And if we have a large number of NUMA regions, then the problem is we cannot create a fully
connected system.
So we saw this, I mean, with four nodes it's still possible, but eight nodes is already
tricky.
Like basically putting all the connections is also not cost efficient at a certain point.
So this means if we're not directly connected, then we need to have a higher latency if we
have to go through multiple hops.
Although we saw that this doesn't really make that much of a difference.
But we're also going to be dependent on the bandwidth.
Of course, that's something that I didn't really explain to you.
These connections have a certain bandwidth, right?
So if I, if I maybe have communication here and communication here,
they don't interfere with each other, or don't necessarily interfere with each other.
So this connection, like, going across, this probably would interfere to some extent.
In the end, today it's a mesh, I mean that's also what I have to say, today it's a mesh
so there will be some cross communication somehow.
Actually I have it here, modern CPUs have meshes rather than just direct connections,
so this means if there's a lot of cross communicationcommunication, any kind of cross-communication, you will
see this.
And the other thing that I want to mention is even in a single CPU, if you have many
cores, like this large AMD EPYC cores or the very large Intel CPUs, these could be split up into different regions
or tiles or whatever you want to call them, which also have different kind of NUMA set
up. So they will also have different access speeds. Okay, so this means for us,
well, if we have NUMA
and we have NUMA, so it's
a valid choice, we have
NUMA, means we have to design
a hybrid with the data structure for it.
So, I mean,
at least we need to differentiate
between local and remote
memory if we want
to be efficient.
The local memory is faster and has higher bandwidth, and we can be more concurrent, right?
So, within the socket, the synchronization will be much faster, because we don't need to do this snooping, etc. across multiple sockets.
So if we are in the same socket, we are using just the local data, this will be faster.
And of course, if we really want to scale our tasks, so we want to have very large transactions
or very large queries, very large data structures, then we somehow need to scale
across the new models.
So this means, well, for the database system, there's different ways how we can approach
this in a simple way, or not a single way, but A clear way that's useful for this kind of setup is just partitioning the data.
We talked about this earlier, but even if you have a large system, you often want to partition the data.
You don't want to have just one large file for the data, but you can split up each table into multiple files.
And while you can do this in a column-oriented fashion, which would be more orthogonal to what we're talking about here,
you also want to do this horizontally, so you're splitting up the types of rows.
Just like we did with the morse or scratch-web here,
we talked about library data steps, so we can basically partition the data across the different regions,
or the Luba regions, and assign them with two different sockets,
which then will have this local access.
And then we also can basically structure the data structure for this.
And if we're controlling and tracking the local locations of the partitions, then we
can schedule operators, etc. close to the data and with this very high bandwidth.
And for this we basically will have some unique kind of partitioning scheme. So there's different ways of doing partitioning.
We can do it non-roman or random, which is usually good because we did a nice load balancing.
But we don't have any localities.
So non-roman basically means that it makes role more due on a separate socket. that we make scrollable through our separate sockets and even if we're doing something
like a range scan, this
always will already touch everything, so
that's not super efficient.
Because of that,
we have range partitioning.
Range partitioning means we're
splitting up the table into key ranges.
So students from
A to B go to the first
socket, C to D go to the second socket, and so on.
And finally, similar to random partitioning, we can use hash partitioning.
There we could, for example, do this Radix partitioning as we had earlier, or some kind of modular partitioning.
With the same problem as we have in the round-robin or random partitioning, meaning that we basically,
if we're doing something with a scan or something, we'll already punch all of the sockets quite quickly.
So there's different ways how to do this, and, well, of course, we also, I mean if the data is not that large, or we're
not reading it all the time, we might not want to partition too much because it's also
overhead, especially this kind of hash partition.
However, if we have data that we're, like, where we have individual data items or data items that are close, or can
be recorded to each other, we update a lot. So we have many conflicts in the event, and
it actually makes sense to distribute this a lot, because otherwise we're giving it a
lot of contention on a single CPU. on the same way it seems to be. Yeah, and of course the partition granularity defines how many partitions are created.
Again, this can be a two-step, so on the one hand, where do we put the data initially,
so which memory regions, and then it can be like the partitioning with all the task sizes.
Say for example, this is more as we said we have an execution strategy.
If you have coarse-grained partitions, we have lower scheduling overhead.
With fine-grained partitions, we see a better load balancing.
But we have more scheduling overhead, of course.
And yeah, well, then there's different ways of doing this.
Again, we can assign this around Robin.
We can assign this based on load, et cetera.
So that's kind of the other, that's the allocation or placement scheme.
So on the one hand, we're basically figuring out how to split up the data
First, how to split it up into two ranges, so by key, by hashing, by randomness, etc.
Then we say how large should these partitions be? Do we want to have one partition per node?
Do we want to have multiple partitions per node?
Smaller means better distribution, larger means less overhead,
and then we have to figure out where do we actually put it.
So this is again, can be round-robin, can be optimized by node balancing, etc.
There can also be some dynamics in there.
So, the OS actually gives you some support for NUMA.
So, modern operating systems are aware of NUMA regions,
so they know this memory is kind of here,
or close to this socket, etc.
And this is why Linux actually transitions them with memory numazones.
So each socket will get their own numazone.
If you're splitting up your CPU, as I said yesterday, into multiple numazones,
then you will actually have multiple... they will look like multiple sockets, multiple CPUs
basically with each their own numers on. And then there's basically separate data structures for
each numers on. This basically makes it possible to assign memory to a thread that runs on a certain core, then NUMIX will try
to get your memory on this local NUMIX node, which basically makes sense, right?
So you will actually, by default, you will get the good setup.
So if you're not fully utilizing your memory, that's the stuff that we did so far.
So if I'm doing small experiments here, testing some memory bandwidth stuff, then automatically, just by not doing anything, I get local memory access.
Because my thread will be on the same node as the memory, or the same socket as the memory will be allocated
so I don't really need to do anything.
However, it can also specifically say I want to be in a certain numazon.
With this thing I can basically also test this cross node, so I can throw it into different numeral, into numers, I can basically see
what's interconnected, or of course if I'm running out of memory locally, then also the
kernel will basically give you memory somewhere else.
And that's, basically that's what also happened, but we can also specifically say that we want
this.
So what happens if the database management system calls memory allocation? I mean, practically what the system will do first, or the kernel will do first, it will do nothing.
Initially it will just say, well, unless you touch the memory, there is no memory to you.
So you will only get memory once you're actually doing something.
So we will just extend the data segment.
Well, it's virtual memory, so we're not doing anything as soon as we're trying to access the memory.
Then there will be a page fault, and then there will be an actual allocation of the page. And by default, this would be local, meaning that we get it on the same socket.
Now if you have a thread that just allocates memory, and then you can use it in another thread that's on a different socket,
then basically all of a sudden you will have this kind of remote memory access all the time. But you can use this bind operation, for example, to say, oh, this one I actually want to have there.
Right. Well, now after the page fault, basically this is when the actual, when the OS will actually allocate, sorry, allocate the physical memory.
That is, by default, it will do it on the local mode to the thread.
However, you can also say we want to have interleaving.
So this means rather than having everything local,
and this is in the case I have one other thing in effect,
I want to use the whole memory
and I don't really care about the distribution but I know I'm not just going to use the local memory here
but I really want to use everything that I can do in the meeting which means I will allocate the memory across all of the CPUs. So then, I get a uniform access.
Let's go back to uniform memory access,
but I also know that I'm going to be bound
by the
the gripping at the endpoint.
So this gives me uniform access,
but uniform storage access.
Otherwise, it's basically first touch, right?
But I can allocate, but whoever basically first touches the page,
or whatever the page for is maybe initiated,
they will allocate the memory.
And of course we can also do this kind of Luma
program pre-allocation.
So this is when the database management group APT says,
well, I know what I'm doing, so please give me a memory here.
I want this part of the memory to be on this socket,
the other part of the memory to be on the other socket.
And this actually makes a difference.
So this is again a slightly older experiment where people first use a single RAM DIMM or
a single local memory, not a single DIMM, so it would be something like maybe 4 games with 50 gigabytes
per second, for example, so So this is kind of a huge
huge system with multiple blades. So you have eight blades that are like interconnected and
in each blade you have multiple individual processors. So this is the rack.
So in the rack, you have these IRUs,
which are interconnected.
In the IRUs, you have individual blades.
And each of the blades has two nodes with individual RAM,
which are then connected with QPI.
And this is kind of this NUMA link interface
of yet another system.
So SGI is a this NUMA link interface of yet another system.
SGI is a huge NUMA system.
And here they basically tested the same.
So of course you can see market drops, which will cost you more.
And here it's significantly more.
And even here you lose some of the bandwidth. So if you are on the same blade, you basically just pay the same QPI, which would be the
same as we saw earlier.
So we have the same throughput here.
But then if we go out through this hard link to a different blade, one output with the 7.5 and if we go even out of the IRU, into another
and through the red, it will be a bit slower than this.
Now of course, what we can do, rather than going through everything, right? So if you do this interleave across everything,
then we're bound by this kind of interleave, by these interconnects here.
However, if we're using all of them in parallel, then, well, we're of course much,
much faster, right? Because we can really use these individual these individual memory
and
okay
so with this real quick
we will go
through the final
kind of idea for you
we'll answer your questions
we have questions
so far and it's not
that special it's just something you need to know in order to be efficient here.
Now we're going to look at how we can use this for sorting, sort merge terms.
So we talked about sort merge terms yesterday already.
And basically we can extend the same ideas across multiple Luma nodes.
So across basically multiple sockets.
Suppose we said, okay, we have multiple cores, so now we want to be cache efficient.
If there are multiple sockets, we want to use local memory as much as possible.
So rather than going through these interconnects, we
will have to go through the interconnects eventually, because we always kind of need
to compare, also compare data that is on another node. However, we want to limit this. And
there's different kind of solutions. I'm just briefly show you the method of sort merge algorithm
where you basically start by sorting one thing locally and another locally and then a multiway
sort merge which is basically similar to the multiway sort merge that we saw for Simulink.
So here, even the Simulink sort merge is kind of part of this one here.
So there we basically sort both inputs globally, but with local merges after
like resorting locally and then merging in order to get a global input.
A global sorted input.
And because it's not super hard, or not super different from what we did yesterday, I'm going to be quick here.
Unless you feel it's too quick, right?
Unless you feel there is something unclear.
So, this massively parallel merge has two different steps, two different approaches. So we're starting with the range, like we have one infibrillation,
and we range partitioning across the different blue models.
And then we're doing a local sort on each of those.
And then we basically have to compare with each of those, right?
So we can basically, if this is, I mean, well, we can basically, we'll have partitions,
so each node has their individual data, and they are
sorted.
So we need to find out what of this other input relation needs to go to which node.
And in the method parallel sort merge join, what they do is they also do a local sort here, but they don't partition in advance from the other input.
Which means each of the local sorts will have inputs, at least statistically speaking,
will have inputs that belong to each of the other nodes.
So then basically what we're doing, we're scanning smaller subsets or smaller sub-partitions
and joining them with the larger partitions.
So these smaller ones, then we can basically communicate.
We're not going to communicate the larger one with the smaller one and join these.
So that's basically one way of doing a parallel sort. And then we basically, all we need to communicate, or for the communication, is first partitioning,
then the sorting is all local.
Here the sorting is all local, but then the scanning basically needs some communication.
And that's basically the idea in general. If we want to do something in parallel, we
basically have to do this in the new model fashion. We want to make sure that the local
portion of the computation is as much as possible without creating too much overhead. So then
we're saying the first one, the next one, on, and then we will have the complete result.
And we're doing this for all of them in parallel.
But this step, this communication will most likely be the bottleneck.
So here we're basically maxing out the interconnection.
And because of that, there is also a different way of doing this, this is called the block
in way sort merge, which basically does sorting in the way as we discussed yesterday.
So rather than doing like a global partitioning and then separate sorting, we're doing local
sorting and then doing a merging. So in this case, basically, the sorting is completely parallel, so we're doing this completely
local, and then only the merging is basically communication.
We're going to do the same on the other side, and then we get kind of the only communication
in the merging phase here.
So this is basically just the sorting side.
So the local sorting will be highly parallel, will be not limited by the interconnect.
Only the merging is basically where we're limited by the interconnect.
So this kind of redistribution is where we need to go across different sites.
And for this, what we can do is, again, rather than doing this globally, we can do this in
a multi-way merging.
So we can try, rather than merging individual partitions, so we can binary merge, we can try, rather than merging individual partitions, so binary merging, we can try
to do this in a multi-way.
It's kind of the same question that we had yesterday.
So how would we mix out the caches, etc by not just merging two cache lines or something like this
but basically doing a multi-way merge, merging many at the same time so we have more computation that we need to do
but we have less copying back and forth in memory.
So, maybe the idea is here, if we do out of cache merging,
I mean this is basically what we said yesterday, then we're going to be bound by memory bandwidth,
so we want to merge in the caches, but in order not to be zeroed on by memory, we want
to load many cache lines at the same time, many lines with the change then, and merge
them at once.
So this is because if we load more stuff in parallel, we have less memory around us, right?
So then we're saving memory bandwidth,
because we're basically not reading and writing,
but we're reading more at a time.
And we can do this somewhat asynchronously.
And then we can also use this kind of sync
in merging that we had yesterday,
so this atomic merge in this local merge.
So, rather than doing a single merge, we're going to do this kind of a merge tree, basically.
And there, of course, we want to balance memory bandwidth and compare this kind of trade-off
that we're always discussing.
We want to have memory bandwidth balanced with computations in mind.
So if we're just doing merging two runs,
then we're gonna be limited by memory bandwidth,
so the CPU will stall.
If we have a higher time in,
meaning more runs that we're merging at the same time,
at a certain point, we're gonna be compute-bound,
so then we're not bound by memory anymore, then there's less pressure
on the memory, so we're going to be faster. Of course, if we're too CPU bound, then at
a certain point we're going to be, you know, our performance will degrade again. So this
is kind of something that we need to figure out experimentally, what's kind of a good setup in terms of memory versus compute, how many ones do we do at the same time.
And all in all, this looks then somewhat like this. We have the multi-way merge sort versus the methodly parallel sort merge.
Not merge sort, multi-way merge sort join, sort merge join, versus the methodly parallel
sort merge join.
And we have a non-partitioning hash join and a radius partitioning hash join.
Then we have basically two setups with smaller
and larger tables.
Then we can see the multi-way sort merge join
is kind of that, so we see the partitioning takes
some time, sorting is much faster than in the method
we parallel sort merge join, and the merging again doesn't
really take that much time.
And it's even faster than a non-partition hash joint.
A non-partition hash joint is a hash joint with a global hash table.
So this basically means we have one big hash table which also again will be bound by the interconnect
because we basically have this large hash table and all of the threads will basically have to
write to this hash table across the interconnect.
And of course we have lots of synchronization issues.
I mean, if it's large enough, we're not going to have synchronization problems all the time, but it's more likely because
it's a global data structure.
If you're using the radix partitioning joint, then we see that here, this is actually still
much faster, almost twice as fast, or more than twice as fast as the multi-way sort merge
joint. multi-way sort merge joint, but it has the, well, let's say the multi-way sort merge joint
has the benefit that our output is going to be sorted.
So if we need a sorted output afterwards, we're going to save on this because this basically,
this step basically would go on top here again, even in terms of some of the partitioning stuff as well.
So basically we would pay this on top here again just to get sorted output again.
So it makes sense, and if we're looking at larger data, then the partitioning is what well the probing, the building actually is expensive, so I don't
know exactly where the partitioning goes in here.
It's in the leaf so you can't really see it, but I would assume this is also part of this
problem.
And you can see this gets more costly, the radius is still faster, but then if you would need a sorted output afterwards, then the multi-way
sort version would actually be the better option in this case.
Okay, so, for this, thank you very much.
While we talk about the non-uniform memory access. I mean, it's a bit high level today, but it's just, you need to know how this works
in order to use it efficiently.
So, the important is, you have local data, you have remote data,
or local memory and remote memory, and you can get very high bandwidth
if you're using all of the local memory, but you're not going to get very high bandwidth
if you go on the whole random data experience because this is basically accessing
memory like this is embarrassing you terribly, and here you have lots of communication otherwise.
And this is, you have to basically, if you really want to use this efficiently, you have to be aware of this and make use of your data structures accordingly.
Next time we're going to talk about persistent memory.
Actually, if I say we, I'm going to say Lawrence is going to talk about persistent memory.
So this is his topic. We're going to be back in the old lecture hall.
But there's one question for you. How do you like this one? Because the
group's not that large. You're probably not going to change it in this semester, but in
the future, we prefer this kind of smallish setup. Would you actually prefer to switch
it now, forever? I mean, I to promise this is possible.
Who prefers this lecture on?
Everybody.
I would have preferred on.
Okay.
Well, okay.
We're going to change it.
This is possible because I also think it's a bit, for the smaller group, it's a bit nicer.
Okay.
But then, with that, thank you very much, and
I will be there next week.
Actually, next topic is not persistent memory.
Tuesday is going to be test.
Tweet. Tweet. Tweet. Tweet.
This is going to be myself, so...
I don't think I'm going to...
We'll be here. Thank you.
Thank you. Thank you very much. Thank you. Thank you. Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.