Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Data Processing on GPUs II
Episode Date: June 26, 2024...
Transcript
Discussion (0)
I guess we can start. So, hi everyone. Today we'll continue on the topic of
data processing on GPUs. So, right, so yesterday we talked about the
introduction to GPUs, We introduced the GPU architecture.
And we discussed the main positive sides and things
to consider when programming for GPUs.
We went through the physical and logical execution
of the programming model.
And today, we'll be a bit more on the database side
and more on the implementation of actual database operators.
So in the beginning of the lecture,
I'll talk about the implementation of selections
and joins on GPUs.
Then I'll talk about multi-GPU data processing,
basically how the architectures differ,
what kind of system architectures are available,
what kind of topologies are there,
and how we can develop algorithms that
can recognize and utilize the benefits of these topologies.
And that will be the part regarding multi-GPU sorting.
And there we'll talk about peer-to-peer sorting,
heterogeneous sorting, where we also
utilize the CPU in the process, and also
some arrayed X multi-GPU sort that
was developed in our group.
So we'll start with the GPU accelerated database operators.
And we'll start with the selection,
which is a fundamental database operation.
And there, the goal is quite simple.
Given a predicate p, we need to choose a subset of tuples
that from our relation R satisfy p and remove the rest.
And this is an example.
We have our predicate in x needs to be larger than 4.
We have our input relation, which is x.
And we have the output relation, y.
And this would be the, basically, we
have a condition code here where, basically, it
checks for the predicate and then if the predicate is satisfied
we perform addition and uh and otherwise we just move on right so uh how can we utilize the gpus
to to parallelize this uh the the big question is where to write the results in parallel.
And basically, now we'll talk about how we can address this.
The idea or an initial solution would
be to pre-compute the locations by utilizing prefix sums.
And there, basically, we can rely on prefix scans.
It's basically the prefix sums that we want to use,
our prefix scan operation.
And there, the idea is to apply a binary operator
to an array of n elements.
And basically, for our input relation of x,
we would have this x prime output.
So now the question is how we can do this on the GPU
in a somewhat efficient way.
So as I mentioned, the idea is to pre-compute the locations
by utilizing these prefix sums.
And we can do this in three steps.
So initially, given the input relation x,
we build a flag array initially.
And this is just a binary array where we define whether the values from x satisfy our predicate
p.
And in our case, we have three instances of this.
And basically, our flag array will have ones on these places.
And based on that, we compute the prefix sum
in the following array based on the results from the flag array.
And basically, in this way, we pre-compute the locations for basically to write the elements from X into the area Y,
which would be our output.
So we can parallelize the selection by following these three steps.
And initially, what we want to do is apply a reduction here.
So basically, we want to reduce a set of values
to a single value using the binary operator.
And basically, from the prefix scan slide,
we just utilize the input A, and we basically
apply the operators on the subsequent elements,
and we have the output in A prime.
As a building block, we can use the reductions
for this prefix sum operation, and we
can calculate this conveniently in the shared memory
in the GPU.
So basically, we would pay no overhead
in the sense of additional data transfers.
And let's see how we can do this and how we can basically
utilize the GPU shared memory for this.
So we would use binary reduction trees for this.
And in general, this is how a binary reduction
tree would look like.
We would have three steps to generate the sum on the top
in the root, and we'll have complexity of log n
to achieve this.
However, on the GPU, we would do something different.
Yesterday, we talked about that we have
many threads at our disposal.
We can group these threads.
We can manage the memory accesses in these threads.
So let's see how, based on what we learned yesterday, how we can use this concept to
implement this in a naive manner.
So let's say we initially just have this input array on the bottom and for every two elements
we activate a thread pair. So basically I don't know how this looks for you but
this is way too much work to be done by the threads. But if we go about this in
naive fashion,
this is what we get.
So in theory, we would just parallelize every sum operator,
and we propagate these changes upwards.
However, in fact, with this implementation,
we're performing way too many add operations.
And we are actually increasing the complexity compared
to a sequential implementation.
So the question is, can we be more efficient about this
and utilize the threads and the memory accesses
in a smarter way.
So we talked about thread coarsening yesterday also.
So we can just utilize less threads.
So we say for each two elements, we activate one thread.
And after every step, we shut down half of these threads.
And now basically, the single traversal of this tree
performs O of n additions.
So now we are back to the sequential step.
And this is how an example implementation in CUDA
would look like.
So you're familiar with how we define a kernel.
And here, we are basically using strided memory access
to shift the memory access in an increasing manner.
So basically, after every iteration,
we double the stride, and we also
shut down half of the threads.
Is this the most efficient that we can get?
Yes?
No?
OK. Do you have an intuition what you would do?
Okay.
So let's see on the next slide.
So yeah.
Maybe the question should have been what's the problem with the previous slide, but we'll
get there.
By halving the number of active threads in each iteration and increasing the previous slide, but we'll get there. By halving the number of active threads in each iteration
and increasing the stride, we are also
increasing the physical distance of the memory
accessed by each thread.
And initially, for example, in the first step,
this is perfectly fine.
But after x amount of iterations,
we have underutilized warps.
In an example, if our array, instead of eight elements,
it was 256 elements, and we activate, as we said,
for each two elements, we activate one thread.
So 128 into this example, this would
be the equivalent of four warps.
And for this, we would need eight iterations
to calculate the reduction sum, since we
have log n complexity.
However, from iteration 6 onwards, only warps 0.
So if we have warp 0, 1, 2, and 3, after iteration 6,
only warp 0 and 2 will be active with one thread each.
So we have allocated threads that
are busy looking for their next memory access,
but we're not using them.
Even after we stop half the threads
after each step, the issue or the efficiency problem here
is that these two warps are too far from each other.
And in yesterday's terms, we're not
coalescing the memory access.
We are not using the cache lines efficiently.
So we're trying to access too distant of memory locations.
So let's see what we can do to fix this.
So instead of doubling the stride,
we can actually decrease the stride.
So at every subsequent iteration,
we compute collocated threads
and memory locations.
So instead of moving the warps and the threads inside them
further apart, we actually bring them closer
so we can utilize more adjacent memory locations
and utilize the cache lines better.
And this is how this would look like in our implementation.
So in the previous example, we started combining the adjacent elements.
Here we are actually splitting the array in half,
and we are basically narrowing down the memory access.
So basically, at all times, we are keeping all warps busy.
Another aspect or another thing that might help us in this case is that we can store
the whole input array or actually this is the challenge, the whole input array would
be stored in global memory and partial results would be written also to global memory. And in our example with 256 threads,
this would mean that we have around 36 global memory
accesses and writes.
Can we be more efficient about this?
And let's see how.
So basically, we can load the input
from global memory, which we have to do in the beginning.
But after every computation of intermediate result,
we can write and read it from shared memory.
And if you remember, if we store something in the shared memory,
we're reducing the access time by orders of magnitude
compared to accessing it from global memory.
And basically, using the shared memory,
we can be as fast as a register if we're
lucky that there are no bank conflicts.
But we're still at least 100x faster
than accessing global memory.
So in our example of 256 elements,
if we start only with the initial input,
reading it from global memory and writing the final output
to global memory, we reduce the total global memory accesses
from 36 to 9 or by 4x in our example.
So basically, we mentioned this example.
However, with the current implementation,
we don't have the partial sums, which basically does not provide our final prefix sum.
So we would add one more phase to this, and this would be the sweep down phase
where we'd add partial sums to the successor elements.
And basically after this phase, we would have the final prefix sum.
So with the blue arrows, so now basically we start on the top.
That's what we computed in this moving up phase.
And now in the sweep down phase, we'll basically retrieve the partial sums.
And with blue arrows, we depict the elements, or in this case,
the 0 that we're pushing down.
And with the red arrows, we define the partial sums that
are computed.
And in the bottom, in the end, we have the full prefix sum.
Right.
So the complexity of this is 2n.
And now, basically, we have the same performance considerations
as with the reduction kernel phase.
And basically, we're applying the same logic.
We need the memory divergence and control to manage those.
And we need to reduce the global memory access.
So basically, going step by step here,
we would start in the same fashion
as we did when moving up.
So basically, we start at the half of our array,
and we combine it with the final prefix sum.
And to retrieve the partial computation,
we need to replace that with a 0, because that result we
already have.
And basically, by moving down, we are basically splitting the subsequent arrays in half.
And we're pushing the 0 element, which is basically
produced by the computation of the initial prefix sum.
And hence, that ends up in the first position.
However, basically, by combining the zeroth element
with the precomputed values from the previous step,
we are able to retrieve the partial inputs.
And basically, that's how we end up with the full prefix sum
in the bottom area.
Are there questions?
Yeah?
Why is the finder result on the bottom right
35 instead of 36 in the beginning?
Shouldn't it be the same?
Right.
So here, the final element there is basically
the pre-computed value minus whatever
there was in the initial array.
In the initial array, there we have 1.
So basically, by having 35 there in our prefix sum
combined with the input array with the value 1, then basically we
know that until this point, our prefix sum is 35.
And then the actual value in the array there would be 1.
Hence, we get the 36.
Other questions?
Okay.
Let's move to joins.
So with the selection and with the implementation that we've shown, we had two convenient assumptions
or basically, yeah, we had two convenient assumptions or constraints.
The tuples either match or not.
So this is our initial constraint.
But we also know the maximum result size.
I'm sorry.
The maximum result size from the input relation.
Basically, in our selection, we know that the output cannot be
larger than the input relation.
In joins, this is not the case.
Couples can match multiple times.
And also, the maximum output size, first of all, we cannot define it
before computing it.
But also, it can be as large as the Cartesian product
from both input relations.
So in this case, we need to think
about how to calculate the actual size in advance
and whether this would fit in GPU memory or not.
Since the only constraint that we have with GPU joins
is we need either one of the input relations
to fit into GPU memory, since the result can
be as large as the Cartesian product,
it may not fit at all.
So how we would exploit the parallelism of the GPU
in order to make joins more efficient, basically,
we would like to perform some log-free processing in order
to utilize the parallelism.
But also, we would like to pre-compute the right locations
for each thread.
And that's how we know whether our result would
fit into the memory.
So in terms of execution, joins generally
use this three-step execution scheme,
where basically each thread counts
the number of join partners for the inputs.
And then with the result size of each thread,
we compute the prefix sum to get the right location
for each thread.
Then basically in the third stage,
we allocate the memory on the host side
for the size of the join results.
And then basically we write with all threads
to the global device memory.
And the locations are pre-computed.
So basically, each thread has its own designated memory
locations where the results will be written.
So how we would implement this when, say for example,
we're using indexed nested loop join,
we would start in this three step execution scheme.
We would start with building the index on the input relation,
or one of the input relations, in this case, r.
And basically, for the actual join,
we would look up each element of s in a sorted array
by doing binary search.
Final step, we transform the result position list
to the original unsorted index of the input relation.
In case r is already sorted, we skip the index building and the sorting step.
So in our example, to compute the result of the join columns,
we would need to store two position lists.
So given input relations R and S,
we would need to store a position list for each of them.
And if we see here which positions contain a match,
we have the numbers 5 and 3 that are in both areas.
And we're basically joining those.
In our position list in R, we see
that the 5 is in position or index 2.
And 3 is also conveniently on position 3. In the position list of S, we have the 5 at position 0,
but also we have the duplicates of 3 at positions 1 and 4.
Both position lists need...
They always need to have the same number
of elements since we have one-to-one join.
So basically, no matter in which relation
we have the duplicate element that has multiple matches,
we store the position repetitively
for in that array in the position repetitively in that array,
in the position list.
So to build the index, we start with keeping the position list
with the original order.
And basically, we would need this for the following step. So if we sort R, we would get the array
on the right-hand side.
So basically, we have the values in ascending order.
But we also keep the positions in the position list
from the original unsorted R. And now, basically, in step 3,
we would transform the result position list
with the original index.
And if we use three threads for this,
we would start in the first phase where basically each thread would
access one element and perform a lookup in our sorted index. Then the threads would access
elements from S with a stride, which we've basically seen already. So initially, the first three threads
would access the first three elements, perform the lookups,
and basically get the positions of the matches.
So thread one would see that basically we collect the number of matches.
So thread 0 will find its match with the value 5.
Thread 1 would see that for the value 3, it contains two matches.
So basically, this is after the completion of the lookup in S.
And basically, this would be the counter.
And basically, we know that we know the memory locations
that each thread accessed.
And basically, we also have the values
on those memory locations.
And each thread collects their matches in their counter.
The more threads we activate, the larger this counter area
will be.
We would need to allocate more memory for it.
And to compute the prefix sum of the counter area,
we need to get the right positions for each thread
and the overall result size, which basically we
do by doing the prefix sum on the counter.
And we get to the final number of three matches.
So in the second phase, using now the prefix sum
that we generated from the counter,
each thread looks up the assigned elements
and basically uses the prefix sum to compute the result.
In our example, we would have basically
the following position list.
So the value that thread 0 accesses is 5.
And from the position list, we know that the 5
has a match on index 3.
Similarly, with thread 1, we know that, or don't know,
but we perform a lookup.
And we basically see that the value 3 in this case
has two matches at positions 1 and 3.
And basically, in the case of having sorted R in the beginning,
if we actually sorted the input relation,
now we need to reverse the sorting.
So in our input relation, the elements
were not in the following order.
So now we need to utilize the existing position list
and to reverse the sorting.
And basically, to actually revert the sorting,
we would perform a getter operation on the position list.
And then basically, as an output, we would have the position list
from the original unsorted array.
So basically, these are the two steps.
How we started from an unsorted array,
we collected the position list.
Then we have the sorted array as an input.
And then we're basically reverting this
based on the position list for the sorted one.
Are there any questions till this point?
Yes?
In step two, would it rests assigned the values or indices in the first, for the
prefix, when?
This one?
Yeah.
So does thread zero look for all the fives, or does it just, is there like a global counter that each threading increases
if they find a 5, for example?
So thread 0 can also find other values than the 5.
So it basically just counts the matches.
It's not assigned for a single value.
Ah. It's not assigned for a single value. Oh.
Why?
I mean, what's confusing you?
Sure. I think I'm going to say that again.
With the notice that you just put in,
are you telling me?
No.
Okay.
Other questions?
Okay.
We're a bit ahead of time,
but I suggest
we take a short break now and wrap up with the operator
implementations.
And after the break, we'll move on to multi-GPU architectures
and multi-GPU algorithms.
So five minutes.
We'll see each other back at 40.
All right, so we'll continue with multi-GPU data processing.
And for this, we'll do a short recap of the memory hierarchy
that we saw yesterday.
So based on the architecture and the numbers for compute
bandwidth and our memory access latency,
we can see that the faster memory or the quickest memory
has very limited capacity.
So say for the fastest one in our L1 caches,
we are in the order of kilobytes.
And further down that we go in our memory hierarchy,
the larger the memory, but also the more expensive
the memory accesses are.
So we see around an order of magnitude decrease in bandwidth as we go down the hierarchy.
So that is why it is important to make sure that we are allocating memory at the exact place where it is
needed and not default to reading data
from GPU global memory.
To achieve this, we need to be careful about managing
the lifetime of the data structures
that we generate in our program.
And basically, based on what we discussed yesterday,
we can shift this between the kernel lifetime
and the complete application lifetime.
So the GPU global memory, compared to DRAM,
it has quite limited capacity.
The largest GPU that can be found today
has a capacity of 192 gigabytes, where with DRAM,
we can go into the order of terabytes in our servers.
So basically, whatever we need to process in our GPU
needs to be transferred from this main memory.
And basically, anything that's larger than our GPU memory,
by default, it means that the data needs to be swapped,
and we need to access main memory.
Now, the bandwidth between the host memory, or the main memory,
and the GPU global memory can vary by quite a bit.
And this is pretty much dependent on the interconnect
between the GPU and the DRAM.
So we are in the order of 16 gigabytes for PCIe
interconnects.
We'll discuss them later.
And state of theart proprietary interconnects allow access throughput of 900 gigabytes per second.
And this is all determined by the physical hardware underneath.
And there are multiple interconnect interfaces that are available.
Some are general across all vendors.
Some are vendor specific.
And we'll go through each of them.
So PCI Express, this is a general interconnect
that connects the CPU to the storage, the memory,
and to any available coprocessors.
It is the most commonly used CPU-GPU interconnect.
And this is important in our scenario.
Current systems use versions 3, 4, and 5. In the slides in Moodle this will be updated. So
the current systems that we have available here but also that are
currently in production, they use versions 3, 4, and 5. When using
all 16 lanes, the performance for PCI version 3 is 16 GB per second per direction,
and every following version doubles this.
So PCI 4 is 32 GB per second, PCI 5 is 64 GB per second per direction.
There are systems that are currently announced, not yet
available on the market, that have PCI 6.
And this is, again, double from PCI 5.
The bandwidth is 128 gigabytes per second.
PCIe is quite ubiquitous in the sense
that in a single machine,
we would have multiple sets of lanes
to connect it to the different components.
When we're talking specifically about connecting the CPU
to the GPU or connecting GPUs between themselves.
There are proprietary interconnects,
usually manufactured by the vendors themselves.
So there is the NVIDIA NVLink and NVIDIA NVSwitch
interconnects, which are produced for the NVIDIA GPUs.
And the NVLink is mainly used as an interconnect
between two GPUs.
However, in certain machines, it can
be found as also the bridge connector
for the CPU and the GPU, so replacing the PCIe Express.
The NVLink supports high bandwidth peer-to-peer data
transfers.
And currently, the current systems
are mostly using NVLink 3.0 and NVLink 4.0.
2.0 is already considered obsolete.
3.0 allows data transfers up to 300 gigabytes per second
per direction.
4.0 moves this to 450 gigabytes per second per direction.
The difference between Envilink 3.0 and 4.0
is not in the actual hardware.
It's just in the number of lanes that's available at the GPU.
The per link capacity is actually the same.
It's 25 gigabytes per second.
The only difference is that hardware architectures that
are using Envilink 3.0 have 15 lanes amounting
to 300 gigabytes per second.
And the systems that are using NVLink 4.0 have more lanes with the same compute capability
of 25 gigabytes per second.
NVLink 5.0 is announced. The systems are still not commercially available.
But this basically doubles NVLink 4.0's bandwidth up
to 900 gigabytes per second per direction.
And this is actually based
on the hardware characteristics of the individual links so now they're
supporting 50 gigabytes per second with the the same 18 lanes that that are
available so basically these are awesome numbers.
This is very, very high bandwidth
compared to the compute available,
and especially compared to PCIe interconnects.
So the limitation here is that NVLink
can be used as an interconnect between only two elements,
or two GPUs, or one CPU and one GPU.
So for this purpose, NVIDIA developed Envy Switch,
which basically works as a switching element instead of a direct peer-to-peer interconnect,
and it can connect up to 16 GPUs at once. and instead of a direct peer-to-peer interconnect.
And it can connect up to 16 GPUs at once.
So this allows all-to-all communication.
And basically, the data transfers are non-blocking.
So we can issue transfers.
So if you see the figure here, we
have six switching elements. So at any point here, we have six switching elements.
So at any point in time, all six switching elements
can be busy, and the data transfers
can be issued through them.
Later on, I'll show you some numbers
to see how this scales and what's the actual performance. So performance, so for the Envy switch, the Envy switch 2.0,
if I'm not mistaken, this bandwidth of all
to all communication goes up to 900 gigabytes per second.
Some announced systems move this into the orders of 18 terabytes, but they're just announced.
They're not available.
The details on the architecture are not clear, so I'll not go into too many details.
AMD is another major GPU vendor, and they have their own proprietary interconnect.
It's called the Infinity Fabric and the under the Infinity Fabric umbrella day they're
referring to inter CPU interconnects and inter GPU interconnects so in their in
their epic CPUs which is their their server grade GPUs. This bidirectional bandwidth amounts to around 200 gigabytes
per second.
The GPU interconnects can bridge up to four GPUs.
So for example, as shown in this figure,
so basically we have a single bridge for all four GPUs.
And this allows all to all communication between the GPUs,
similarly to the NVSwitch that we saw in the previous slide.
And basically, the bidirectional bandwidth
is 340 gigabytes per second per individual GPU.
Now, I'll basically connect this to the compute capability of the GPUs and see what real-life systems we have.
Some of them are available in our group.
Some of them are available as a part of the data center, but we'll see how this interconnect capabilities
and GPU capabilities are intertwined
and what systems are coming out as a result.
So I would say a fairly simple setup,
or basically the simplest setup that, and this is basically the simplest setup
that you can have with multiple GPUs,
is connecting them to a single CPU.
And there you can have a single socket CPU system
with direct main memory access.
There's no Luma effects from accessing multiple memory
sockets and multiple CPUs.
And for that, it's basically the only necessary interconnect
there is to have PCIe to the main memory and then basically have one PCIe
interconnect for connecting them to the GPUs.
A real life system of this would be our server,
the small server that we use for development in our group.
There we have one AMD CPU with 64 cores.
We have PCIe 4.0 interconnects between the CPU and both GPUs.
Basically, we have 32 gigabytes per second of bandwidth.
And we have basically consumer-grade NVLink 3.0,
where we have basically just two lanes for a total of 56
gigabytes per second of bandwidth.
Since we're in the single CPU domain and connecting one CPU to a GPU, I'll show one example system
that is more complex, and this is also available in our data center, where basically the assumption
that the interconnect between the CPU and the GPU
is the bottleneck, basically with this system it fails.
Mainly because there is an NVLink connecting
the CPU and the GPU.
This is both NVIDIA CPU and NVIDIA GPU, so is both Nvidia CPU and Nvidia GPU.
So it's called the Grace Hopper.
And basically, it allows the full NVLink bandwidth
of 900 gigabytes per second.
Sorry, this is bidirectional.
This is an NVLink 4, so it's 450 gigabytes per second,
as I mentioned on the slides earlier.
And basically, keeping in mind that the main memory access
has a bandwidth of 256 gigabytes per second,
because, again, this is PC PCI 5 and has four lanes and we
have 450 gigabytes per second of bandwidth between the CPU and the GPU
basically this was the first chip that bridged this gap of basically accessing main memory
faster from the GPU than explicitly through the CPU.
And at some point, we were fortunate enough
to be able to test this.
And yes, the numbers are really impressive. And the architecture is actually quite impressive
compared to what the state of the art at that time offered.
However, this example is important
because using the NVLink
between the CPU and the GPU can really
change the way we design our systems around.
So the assumptions for the latency of individual memory
accesses and the algorithms that we designed around these
can shift by quite a bit.
And I'll show an example where this is of a real life system
that we have in our data center that is basically
that this comes into play to a bit lesser extent.
In the meantime, I'll just move on to dual socket systems
with two CPUs and at least two GPUs per CPU.
So in this case, we can combine two CPUs with four GPUs and basically interconnecting
them with a generic interconnect like a PCIe.
We need to interconnect the CPUs through usually the CPU
vendors have proprietary interconnects.
And each CPU socket has their own dedicated memory access.
So one example of such a system, the Delta system
that we have at HPI. It contains two Intel CPUs and four NVIDIA Volta GPUs.
The GPUs, interestingly enough, are
connected with asymmetric NVLinks.
So you can see that between GPU 1 and GPU 3,
we have single NVLink lane compared to the GPU 0 and 1, and basically
the rest of the pairs.
So we have this ring architecture of all GPUs
being interconnected.
However, this is done in an asymmetrical fashion.
I mentioned that I'll present a system with NVLink serving as the CPU to GPU interconnect.
And this is the example of it.
So this is basically a system that allows cache coherent access and basically zero latency access between the CPUs and the GPUs,
and basically combining a single CPU with two GPUs in a ring
that has the same bandwidth and memory access latency.
The example of this system that we have in our data lab is the IBM AC922.
So it uses the Power9 CPUs that allow NVLink to be connected to them.
And it combines them with four GPUs that are using, again, NVLink 2.0,
similarly to the Delta system that we saw before,
but now they have three lanes each.
Basically, we have the same bandwidth
between the CPU and the GPU and between each of the GPUs.
The bottleneck here in this system
would be the interconnect between both CPUs, which
has a slightly lower bandwidth compared
to the CPU and GPU interconnects. Then we move on to adding multiple or more GPUs
to a single CPU.
And one example would be the DGX A100 and H100 machines
that we have in our data lab.
So the idea here is that out of each CPU, there are four,
sorry, there are two PCIe lanes that are coming out.
And each of these PCIe lanes is shared between two GPUs.
So the architecture is the same, or the layout of both systems
is the same.
However, the underlying GPUs and interconnects are different.
So basically, both have two CPUs.
The A100 has the AMD CPUs.
The H100 has, interestingly enough, weaker CPUs compared,
even though it's a newer system.
However, the GPUs are significantly more powerful
and provide larger memory.
So the A100s, initially in the architecture that we have here,
they're available with 40 gigabytes and NVLink 3.0,
which basically amounts to 300 gigabytes per second
of peer-to-peer data transfers.
The H100 basically has double the memory
and also NVLink 4.0,
which basically increases the bandwidth to 450 gigabytes per second.
Similarly, the PCIe lanes, they're again shared.
However, the DM100 system has PCIe 4.0, and the H100 has PCI 5.0. The CPU to CPU interconnects are the
default proprietary AMD and Intel inter CPU fabrics. So I presented the
architectures and now I'll talk about what are the implications
and what is the actual performance that we can take out
from this system.
So we go back to the single CPU and two GPU system that
has basically PCIe 4.0 as the CPU-GPU interconnect
and NVLink 3.0 between both GPUs.
So we have single PCI channels.
So they are not shared as the architecture I showed before.
So we'll see basically how close to these numbers
we can actually measure.
For transferring data from the CPU to the GPU,
we managed to get to 26 out of the promised 32 gigabytes
per second per direction.
And if we execute parallel transfers,
we're actually getting 42 parallel transfers
in terms of bidirectional transfers.
We get to 42 out of the potential 64 gigabytes
per second that we can get.
This is for the serial transfers where CPU accesses GPU 0 or GPU 1 at a time.
If the CPU accesses both GPUs at once,
we basically double the bandwidth of the unidirectional transfers from CPU to the GPU or the other way around.
With the parallel transfers where we basically utilize all four lanes or both lanes in both directions,
we get less than double.
So in theory, this should be 128 gigabytes there.
However, we're only managing to get to 75 gigabytes per second.
In terms of the peer-to-peer interconnects, whether we transfer data from GPU 0 or GPU
1, in both cases we measure a bit less than the 56 gigabytes that are theoretically supported.
And when we issue a parallel or bidirectional transfer, we're getting to a bit more
than 100 gigabytes per second out of the 112 theoretically
possible.
So this is basically a very straightforward measurement
where we're just trying to do a sanity check whether these numbers
basically support the claims by the manufacturer.
What will be interesting in the following systems, we see how these design decisions
impact the data transfer rates and consequentially the the algorithms
that we would design so in our in our Delta system which was the one that has
asymmetrical asymmetrical GPU interconnects there the we see quite a
large discrepancy in in performance so when we're moving from the CPU to the GPUs,
we measure quite low bandwidths,
but this is, again, in accordance to what's available as the interconnect.
So out of the 16 gigabytes, we measure around 12 and 13
for the host to device transfers.
The parallel transfers, interestingly,
are capped at, or bidirectional transfers are capped at 20.
So a bit less than double of the bandwidth
that was measured in one direction.
When we shoot parallel transfers, actually, this
scales quite nicely.
We see 2x improvement across the different GPUs.
What is interesting will be basically
the peer-to-peer transfers, because there now we
have different capabilities, but also we
have different transfer logic
between the different GPUs.
So if we start from GPU 0 and we just issue a data transfer
to GPU 1, which is directly connected via double NVLink,
and we get close to the theoretical maximum.
We get 48 out of 50.
The same happens when we get to GPU2.
However, when we try to move data from GPU0 to GPU3,
we get only 9 gigabytes per second of bandwidth. Any intuition, any idea why this is the case?
Yeah.
We have to do at least one op across another GPU or CPU, so I guess there might be some
processing bottleneck inside the device.
Yes.
Of the device? Yes.
.
Yeah, exactly.
But I mean, we should see at least 25 or something
close to 25, right?
Because we have 50, we have 25.
In theory, if we go through GPU 0, we would get,
sorry, if we go through GPU 1,
we have 50 gigabytes of bandwidth until there.
And we move to GPU 3 with 25 gigabytes per second
of bandwidth.
Same case if we go to GPU 2.
Well, what this system does is basically
it goes through the CPUs.
So to get from GPU 0 to GPU 3, since there's no direct connection between the two, the
default is to go to CPU 0, CPU 1 and then GPU 3.
Yeah?
Why is it not 16 gigabits?
Well, 16 we cannot get because we don't even get 16 up here.
And I guess there would be additional OS checks.
And this brings the bandwidth down.
And we see a similar pattern when
we issue transfers in parallel, either bidirectionally
between directly interconnected GPUs
or when we try to connect the non-connected ones.
So what's interesting there, when
we shoot a parallel transfer between GPU 0 and 3 and 1
and 2, there we kind of 3x the bandwidth there.
And these were measures that were repeated.
And we consulted with the vendors that provided this machine and basically we're
just as good for the numbers.
We just know what the transfer flow is. exact explanation why this goes from 9 to 30 instead of going
at least to 36.
But there's assumingly more overhead that's going on.
So we'll move on to another peculiar system.
I'm saying peculiar because it's not very often
that we see this ring architecture
between two GPUs and a CPU.
So let's see what performance implications
there are from this.
So the CPU to GPU, the left-hand side,
these are not very interesting numbers.
We just have close to the maximum of 75 gigabytes
per second.
So this is as expected.
The bidirectional transfer has a bit smaller or lower average bandwidth.
But yeah, that's transferring basically from CPU 0
to GPUs 0 and 1.
This all works as expected.
When we transfer from CPU 0 to GPUs 2 and 3,
we see a drop in performance.
And actually, it's quite a significant drop,
since the CPU interconnect is actually quite powerful
compared to the NVLink.
However, in any case, either if we shift data to the GPU 2 or 3, we see this significantly low numbers of around 40 gigabytes per second.
And again, we inspected this. out of the bandwidth for between the out of the 64 gigabytes of bandwidth between
between both CPUs only 32 are actually available for pure data transfers the
other 32 gigabytes of bandwidth are reserved for OS calls and and basically
kernel calls so only half of the bandwidth is reserved for actual data
transfer, which was quite interesting to see.
And we had to dig quite deep.
We had to reach to people in IBM to find out about this.
So when issuing parallel transfers, again, we see the similar performance drops from in the case of accessing GPU 0 and 1.
We see close to double the bandwidth that we measured for the serial transfers. However, what was also difficult to explain
was the sudden drop in performance
when we moved data back from the device
to the host in parallel fashion.
Our assumption was that the CPU cannot handle receiving data
or placing data in main memory when accessed by both GPUs simultaneously.
But this was an educated guess at that point.
For the parallel transfers, sorry,
for the peer-to-peer transfers, we
measure quite close to the NVLink limits when we access or when we transfer data to directly interconnected GPUs.
When we go through the CPUs, and in this case we have to go through the CPUs, there's no ring topology between the GPUs in any way, we see this significant drop.
And also, this was one of the indicators
that actually the bandwidth of the interconnect
between both CPUs is capped at a lower speed compared
to the 64 gigabytes that were advertised.
The interesting system is the one
where we have many GPUs that are interconnected
in different fashion to the CPU.
But then they're all interconnected
between each other through an all-to-all NV switch.
These are the numbers for the DJX A100 machine.
So basically, the less powerful of both.
We are yet to run the similar analysis for the H100.
So there, what we can see is that we have consistent bandwidth between all GPUs.
So we measure 279 out of 300 possible gigabytes per second possible bandwidth.
And this is consistent across any two GPUs that are in the system. Since they're all interconnected through the NV switch,
the bandwidth is equal to all of them.
The main memory transfers, on the other hand,
are actually quite, quite interesting to see.
They're limited by the PCIe interconnect.
And there, what's interesting to see
is that if we issue a single transfer from CPU 0
to any of the GPUs from 0 to 3, we get 24 out of 32 2 gigabytes
per second.
However, if we issue a transfer between GPU 0
at 1 at the same time, which is the one in the smooth brackets,
we get the same bandwidth.
And this is because this interconnect is shared.
And basically, this presents a significant bottleneck
when taking into account that the bandwidth
between the individual GPUs is more than an order of magnitude
higher than this.
What is interesting there to see is that we're basically,
when trying to issue a parallel transfer between all eight
GPUs, we get more than 2 terabytes of bandwidth, which is, I would say,
fairly close to the theoretical 2.4 terabytes that can be measured.
And what was interesting to see here is in this red rectangle, despite having the NV switch,
we measured that accessing co-located GPUs
in the sense of GPUs belonging to the same CPU,
we measure higher bandwidth between them compared to accessing the GPUs that
are connected to the other CPU.
This was an interesting observation to see.
So we went through a lot of different numbers and system architectures, but where does this
bring us?
Basically, the numbers that I've shown basically give us the insight that being aware of the topology for system is very important where we design our algorithms.
So if we just take a comparison between these two systems,
they both have two CPUs.
They both have four GPUs, right?
However, even though the GPU interconnects
are somewhat similar,
we need to be aware of how these components
are connected between each other.
Do we have high bandwidth interconnects
between the CPU and the GPU?
Do we have an obvious bottleneck in the system?
In our cases, we need to know if all GPUs are
interconnected between each other.
If they're interconnected, can we still access them?
Do we need to go through a CPU, et cetera?
So to this end, we need to be aware of the systems
that we are designing our algorithms for.
And we can do this basically by deciding
where to pre-allocate the memory.
We can reduce data transfers, or we can optimize data transfers
by utilizing the stronger interconnects in our system.
Similarly, basically moving from main memory to the GPU memory,
we can organize the data transfers
between the GPUs themselves.
So we can manage this peer-to-peer communication
between individual GPUs.
And basically, this is all in the direction
of maximizing the bandwidth accordingly.
When, as a general rule of thumb is basically we saw it
in the AC922 system but also we saw it in the other system that when we
have multiple sockets we try to collocate the data on only one socket
and we keep the computation on the same NUMA node as much as possible.
If the data is larger than the single socket memory,
then we need to think in another direction.
But as long as this is possible, we
try to keep the computation on the same NUMA nodes.
And I'll quickly go through some examples
on a fundamental operator as a sorting,
how we can be efficient for utilizing multiple GPUs
for sorting.
So there are three different approaches that are currently developed.
One is the peer-to-peer based sorting where the actual sorting is performed in place and only in the GPU memory.
So once we allocate the data on the GPUs
and we provide or we transfer the data to the GPUs,
the sorting is performed in place.
And basically, this gives us the trade-off
of executing quick data transfers between the GPUs.
But also, since we're doing this in place,
this severely limits the amount of data that we can sort.
Because such algorithms are limited by the total GPU
memory.
And basically, the maximum amount of data
that we can sort is only half of the total GPU memory.
Such algorithms, so there's one implementation
that we used in one of our papers.
So basically, we split the data across the GPUs.
On each GPU, we find the pivot element.
So we can basically sort this in place
and then exchange the data between GPUs
where we have then sorted sub-arrays.
And then there's another stage of sorting the sub-array.
So the goal here is not to go through every algorithm step,
just to see how some of the multi-GPU peculiarities apply
into the algorithms themselves.
The second idea is to utilize the CPU for this or to have the CPU help in this process.
And there the intuition is a bit different.
So initially we fill a single GPU with data.
So we do an initial data transfer there.
We perform an in-place sorting.
And basically, we have this at stage three there.
We have this bidirectional data transfer
between the CPU and the GPU.
So the GPU sends the sorted data back while receiving a new batch of data for sorting.
Then in stage four, there is basically while the GPU is sorting the new batch,
the CPU is merging the data that was just received.
And in the final stage, once the final batch of data
is sorted on the GPUs, there's the final merge stage
on the CPU.
So what this allows us is to sort data
that's larger than the combined GPU capacity.
So we're moving from less than half of the combined GPU
capacity from the peer-to-peer sorting to basically algorithm that's
bound by the size of the main memory, which,
as we've seen before, it can be order of magnitude larger.
And the third idea is, again, in the direction
of in-place sorting.
However, we perform radix partitioning instead
of this Pivo element in the beginning.
So there, basically, we partition the keys
based on the radix value and the most significant bit.
And then we exchange these buckets between the GPUs, between peer-to-peer transfers and
in each of the GPUs the individual buckets are sorted using a single GPU sorting primitive.
This is again, as I mentioned, an in-place algorithm. And it is limited by the, basically,
it has the same limitations as the peer-to-peer sort.
This was developed as a master thesis in our chair.
So let's see how the three of them combine.
And basically, not necessarily only how they compare,
but also where does time go.
So in the peer-to-peer case, we have basically the data transfers, which are the green and
the blue parts, and we can see that they're dominating the runtime.
And this is the ongoing story across all the algorithms,
with the exception of the heterogeneous one,
where the merging on the CPU is the clear bottleneck there.
So we can see basically that the actual sorting, which
is marked in pink in the case of the peer-to-peer sort and is intertwined into the data transfer copies
into the RMG sort from the CPU and back.
With the heterogeneous sort, actually, the merging stage
on the CPU proved to be the most costly stage.
And in terms of results, we can see that the in-place sorting
algorithms, while being constrained
by the amount of data they can sort,
they're significantly faster compared
to the heterogeneous sort.
Right, so this happened on the DGXA100.
So what's important to notice here
is actually basically almost no difference between the
performance on four GPUs and eight GPUs across all algorithms and this is a
direct consequence of sharing the PCIe lane between two GPUs. Right on the AC922, where we have the envioling between the CPU and the GPU, we have another
set of interesting results.
So for the peer-to-peer and RMG sort, actually, we see a significant performance decrease
once we move past two GPUs. And this is basically exactly the point
where we stop utilizing the ring architecture
or this efficient data transfers between the CPUs and the GPUs.
But we also need to move to the other socket.
And there, we are actually constrained
by the bandwidth of the CPU interconnects.
And we can see consistently that for our in-place algorithms,
that we are faster on two GPUs compared to four GPUs.
And basically, with this in mind,
I want to summarize the main takeaways from this, is
that basically, scaling to multiple GPUs needs to be done at your awareness.
Basically, you will not benefit infinitely just by distributing the compute on multiple
GPUs. We see that the data transfers are by far the costliest
operations and basically knowing the topology of your system is essential to get the fastest
or the best performance out of it. So one example was the 2 GPU mode on AC922, providing the fastest solution for that system, but also
knowing that utilizing 4 GPUs or 8 GPUs doesn't make a difference in the DGXA100.
And here I would add it's important that basically each socket
needs to allocate two GPUs compared to four GPUs
on one socket because then, again, we're not
utilizing the split bandwidth.
Right.
And basically, again, as I said, we
can see that this order of magnitude difference in bandwidth shows in the actual time it takes to move data or actually sort data.
Right.
This is.
Yeah.
So these are some of the systems that we already
have in our data lab.
Not all are shown here, because we're constantly
adding new systems.
But if you're interested in working on any projects
regarding GPU development or multi-GPU development,
feel free to reach out to us.
This includes research, projects, seminars,
but also master theses.
If you're at that point in time,
just reach out to any of us in our group.
This would be mainly Florian and myself.
So we'll be happy to work with you on interesting projects like this.
These are some of the papers and master
thesis that came out as a result of this work.
And with that, I would like to thank you.
And yeah, I'm open for questions if there are any.
OK, thank you. Next week you'll talk about FPGAs.
So more coprocessing and acceleration.