Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Data Processing on GPUs II
Episode Date: July 5, 2023...
Transcript
Discussion (0)
Welcome everybody. Today is the second part of the lecture on data processing on GPUs.
And yeah, this is the last lecture on GPUs and then afterwards you'll have
data processing on FPGAs, your new programming task and then you'll get introduced to CXL.
What are we going to talk about this lecture
is how to implement some database management
operators on GPUs.
We'll talk about selection, how to do parallel prefix sums,
and how this can be implemented in joins.
Then we'll talk about what happens when a single GPU
or processing on a single GPU is not enough,
so we need to distribute.
We need multi-GPU setups.
And then we'll take a look into a use case of multi-GPU sorting
and how this was implemented in our group.
Again, the lecture is based mainly on the two books.
The first part mostly, the multi-GPU part,
is based on our research in the group,
and also some of the lectures that are shown here.
Right, so let's start with the database operators.
We'll start with a simple selection.
And doing a simple selection, basically, we
have a selection predicate that basically
needs to choose a subset of the tuples from the relation that
satisfy the predicate and basically remove the rest.
And if we write simple code or pseudo code in this case, this would look something like
this.
And for the given predicate, we need to basically from relation X, we just need to remove the 2.
So basically, the question that we need to answer
is, how can we parallelize this on a GPU?
So we know that we can access this memory locations
in parallel.
But then the question comes, where
should we write the results?
And basically, how we can do this efficiently But then the question comes, where should we write the results and basically
how we can do this efficiently so there
are no overlapping writes.
And one solution is to pre-compute the locations
by using prefix sums.
And basically, this means that we need just in general,
a prefix sum is a prefix scan operation.
So basically, we need to apply the binary operator
to an array of elements.
And we need to use this prefix sum or the prefix scan
operation in general basically to pre-compute
the locations of the right operation
after the selection predicate.
So basically to do this, we need three steps.
So we need to build a flag array, which will basically
note which of the elements in our initial relation
satisfied the predicate condition.
Then from this flag array, we compute the prefix sum.
And then, again, we scan the flag array
and write the elements to their respective positions
in a new array.
And these three steps can be easily parallelized, mostly
by using reductions in GPUs.
So basically, again, we need to apply a single operator to an
array, and we know that by using multiple threads in a
GPU, we can easily achieve this.
And the reductions that we can use, basically applying these
binary operators, we can use them as a building block for
the prefix sum operations.
And what's good is that prefix sums
are calculated in GPU shared memory in general.
So basically, we have little to no overhead based on this.
And now the question is, how can we parallelize them
and make them efficient?
So one option is to use binary reduction trees.
And if you're familiar with binary reduction trees,
to basically compute the sum, we need log n operations.
So basically, in our input array of eight elements,
we'll need three iterations over the array
to get to our sum at the end.
And if we implement this naively parallel in GPU,
we get a bad solution, like we started yesterday.
So just doing naive parallelization
doesn't work really that well also in this case.
So basically, in this case, we'll need more operations
actually than a sequential implementation would do. So now, basically, the question is, how can we improve?
And a simple solution is, let's split the workload
and then assign two elements to each thread.
And basically, after each step, we
ensure that basically only half of the threads remain active.
And basically, we have the achieved log of n complexity.
Now the additional benefit is that we
can execute this in parallel.
Is this an efficient implementation?
Somewhat, right?
It's not extremely bad, right?
It's OK.
So this would be a GPU kernel that would execute this, right?
So from last lecture, we know that we
would like to access the memory in a stride manner.
So we basically co-locate the threads.
They access adjacent locations.
However, that happens only in the first step, right?
So that's the only part where threads are accessing actually
adjacent locations.
If you look at iterations two and three,
there the threads are quite widely apart.
Or they're accessing elements that are quite widely apart
in memory.
Now, for eight elements, this is not a problem, right?
Because everything is still in a single cache line.
All is good.
However, if we have a wider array or a larger array,
what happens is that after a few iterations,
two or three iterations, threads are accessing locations
that are in different cache lines
and can be also eventually out of a single thread block. So this is basically, again, the consequence of this
is that we have doubling stride after every iteration.
And this basically increases the distance,
meaning that we are not utilizing the cache lines
efficiently in the GPU memory.
And to, I guess, what I'm hinting at,
that we can do better,
mainly by the way that we're accessing the data.
Yes.
I have a question.
For this reduction, we need those three different steps,
right, they're after, one after another.
Yes.
How is this guaranteed in this code?
Like that we have the first stage run,
and then we have like the second on top?
So what do you mean, how is it guaranteed?
Because the bottom row,
first we were doing the addition
on all of those, right?
Yeah.
But isn't?
Maybe I'm missing something here.
OK, so I think I understood the question now.
So what this for loop does, so in this case,
we have four threads, right? In the very first.
So we have four threads.
So this means that this for loop is executed four times, right?
In parallel.
One by each thread.
And what each thread does is basically goes from stride one.
And then basically, the block dimensions in this case are four.
We have 4 threads.
So basically, it says after.
So we start with stride 1, meaning we
are in iteration number 1.
So we are accessing the adjacent location, right?
And then we do the addition.
So we get the 10, 12, 5, and 9.
So we get those results.
And then in the next iteration, the stride doubles.
And to answer the question about the guarantee,
basically, only after the summation,
we go to a double stride and to access the new elements.
So basically, this is the guarantee.
The tricky part is that this is like you see one
thread implementation.
And basically, this is being executed four times in parallel.
And it won't go in the second iteration.
Or basically, even if it goes, it doesn't matter.
Helpful, not?
But is it possible that one thread is already
further than the other?
Or no, that's not possible, right?
Well, if they're executing the same computation,
this doesn't make sense.
But I mean, this goes back to the execution part, right,
that we talked about yesterday.
At some point, they are reconverging.
And if you're not sure, if you have a varying workload where
you do a random computation that it's not an addition, right,
then you can explicitly ensure the reconversions,
which we talked about yesterday, basically.
So is that helpful?
OK.
Yeah?
If I was to program this,
and I'm not really sure about the internals,
so I'm not 100% sure that they will be synced.
If I were to call distinct threads also in the for loop
to ensure that every stage is executed off each other,
would the compiler be able to optimize it away or not?
That I'm not sure.
I haven't looked into, basically,
if the sync threads command can be pushed upwards when it's called explicitly.
So say, for example, you don't call sync threads.
What the compiler would do, it would assume the conservative case.
The conservative case meaning that you need inter-thread communication,
so it will only sync after the for loop.
So basically, again, what we have here, the sync threads,
this is something that's syncing in this part of the code
is something that the compiler would do implicitly.
It's put here just so everybody's
on the same page where the synchronization happens.
You can do it in the for loop.
But since the for loop is executed on a thread level,
this doesn't really make sense.
Because you're executing a single iteration,
and then you're synchronizing with what, basically?
So it means that the thread will synchronize with itself,
which is kind of not necessary.
Yeah, so basically, we have this as a starting point. So just as a basically insight into the steps,
output's still readable.
So we need reduction, the simple one, and the verbose one.
So basically, I'm going to make this smaller
so you can see the output.
Basically, what you have here is exactly that, right?
So we have, if you start from here, we have,
so you need to look at the second iteration.
We have thread 0 writing the 9, thread 1 writing the 5,
and so on and so on.
So basically, we have the order of writes.
And basically, we also have the locations.
And we can see that these are not adjacent locations.
And the threads are accessing far away locations.
And again, just to clarify for eight elements,
this is no issue.
But for larger data, this would very quickly become a problem,
because we might end up with inactive threads inside a warp.
And even worse, we can end up with inactive warps just
being blocked by basically a single iteration.
Or in this case, it can be a single summation so
basically how we can be more efficient about this is basically okay this is a
short explanation of what I talked about so and some example of when exactly will this happen, right?
So I took the example of having 256 elements, which is, again, not a large number.
And for this, if we take the same assumption, we need half of the threads of the number of elements to start with.
So we'll activate four warps.
So basically, we'll need eight iterations
to calculate the reduction sum.
And basically, as of iteration six,
only two of the four warps would be active,
and only one thread will be active in each.
So basically, we are wasting 128 threads
for the sake of two threads doing the computation.
So one way to be more efficient about this
is very simply to decrease the stride.
So instead of just splitting naively the workload
and having the initial threads accessing adjacent locations, we decrease the stride
with every iteration.
So we basically make sure that we narrow the workload.
And basically, as the workload progresses,
we are making sure that we are converging
into a single warp or a single thread in the end.
And basically, the rest of the warps
can be freed up to execute different operations.
Right.
Is there a question?
Why do we have the same threads now in?
That's a good question.
So I can very quickly check, but this might as well be a copy-paste error on my
end.
Yes?
I don't think it's a copy-paste error.
Why?
If we parallelly compute the sum of two numbers for the first stage, then we would
not want to overwrite the numbers because we are using the same memory to read the numbers
in the for loop in the other example, right?
Yeah, so.
The other example would work because the stripe was like, like it was.
Ah, yeah.
Because we're not rewriting the memory from other places.
Yeah. because the stripe was like so say again what was the difference in this example
yeah memory that itself used the 9 the 1 overwriting
9 overwriting 1
and in the other example
the first fret summing
1 and 8
is also potentially overwritten
by another fret
the 1 summing
2, 15
so
that's the second summing to 15. So that's the second summing 7 and 8.
OK.
So basically what you're saying is
that we don't want the faster thread in the second iteration
to override a busy memory location, basically,
what you're saying.
It's still the first iteration?
So we don't want the thread summing 8 and 7
to overwrite the 8 with 15 before the first thread summing
8 and 1 can sum them together to 9?
Or 8 and 1?
JAN-FELIX SCHWARTZMANN- 8 and 7, yeah, I understand.
But OK, yeah, I mean, I understand. OK.
OK, yeah, I mean, I understand what you mean. But again, the question is, why would you
need the sync threads in the loop then?
So how does this help?
I understand your point, and it's correct.
But how does it help to have the sync threads in the loop?
Good question.
But yeah, I mean, that's true.
I'll need to check this. And basically, the uploaded version of the slides on Moodle
will have the correct one.
This might be that this was an error on my end.
Right.
So yeah, we have basically what we want to do here
is to minimize the divergence of the threads
and basically their subsequent access.
And just to compare how this would affect performance,
even on our small, where's my mouse?
OK.
Let me move this here so I can be a bit quicker.
So we want first to execute the first one. And then we need the minimized convergence control.
And basically, what you can see here,
so these are different versions where I remove the prints, so they won't constrain the kernel.
So basically, what happens, like, for these eight elements that we saw,
we have four milliseconds for the simple, naive
implementation.
And then we have 100 microseconds
for the convergent implementation of the kernel.
And basically, this is the cost that we pay for threats having to access locations that are, okay, I need
to move this.
All right.
Yeah.
So we see that easily this can mean a large improvement.
And again, this is a very small example. So milliseconds to microseconds, still a large improvement.
However, we don't need that at this point.
However, basically, in this case,
there is also a small problem, mainly.
Even though we decrease the stride and we converge the execution to eventually a single
warp and a single thread, we do the expensive global memory accesses. And basically, for the previous example
that I mentioned of a bit larger data with 256 elements,
in this loop, we'll need to execute around 36 global memory
accesses and writes.
Are there any suggestions how we can reduce this?
Yeah?
Using shared memory?
Exactly.
Yes.
So basically, what we need is to basically move
this write operation into the shared memory.
And basically, this is what we want.
We load the input from global memory initially,
and then write and read every intermediate result
into shared memory.
And basically, the reason why we want to do this
is that the access is up to 150x faster than global memory.
And shared memory access, basically, if we have no bank conflicts for the threads, this
can be as fast as a register access.
Since if you remember from yesterday, shared memory and L1 cache are co-located now
basically in the same memory.
And this is basically the fastest access
that we can achieve.
And for the larger inputs of 256 elements,
then we have only the initial input,
and the final outputs are read and written corresponding
to global memory.
And basically, since we need eight iterations to do this,
we'll have nine total global memory accesses,
reads and writes combined. So we decrease the global memory reads and writes
by a factor of four.
How do you think that this will affect performance
in our dummy example with eight elements?
OK. Okay. Let's see.
Say again?
Yeah, let's see.
We need demo.
Okay.
So, we're going to do a demo.
So, we're going to do a demo.
So, we're going to do a demo.
So, we're going to do a demo.
So, we're going to do a demo.
So, we're going to do a demo.
So, we're going to do a demo.
So, we're going to do a demo.
So, we're going to do a demo.
So, we're going to do a demo.
So, we're going to do a demo. So, we're going to do a demo. So, we're going to do a demo. So, we're going to do a demo. So, we're going to do a demo.. Yeah, let's see.
We need demo shared.
Almost nothing. And basically, the reason for this
is that the compiler already optimizes this.
For small data, this is basically
done at compilation time when the input is small.
And basically, it reserves shared memory by default.
What happens when I tried this on larger examples,
for example, million elements like the addition
that we did yesterday, if we have
let me go back to the slides.
If we dedicate this flag to use the shared memory for the
input, there we get a compiler warning that basically we are
trying to access too large of a chunk of shared memory, which
is not accessible by putting one gigabyte of data in the
shared memory.
So basically, this is just it gives a warning.
And it disregards this.
It still goes to global memory by default.
However, vice versa applies.
If we don't designate the shared flag
and the input is sufficiently small,
the compiler will notice this and move the data
to shared memory anyway.
And hence, we don't see any improvement
since our example is too small.
Right, so with the current implementations,
we have the final sum, right?
We have that 36.
We achieved this in three different ways so far.
But we don't have the final prefix sum.
So the solution to this is to add one more
phase to the algorithm.
Basically, we have the sweep down phase
where we need to add the partial sums to the successor element.
So basically, in the end, we have the final prefix sums.
And basically, this is the goal.
So we want, starting from the final sum, from the 36 up there,
we want to recursively come down and basically redo the computations
so we can generate the partial sums into the final prefix
sums.
And basically, not to go step by step, but basically you can see that the accesses in the sweepdown
phase are mirroring this pattern here.
And this was our starting example.
So basically, not to go through all three alterations
that we made to the example, basically the same performance considerations
for the sweep down phase will apply in this case as well.
So again, the strided access here will need to be adjusted.
And if the memory allows, we need
to reduce the global memory accesses as well.
Right, so what I already said, we need to make sure that we follow the control and memory
divergence for the thread and memory accesses
and reduce the global memory accesses.
And so far, we were dealing with selections.
And there, the tuples either match or not. The maximum result size that we can get
is as large as the input relation.
So basically, what we have in joins is something else.
So this is not the case, right? The tuples can
match multiple times, the maximum output can be as large as the Cartesian product.
So we need to do more adjustments and basically to address several challenges
that arise. And one I already mentioned, we do not know the exact result
size in advance.
The result may fit or may not fit in GPU memory,
regardless of whether the relations do.
And now the question is how we can use the GPU to address
these challenges, but also to exploit the parallelism
that they offer.
And two options that we need to consider
is we need to do a lock-free processing,
and we want to pre-compute the right locations for each thread, similar to what we
did with the prefix sums in the selection part.
And then, so basically, how are joins executed?
I guess you're familiar with the execution scheme already.
But basically, we have this three step execution for most of the joins
except for the primary foreign key joins.
There basically we do not need a position list because we have all foreign keys, all
foreign keys have exactly one primary key partner.
But in general we use these three steps.
So basically, we need each thread
to count the number of joint partners for its input.
Each thread needs to compute the prefix sum for its result
to get the right location for it.
And then basically, on the host side,
we need to allocate the memory so that all threads can
write their result in a parallel fashion.
And we'll take a look at an example of indexed nested loop
join.
And there, basically, we have, again, this three-step
execution scheme.
So we need to build the index on the R relation.
We need to sort that.
Then we need to do a lookup for each element of S
into the sorted array by performing binary search.
And then we transform the result position list
to the original index of R. And then, and then we transform the result position list to the original index of r.
And then basically, we perform the second step two times,
once for the position list and the second time
for the actual join operation.
Right, so if we want to join, say, these two relations, again, quite small.
Basically we have three matches in total.
So we have the five matching in RNS and we have two instances of the three matching. So we have the number five at position three.
So we have that written in our position list for R.
And then we have the three at position three, at index 3.
And the end result of this is that we
have the mesh partner for the 5 at position 1,
and then for the 3s at positions 1 and 4.
So we already have a duplicate here.
So we basically need to make sure that the step three, where we are
getting the original input, or where we want the result to be
in the original order as the input in the input list.
So basically, what we do here is we build the index of r.
There we have the positions.
And then we need to sort this.
So we sort by the value of the elements.
And then we need to perform the lookup in s of the sorted array.
And we'll perform that in the next step.
So how we can parallelize this, or how
we can parallelize the first phase,
is basically each thread will access one element, exactly one
element, and perform a lookup on the index.
So we are going through S. And so basically now what
we want to do is basically we can
use three threads in this example,
and each of them reads one element.
And again, we do this in a strided fashion,
as we know from earlier examples.
So what happens is that now we have the counters.
We know the number of the elements
that each thread processes that basically have the result or have the match
when they are looking up at the sorted R relation.
So we know now that thread one, it read the five, so we have a single match there, and
then thread one reads the three and we have
two two matches there
So basically the more threads that we use we we increase the memory consumption
So we want to make sure that we don't over parallelize this right as we spoke yesterday
We want to make sure that we don't over-paralyze this. As we spoke yesterday, we want to make sure that we don't assign too many threads for too simple
of a compute operation.
So once we have the counter array,
we need to compute the prefix sum
to get the right positions for each of the threads
and the prefix sum. We saw how we computed the prefix sum earlier.
And then basically, we have the lookups of the threads.
And they're assigned elements.
And we basically use the prefix sum to write the final result.
So basically, what happens here is
that we have the lookups of the threads.
And then we have the lookups of the threads and their assigned elements. And we basically use the prefix sum to write the final result.
So basically what happens here is
that we have the position lists for both relations.
So we're still working on the sorted R relation.
We know that number 5 is on position 3.
And also, we need the position of the 3,
because that's also one of the matches.
And there, we write the position of the 3, which is index 1,
in this case.
And in the position list for s, we do the same.
The positions that we're looking for are 0, 1, and 4.
So since we sorted r in the beginning,
now we need just to reverse the sort.
And to reverse the sort, basically, we
need to perform a gather operation on the position list
that we generated initially from step one.
And basically, we give the input array and the position list
as the input.
And we basically, and we, so we get the sorted array
and the position list to reverse the sorting.
And basically, what happens in the final step
is that we are ending up with the results of the positions
in R for the joint partners in S.
So we basically end up with the correct position list.
And then as output, we have the array of R,
which has as many elements as its own position list,
since we write all the duplicates in the position list
explicitly.
Right, are there questions until here?
Right. So we continue with multi-GPU data processing.
So far, we've learned about the principles of what do we want to do to extract maximum
performance from a single GPU and how to transfer the CPU programming model, the sequential
model to a parallel GPU programming model.
However, we've seen this graphic in the previous lecture,
and we know that the faster the memory,
the lower the capacity of this memory is.
And for the fastest memory, we are quite limited.
So we are in the order of kilobytes.
GPUs currently have plenty of global memory.
However, the access there we saw is up to two orders
of magnitude lower.
And basically, we go with this trend the lower the access there we saw is up to two orders of magnitude lower and basically
we go to with this trend the lower in in the memory hierarchy that we go so
basically this this makes us aware of where do we allocate the memory for for
what part of our workload and we basically manage the lifetime
of the allocated memory in accordance to, again,
to optimize the memory accesses.
In general, the GPU global memory
is much more limited compared to DRAM.
Currently, GPUs allow up to 188 gigabytes of global memory.
However, this is still up to an order of magnitude of what the maximum DRAM capacity is.
So there we can get into the order of terabytes.
So that's one aspect.
And then the second aspect is that whatever we process in the GPU, we need to transfer it from the main memory.
So by default, GPU memory is not meant as storage of any kind.
And basically, whenever we need to access data
outside of the GPU memory, it means
that we go to the main memory.
And here we are dependent.
So we saw this picture, right?
So we need to go from GPU global memory to host memory.
So basically, we're dependent on the interconnect there.
And currently, there's different types of interconnects
that are put into GPU systems
by different vendors.
So as an example, PCIe is the vendor independent
as de facto standard interconnect between host
memory and GPUs.
NVIDIA has their own NVLink as their GPU interconnect.
However, we'll see that there are systems
that use this as the interconnect
between the CPU and the GPU.
And AMD has the Infinity Fabric
as their global interconnect.
So they use it for interconnecting GPUs,
but also as an interconnect
in their accelerated processing units.
So let's take a look at basically what
are the properties and the performance
that each of these interconnects offer.
So PCI Express, as I mentioned, is the general CPU
to GPU interconnect.
It connects the CPU to the storage, to the memory,
and to any accelerator that we have.
So it doesn't need to be GPU exclusively.
This can be also any FPGA or any custom coprocessor
that's connected to the CPU.
And depending on the version and the end device
that we are connecting to the CPU,
we can use one, four, eight, or 16 lanes of PCIe interconnect.
Not necessarily all lanes are used.
So as you can see, different devices
require different connectivity.
By default, the GPUs are connected through all 16 lanes
with PCIe.
Now what's interesting in this slide is these two numbers
here.
So we have PCIe 3.0.
This was, until recently, the standard.
And this allowed for 16 gigabytes of bandwidth
for all 16 lanes in total. So given that GPUs are accelerators with up to 1
terabyte or more than 1 terabyte of computational bandwidth,
this obviously becomes a severe bottleneck
when transferring data from main memory to the GPU memory.
And this is getting better with time and with the new PCIe standards.
So PCIe 4.0 basically doubles this to 32 gigabytes.
And we've already started seeing or there have been already announced devices with PCIe
5.0 as the CPU to GPU interconnect, which basically doubles the bandwidth of PCIe 4
to up to 64 gigabytes per second.
However, again, this is far down from the terabytes of bandwidth that the GPU has.
So some of the vendors were very aware of this.
And they designed their own interconnects,
proprietary interconnects for their accelerators.
And one example that I already mentioned
is the NVLink, which is built by NVIDIA and it's used mainly as an interconnect between two GPUs.
It allows high bandwidth transfers. 250 gigabytes of peer-to-peer bandwidth.
And current versions of NVLink 3.0
allow up to 300 gigabytes per second
of interconnect bandwidth.
And this is just peer-to-peer, just one GPU to one GPU.
What happens if we have a device with multiple GPUs
that we want to perform
an all-to-all memory transfer, then this
means that we'll have to perform individual transfers
from each GPU to every other GPU.
And again, this is suboptimal in terms of data transfer
and its performance.
So there has been a solution to that as well.
That's the NVSwitch.
Basically, this is a switching element
which basically allows us to interconnect all to all up
to 16 GPUs.
And basically, this allows concurrent memory transfers between all 16 GPUs and the maximum bandwidth
of 300 gigabytes per second that basically NVLink initially allowed.
Now this is transferred to an all-to-all data transfer to up to 16 GPUs, which
basically increases the global memory that each GPU has.
Because if we have a system with more than eight GPUs,
for example, that have more than 40 gigabytes each,
and we know that the interconnect between them
is way faster than either the interconnect from host memory
to CPU or from the CPU to the GPU.
Basically, once we transfer that data,
we can basically assume that we have much larger GPU memory
than we actually do.
Because the bottleneck won't be the peer-to-peer transfer,
but the transfer from main memory.
So something similar that AMD offers, basically,
is the Infinity Fabric.
As I mentioned, this is their general interconnect. They use it for interconnecting CPUs, GPUs, and vice versa.
They achieve up to 204 gigabytes per second
for their CPU interconnect, or when used as a CPU
interconnect.
And when used as a GPU interconnect bridge,
the bidirectional bandwidth is up to 340 gigabytes per second.
So for interconnecting GPUs, state of the art,
we are in the order of around 300 gigabytes per second,
independent of the vendor.
So now let's see how this information about the
interconnects and the GPU
bandwidth can be used to build multi-GPU systems.
So one option, the very basic multiplication of such a system
would be when we have single CPU and then two GPUs.
So this means that we have single so and then two GPUs. So this means that we have single socket, single main
memory access.
We have no NUMA effects.
And all of the data transfer goes through the CPU
interconnect and to its connection to main memory.
And basically, what we need to be aware of is now the interconnect between the GPU and the CPU.
If there is an interconnect between both GPUs, then we need to be aware of that bandwidth as well.
We have an implementation of this system in our group.
So we have a small server.
So basically all of the demos that I've shown
were executed on this server
that has quite a powerful CPU
and two high-range consumer GPUs.
So these are not data center GPUs,
but quite strong consumer GPUs.
So there in this system, basically what we have
is PCIe 4.0 as the CPU to GPU interconnect with 32 gigabytes.
Not yet.
300.
And between the GPUs, we have up to 56 gigabytes per second.
So this is still NVLink 3.0.
However, it doesn't provide all lanes in the bridge.
So the maximum bandwidth of 300 gigabytes per second
in this implementation, it's not used.
So then we can build a more complicated system than this.
Basically, when we need two CPUs for increased computational
bandwidth on the CPU side, then basically this
is what we end up with.
So we have a dual socket system that's
basically where both CPUs are interconnected
via a proprietary interconnect.
And then each of these CPUs have their own dedicated memory
access. And they're either connected to a single GPU
or to multiple GPUs.
So one example that we have here is dual CPU and four GPUs
in one system.
And there, again, we need to be aware of three types
of interconnect.
Basically, how is the CPU connected to the GPUs
and to how many of the GPUs?
And how are the GPUs connected between each other?
And what happens or what we can see in this schema
or in this implementation the
Delta the Delta system that's also available at HPI is that we have a PCI
3.0 CPU to GPU interconnect each CPU is connected to two GPUs. This is all fairly simple. However on the GPU side we have
I would say, unorthodox interconnects between them. As you can see, GPU 0 and 1, they're
interconnected with two NVLink lanes, as well as GPU 0 and 2,
GPU 2 and 3 as well.
However, from GPU 1 to GPU 3, we have half the bandwidth,
or we have a single NVLink 2.0 lane.
So basically, what's important here
is if we develop an algorithm or we run a workload that's
running on the system, we need to be aware of the data
placement on each of the GPUs.
Because if we require the inter-GPU communication,
then we can see already clearly that there
can be better or worse ways to exchange the data.
If we exchange the data only from GPU 1 to GPU 3,
we'll get just half the bandwidth compared
to the other interconnects.
And basically, to mitigate this, one straightforward solution
is we make all GPU interconnects equal.
And not only that, we can also make the CPU and GPU
interconnect equal. So jumping from 16 gigabytes PCI 3.0 interconnect from CPU
to GPU, we can move to GPU interconnect
that's equivalent to the interconnect
between the GPUs themselves.
And now we basically don't lose any bandwidth
when transferring data.
And this is quite a unicorn system, I would say.
So it was developed by IBM.
We have two of them in our data center.
And what's interesting here is that the CPU-GPU interconnect
is using NVLink instead of PCIe.
So this means that we have the same 2.0 NVLink interconnect
with 75 gigabytes of bandwidth between the CPU and the GPU
and between the GPUs themselves.
However, what's also interesting in this system
is that we have a proprietary interconnect
between the two sockets.
And there, we achieve 64 gigabytes
per second per direction.
So if we want to transfer something from CPU
or we want to transfer data from CPU 0 to GPUs 2 or 3,
we need to go through this proprietary interconnect.
What's also interesting in this system
that this bandwidth is split.
Half of the bandwidth is dedicated to data transfers.
The other half is dedicated to basically sharing compute instructions and data
requests between the CPUs.
So basically what we end up effectively in the CPU
to CPU interconnect is 32 gigabytes per second
of bandwidth in a single direction. So again, when we develop workloads
that run on the system, we need to be aware from which socket
do we read the data, how is this data transferred,
and again, not necessarily distribute to all four GPUs.
But if we see that the computation required
is not as intensive, we might be better off just transferring
the data just to the two GPUs that are directly
interconnected to our memory socket.
And then if we need more GPU memory and more GPU compute,
then we can go up to eight GPUs in current systems.
And there, since we move past the double interconnect
between the GPUs, now we need to make sure that these GPUs can communicate
between each other in an efficient manner.
So one implementation of such a system
is the DGXA100 that's developed by NVIDIA.
And there, we basically see the use of the all-to-all data
transfer, all-to-all interconnect that
allows quick data transfers between all GPUs concurrently.
And there, in this system, what's interesting to see
is that even though we have 300 gigabytes per second bandwidth
between each GPU,
we are pretty much bounded by the CPU to GPU bandwidth that's
provided by the PCA 4.0 interconnect.
And what's even more interesting is
that the interconnect is split between the GPUs,
meaning that, for example, if we want to transfer data
from main memory to GPU 0 and the GPU 1,
both of them will use the same PCIe lane.
So we will never reach 32 gigabytes for each of the GPUs.
However, this will be split by up to 16 gigabytes per GPU.
So there we need to be aware that if we need to use only two GPUs or we need to transfer data
from main memory to only two of the GPUs on this side of the system or from this socket,
basically we are much better off if you are using, for example, GPU 0
and GPU 2 or any combination that basically doesn't
use the same PCIe lane.
So what I'm aiming for with explaining all the systems
and these considerations is that we
need to be aware of the topology of our system.
Because the more components or the more processing chips
that we have, the more interconnects
that we need to pass through.
And they differ in their bandwidth,
and they differ in basically their connectivity. So to be more aware or to get a closer look of the behavior of the systems when we try
to transfer data, we'll look at some benchmarks that we run on the systems and basically how
they behave when we actually transfer the data.
And are these numbers, basically the advertised numbers
on the right-hand side, are they achievable
if we try to transfer data?
So let's see.
In the first system that I showed,
this is a small GPU server that has only one CPU and two GPUs.
We need to achieve, or we want to achieve,
the advertised numbers for the CPU to GPUs, we want to achieve the advertised numbers for the CPU
to GPU interconnect and also for the interconnect
between the GPUs.
So if you take a look at the graph A, what we're doing here
is we are just transferring serially data from the CPU
to GPU 0 and GPU 1.
And basically, what we achieve is
once we transfer data from the host, from the CPU to GPU 0,
we achieve 26 out of the 32 gigabytes per second.
The same bandwidth we achieve when
going in the reverse direction, when
we're transferring the data back from the GPU to the CPU.
Regarding the peer-to-peer transfers between the GPUs,
we achieve 42 and 43 gigabytes per second out of the 56
gigabytes that are advertised.
So these are not bad numbers.
They're just not quite as advertised
because other than the pure data that we are sending,
there are also data requests and some memory checks that
need to be executed that hinder the bandwidth,
hinder the physical capabilities of the interconnects.
What's also important to see is that we have almost 2x discrepancy between the bandwidth
of the CPU to GPU interconnect and the peer-to-peer interconnect.
Right, so then if we take a look at the just pure peer-to-peer transfers,
again, we might want to basically transfer data in parallel,
meaning that we want to either transfer data in parallel
from the CPU to the GPU and the other direction,
or we might want to concurrently transfer data
between both GPUs.
And what we achieve here is for the CPU to GPU
and parallel transfers, we achieve here is for the CPU to GPU and parallel
transfers, we achieve basically good scaling.
We scale linearly.
We achieve twice the bandwidth.
In the case of the parallel transfer from device up,
so we achieve double the bandwidth when we move the data concurrently from both GPUs
to the CPU or from the CPU to both GPUs.
And there, we basically see twice the bandwidth
of the CPU-GPU interconnect.
What happens if we move the data concurrently
or in parallel between the CPU and the GPU?
Then we basically achieve less than twice
the bandwidth of 42 and 43 gigabytes
that we achieved when performing only the serial transfers.
Similar holds true when doing transfers between the GPUs. So there we achieve close to twice the bandwidth
between the GPUs when transferring in parallel.
So this was a very simple single CPU to GPU system.
If we take a look at the Delta system,
that basically has PCA 3.0 between the CPU and the GPUs,
and it also has asymmetrical GPU interconnects.
There we see varying performance, basically.
I would start with the GPU transfers,
because there we see a very sharp decline
in the performance.
If we want to transfer data or when transferring data from
GPU 0 and either GPU 1 or GPU 2, we achieve very close to the 50 gigabytes per second advertised
bandwidth. So we achieved or we measured 48 gigabytes per second, which is very nice. However, when we transferred data from GPU 0 to GPU 3,
we achieved only 9 gigabytes of bandwidth
between these two GPUs.
Somebody wants to guess what happened, why?
Yeah.
If for once it's too vulnerable, it's probably going to take some bandwidth of?
What are the hops exactly?
Like, to get from one to three, it needs to go either through two or one.
There's no connection between zero and three.
Yeah, exactly. There's no connection, 0 and 3. Yeah, exactly. There's no connection. That's true.
But for example, the connection between if we go either
through GPU 2 or GPU 1 to GPU 3, there we have 50 gigabytes
per second, and then we have 25 gigabytes per second.
But we achieve nine.
It's a sharp decline. And the reason is, basically by default,
when we want to transfer data between GPUs that are not
directly interconnected, we go always through the CPU.
So we are not utilizing the 50 and 25 gigabytes per second. We go through the PCI 3.0.
Then we move through the CPUs.
And then we go to GPU 3.
And basically, this is the, unless explicitly stated,
this is the default data transfer behavior.
And basically, this is just I want
to emphasize the point that we need
to be aware of the topology.
Because this is a decline of almost 6x.
And that's quite significant in such a system.
Same very, very similar happens when we do parallel transfers.
So if we do 0 to 1 or 2 to 3, all is good,
because they're interconnected between.
So they're interconnected
through the double NVLink lanes, we
achieve close to the theoretical maximum of 100.
We measure 97.
What happens here when we try to do parallel transfers
between 1 and 2 and 0 and 3?
For 0 and 3, we achieve the very bad 9 gigabytes per second.
And then this averages out to 30 gigabytes per second
when we consider the transfer to 1 and 2, GPUs 1 and 2.
Right. to 1 and 2, GPUs 1 and 2.
So regarding the transfer from CPUs to GPUs,
this is, I would say, quite intuitive.
We just measure a bit below the theoretical allowed bandwidth
from the PCIe 3.0 interconnect.
So let's see how things changed when
we have symmetrical interconnects,
and not only symmetrical interconnects between the GPUs,
but also between the CPU and the GPU.
There we saw some interesting consequences.
So again, I'll start with the peer-to-peer transfers
between GPUs.
So then there, if we transfer data from GPU 0 to GPU 1,
we see, again, very close to the theoretical maximum, which
is 75 gigabytes per second.
What happens when we transfer data from GPU 0 to GPU 2
or to GPU 3 is that we reach less than half
of this bandwidth.
Again, there we don't have any physical
interconnect between the GPUs, so we need to go through the CPU and we
need to go through our interconnect to reach the CPU, then the CPU has its half bandwidth that I mentioned earlier, and then reach GPUs
two or three.
We see the same effects when transferring data in parallel.
So the interconnected GPUs are all fine,
very close to the maximum of 150 gigabytes per second.
And we see up to 3x decrease when
we try to do parallel transfers from the GPUs that are not interconnected between
between each other are there any questions until this point yeah Yeah. I have a question. How do you connect this GPU not with PCIe but with NB-Link to the main board?
Is it like, I'm just curious if you ever see that.
Yeah.
So this is, as I said, a unicorn system.
This is like the only system that allows this.
And basically the reason is that the power CPU allows this. And basically, the reason is that the power CPU allows this.
So IBM developed a separate CPU that
doesn't require a PCIe interconnect
to other accelerators, but it allows NVLink interconnect.
It has like a connection to the plug-in.
Exactly.
Yeah.
And just an important note, that's only the CPU-GPU
interconnect.
The interconnect to main memory is still standard.
Right.
Yeah.
However, in this system, what's interesting
is this CPU-to-CPU interconnect.
Basically, if you see the CPU-to-GPU transfer,
that's where we saw all the bottlenecks.
And basically, they were mostly the most responsible part
of the system was the CPU to CPU interconnect.
So there, now we go to all-to-all communication territory,
and there we also saw some interesting behavior of the system.
So I already spoiled a bit when I mentioned the split PCIe interconnect between two GPUs,
the performance of which we'll see on the plot above.
So if we transfer data between GPUs 0, 1, 2, and 3
from the CPU, there we achieve a maximum of 24 gigabytes per second.
Which, yeah, in theory, we should achieve something close to 64.
But since we have a split PCIe lane, that's not the case, right?
So basically, we have maximum of 16 gigabytes theoretical bandwidth to each of the GPUs.
And then the measured one is quite lower than that.
But it's in line with what we measured
with the single PCIe lanes, where we achieved 12 out
of the 16 gigabytes that are theoretically possible.
So that's on the CPU side. Then basically if we look at the
all-to-all transfers which are basically what's interesting about the system, we
achieve consistent bandwidth of 279 gigabytes per second for any combination
of GPUs or data transfer between any of the GPUs.
And there, what's interesting is basically what happens
when we try to do parallel transfers between the GPUs.
So there we see some discrepancy between co-located and not co-located GPUs.
And there we see that basically despite having
Envy switch that allows, in theory,
consistent performance for interconnecting all GPUs,
we see that the peer-to-peer transfers
are faster between co-located GPUs,
or between the GPUs that share the PCIe lane.
So for example, we show one example here.
The transfer between GPU 0 and GPU 1
is faster than the transfer between GPU 0 and GPU 2.
And basically, this corresponds to the collocation part.
The reason for this is that NVSwitch basically
puts the interconnect between collocated GPUs
on the same physical bank, which is basically,
I won't go into details, but basically the interconnected GPUs
share the same bank of basically for handling the data requests
in the Envy switch,
which basically allows this performance gap to appear.
Yeah, but I mean, the drop is up to 10% per GPU.
These are parallel transfers.
So right.
What's also interesting to see is that basically we can achieve the performance discrepancy that's achieved in this system is up
to an order of magnitude between the data transfers from the CPU to the GPU and between
the GPUs themselves. So when developing algorithms or workloads for the system, one can ask, should we just do single allocation, single transfer
in the beginning, and then basically not move the data out
of the GPUs at all?
And in such systems that basically have, in this
instance, 40 gigabytes per GPU, we have 320 gigabytes of GPU
memory.
This is a significant amount of memory that we can work with
once we transfer the data there.
And with current implementations where there
are GPUs that have more than 180 gigabytes per GPU memory,
we can reach up to a terabyte of GPU memory.
That's very, very high bandwidth memory.
So the notion of in-memory processing
is changing with these developments.
So I mentioned some drawbacks or features
of each of the systems.
And basically, the main conclusion
is that we need to be aware of the topology of the systems
to develop algorithms that will efficiently use this.
And a few of the tips that I implicitly mentioned,
like once we need to pre-allocate memory
and reduce data transfers as much as possible.
This is in line with what we've learned to single GPU
processing as well so far.
However, now we need to organize the peer-to-peer communication
and basically maximize the bandwidth
according to the topology.
We need to be aware of the peer-to-peer interconnects.
However, we also need to be aware of the access to the CPU
and to the main memory.
Right, and how we can use this to accelerate sorting.
I'll quickly go through three types of algorithms
that accelerate sorting on multiple GPUs.
So one, the first one is peer-to-peer based GPU sorting.
And basically there I won't go through the individual stages of the algorithm, but I'm
going to say basically I'm going to focus on the part that this is an in-place sorting
algorithm.
So there is no communication between the CPUs and the GPUs
once the data is transferred to the GPUs.
So the sorting takes place, is done in place.
And just because of this, basically, the algorithm
is limited by the total GPU memory.
So the maximum amount of data that we can sort
is basically a bit less than 50% of the total memory.
And here, you can already see the trade-off
that we do with peer-to-peer algorithms.
We do single memory allocation, single memory transfer
between the CPU and the GPUs.
And there, we basically utilize the computational bandwidth
of the GPU interconnects.
And the trade-off of this is we can process less data.
In this case, half of it.
So another option, another alternative,
is to do heterogeneous implementation of a sorting
algorithm.
And here, sorting is just a use case, right?
This can be applied to any operator, right?
So if you want to do in-place joins, you can do so.
You can do heterogeneous joins.
And this goes on for every operator.
So what happens here is that now we are not
limited by the combined GPU memory.
However, we are limited by the access to the main memory.
Because once we sort everything on the GPU side,
then we need to merge the sorted chunks on the CPU side.
And the merging and the transfer to the CPU
are the bottleneck and basically the bottleneck of this
approach. And the third option, and this algorithm
was developed in our chair as a master thesis of the student.
Basically, there we do ready exportation-based sorting
where we utilize the peer-to-peer interconnects
of the GPUs.
And again, we do in-place sorting.
However, we accelerate the sorting and the allocation
phase by doing a bucket exchange between the GPUs
and also an initial scattering phase of the data.
So just to compare how all of these perform, so on the DGXA100, as a reminder, this was
the 8 GPU system with NVSwitch interconnect.
There we basically see that the heterogeneous approach, where
we merge everything on the CPU, performs worse
compared to the other two.
Again, as I mentioned, this is mainly to the merge stage.
And here you can see basically how big of a part
the merging phase plays into the whole algorithm.
There we can compare the peer-to-peer approaches,
the radix-based partitioning and the peer-to-peer sort.
And there we see similar performance.
The radix-based is around 10% to 15% better based on the number of GPUs.
And also what I would like to point out is just take a look how long does the sort face
actually last in all of the algorithms.
So it's the pink one.
And for example, in the Radix one, it's overlapped
with the data transfer, because we do parallel transfers
once every sorted chunk is completed.
But it's basically along the lines of the other two.
So more than 80% is spent on data transfers.
And the actual compute is almost for free in this case.
The trends are the same across both systems.
So this is the AC922. So this is the IBM machine with NVLink used as the CPU
to GPU interconnect.
And there, basically, what we see
is, again, very similar trend.
And again, similarly, the sorting
is taking the least time in the whole workload compared to the data transfers
and to the merging in the heterogeneous approach.
And again, this is a system that has symmetrical interconnects.
So basically, this is as good as it gets in terms of pure interconnect bandwidth
between GPUs and also between the CPU and the GPU.
Again, still the compute is not the problem.
The bottleneck is in the data transfers and processing
performed on the CPU in the heterogeneous approach.
So these are some of the takeaways
that basically summarize what I've been talking about so far.
So scaling on multiple GPUs has benefits,
but they are limited.
Basically, the main limitation is
that if we develop an algorithm that does not consider
the topology of the system, basically the benefits will be non-existent.
One and in order to actually exploit the benefits of multi-GPU systems, we need to be very aware of the topology
and utilize this to the benefit of our workload
and our approach.
Just as a side note, the overall fastest solution
was the 2GPU mode on the AC922.
Basically, we are sticking to the symmetrical interconnect
as the overall best solution.
Basically, the most powerful, the most computationally
powerful system is heavily bound by the CPU to GPU interconnects.
Where I mentioned this earlier, we
have more than 10x performance discrepancy in this case.
And these are, I already explained some of the systems
that we have in our data lab.
And one system that I didn't cover, basically,
is the Dell PowerEdge.
But this is a smaller version of the DGX-A100.
Basically, it has four GPUs and a single GPU.
Right.
So we've done some work on these topics in our group.
And yeah, basically, almost all of the findings
on multi-GPU data processing that I presented today
are part of the works listed here.
If you want to work in this area, you can contact our group.
And we can find an interesting project that we can work together on.
And with that, I would like to thank you for the attention.
And that wraps up the data processing on GPU part
in this lecture.