Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Data Processing on GPUs II

Starting point is 00:00:00 I guess we can start. So, hi everyone. Today we'll continue on the topic of data processing on GPUs. So, right, so yesterday we talked about the introduction to GPUs, We introduced the GPU architecture. And we discussed the main positive sides and things to consider when programming for GPUs. We went through the physical and logical execution of the programming model. And today, we'll be a bit more on the database side

Starting point is 00:00:48 and more on the implementation of actual database operators. So in the beginning of the lecture, I'll talk about the implementation of selections and joins on GPUs. Then I'll talk about multi-GPU data processing, basically how the architectures differ, what kind of system architectures are available, what kind of topologies are there,

Starting point is 00:01:18 and how we can develop algorithms that can recognize and utilize the benefits of these topologies. And that will be the part regarding multi-GPU sorting. And there we'll talk about peer-to-peer sorting, heterogeneous sorting, where we also utilize the CPU in the process, and also some arrayed X multi-GPU sort that was developed in our group.

Starting point is 00:01:52 So we'll start with the GPU accelerated database operators. And we'll start with the selection, which is a fundamental database operation. And there, the goal is quite simple. Given a predicate p, we need to choose a subset of tuples that from our relation R satisfy p and remove the rest. And this is an example. We have our predicate in x needs to be larger than 4.

Starting point is 00:02:27 We have our input relation, which is x. And we have the output relation, y. And this would be the, basically, we have a condition code here where, basically, it checks for the predicate and then if the predicate is satisfied we perform addition and uh and otherwise we just move on right so uh how can we utilize the gpus to to parallelize this uh the the big question is where to write the results in parallel. And basically, now we'll talk about how we can address this.

Starting point is 00:03:13 The idea or an initial solution would be to pre-compute the locations by utilizing prefix sums. And there, basically, we can rely on prefix scans. It's basically the prefix sums that we want to use, our prefix scan operation. And there, the idea is to apply a binary operator to an array of n elements. And basically, for our input relation of x,

Starting point is 00:03:47 we would have this x prime output. So now the question is how we can do this on the GPU in a somewhat efficient way. So as I mentioned, the idea is to pre-compute the locations by utilizing these prefix sums. And we can do this in three steps. So initially, given the input relation x, we build a flag array initially.

Starting point is 00:04:19 And this is just a binary array where we define whether the values from x satisfy our predicate p. And in our case, we have three instances of this. And basically, our flag array will have ones on these places. And based on that, we compute the prefix sum in the following array based on the results from the flag array. And basically, in this way, we pre-compute the locations for basically to write the elements from X into the area Y, which would be our output.

Starting point is 00:05:16 So we can parallelize the selection by following these three steps. And initially, what we want to do is apply a reduction here. So basically, we want to reduce a set of values to a single value using the binary operator. And basically, from the prefix scan slide, we just utilize the input A, and we basically apply the operators on the subsequent elements, and we have the output in A prime.

Starting point is 00:05:54 As a building block, we can use the reductions for this prefix sum operation, and we can calculate this conveniently in the shared memory in the GPU. So basically, we would pay no overhead in the sense of additional data transfers. And let's see how we can do this and how we can basically utilize the GPU shared memory for this.

Starting point is 00:06:23 So we would use binary reduction trees for this. And in general, this is how a binary reduction tree would look like. We would have three steps to generate the sum on the top in the root, and we'll have complexity of log n to achieve this. However, on the GPU, we would do something different. Yesterday, we talked about that we have

Starting point is 00:06:55 many threads at our disposal. We can group these threads. We can manage the memory accesses in these threads. So let's see how, based on what we learned yesterday, how we can use this concept to implement this in a naive manner. So let's say we initially just have this input array on the bottom and for every two elements we activate a thread pair. So basically I don't know how this looks for you but this is way too much work to be done by the threads. But if we go about this in

Starting point is 00:07:44 naive fashion, this is what we get. So in theory, we would just parallelize every sum operator, and we propagate these changes upwards. However, in fact, with this implementation, we're performing way too many add operations. And we are actually increasing the complexity compared to a sequential implementation.

Starting point is 00:08:14 So the question is, can we be more efficient about this and utilize the threads and the memory accesses in a smarter way. So we talked about thread coarsening yesterday also. So we can just utilize less threads. So we say for each two elements, we activate one thread. And after every step, we shut down half of these threads. And now basically, the single traversal of this tree

Starting point is 00:08:48 performs O of n additions. So now we are back to the sequential step. And this is how an example implementation in CUDA would look like. So you're familiar with how we define a kernel. And here, we are basically using strided memory access to shift the memory access in an increasing manner. So basically, after every iteration,

Starting point is 00:09:24 we double the stride, and we also shut down half of the threads. Is this the most efficient that we can get? Yes? No? OK. Do you have an intuition what you would do? Okay. So let's see on the next slide.

Starting point is 00:09:52 So yeah. Maybe the question should have been what's the problem with the previous slide, but we'll get there. By halving the number of active threads in each iteration and increasing the previous slide, but we'll get there. By halving the number of active threads in each iteration and increasing the stride, we are also increasing the physical distance of the memory accessed by each thread. And initially, for example, in the first step,

Starting point is 00:10:21 this is perfectly fine. But after x amount of iterations, we have underutilized warps. In an example, if our array, instead of eight elements, it was 256 elements, and we activate, as we said, for each two elements, we activate one thread. So 128 into this example, this would be the equivalent of four warps.

Starting point is 00:10:46 And for this, we would need eight iterations to calculate the reduction sum, since we have log n complexity. However, from iteration 6 onwards, only warps 0. So if we have warp 0, 1, 2, and 3, after iteration 6, only warp 0 and 2 will be active with one thread each. So we have allocated threads that are busy looking for their next memory access,

Starting point is 00:11:17 but we're not using them. Even after we stop half the threads after each step, the issue or the efficiency problem here is that these two warps are too far from each other. And in yesterday's terms, we're not coalescing the memory access. We are not using the cache lines efficiently. So we're trying to access too distant of memory locations.

Starting point is 00:11:49 So let's see what we can do to fix this. So instead of doubling the stride, we can actually decrease the stride. So at every subsequent iteration, we compute collocated threads and memory locations. So instead of moving the warps and the threads inside them further apart, we actually bring them closer

Starting point is 00:12:18 so we can utilize more adjacent memory locations and utilize the cache lines better. And this is how this would look like in our implementation. So in the previous example, we started combining the adjacent elements. Here we are actually splitting the array in half, and we are basically narrowing down the memory access. So basically, at all times, we are keeping all warps busy. Another aspect or another thing that might help us in this case is that we can store

Starting point is 00:13:09 the whole input array or actually this is the challenge, the whole input array would be stored in global memory and partial results would be written also to global memory. And in our example with 256 threads, this would mean that we have around 36 global memory accesses and writes. Can we be more efficient about this? And let's see how. So basically, we can load the input from global memory, which we have to do in the beginning.

Starting point is 00:13:48 But after every computation of intermediate result, we can write and read it from shared memory. And if you remember, if we store something in the shared memory, we're reducing the access time by orders of magnitude compared to accessing it from global memory. And basically, using the shared memory, we can be as fast as a register if we're lucky that there are no bank conflicts.

Starting point is 00:14:23 But we're still at least 100x faster than accessing global memory. So in our example of 256 elements, if we start only with the initial input, reading it from global memory and writing the final output to global memory, we reduce the total global memory accesses from 36 to 9 or by 4x in our example. So basically, we mentioned this example.

Starting point is 00:14:57 However, with the current implementation, we don't have the partial sums, which basically does not provide our final prefix sum. So we would add one more phase to this, and this would be the sweep down phase where we'd add partial sums to the successor elements. And basically after this phase, we would have the final prefix sum. So with the blue arrows, so now basically we start on the top. That's what we computed in this moving up phase. And now in the sweep down phase, we'll basically retrieve the partial sums.

Starting point is 00:15:39 And with blue arrows, we depict the elements, or in this case, the 0 that we're pushing down. And with the red arrows, we define the partial sums that are computed. And in the bottom, in the end, we have the full prefix sum. Right. So the complexity of this is 2n. And now, basically, we have the same performance considerations

Starting point is 00:16:11 as with the reduction kernel phase. And basically, we're applying the same logic. We need the memory divergence and control to manage those. And we need to reduce the global memory access. So basically, going step by step here, we would start in the same fashion as we did when moving up. So basically, we start at the half of our array,

Starting point is 00:16:47 and we combine it with the final prefix sum. And to retrieve the partial computation, we need to replace that with a 0, because that result we already have. And basically, by moving down, we are basically splitting the subsequent arrays in half. And we're pushing the 0 element, which is basically produced by the computation of the initial prefix sum. And hence, that ends up in the first position.

Starting point is 00:17:25 However, basically, by combining the zeroth element with the precomputed values from the previous step, we are able to retrieve the partial inputs. And basically, that's how we end up with the full prefix sum in the bottom area. Are there questions? Yeah? Why is the finder result on the bottom right

Starting point is 00:18:07 35 instead of 36 in the beginning? Shouldn't it be the same? Right. So here, the final element there is basically the pre-computed value minus whatever there was in the initial array. In the initial array, there we have 1. So basically, by having 35 there in our prefix sum

Starting point is 00:18:42 combined with the input array with the value 1, then basically we know that until this point, our prefix sum is 35. And then the actual value in the array there would be 1. Hence, we get the 36. Other questions? Okay. Let's move to joins. So with the selection and with the implementation that we've shown, we had two convenient assumptions

Starting point is 00:19:18 or basically, yeah, we had two convenient assumptions or constraints. The tuples either match or not. So this is our initial constraint. But we also know the maximum result size. I'm sorry. The maximum result size from the input relation. Basically, in our selection, we know that the output cannot be larger than the input relation.

Starting point is 00:19:53 In joins, this is not the case. Couples can match multiple times. And also, the maximum output size, first of all, we cannot define it before computing it. But also, it can be as large as the Cartesian product from both input relations. So in this case, we need to think about how to calculate the actual size in advance

Starting point is 00:20:24 and whether this would fit in GPU memory or not. Since the only constraint that we have with GPU joins is we need either one of the input relations to fit into GPU memory, since the result can be as large as the Cartesian product, it may not fit at all. So how we would exploit the parallelism of the GPU in order to make joins more efficient, basically,

Starting point is 00:20:54 we would like to perform some log-free processing in order to utilize the parallelism. But also, we would like to pre-compute the right locations for each thread. And that's how we know whether our result would fit into the memory. So in terms of execution, joins generally use this three-step execution scheme,

Starting point is 00:21:19 where basically each thread counts the number of join partners for the inputs. And then with the result size of each thread, we compute the prefix sum to get the right location for each thread. Then basically in the third stage, we allocate the memory on the host side for the size of the join results.

Starting point is 00:21:39 And then basically we write with all threads to the global device memory. And the locations are pre-computed. So basically, each thread has its own designated memory locations where the results will be written. So how we would implement this when, say for example, we're using indexed nested loop join, we would start in this three step execution scheme.

Starting point is 00:22:11 We would start with building the index on the input relation, or one of the input relations, in this case, r. And basically, for the actual join, we would look up each element of s in a sorted array by doing binary search. Final step, we transform the result position list to the original unsorted index of the input relation. In case r is already sorted, we skip the index building and the sorting step.

Starting point is 00:22:51 So in our example, to compute the result of the join columns, we would need to store two position lists. So given input relations R and S, we would need to store a position list for each of them. And if we see here which positions contain a match, we have the numbers 5 and 3 that are in both areas. And we're basically joining those. In our position list in R, we see

Starting point is 00:23:34 that the 5 is in position or index 2. And 3 is also conveniently on position 3. In the position list of S, we have the 5 at position 0, but also we have the duplicates of 3 at positions 1 and 4. Both position lists need... They always need to have the same number of elements since we have one-to-one join. So basically, no matter in which relation we have the duplicate element that has multiple matches,

Starting point is 00:24:20 we store the position repetitively for in that array in the position repetitively in that array, in the position list. So to build the index, we start with keeping the position list with the original order. And basically, we would need this for the following step. So if we sort R, we would get the array on the right-hand side. So basically, we have the values in ascending order.

Starting point is 00:24:58 But we also keep the positions in the position list from the original unsorted R. And now, basically, in step 3, we would transform the result position list with the original index. And if we use three threads for this, we would start in the first phase where basically each thread would access one element and perform a lookup in our sorted index. Then the threads would access elements from S with a stride, which we've basically seen already. So initially, the first three threads

Starting point is 00:25:48 would access the first three elements, perform the lookups, and basically get the positions of the matches. So thread one would see that basically we collect the number of matches. So thread 0 will find its match with the value 5. Thread 1 would see that for the value 3, it contains two matches. So basically, this is after the completion of the lookup in S. And basically, this would be the counter. And basically, we know that we know the memory locations

Starting point is 00:26:46 that each thread accessed. And basically, we also have the values on those memory locations. And each thread collects their matches in their counter. The more threads we activate, the larger this counter area will be. We would need to allocate more memory for it. And to compute the prefix sum of the counter area,

Starting point is 00:27:13 we need to get the right positions for each thread and the overall result size, which basically we do by doing the prefix sum on the counter. And we get to the final number of three matches. So in the second phase, using now the prefix sum that we generated from the counter, each thread looks up the assigned elements and basically uses the prefix sum to compute the result.

Starting point is 00:27:52 In our example, we would have basically the following position list. So the value that thread 0 accesses is 5. And from the position list, we know that the 5 has a match on index 3. Similarly, with thread 1, we know that, or don't know, but we perform a lookup. And we basically see that the value 3 in this case

Starting point is 00:28:28 has two matches at positions 1 and 3. And basically, in the case of having sorted R in the beginning, if we actually sorted the input relation, now we need to reverse the sorting. So in our input relation, the elements were not in the following order. So now we need to utilize the existing position list and to reverse the sorting.

Starting point is 00:29:10 And basically, to actually revert the sorting, we would perform a getter operation on the position list. And then basically, as an output, we would have the position list from the original unsorted array. So basically, these are the two steps. How we started from an unsorted array, we collected the position list. Then we have the sorted array as an input.

Starting point is 00:29:49 And then we're basically reverting this based on the position list for the sorted one. Are there any questions till this point? Yes? In step two, would it rests assigned the values or indices in the first, for the prefix, when? This one? Yeah.

Starting point is 00:30:17 So does thread zero look for all the fives, or does it just, is there like a global counter that each threading increases if they find a 5, for example? So thread 0 can also find other values than the 5. So it basically just counts the matches. It's not assigned for a single value. Ah. It's not assigned for a single value. Oh. Why? I mean, what's confusing you?

Starting point is 00:31:08 Sure. I think I'm going to say that again. With the notice that you just put in, are you telling me? No. Okay. Other questions? Okay. We're a bit ahead of time,

Starting point is 00:31:24 but I suggest we take a short break now and wrap up with the operator implementations. And after the break, we'll move on to multi-GPU architectures and multi-GPU algorithms. So five minutes. We'll see each other back at 40. All right, so we'll continue with multi-GPU data processing.

Starting point is 00:31:55 And for this, we'll do a short recap of the memory hierarchy that we saw yesterday. So based on the architecture and the numbers for compute bandwidth and our memory access latency, we can see that the faster memory or the quickest memory has very limited capacity. So say for the fastest one in our L1 caches, we are in the order of kilobytes.

Starting point is 00:32:35 And further down that we go in our memory hierarchy, the larger the memory, but also the more expensive the memory accesses are. So we see around an order of magnitude decrease in bandwidth as we go down the hierarchy. So that is why it is important to make sure that we are allocating memory at the exact place where it is needed and not default to reading data from GPU global memory. To achieve this, we need to be careful about managing

Starting point is 00:33:18 the lifetime of the data structures that we generate in our program. And basically, based on what we discussed yesterday, we can shift this between the kernel lifetime and the complete application lifetime. So the GPU global memory, compared to DRAM, it has quite limited capacity. The largest GPU that can be found today

Starting point is 00:33:56 has a capacity of 192 gigabytes, where with DRAM, we can go into the order of terabytes in our servers. So basically, whatever we need to process in our GPU needs to be transferred from this main memory. And basically, anything that's larger than our GPU memory, by default, it means that the data needs to be swapped, and we need to access main memory. Now, the bandwidth between the host memory, or the main memory,

Starting point is 00:34:41 and the GPU global memory can vary by quite a bit. And this is pretty much dependent on the interconnect between the GPU and the DRAM. So we are in the order of 16 gigabytes for PCIe interconnects. We'll discuss them later. And state of theart proprietary interconnects allow access throughput of 900 gigabytes per second. And this is all determined by the physical hardware underneath.

Starting point is 00:35:18 And there are multiple interconnect interfaces that are available. Some are general across all vendors. Some are vendor specific. And we'll go through each of them. So PCI Express, this is a general interconnect that connects the CPU to the storage, the memory, and to any available coprocessors. It is the most commonly used CPU-GPU interconnect.

Starting point is 00:35:51 And this is important in our scenario. Current systems use versions 3, 4, and 5. In the slides in Moodle this will be updated. So the current systems that we have available here but also that are currently in production, they use versions 3, 4, and 5. When using all 16 lanes, the performance for PCI version 3 is 16 GB per second per direction, and every following version doubles this. So PCI 4 is 32 GB per second, PCI 5 is 64 GB per second per direction. There are systems that are currently announced, not yet

Starting point is 00:36:45 available on the market, that have PCI 6. And this is, again, double from PCI 5. The bandwidth is 128 gigabytes per second. PCIe is quite ubiquitous in the sense that in a single machine, we would have multiple sets of lanes to connect it to the different components. When we're talking specifically about connecting the CPU

Starting point is 00:37:19 to the GPU or connecting GPUs between themselves. There are proprietary interconnects, usually manufactured by the vendors themselves. So there is the NVIDIA NVLink and NVIDIA NVSwitch interconnects, which are produced for the NVIDIA GPUs. And the NVLink is mainly used as an interconnect between two GPUs. However, in certain machines, it can

Starting point is 00:37:52 be found as also the bridge connector for the CPU and the GPU, so replacing the PCIe Express. The NVLink supports high bandwidth peer-to-peer data transfers. And currently, the current systems are mostly using NVLink 3.0 and NVLink 4.0. 2.0 is already considered obsolete. 3.0 allows data transfers up to 300 gigabytes per second

Starting point is 00:38:30 per direction. 4.0 moves this to 450 gigabytes per second per direction. The difference between Envilink 3.0 and 4.0 is not in the actual hardware. It's just in the number of lanes that's available at the GPU. The per link capacity is actually the same. It's 25 gigabytes per second. The only difference is that hardware architectures that

Starting point is 00:38:59 are using Envilink 3.0 have 15 lanes amounting to 300 gigabytes per second. And the systems that are using NVLink 4.0 have more lanes with the same compute capability of 25 gigabytes per second. NVLink 5.0 is announced. The systems are still not commercially available. But this basically doubles NVLink 4.0's bandwidth up to 900 gigabytes per second per direction. And this is actually based

Starting point is 00:39:47 on the hardware characteristics of the individual links so now they're supporting 50 gigabytes per second with the the same 18 lanes that that are available so basically these are awesome numbers. This is very, very high bandwidth compared to the compute available, and especially compared to PCIe interconnects. So the limitation here is that NVLink can be used as an interconnect between only two elements,

Starting point is 00:40:28 or two GPUs, or one CPU and one GPU. So for this purpose, NVIDIA developed Envy Switch, which basically works as a switching element instead of a direct peer-to-peer interconnect, and it can connect up to 16 GPUs at once. and instead of a direct peer-to-peer interconnect. And it can connect up to 16 GPUs at once. So this allows all-to-all communication. And basically, the data transfers are non-blocking. So we can issue transfers.

Starting point is 00:41:01 So if you see the figure here, we have six switching elements. So at any point here, we have six switching elements. So at any point in time, all six switching elements can be busy, and the data transfers can be issued through them. Later on, I'll show you some numbers to see how this scales and what's the actual performance. So performance, so for the Envy switch, the Envy switch 2.0, if I'm not mistaken, this bandwidth of all

Starting point is 00:41:33 to all communication goes up to 900 gigabytes per second. Some announced systems move this into the orders of 18 terabytes, but they're just announced. They're not available. The details on the architecture are not clear, so I'll not go into too many details. AMD is another major GPU vendor, and they have their own proprietary interconnect. It's called the Infinity Fabric and the under the Infinity Fabric umbrella day they're referring to inter CPU interconnects and inter GPU interconnects so in their in their epic CPUs which is their their server grade GPUs. This bidirectional bandwidth amounts to around 200 gigabytes

Starting point is 00:42:26 per second. The GPU interconnects can bridge up to four GPUs. So for example, as shown in this figure, so basically we have a single bridge for all four GPUs. And this allows all to all communication between the GPUs, similarly to the NVSwitch that we saw in the previous slide. And basically, the bidirectional bandwidth is 340 gigabytes per second per individual GPU.

Starting point is 00:43:00 Now, I'll basically connect this to the compute capability of the GPUs and see what real-life systems we have. Some of them are available in our group. Some of them are available as a part of the data center, but we'll see how this interconnect capabilities and GPU capabilities are intertwined and what systems are coming out as a result. So I would say a fairly simple setup, or basically the simplest setup that, and this is basically the simplest setup that you can have with multiple GPUs,

Starting point is 00:43:48 is connecting them to a single CPU. And there you can have a single socket CPU system with direct main memory access. There's no Luma effects from accessing multiple memory sockets and multiple CPUs. And for that, it's basically the only necessary interconnect there is to have PCIe to the main memory and then basically have one PCIe interconnect for connecting them to the GPUs.

Starting point is 00:44:35 A real life system of this would be our server, the small server that we use for development in our group. There we have one AMD CPU with 64 cores. We have PCIe 4.0 interconnects between the CPU and both GPUs. Basically, we have 32 gigabytes per second of bandwidth. And we have basically consumer-grade NVLink 3.0, where we have basically just two lanes for a total of 56 gigabytes per second of bandwidth.

Starting point is 00:45:24 Since we're in the single CPU domain and connecting one CPU to a GPU, I'll show one example system that is more complex, and this is also available in our data center, where basically the assumption that the interconnect between the CPU and the GPU is the bottleneck, basically with this system it fails. Mainly because there is an NVLink connecting the CPU and the GPU. This is both NVIDIA CPU and NVIDIA GPU, so is both Nvidia CPU and Nvidia GPU. So it's called the Grace Hopper.

Starting point is 00:46:09 And basically, it allows the full NVLink bandwidth of 900 gigabytes per second. Sorry, this is bidirectional. This is an NVLink 4, so it's 450 gigabytes per second, as I mentioned on the slides earlier. And basically, keeping in mind that the main memory access has a bandwidth of 256 gigabytes per second, because, again, this is PC PCI 5 and has four lanes and we

Starting point is 00:46:49 have 450 gigabytes per second of bandwidth between the CPU and the GPU basically this was the first chip that bridged this gap of basically accessing main memory faster from the GPU than explicitly through the CPU. And at some point, we were fortunate enough to be able to test this. And yes, the numbers are really impressive. And the architecture is actually quite impressive compared to what the state of the art at that time offered. However, this example is important

Starting point is 00:47:42 because using the NVLink between the CPU and the GPU can really change the way we design our systems around. So the assumptions for the latency of individual memory accesses and the algorithms that we designed around these can shift by quite a bit. And I'll show an example where this is of a real life system that we have in our data center that is basically

Starting point is 00:48:18 that this comes into play to a bit lesser extent. In the meantime, I'll just move on to dual socket systems with two CPUs and at least two GPUs per CPU. So in this case, we can combine two CPUs with four GPUs and basically interconnecting them with a generic interconnect like a PCIe. We need to interconnect the CPUs through usually the CPU vendors have proprietary interconnects. And each CPU socket has their own dedicated memory access.

Starting point is 00:49:08 So one example of such a system, the Delta system that we have at HPI. It contains two Intel CPUs and four NVIDIA Volta GPUs. The GPUs, interestingly enough, are connected with asymmetric NVLinks. So you can see that between GPU 1 and GPU 3, we have single NVLink lane compared to the GPU 0 and 1, and basically the rest of the pairs. So we have this ring architecture of all GPUs

Starting point is 00:49:53 being interconnected. However, this is done in an asymmetrical fashion. I mentioned that I'll present a system with NVLink serving as the CPU to GPU interconnect. And this is the example of it. So this is basically a system that allows cache coherent access and basically zero latency access between the CPUs and the GPUs, and basically combining a single CPU with two GPUs in a ring that has the same bandwidth and memory access latency. The example of this system that we have in our data lab is the IBM AC922.

Starting point is 00:50:47 So it uses the Power9 CPUs that allow NVLink to be connected to them. And it combines them with four GPUs that are using, again, NVLink 2.0, similarly to the Delta system that we saw before, but now they have three lanes each. Basically, we have the same bandwidth between the CPU and the GPU and between each of the GPUs. The bottleneck here in this system would be the interconnect between both CPUs, which

Starting point is 00:51:32 has a slightly lower bandwidth compared to the CPU and GPU interconnects. Then we move on to adding multiple or more GPUs to a single CPU. And one example would be the DGX A100 and H100 machines that we have in our data lab. So the idea here is that out of each CPU, there are four, sorry, there are two PCIe lanes that are coming out. And each of these PCIe lanes is shared between two GPUs.

Starting point is 00:52:19 So the architecture is the same, or the layout of both systems is the same. However, the underlying GPUs and interconnects are different. So basically, both have two CPUs. The A100 has the AMD CPUs. The H100 has, interestingly enough, weaker CPUs compared, even though it's a newer system. However, the GPUs are significantly more powerful

Starting point is 00:52:56 and provide larger memory. So the A100s, initially in the architecture that we have here, they're available with 40 gigabytes and NVLink 3.0, which basically amounts to 300 gigabytes per second of peer-to-peer data transfers. The H100 basically has double the memory and also NVLink 4.0, which basically increases the bandwidth to 450 gigabytes per second.

Starting point is 00:53:29 Similarly, the PCIe lanes, they're again shared. However, the DM100 system has PCIe 4.0, and the H100 has PCI 5.0. The CPU to CPU interconnects are the default proprietary AMD and Intel inter CPU fabrics. So I presented the architectures and now I'll talk about what are the implications and what is the actual performance that we can take out from this system. So we go back to the single CPU and two GPU system that has basically PCIe 4.0 as the CPU-GPU interconnect

Starting point is 00:54:23 and NVLink 3.0 between both GPUs. So we have single PCI channels. So they are not shared as the architecture I showed before. So we'll see basically how close to these numbers we can actually measure. For transferring data from the CPU to the GPU, we managed to get to 26 out of the promised 32 gigabytes per second per direction.

Starting point is 00:55:03 And if we execute parallel transfers, we're actually getting 42 parallel transfers in terms of bidirectional transfers. We get to 42 out of the potential 64 gigabytes per second that we can get. This is for the serial transfers where CPU accesses GPU 0 or GPU 1 at a time. If the CPU accesses both GPUs at once, we basically double the bandwidth of the unidirectional transfers from CPU to the GPU or the other way around.

Starting point is 00:55:53 With the parallel transfers where we basically utilize all four lanes or both lanes in both directions, we get less than double. So in theory, this should be 128 gigabytes there. However, we're only managing to get to 75 gigabytes per second. In terms of the peer-to-peer interconnects, whether we transfer data from GPU 0 or GPU 1, in both cases we measure a bit less than the 56 gigabytes that are theoretically supported. And when we issue a parallel or bidirectional transfer, we're getting to a bit more than 100 gigabytes per second out of the 112 theoretically

Starting point is 00:56:51 possible. So this is basically a very straightforward measurement where we're just trying to do a sanity check whether these numbers basically support the claims by the manufacturer. What will be interesting in the following systems, we see how these design decisions impact the data transfer rates and consequentially the the algorithms that we would design so in our in our Delta system which was the one that has asymmetrical asymmetrical GPU interconnects there the we see quite a

Starting point is 00:57:41 large discrepancy in in performance so when we're moving from the CPU to the GPUs, we measure quite low bandwidths, but this is, again, in accordance to what's available as the interconnect. So out of the 16 gigabytes, we measure around 12 and 13 for the host to device transfers. The parallel transfers, interestingly, are capped at, or bidirectional transfers are capped at 20. So a bit less than double of the bandwidth

Starting point is 00:58:18 that was measured in one direction. When we shoot parallel transfers, actually, this scales quite nicely. We see 2x improvement across the different GPUs. What is interesting will be basically the peer-to-peer transfers, because there now we have different capabilities, but also we have different transfer logic

Starting point is 00:58:46 between the different GPUs. So if we start from GPU 0 and we just issue a data transfer to GPU 1, which is directly connected via double NVLink, and we get close to the theoretical maximum. We get 48 out of 50. The same happens when we get to GPU2. However, when we try to move data from GPU0 to GPU3, we get only 9 gigabytes per second of bandwidth. Any intuition, any idea why this is the case?

Starting point is 00:59:27 Yeah. We have to do at least one op across another GPU or CPU, so I guess there might be some processing bottleneck inside the device. Yes. Of the device? Yes. . Yeah, exactly. But I mean, we should see at least 25 or something

Starting point is 00:59:53 close to 25, right? Because we have 50, we have 25. In theory, if we go through GPU 0, we would get, sorry, if we go through GPU 1, we have 50 gigabytes of bandwidth until there. And we move to GPU 3 with 25 gigabytes per second of bandwidth. Same case if we go to GPU 2.

Starting point is 01:00:21 Well, what this system does is basically it goes through the CPUs. So to get from GPU 0 to GPU 3, since there's no direct connection between the two, the default is to go to CPU 0, CPU 1 and then GPU 3. Yeah? Why is it not 16 gigabits? Well, 16 we cannot get because we don't even get 16 up here. And I guess there would be additional OS checks.

Starting point is 01:01:01 And this brings the bandwidth down. And we see a similar pattern when we issue transfers in parallel, either bidirectionally between directly interconnected GPUs or when we try to connect the non-connected ones. So what's interesting there, when we shoot a parallel transfer between GPU 0 and 3 and 1 and 2, there we kind of 3x the bandwidth there.

Starting point is 01:01:36 And these were measures that were repeated. And we consulted with the vendors that provided this machine and basically we're just as good for the numbers. We just know what the transfer flow is. exact explanation why this goes from 9 to 30 instead of going at least to 36. But there's assumingly more overhead that's going on. So we'll move on to another peculiar system. I'm saying peculiar because it's not very often

Starting point is 01:02:29 that we see this ring architecture between two GPUs and a CPU. So let's see what performance implications there are from this. So the CPU to GPU, the left-hand side, these are not very interesting numbers. We just have close to the maximum of 75 gigabytes per second.

Starting point is 01:02:55 So this is as expected. The bidirectional transfer has a bit smaller or lower average bandwidth. But yeah, that's transferring basically from CPU 0 to GPUs 0 and 1. This all works as expected. When we transfer from CPU 0 to GPUs 2 and 3, we see a drop in performance. And actually, it's quite a significant drop,

Starting point is 01:03:29 since the CPU interconnect is actually quite powerful compared to the NVLink. However, in any case, either if we shift data to the GPU 2 or 3, we see this significantly low numbers of around 40 gigabytes per second. And again, we inspected this. out of the bandwidth for between the out of the 64 gigabytes of bandwidth between between both CPUs only 32 are actually available for pure data transfers the other 32 gigabytes of bandwidth are reserved for OS calls and and basically kernel calls so only half of the bandwidth is reserved for actual data transfer, which was quite interesting to see.

Starting point is 01:04:29 And we had to dig quite deep. We had to reach to people in IBM to find out about this. So when issuing parallel transfers, again, we see the similar performance drops from in the case of accessing GPU 0 and 1. We see close to double the bandwidth that we measured for the serial transfers. However, what was also difficult to explain was the sudden drop in performance when we moved data back from the device to the host in parallel fashion. Our assumption was that the CPU cannot handle receiving data

Starting point is 01:05:23 or placing data in main memory when accessed by both GPUs simultaneously. But this was an educated guess at that point. For the parallel transfers, sorry, for the peer-to-peer transfers, we measure quite close to the NVLink limits when we access or when we transfer data to directly interconnected GPUs. When we go through the CPUs, and in this case we have to go through the CPUs, there's no ring topology between the GPUs in any way, we see this significant drop. And also, this was one of the indicators that actually the bandwidth of the interconnect

Starting point is 01:06:13 between both CPUs is capped at a lower speed compared to the 64 gigabytes that were advertised. The interesting system is the one where we have many GPUs that are interconnected in different fashion to the CPU. But then they're all interconnected between each other through an all-to-all NV switch. These are the numbers for the DJX A100 machine.

Starting point is 01:06:49 So basically, the less powerful of both. We are yet to run the similar analysis for the H100. So there, what we can see is that we have consistent bandwidth between all GPUs. So we measure 279 out of 300 possible gigabytes per second possible bandwidth. And this is consistent across any two GPUs that are in the system. Since they're all interconnected through the NV switch, the bandwidth is equal to all of them. The main memory transfers, on the other hand, are actually quite, quite interesting to see.

Starting point is 01:07:39 They're limited by the PCIe interconnect. And there, what's interesting to see is that if we issue a single transfer from CPU 0 to any of the GPUs from 0 to 3, we get 24 out of 32 2 gigabytes per second. However, if we issue a transfer between GPU 0 at 1 at the same time, which is the one in the smooth brackets, we get the same bandwidth.

Starting point is 01:08:24 And this is because this interconnect is shared. And basically, this presents a significant bottleneck when taking into account that the bandwidth between the individual GPUs is more than an order of magnitude higher than this. What is interesting there to see is that we're basically, when trying to issue a parallel transfer between all eight GPUs, we get more than 2 terabytes of bandwidth, which is, I would say,

Starting point is 01:09:06 fairly close to the theoretical 2.4 terabytes that can be measured. And what was interesting to see here is in this red rectangle, despite having the NV switch, we measured that accessing co-located GPUs in the sense of GPUs belonging to the same CPU, we measure higher bandwidth between them compared to accessing the GPUs that are connected to the other CPU. This was an interesting observation to see. So we went through a lot of different numbers and system architectures, but where does this

Starting point is 01:10:09 bring us? Basically, the numbers that I've shown basically give us the insight that being aware of the topology for system is very important where we design our algorithms. So if we just take a comparison between these two systems, they both have two CPUs. They both have four GPUs, right? However, even though the GPU interconnects are somewhat similar, we need to be aware of how these components

Starting point is 01:10:51 are connected between each other. Do we have high bandwidth interconnects between the CPU and the GPU? Do we have an obvious bottleneck in the system? In our cases, we need to know if all GPUs are interconnected between each other. If they're interconnected, can we still access them? Do we need to go through a CPU, et cetera?

Starting point is 01:11:14 So to this end, we need to be aware of the systems that we are designing our algorithms for. And we can do this basically by deciding where to pre-allocate the memory. We can reduce data transfers, or we can optimize data transfers by utilizing the stronger interconnects in our system. Similarly, basically moving from main memory to the GPU memory, we can organize the data transfers

Starting point is 01:11:45 between the GPUs themselves. So we can manage this peer-to-peer communication between individual GPUs. And basically, this is all in the direction of maximizing the bandwidth accordingly. When, as a general rule of thumb is basically we saw it in the AC922 system but also we saw it in the other system that when we have multiple sockets we try to collocate the data on only one socket

Starting point is 01:12:21 and we keep the computation on the same NUMA node as much as possible. If the data is larger than the single socket memory, then we need to think in another direction. But as long as this is possible, we try to keep the computation on the same NUMA nodes. And I'll quickly go through some examples on a fundamental operator as a sorting, how we can be efficient for utilizing multiple GPUs

Starting point is 01:12:59 for sorting. So there are three different approaches that are currently developed. One is the peer-to-peer based sorting where the actual sorting is performed in place and only in the GPU memory. So once we allocate the data on the GPUs and we provide or we transfer the data to the GPUs, the sorting is performed in place. And basically, this gives us the trade-off of executing quick data transfers between the GPUs.

Starting point is 01:13:49 But also, since we're doing this in place, this severely limits the amount of data that we can sort. Because such algorithms are limited by the total GPU memory. And basically, the maximum amount of data that we can sort is only half of the total GPU memory. Such algorithms, so there's one implementation that we used in one of our papers.

Starting point is 01:14:17 So basically, we split the data across the GPUs. On each GPU, we find the pivot element. So we can basically sort this in place and then exchange the data between GPUs where we have then sorted sub-arrays. And then there's another stage of sorting the sub-array. So the goal here is not to go through every algorithm step, just to see how some of the multi-GPU peculiarities apply

Starting point is 01:14:57 into the algorithms themselves. The second idea is to utilize the CPU for this or to have the CPU help in this process. And there the intuition is a bit different. So initially we fill a single GPU with data. So we do an initial data transfer there. We perform an in-place sorting. And basically, we have this at stage three there. We have this bidirectional data transfer

Starting point is 01:15:35 between the CPU and the GPU. So the GPU sends the sorted data back while receiving a new batch of data for sorting. Then in stage four, there is basically while the GPU is sorting the new batch, the CPU is merging the data that was just received. And in the final stage, once the final batch of data is sorted on the GPUs, there's the final merge stage on the CPU. So what this allows us is to sort data

Starting point is 01:16:14 that's larger than the combined GPU capacity. So we're moving from less than half of the combined GPU capacity from the peer-to-peer sorting to basically algorithm that's bound by the size of the main memory, which, as we've seen before, it can be order of magnitude larger. And the third idea is, again, in the direction of in-place sorting. However, we perform radix partitioning instead

Starting point is 01:16:47 of this Pivo element in the beginning. So there, basically, we partition the keys based on the radix value and the most significant bit. And then we exchange these buckets between the GPUs, between peer-to-peer transfers and in each of the GPUs the individual buckets are sorted using a single GPU sorting primitive. This is again, as I mentioned, an in-place algorithm. And it is limited by the, basically, it has the same limitations as the peer-to-peer sort. This was developed as a master thesis in our chair.

Starting point is 01:17:34 So let's see how the three of them combine. And basically, not necessarily only how they compare, but also where does time go. So in the peer-to-peer case, we have basically the data transfers, which are the green and the blue parts, and we can see that they're dominating the runtime. And this is the ongoing story across all the algorithms, with the exception of the heterogeneous one, where the merging on the CPU is the clear bottleneck there.

Starting point is 01:18:13 So we can see basically that the actual sorting, which is marked in pink in the case of the peer-to-peer sort and is intertwined into the data transfer copies into the RMG sort from the CPU and back. With the heterogeneous sort, actually, the merging stage on the CPU proved to be the most costly stage. And in terms of results, we can see that the in-place sorting algorithms, while being constrained by the amount of data they can sort,

Starting point is 01:19:06 they're significantly faster compared to the heterogeneous sort. Right, so this happened on the DGXA100. So what's important to notice here is actually basically almost no difference between the performance on four GPUs and eight GPUs across all algorithms and this is a direct consequence of sharing the PCIe lane between two GPUs. Right on the AC922, where we have the envioling between the CPU and the GPU, we have another set of interesting results.

Starting point is 01:19:52 So for the peer-to-peer and RMG sort, actually, we see a significant performance decrease once we move past two GPUs. And this is basically exactly the point where we stop utilizing the ring architecture or this efficient data transfers between the CPUs and the GPUs. But we also need to move to the other socket. And there, we are actually constrained by the bandwidth of the CPU interconnects. And we can see consistently that for our in-place algorithms,

Starting point is 01:20:32 that we are faster on two GPUs compared to four GPUs. And basically, with this in mind, I want to summarize the main takeaways from this, is that basically, scaling to multiple GPUs needs to be done at your awareness. Basically, you will not benefit infinitely just by distributing the compute on multiple GPUs. We see that the data transfers are by far the costliest operations and basically knowing the topology of your system is essential to get the fastest or the best performance out of it. So one example was the 2 GPU mode on AC922, providing the fastest solution for that system, but also

Starting point is 01:21:29 knowing that utilizing 4 GPUs or 8 GPUs doesn't make a difference in the DGXA100. And here I would add it's important that basically each socket needs to allocate two GPUs compared to four GPUs on one socket because then, again, we're not utilizing the split bandwidth. Right. And basically, again, as I said, we can see that this order of magnitude difference in bandwidth shows in the actual time it takes to move data or actually sort data.

Starting point is 01:22:18 Right. This is. Yeah. So these are some of the systems that we already have in our data lab. Not all are shown here, because we're constantly adding new systems. But if you're interested in working on any projects

Starting point is 01:22:36 regarding GPU development or multi-GPU development, feel free to reach out to us. This includes research, projects, seminars, but also master theses. If you're at that point in time, just reach out to any of us in our group. This would be mainly Florian and myself. So we'll be happy to work with you on interesting projects like this.

Starting point is 01:23:05 These are some of the papers and master thesis that came out as a result of this work. And with that, I would like to thank you. And yeah, I'm open for questions if there are any. OK, thank you. Next week you'll talk about FPGAs. So more coprocessing and acceleration.

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Data Processing on GPUs II

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.