Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Data Processing on GPUs II

Starting point is 00:00:00 Welcome everybody. Today is the second part of the lecture on data processing on GPUs. And yeah, this is the last lecture on GPUs and then afterwards you'll have data processing on FPGAs, your new programming task and then you'll get introduced to CXL. What are we going to talk about this lecture is how to implement some database management operators on GPUs. We'll talk about selection, how to do parallel prefix sums, and how this can be implemented in joins.

Starting point is 00:00:41 Then we'll talk about what happens when a single GPU or processing on a single GPU is not enough, so we need to distribute. We need multi-GPU setups. And then we'll take a look into a use case of multi-GPU sorting and how this was implemented in our group. Again, the lecture is based mainly on the two books. The first part mostly, the multi-GPU part,

Starting point is 00:01:12 is based on our research in the group, and also some of the lectures that are shown here. Right, so let's start with the database operators. We'll start with a simple selection. And doing a simple selection, basically, we have a selection predicate that basically needs to choose a subset of the tuples from the relation that satisfy the predicate and basically remove the rest.

Starting point is 00:01:46 And if we write simple code or pseudo code in this case, this would look something like this. And for the given predicate, we need to basically from relation X, we just need to remove the 2. So basically, the question that we need to answer is, how can we parallelize this on a GPU? So we know that we can access this memory locations in parallel. But then the question comes, where

Starting point is 00:02:22 should we write the results? And basically, how we can do this efficiently But then the question comes, where should we write the results and basically how we can do this efficiently so there are no overlapping writes. And one solution is to pre-compute the locations by using prefix sums. And basically, this means that we need just in general, a prefix sum is a prefix scan operation.

Starting point is 00:02:49 So basically, we need to apply the binary operator to an array of elements. And we need to use this prefix sum or the prefix scan operation in general basically to pre-compute the locations of the right operation after the selection predicate. So basically to do this, we need three steps. So we need to build a flag array, which will basically

Starting point is 00:03:26 note which of the elements in our initial relation satisfied the predicate condition. Then from this flag array, we compute the prefix sum. And then, again, we scan the flag array and write the elements to their respective positions in a new array. And these three steps can be easily parallelized, mostly by using reductions in GPUs.

Starting point is 00:04:02 So basically, again, we need to apply a single operator to an array, and we know that by using multiple threads in a GPU, we can easily achieve this. And the reductions that we can use, basically applying these binary operators, we can use them as a building block for the prefix sum operations. And what's good is that prefix sums are calculated in GPU shared memory in general.

Starting point is 00:04:33 So basically, we have little to no overhead based on this. And now the question is, how can we parallelize them and make them efficient? So one option is to use binary reduction trees. And if you're familiar with binary reduction trees, to basically compute the sum, we need log n operations. So basically, in our input array of eight elements, we'll need three iterations over the array

Starting point is 00:05:11 to get to our sum at the end. And if we implement this naively parallel in GPU, we get a bad solution, like we started yesterday. So just doing naive parallelization doesn't work really that well also in this case. So basically, in this case, we'll need more operations actually than a sequential implementation would do. So now, basically, the question is, how can we improve? And a simple solution is, let's split the workload

Starting point is 00:05:55 and then assign two elements to each thread. And basically, after each step, we ensure that basically only half of the threads remain active. And basically, we have the achieved log of n complexity. Now the additional benefit is that we can execute this in parallel. Is this an efficient implementation? Somewhat, right?

Starting point is 00:06:31 It's not extremely bad, right? It's OK. So this would be a GPU kernel that would execute this, right? So from last lecture, we know that we would like to access the memory in a stride manner. So we basically co-locate the threads. They access adjacent locations. However, that happens only in the first step, right?

Starting point is 00:07:07 So that's the only part where threads are accessing actually adjacent locations. If you look at iterations two and three, there the threads are quite widely apart. Or they're accessing elements that are quite widely apart in memory. Now, for eight elements, this is not a problem, right? Because everything is still in a single cache line.

Starting point is 00:07:34 All is good. However, if we have a wider array or a larger array, what happens is that after a few iterations, two or three iterations, threads are accessing locations that are in different cache lines and can be also eventually out of a single thread block. So this is basically, again, the consequence of this is that we have doubling stride after every iteration. And this basically increases the distance,

Starting point is 00:08:16 meaning that we are not utilizing the cache lines efficiently in the GPU memory. And to, I guess, what I'm hinting at, that we can do better, mainly by the way that we're accessing the data. Yes. I have a question. For this reduction, we need those three different steps,

Starting point is 00:08:42 right, they're after, one after another. Yes. How is this guaranteed in this code? Like that we have the first stage run, and then we have like the second on top? So what do you mean, how is it guaranteed? Because the bottom row, first we were doing the addition

Starting point is 00:09:06 on all of those, right? Yeah. But isn't? Maybe I'm missing something here. OK, so I think I understood the question now. So what this for loop does, so in this case, we have four threads, right? In the very first. So we have four threads.

Starting point is 00:09:28 So this means that this for loop is executed four times, right? In parallel. One by each thread. And what each thread does is basically goes from stride one. And then basically, the block dimensions in this case are four. We have 4 threads. So basically, it says after. So we start with stride 1, meaning we

Starting point is 00:09:55 are in iteration number 1. So we are accessing the adjacent location, right? And then we do the addition. So we get the 10, 12, 5, and 9. So we get those results. And then in the next iteration, the stride doubles. And to answer the question about the guarantee, basically, only after the summation,

Starting point is 00:10:30 we go to a double stride and to access the new elements. So basically, this is the guarantee. The tricky part is that this is like you see one thread implementation. And basically, this is being executed four times in parallel. And it won't go in the second iteration. Or basically, even if it goes, it doesn't matter. Helpful, not?

Starting point is 00:11:08 But is it possible that one thread is already further than the other? Or no, that's not possible, right? Well, if they're executing the same computation, this doesn't make sense. But I mean, this goes back to the execution part, right, that we talked about yesterday. At some point, they are reconverging.

Starting point is 00:11:31 And if you're not sure, if you have a varying workload where you do a random computation that it's not an addition, right, then you can explicitly ensure the reconversions, which we talked about yesterday, basically. So is that helpful? OK. Yeah? If I was to program this,

Starting point is 00:12:00 and I'm not really sure about the internals, so I'm not 100% sure that they will be synced. If I were to call distinct threads also in the for loop to ensure that every stage is executed off each other, would the compiler be able to optimize it away or not? That I'm not sure. I haven't looked into, basically, if the sync threads command can be pushed upwards when it's called explicitly.

Starting point is 00:12:28 So say, for example, you don't call sync threads. What the compiler would do, it would assume the conservative case. The conservative case meaning that you need inter-thread communication, so it will only sync after the for loop. So basically, again, what we have here, the sync threads, this is something that's syncing in this part of the code is something that the compiler would do implicitly. It's put here just so everybody's

Starting point is 00:12:59 on the same page where the synchronization happens. You can do it in the for loop. But since the for loop is executed on a thread level, this doesn't really make sense. Because you're executing a single iteration, and then you're synchronizing with what, basically? So it means that the thread will synchronize with itself, which is kind of not necessary.

Starting point is 00:13:37 Yeah, so basically, we have this as a starting point. So just as a basically insight into the steps, output's still readable. So we need reduction, the simple one, and the verbose one. So basically, I'm going to make this smaller so you can see the output. Basically, what you have here is exactly that, right? So we have, if you start from here, we have, so you need to look at the second iteration.

Starting point is 00:14:32 We have thread 0 writing the 9, thread 1 writing the 5, and so on and so on. So basically, we have the order of writes. And basically, we also have the locations. And we can see that these are not adjacent locations. And the threads are accessing far away locations. And again, just to clarify for eight elements, this is no issue.

Starting point is 00:15:02 But for larger data, this would very quickly become a problem, because we might end up with inactive threads inside a warp. And even worse, we can end up with inactive warps just being blocked by basically a single iteration. Or in this case, it can be a single summation so basically how we can be more efficient about this is basically okay this is a short explanation of what I talked about so and some example of when exactly will this happen, right? So I took the example of having 256 elements, which is, again, not a large number.

Starting point is 00:15:52 And for this, if we take the same assumption, we need half of the threads of the number of elements to start with. So we'll activate four warps. So basically, we'll need eight iterations to calculate the reduction sum. And basically, as of iteration six, only two of the four warps would be active, and only one thread will be active in each. So basically, we are wasting 128 threads

Starting point is 00:16:21 for the sake of two threads doing the computation. So one way to be more efficient about this is very simply to decrease the stride. So instead of just splitting naively the workload and having the initial threads accessing adjacent locations, we decrease the stride with every iteration. So we basically make sure that we narrow the workload. And basically, as the workload progresses,

Starting point is 00:16:58 we are making sure that we are converging into a single warp or a single thread in the end. And basically, the rest of the warps can be freed up to execute different operations. Right. Is there a question? Why do we have the same threads now in? That's a good question.

Starting point is 00:17:44 So I can very quickly check, but this might as well be a copy-paste error on my end. Yes? I don't think it's a copy-paste error. Why? If we parallelly compute the sum of two numbers for the first stage, then we would not want to overwrite the numbers because we are using the same memory to read the numbers in the for loop in the other example, right?

Starting point is 00:18:07 Yeah, so. The other example would work because the stripe was like, like it was. Ah, yeah. Because we're not rewriting the memory from other places. Yeah. because the stripe was like so say again what was the difference in this example yeah memory that itself used the 9 the 1 overwriting 9 overwriting 1 and in the other example

Starting point is 00:18:50 the first fret summing 1 and 8 is also potentially overwritten by another fret the 1 summing 2, 15 so that's the second summing to 15. So that's the second summing 7 and 8.

Starting point is 00:19:11 OK. So basically what you're saying is that we don't want the faster thread in the second iteration to override a busy memory location, basically, what you're saying. It's still the first iteration? So we don't want the thread summing 8 and 7 to overwrite the 8 with 15 before the first thread summing

Starting point is 00:19:32 8 and 1 can sum them together to 9? Or 8 and 1? JAN-FELIX SCHWARTZMANN- 8 and 7, yeah, I understand. But OK, yeah, I mean, I understand. OK. OK, yeah, I mean, I understand what you mean. But again, the question is, why would you need the sync threads in the loop then? So how does this help? I understand your point, and it's correct.

Starting point is 00:20:05 But how does it help to have the sync threads in the loop? Good question. But yeah, I mean, that's true. I'll need to check this. And basically, the uploaded version of the slides on Moodle will have the correct one. This might be that this was an error on my end. Right. So yeah, we have basically what we want to do here

Starting point is 00:20:52 is to minimize the divergence of the threads and basically their subsequent access. And just to compare how this would affect performance, even on our small, where's my mouse? OK. Let me move this here so I can be a bit quicker. So we want first to execute the first one. And then we need the minimized convergence control. And basically, what you can see here,

Starting point is 00:21:51 so these are different versions where I remove the prints, so they won't constrain the kernel. So basically, what happens, like, for these eight elements that we saw, we have four milliseconds for the simple, naive implementation. And then we have 100 microseconds for the convergent implementation of the kernel. And basically, this is the cost that we pay for threats having to access locations that are, okay, I need to move this.

Starting point is 00:22:33 All right. Yeah. So we see that easily this can mean a large improvement. And again, this is a very small example. So milliseconds to microseconds, still a large improvement. However, we don't need that at this point. However, basically, in this case, there is also a small problem, mainly. Even though we decrease the stride and we converge the execution to eventually a single

Starting point is 00:23:11 warp and a single thread, we do the expensive global memory accesses. And basically, for the previous example that I mentioned of a bit larger data with 256 elements, in this loop, we'll need to execute around 36 global memory accesses and writes. Are there any suggestions how we can reduce this? Yeah? Using shared memory? Exactly.

Starting point is 00:23:46 Yes. So basically, what we need is to basically move this write operation into the shared memory. And basically, this is what we want. We load the input from global memory initially, and then write and read every intermediate result into shared memory. And basically, the reason why we want to do this

Starting point is 00:24:19 is that the access is up to 150x faster than global memory. And shared memory access, basically, if we have no bank conflicts for the threads, this can be as fast as a register access. Since if you remember from yesterday, shared memory and L1 cache are co-located now basically in the same memory. And this is basically the fastest access that we can achieve. And for the larger inputs of 256 elements,

Starting point is 00:25:02 then we have only the initial input, and the final outputs are read and written corresponding to global memory. And basically, since we need eight iterations to do this, we'll have nine total global memory accesses, reads and writes combined. So we decrease the global memory reads and writes by a factor of four. How do you think that this will affect performance

Starting point is 00:25:37 in our dummy example with eight elements? OK. Okay. Let's see. Say again? Yeah, let's see. We need demo. Okay. So, we're going to do a demo. So, we're going to do a demo.

Starting point is 00:25:56 So, we're going to do a demo. So, we're going to do a demo. So, we're going to do a demo. So, we're going to do a demo. So, we're going to do a demo. So, we're going to do a demo. So, we're going to do a demo. So, we're going to do a demo.

Starting point is 00:26:04 So, we're going to do a demo. So, we're going to do a demo. So, we're going to do a demo. So, we're going to do a demo. So, we're going to do a demo.. Yeah, let's see. We need demo shared. Almost nothing. And basically, the reason for this is that the compiler already optimizes this. For small data, this is basically done at compilation time when the input is small. And basically, it reserves shared memory by default. What happens when I tried this on larger examples,

Starting point is 00:26:49 for example, million elements like the addition that we did yesterday, if we have let me go back to the slides. If we dedicate this flag to use the shared memory for the input, there we get a compiler warning that basically we are trying to access too large of a chunk of shared memory, which is not accessible by putting one gigabyte of data in the shared memory.

Starting point is 00:27:25 So basically, this is just it gives a warning. And it disregards this. It still goes to global memory by default. However, vice versa applies. If we don't designate the shared flag and the input is sufficiently small, the compiler will notice this and move the data to shared memory anyway.

Starting point is 00:27:46 And hence, we don't see any improvement since our example is too small. Right, so with the current implementations, we have the final sum, right? We have that 36. We achieved this in three different ways so far. But we don't have the final prefix sum. So the solution to this is to add one more

Starting point is 00:28:16 phase to the algorithm. Basically, we have the sweep down phase where we need to add the partial sums to the successor element. So basically, in the end, we have the final prefix sums. And basically, this is the goal. So we want, starting from the final sum, from the 36 up there, we want to recursively come down and basically redo the computations so we can generate the partial sums into the final prefix

Starting point is 00:28:57 sums. And basically, not to go step by step, but basically you can see that the accesses in the sweepdown phase are mirroring this pattern here. And this was our starting example. So basically, not to go through all three alterations that we made to the example, basically the same performance considerations for the sweep down phase will apply in this case as well. So again, the strided access here will need to be adjusted.

Starting point is 00:29:37 And if the memory allows, we need to reduce the global memory accesses as well. Right, so what I already said, we need to make sure that we follow the control and memory divergence for the thread and memory accesses and reduce the global memory accesses. And so far, we were dealing with selections. And there, the tuples either match or not. The maximum result size that we can get is as large as the input relation.

Starting point is 00:30:17 So basically, what we have in joins is something else. So this is not the case, right? The tuples can match multiple times, the maximum output can be as large as the Cartesian product. So we need to do more adjustments and basically to address several challenges that arise. And one I already mentioned, we do not know the exact result size in advance. The result may fit or may not fit in GPU memory, regardless of whether the relations do.

Starting point is 00:30:59 And now the question is how we can use the GPU to address these challenges, but also to exploit the parallelism that they offer. And two options that we need to consider is we need to do a lock-free processing, and we want to pre-compute the right locations for each thread, similar to what we did with the prefix sums in the selection part. And then, so basically, how are joins executed?

Starting point is 00:31:37 I guess you're familiar with the execution scheme already. But basically, we have this three step execution for most of the joins except for the primary foreign key joins. There basically we do not need a position list because we have all foreign keys, all foreign keys have exactly one primary key partner. But in general we use these three steps. So basically, we need each thread to count the number of joint partners for its input.

Starting point is 00:32:16 Each thread needs to compute the prefix sum for its result to get the right location for it. And then basically, on the host side, we need to allocate the memory so that all threads can write their result in a parallel fashion. And we'll take a look at an example of indexed nested loop join. And there, basically, we have, again, this three-step

Starting point is 00:32:45 execution scheme. So we need to build the index on the R relation. We need to sort that. Then we need to do a lookup for each element of S into the sorted array by performing binary search. And then we transform the result position list to the original index of R. And then, and then we transform the result position list to the original index of r. And then basically, we perform the second step two times,

Starting point is 00:33:12 once for the position list and the second time for the actual join operation. Right, so if we want to join, say, these two relations, again, quite small. Basically we have three matches in total. So we have the five matching in RNS and we have two instances of the three matching. So we have the number five at position three. So we have that written in our position list for R. And then we have the three at position three, at index 3. And the end result of this is that we

Starting point is 00:34:10 have the mesh partner for the 5 at position 1, and then for the 3s at positions 1 and 4. So we already have a duplicate here. So we basically need to make sure that the step three, where we are getting the original input, or where we want the result to be in the original order as the input in the input list. So basically, what we do here is we build the index of r. There we have the positions.

Starting point is 00:35:10 And then we need to sort this. So we sort by the value of the elements. And then we need to perform the lookup in s of the sorted array. And we'll perform that in the next step. So how we can parallelize this, or how we can parallelize the first phase, is basically each thread will access one element, exactly one element, and perform a lookup on the index.

Starting point is 00:35:46 So we are going through S. And so basically now what we want to do is basically we can use three threads in this example, and each of them reads one element. And again, we do this in a strided fashion, as we know from earlier examples. So what happens is that now we have the counters. We know the number of the elements

Starting point is 00:36:21 that each thread processes that basically have the result or have the match when they are looking up at the sorted R relation. So we know now that thread one, it read the five, so we have a single match there, and then thread one reads the three and we have two two matches there So basically the more threads that we use we we increase the memory consumption So we want to make sure that we don't over parallelize this right as we spoke yesterday We want to make sure that we don't over-paralyze this. As we spoke yesterday, we want to make sure that we don't assign too many threads for too simple

Starting point is 00:37:13 of a compute operation. So once we have the counter array, we need to compute the prefix sum to get the right positions for each of the threads and the prefix sum. We saw how we computed the prefix sum earlier. And then basically, we have the lookups of the threads. And they're assigned elements. And we basically use the prefix sum to write the final result.

Starting point is 00:37:38 So basically, what happens here is that we have the lookups of the threads. And then we have the lookups of the threads and their assigned elements. And we basically use the prefix sum to write the final result. So basically what happens here is that we have the position lists for both relations. So we're still working on the sorted R relation. We know that number 5 is on position 3. And also, we need the position of the 3,

Starting point is 00:38:13 because that's also one of the matches. And there, we write the position of the 3, which is index 1, in this case. And in the position list for s, we do the same. The positions that we're looking for are 0, 1, and 4. So since we sorted r in the beginning, now we need just to reverse the sort. And to reverse the sort, basically, we

Starting point is 00:38:42 need to perform a gather operation on the position list that we generated initially from step one. And basically, we give the input array and the position list as the input. And we basically, and we, so we get the sorted array and the position list to reverse the sorting. And basically, what happens in the final step is that we are ending up with the results of the positions

Starting point is 00:39:21 in R for the joint partners in S. So we basically end up with the correct position list. And then as output, we have the array of R, which has as many elements as its own position list, since we write all the duplicates in the position list explicitly. Right, are there questions until here? Right. So we continue with multi-GPU data processing.

Starting point is 00:40:11 So far, we've learned about the principles of what do we want to do to extract maximum performance from a single GPU and how to transfer the CPU programming model, the sequential model to a parallel GPU programming model. However, we've seen this graphic in the previous lecture, and we know that the faster the memory, the lower the capacity of this memory is. And for the fastest memory, we are quite limited. So we are in the order of kilobytes.

Starting point is 00:40:53 GPUs currently have plenty of global memory. However, the access there we saw is up to two orders of magnitude lower. And basically, we go with this trend the lower the access there we saw is up to two orders of magnitude lower and basically we go to with this trend the lower in in the memory hierarchy that we go so basically this this makes us aware of where do we allocate the memory for for what part of our workload and we basically manage the lifetime of the allocated memory in accordance to, again,

Starting point is 00:41:30 to optimize the memory accesses. In general, the GPU global memory is much more limited compared to DRAM. Currently, GPUs allow up to 188 gigabytes of global memory. However, this is still up to an order of magnitude of what the maximum DRAM capacity is. So there we can get into the order of terabytes. So that's one aspect. And then the second aspect is that whatever we process in the GPU, we need to transfer it from the main memory.

Starting point is 00:42:08 So by default, GPU memory is not meant as storage of any kind. And basically, whenever we need to access data outside of the GPU memory, it means that we go to the main memory. And here we are dependent. So we saw this picture, right? So we need to go from GPU global memory to host memory. So basically, we're dependent on the interconnect there.

Starting point is 00:42:35 And currently, there's different types of interconnects that are put into GPU systems by different vendors. So as an example, PCIe is the vendor independent as de facto standard interconnect between host memory and GPUs. NVIDIA has their own NVLink as their GPU interconnect. However, we'll see that there are systems

Starting point is 00:43:05 that use this as the interconnect between the CPU and the GPU. And AMD has the Infinity Fabric as their global interconnect. So they use it for interconnecting GPUs, but also as an interconnect in their accelerated processing units. So let's take a look at basically what

Starting point is 00:43:26 are the properties and the performance that each of these interconnects offer. So PCI Express, as I mentioned, is the general CPU to GPU interconnect. It connects the CPU to the storage, to the memory, and to any accelerator that we have. So it doesn't need to be GPU exclusively. This can be also any FPGA or any custom coprocessor

Starting point is 00:43:51 that's connected to the CPU. And depending on the version and the end device that we are connecting to the CPU, we can use one, four, eight, or 16 lanes of PCIe interconnect. Not necessarily all lanes are used. So as you can see, different devices require different connectivity. By default, the GPUs are connected through all 16 lanes

Starting point is 00:44:24 with PCIe. Now what's interesting in this slide is these two numbers here. So we have PCIe 3.0. This was, until recently, the standard. And this allowed for 16 gigabytes of bandwidth for all 16 lanes in total. So given that GPUs are accelerators with up to 1 terabyte or more than 1 terabyte of computational bandwidth,

Starting point is 00:44:58 this obviously becomes a severe bottleneck when transferring data from main memory to the GPU memory. And this is getting better with time and with the new PCIe standards. So PCIe 4.0 basically doubles this to 32 gigabytes. And we've already started seeing or there have been already announced devices with PCIe 5.0 as the CPU to GPU interconnect, which basically doubles the bandwidth of PCIe 4 to up to 64 gigabytes per second. However, again, this is far down from the terabytes of bandwidth that the GPU has.

Starting point is 00:45:48 So some of the vendors were very aware of this. And they designed their own interconnects, proprietary interconnects for their accelerators. And one example that I already mentioned is the NVLink, which is built by NVIDIA and it's used mainly as an interconnect between two GPUs. It allows high bandwidth transfers. 250 gigabytes of peer-to-peer bandwidth. And current versions of NVLink 3.0 allow up to 300 gigabytes per second

Starting point is 00:46:32 of interconnect bandwidth. And this is just peer-to-peer, just one GPU to one GPU. What happens if we have a device with multiple GPUs that we want to perform an all-to-all memory transfer, then this means that we'll have to perform individual transfers from each GPU to every other GPU. And again, this is suboptimal in terms of data transfer

Starting point is 00:47:03 and its performance. So there has been a solution to that as well. That's the NVSwitch. Basically, this is a switching element which basically allows us to interconnect all to all up to 16 GPUs. And basically, this allows concurrent memory transfers between all 16 GPUs and the maximum bandwidth of 300 gigabytes per second that basically NVLink initially allowed.

Starting point is 00:47:37 Now this is transferred to an all-to-all data transfer to up to 16 GPUs, which basically increases the global memory that each GPU has. Because if we have a system with more than eight GPUs, for example, that have more than 40 gigabytes each, and we know that the interconnect between them is way faster than either the interconnect from host memory to CPU or from the CPU to the GPU. Basically, once we transfer that data,

Starting point is 00:48:15 we can basically assume that we have much larger GPU memory than we actually do. Because the bottleneck won't be the peer-to-peer transfer, but the transfer from main memory. So something similar that AMD offers, basically, is the Infinity Fabric. As I mentioned, this is their general interconnect. They use it for interconnecting CPUs, GPUs, and vice versa. They achieve up to 204 gigabytes per second

Starting point is 00:48:55 for their CPU interconnect, or when used as a CPU interconnect. And when used as a GPU interconnect bridge, the bidirectional bandwidth is up to 340 gigabytes per second. So for interconnecting GPUs, state of the art, we are in the order of around 300 gigabytes per second, independent of the vendor. So now let's see how this information about the

Starting point is 00:49:24 interconnects and the GPU bandwidth can be used to build multi-GPU systems. So one option, the very basic multiplication of such a system would be when we have single CPU and then two GPUs. So this means that we have single so and then two GPUs. So this means that we have single socket, single main memory access. We have no NUMA effects. And all of the data transfer goes through the CPU

Starting point is 00:50:01 interconnect and to its connection to main memory. And basically, what we need to be aware of is now the interconnect between the GPU and the CPU. If there is an interconnect between both GPUs, then we need to be aware of that bandwidth as well. We have an implementation of this system in our group. So we have a small server. So basically all of the demos that I've shown were executed on this server that has quite a powerful CPU

Starting point is 00:50:39 and two high-range consumer GPUs. So these are not data center GPUs, but quite strong consumer GPUs. So there in this system, basically what we have is PCIe 4.0 as the CPU to GPU interconnect with 32 gigabytes. Not yet. 300. And between the GPUs, we have up to 56 gigabytes per second.

Starting point is 00:51:12 So this is still NVLink 3.0. However, it doesn't provide all lanes in the bridge. So the maximum bandwidth of 300 gigabytes per second in this implementation, it's not used. So then we can build a more complicated system than this. Basically, when we need two CPUs for increased computational bandwidth on the CPU side, then basically this is what we end up with.

Starting point is 00:51:46 So we have a dual socket system that's basically where both CPUs are interconnected via a proprietary interconnect. And then each of these CPUs have their own dedicated memory access. And they're either connected to a single GPU or to multiple GPUs. So one example that we have here is dual CPU and four GPUs in one system.

Starting point is 00:52:20 And there, again, we need to be aware of three types of interconnect. Basically, how is the CPU connected to the GPUs and to how many of the GPUs? And how are the GPUs connected between each other? And what happens or what we can see in this schema or in this implementation the Delta the Delta system that's also available at HPI is that we have a PCI

Starting point is 00:53:00 3.0 CPU to GPU interconnect each CPU is connected to two GPUs. This is all fairly simple. However on the GPU side we have I would say, unorthodox interconnects between them. As you can see, GPU 0 and 1, they're interconnected with two NVLink lanes, as well as GPU 0 and 2, GPU 2 and 3 as well. However, from GPU 1 to GPU 3, we have half the bandwidth, or we have a single NVLink 2.0 lane. So basically, what's important here is if we develop an algorithm or we run a workload that's

Starting point is 00:53:34 running on the system, we need to be aware of the data placement on each of the GPUs. Because if we require the inter-GPU communication, then we can see already clearly that there can be better or worse ways to exchange the data. If we exchange the data only from GPU 1 to GPU 3, we'll get just half the bandwidth compared to the other interconnects.

Starting point is 00:54:06 And basically, to mitigate this, one straightforward solution is we make all GPU interconnects equal. And not only that, we can also make the CPU and GPU interconnect equal. So jumping from 16 gigabytes PCI 3.0 interconnect from CPU to GPU, we can move to GPU interconnect that's equivalent to the interconnect between the GPUs themselves. And now we basically don't lose any bandwidth

Starting point is 00:54:46 when transferring data. And this is quite a unicorn system, I would say. So it was developed by IBM. We have two of them in our data center. And what's interesting here is that the CPU-GPU interconnect is using NVLink instead of PCIe. So this means that we have the same 2.0 NVLink interconnect with 75 gigabytes of bandwidth between the CPU and the GPU

Starting point is 00:55:18 and between the GPUs themselves. However, what's also interesting in this system is that we have a proprietary interconnect between the two sockets. And there, we achieve 64 gigabytes per second per direction. So if we want to transfer something from CPU or we want to transfer data from CPU 0 to GPUs 2 or 3,

Starting point is 00:55:45 we need to go through this proprietary interconnect. What's also interesting in this system that this bandwidth is split. Half of the bandwidth is dedicated to data transfers. The other half is dedicated to basically sharing compute instructions and data requests between the CPUs. So basically what we end up effectively in the CPU to CPU interconnect is 32 gigabytes per second

Starting point is 00:56:17 of bandwidth in a single direction. So again, when we develop workloads that run on the system, we need to be aware from which socket do we read the data, how is this data transferred, and again, not necessarily distribute to all four GPUs. But if we see that the computation required is not as intensive, we might be better off just transferring the data just to the two GPUs that are directly interconnected to our memory socket.

Starting point is 00:56:59 And then if we need more GPU memory and more GPU compute, then we can go up to eight GPUs in current systems. And there, since we move past the double interconnect between the GPUs, now we need to make sure that these GPUs can communicate between each other in an efficient manner. So one implementation of such a system is the DGXA100 that's developed by NVIDIA. And there, we basically see the use of the all-to-all data

Starting point is 00:57:47 transfer, all-to-all interconnect that allows quick data transfers between all GPUs concurrently. And there, in this system, what's interesting to see is that even though we have 300 gigabytes per second bandwidth between each GPU, we are pretty much bounded by the CPU to GPU bandwidth that's provided by the PCA 4.0 interconnect. And what's even more interesting is

Starting point is 00:58:19 that the interconnect is split between the GPUs, meaning that, for example, if we want to transfer data from main memory to GPU 0 and the GPU 1, both of them will use the same PCIe lane. So we will never reach 32 gigabytes for each of the GPUs. However, this will be split by up to 16 gigabytes per GPU. So there we need to be aware that if we need to use only two GPUs or we need to transfer data from main memory to only two of the GPUs on this side of the system or from this socket,

Starting point is 00:59:01 basically we are much better off if you are using, for example, GPU 0 and GPU 2 or any combination that basically doesn't use the same PCIe lane. So what I'm aiming for with explaining all the systems and these considerations is that we need to be aware of the topology of our system. Because the more components or the more processing chips that we have, the more interconnects

Starting point is 00:59:31 that we need to pass through. And they differ in their bandwidth, and they differ in basically their connectivity. So to be more aware or to get a closer look of the behavior of the systems when we try to transfer data, we'll look at some benchmarks that we run on the systems and basically how they behave when we actually transfer the data. And are these numbers, basically the advertised numbers on the right-hand side, are they achievable if we try to transfer data?

Starting point is 01:00:10 So let's see. In the first system that I showed, this is a small GPU server that has only one CPU and two GPUs. We need to achieve, or we want to achieve, the advertised numbers for the CPU to GPUs, we want to achieve the advertised numbers for the CPU to GPU interconnect and also for the interconnect between the GPUs. So if you take a look at the graph A, what we're doing here

Starting point is 01:00:38 is we are just transferring serially data from the CPU to GPU 0 and GPU 1. And basically, what we achieve is once we transfer data from the host, from the CPU to GPU 0, we achieve 26 out of the 32 gigabytes per second. The same bandwidth we achieve when going in the reverse direction, when we're transferring the data back from the GPU to the CPU.

Starting point is 01:01:17 Regarding the peer-to-peer transfers between the GPUs, we achieve 42 and 43 gigabytes per second out of the 56 gigabytes that are advertised. So these are not bad numbers. They're just not quite as advertised because other than the pure data that we are sending, there are also data requests and some memory checks that need to be executed that hinder the bandwidth,

Starting point is 01:01:46 hinder the physical capabilities of the interconnects. What's also important to see is that we have almost 2x discrepancy between the bandwidth of the CPU to GPU interconnect and the peer-to-peer interconnect. Right, so then if we take a look at the just pure peer-to-peer transfers, again, we might want to basically transfer data in parallel, meaning that we want to either transfer data in parallel from the CPU to the GPU and the other direction, or we might want to concurrently transfer data

Starting point is 01:02:36 between both GPUs. And what we achieve here is for the CPU to GPU and parallel transfers, we achieve here is for the CPU to GPU and parallel transfers, we achieve basically good scaling. We scale linearly. We achieve twice the bandwidth. In the case of the parallel transfer from device up, so we achieve double the bandwidth when we move the data concurrently from both GPUs

Starting point is 01:03:11 to the CPU or from the CPU to both GPUs. And there, we basically see twice the bandwidth of the CPU-GPU interconnect. What happens if we move the data concurrently or in parallel between the CPU and the GPU? Then we basically achieve less than twice the bandwidth of 42 and 43 gigabytes that we achieved when performing only the serial transfers.

Starting point is 01:03:54 Similar holds true when doing transfers between the GPUs. So there we achieve close to twice the bandwidth between the GPUs when transferring in parallel. So this was a very simple single CPU to GPU system. If we take a look at the Delta system, that basically has PCA 3.0 between the CPU and the GPUs, and it also has asymmetrical GPU interconnects. There we see varying performance, basically. I would start with the GPU transfers,

Starting point is 01:04:37 because there we see a very sharp decline in the performance. If we want to transfer data or when transferring data from GPU 0 and either GPU 1 or GPU 2, we achieve very close to the 50 gigabytes per second advertised bandwidth. So we achieved or we measured 48 gigabytes per second, which is very nice. However, when we transferred data from GPU 0 to GPU 3, we achieved only 9 gigabytes of bandwidth between these two GPUs. Somebody wants to guess what happened, why?

Starting point is 01:05:21 Yeah. If for once it's too vulnerable, it's probably going to take some bandwidth of? What are the hops exactly? Like, to get from one to three, it needs to go either through two or one. There's no connection between zero and three. Yeah, exactly. There's no connection, 0 and 3. Yeah, exactly. There's no connection. That's true. But for example, the connection between if we go either through GPU 2 or GPU 1 to GPU 3, there we have 50 gigabytes

Starting point is 01:05:56 per second, and then we have 25 gigabytes per second. But we achieve nine. It's a sharp decline. And the reason is, basically by default, when we want to transfer data between GPUs that are not directly interconnected, we go always through the CPU. So we are not utilizing the 50 and 25 gigabytes per second. We go through the PCI 3.0. Then we move through the CPUs. And then we go to GPU 3.

Starting point is 01:06:43 And basically, this is the, unless explicitly stated, this is the default data transfer behavior. And basically, this is just I want to emphasize the point that we need to be aware of the topology. Because this is a decline of almost 6x. And that's quite significant in such a system. Same very, very similar happens when we do parallel transfers.

Starting point is 01:07:16 So if we do 0 to 1 or 2 to 3, all is good, because they're interconnected between. So they're interconnected through the double NVLink lanes, we achieve close to the theoretical maximum of 100. We measure 97. What happens here when we try to do parallel transfers between 1 and 2 and 0 and 3?

Starting point is 01:07:42 For 0 and 3, we achieve the very bad 9 gigabytes per second. And then this averages out to 30 gigabytes per second when we consider the transfer to 1 and 2, GPUs 1 and 2. Right. to 1 and 2, GPUs 1 and 2. So regarding the transfer from CPUs to GPUs, this is, I would say, quite intuitive. We just measure a bit below the theoretical allowed bandwidth from the PCIe 3.0 interconnect.

Starting point is 01:08:29 So let's see how things changed when we have symmetrical interconnects, and not only symmetrical interconnects between the GPUs, but also between the CPU and the GPU. There we saw some interesting consequences. So again, I'll start with the peer-to-peer transfers between GPUs. So then there, if we transfer data from GPU 0 to GPU 1,

Starting point is 01:08:59 we see, again, very close to the theoretical maximum, which is 75 gigabytes per second. What happens when we transfer data from GPU 0 to GPU 2 or to GPU 3 is that we reach less than half of this bandwidth. Again, there we don't have any physical interconnect between the GPUs, so we need to go through the CPU and we need to go through our interconnect to reach the CPU, then the CPU has its half bandwidth that I mentioned earlier, and then reach GPUs

Starting point is 01:09:48 two or three. We see the same effects when transferring data in parallel. So the interconnected GPUs are all fine, very close to the maximum of 150 gigabytes per second. And we see up to 3x decrease when we try to do parallel transfers from the GPUs that are not interconnected between between each other are there any questions until this point yeah Yeah. I have a question. How do you connect this GPU not with PCIe but with NB-Link to the main board? Is it like, I'm just curious if you ever see that.

Starting point is 01:10:34 Yeah. So this is, as I said, a unicorn system. This is like the only system that allows this. And basically the reason is that the power CPU allows this. And basically, the reason is that the power CPU allows this. So IBM developed a separate CPU that doesn't require a PCIe interconnect to other accelerators, but it allows NVLink interconnect. It has like a connection to the plug-in.

Starting point is 01:10:59 Exactly. Yeah. And just an important note, that's only the CPU-GPU interconnect. The interconnect to main memory is still standard. Right. Yeah. However, in this system, what's interesting

Starting point is 01:11:17 is this CPU-to-CPU interconnect. Basically, if you see the CPU-to-GPU transfer, that's where we saw all the bottlenecks. And basically, they were mostly the most responsible part of the system was the CPU to CPU interconnect. So there, now we go to all-to-all communication territory, and there we also saw some interesting behavior of the system. So I already spoiled a bit when I mentioned the split PCIe interconnect between two GPUs,

Starting point is 01:12:02 the performance of which we'll see on the plot above. So if we transfer data between GPUs 0, 1, 2, and 3 from the CPU, there we achieve a maximum of 24 gigabytes per second. Which, yeah, in theory, we should achieve something close to 64. But since we have a split PCIe lane, that's not the case, right? So basically, we have maximum of 16 gigabytes theoretical bandwidth to each of the GPUs. And then the measured one is quite lower than that. But it's in line with what we measured

Starting point is 01:12:50 with the single PCIe lanes, where we achieved 12 out of the 16 gigabytes that are theoretically possible. So that's on the CPU side. Then basically if we look at the all-to-all transfers which are basically what's interesting about the system, we achieve consistent bandwidth of 279 gigabytes per second for any combination of GPUs or data transfer between any of the GPUs. And there, what's interesting is basically what happens when we try to do parallel transfers between the GPUs.

Starting point is 01:13:37 So there we see some discrepancy between co-located and not co-located GPUs. And there we see that basically despite having Envy switch that allows, in theory, consistent performance for interconnecting all GPUs, we see that the peer-to-peer transfers are faster between co-located GPUs, or between the GPUs that share the PCIe lane. So for example, we show one example here.

Starting point is 01:14:14 The transfer between GPU 0 and GPU 1 is faster than the transfer between GPU 0 and GPU 2. And basically, this corresponds to the collocation part. The reason for this is that NVSwitch basically puts the interconnect between collocated GPUs on the same physical bank, which is basically, I won't go into details, but basically the interconnected GPUs share the same bank of basically for handling the data requests

Starting point is 01:14:53 in the Envy switch, which basically allows this performance gap to appear. Yeah, but I mean, the drop is up to 10% per GPU. These are parallel transfers. So right. What's also interesting to see is that basically we can achieve the performance discrepancy that's achieved in this system is up to an order of magnitude between the data transfers from the CPU to the GPU and between the GPUs themselves. So when developing algorithms or workloads for the system, one can ask, should we just do single allocation, single transfer

Starting point is 01:15:48 in the beginning, and then basically not move the data out of the GPUs at all? And in such systems that basically have, in this instance, 40 gigabytes per GPU, we have 320 gigabytes of GPU memory. This is a significant amount of memory that we can work with once we transfer the data there. And with current implementations where there

Starting point is 01:16:16 are GPUs that have more than 180 gigabytes per GPU memory, we can reach up to a terabyte of GPU memory. That's very, very high bandwidth memory. So the notion of in-memory processing is changing with these developments. So I mentioned some drawbacks or features of each of the systems. And basically, the main conclusion

Starting point is 01:16:49 is that we need to be aware of the topology of the systems to develop algorithms that will efficiently use this. And a few of the tips that I implicitly mentioned, like once we need to pre-allocate memory and reduce data transfers as much as possible. This is in line with what we've learned to single GPU processing as well so far. However, now we need to organize the peer-to-peer communication

Starting point is 01:17:19 and basically maximize the bandwidth according to the topology. We need to be aware of the peer-to-peer interconnects. However, we also need to be aware of the access to the CPU and to the main memory. Right, and how we can use this to accelerate sorting. I'll quickly go through three types of algorithms that accelerate sorting on multiple GPUs.

Starting point is 01:17:49 So one, the first one is peer-to-peer based GPU sorting. And basically there I won't go through the individual stages of the algorithm, but I'm going to say basically I'm going to focus on the part that this is an in-place sorting algorithm. So there is no communication between the CPUs and the GPUs once the data is transferred to the GPUs. So the sorting takes place, is done in place. And just because of this, basically, the algorithm

Starting point is 01:18:22 is limited by the total GPU memory. So the maximum amount of data that we can sort is basically a bit less than 50% of the total memory. And here, you can already see the trade-off that we do with peer-to-peer algorithms. We do single memory allocation, single memory transfer between the CPU and the GPUs. And there, we basically utilize the computational bandwidth

Starting point is 01:18:50 of the GPU interconnects. And the trade-off of this is we can process less data. In this case, half of it. So another option, another alternative, is to do heterogeneous implementation of a sorting algorithm. And here, sorting is just a use case, right? This can be applied to any operator, right?

Starting point is 01:19:16 So if you want to do in-place joins, you can do so. You can do heterogeneous joins. And this goes on for every operator. So what happens here is that now we are not limited by the combined GPU memory. However, we are limited by the access to the main memory. Because once we sort everything on the GPU side, then we need to merge the sorted chunks on the CPU side.

Starting point is 01:19:51 And the merging and the transfer to the CPU are the bottleneck and basically the bottleneck of this approach. And the third option, and this algorithm was developed in our chair as a master thesis of the student. Basically, there we do ready exportation-based sorting where we utilize the peer-to-peer interconnects of the GPUs. And again, we do in-place sorting.

Starting point is 01:20:27 However, we accelerate the sorting and the allocation phase by doing a bucket exchange between the GPUs and also an initial scattering phase of the data. So just to compare how all of these perform, so on the DGXA100, as a reminder, this was the 8 GPU system with NVSwitch interconnect. There we basically see that the heterogeneous approach, where we merge everything on the CPU, performs worse compared to the other two.

Starting point is 01:21:14 Again, as I mentioned, this is mainly to the merge stage. And here you can see basically how big of a part the merging phase plays into the whole algorithm. There we can compare the peer-to-peer approaches, the radix-based partitioning and the peer-to-peer sort. And there we see similar performance. The radix-based is around 10% to 15% better based on the number of GPUs. And also what I would like to point out is just take a look how long does the sort face

Starting point is 01:21:59 actually last in all of the algorithms. So it's the pink one. And for example, in the Radix one, it's overlapped with the data transfer, because we do parallel transfers once every sorted chunk is completed. But it's basically along the lines of the other two. So more than 80% is spent on data transfers. And the actual compute is almost for free in this case.

Starting point is 01:22:38 The trends are the same across both systems. So this is the AC922. So this is the IBM machine with NVLink used as the CPU to GPU interconnect. And there, basically, what we see is, again, very similar trend. And again, similarly, the sorting is taking the least time in the whole workload compared to the data transfers and to the merging in the heterogeneous approach.

Starting point is 01:23:16 And again, this is a system that has symmetrical interconnects. So basically, this is as good as it gets in terms of pure interconnect bandwidth between GPUs and also between the CPU and the GPU. Again, still the compute is not the problem. The bottleneck is in the data transfers and processing performed on the CPU in the heterogeneous approach. So these are some of the takeaways that basically summarize what I've been talking about so far.

Starting point is 01:23:54 So scaling on multiple GPUs has benefits, but they are limited. Basically, the main limitation is that if we develop an algorithm that does not consider the topology of the system, basically the benefits will be non-existent. One and in order to actually exploit the benefits of multi-GPU systems, we need to be very aware of the topology and utilize this to the benefit of our workload and our approach.

Starting point is 01:24:31 Just as a side note, the overall fastest solution was the 2GPU mode on the AC922. Basically, we are sticking to the symmetrical interconnect as the overall best solution. Basically, the most powerful, the most computationally powerful system is heavily bound by the CPU to GPU interconnects. Where I mentioned this earlier, we have more than 10x performance discrepancy in this case.

Starting point is 01:25:15 And these are, I already explained some of the systems that we have in our data lab. And one system that I didn't cover, basically, is the Dell PowerEdge. But this is a smaller version of the DGX-A100. Basically, it has four GPUs and a single GPU. Right. So we've done some work on these topics in our group.

Starting point is 01:25:45 And yeah, basically, almost all of the findings on multi-GPU data processing that I presented today are part of the works listed here. If you want to work in this area, you can contact our group. And we can find an interesting project that we can work together on. And with that, I would like to thank you for the attention. And that wraps up the data processing on GPU part in this lecture.

Hardware-Conscious Data Processing (ST 2023) - tele-TASK - Data Processing on GPUs II

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.