Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Data Processing on GPUs I

Starting point is 00:00:00 Welcome everyone. Today I'll give a lecture on data processing on GPUs and some of you I know would do those I haven't met yet. I'm Ilin, I'm a PhD student in Professor Abel's group for data engineering systems and I've done some research on data processing on GPUs. So today we'll discuss the introductory part, or basically what are the main principles behind GPU data processing, its programming model, GPU architectures, et cetera. So this is a short announcement. We're always on the lookout for student assistance.

Starting point is 00:00:43 So if you're interested in any of the topics presented in this lecture, you can chat with Professor Abel or myself. And we can find a nice project topic or maybe a master thesis topic when the time comes for you to register. OK, so where are we in the lecture? Last time, you've talked about network. And from this lecture on, we start with the accelerators. So we'll start with two lectures on data processing on GPUs.

Starting point is 00:01:18 And then you'll talk about FPGAs and CXL. Right, exactly. You'll talk about FPGAs and CXL. Right? Exactly. With task four coming in the next few weeks. So in this lecture, we'll talk about what's coprocessing, why do we need accelerators, how general purpose GPUs can be used to accelerate data processing, why do we need accelerators, how general purpose GPUs can be used to accelerate data processing, and what's the programming model for data processing on GPUs

Starting point is 00:01:54 in the physical execution model and also from the logical perspective. Well, finally, we'll talk about some performance considerations that apply generally to massively parallel processors. And since we start now with parallel accelerators, we'll cover some of them as well. The lecture is based on these two books, mainly the orange one, Programming Massively Parallel Processors.

Starting point is 00:02:22 It's a very nice introduction. So if somebody is interested into the topic, I can recommend the book. It's based on the CUDA programming model, so mostly NVIDIA GPUs and principles implemented there. However, the introductory part has enough depth to basically introduce all the general principles for data processing on GPUs. And I've also used additional sources like Karsten Binnick's lecture on advanced data management systems and Sebastian Bress' lecture on in-memory databases on modern hardware. Right. So coprocessing, and why do we need it?

Starting point is 00:03:08 So I think you've seen this plot in this lecture. So we'll just do a quick recap on Moore's law, and basically, where does it stop? And basically, the main reason for this is that the increase in transistors in the CPU basically doesn't yield higher single core performance anymore. And this basically holds true from the mid-2000s.

Starting point is 00:03:34 And basically, the main reasons are that performance is kept due to either economic feasibility, power and safety constraints, or all of the above. So one natural solution is to add more cores. You've talked in the multi-core CPU lecture. However, this comes at a cost as well. So by the NAR scaling law, basically, the idea is that the transistor density doubles,

Starting point is 00:04:05 but the power consumption stays the same. However, again, since the mid-2000s, as you can see on the plot in the right corner there, again, this starts or doesn't apply anymore. Because we see that basically as we scale the transistors, basically we don't have the same power consumption. Rather, the ratio goes below one. So it doesn't scale linearly. And this is mainly due to the reason that smaller surfaces increase the risks of current leakage. Basically, we need more control units and basically more space between the cores.

Starting point is 00:04:54 If we don't do that, we have either heating problems or power leakage. And this basically peaks the frequency that a processor can achieve. And again, one temporary solution is just add more physical cores. Do not extend the core itself. However, this, while it helps to keep the energy consumption

Starting point is 00:05:19 per core low, we cannot run all cores simultaneously. And parts of the physical cores are unused because of that. And what we end up with is dark silicon, which means underutilized CPU resources, which were already expensive to develop and design. So in this part, we introduce accelerators because we have them. Yes?

Starting point is 00:05:51 Why can't they run all at the same time? JAN-FELIX SCHWARTZMANN- So they run so you run them in parallel, right? But physically, they're exchanging turns. You cannot run them at the same frequency. OK, so we can't. There's like a physical limit to how much we can, and how much electricity we can put in the CPU in total?

Starting point is 00:06:18 Exactly. And basically, the idea is that we move from the flexible programming models of the CPU to multi-core CPUs, but then there is hardware that can basically utilize parallelism even more. And we call this device accelerators, and basically you can see we're talking about GPUs. This is right in the middle in terms of programming flexibility and compute efficiency. And basically, the idea is that we utilize this new hardware.

Starting point is 00:06:58 We apply a different programming model that basically is focused on the throughput rather than low latency. And we reach higher throughput levels by utilizing parallelism. And in the beginning, this was aimed solely towards scientific workloads. And the main reason was that basically GPUs were highly, highly specialized co-processing chips that were aimed only towards image rendering, which meant parallel floating point computations. And some computer scientists noticed that the performance for the GPUs increased 2.5x yearly.

Starting point is 00:07:44 And during Moore's law, CPUs performance increased 2.5x yearly. And during peak Moore's law, CPU's performance increased 1.8x. So they wanted to try out some alternative data processing and basically try to program and utilize the parallelism that the GPU offered. However, the problem was that the programming model was based on DirectX, which is a graphical interface library and major of the adaptations were needed

Starting point is 00:08:10 to basically execute this the scientific workloads of this this came mainly from from the biology domain and basically after these, the adaptations that were required, basically I said, okay, let's design a programming framework and design the chips in that way that we can use this for general-purpose programming. And this is basically where general-purpose GPUs come into play. And let's take a look at their design. So on the left-hand side here, we

Starting point is 00:08:48 have a classical CPU core, or classical CPU with four cores. And basically, we can see that for each core, we have a control unit. We have L1 caches. We have L2 shared cache, L3 cache, and then we have the main memory. You can see that the same components are more or less in the GPU cores. However, there are some fundamental design differences that we need to note. So CPU cores are aimed towards minimizing the latency of arithmetic operations. Their design is purely sequential latency oriented

Starting point is 00:09:28 and basically for this they require a lot of control units, power management and of course their designated cache memory so they can be efficient and achieve this low latency. GPU cores on the other hand their designated cache memory so they can be efficient and achieve this low latency. GPU cores, on the other hand, need to process large numbers of floating point operations. And also, they need concurrent memory access.

Starting point is 00:09:57 So basically, they have different design. The goal is to achieve higher throughput. And today's GPUs, they can achieve around 1.5 terabytes to 3 terabytes of throughput. And basically, this is the result of having thousands of these cores. However, the big difference is that they require much less power because they have much less control

Starting point is 00:10:21 units. And their functionality is way more limited without any control unit per core, but rather this is organized in a more coarse way that we'll discuss in a bit when I go through the architecture. And basically, the result of this is massive parallelism. While the CPU peak bandwidth can be around 100 gigabytes in today's CPUs, current GPUs, for example, the A100 GPU,

Starting point is 00:10:52 that's currently the top GPU that, for example, Nvidia offers in the market, offers 1.5 terabytes. And the upcoming H100 will offer up to 3 terabytes of compute bandwidth. Now, let's see how this GPU architecture is organized. Basically, where do the compute units differ from the CPU ones and how we can access this high bandwidth that I mentioned. So the main computational units are, in general,

Starting point is 00:11:21 called compute units. And similarly to the CPU cores, they contain multiple computational cores, register files, L1 cache that's tied to the computational core, and shared memory that can be accessed by the whole compute unit. Now, one difference that you can see here is that each compute unit itself contains multiple cores.

Starting point is 00:11:46 So in general, this is around either 32 or 64, depending on the architecture. So basically, 32 cores or up to 64 cores are managed under single compute unit, which means that basically all cores can access the same cache, which is very low latency and high bandwidth memory. And also, they can communicate with other compute units through the shared L2 cache.

Starting point is 00:12:22 And basically, the lower we go in the memory hierarchy, we increase the latency to access. And what we can see here, or here rather, is that to access GPU global memory, we steadily increase the latency. And while this is like high bandwidth memory, we steadily increase the latency. And while this is high bandwidth memory, like we said, up to three terabytes per second processing power, the cost to access the global memory is quite high. We'll go into more detail a bit later.

Starting point is 00:13:01 So the memory hierarchy is basically split into two. We have the on-chip memory that consists of the L1 cache and the shared memory on the compute level, on the compute unit level, and then L2 cache that's shared across all compute units. And the sizes differ by quite a lot. So the L1 cache is around 200 kilobytes. So I think Intel's new data center GPU just increased this to 256 kilobytes of L1 cache. However, the L2 cache can go up to 400 megabytes.

Starting point is 00:13:40 And this is also from Intel's new data center GPU. And it's an 8x improvement over what was already state of the art, which was around 50 megabytes for Nvidia's latest GPUs. So we have high memory L2 cache. And then the GPU global memory can go up to 188 gigabytes. However, as high bandwidth as this memory is, we have high access latency. So to move from the L2 cache to the GPU global memory, we need to drop to two orders of magnitude in latency time. And yes, the physical distribution of the memory is that we have so-called on-chip memory

Starting point is 00:14:29 and on-device memory. Regarding terminology, I mentioned compute units as the main computational centers of the GPU. And different vendors use different terms for this. I'll mainly focus on AMD's and NVIDIA's terminology or nomenclature because these are the biggest vendors as of today and their GPUs are most widely spread. In this lecture, we'll discuss specifically about the architecture of the NVIDIA GPUs, as these are de facto, the de facto standard. And also, for the programming model that we'll discuss,

Starting point is 00:15:14 this is based on NVIDIA GPUs. We use them here at HPI in our data lab. And also, most of the cloud providers are exclusively offering NVIDIA GPUs. So we'll focus on that terminology. And basically, the idea is that you can map the terms that you already know from the CPU lectures to the computational parts or the computational units

Starting point is 00:15:42 of the NVIDIA GPUs. So for example, one multiprocessor or a multicore CPU, you can look at it as a streaming multiprocessor. One core, one physical core is one computational core. In the NVIDIA GPUs, a thread is a thread. And a vector unit is a synth warp in an NVIDIA GPU. What's a synth warp? And what does computational core consist of?

Starting point is 00:16:12 And what a streaming multiprocessor consists of? I'll talk about in a second. So independent of the vendors, one compute unit basically consists of four main parts. We have the computational core. As I said, we can have at least 32 or up to 64 in current GPU models. We have the registers, which are the private memory for each computational core.

Starting point is 00:16:40 Then we have the local memory that are shared for each core in the computational unit. And we have a thread scheduler that basically assigns the threads to the thread groups and then to the cores. What's a thread group? We'll discuss in a bit. Just I'm trying to not go too quickly with the terminology.

Starting point is 00:17:06 But what's important to know from this slide is basically the four different parts that each computational unit from the GPU has. And basically, how does this map into a real GPU? So like I said, we'll take as an example one of the GPUs that we have at HPI. One of them is the NVIDIA A100 GPU. And basically there we have the computational cores.

Starting point is 00:17:33 Again, we have the L1 cache and the shared memory. We have a warp scheduler. We have the register files, and we have the dispatch unit. So we start with the computational cores. There are two types of computational cores in NVIDIA's streaming multiprocessor. So basically we have the CUDA cores, which are processing integers, floating point numbers, or floating point numbers. And we also have the tensor core.

Starting point is 00:18:06 The tensor core operates, is a specialized type of core that operates solely on tensors. I won't go into detail on either of those. It's just important to remember that there is a distinction between the inputs that they're processing. And computational cores in a GPU are stream processors. And you can imagine or you can see it as an equivalent to a vector lane in a CPU. And then basically the streaming multiprocessor, you can see it as the equivalent of a full CPU core.

Starting point is 00:18:49 Since we have multiple computational cores in one streaming multiprocessor, in this GPU specifically, we have 64 CUDA cores per one streaming multi-processor we have hundred and eight of these streaming multi-processors so basically this results in six thousand uh more than around seven thousand um computational cores in uh in this gpu which basically uh you can imagine uh 7000 core core cpu and the parallelism that it can achieve. Now, what type of memory does the streaming multiprocessor access? Basically, we have the L1 cache and the shared memory. And usually, these were combined until around 2018, which meant that basically they were sorry, they were separate, which meant that the L1 cache was managed solely by the OS

Starting point is 00:19:49 or the operating system. And the shared memory was managed explicitly by the code that we provided. Now they're combined, which means that there's no cache misses. They can be configured for each streaming multiprocessor separately. And they share the same address space, which basically

Starting point is 00:20:10 allows us to have aligned memory accesses, efficiently use the cache lines, and basically lower the latency for memory accesses. And here, it says that they're combined, so we have 192 kilobytes combined L1 cache and shared memory, and this can be programmed. So the developer can decide how much of this memory will be an L1 cache and how much of this will be shared memory. The core difference, and there is a limit.

Starting point is 00:20:48 So the shared memory can go up to 164 kilobytes. And this 28 kilobytes are exclusively reserved for the L1 cache. What's the main difference here is basically the lifetime of the variables that we define in each of these memories, which we'll cover when we talk about the programming model a few slides down the line.

Starting point is 00:21:16 So then outside of L1 caches and shared memory, we also have the registers, which basically, they send data directly to the computational core. And for this, they don't require a load instruction to fetch the data. And basically, what this means is that each of the threads has basically exclusive access to the register files, which are organized in 32 banks. Each bank is basically the private memory to a single thread.

Starting point is 00:21:52 And basically, these 32 banks correspond to the 32 threads that can be executed at once or in a single call in the GPU. Does this, so basically, each thread accesses a separate register bank. Does this remind you of something, like each of the banks that the threads access. You should have seen this before in the lectures. So this would be a vector lane.

Starting point is 00:22:49 This would be one vector lane in the cache memory or in the register. We'll get to that when we're discussing the programming model. And basically, how we manage the threads and how do these threads access memory and execute compute, basically, is through the warp scheduler. And what the warp scheduler does is assigns groups of threads in an execution queue. And the naming here, it's called warp scheduler, not thread

Starting point is 00:23:19 scheduler, because a group of thread is called a warp. And basically, a group of thread is called a warp. And basically, a group of 32 threads is a single warp, which then corresponds to this 32 banks. And basically, executing one warp per cycle basically allows concurrent execution of 32 threads. And basically, depending on the design of the GPU, they allow executing the whole warp in a single cycle, which means that we have 32

Starting point is 00:24:05 threads per group. For example, on my work laptop, I have 14 streaming processors on my GPU. The core count is around 900. So basically, I have 64 cores per streaming multiprocessor. So I can execute two thread groups or two warps in a single cycle per streaming multiprocessor. So I can execute two thread groups or two warps in a single cycle per streaming multiprocessor. And basically, I can do this for the whole GPU.

Starting point is 00:24:33 I can do 28 warps. On the A100, this is a bit of a different story. This is a data center GPU. It has 108 streaming multiprocessors. The CUDA core count is close to 7,000. And we set, again, 64 cores per streaming multiprocessors. The CUDA core count is close to 7,000. And we set, again, 64 cores per streaming multiprocessor. And if we execute two warps in a single cycle per streaming multiprocessor, we have 216 warps

Starting point is 00:24:58 executed on the whole GPU, which basically multiply that by 32. And you get the execution of threads in a single cycle. Now, basically, how is this physically implemented, and what's the GPU programming model on the physical level? So let's discuss a bit about that. So it's, like I said, in one streaming multiprocessor, threads are scheduled into warps. Also, I already gave a small spoiler.

Starting point is 00:25:38 Each warp has 32 threads. And basically, one warp is executed in a single swimming multiprocessor. So these 32 threads that are grouped in the warp cannot be shared between swimming multiprocessors. They all correspond to the same thread block which means they access a particular part of the memory that's basically sequential, and they execute the same instruction in parallel. And basically on the warp level, they share, or the warps share the control unit in the streaming multiprocessor. By knowing the number of warps per swimming multiprocessor and the number of swimming multiprocessors,

Starting point is 00:26:27 we can calculate the number of 220,000 threads, active threads that we can spawn on the GPU. How are these threads scheduled? Basically, there are two ways that this can be done. Like we said already, all threads within one warp execute the exact same instruction. However, we can, so this is what physically happens.

Starting point is 00:27:14 What we can define in our programs can be something completely different. So basically, we can say, I want the first two threads to execute some instructions, and the rest something completely different. And then after this branch is finished, I want them to execute another instruction, again, all of them together. And basically, to execute this, there are two ways.

Starting point is 00:27:43 We can have a lockstep execution, which happened in the past, or we can have an interleaved execution model that basically happens or is implemented in current generations. So let's start with the lockstep execution model. Basically, what happens here is that for each warp we have a program counter and a stack for the whole warp. So we have reduced resources to track the thread state which means that we don't require additional control units for this. However, this also means that we lose granularity

Starting point is 00:28:28 when processing the data on the threads. So we don't allow inter-thread communication. And basically, we reduce the... So by reducing the resources to track the thread state, we also limit the communication between threads. And basically what happens as a result, we have serialized divergent branches. So all statements are within one branch are executed until they're completed, which means that, for example, in this small code snippet, for the purposes of executing A and B,

Starting point is 00:29:13 all 32 threads will be active. However, only four of them will actually execute the code. The other 28 will wait for the completion of instructions A and B. And then, again, all 32 threads will start at the execution of, sorry, will be activated, but only 28 will execute X and Y. The rest of the four threads will be idle. Yes?

Starting point is 00:29:41 Does this happen at compiler time, or does it happen, like, by the GPU, it, like, detects that there's a branch, Yes? So this is explicitly defined in the cuda code. So we can define this explicitly. And basically what happens at compilation time is the thread scheduler basically assigns the, sorry, at runtime, the thread scheduler assigns the threads to the memory locations and basically assigns the execution order. And what happens basically is, again, like all threads are spawned. Part of them are just idling because of this, until we have a part where, or instruction that basically

Starting point is 00:30:35 shares all threads. And then the state reconverges, and the execution continues. And so basically basically the question is, can we do better than this? Well, yes, to a certain extent. Yeah, the threads concurrency I already mentioned, basically in this execution model, it's not possible. Basically doing so explicitly

Starting point is 00:31:05 by using certain locks or mutexes basically just hurts performance and leads to deadlocks. The alternative is the interleaved execution model, where we allow interleaved execution of the instructions in a thread. So basically, what happens in contrast to what we've seen here, in the previous slide, we have now control units per thread.

Starting point is 00:31:38 So we have the program counter and the call stack are tracking the state of each thread, not of each warp. So now we have increased granularity. This means a slightly higher overhead because instead of one program counter and a single call stack, now we have 32 of these. What this allows us is to have interleaved execution across code branches and also it

Starting point is 00:32:08 allows the threads to communicate within a warp while being executed. And basically what happens is if we see the top chart, basically, we have the instructions A and X being interleaved. Again, all threads are active while executing either of those. However, we don't wait for the full branch to finish. Rather, we can update the state by accessing both branches in an interleaved manner rather than wait for one of them to complete. Now, the issue here is that the threads need to converge at some point. And basically, the question is, when we get to instruction Z, or Z, does the reconversion happen automatically?

Starting point is 00:33:21 So the answer is somewhat, yes. So basically what happens is there's auto-reconvergence, which will happen by default. It happens after the execution of the first converging statement. So basically, it will happen after Z is executed, which basically is what's shown in the code there. So basically, this sync warp synchronizes the state and basically allows all 32 threads then to execute computations

Starting point is 00:34:06 in parallel. Why does this happen after the execution of Z? It's because the interleaved execution model allows communication between threads. And if there is no explicit synchronization, the model assumes that we have some sort of, that basically this property of thread communication within a warp is being utilized,

Starting point is 00:34:31 and there needs to be an update between the thread groups. And basically what's shown in the second code block is what happens implicitly anyway. Now, my question is, what happens if we call sync warp before z is executed? Yeah? I would assume that the execution of z is then synchronized with all calls?

Starting point is 00:35:03 Exactly. Yeah. And basically, that's the best practice. The execution of that is then synchronized with all colors? Exactly. And basically, that's the best practice. If we know that we have a divergent block, then we need to make sure that immediately after the block, we do a synchronization so we can utilize the parallelism to the fullest. What happens here in the second code block shown,

Starting point is 00:35:27 again, is what happens anyway, implicitly. And this basically comes at a slight cost of performance, just because the model assumes that the threads need to access each other's state within the single warp. And basically, how are threads executed in a single warp in terms of concurrency? We can see on this chart here, and basically we can compare this to a CPU core. And we can see the comparison between what a high throughput processor is designed to achieve

Starting point is 00:36:11 and what does a low latency processor do. So the immediate difference is that we have a significant difference in the memory access time. So the time that a single GPU thread waits for data to access the global memory is significantly higher than what the CPU does. And basically, what needs to happen, we need to overlap this memory access time.

Starting point is 00:36:45 And we can kind of calculate what's the cost of the increased memory access and how much parallelism do we actually need to mask this memory accesses. So here are some numbers. So the L1 cache latency for a GPU In the GPU processors, 28 cycles. In the case of the CPU, this is four. So basically, to hide L1 cache latency,

Starting point is 00:37:12 we need to spawn at least seven warps to compensate for this. To access the GPU memory latency, the cost is around 350 cycles. To hide this, we need at least 88 warps to basically say, okay, now memory access for this computational cycles are free. And basically, what happens when we issue this memory accesses, we can do this in two ways. We can do this in an aligned way, which basically would mean that we coalesce the memory access. We need to basically access the addresses in the global memory, we want to efficiently use the cache line.

Starting point is 00:38:10 And the cache line is conveniently 128 bytes. And this corresponds to basically a single warp with 32 threads can access four bytes per thread in a single kernel call. However, if we don't utilize the cache line in an efficient manner, basically for the same memory access to access the same 128 bytes, we'll need to execute two kernel calls, which means another cycle of thread scheduling and basically overhead for the actual two cache accesses. Are there any questions until this point? Yeah.

Starting point is 00:39:02 In the execution, you showed this one. You can only execute one branch at one time on an arbitrary amount of threads, right? Yeah. OK. I was wondering why you can't just compress this graph and execute x and a. Yeah, so basically whenever a single instruction is called,

Starting point is 00:39:24 either of a, b, x, or y, all 32 threads in a warp are called. So basically, now here we are talking about a single warp. So we are looking at the granularity of 32 threads. And say we have instruction A that needs to be executed. All 32 threads in the warp are activated. But here, we explicitly say that we need only four threads to do this. I'll get later in the lecture why

Starting point is 00:39:53 this might be necessary to basically limit or reduce the number of threads that execute a single instruction. But yeah, until now, basically, we say we need the four threads to execute A and B, which means that 32 threads will be activated anyway. However, only four of these will process A and B. The benefit of having the interleaved model is that we can use the rest of the 28 threads that

Starting point is 00:40:29 will be activated to execute x in the next cycle, rather than waiting for the whole conditional block to be finished. So instead of waiting for a and b, we can do this in interleaved fashion. Other questions? OK, so then let's make a short break. And then afterwards, we'll discuss

Starting point is 00:41:02 about the logical execution, how this is implemented on the software level, and basically how we can develop for this. Right. So, so far we talked about the physical execution, what happens on the hardware level, which is a bit abstract at this point. However, in this part, we'll talk about what happens on the logical level, how we can develop programs that can efficiently utilize the thousands of cores, the thousands of threads

Starting point is 00:41:41 to achieve high throughput and good levels of parallelism. So currently, you can develop GPU programs in one of three frameworks. First one that's vendor independent and the recognized industry standard across different manufacturers is OpenCL. If you programmed so far in OpenCL, basically you know that this supports different hardware,

Starting point is 00:42:13 so you can program not only GPUs, but you can do CPU programming, you can program FPGAs, and it's basically a layer over C and C++. The proprietary frameworks are CUDA, developed for NVIDIA frameworks, for NVIDIA GPUs. It's based on C++ and Fortran. And there are wrappers for other programming languages and APIs. But if you develop pure CUDA, this pretty much looks like C++. for other programming languages and APIs. But if you develop pure CUDA, this pretty much

Starting point is 00:42:48 looks like C++ with significantly different syntax for particular operators. And then the AMD's response is an open source framework called RockM. So basically, they claim it's vendor independent. I haven't personally tried it. I haven't had access to an AMD GPU. And since AMD also develops CPUs and also APUs, so basically hybrid CPU and GPUs.

Starting point is 00:43:25 So basically, the claim is that you can develop software for their different hardware all in the same framework. So how is a GPU program executed? Basically, it's a heterogeneous model. We operate in the CPU and the GPU and it's most often is a sequential execution we have part of the code that's that's executed on the CPU basically we call it the serial part or the serial code that's optimized for sequential execution then we have the the kernel methods that are executed

Starting point is 00:44:03 completely on the GPU there we want to basically achieve this high level of parallelism that we talked so far. And then again, we need some synchronization step in the end that's executed again on the CPU. Mostly this is transferring back to main memory or storage, et cetera. And basically to explain the logical execution on a GPU, we'll go to a very simple program, which will basically be adding to one gigabyte vectors. We'll do CPU. We'll see how this is executed on the CPU, what's the performance there, and how this is executed on the CPU, what's the performance there, and how

Starting point is 00:44:47 this is executed on the GPU, basically how we can optimize this to be more efficient than the CPU implementation, which is initially the goal of using a GPU to program, right? So regarding the execution flow, the parallel part or the kernel code that I mentioned that's executed exclusively on the GPU is executed by many threads, or by all of the threads. So you can imagine a kernel call to be the GPU equivalent of the execution

Starting point is 00:45:26 of a single CPU thread. So basically, this kernel executes the sequence of instructions assigned to the core. And basically, the way to achieve the massive parallelism that the GPU architecture offers, the idea is to spawn a high number of threads at the same time and basically following the single instruction multiple threads execution model, achieve high throughput. You've talked about SIMD before.

Starting point is 00:46:02 Now this SIMT versus SIMD is a bit different. So in SIMD, you execute a single instruction on all data. What differs in the GPU model is single instruction is executed only on the active threads, or as we see before, only on the active threads, or as we see before, only on the designated threads. Because by default, warps of 32 threads are spawned. We can manage basically the execution or the instructions

Starting point is 00:46:40 on the thread level. And if you want to compare this to SIMD, SIMD execution model is similar to the mask SIMD instructions. Basically, where you execute the instructions on only parts of the vector lane, not on all of it. And basically, how can you program for this? How do you define a kernel? So a kernel is defined by using this global signature

Starting point is 00:47:15 of the method. And basically, then, everything that's inside this method will be executed solely by the GPU and by the amount of threads that we give it to. And basically, how do we call the kernel? We do this with the second instruction, right? So here we have the add kernel and then we call the add method. I'm showing this on this screen, but this doesn't work anyway. So we have this add, then 1 and 256, and then the input. That's a call to a GPU kernel. And basically, we'll get through that, but a small spoiler for now.

Starting point is 00:47:56 Basically, 1 means we have one thread block, and we have 256 threads in this single block. What does a thread block mean? We'll discuss, or we'll talk about this very soon. So this is how we define and call a kernel. However, to execute something on the GPU, we need to send data there. And we need to, of course, allocate memory for this. We can do this in two ways.

Starting point is 00:48:25 We can do static allocation, which is done by using the device prefix. Basically, we say, okay, with this integer x needs to be allocated on the GPU. This is very static, so we define single variables. However, often we need chunk of the memory, right? So the dynamic allocation is basically executed through the kuda-malloc method.

Starting point is 00:48:53 There are different ways of calling kuda-malloc. We can do pinned memory. So we can do, in separate cases, we can define the memory location where we want a particular part of the data to reside. And then basically, this can be achieved through pinned memory because then we reserve this location for this for this for this part of the code or for this chunk of the data and then means that this this will be continuously

Starting point is 00:49:32 loaded into into this part of the memory then we have the memory transfers from host to device and from device to host, host being the CPU and the main memory. So basically, we have two directions to execute this. And yeah, basically, that's what happens. And then we can control the memory placement on the GPU on a bit lower granularity. So based on how we define a variable, they can have different lifetimes. So we mentioned the static allocation with device.

Starting point is 00:50:20 Basically, this places the variable explicitly on the GPU. However, basically, we can define a variable to be accessible by threads in a block or for the whole kernel. And basically, if we define a shared memory variable, it means that this will be accessible to all threads in one block and will be stored during the execution of a single kernel. Kernel variables are private for each thread. So basically this is the variables that are pinned to the register memory. And basically, other threads cannot access these variables.

Starting point is 00:51:11 And this memory is freed by default after the kernel completes. Just for completion, here is a part of code that basically OpenCL would do. So basically, we want to select a device. We need to create OpenCL context. We define the kernel, again, the part where the GPU will process the data. We create buffers for the vectors, and we put the device commands or the GPU commands in a queue,

Starting point is 00:51:58 and finally execute the kernel. This is just for completeness to see how this is executed in OpenCL. As I mentioned earlier, we'll stick with the CUDA implementation. And now I'll do a short demo so we can basically see the performance differences. Yes?

Starting point is 00:52:23 I hope on the slide before that usually the code for the kernel is loaded from a file and not written as a string. Yeah, exactly. I mean, yes. That's kind of scary. Yeah, if I add there kernel.txt or something would be a bit unreadable.

Starting point is 00:52:43 I mean, this is also unreadable in my opinion. But I put it there just so you know how it's invoked. And of course, you can do this from a separate file, which basically then contains more or less C. This can be called from a C file or C++ file. So then, as I said, we'll do a short demo. Again, this is example code. The example is very basic.

Starting point is 00:53:14 We're just adding two vectors. And we are going to discuss some principles of how to basically optimize for the parallelism in CUDA or on the GPU. So just as a reference, we have a single threaded addition of two vectors in the CPU, or two arrays, sorry, not vectors. And we have, if we just compile this with no optimization flags and run the code, we have around 700 milliseconds of execution time. If we compile this with an O3 flag, then this drops to 155 milliseconds.

Starting point is 00:54:08 The complete code, you can find it in the slides. And basically, you can compare the performance if you're interested. Now, the equivalent to vector addition on the GPU would be to define the kernel. Like we said already, we need to invoke the kernel. So, as we did a single thread implementation of the CPU code, we'll start with a single thread on the GPU as well. Now, I mentioned that the numbers after the kernel call, they have a meaning. And basically, the sequence is the following. So we define the number of thread blocks. And we define the number of threads

Starting point is 00:54:58 for executing this kernel. Then we have memory allocation, right? So we need to allocate the memory for the two vectors that will be executed on, that will be placed in GPU memory and then send it back to CPU memory after the program has, the kernel has been completed. Some useful code for kernel debugging is the following. So basically, once you define something in a kernel, you cannot easily debug this. And basically, there are defined error checks in CUDA. Just a second. And so basically you can perform

Starting point is 00:55:47 kernel calls wrapped up in these error checks. So basically if there is something wrong or there's any error going on in the kernel, this will be reported back. Yes? You can absolutely debug the kernel. So in the... Through ANSsys, right? yeah. So, just at run time, this is not possible, right?

Starting point is 00:56:14 You need to have the nsys suite installed, right? And basically, they're more or less you're doing profiling and Debugging. This is a bit more advanced. This is like the very basic, I want to do like Hello World. This is what happens. Yeah? And printf debugging works as well, so. JAN-FELIX SCHWARTZMANN- Since when?

Starting point is 00:56:40 At least it worked last year. JAN-FELIX SCHWARTZMANN- OK. So when I tried printf debugging, this was very unreliable. So basically, if you execute this on multiple threads, the problem is that you don't know which output you're seeing. Yeah, that, obviously.

Starting point is 00:57:01 Yeah. Well. And the buffer for the strings is pretty small, so you can debug large time intervals, but still. Yeah. Again, this is just a safety check. Also, it's also useful for the memory allocation part, because sometimes memory allocation errors are just segfaults.

Starting point is 00:57:32 And basically, we don't want to deal with segfaults as much as we can or as less we can. So I found it useful when dealing with some weird memory allocations or some corner cases. This helped. But definitely, if you want to do proper debugging, you use NSYS. And then you see what's going on. You profile the code.

Starting point is 00:57:58 And you can see the state of your variables. So here we have the single thread performance of this vector addition or array addition that happened on my laptop. And if we run this on single thread, we have an execution time of 19 seconds, which is a lot slower than the CPU run. It's two orders of magnitude slower. And it's not very nice, right?

Starting point is 00:58:38 So let's try to see. I accessed. I cannot find. OK, great. So I connected to one of the servers. There we have a bit more advanced GPU, and let's see, oops. All right.

Starting point is 00:58:55 Let's see what happens. And what was the actual performance in a, say, mid-sized server GPU. We have the... So we allocate the memory in the beginning, and then let's see the results. For the purposes of the demo, I'm not running any profiling tool

Starting point is 00:59:21 just because of brevity. The output is quite large so the number that we are interested in is this kernel duration I have it in microseconds you'll see in a few demos why that is that the case but this is again 15 seconds this is compared to what was the CPU, 120 milliseconds, this is way, way slower. And this is when we use single thread, right? When we perform this on a single thread. But the goal is not to do this on a single thread, right?

Starting point is 01:00:00 So, okay, this should work. So the question is, how do we improve? So we need to use, of course, multiple threads. We've mentioned this so far multiple times. And to do this, we need to be aware of how are the threads organized and how are they called. To basically execute a multi-threaded program, we're executing thread blocks. And each thread block is executed in one streaming multiprocessor. So these threads can communicate now. And let's not confuse also thread blocks with warps.

Starting point is 01:00:54 So warps are the scheduling units. So this means that this goes into multiples of 32. However, a single thread block can have multiple warps inside. And now basically what we want to have is, let's say we do a single block with a large number of threads. And basically what we want to achieve and basically want to achieve high throughput, as we mentioned but we

Starting point is 01:01:27 want to do this by invoking multiple threads and to do this we need to manage the way in which the threads will access the memory and the data so there are two ways, or at least two ways, that you can access the data on a thread one is being executed on sequential parts of the data. Then thread two is executed on the next block, and then thread three and onwards. The alternative is to have strided access. So basically, one thread doesn't access sequential memory locations. However, it jumps to basically the stride is the number of threads.

Starting point is 01:02:41 And the amount of memory locations that it skips is the number of threads. So which one do you think would perform better? Which memory access pattern should achieve higher performance. Yeah? Does Drive better because we access the cache line at a time? Mhm, okay. I thought things are opposite because we don't have to switch caches often. We don't have to switch the first cache, then switch the second cache, then the third cache, then each one

Starting point is 01:03:35 of those. Your reasoning is correct, but basically, the effects of that reasoning are the opposite. Basically, to access the cache efficiently, we need the strided access, because these threads are executed at once. So this means that what you were saying, which is exactly right, you want to utilize the cache.

Starting point is 01:03:58 Now, these three threads will access three memory locations. And then they'll access basically what's shown here. Then they'll access the next ones and the following ones. And what happens in the example above, thread one, thread two, and thread three are executed at the same time, but they're accessing very basically separated locations. And now with three threads, this won't be a problem.

Starting point is 01:04:28 But if we want to execute this on more than 32 threads, which by default we want to, this will already mean that we access at least three cache lines, where, for example, a single warp will access a single cache line here. And let's see what does this mean in terms of performance. So I have some code from my laptop, but let's see what happens on the server. So we'll try multi-threaded sequential first. Okay. So we achieve so we are already down to from 15 seconds, we're down to 519

Starting point is 01:05:25 milliseconds, which is quite nice. So we're executing this on eight warps, meaning 256 threads. Now let's see what happens if we do. So this is the sequential access. And again, with the equals, now we want to do the strided access. Is those 12 seconds of copying the data from memory

Starting point is 01:05:58 to? No, that's actually initializing one gigabyte of uniform integers, of uniform distributed integers. That's the bottleneck in this code, basically. OK, it's already finished. Your question, the memory allocation, is the 80 milliseconds here. So to compare the performance of the kernel,

Starting point is 01:06:36 we went from 520 milliseconds to 333. So basically, this is the impact of having stride memory access versus sequential memory access. And again, the reasoning is the same. Not the same, but the reasoning is we want to access one full cache line at a time. So let's see if my laptop performance numbers agree. Yes, so here we had around 2x improvement. Yeah.

Starting point is 01:07:10 And again, the short explanation underneath each. So basically, we are a bit better than a single thread compiler optimized CPU execution. So now the question is, is this worth the extra money that you would spend on a GPU and the overhead of programming in different framework, etc.? It shortly answered no. So we want to be better than this. We want to push beyond the CPU execution time by a factor of multiples of the CPU execution time.

Starting point is 01:08:02 Again, code references. So to do that, we go on basically one level of abstraction higher. So we know that threads are organized in a single block. And then thread blocks are organized in a grid. Basically, what does this mean? Is that we have now basically a higher abstraction where thread blocks are independent, but they can share results. But more importantly, they called in a single kernel call and in a single execution. And now, basically, what we can do

Starting point is 01:08:50 is we can modify the kernels to access blocks and threads. So basically, now we want to spread out the memory access of a single execution unit. And when we move this to a kernel block, this allows us to exploit more of the parallelism from both the computational side, but also in the memory access part. So the block size is basically, we

Starting point is 01:09:30 defined the number of threads in a block. Current GPUs allow up to 1,024 threads in a single block. And basically, the grid size limit for how many blocks can be in a single grid is basically around a billion. And now the question is, how can this impact performance? And here, this is just a small adjustment of how we access the different blocks in the grid. So basically, they can be executed in parallel.

Starting point is 01:10:15 So just a quick question. What's the amount of speedup that you expect from this. So we were at around 300 milliseconds with stride memory access and 256 threads. OK, let's see. So just before we run the demo, just a quick mention, basically, what we changed in the previous. So basically, when we were multi-threading, we were defining the parts of the memory that the thread accesses and how the memory accesses basically is executed. Now, if we want to do this concurrently for multiple blocks, we need to do this also,

Starting point is 01:11:13 like small change here. So we need to define the stride and the index for each thread separately. And now let's check what will be the performance application of this. Okay, so now we want multi-thread, but we also want multi-block, right? Again, we'll wait for those 10 seconds for the integers to be generated. Again, this is far from efficient uniform distribution generation. The goal of the program is a bit different.

Starting point is 01:11:56 So what happened is that we moved from 333 milliseconds to 4. So basically now we are comparing hundreds. So on the GPU level, we compare a speed up of, we started with 15 seconds. Now we are at 4 milliseconds. So we have three orders of magnitude improvement here. Of course, the initial example was extremely bad. However, it is with just basic utilization of the parallelism

Starting point is 01:12:41 that the GPU offers, so we didn't do any data optimizations. We just handled the memory access and the thread allocation. And this is the speedup that we have. And again, we're not even utilizing the full GPU resources, right? We access only 256 threads. And we don't run on the full grid. Now, let me go back to the slides. Right?

Starting point is 01:13:16 So on my laptop, this was 20 milliseconds, which was, yeah, these are the performance gains that were achieved. Now, basically, the question is, how would this perform against the CPU multi-thread, right? So there, I haven't run the code on a CPU multi-thread. However, what's important for this example is to basically see what are the performance

Starting point is 01:13:50 implications of, of course, running something in multi-thread fashion, but also what are the implications of memory accesses and combining threads in blocks and grid. Yes? Can you explain again how you come up with the 256 threads per block? Could I change this to 512? Yes.

Starting point is 01:14:16 Does this affect the performance in this case? I'll talk about that in a second. So that's an excellent question. So basically, how do we define this, right? How do we know do we need, whether we need 256 threads or we do the full maximum of 1024 or 512 or whatever in between, right? Again, code sample. So basically this goes here and now basically

Starting point is 01:14:42 the other side of the coin comes and basically this is the cost of parallelism and the cost of parallelism is usually overestimation of hardware resources so in in our example we had we had just a pure array addition of not a ridiculously high amount of data. The GPUs that this was run on, this was like 24 gigabytes GPU, so everything fits comfortably in GPU memory. We have no issues, right? However, what happens when we overestimate the hardware resources, we basically don't utilize,

Starting point is 01:15:36 we stop utilizing the resources that the GPU has, but we are adding overhead in terms of control. So the problem with using the maximum amount of threads and the maximum amount of blocks is that these need to be synchronized at some point right so they'll execute the computation in parallel they'll they'll do efficient memory access to a certain extent and the problem is that there is synchronization time there's scheduling time and the question is that there is synchronization time, there's scheduling time, and the question is if this pays off. And for example, so a bit of a different example,

Starting point is 01:16:13 a bit more relevant than adding two arrays. If let's say we imagine, so we know how thread blocks are organized, we know how threads are assigned to thread blocks. Let's say we have multiple thread blocks that are writing to the same output files. So just as a side note, we have thread blocks that don't share the shared memory

Starting point is 01:16:36 in a streaming multiprocessor. So these are independent. The shared memory is reserved for the warp. So now this means that each block will load its own copy of the output file. So immediately, we have redundant data loading and redundant data synchronization.

Starting point is 01:16:53 And this means that at a certain point, we'll either have a deadlock because of concurrent data accesses and or the synchronization part of the workload. And basically what this will lead to is that we have a parallel workload, or a workload that we designed to be parallel, but that will be inevitably serialized. Because we overestimated the hardware distribution of the threads.

Starting point is 01:17:31 So basically, what's the solution here is to assign more work to a single thread block. So if you remember the example when I was discussing about the physical execution models where we said, okay, A and B need to be executed only by threads 1 to 4. This is the reason for this. And basically, this is called thread coarsening. It means that we provide more work for a particular set of threads because we decided that the memory access and the synchronization cost will be higher

Starting point is 01:18:00 if we over-parallelize this. And thread coarsening in this case is in the case that I shown with the with the vector addition can be interesting because to answer your question Moritz about the utilizing all threads basically the performance was worse. So I did only 1,024. I didn't do anything in between to find the optimal amount of threads. But 1,024 threads in a single block performed worse than 256 threads. And it was, I don't remember exactly, but it was at least 2x worse. So basically, knowing the workload, knowing what we need, or basically how the data is

Starting point is 01:18:54 moved, how costly are the computations. In our case, addition is not costly. But there are workloads where we need to optimize the memory access versus the computation. And basically, the way to do this, we profile the code, right? You use NSYS, you use NVProf in earlier generations, and you see where data is being spent and basically where data, or you see basically where your threads are idle.

Starting point is 01:19:20 And idle threads will happen if we overestimate the hardware. So basically, to add to the thread coarsening aspect, if we have different thread blocks reading the same data, it means that we need to access different parts, either in GPU global memory or in CPU memory, it pays off more to do this sequentially or to add more work to these thread blocks rather than infinitely parallelize this or parallelize this to the maximum.

Starting point is 01:19:59 Because again, we want to reduce global memory accesses and synchronization. And in general, I think this is an implicit conclusion that we can make from this lecture so far. In GPUs, compute is way cheaper than data transfers. So if we can assign more compute to a set of threads to reduce the data transfers, we are already performing an optimization.

Starting point is 01:20:25 Or we are basically allowing or reducing the number of memory accesses and the need for synchronization and waiting for data. So to summarize this, again, the synchronization and the data transfer are the main reasons for the additional cost for the basically reduced performance by parallelism. And basically what we need to know is that thread blocks work on shared memory, so we can utilize this shared memory. If you remember, we saw a table where the lifecycle of the variables was defined, and basically what's the level of granularity

Starting point is 01:21:14 of the threads that are accessing these variables. And basically, reusing shared memory in a single block is very beneficial to make sure that you don't have redundant transfers of data and redundant allocations. Of course, this is not the silver bullet. Thread coarsening is not beneficial when we have the data distributed across different thread blocks. So basically, this means that no additional synchronization

Starting point is 01:21:47 will be required because each thread block will access their own data, which is all fine. However, if they need to mix and match, then this becomes a problem and basically pays off more to just allocate the complete workload to a single thread block or to the thread blocks that are accessing this part of the memory and execute it in constraint fashion there. And another two aspects or cases where thread

Starting point is 01:22:18 coarsening is not beneficial is where the hardware is underutilized anyway so this means that we're not parallelizing enough, and the occupancy of the threads and the thread blocks is not maximized. Just an overview of some of the performance considerations that we had so far in the lecture, and some of them that we didn't cover so far. So we've talked about coalesced memory access. We have the computational benefits and the memory access benefits of performing such memory access and basically a way to do this.

Starting point is 01:23:02 I won't go one by one. I'll just mention the ones that we talked in this lecture. So we mentioned coalesced memory access. Maximizing occupancy. This, I didn't mention it explicitly, but this is basically increasing the number of threads, increasing the number of blocks. And basically, the way to do this is, like we already did, we do different number of threads on different number of blocks. And then we can go even more granular.

Starting point is 01:23:42 Some things that we didn't cover but I mentioned is the utilization of shared memory. And then we can go even more granular. Some things that we didn't cover, but I mentioned, is the utilization of shared memory. And also we can access individual registers or basically dedicate registers to individual threads. Thread coarsening I just mentioned, basically addressing the cost of parallelization. What we didn't mention is privatisation or using atomic operations and atomic variables in GPUs.

Starting point is 01:24:20 We want also to minimise the control diversions. So again, this is mainly the mapping of the threads to the data and rearranging the layout of the data. And basically, the goal is to achieve high efficiency. And tiling of the reuse data is basically more of the same. Again, data placement and using of the usage of shared memory. Right. So to summarize, I introduced the need for coprocessors. Why do we need them. I talked about GPUs as one of the examples of coprocessing.

Starting point is 01:25:12 We talked about the architecture, the memory hierarchy, and the programming model basically on physical level and also on logical level. The performance considerations that I mentioned just now are part of where would we start from tomorrow. So we again, some of other Considerations and use cases of gpu data processing. Are there any questions so far? No? okay. Well, then, thank you very much. And we see each other tomorrow for the second part of the gpu

Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Data Processing on GPUs I

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.