Hardware-Conscious Data Processing (ST 2024) - tele-TASK - Data Processing on GPUs I
Episode Date: June 25, 2024...
Transcript
Discussion (0)
Welcome everyone. Today I'll give a lecture on data processing on GPUs and some of you I know
would do those I haven't met yet. I'm Ilin, I'm a PhD student in Professor Abel's group
for data engineering systems and I've done some research on data processing on GPUs. So today we'll discuss the introductory part,
or basically what are the main principles
behind GPU data processing, its programming model,
GPU architectures, et cetera.
So this is a short announcement.
We're always on the lookout for student assistance.
So if you're interested in any of the topics presented
in this lecture, you can chat with Professor Abel or myself.
And we can find a nice project topic or maybe a master thesis
topic when the time comes for you to register.
OK, so where are we in the lecture?
Last time, you've talked about network.
And from this lecture on, we start with the accelerators.
So we'll start with two lectures on data processing on GPUs.
And then you'll talk about FPGAs and CXL.
Right, exactly. You'll talk about FPGAs and CXL. Right?
Exactly.
With task four coming in the next few weeks.
So in this lecture, we'll talk about what's coprocessing, why
do we need accelerators, how general purpose GPUs can be used to accelerate data processing, why do we need accelerators, how general purpose GPUs can be used
to accelerate data processing, and what's
the programming model for data processing on GPUs
in the physical execution model and also
from the logical perspective.
Well, finally, we'll talk about some performance considerations
that apply generally to massively
parallel processors.
And since we start now with parallel accelerators, we'll cover some of them as well.
The lecture is based on these two books, mainly the orange one, Programming Massively Parallel
Processors.
It's a very nice introduction. So if somebody is interested into the topic,
I can recommend the book.
It's based on the CUDA programming model,
so mostly NVIDIA GPUs and principles implemented there.
However, the introductory part has enough depth
to basically introduce all the general principles for data processing on GPUs.
And I've also used additional sources like Karsten Binnick's lecture on advanced data
management systems and Sebastian Bress' lecture on in-memory databases on modern hardware. Right. So coprocessing, and why do we need it?
So I think you've seen this plot in this lecture.
So we'll just do a quick recap on Moore's law,
and basically, where does it stop?
And basically, the main reason for this
is that the increase in transistors in the CPU
basically doesn't yield higher single core performance
anymore.
And this basically holds true from the mid-2000s.
And basically, the main reasons are that performance
is kept due to either economic feasibility, power and safety
constraints, or all of the above.
So one natural solution is to add more cores.
You've talked in the multi-core CPU lecture.
However, this comes at a cost as well.
So by the NAR scaling law, basically, the idea
is that the transistor density doubles,
but the power consumption stays the same.
However, again, since the mid-2000s,
as you can see on the plot in the right corner there,
again, this starts or doesn't apply anymore.
Because we see that basically as we scale the transistors, basically we don't have the same power consumption. Rather,
the ratio goes below one. So it doesn't scale linearly. And this is mainly due to the reason
that smaller surfaces increase the risks of current leakage. Basically, we need more control units and basically more space
between the cores.
If we don't do that, we have either heating problems
or power leakage.
And this basically peaks the frequency
that a processor can achieve.
And again, one temporary solution
is just add more physical cores.
Do not extend the core itself.
However, this, while it helps to keep the energy consumption
per core low, we cannot run all cores simultaneously.
And parts of the physical cores are unused because of that.
And what we end up with is dark silicon,
which means underutilized CPU resources, which were already
expensive to develop and design.
So in this part, we introduce accelerators
because we have them.
Yes?
Why can't they run all at the same time?
JAN-FELIX SCHWARTZMANN- So they run so you
run them in parallel, right?
But physically, they're exchanging turns.
You cannot run them at the same frequency.
OK, so we can't.
There's like a physical limit to how much we can,
and how much electricity we can put in the CPU in total?
Exactly.
And basically, the idea is that we move from the flexible programming models of the
CPU to multi-core CPUs, but then there is hardware that can basically utilize parallelism
even more.
And we call this device accelerators, and basically you can see we're talking about GPUs.
This is right in the middle in terms of programming flexibility
and compute efficiency.
And basically, the idea is that we utilize this new hardware.
We apply a different programming model that basically
is focused on the throughput rather than low latency.
And we reach higher throughput levels by utilizing parallelism.
And in the beginning, this was aimed solely towards scientific workloads.
And the main reason was that basically GPUs were highly, highly specialized co-processing chips that were aimed only towards image
rendering, which meant parallel floating point computations.
And some computer scientists noticed
that the performance for the GPUs increased 2.5x yearly.
And during Moore's law, CPUs performance increased 2.5x yearly. And during peak Moore's law,
CPU's performance increased 1.8x.
So they wanted to try out some alternative data processing
and basically try to program and utilize
the parallelism that the GPU offered.
However, the problem was that the programming model
was based on DirectX,
which is a graphical interface library and major of the adaptations were needed
to basically execute this the scientific workloads of this this came mainly from
from the biology domain and basically after these, the adaptations that were required,
basically I said, okay, let's design a programming framework
and design the chips in that way that we can use this
for general-purpose programming.
And this is basically where general-purpose GPUs come into play.
And let's take a look at their design.
So on the left-hand side here, we
have a classical CPU core, or classical CPU with four cores.
And basically, we can see that for each core,
we have a control unit.
We have L1 caches.
We have L2 shared cache, L3 cache, and then we have the main memory.
You can see that the same components are more or less in the GPU cores. However,
there are some fundamental design differences that we need to note.
So CPU cores are aimed towards minimizing the latency of arithmetic operations. Their design is purely sequential latency oriented
and basically for this they require
a lot of control units, power management
and of course
their designated cache memory so they can be efficient and achieve this low latency.
GPU cores on the other hand their designated cache memory so they can be efficient and achieve this low latency.
GPU cores, on the other hand, need
to process large numbers of floating point operations.
And also, they need concurrent memory access.
So basically, they have different design.
The goal is to achieve higher throughput.
And today's GPUs, they can achieve around 1.5 terabytes
to 3 terabytes of throughput.
And basically, this is the result
of having thousands of these cores.
However, the big difference is that they
require much less power because they have much less control
units.
And their functionality is way more limited
without any control unit per core,
but rather this is organized in a more coarse way that
we'll discuss in a bit when I go through the architecture.
And basically, the result of this is massive parallelism.
While the CPU peak bandwidth can be around 100 gigabytes
in today's CPUs, current GPUs, for example, the A100 GPU,
that's currently the top GPU that, for example,
Nvidia offers in the market, offers 1.5 terabytes.
And the upcoming H100 will offer up to 3 terabytes
of compute bandwidth.
Now, let's see how this GPU architecture is organized.
Basically, where do the compute units differ from the CPU ones
and how we can access this high bandwidth that I mentioned.
So the main computational units are, in general,
called compute units.
And similarly to the CPU cores, they
contain multiple computational cores, register files, L1 cache
that's tied to the computational core,
and shared memory that can be accessed
by the whole compute unit.
Now, one difference that you can see here
is that each compute unit itself contains multiple cores.
So in general, this is around either 32 or 64,
depending on the architecture.
So basically, 32 cores or up to 64 cores
are managed under single compute unit, which
means that basically all cores can access the same cache,
which is very low latency and high bandwidth memory.
And also, they can communicate with other compute units
through the shared L2 cache.
And basically, the lower we go in the memory hierarchy,
we increase the latency to access.
And what we can see here, or here rather,
is that to access GPU global memory,
we steadily increase the latency. And while this is like high bandwidth memory, we steadily increase the latency.
And while this is high bandwidth memory, like we said, up to three terabytes per second
processing power, the cost to access the global memory is quite high.
We'll go into more detail a bit later.
So the memory hierarchy is basically split into two.
We have the on-chip memory that consists of the L1 cache and the shared memory on the compute level,
on the compute unit level, and then L2 cache that's shared across all compute units.
And the sizes differ by quite a lot.
So the L1 cache is around 200 kilobytes.
So I think Intel's new data center GPU just increased this
to 256 kilobytes of L1 cache.
However, the L2 cache can go up to 400 megabytes.
And this is also from Intel's new data center GPU.
And it's an 8x improvement over what was already
state of the art, which was around 50 megabytes
for Nvidia's latest GPUs.
So we have high memory L2 cache.
And then the GPU global memory can go up to 188 gigabytes.
However, as high bandwidth as this memory is, we have high access latency.
So to move from the L2 cache to the GPU global memory, we need to drop to two orders of magnitude in latency time. And yes, the physical distribution of the memory is that we have so-called on-chip memory
and on-device memory.
Regarding terminology, I mentioned compute units as the main computational centers of
the GPU.
And different vendors use different terms for this. I'll mainly focus on AMD's and NVIDIA's terminology or nomenclature because these are the biggest
vendors as of today and their GPUs are most widely spread.
In this lecture, we'll discuss specifically about the architecture of the NVIDIA GPUs,
as these are de facto, the de facto standard.
And also, for the programming model that we'll discuss,
this is based on NVIDIA GPUs.
We use them here at HPI in our data lab.
And also, most of the cloud providers
are exclusively offering NVIDIA GPUs.
So we'll focus on that terminology.
And basically, the idea is that you
can map the terms that you already know from the CPU
lectures to the computational parts or the computational units
of the NVIDIA GPUs.
So for example, one multiprocessor or a multicore
CPU, you can look at it as a streaming multiprocessor.
One core, one physical core is one computational core.
In the NVIDIA GPUs, a thread is a thread.
And a vector unit is a synth warp in an NVIDIA GPU.
What's a synth warp?
And what does computational core consist of?
And what a streaming multiprocessor consists of?
I'll talk about in a second.
So independent of the vendors, one compute unit basically consists of four main parts.
We have the computational core.
As I said, we can have at least 32 or up to 64 in current GPU
models.
We have the registers, which are the private memory
for each computational core.
Then we have the local memory that
are shared for each core in the computational unit.
And we have a thread scheduler that basically
assigns the threads to the thread groups
and then to the cores.
What's a thread group?
We'll discuss in a bit.
Just I'm trying to not go too quickly with the terminology.
But what's important to know from this slide
is basically the four different parts
that each computational unit from the GPU has.
And basically, how does this map into a real GPU?
So like I said, we'll take as an example one of the GPUs
that we have at HPI.
One of them is the NVIDIA A100 GPU.
And basically there we have the computational cores.
Again, we have the L1 cache and the shared memory.
We have a warp scheduler.
We have the register files, and we have the dispatch unit.
So we start with the computational cores.
There are two types of computational cores in NVIDIA's streaming multiprocessor.
So basically we have the CUDA cores, which are processing integers, floating point numbers,
or floating point numbers.
And we also have the tensor core.
The tensor core operates, is a specialized type of core that operates solely on tensors.
I won't go into detail on either of those.
It's just important to remember that there is a distinction between the inputs that they're processing.
And computational cores in a GPU are stream processors.
And you can imagine or you can see it
as an equivalent to a vector lane in a CPU.
And then basically the streaming multiprocessor,
you can see it as the equivalent of a full CPU core.
Since we have multiple computational cores in one streaming multiprocessor, in this GPU specifically, we have 64 CUDA cores per one streaming multi-processor we have hundred and eight of these streaming multi-processors so basically this results in six thousand uh more than around seven thousand
um computational cores in uh in this gpu which basically uh you can imagine uh 7000 core core
cpu and the parallelism that it can achieve. Now, what type of memory does the streaming multiprocessor
access?
Basically, we have the L1 cache and the shared memory.
And usually, these were combined until around 2018,
which meant that basically they were sorry, they were separate,
which meant that the L1 cache was managed solely by the OS
or the operating system.
And the shared memory was managed explicitly
by the code that we provided.
Now they're combined, which means
that there's no cache misses.
They can be configured for each streaming
multiprocessor separately.
And they share the same address space, which basically
allows us to have aligned memory accesses,
efficiently use the cache lines, and basically lower
the latency for memory accesses.
And here, it says that they're combined, so we have 192 kilobytes combined L1 cache and
shared memory, and this can be programmed.
So the developer can decide how much of this memory will be an L1 cache and how much of
this will be shared memory.
The core difference, and there is a limit.
So the shared memory can go up to 164 kilobytes.
And this 28 kilobytes are exclusively
reserved for the L1 cache.
What's the main difference here is basically
the lifetime of the variables that we define
in each of these memories, which we'll
cover when we talk about the programming model a few slides
down the line.
So then outside of L1 caches and shared memory,
we also have the registers, which basically,
they send data directly to the computational core.
And for this, they don't require a load instruction
to fetch the data.
And basically, what this means is that each of the threads
has basically exclusive access to the register files, which are organized in 32 banks.
Each bank is basically the private memory to a single thread.
And basically, these 32 banks correspond to the 32 threads that can be executed at once
or in a single call in the GPU.
Does this, so basically, each thread
accesses a separate register bank.
Does this remind you of something,
like each of the banks that the threads access.
You should have seen this before in the lectures.
So this would be a vector lane.
This would be one vector lane in the cache memory or in the register. We'll get to that when we're discussing the programming
model.
And basically, how we manage the threads
and how do these threads access memory and execute compute,
basically, is through the warp scheduler.
And what the warp scheduler does is assigns groups of threads
in an execution queue.
And the naming here, it's called warp scheduler, not thread
scheduler, because a group of thread is called a warp.
And basically, a group of thread is called a warp. And basically, a group of 32 threads
is a single warp, which then corresponds to this 32 banks.
And basically, executing one warp per cycle
basically allows concurrent execution of 32 threads.
And basically, depending on the design of the GPU,
they allow executing the whole warp in a single cycle,
which means that we have 32
threads per group.
For example, on my work laptop, I have 14 streaming processors
on my GPU.
The core count is around 900.
So basically, I have 64 cores per streaming multiprocessor.
So I can execute two thread groups or two warps
in a single cycle per streaming multiprocessor. So I can execute two thread groups or two warps in a single cycle per streaming multiprocessor.
And basically, I can do this for the whole GPU.
I can do 28 warps.
On the A100, this is a bit of a different story.
This is a data center GPU.
It has 108 streaming multiprocessors.
The CUDA core count is close to 7,000. And we set, again, 64 cores per streaming multiprocessors. The CUDA core count is close to 7,000.
And we set, again, 64 cores per streaming multiprocessor.
And if we execute two warps in a single cycle
per streaming multiprocessor, we have 216 warps
executed on the whole GPU, which basically multiply that by 32.
And you get the execution of threads in a single cycle.
Now, basically, how is this physically implemented,
and what's the GPU programming model on the physical level?
So let's discuss a bit about that.
So it's, like I said, in one streaming multiprocessor,
threads are scheduled into warps.
Also, I already gave a small spoiler.
Each warp has 32 threads.
And basically, one warp is executed in a single swimming
multiprocessor. So these 32 threads that are grouped in the warp cannot be shared
between swimming multiprocessors. They all correspond to the same thread block
which means they access a particular part of the memory that's basically sequential,
and they execute the same instruction in parallel.
And basically on the warp level, they share, or the warps share the control unit in the streaming multiprocessor.
By knowing the number of warps per swimming multiprocessor and the number of swimming multiprocessors,
we can calculate the number of 220,000 threads,
active threads that we can spawn on the GPU.
How are these threads scheduled?
Basically, there are two ways that this can be done.
Like we said already,
all threads within one warp
execute the exact same instruction.
However, we can, so this is what physically happens.
What we can define in our programs
can be something completely different.
So basically, we can say, I want the first two threads
to execute some instructions, and the rest something completely different.
And then after this branch is finished,
I want them to execute another instruction, again,
all of them together.
And basically, to execute this, there are two ways.
We can have a lockstep execution, which
happened in the past, or we can have an interleaved execution
model that basically happens or is implemented
in current generations.
So let's start with the lockstep execution model.
Basically, what happens here is that for each warp we have a program
counter and a stack for the whole warp. So we have reduced resources to track the thread
state which means that we don't require additional control units for this. However, this also means that we lose granularity
when processing the data on the threads.
So we don't allow inter-thread communication.
And basically, we reduce the...
So by reducing the resources to track the thread state, we also limit the
communication between threads. And basically what happens as a result, we have serialized divergent
branches. So all statements are within one branch are executed until they're completed,
which means that, for example, in this small code snippet,
for the purposes of executing A and B,
all 32 threads will be active.
However, only four of them will actually execute the code.
The other 28 will wait for the completion
of instructions A and B.
And then, again, all 32 threads will start at the execution of,
sorry, will be activated, but only 28 will execute X and Y.
The rest of the four threads will be idle.
Yes?
Does this happen at compiler time,
or does it happen, like, by the GPU, it, like, detects that there's a branch, Yes?
So this is explicitly defined in the cuda code.
So we can define this explicitly.
And basically what happens at compilation time is the thread scheduler basically assigns the, sorry, at runtime, the thread scheduler assigns the threads to the memory locations and basically assigns the execution order.
And what happens basically is, again, like all threads are spawned.
Part of them are just idling because of this,
until we have a part where, or instruction that basically
shares all threads.
And then the state reconverges, and the execution continues.
And so basically basically the question
is, can we do better than this?
Well, yes, to a certain extent.
Yeah, the threads concurrency I already mentioned,
basically in this execution model, it's not possible.
Basically doing so explicitly
by using certain locks or mutexes
basically just hurts performance and leads to deadlocks.
The alternative is the interleaved execution model,
where we allow interleaved execution of the instructions
in a thread.
So basically, what happens in contrast to what we've seen
here, in the previous slide, we have now
control units per thread.
So we have the program counter and the call stack
are tracking the state of each thread, not of each warp.
So now we have increased granularity.
This means a slightly higher overhead
because instead of one program counter
and a single call stack, now we have 32 of these.
What this allows us is to have interleaved execution
across code branches and also it
allows the threads to communicate within a warp while being executed.
And basically what happens is if we see the top chart, basically, we have the instructions A and X being interleaved.
Again, all threads are active while executing either of those.
However, we don't wait for the full branch to finish. Rather, we can update the state
by accessing both branches in an interleaved manner rather than wait for one of them to complete. Now, the issue here is that the threads need
to converge at some point.
And basically, the question is, when we get to instruction Z,
or Z, does the reconversion happen automatically?
So the answer is somewhat, yes.
So basically what happens is there's auto-reconvergence,
which will happen by default. It happens after the execution
of the first converging statement.
So basically, it will happen after Z is executed,
which basically is what's shown in the code there.
So basically, this sync warp synchronizes the state
and basically allows all 32 threads then to execute computations
in parallel.
Why does this happen after the execution of Z?
It's because the interleaved execution model
allows communication between threads.
And if there is no explicit synchronization,
the model assumes that we have some sort of,
that basically this property of thread communication
within a warp is being utilized,
and there needs to be an update between the thread groups.
And basically what's shown in the second code block
is what happens implicitly anyway.
Now, my question is, what happens
if we call sync warp before z is executed?
Yeah?
I would assume that the execution of z
is then synchronized with all calls?
Exactly.
Yeah. And basically, that's the best practice. The execution of that is then synchronized with all colors? Exactly.
And basically, that's the best practice.
If we know that we have a divergent block,
then we need to make sure that immediately after the block,
we do a synchronization so we can utilize the parallelism
to the fullest.
What happens here in the second code block shown,
again, is what happens anyway, implicitly.
And this basically comes at a slight cost of performance,
just because the model assumes that the threads need
to access each other's state within the single warp.
And basically, how are threads executed in a single warp in terms of concurrency?
We can see on this chart here, and basically we can compare this to a CPU core.
And we can see the comparison between what
a high throughput processor is designed to achieve
and what does a low latency processor do.
So the immediate difference is that we
have a significant difference in the memory access time.
So the time that a single GPU thread waits for data to
access the global memory is significantly higher than what
the CPU does.
And basically, what needs to happen, we need to overlap
this memory access time.
And we can kind of calculate what's
the cost of the increased memory access
and how much parallelism do we actually
need to mask this memory accesses.
So here are some numbers.
So the L1 cache latency for a GPU In the GPU processors, 28 cycles.
In the case of the CPU, this is four.
So basically, to hide L1 cache latency,
we need to spawn at least seven warps to compensate for this.
To access the GPU memory latency,
the cost is around 350 cycles.
To hide this, we need at least 88 warps to basically say,
okay, now memory access for this computational cycles are free.
And basically, what happens when we issue this memory accesses, we can do this in two ways.
We can do this in an aligned way, which basically would mean that we coalesce the memory access.
We need to basically access the addresses in the global memory, we want to efficiently use the cache line.
And the cache line is conveniently 128 bytes. And this corresponds to basically a single warp with 32 threads can access four bytes per thread in a single kernel call.
However, if we don't utilize the cache line in an efficient manner, basically for the same
memory access to access the same 128 bytes,
we'll need to execute two kernel calls, which
means another cycle of thread scheduling and basically
overhead for the actual two cache accesses.
Are there any questions until this point?
Yeah.
In the execution, you showed this one.
You can only execute one branch at one time
on an arbitrary amount of threads, right?
Yeah.
OK.
I was wondering why you can't just
compress this graph and execute x and a.
Yeah, so basically whenever a single instruction is called,
either of a, b, x, or y, all 32 threads in a warp are called.
So basically, now here we are talking about a single warp.
So we are looking at the granularity of 32 threads.
And say we have instruction A that needs to be executed.
All 32 threads in the warp are activated.
But here, we explicitly say that we need only four threads
to do this.
I'll get later in the lecture why
this might be necessary to basically limit or reduce
the number of threads that execute a single instruction.
But yeah, until now, basically, we
say we need the four threads to execute A and B, which
means that 32 threads will be activated anyway.
However, only four of these will process A and B.
The benefit of having the interleaved model
is that we can use the rest of the 28 threads that
will be activated to execute x in the next cycle,
rather than waiting for the whole conditional block
to be finished.
So instead of waiting for a and b,
we can do this in interleaved fashion.
Other questions?
OK, so then let's make a short break.
And then afterwards, we'll discuss
about the logical execution, how this is implemented on the software level, and basically how we can develop for this.
Right.
So, so far we talked about the physical execution, what happens on the hardware level, which is a bit abstract
at this point.
However, in this part, we'll talk
about what happens on the logical level,
how we can develop programs that can efficiently utilize
the thousands of cores, the thousands of threads
to achieve high throughput and good levels
of parallelism.
So currently, you can develop GPU programs
in one of three frameworks.
First one that's vendor independent
and the recognized industry standard
across different manufacturers is OpenCL.
If you programmed so far in OpenCL, basically you know that this supports different hardware,
so you can program not only GPUs, but you can do CPU programming, you can program FPGAs,
and it's basically a layer over C and C++.
The proprietary frameworks are CUDA,
developed for NVIDIA frameworks, for NVIDIA GPUs.
It's based on C++ and Fortran.
And there are wrappers for other programming languages and APIs.
But if you develop pure CUDA, this pretty much looks like C++. for other programming languages and APIs.
But if you develop pure CUDA, this pretty much
looks like C++ with significantly different syntax
for particular operators.
And then the AMD's response is an open source framework
called RockM.
So basically, they claim it's vendor independent.
I haven't personally tried it.
I haven't had access to an AMD GPU.
And since AMD also develops CPUs and also APUs, so basically hybrid CPU and GPUs.
So basically, the claim is that you can develop software
for their different hardware all in the same framework.
So how is a GPU program executed?
Basically, it's a heterogeneous model.
We operate in the CPU and the GPU and it's most often is a
sequential execution we have part of the code that's that's executed on the CPU
basically we call it the serial part or the serial code that's optimized for
sequential execution then we have the the kernel methods that are executed
completely on the GPU there we want to basically achieve this high level of parallelism that we talked so far.
And then again, we need some synchronization step in the end that's executed again on the CPU.
Mostly this is transferring back to main memory or storage, et cetera. And basically to explain the logical execution on a GPU,
we'll go to a very simple program,
which will basically be adding to one gigabyte vectors.
We'll do CPU.
We'll see how this is executed on the CPU,
what's the performance there, and how this is executed on the CPU, what's the performance there, and how
this is executed on the GPU, basically how we can optimize this to be more efficient
than the CPU implementation, which is initially the goal of using a GPU to program, right?
So regarding the execution flow, the parallel part
or the kernel code that I mentioned
that's executed exclusively on the GPU
is executed by many threads, or by all of the threads.
So you can imagine a kernel call to be the GPU equivalent
of the execution
of a single CPU thread.
So basically, this kernel executes
the sequence of instructions assigned to the core.
And basically, the way to achieve the massive parallelism
that the GPU architecture offers,
the idea is to spawn a high number of threads at the same time and basically following the
single instruction multiple threads execution model, achieve high throughput.
You've talked about SIMD before.
Now this SIMT versus SIMD is a bit different.
So in SIMD, you execute a single instruction on all data.
What differs in the GPU model is single instruction
is executed only on the active threads,
or as we see before, only on the active threads, or as we see before,
only on the designated threads.
Because by default, warps of 32 threads are spawned.
We can manage basically the execution or the instructions
on the thread level.
And if you want to compare this to SIMD,
SIMD execution model is similar to the mask SIMD instructions.
Basically, where you execute the instructions
on only parts of the vector lane, not on all of it.
And basically, how can you program for this?
How do you define a kernel?
So a kernel is defined by using this global signature
of the method.
And basically, then, everything that's inside this method
will be executed solely by the GPU and by the amount of threads that we give it to.
And basically, how do we call the kernel? We do this with the second instruction, right? So here
we have the add kernel and then we call the add method. I'm showing this on this screen,
but this doesn't work anyway. So we have this add, then 1 and 256, and then the input.
That's a call to a GPU kernel.
And basically, we'll get through that, but a small spoiler for now.
Basically, 1 means we have one thread block, and we have 256 threads in this single block.
What does a thread block mean?
We'll discuss, or we'll talk about this very soon.
So this is how we define and call a kernel.
However, to execute something on the GPU,
we need to send data there.
And we need to, of course, allocate memory for this.
We can do this in two ways.
We can do static allocation,
which is done by using the device prefix.
Basically, we say, okay, with this integer x
needs to be allocated on the GPU.
This is very static, so we define single variables.
However, often we need chunk of the memory, right?
So the dynamic allocation is basically executed
through the kuda-malloc method.
There are different ways of calling kuda-malloc.
We can do pinned memory.
So we can do, in separate cases, we
can define the memory location where
we want a particular part of the data to reside.
And then basically, this can be achieved through pinned memory
because then we reserve this location for this for this for
this part of the code or for this chunk of the data and then means that this this will be continuously
loaded into into this part of the memory then we have the memory transfers from host to device and from device to host, host being the CPU and the main memory.
So basically, we have two directions to execute this.
And yeah, basically, that's what happens.
And then we can control the memory placement on the GPU
on a bit lower granularity.
So based on how we define a variable,
they can have different lifetimes.
So we mentioned the static allocation with device.
Basically, this places the variable explicitly on the GPU.
However, basically, we can define a variable
to be accessible by threads in a block or for the whole kernel.
And basically, if we define a shared memory variable,
it means that this will be accessible to all threads in one block and
will be stored during the execution of a single kernel.
Kernel variables are private for each thread.
So basically this is the variables that are pinned to the register memory. And basically, other threads cannot access these variables.
And this memory is freed by default
after the kernel completes.
Just for completion, here is a part of code that basically OpenCL would do.
So basically, we want to select a device.
We need to create OpenCL context.
We define the kernel, again, the part where the GPU will process the data.
We create buffers for the vectors,
and we put the device commands or the GPU commands in a queue,
and finally execute the kernel.
This is just for completeness to see how
this is executed in OpenCL.
As I mentioned earlier, we'll stick with the CUDA
implementation.
And now I'll do a short demo so we can basically
see the performance differences.
Yes?
I hope on the slide before that usually the code
for the kernel is loaded from a file
and not written as a string.
Yeah, exactly.
I mean, yes.
That's kind of scary.
Yeah, if I add there kernel.txt or something
would be a bit unreadable.
I mean, this is also unreadable in my opinion.
But I put it there just so you know how it's invoked.
And of course, you can do this from a separate file, which
basically then contains more or less C. This
can be called from a C file or C++ file.
So then, as I said, we'll do a short demo.
Again, this is example code.
The example is very basic.
We're just adding two vectors.
And we are going to discuss some principles of how
to basically optimize for the parallelism in CUDA or on the GPU.
So just as a reference, we have a single threaded addition of two vectors in the CPU, or two arrays, sorry, not vectors. And we have, if we just compile this with no optimization
flags and run the code, we have around 700 milliseconds
of execution time.
If we compile this with an O3 flag,
then this drops to 155 milliseconds.
The complete code, you can find it in the slides.
And basically, you can compare the performance if you're interested. Now, the equivalent to vector addition on the GPU would be to define the kernel.
Like we said already, we need to invoke the kernel.
So, as we did a single thread implementation of the CPU code, we'll start with a single thread on the GPU as well. Now, I mentioned that the numbers after the kernel call,
they have a meaning.
And basically, the sequence is the following.
So we define the number of thread blocks.
And we define the number of threads
for executing this kernel.
Then we have memory allocation, right? So we need to allocate the memory for
the two vectors that will be executed on, that will be placed in GPU memory and then send it
back to CPU memory after the program has, the kernel has been completed. Some useful code for kernel debugging is the following.
So basically, once you define something in a kernel,
you cannot easily debug this.
And basically, there are defined error checks in CUDA.
Just a second. And so basically you can perform
kernel calls wrapped up in these error checks.
So basically if there is something wrong
or there's any error going on in the kernel,
this will be reported back.
Yes?
You can absolutely debug the kernel.
So in the... Through ANSsys, right? yeah.
So, just at run time, this is not possible, right?
You need to have the nsys suite installed, right?
And basically, they're more or less you're doing profiling and
Debugging. This is a bit more advanced.
This is like the very basic, I want to do like Hello World.
This is what happens.
Yeah?
And printf debugging works as well, so.
JAN-FELIX SCHWARTZMANN- Since when?
At least it worked last year.
JAN-FELIX SCHWARTZMANN- OK.
So when I tried printf debugging,
this was very unreliable.
So basically, if you execute this on multiple threads,
the problem is that you don't know
which output you're seeing.
Yeah, that, obviously.
Yeah.
Well.
And the buffer for the strings is pretty small,
so you can debug large time intervals, but still.
Yeah.
Again, this is just a safety check.
Also, it's also useful for the memory allocation part,
because sometimes memory allocation errors are just segfaults.
And basically, we don't want to deal with segfaults as much as we can or as less we can.
So I found it useful when dealing with some weird memory
allocations or some corner cases.
This helped.
But definitely, if you want to do proper debugging,
you use NSYS.
And then you see what's going on.
You profile the code.
And you can see the state of your variables. So here we have the single thread performance
of this vector addition or array addition
that happened on my laptop.
And if we run this on single thread,
we have an execution time of 19 seconds, which
is a lot slower than the CPU run.
It's two orders of magnitude slower.
And it's not very nice, right?
So let's try to see.
I accessed.
I cannot find.
OK, great.
So I connected to one of the servers.
There we have a bit more advanced GPU,
and let's see, oops.
All right.
Let's see what happens.
And what was the actual performance
in a, say, mid-sized server GPU.
We have the...
So we allocate the memory in the beginning,
and then let's see the results.
For the purposes of the demo,
I'm not running any profiling tool
just because of brevity.
The output is quite large so the number
that we are interested in is this kernel duration I have it in microseconds
you'll see in a few demos why that is that the case but this is again 15
seconds this is compared to what was the CPU, 120 milliseconds, this is way, way slower.
And this is when we use single thread, right?
When we perform this on a single thread.
But the goal is not to do this on a single thread, right?
So, okay, this should work.
So the question is, how do we improve?
So we need to use, of course, multiple threads.
We've mentioned this so far multiple times.
And to do this, we need to be aware of how are the threads organized and how are
they called. To basically execute a multi-threaded program, we're executing thread blocks. And
each thread block is executed in one streaming multiprocessor. So these threads can communicate now.
And let's not confuse also thread blocks with warps.
So warps are the scheduling units.
So this means that this goes into multiples of 32.
However, a single thread block can have multiple warps inside.
And now basically what we want to have is,
let's say we do a single block with a large number of threads.
And basically what we want to achieve
and basically want to achieve high throughput,
as we mentioned but we
want to do this by invoking multiple threads and to do this we need to manage
the way in which the threads will access the memory and the data so there are two ways, or at least two ways, that you can access the data on a thread one is being executed on sequential parts of the data.
Then thread two is executed on the next block,
and then thread three and onwards.
The alternative is to have strided access. So basically, one thread doesn't access sequential memory
locations.
However, it jumps to basically the stride
is the number of threads.
And the amount of memory locations that it skips is the number of
threads. So which one do you think would perform better? Which memory access
pattern should achieve higher performance.
Yeah? Does Drive better because we access the cache line at a time?
Mhm, okay.
I thought things are opposite because we don't have to switch caches often.
We don't have to switch the first cache, then switch
the second cache, then the third cache, then each one
of those.
Your reasoning is correct, but basically, the effects
of that reasoning are the opposite.
Basically, to access the cache efficiently,
we need the strided access, because these threads
are executed at once.
So this means that what you were saying,
which is exactly right, you want to utilize the cache.
Now, these three threads will access three memory locations.
And then they'll access basically what's shown here.
Then they'll access the next ones and the following ones.
And what happens in the example above,
thread one, thread two, and thread three
are executed at the same time,
but they're accessing very basically separated locations.
And now with three threads, this won't be a problem.
But if we want to execute this on more than 32 threads, which
by default we want to, this will already
mean that we access at least three cache lines, where,
for example, a single warp will access a single cache line here.
And let's see what does this mean in terms of performance.
So I have some code from my laptop, but let's see what happens on the server.
So we'll try multi-threaded sequential first. Okay.
So we achieve so we are already down to from 15 seconds, we're down to 519
milliseconds, which is quite nice.
So we're executing this on eight warps, meaning 256
threads.
Now let's see what happens if we do.
So this is the sequential access.
And again, with the equals, now we
want to do the strided access.
Is those 12 seconds of copying the data from memory
to?
No, that's actually initializing one gigabyte
of uniform integers, of uniform distributed integers.
That's the bottleneck in this code, basically.
OK, it's already finished.
Your question, the memory allocation,
is the 80 milliseconds here.
So to compare the performance of the kernel,
we went from 520 milliseconds to 333. So basically, this is the impact of having stride memory access
versus sequential memory access.
And again, the reasoning is the same.
Not the same, but the reasoning is
we want to access one full cache line at a time.
So let's see if my laptop performance numbers agree.
Yes, so here we had around 2x improvement.
Yeah.
And again, the short explanation underneath each.
So basically, we are a bit better than a single thread compiler optimized CPU execution.
So now the question is, is this worth the extra money that you would spend on a GPU
and the overhead of programming in different framework, etc.?
It shortly answered no.
So we want to be better than this.
We want to push beyond the CPU execution time
by a factor of multiples of the CPU execution time.
Again, code references.
So to do that, we go on basically one level
of abstraction higher.
So we know that threads are organized in a single block.
And then thread blocks are organized in a grid.
Basically, what does this mean? Is that we have now basically a higher abstraction
where thread blocks are independent, but they can share results. But more importantly, they called in a single kernel call and in a single execution.
And now, basically, what we can do
is we can modify the kernels to access blocks and threads.
So basically, now we want to spread out
the memory access of a single execution unit.
And when we move this to a kernel block,
this allows us to exploit more of the parallelism
from both the computational side,
but also in the memory access part.
So the block size is basically, we
defined the number of threads in a block.
Current GPUs allow up to 1,024 threads in a single block.
And basically, the grid size limit for how many blocks
can be in a single grid is basically around a billion.
And now the question is, how can this impact performance?
And here, this is just a small adjustment
of how we access the different blocks in the grid.
So basically, they can be executed in parallel.
So just a quick question.
What's the amount of speedup that you expect from this.
So we were at around 300 milliseconds with stride memory access and 256 threads.
OK, let's see.
So just before we run the demo, just a quick mention, basically, what we changed in the previous.
So basically, when we were multi-threading, we were defining the parts of the memory that the thread accesses and how the memory accesses basically is executed.
Now, if we want to do this concurrently
for multiple blocks, we need to do this also,
like small change here.
So we need to define the stride and the index
for each thread separately.
And now let's check what will be the performance application of this.
Okay, so now we want multi-thread, but we also want multi-block, right?
Again, we'll wait for those 10 seconds for the integers to be generated.
Again, this is far from efficient uniform distribution generation.
The goal of the program is a bit different.
So what happened is that we moved from 333 milliseconds to 4.
So basically now we are comparing hundreds.
So on the GPU level, we compare a speed up of,
we started with 15 seconds.
Now we are at 4 milliseconds.
So we have three orders of magnitude improvement here.
Of course, the initial example was extremely bad.
However, it is with just basic utilization of the parallelism
that the GPU offers, so we didn't do any data optimizations.
We just handled the memory access and the thread allocation.
And this is the speedup that we have.
And again, we're not even utilizing the full GPU resources, right?
We access only 256 threads.
And we don't run on the full grid.
Now, let me go back to the slides.
Right?
So on my laptop, this was 20 milliseconds,
which was, yeah, these are the performance gains
that were achieved.
Now, basically, the question is,
how would this perform against the CPU multi-thread, right?
So there, I haven't run the code on a CPU multi-thread.
However, what's important for this example
is to basically see what are the performance
implications of, of course, running something
in multi-thread fashion, but also what
are the implications of memory accesses and combining threads
in blocks and grid.
Yes?
Can you explain again how you come up with the 256 threads per block?
Could I change this to 512?
Yes.
Does this affect the performance in this case?
I'll talk about that in a second.
So that's an excellent question.
So basically, how do we define this, right? How do we know do we need, whether we need 256 threads
or we do the full maximum of 1024 or 512
or whatever in between, right?
Again, code sample.
So basically this goes here and now basically
the other side of the coin comes and basically
this is the cost of parallelism and the cost of parallelism is usually
overestimation of hardware resources so in in our example we had we had just a
pure array addition of not a ridiculously high amount of data.
The GPUs that this was run on, this was like 24 gigabytes GPU, so everything fits comfortably in GPU memory.
We have no issues, right?
However, what happens when we overestimate the hardware resources,
we basically don't utilize,
we stop utilizing the resources that the GPU has,
but we are adding overhead in terms of control.
So the problem with using the maximum amount of threads and the maximum amount of blocks is that these need to be synchronized at some point
right so they'll execute the computation in parallel they'll they'll do efficient
memory access to a certain extent and the problem is that there is
synchronization time there's scheduling time and the question is that there is synchronization time, there's scheduling time,
and the question is if this pays off.
And for example, so a bit of a different example,
a bit more relevant than adding two arrays.
If let's say we imagine,
so we know how thread blocks are organized,
we know how threads are assigned to thread blocks.
Let's say we have multiple thread blocks that are writing
to the same output files.
So just as a side note, we have thread blocks
that don't share the shared memory
in a streaming multiprocessor.
So these are independent.
The shared memory is reserved for the warp.
So now this means that each block
will load its own copy
of the output file.
So immediately, we have redundant data loading
and redundant data synchronization.
And this means that at a certain point,
we'll either have a deadlock because of concurrent data accesses
and or the synchronization part of the workload.
And basically what this will lead to
is that we have a parallel workload, or a workload
that we designed to be parallel, but that will be inevitably
serialized.
Because we overestimated the hardware distribution of the threads.
So basically, what's the solution here is to assign more work to a single thread block.
So if you remember the example when I was discussing about the physical execution models
where we said, okay, A and B need to be executed only by threads 1 to 4.
This is the reason for this.
And basically, this is called thread coarsening.
It means that we provide more work for a particular set
of threads because we decided that the memory access
and the synchronization cost will be higher
if we over-parallelize this.
And thread coarsening in this case is in the case that I
shown with the with the vector addition can be interesting because to answer your question Moritz
about the utilizing all threads basically the performance was worse. So I did only 1,024. I didn't do anything in between to find the optimal amount of threads.
But 1,024 threads in a single block performed worse than 256 threads.
And it was, I don't remember exactly, but it was at least 2x worse.
So basically, knowing the workload,
knowing what we need, or basically how the data is
moved, how costly are the computations.
In our case, addition is not costly.
But there are workloads where we need
to optimize the memory access versus the computation.
And basically, the way to do this, we profile the code, right?
You use NSYS, you use NVProf in earlier generations,
and you see where data is being spent and basically where data,
or you see basically where your threads are idle.
And idle threads will happen if we overestimate the hardware.
So basically, to add to the thread coarsening aspect,
if we have different thread blocks reading the same data,
it means that we need to access different parts,
either in GPU global memory or in CPU memory, it pays off more to do this sequentially
or to add more work to these thread blocks
rather than infinitely parallelize this
or parallelize this to the maximum.
Because again, we want to reduce global memory
accesses and synchronization.
And in general, I think this is an implicit conclusion
that we can make from this lecture so far.
In GPUs, compute is way cheaper than data transfers.
So if we can assign more compute to a set of threads
to reduce the data transfers, we are already
performing an optimization.
Or we are basically allowing or reducing the number of memory accesses
and the need for synchronization and waiting for data.
So to summarize this, again, the synchronization and the data transfer are the main
reasons for the additional cost for the basically reduced performance by
parallelism. And basically what we need to know is that thread blocks work on
shared memory, so we can utilize this shared memory. If you remember, we saw a table where
the lifecycle of the variables was defined,
and basically what's the level of granularity
of the threads that are accessing these variables.
And basically, reusing shared memory in a single block
is very beneficial to make sure that you don't have redundant
transfers of data and redundant allocations.
Of course, this is not the silver bullet.
Thread coarsening is not beneficial when we have the data distributed across different
thread blocks.
So basically, this means that no additional synchronization
will be required because each thread block
will access their own data, which is all fine.
However, if they need to mix and match,
then this becomes a problem
and basically pays off more to just allocate
the complete workload to a single thread block
or to the thread blocks that are accessing this part of the memory and execute it in
constraint fashion there. And another two aspects or cases where thread
coarsening is not beneficial is where the hardware is underutilized anyway so
this means that we're not parallelizing enough, and the occupancy of the threads and the thread blocks is not maximized.
Just an overview of some of the performance considerations that we had so far in the
lecture, and some of them that we didn't cover so far.
So we've talked about coalesced memory access.
We have the computational benefits and the memory access
benefits of performing such memory access and basically
a way to do this.
I won't go one by one. I'll just mention the ones that we talked in this lecture.
So we mentioned coalesced memory access.
Maximizing occupancy.
This, I didn't mention it explicitly, but this is basically increasing the number of
threads, increasing the number of blocks. And basically, the way to do this is, like we already did,
we do different number of threads
on different number of blocks.
And then we can go even more granular.
Some things that we didn't cover but I mentioned
is the utilization of shared memory. And then we can go even more granular. Some things that we didn't cover, but I mentioned,
is the utilization of shared memory.
And also we can access individual registers or basically dedicate registers to individual threads.
Thread coarsening I just mentioned,
basically addressing the cost of parallelization.
What we didn't mention is privatisation or using atomic operations and
atomic variables in GPUs.
We want also to minimise the control diversions. So again, this is mainly the mapping of the threads
to the data and rearranging the layout of the data.
And basically, the goal is to achieve high efficiency.
And tiling of the reuse data is basically more of the same.
Again, data placement and using of the usage of shared memory.
Right.
So to summarize, I introduced the need for coprocessors.
Why do we need them. I talked about GPUs as one of the examples of coprocessing.
We talked about the architecture, the memory hierarchy, and the programming model basically
on physical level and also on logical level. The performance considerations that I mentioned just now are part of where
would we start from tomorrow. So we again, some of other
Considerations and use cases of gpu data processing.
Are there any questions so far?
No? okay.
Well, then, thank you very much.
And we see each other tomorrow for the second part of the gpu