Computer Architecture Podcast - Ep 19: Arkaprava Basu - Memory Management and Software Reliability with Dr. Arkaprava Basu, Indian Institute of Science
Episode Date: March 17, 2025Dr. Arkaprava Basu is an Associate Professor at the Indian Institute of Science, where he mentors students in the Computer Systems Lab. Arka's research focuses on pushing the boundaries of memory mana...gement and software reliability for both CPUs and GPUs. His work spans diverse areas, from optimizing memory systems for chiplet-based GPUs to developing innovative techniques to eliminate synchronization bottlenecks in GPU programs. He is also a recipient of the Intel Rising Star Faculty Award, ACM India Early Career Award, Â and multiple other accolades, recognizing his innovative approaches to enhancing GPU performance, programmability, and reliability.
Transcript
Discussion (0)
Hi, and welcome to the Computer Architecture podcast,
a show that brings you closer
to cutting edge work in computer architecture
and the remarkable people behind it.
We are your hosts.
I'm Suvinay Subramanian.
And I'm Lisa Hsu.
Our guest on this episode is Dr. Arka Prava Basu,
an associate professor at the Indian Institute of Science
where he mentors students in the computer systems lab.
Arka's research focuses on pushing the boundaries
of memory management and software reliability
for both CPUs and GPUs.
His work spans diverse areas, from optimizing memory systems for chiplet-based GPUs to developing
innovative techniques to eliminate synchronization bottlenecks in GPU programs.
He is also a recipient of the Intel Rising Star Faculty Award, the ACM India Early Career Award,
and multiple other accolades,
recognizing his innovative approaches to enhancing GPU performance,
programmability, and reliability.
In this episode, we get deeply technical on GPU programmability,
particularly with respect to software reliability and efficiency.
We had a great time nerding out with a few trips down memory lane sprinkled in.
We hope you enjoy it as much as we enjoyed recording it.
A quick disclaimer that all views shared on the show
are the opinions of individuals
and do not reflect the views of the organizations
they work for.
Arka, welcome to the podcast. We're so thrilled to have you here.
Thanks Lisa and Sabina. And really, a big thank you for inviting me to this interesting
podcast.
Yeah. Well, we're glad to have you. For our listeners, I've known Arka for a very, very,
very long time. He was an intern at AMD research when I was first a young full-time engineer.
And so we go way back and I'm really happy to reconnect here with you now.
So let's start with our typical first question of the podcast.
What's getting you up in the morning these days?
So if I have to literally answer that, then that will be our two-year-old son.
Every day I'm excited to figure out what new word he learns, what new tricks he learns.
So that is the most interesting thing I would say these days.
Technically talking about the technical side of things, the fact that how much of our daily
life is now controlled or affected by the software and the fact that a lot of today's software relies on accelerators like
graphics processing unit or GPUs and that change has been sudden. For example, for many decades,
until a few years ago, it was always the CPU. Why do you run your software? And there was other
devices including some GPUs and so on that would actually hang off the bus on the side. It's always a side show. But
now, suddenly this GPU has become a new CPU and the software that is written for it is
quite different than what we are used to for many decades previously.
And that brings new challenges and new opportunities.
And as a lot of the software affects our life, which means that how we use these accelerators,
how you write programs for that, how efficiently the software runs on them, really matters.
That's something that I actually work on and that's pretty interesting for me and exciting
for me, thinking every day, what can I do, how can I contribute to that.
Very cool. So I think the thing that you said there that really caught my attention the
most is that the software that's being written for GPUs is quite a lot different than the software that you remember writing
for CPUs.
So, of course, CPUs and GPUs, they seem like distant cousins.
They're kind of related and yet very different, such that from an architectural level, there's
a lot of things that we can borrow, but a lot of things that are different.
And of course, that translates then to the software.
What would you say is the biggest difference that if someone were to be a primary
CPU programmer that was converting to a GPU programmer, what's the biggest change that
they would have to wrap their heads around, would you say?
Yeah, the whole programming model is different.
The fact that there is this hierarchical programming model where you have hundreds and thousands
of threads
that can actually run concurrently.
It's not a very large CPU.
It's not a CPU with many, many codes.
It's completely different.
You have to actually give a structure explicitly about these threads and the sheer scale of
it.
It's quite different. Then the fact, for example, those who have written some
market-threaded programs, for example, would be kind of expecting, say, a kind of cache equivalence
to use the terminology, a little bit technical terminology, but you generally expect the data
that is written by one thread, you'll see the latest value of
it if you read it from another thread.
Something that the hardware can always be providing them.
Some of these assumptions might not be always true when you come to writing a GPU software,
because at that scale of concurrency, in supporting those nice programming features that could automatically happen for
the software is hard.
And then there are many other nuances like there are things that...
There are a lot of things that the GPUs are really good at, matrix multiplication and
so on.
But there are things that they're not really tuned
for if you have a lot of EFEL conditions. That often happens when you write a CPU software.
They aren't run really well. They aren't actually utilized the GPU well. So many considerations come when you're trying to write high-quality GPU software
that a typical CPU programmer might not have thought about.
Right. So you talked about cache coherence and closely related problems, consistency, and so on.
And of course, in the CPU world, we are used to fairly, let's say, strong semantics on the CPU side.
And when you go to the GPU side, maybe it doesn't readily come out of the box.
Maybe we can expand on that.
And you talked about applications where you have conditional statements or maybe irregular
control or data flows.
And that's probably one of the places where you hit this problem.
So how do programmers think about this today?
What support does the hardware provide?
And how do you deal with it in the software side, both with respect to correctness, which
you might have to think about explicitly in the context of many, many concurrent threads,
and also in terms of performance?
Now the fundamental reason to go to substrates like GPUs is to get those vast amounts of
parallelism. So how do you think
about both correctness in handling these threads and performance concentrations in the context of
substrates like GPUs? Yeah, that's a really nice question. In general, I actually think about both
of these correctness that you talked about, like functional correctness, and then this performance,
say efficiencies or inefficiencies.
In my mind, they are actually at two sides of the same coin.
And in general, the broad umbrella that we can actually talk about is quality of the
GPU software.
And it can have two parts.
You want the program that you write to be reliable, the GPU accelerated program to be reliable.
Like it shouldn't be that if you run a program 100 times, 99 times, it gives you the right
answer.
But in one time, it crashes, that gives you something wrong.
Because one time that it actually does something wrong, it can be catastrophic.
And we have seen so many examples in real life that software bug can have a disastrous
effect.
Now, that a lot of software today relies on the GPU programs, bug that can sometime manifest,
can cause real trouble.
So in that context, as you rightly kind of hit the nail, one of the big challenges is
when you're talking about the GPU software, it's by definition, it's a massively parallel
program.
Right?
And synchronizing parallel program has been hard.
Like when I remember during my early days of PhD, at the time, it was this whole thing
about like from single core, there is this
whole thing about the multi-core. So you needed multi-center software and once you had the
multi-center software, there was this kind of different kind of bugs like data races where
the program was not properly synchronous. So sometimes it will give you wrong answers or crash. Now, if you map it now to the GPU software,
this is a problem of another different dimension.
Because what happens, particularly in GPUs,
since you have hundreds and hundreds of thousands
of threads concurrently that could be concurrently running
on a given GPU, synchronizing across that many number of threads is costly,
costly in terms of the overheads. At the same time, even how the GPUs are, you have to
program the GPUs, you have to explicitly structure, have a hierarchy of threads, it's often not
necessary to synchronize globally across all the threads of your program.
So there is this unique concept that doesn't actually exist on the CPU for, say, any of the
GPU programs, be it GPU programming languages like CUDA or OpenCL, there is this concept of scopes,
where there are synchronization operations which you can qualify with kind of saying like, you know, what subset of the threads are guaranteed to see the
or witness the effect of the synchronization operation you're actually doing, whether it's
atomic operation, phase separation, so on. Now what happens is that now this is again unique
to a GPU program where you can, you have, you know, quote unquote, proper synchronizations, but their scopes are wrong, while the producer
and the consumer of our data are not within the same scope of operation.
So what can actually happen, for example, in the CPU, if you say, write a program that
has only, say, synchronization operation, you can't have a data release by definition,
but you can have such a situation in GPU program just because you have synchronization operation,
but they are not of correct scope or strength.
So it creates a new dimensions where your software can be wrong.
And this synchronization bugs are particularly notorious because by definition, they are
intermittent.
Sometimes they show up and then cause havoc. And most of the
time they might not actually show up. But once they actually show up, how do you actually
debug it? The first thing that you would actually try to do to debug that crash, for example,
you try to reproduce that buggy situation or to manifest that bug. But if you are unlucky,
you might try many times, but that synchronization error, because
of timing, it's not going to manifest itself.
So that overall affects the quality of the program big time.
So before you go on, Arka, that was super helpful.
I think maybe two things.
One is I'll just provide a quick story, which is when we were both at AMV together, I remember
writing a coherence protocol, which is not exactly the same,
but the same sort of problem
where it doesn't manifest every time.
And then in order to test it,
you need to run like 100 million load in store instructions
and see if you have a problem.
And I remember it was like,
oh, found a bug, fixed it.
Now I hit it with 10,000 load stores.
Okay, now I run again.
Oh, I'm so happy because I made up to 100,000.
It doesn't mean it's bug free.
No, I'm gonna fix another one.
Oh, it made it happen.
And now it's like, you have to wait longer
and longer and longer until you find the bugs.
And it doesn't mean that they're not there.
So I totally get your story.
But the thing that I wanted to sort of maybe step back
for a second and say is, you know,
a lot of our listeners are on the more youthful side.
And so I think their coming of age
tends to be around a lot of ML stuff. And so when our coming of age was GPGPU becoming a real thing
and everybody wanted to learn OpenCL, everybody wanted to learn open, you know, CUDA, I guess,
OpenCL since we were at AMD, but you know, CUDA has now taken over the world.
And so many grad students of sort of around our cohort were very intimately familiar with
GPU architecture and what that meant from a computing standpoint.
I feel like it might be a little bit less ubiquitous now.
So maybe you can take a moment to just say like, you know, these scopes that you're talking
about are the software scopes that you're talking about, the software scopes that you're talking about, are very tightly coupled with the hardware architecture. So maybe you can take a minute
to just say, what is a warp? What is a bunch of threads? What are these scopes that you're
talking about? Because you could imagine, like you were saying, you've got these hundreds of
thousands of threads. And if you say you wanted to do a fence across 100,000 threads, gosh,
that sounds awfully costly. But what is this hierarchical execution
that you're talking about?
Just briefly so that listeners can continue following along
as you keep going.
In GPUs, since you wanted to support
very large amount of parallelism,
your hardware resources are organized in a hierarchy.
You have, if you want to relate it back to like the CPU,
you can think of loosely that you have something
else, say a streaming multiprocessor, which like can be like you can think of an core of an
multi-codes, not exactly, but something like that. And you will actually have hundreds of these,
you know, given GPU. And within this streaming multiprocessor, you will have what's the basic
structure is that you will have a single instruction, multiple data, where you will have what's the basic structure is that you will have a single instruction multiple data where you will try to execute a same instruction on different data items. And that's how one of the
key things, that's how GPUs are so good at doing parallel processing, single instruction multiple data.
Now this kind of hardware and the hierarchy is also reflects on the software. And in this hierarchy, there are resources,
like some level one cache, for example,
is only shared, only is private to us
in a streaming multiprocessor array,
that we talked about.
Now, this kind of gets reflected in the programming dialect,
say in CUDA or OpenCL.
So where, as I said, like when can eventually launch a kernel, a GPU function
that's called kernel onto a GPU. You say that essentially you provide a grid of threads that
can have hundreds of thousands of threads, right? And you actually then organize these threads like,
okay, a group of say a thousand threads will be part of a given one thread block. And the programming language is such that it says that, okay, these thread blocks, all
the threads in one thread block would always be executed in a single, in a SM, or symmetric
multiple set.
Which means that those threads have something common among them.
They can share in a level one cache, for example.
They can communicate faster among themselves versus
the threads that might be part of different thread blocks. And this is what exactly in
the synchronization, the scope synchronization actually takes advantage of. If I actually as a
programmer, I know that the producer and the consumer of a data item, because you want to
synchronize because you have a producer of a data and a consumer of a data item, because you want to synchronize, because you want to actually have a producer of a data
and a consumer of a data,
you know both of these producer and the consumer
are within the same thread block,
same essentially group.
Then you know they can actually communicate faster
with say level one cache.
If they are not part of that,
then you have to actually take a longer route, say to a
level two cache.
And this makes a lot of performance difference.
Just to give you a feel of how much it can be in a typical, say, fence operation that
you actually use to make a data visible to other threads.
If it were to communicate across threads that are part of the same group
that I mean that I talked about the thread block and that can be said 20 times faster
than if you have to communicate across multiple thread blocks that are running in the same
GPU but in different parts of the GPU. So these are actually exposed to the programmer.
So if somebody is not careful enough while writing the program, and that can actually
happen, then if you're lucky, then you'll actually see the updated data.
But sometimes, the programming actually has error, and if it's sometimes, it would happen
that the consumer of a data would not actually see the most latest data, and at that point,
all bets are off.
So that's the kind of complexity that comes in when you're talking about writing good
quality program for GPUs.
These days, not only are the GPUs getting bigger, but now you have clusters of GPUs
themselves, like these DGXs.
I'm going to show my ignorance here on programming,
actually doing the programming for these sorts of things.
With the standard synchronization primitives
that are used right now, that your program is not
identifying or that is identifying
as potentially inefficient, your work, ScopeAdvice,
does synchronization potentially span across an entire cluster of GPUs? Does that get even worse? I can imagine it's 20 times faster within a thread block
on one SM than going to another SM. But now if you're potentially also having to go across
NVLink, are there synchronizations that even exist that go across the entire thing and
how slow could that be?
So we currently, so far, we have been looking even just within a single GPU,
like that much of performance, like, you know, can a single line of code change can be like
make something like a 30% or 35% HD faster program. But you are right that there is a lot of scope
also, you know, in the sense like, you know, if you have to go across the different GPUs in a
cluster,
then yes, there is a way to actually synchronize
there should be another different scope
of synchronization for that system scope.
And if you're not careful, then there's a lot of time
you can actually spend in just communicating
with another GPU rather than doing compute,
which is what these GPUs are essentially built for.
So it's very much a wastage
of resources. Yeah, that makes sense. And I'll say, during my time at Microsoft, this kind of
problem comes up over and over again in architecture, right? You want to give a lot of flexibility,
and you want to give programmers freedom to not have to think about the hardware resources
underneath, because then if the hardware changes or the program becomes not portable or something like that,
so you want to give programmers a lot of flexibility.
At the same time, if you want really, really good performance, then you want to be very, very,
have a very good understanding of the hardware resources that are underneath
so that you can take advantage of them, right?
And so just at a totally different layer,
we would have customers in Azure who say,
yes, you want to give us a bunch of virtual machines
and you want the flexibility.
We all want the flexibility, but just spray them
anywhere across the data center.
But really at the end of the day,
we want these guys on the same rack because
communicating over even within the same rack,
all you have to go is up to
the top of rack switch and you're fine. I don't want to have to go multiple
rows away. And that's essentially the same thing in an abstract sense is going to a different
SM. So this notion of hierarchy, which is so useful in computer architecture, and this
notion of leveraging the hierarchy, like trying to balance the tension
between leveraging the hierarchy for performance purposes
and allowing programmers to be flexible and agnostic,
and yet also understanding the hierarchy,
like that's the tricky part.
So it sounds like at the end of the day,
you and your students came up with this tool,
Scope Advice, which can essentially come in
and find errors in sort of synchronization scope
in a program.
Can you talk about that for a little bit?
Yeah.
Actually, there are two parts to it, I would say.
One is functional correctness.
What happens is that if you have a producer and consumer of the data that are not properly
synchronized in the sense that the scopes are not big enough
to cover both the produce and consumer for trade,
then you have functional bugs when things are really bad,
like things where a program goes wrong,
can have a crash, things can give wrong data and so on.
The other, while we're actually doing that,
this other aspect that we actually
noticed is that you provide programmers with all this flexibility that you can actually essentially
fine-tune how much cost you want to pay for synchronization as long as you really correctly
know that where your producer and consumers of a thread are, right? But programmers also prioritize
functional correctness over getting the best performance possible. So what happens, what we
have, and they are actually also aware of the fact that if they do it wrong, they will be actually
found, they take in a scope that is synchronization that is a is cheaper, but it doesn't do the job,
then there is a bug that we talked about.
So what happened is that people, we are finding that they're actually, programmers can be
actually conservative and they might not actually use this cheaper, but more, I would actually
say synchronization operations that if you not use carefully, there will be bugs.
So this talk, the work that you are talking about in a published recently in
micro is building that in a tool to age programmers so that they can focus on
writing functionally correct program that could be their first job, right?
And then rely on the automatic tool to figure out
why the performance is left on the table.
That way, make the life simpler for the programmers.
And that is the kind of the series of work
that we have been doing on the both
the functional correctness side
and the performance debugging.
And that we call, generally call,
how we can actually help with this tool,
how we can actually help improve the quality of the software.
And at this point, I should also kind of acknowledge
this Computer Data Podcast that long back,
I think in 2020, there was a podcast I was hearing,
2020 or 2021, but definitely during the COVID time,
and Kim Hazelwood was your guest.
And at one point, she talked about how it is so hard to make sense of performance when
you have a GPU program, and it's so hard.
And that also kind of affected our thought process that, okay,
we thought like, yeah, we should actually have better tools that can actually help the programmers
write good quality GPU products. So thanks, Lisa and Suminath for having this
computer architecture podcast at the first place. Yeah, absolutely. I think this is definitely one
of the places where you can
exchange ideas and learn about different themes of problems
and maybe how relevant it is in different spaces.
So expanding on tools for thinking about both correctness
and performance, can you tell us about your approach
towards building these tools?
What are the key techniques that you're trying to leverage?
Do you do static analysis?
Do you need dynamic runtime information?
And how does this interplay with both correctness
and performance and the nature of the application itself?
Some applications, you might know all of your data access
patterns statically ahead of time at compile time.
And for others, for example, graph processing applications
or others, you might actually have runtime or input-based
data dependencies. So how do you think about building a tool that can provide the right set of visibility and
guidance to the programmer on how to best get the right balance of correctness, but also getting
the best possible performance for the application that they are targeting? If it is okay, I'll
answer this question in two parts. Since I asked how to actually got into it. There is a kind of a
personal story behind that how we actually started on this project actually started on initially in
2020 and 2021. So it goes actually long back. So during my, and then I'll actually come answer
your question about dynamic and static tools. So during my kind of early, I reached the same, you know, within like within two and
two and a half years into my PhD program, I wasn't sure like, you know, what is going,
what I'm actually working on, like, you know, just trying to figure out what would be my
into this topic and so on.
At that time, I was, you know, working at, you know, I was doing PhD with Mark Hill at
Wisconsin University of Wisconsin Medicine. At that time, a new assistant professor,
Sean Lu, actually joined the department. Now, those who actually know Sean Lu, he is an expert
in bug hunting for multi-threaded CPU software. Because at the time, multi-threading, finding bugs,
confidential bugs in the multi-threadreading programs was really a hot topic.
So I tried to learn something from her
and I should even try to see if we can actually collaborate
and start a project.
Ultimately, it's not completely worked out
because I got interested in something else,
but that learning was there,
like these databases, what kind of error can happen and so on.
And then finally, when I started working on the GPUs
and so on, initially looked into the GPU memory subsystem
because that's what virtual memory subsystem,
that is what I actually did for the CPU on the, for my PhD.
But then this learning that I had about this,
what can actually go wrong in the concurrent software,
came back saying that in the case of a GPU, it's like in on-steroid, concurrent software
on-steroid.
And that kind of, many years later, that kind of started this whole thought process that
we need to look at what kind of unique functional correctness issues are the bugs that can happen
on the GPU software.
So that was kind of a bit of a story,
like how it all started.
Now, getting into more specifics,
like dynamic tool versus static tool,
and this is a very timely question.
And the reason is, so currently the tools
that we have built so far and actually published about,
they are dynamic in nature, because as you point out, you can actually, you, under certain
circumstances, some type of applications, you would require runtime information, you know,
pointed out correctly. The graph, for example, it can be actually input dependent, right?
But the dynamic tools has their drawbacks, including very large performance overhead. Although it's a
debugging tool, it's not only performance overhead, but also the memory. It takes a lot of memory
space metadata. And in case of a GPU, the memory is constrained, the capacity of the memory. So,
what kind of application you can run this tool through becomes limited, becomes limited because,
the tool itself is taking up a bunch of memory that you have.
So it's current tools that we have actually released
are dynamic, but we are pretty close at getting something
on the static analysis side.
You're actually working on something that hopefully
you'll actually submit in a month or so,
but we're looking at static analysis tool, like how much you can actually get information.
And there are constants as I point out, like the input information is not there.
But we have started from the dynamic analysis tool and then see whether the static analysis
help or how much it can help the dynamic analysis so that some of the drawbacks of dynamic analysis
are mitigated at At the same time,
you might not want to rely completely on the static analysis tool.
That is super cool, Arka. First of all, your two-part answer. That is very interesting about
how your encounter with Shanlu at Wisconsin, you were interested in the time, you learned some
stuff, and it didn't pan out at the time. But then, here we are, what?
It's probably 15 years later now, 10 years later, where it has actually been formed like
a pretty big cornerstone of your research program.
And I think that is really amazing.
And so we usually talk career stuff more at the end of the podcast, but it just seems
like a good time to say, a lot of times what I see in the youth, I don't know why, these days, everybody wants
to get somewhere faster, faster, faster.
So if something is not directly helping you reach your immediate goal, it's like, whatever,
I don't want to do it or something like that.
I mean, I don't want to generalize too much, but that is our society, right?
Fast, fast, fast, go, go, go.
So I just thought it was really cool to hear that story about how this kind of new professor
comes on.
Maybe we can collaborate, maybe we can work together.
Doesn't even work out at the time, but now here we are so much later and it's like the
foundation of your work and you have this nice call out for her here.
And that's another thing.
We have a lot of thank you call outs on this podcast where
people talk about someone who has inspired them, someone they learned from.
And so this is a really collaborative field.
I just want people to really remember that, you know, there's almost nobody can do any
of this by themselves.
So that's a super cool story.
And then with respect to the second piece, the more technical piece, the static versus
dynamic tools.
So on the static side, I guess I've got to sort of go back in my memory a while now.
So when you have something like a thread block, which has a thousand threads in it, that is
being mapped to a particular SM, and it's being mapped to a particular SM in groups
of warps, right?
So like that's what, 32, 64 threads at a time. So that presumably is happening in real time, if I recall. So you don't
have a particular schedule of how exactly those thousand threads are being mapped onto a SM.
So to what extent, I mean, I guess you start with dynamic because it's in some ways a simpler problem in terms
of correctness, right?
And so once you go to static and you lose that information of what order things are
actually happening in, like I know you said something is going to come out soon, but what
would you say is the key difference in what you're able to discover?
And is it any different from, you know, sort of CPU static versus dynamic analysis tools? Was there anything that surprised you there?
Yeah, one thing that actually surprised, one thing that always happens with static
tool is that you tend to become conservative, right? And what the effect of
conservativeness is that you tend to actually churn out false positives,
because you just don't have the full information. You assume something can happen,
but actually would not actually happen given the constraints on the inputs. And then, you know,
you can report something that is that don't work out. So that is one of the problems with
static analysis that happens. And I don't want to let all our thunder for the work out,
but at a high level, what we're more looking into
is that if static analysis can help the dynamic analysis.
It's not needed that you do everything at the runtime.
Many of the GPU programs,
they tend to be more structured
to the question that you're actually asking,
how it is different from the CPU side.
GPU programs often tend to be more structured than the CPU program.
You can actually draw more semantic information statically,
and that can aid your dynamic analysis. So you don't need to do either or one of them.
One can help others.
I think that's a pertinent point,
which is static analysis,
you do lose a lot of information,
but the fact that GPU programs are structured
allows you to glean some semantic information
within the scope of that structure.
And if I recollect correctly, this is not the first place where you have made this observation.
Even in some of your other work, you have tried to extract semantic information where possible
in order to provide the right scaffolding for your
maybe dynamic tools and analysis to either improve performance while guaranteeing correctness or
otherwise. For example, I think some of your other work has looked at unified virtual memory,
so expanding the scope out of only the GPU memory.
So GPUs can also talk to the CPU memory
and that's available in recent GPU platforms.
And in those cases, also you might want to have
like dynamic tools in order to understand
when you should fetch data into your GPU memory,
which is the faster HBM, versus when it needs to be in the CPU address space and at what time do you
trigger these transfers. Maybe you could elaborate a little bit
on how do you think about what is the semantic information that's relevant and
how does this change the flavor of the dynamic tool, like whether it's
reactive in terms of what's happening in the application versus
trying to be a little more proactive in figuring out what's happening in the application versus sort of, you know, trying
to be a little more proactive in figuring out what performance bottlenecks or what performance
issues can actually be tackled.
And you're absolutely right.
Like, you know, some of this, what happens often across multiple projects is that you
learn something useful in one project, and then you carry over some of this meta-learning
over to other projects. And one of the things that you're actually pointing out, yes, we actually
have been looking into this how to oversubscribe the GPU's memory. As we know, typically GPUs,
you have a plenty of compute. But if one thing that is in shortage is the amount of the memory that a GPU card can
have on board. At the same time, more and more, there's a need to actually process a larger amount
of data. You might be forced to use multiple GPUs just because you're running out of memory in one of the GPUs. GPUs are costly.
It's not the compute. The memory is what's causing you the hardship. In that context,
as you pointed out, there is a way in modern GPUs where it allows the programs running on
GPU to access the DRAM, CPU attached DRAM, which is of course,
much larger, you can easily have data bytes of memory there.
But the problem has been that in accessing this DRAM,
data on the DRAM over the PCI interconnect
and from the GPU is pretty slow, has been slow.
And the key observation that we had is that
you don't need to be reactive to actually reduce that word.
You can be proactive.
And proactive in the sense is that
as you are currently, as you are pointing out, right?
That, you know, the GPUs often have more structure
in how the programs are written.
So if you do a static analysis of those GPU programs,
you can glean out a lot of semantic information
about the memory access pattern.
That is, it's going to actually happen.
And then use this information,
feed it to the driver, GPU driver,
that would actually move the data between the RAM
and the GPU's memory, the HPM.
In the way that then what is really happening is this driver itself becomes proactive in
the sense is that before the GPU program actually requests some data, it's already there because
your static analysis already told that this is what is expected like one of this particular
program.
So that is also a kind of a mix of both static analysis that is informing how at the runtime
or dynamically the GPU runtime and the driver is reacting to in managing the overall memory,
including over substitution. That is one of the other
work that we currently look into, how to over-subscribe the GPU memory.
So with UVM, I think the main thing that I heard you say is you do a bunch of static analysis to
figure out access patterns so that you can have the driver, you maybe pre-perform some memory accesses
so that data can be ready for the GPU.
This allows you to potentially oversubscribe GPU memory.
I guess I just wanted to confirm that what I heard more or less is that you use static
analysis to search for access patterns such that now your driver is almost functioning
like a software prefetcher. Is that pretty much the case? That's part of it and it's an important part of it.
So also what happens in a GPU is that you launch so many threads on a GPU card and
like you want to actually launch a work on the GPU, the hardware may not be able to execute all of
those threads at the same time.
But through static analysis, you can actually find a relation between, say, you have a large
data structure that won't actually fit in your, say, GPU memory.
But you can actually find through static analysis, like, what subset of thread will access what
part of that data structure.
And now you also know, given this the part of runtime, the dynamic information coming in that on the GPU
that you would run that code, okay,
it has, it can just run only this subset of threads
at a given time, which then with the static analysis,
you know only a part of that large data structure
and which part of the large data structure
would be accessed when these thread blocks are running.
You just try to keep that on the GPU memory while you know this other part of the data structure,
all we need it. So that is the static analysis informs the driver not only like what to prefetch,
but what data to kick out whose use has been used. It's not going to be actually required anytime soon.
That is the information you're actually coming from the static analysis.
Okay.
So I guess what I'm trying to figure out exactly is how you would decouple the dynamic execution
pattern of a thread block on an SM and say, like, oh, well, I've
prepared something except oops, you know, this time we're doing the warps in this order instead of
that order. And so then now it's messed up. It sounds like what you're saying is that you're
avoiding that kind of pattern search. And what you're looking for is, okay, if this warp is
happening at this time, these patterns tend to happen maybe in warp. And therefore, when the next warp shows up,
I can identify a pattern and then make sure everything
that in this warp is gonna be hunky dory.
So this is just one position.
Like it's not, we look at slightly at a higher level,
like this is a thread block level,
which is like instead of 32 threads,
it's more of a thousand, thousand 24 threads.
But at a high level, what you're actually saying is
right. Look at things like what data would those threads actually require. And there is one more
thing is that we also found out that in an application, you have different types of data
structures. It's not like one data structure that you'll actually access in a GPU. So what we've also done is that in some cases, you just can't actually, static analysis is not
good enough. Like you can't actually figure out what that access pattern would be. Like you have
a pointer-based data structure in a pointer chase. Like how do you figure out what data you will
access? It just depends, right? But the observation there also is that, again,
you don't have to do all or nothing
because even in a, say, application that is doing,
having a data structure that is a pointer chase,
that is doing pointer chase,
but there will be other data structures
that whose access pattern you can actually figure out.
So what you can actually do is that for the parts
of the virtual,
the data structures whose access pattern you figured out, you do the static analysis and
guide the driver what to keep, what not to keep. The other place essentially you can just say that
static analysis cannot help. So you are falling back to how do you do it, react as you see those
memory access pattern, and then
you actually fetch them.
So it's not all or nothing.
That makes sense.
I think then if you're looking at things, so two things, if you're looking at things
from within the thread block, I guess the thing that I was wondering is, would the dynamic
scheduling of warps within a thread block affect your prefetch patterns?
If you're doing static analysis,
then you know, there's maybe a particular pattern that might be different if you're doing it if the order of the thread blocks
ends up being different. So that's one question. Like, how are you making sure that it's not that, I suppose?
And then the second question being then, you know, if it is more or less like a software prefetcher,
and not just a software prefetcher, but you're also like a replacement guide,
I suppose, as well.
What ends up being the trigger?
Because you've got this static analysis,
presumably for a given program,
and now is it saying,
oh, I know this particular program,
because maybe it's different patterns
for different programs.
So what is the actual trigger
that tells it to do something?
So those are the two separate pieces.
Yeah, so I think we do actually multiple things
in that app, but there are actually multiple triggers.
One of the triggers that is for this, what you call,
which is rightly so, it's kind of a stop-start.
Preparation, what also happens is that
in a GPU accelerator software,
you typically also have this pattern
where you iteratively call a kernel.
Within a loop from the CPU side, you're actually calling the same kernel,
but it would be operating on different parts of the data structure.
Now you see that the trigger is this end of a loop on the,
or you can say kernel launch for different parts of the data structure
to pre-fetch what data would actually be required.
So pictorially, if you think of like,
this is pretty common,
you have a loop on the CPU,
which actually launches a kernel
that will actually operate on different disjoint part
of a data structure.
And your trigger is that every time
you actually finish your iteration and you want to actually get back,
you can actually, you know, pre-fetch things up.
Before the kernel starts running and start requesting data
that then you would have otherwise have to actually service
and pay the cost of moving those data in the critical part.
So this goes off the critical part.
So that is one example of trigger.
So about like what thread blocks are actually running.
So first of all, we also did a little bit
of reverse engineering of figuring out
this is not too hard,
what order these thread blocks will be actually executed.
So that is something that we do take advantage,
but that is not necessarily absolutely needed. So, for example,
you know that if this set of thread blocks going to run, they will touch a given part of the data
structure. And you see the first access, you can immediately push all the data that would be
required for, you know that it would require for those thread blocks. So So that also we actually do, but we also take advantage of
this reverse engineer information in which order the thread blocks will actually get scheduled.
They're pretty fixed across the generation of the GPUs, doesn't change. So we talked about a single
GPU and multiple GPUs and then we have expanded the scope to GPUs and CPUs in the context of unified virtual memory.
Maybe expanding that a little further, we have various storage layers in our hierarchy.
You talked about the memory hierarchy in the GPU, and then there is memory on GPU plus
the CPU.
Now, people want access to even more data for different applications.
Some of your upcoming work sort of tackles the problem
of things like KV stores or key value stores.
And those require way more memory
than even what's maybe available in a DRAM.
And so you need to look at other forms of storage
in the entire hierarchy.
So expanding on that, you looked at sort of persistent memory.
It's traditionally been in the realm of CPU systems.
It's also a relatively new memory technology in the sense that it's not
been deployed as widely in real production settings and turns out to be
somewhat finicky to use both in terms of correctness and in terms of performance.
Can you maybe double-click on some of these applications that require
this vast amounts of storage,
and how that intersects with tooling to understand what is the best opportunity
to leverage different parts of your memory hierarchy,
how that intersects with concurrency and performance,
like CPUs and GPUs have different attributes,
both in terms of their computational capabilities,
their memory footprint and bandwidth, and also ease of programming.
And ultimately, it intersects with the application
and workload characteristics.
So tell us a little bit about these applications
that require different forms of memory and compute,
or different phases in the application,
and how that intersects with your techniques and tools
to enable these kind of things.
That's a nice question.
So, you know, this also goes back a little bit back in time.
So, I think in 2021-22, that time frame, you observed that a lot of compute have started using
a lot of software, different software have started using GPUs as their primary compute platform. But there isn't one, I would say, class of domain of software that needed persistence,
like a key value store or databases, that are pretty important in the general software
architecture are kind of missing out.
And then this, if you just to even how important this could be, like anytime, for example, we go to Instagram or Facebook,
in the backend, in all this data that we actually tend
to fetch, go to like, probably a software stack,
this name's Rockstabuse, it's a parser-stacky value store,
it's kind of a NoSQL database, it remembers things,
you can actually log even your clicks and everything.
And you can actually look up the data with an identifier, which is called keys.
And we are talking about internet scale software, which means that the throughput is very important.
You need to serve a lot of requests together. In that context, the GPUs are the throughput engine.
However, now you regret two things,
which doesn't so far like did not gel well, right?
You need throughput, which GPUs can provide.
The same time, the GPUs are not designed to deal
with storage of the power systems as such in general.
And the reason could be like the storage
or the power system memory that we actually talk about
are, you know, hang off the CPU, not like the storage or the persistent memory that we actually talk about are you know hang off the CPU not from the GPU. So it's naturally it is natural that you look at
how CPU can actually leverage them not the GPU. Now that is why we actually started thinking hey
look there isn't the way that we actually try to think is that is there an interesting use case
from the software point of view and we we thought that, yes, there is.
Not only this kind of this parser-stripped key
value store that I actually mentioned,
but you remember when you are actually
doing many of the GNN training from time to time,
it is very common to actually checkpoint
this learned weights and so on to us somewhere that you would
not actually forget, somewhere in parser-stripped storage and so on to us, somewhere that you would not actually forget,
somewhere in the first step, like storage and so on.
And same happens for really long running,
high performance, this scientific computing,
like computational fluid dynamics,
because you don't want to lose if something goes wrong,
you don't want to lose this,
and all what you have actually learned
through hours or days of computing.
So there is a need.
Like a lot of computers move down the GPU,
but they don't have a good access,
a direct access to the storage or the persistent subsystem.
So that triggered like in how, you know,
we can create this enable this ability for the GPU
to directly access and read, write to persistent memory. So that was
startup work like this S plus 22 and 23 works and so on. And then we started thinking a little bit
more digging more into the specific more commercially deployed software systems that makes use of
persistence. And that's where we actually started looking into something called this Rocks TV,
which is used by Meta and actually from actually from Meta. And similar actually, software
does exist from Google, for example, it is called the LevelDB. And what we found was
something interesting. What we found is that even in the software, that is, you know, we
are not talking about the GPXL software, but we are just talking about like tech software
that is running on the same CPU today and commercially deployed
and you see you tried to break down how much time is spent on whack. Found out that even though those
are what you call the persistence aware software because they need persistence, they want to
actually store something, it's not that they're actually spending all their time there. Majority
of the time actually goes in the software manipulating in-memory data structures.
And then only a small part really goes in writing,
reading or writing from the persistent media.
This is particularly true if you actually run them
with fast storage systems of today,
including persistent memory.
So what it means is that you don't always,
even for the persistent program,
you don't have to always worry about,
oh, how much bandwidth I'm having
to the persistent medium or not,
or how long it is actually taking.
You have a lot of time that you're spending
on seeing memory data structures,
and you can actually speed them up on the GP,
particularly when you have an
internet skill in a throughput, that you have an internet skill service and you are trying to support
a large amount of throughput. So the trick there is to figure out what are the parts of the current
software that can be actually accelerated, that can actually have a lot of parallelism and can be accelerated well on the GPU.
Fork them off to the GPU.
Don't try to do everything from the GPU.
Leave the parts that are not too much critical to the performance and let the CPU handle
that.
So that division of labor becomes extremely important.
And that's what we actually showed in our upcoming paper, SIGMOD, where we show how
you can take a commercially deployed software stack, break it up neatly between CPU and
the GPU, and you can have very high throughput because you're certainly able to use GPUs.
So this is interesting, Arka, because as I was telling you before we started this episode, when you were an intern at AMD and we were starting to think about GPGPU very, very early days,
the use case of cooperation between CPUs and GPUs was a big question.
Was it going to be fine-grained? Is it going to be coarse-grained?
And at the time, we really didn't know.
And so there was the question of how we were going to design these things without having
a solid understanding because there was
no existence of a use case.
And here we are again now.
Gosh, this has got to be like more than 15 years later,
where here you have found a use case of how
to partition work between a CPU and a GPU.
So my question to you here is, so it
sounds like this particular use case that you found
is you're dealing with in-memory databases,
you're doing a lot of work in the in-memory databases,
and then at some point you want to save off a bunch of stuff
to persistent memory.
So it sounds like what you're saying is,
you do the in-memory database manipulation
because that's big and highly parallelizable on the GPU,
and you do the saving off to persistent storage on the
CPU. Because initially, when you first started talking about this observation, I was like,
oh man, are you saying that we want to hang Flash off of GPUs now too? Are they just really going
to become a whole cousin of CPUs where they have the same sort of resources? But eventually,
you came to know we're going to send that stuff back to the CPU to store. I guess the question is, presumably in this kind of use case,
you have a lot of in-memory data manipulation.
But then you're able to boil that down to something smaller
to ship back to the CPU to store.
Is that right?
I imagine it would be a problem if the thing you had to store
was also very large, because you wouldn't necessarily want to ship that back.
Yeah, and then the other thing that actually happens is that,
and this is, I think, so when this software is written,
like, you know, the developers,
while they should have written this CPU,
they're all aware that the storage would be slow, right?
So once you just start writing a software,
assuming that you use data structures,
like they use like log structure, March Tree,
and so on, LSM Tree,
that are suited for that kind of scenario.
You want to actually reduce your time
going to the storage.
And now if you actually attach a very fast storage system,
that time that you actually already your software
is designed to reduce your time to going to a storage.
And now it's the hardest,
if you have a very fast, say, fast system memory,
that time actually shrinks big time.
That suddenly opens up the opportunity for you to,
because you're now, if you look at the end to end time,
you're spending the majority of time at the end-to-end time, you're spending the majority of time
manipulating the in-memory data structures before you actually write, as you say, only in a limited amount of time
and then there's some compaction and so on, and then finally you actually write something to the storage medium. And you can actually let it be in the CPU. And then
that is okay because you are already the time that you have been in the Amdus law coming.
This is a small part, leave it alone. The bigger part that you are actually still spending on this
manipulating in-memory data structures, you can also do something like GP. So that is what
comes out of that.
There are different challenges comes out on that
because what happens is that when you were in the software
expects you to execute something sequentially,
but you are actually executing them parallel in GPUs.
At the same time, you want to make, keep the pretense
that to the higher level software
that is actually using the RockSleep in the the back again, is that nothing really changed.
It's just dropping.
So what it means is that you need to then start playing this all these trees, which
you keep multiple versions to hide the fact that you ran something in parallel, but you
don't want to show that result to the customer.
The customer software is using, say, this, the backend software, the Rocks to be in the customer. The customer software is using say this the backend software, the rocks
in the background. So a lot of tricks come in there and that's where a lot of fun is and the
contributions are as well. Yeah, so circling back to the top of your answer, you sort of touched
upon how evolving technologies like in this case storage technologies sort of change the performance
trade-offs and so the bottlenecks sort shift across your application, and it also reveals new opportunities for optimization.
One consideration I had in this was, when you talk about the division of labor between
CPUs and GPUs, one constraint that often comes up is that GPUs are typically attached to
the CPU host via a PCIe bus or a PCIe link, and those are typically very low bandwidth.
I wanted to get a sense of how much this plays
a factor into how you think about the division of labor, but also in terms of emerging technologies.
So for example, NVIDIA has the Grace Hopper-based systems where they actually provide a very high
bandwidth between the CPU host and the GPU machine. And so if you have things like NVLink that
provides sufficiently high bandwidth, then would that enable different avenues for optimization for either these
applications in terms of division of labor or maybe other applications where
you have found the PCI bottleneck is a reason that you're actually hamstrung.
And if you actually had higher bandwidth,
you could maybe go to much higher QPS or queries per second or throughput,
or you could support a different form of division of labor between the
CPU and the GPU.
This is an excellent question. We actually had to handle this.
So as we were talking about division of labor, in general it would actually feel like,
okay, this is division of labor between what happens on the CPU versus GPU.
But there is another division of labor actually happens on the memory side of things.
For example, in a key value store,
your keys are relatively smaller than the data, right?
A data can be like, you know, multiple kilobytes.
So one of the things that we actually did
is that we kept the values on the DRAM,
put the pointers of those on the in-memory data structures
that we kept on the GPU's HPM or GPU memory.
And why it was important,
because this pointer is always eight bytes,
but the data can be actually multiple kilobytes.
And if we had to actually move this data, you know,
to and fro over the PCI from the RAM to the HPM
and then back, then the benefit of the speed up that you would actually
get by doing things in parallel on the GPU would be lost.
Most of it will be lost.
We had to also think about not only
the division of lever in terms of what happens where,
but what some data,
which is the data or the values or the keys actually placed.
So that is also a key part of the decision making process.
Now, coming back to your other question that, hey, if you had much faster link between the
CPU and GPU, would you have designed something differently?
And it would make actually even make some things actually much faster.
Answer is certainly yes.
For example, which we did not,
one of the things that we did not talk about much
is that in this safe persistent Kibhado store,
although ultimately you were kind of in a persistent things
from the CPU and so on,
but at the same time when you are writing something,
for example, you log as well.
And in our case, all of those actually
happens from the GPU directly to the power systems storage.
That crosses those logging requests.
Those actually crosses the PCIe, which can easily
add around each way 300 nanosecond latency.
If it is actually faster, performance
would actually improve significantly. You can actually do much more because the logging overhead is
non-negligible like if I remember correctly and 15 to 16 percent of the
time goes in logging even though you are actually doing it from the GPU there is
a parallelism but still it's it takes time to get things because the
persistent medium the storage is not attached to the GPU.
It is on the other side of the PCIe on the CPU.
Certainly looks like a very rich area for further research and as newer systems and
technologies come to the fore. And yeah, we look forward to reading about your work
in AdSigmod this year. So you've talked to us about a variety of different projects and
we have
peppered this with what are the origins story for those projects, but maybe this is a good
time to sort of wind the clocks back and ask you about your origin story. So how do you
get interested in computer architecture and how did you sort of land up at IISC?
I'll be happy to provide my perspective. So I really wanted to understand how systems
at the computer actually work. I think that is where I got interested in computer architecture.
At the end of the day, whatever we actually do in the algorithm or we do in the software,
ultimately it needs to run on the silicon.
Without that, we get no value.
That is why it feels like, okay, I'll start from the bottom of the stack and figure
out what's happening there.
And then what really happened, and this is also a personal story, as I mentioned, I started
working under Mark Hill at Wisconsin.
And as you can see from my previous conversation, I was actually also thinking, what could be
my thesis topic? I was actually also thinking like what could be, you know, my thesis topic, right?
I was just searching for that because initially what I actually worked on none of this actually went into my thesis
So what happened there and this is also kind of an I would actually say, I don't know
Maybe a stroke of luck during that time in Wisconsin and there was a kind of an
Typically what they used to do is that there will be like rooms
where the students would sit and typically the computer architecture students will always sit
in in a set of rooms. Okay, so you are, you know, in the room with the computer architecture student
and then there will be like some rooms where the system students would sit and then there is
another floor where the theory students would sit and so on.
That's how it's organized.
In my case, what happened around,
when I went to lecture program in the second year
of my PhD, is that I got assigned to a room
where the other person in the room wasn't systems person.
It happens to be Harris Bollos,
who used to work on operating system.
And at that time, he was actually working on the initial on the CPU side,
the power system memory and so on, memory subsystem. Now, looking back,
we spent a lot of time discussing about what could be, you know,
the ideas. A lot of them was looking back, you know,
was not so well thought out and so on. But like, you know,
we discussed a lot.
And then I think how I should go
into this virtual memory system,
why do we actually did a lot of in my PhD,
a lot of like co-design of the operating system
and the hardware, some of that thing happened
because I was sitting next to the system person
and listening what he is doing and so on.
And then that kind of also kind
of an, there is an induction that actually happens. So, and that's how I got into that
virtual memory and so on. After my PhD, I actually joined AMD research and there, and until that,
I never actually, you know, even in Spain, any time thinking about the GPUs. And when I actually joined AMD Reset,
at the time, this Exascale project was in full swing,
and it was pretty clear that you want to actually get
those Exascale systems, the flops,
going to mostly come from GPUs.
So for me, there was no option but to work on the GPU,
which turned out to be a big blessing.
So that's why I actually started working on the GPUs,
but I knew the virtual memory. So you do what?
Like you started doing virtual memory of GPUs, right?
And then as I got more and more comfortable with the GPUs,
and then finally I had to move back to India.
There was some personal reason my dad was falling sick and so on.
So I said, okay,
I'll actually move back to India. And then, you know,
I had an opportunity to join industry. Actually,
if I can say like this Microsoft research in Bangalore,
but then I also thought in,
in interest at one of the things that I actually understood that I really liked
or enjoyed mentoring juniors,
the students and the interns. So I thought, okay, I'll actually give it a try if I can
create students who are passionate about doing system research in India. And that's where
I actually landed in the Institute of Science.
That's quite a history where a lot of
serendipitous things, right? You come across this person. I mean, I think the thing that
we're learning here is like, you come across certain people, you learn some stuff from them,
and then eventually that's just providing fertilizer for future work, right? And so I think
that is something that our listeners could potentially take away, which is like, you know, every encounter with anyone crossfield, same field,
whatever is potentially fertilizer for future work.
And you're case of point. And now here you are at ISC, you know,
a premier Institute in India. And we're happy to have you,
except now you're on sabbatical, right? At ETH, is that right? Or was it?
I'm actually on sabbatical at EPFL. EPFL yeah awesome so how long have you been at EPFL? I've been here in a couple of months almost three months now
okay I'll just spend several in the next few months here before heading back so
that's the plan. Cool. India, pretty different from Seattle.
You've been all over the world,
doing computer architecture research,
systems research all over the world.
Yeah. Yes.
I like, as I said,
I learn from people and that's what I'm here for.
Hopefully, I'll actually bring back some learning
that in future might become the start of another thing, another project.
Yeah, the traveling computer architecture researcher, it reminds me of Paul Erders, the mathematician who used to hop from place to place,
go and visit different mathematician friends in different cities and universities,
and everywhere he went he would pick up a new problem based on the conversations and then put them in his notebook
and then solve them either at that time or at some point.
So maybe in your journey as well,
you've interacted with multiple people.
As you said, serendipitous interactions
are structured as well and leads to the seeds
of new research projects and ideas as well.
Very cool.
So before we close, do you have any words of advice
for our listeners?
If I have, I hope there are actually young students who
might be like the PhD students and so on.
They are also listening to it.
One of the things that I should learn over the years
is that as long as one is learning something new,
we don't need to actually look for immediate results.
And then if you learn something new and learn that thing well, careers are long.
You never know how long, you know, how it will actually come true in your profession later on in your career.
So looking at, you don't always have to look at the clock.
And that way, one would actually enjoy the research journey more. So, looking at, you don't always have to look at the clock.
And that way, one would actually enjoy the research journey more.
I love it because it's true.
If you think of it as a long game, then everything is future fodder.
If you think of it as a short game, then we're constantly feeling like we're failing.
So it is a long game.
I mean, I've been around a lot.
I feel like I've sort of watched you grow
up in some ways, Arka, because you were just like a baby-faced intern, tons of enthusiasm. And now
here you are, like this well-established, well-published professor at the top institute.
And so it's just been very cool to come back and talk with you.
I really appreciate this opportunity. Thank you.
We are very happy to have you here.
Yeah, absolutely, Arka. I think it was a delight speaking with you. We are very happy to have you here.
Yeah, absolutely, Arka.
I think it was a delight speaking with you.
And to our listeners, thank you for being with us on the Computer Architecture Podcast.
Till next time, it's goodbye from us.