CppCast - Volta and Cuda C++
Episode Date: September 1, 2017Rob and Jason are joined by Olivier Giroux from NVidia to talk about programming for the Volta GPU. Olivier Giroux has worked on eight GPU and four SM architecture generations released by NVID...IA. Lately, he works to clarify the forms and semantics of valid GPU programs, present and future. He was the programming model lead for the new NVIDIA Volta architecture. He is a member of WG21, the ISO C++ committee, and is a passionate contributor to C++'s forward progress guarantees and memory model. News Visual C++ for Linux Development with Cmake Sourcetrail 2017.3 released - cross platform source explorer Call for CppCon Lightning Talks and Open Content C++17 STL Cookbook Book Review Olivier Giroux @simt Links CppCon: Designing C++ Hardware Inside Volta: The World’s Most Advanced Data Center GPU Inside Volta Slidedeck NVidia Dev Blog Sponsors Backtrace JetBrains Hosts @robwirving @lefticus
Transcript
Discussion (0)
This episode of CppCast is sponsored by Backtrace, the turnkey debugging platform that helps you
spend less time debugging and more time building. Get to the root cause quickly with detailed
information at your fingertips. Start your free trial at backtrace.io slash cppcast.
And by JetBrains, maker of intelligent development tools to simplify your challenging tasks and
automate the routine ones. JetBrains is offering a 25% discount for an
individual license on the C++ tool of your choice, CLion, ReSharper, C++, or AppCode.
Use the coupon code JetBrains for CppCast during checkout at JetBrains.com.
Episode 116 of CppCast with guest Oliver Giroux, recorded August 31st, 2017.
In this episode, we talk about CppCon open content and STL cookbooks.
Then we talk to Oliver Giroux from NVIDIA.
Oliver talks to us about his work on the Volast, the only podcast for C++ developers by C++ developers.
I'm your host, Rob Irving, joined by my co-host, Jason Turner.
Jason, how are you doing today?
Good, Rob. So we can keep saying that despite Jens' comments last week?
Well, we haven't heard of any new podcast
yet, but yeah, we do encourage
if you're out there and have any interest in starting a podcast,
there is room for more in the
C++ community.
Yeah, we might have to change it to say someday
the only English language one or something.
We'll see what happens.
Yeah, I'll see what happens.
You know anyone down in Texas?
Everyone okay with Harvey that you might know, Jason?
I don't know anyone directly who was affected.
So I honestly haven't been watching the news a whole lot because I've been busy getting ready for conference prep stuff.
Yeah, I don't think I know anyone personally.
But if any of our listeners are down there, I hope you're okay with
the storm.
Well, at the top of every
episode, I'd like to read a piece of
feedback this week.
I'm going to butcher this name. It's
coming from a listener named
Wojcik, maybe?
He writes
in, thank you guys for such a great podcast.
For me, podcast is not only c++ news
libraries but also a passion because of your show i don't only write code but i also started to love
c++ thank you guys i hope you will never stop recording c++ podcasts and also a separate thank
you to jason for the great c++ weekly so yeah thank you for the feedback, and I'm sorry for horribly mispronouncing your name.
Yeah, thanks. That was great.
Yeah, and we'd love to hear your thoughts about
the show as well. You can always reach out to us
on Facebook, Twitter, or
email us at feedback at cpcast.com,
and don't forget to leave us a review on iTunes.
Joining us today
is Oliver Giroux. Oliver
has worked on 8 GPU and 4 SM architecture generations released by NVIDIA. Thank you. I'm glad to be here. the ISO C++ Committee, and is a passionate contributor to C++'s forward progress guarantees and memory model.
Oliver, welcome to the show.
Thank you. I'm glad to be here.
How did you get started working on GPU stuff, Oliver?
Oh, I had an internship with Microsoft one time, and then I had just come back from that and NVIDIA's lead architect came and
visited my school and
I'm one of the few people he
interviewed there on site and then he gave me another
internship and then I just never left.
It's
basically been 15 years now.
So I
like it.
Wow.
That's pretty cool.
Yeah, well, we got... Oh, go ahead. So I like it. Wow. That's pretty cool. Yeah.
Well, we got to go ahead.
Oh, I was just going to say, when I joined back then, all the interviews were, they were all graphics.
NVIDIA didn't do anything other than graphics. I remember doing vector algebra and projecting, you know,
permutatives, projecting triangles onto planes
and doing intersections of surfaces and other things.
And it was tough.
It was really tough.
You know, it was like an eight-hour thing back then.
Wow.
Yeah, I would be out of my league.
My 3D geometry skills are not up to that, for sure.
Well, Oliver, we've got a couple of news articles to discuss, and we'll start talking about the work you've been doing recently at NVIDIA, okay?
Yep. yep okay so this first one is a announcement from the visual c++ blog uh their 15.4 pre-release is out and in it there's going to be support for linux development with cmake which uh should be
pretty exciting you're already able to do linux development from visual studio but you had to
create a visual studio project for it, which is kind of weird.
But now the CMake support that they had
for Windows is being extended to
support Linux development as well.
It's not
explicitly mentioned in here, but
it talks about setting up
your remote Linux target
with a connection manager.
And it doesn't explicitly say
how easy or hard it is
to do that with the Linux subsystem for Windows.
I believe it can be done, though.
If you have a Linux subsystem, you can use your own Linux,
your own Windows box as a Linux server, yeah.
But yeah, they don't go into that explicitly.
I was thinking somewhat cynically before our episode here
that I love Linux, I use it for most of my development,
I run Windows everywhere.
My real problem is when I need to port to macOS.
I need a Mac subsystem for Windows
that I can just easily connect to with Visual Studio or whatever,
and then I don't have to ever boot my Mac box.
I am sure Windows or Microsoft would be happy to do that, but I think Apple would sue them
if they ever tried to.
Is this anything that you work with at all, Oliver, Visual Studio in your work or in these
tools?
Oh, yeah.
Oh, yeah.
I've actually been a bit of a black sheep in the architecture team here for always insisting to work on Visual Studio.
Visual Studio pretty much got me started with C++, actually.
And in fact, it's funny because you're talking about Linux builds from Visual Studio.
And that's been my day-to-day experience at NVIDIA for 10 years now
because 10 years ago,
we wrote a tool that ingests VC project files
and then spits out a make file.
Oh, wow.
And so we...
Actually, the NVIDIA architecture team
has been many of our builds,
those that are not
silicon builds, obviously, but
most of our C++ builds are
enshrined in VC project files as their
canonical build
specification. And then
obviously that works natively
on Windows, and then we use our tool to
convert to make files and make
on Linux.
And yeah, so that's been my normal thing,
moving between Windows and Linux, you know, 10 times a day.
So have you guys considered moving that stuff to CMake or Nissan
or any of the other tools that would automatically generate
the various build targets for you?
Well, the thing that... One thing that stopped us from doing that
in the past is we really wanted
a native experience
in the
YZWig part of
Visual Studio.
Tweaking and building Visual Studio is
really easy.
Right-click on this project, add an option here,
click OK.
And it's very self-discoverable.
You don't know what the options are a priori?
Fine.
Open that properties dialog,
and then see what the options are there.
And then click on the combo box
and see what the values for this thing are.
And in comparison, most of the other build tools
assume you're already a build ninja
and you know everything you want to do.
And that's not...
For many people who are here
doing GPU architecture,
that's not their bread and butter.
That's not what they do.
And so they kind of want
a build system that gets out of the way and is easy to use.
So they just need to fit the bill at the time.
Now, it's not to say that, you know, this is just like VI and Emacs.
You know, it's not to say that this hasn't been a subjective hot debate for that 10-year period, basically.
I'm sure if you were starting from scratch today, you probably wouldn't roll your own solution, maybe.
Probably not, no.
And to be fair,
I was just working with
Visual Studio CMake integration, and
getting started out of the box
of the CMakes build folder is
just trivial, just works.
But as soon as you want to change any of the configuration options, it seems
to be very non-obvious how to do that. They have not integrated it with their
GUI or anything yet, but their CMake integration is definitely getting better.
Okay, this next article we have
is an update to SourceTrail, which is some software we talked about a while back.
How do they describe this again, Jason? It's more of a source explorer. It's not an IDE, right?
It is not an IDE. No, I actually have played with it a few times since when they first mentioned it to us and then with just this release also. It looks like the main new feature of this release
is interactive tooltips,
where if you're hovering over something,
it gives you the full signature of that function or method
or whatever it is you're looking at.
Right.
It's a pretty darn neat tool.
Did you play with it at all, Rob?
I don't think I have yet.
Okay.
It gives you this great flow chart of what your code,
what functions are calling other functions,
and you can click on it and expand,
and it shows you the source code and does mouseovers
and hyperlinking between everything.
It does a really good job of it,
but I think more interesting than just the fact that you can do that
is it also shows you what templates are instantiated in many cases.
And then you can get slightly more detail there about the template stuff that's happening, which I like.
Because I tried browsing with ChaiScript, which has many, many template instantiations.
And it does a pretty good job of keeping up.
Right.
Yeah.
And this next one is calls for lightning talks and open content at CPPCon.
And I'm not sure if we've mentioned this before,
but this includes an announcement at the bottom of the post saying that all content is going to be free on Friday of CPPCon.
That's including the plenary speaker, Matt matt godbolt and all the other sessions that day
oh i didn't even see that i didn't even realize that was in this announcement yeah
so yeah if you're in the seattle area weren't planning on going to cpvcon already but uh
you know you want to go check out some of the sessions you can go and do so for free on friday
it'll be really interesting for those of us who are there for the whole week to see just how much busier it gets on Friday or no matter what happens with that.
I mean, there's plenty of, you know, tech companies in the area employing C++ developers.
So you'd think there'd be some people interested in that.
You know, you got Microsoft and Amazon right there.
Right.
But yeah, in addition to that, lightning talks and open content,
they're now looking for submissions for that at CppCon.
And the open content,
those are sessions during lunch and breakfast, I guess?
I don't recall that from last year.
Yes, yeah.
They're at various times during the day,
not normal track sessions.
And it does explicitly say again this year, open content does not require conference registration.
So anyone can present, anyone can attend open content sessions.
Right.
And then lightning talks are five minute talks that are usually in the evenings.
I think they do that for two or three nights last year.
I'm not sure if that's the same plan for this year or not.
Right. You know, I will say two years ago,
I gave an open content session for the fun of it when I was giving also
regular session that year.
And it is so much less stress because you know,
the camera's not recording.
And so if,
if you wanted to try your hand at speaking at a conference,
you didn't get a submission in for CBPCon,
try an open content thing.
There's less pressure to get accepted,
less pressure when you're actually up there talking.
Okay, and then this last article,
C++17 STL Cookbook Book Review.
And this is a new book that is just released from uh i'm gonna butcher
this name too jaycee gallowicks i think and it's uh over 90 recipes that leverage the powerful
features of standard library and c++ 17 and the review uh is making it sound like it's a pretty
good read good book and uh if you are interested the uh author of this
blog has managed to get i think five copies of the book that he's going to give away if you're
interested and you can uh register at the bottom of this post for that all right did you uh look
at this one also oliver any comments yeah yeah i looked at that. Obviously, I zoomed straight to item 9, the SG1, my bread and butter.
I might slightly prefer if people used S3D async a little bit less.
That's probably the SG1 PSA there.
We're still working on that.
But yeah, seeing the parallel STL
come to life, that's really
exciting.
We worked on...
There was a lot of
new standard
ease added there. We worked hard on that.
I am still looking forward
to being able to actually try that in one
of the standard
implementations, if you will. Last I checked, GCC Norclang was able to actually try that in one of the standard library or standard implementations if
you will last i checked gcc nor clang was able to do parallel algorithms yet anyhow
surely coming soon right you would think yeah well i i've i've noticed i mean i've noticed a lot of
um people the people who work on that um you, making noises through reflectors or IRC channels or sending email to one another.
So asking, how did that work?
How did you manage to parallelize that?
And what's the performance like and all that?
So it's clearly being worked on.
Yeah, I've seen a handful of tweets from the Visual Studio team in that regard also
with a couple of comments that there's a couple of algorithms that no matter what they did,
the parallel version was slower than the linear version,
so they were going to recommend against using those ones.
I don't recall what the details were, though, unfortunately. It's perfectly valid to, you know,
Stdpar communicates information from the programmer
to the implementation that were the lambdas
past this algorithm invoked in parallel,
they would not introduce databases.
And then it's up to the implementation
to decide what to do with that,
and running it sequentially is perfectly valid.
There's no rule that says,
if you pass in std par,
it shall run in multiple threads.
Okay.
That's not in the standard.
This is, please, I think that it would be a good idea
and the library implementation's up to saying
no, it really wouldn't be a good idea.
The way that I see it
is you're communicating
semantic information. You're saying it's
possible. It is possible, yes.
The program would not
become undefined if you parallelize this.
Okay.
Which is the case for a normal algorithm, a normal sequential algorithm.
The lambda can do anything.
The lambda is allowed to read and write any C++ object.
And the memory model says if you have multiple threads and more than one is writing, or one is writing and another one is reading, then you have a data race and your program is undefined.
So the implementation cannot just go in and say, I think I'm going to run this on the thread pool now.
It would have to prove the absence of aliasing between the references of different invocations of the lambda.
And we all know that alias analysis is hard, and it's particularly hard in C++.
Right.
So this sort of automatic parallelization doesn't happen.
But once the user tells the implementation that there would not be any data race in the case of PAR.
And in the case of PAR and SEEK, you're saying further,
not only would there not be any data races,
but this code does not assume forward progress guarantee beyond lock-free.
So then you can also vectorize it, also without any further proof.
So from your standpoint, would you suggest that when C++17 is fully out there,
we all go and look at our usage of standard algorithms and mark them par and seek where we know it's a correct and valid thing to do,
and then just let the implementation do what it wants with it?
Or should we only do that
if we're trying to optimize our use of the standard algorithms? So in general, my first
advice would be, you should always tell the implementation as much semantic information
as you know. And the implementation can be clever about that.
Especially as, and we're going to talk about this in a few minutes probably,
but especially as the variety of implementations increases,
you should not prejudge what the implementation might do with that semantic information. More information
helps implementation. Okay. Now, this said, before executors land, which is the case of
C++17, which does not have executors, before executors land, it's difficult for you to apply constraints on top of that to say,
well, here's the semantic information.
This would not introduce any databases.
But I don't want you to spin up 100 threads to run this
because I know that this would thrash my system
because something else is happening. So my first advice is give the implementation
as much semantic information as you have,
and then I just need to dither on that a little bit.
It might be that for some number of years,
doing that would be a de-optimization.
Okay.
And then in that case, as you optimize,
you may want to roll it back.
Leave a little comment in your code.
You know, to do C++20 at executive here.
Having a hard time porting my stuff to C++17 right now,
it's hard to think to do C++20 while I'm at it.
Okay, Oliver.
So your talk at CppCon
is titled Designing C++
Hardware, and your abstract says
you can run C++ on any computer
you want as long as it pretends it's an 80s
computer. What do you mean by that
exactly?
The first thing is I want to get your attention.
But
I have a real thesis here, which is that C and C++ and CPU designs are sort of in orbit of one another.
They're kind of a binary star system, and that really limits how much freedom people have to go and make radically different designs.
You know,
CPU engineers,
so, you know, people working on C++ look at CPU specs
and they say, oh, this is,
you know, we have this feature and that
feature, what can we do with that?
Well, I want to
peel the curtain and tell you that there's
people on the other side
who are doing the same
thing in reverse they're looking at the languages and they're like oh well but this language it
needs this and so we can't build our design that way then we need to make this other change and we
need to add a layer of emulation on top of whatever we were truly intending to do because if we don't have this layer of emulation
then
C++ would keel over.
And
there are things that are not getting done.
There are things that are not done.
They're not built. There are things that people talk
about that don't get built
because it's too much risk.
It's too much risk to
depart from the quote quote, canonical design.
And so we'll talk about it and we'll say,
well, if we spend five, ten man years building this feature here
and then we really don't have much of a story
for how it could work with C++,
then it gets killed at some point.
I mean, legitimately.
Legitimately, we say, no, we can't do that.
We can't pin a billion-dollar design
on a feature that might not work with the language.
That's interesting.
So you mentioned C++ specifically,
but really this applies to, I guess, all compiled languages
are all kind of similar in what they can do, right?
Well, I think it may surprise you to hear that Fortran is better at this.
Right.
Well, in part because we don't have massive operating systems written in Fortran today,
so it's not too much of a systems language.
In the systems languages, the machine has sort of leaked up.
You know, the abstractions don't hide the machine very well.
And so you can see it better from C and C++, and that makes them a little bit more constraining.
Yeah, I mean, in theory, it does apply to all languages, but pointers of all kinds, function and data, certainly apply the most pressure on the design.
Okay. Well, then I guess that
you've been working on the NVIDIA Volta
and you're at least implying
that it doesn't pretend to be an 80s computer.
So how does it differ?
Actually, I would say
it has a very clever way
of kind of looking like an 80s computer
if you squint at it.
But it really is not internally. very clever way of kind of looking like an 80s computer if you squint at it is but it just it
really is not um internally so the the main thing was you know we we started back back when we when
we started volta um we here's the mandate that we started with um people think gpu programming is hard um you go on on google and
you start typing you know cuda programming and then the little autocomplete comes in and then
and then the autocomplete can be is hard is you know things like that and uh so um
so we said okay well we we we need to make a list of the biggest problems.
And we knew for a really long time, you know, the disjoint address spaces was a very big problem.
80s computers have flat address spaces.
You know, memory is all uniform.
It's all one pool.
It's all one pool.
It's all flat.
You can reach all of it from any of your threads. So, you know, so we knew that and we've been working on that for years.
Actually, Pascal, the architecture that comes before Volta, moved the needle quite a bit
on what memory you can address. But then one of the next things down that we ran into was what can threads do?
Like what kind of code can threads run?
We've been talking about GPU threads.
I don't even have another word for it.
We've been talking about GPU threads being threads by definition.
But there was code that they couldn't run. Like, you could go on Stack Overflow
and search for writing a mutex.
And I'm using a mutex as just an example
of a concurrent algorithm.
If I put my SG1 hat on,
the issue here is the distinction
between lock-free programming and starvation-free
programming.
A mutex
is a starvation-free algorithm.
And GPU threads
had a really hard time running that.
And it's because
they're more
like vectors.
They're more like SIMD underneath. Deep in the bowels of the machine, they're more like vectors. They're more like SIMD underneath.
Deep in the bowels of the machine, they're running on SIMD.
But that was shining through all the way up through the execution model.
And then these threads could deadlock when they were trying to execute a mutex
algorithm. So in Volta, we
looked
at that, and
then we wondered,
can we build a different kind of machine?
Can we build a machine that
has the same efficiency at the bottom,
but runs
all C++ concurrent algorithms,
bar none,
at the top.
And that had not been done.
That had not been done before.
You can look around, and GPU-like threads have existed for quite some time.
In the 80s, I don't know, you probably don't know this, Pixar was a hardware company in the 80s.
And they built their own machines.
And they're sort of the first example you can point at of somebody building the same kind of threads that we've been building.
And across this 30-year time span, nobody had managed to solve this problem.
So anyway, so in Volta, we set out to do that because we wanted to be able to run basically anything a std thread can run.
That was the goal.
So I guess maybe we should just take a quick step back and describe exactly what Volta is.
Oh, it's a GPU.
Okay. Well, first of all, Volta is an architecture. Volta is the Oh! It's a GPU. First of all, Volta is an architecture.
Volta is the name of the architecture.
It's a family of GPUs in some way.
You've only seen one
today.
We launched it this spring.
Actually, I was on stage at GDC
launching that in the spring.
The first Actually, I was on stage at GDC launching that in the spring. So the first implementation of Volta is V100.
And V100 is a big chip.
It's a really big chip.
You guys just go right out the door.
Version 100, most people start at like 1.0.
Oh, well, actually, we have a naming scheme at this point.
I mean, you could go on Wikipedia.
It's pretty amazing how NVIDIA doesn't really communicate
internal product numbering,
but somehow Wikipedia always figures it out.
There must be a mole or something.
But in our current nomenclature,
we'll have a two-letter prefix, and then we'll have three digits. And 100 is, is the big one. Okay, 100 is the big one. And then
we then, you know, that third digit down at the bottom, you know, twos are just about as big,
fours are half of that,
and then sixes are half of that,
and then sevens are half of that.
GPUs have a big dial
on how big or small
you make them.
We'll have in the same family
something with
80 SM cores, and then
we'll do one with 40,
and then one with 20,
and then one with 10,
and maybe one with four.
It's a big range.
So it's a GPU architecture
that's just been released, basically.
That's right.
Yeah, that's right.
Sorry, I didn't mean to cut you off
if you were going to go into more detail there.
No, no, actually, that's a good idea.
You should try to help me be concise.
So this is not something that's specifically designed for computation.
It is something that we could buy in our desktop graphics adapter
next time I go to upgrade my machine or something.
Right, so at some point you'll be able to. Okay So at some point you'll be able to.
Okay.
At some point you'll be able to.
Right now V100 is going out in big HPC boxes.
Okay.
V100 is what the Department of Energy bought for Summit and Sierra,
two big supercomputers coming online.
Where are those two going?
Do you know?
Oh, yeah.
Summit is going to Oak Ridge National Lab.
Oh, okay.
And Sierra is going to Lawrence Livermore.
All right.
Yeah.
So these machines are going to be working on both open problems, you know, like climate change research, modeling of aerodynamics of things and, you know, combustion efficiencies and engines.
And they'll be working on galaxy collisions and things like that.
So big machines. Cool stuff.
So V100 is going in there
right now.
And it's also going into a lot of deep
learning systems
that are
going
out, already out, soon going
out. I'm not sure what the right
timing is.
So from the
C++ programmers perspective
now, how
does programming Volta look different
than programming with CUDA?
Oh,
so CUDA
is your vehicle. I mean, I think
I might even say
expand that term a little bit. call it CUDA C++.
CUDA C++ is a C++ product, similar to Visual C++ is a C++ product.
Okay.
Now, CUDA C++ is, and I won't hide it, is the least conforming of all the C++ products out there.
You know, probably Visual Studio 6.0
was more conforming than we are at the moment.
But it's actually, you know, now that I've said that,
I think you'd be surprised just how conforming it is.
On the language side, there's very little that you can't use on the device.
In fact, you know, in the last month, I've written a ton of CUDA, way more than I've written recently.
You know, I've been waiting to get my hands on a V100 to write a lot of CUDA code.
So I finally did.
I finally got my own.
Sorry, I get them ahead of you.
And I've just been typing.
I have been typing for two months, basically. And my own surprise was
I have not had to think about,
oh, can I use this language feature?
I've been writing in C++14,
in CUDA C++,
and fundamentally,
the only thing where it's obvious to me that this isn't your regular conforming C++ implementation is that you need to prefix all your functions with device keyword.
Okay.
That's pretty much it.
Interesting.
So is CUDA C++, did that come out with Volta or has that been around longer?
It's been around for 10 years, man.
Oh, okay.
Mind the thing.
10 years, yeah.
And you don't need to clone a GitHub repo and make it yourself.
You can download it on our website.
Okay.
But the capabilities have changed with Volta.
Right.
We have this concept of compute capability level.
Okay.
And Volta is compute capability level 7.
And so what that means is it can use everything that was in 6,
which includes the unified address space with the illusion of coherence, at least.
And it adds to that two really key things
for C++ developers.
Number one, it adds a conforming primitives
that are compatible with the C++ memory model.
And second, the forward progress guarantees,
which allow you to write just any concurrent algorithm you want.
Okay.
I wanted to interrupt this discussion for just a moment
to bring you a word from our sponsors.
Backtrace is a debugging platform that improves software quality,
reliability, and support by bringing deep introspection
and automation throughout the software error lifecycle. Spend less time
debugging and reduce your mean time to resolution by using the first and only platform to combine
symbolic debugging, error aggregation, and state analysis. At the time of error,
Bactrace jumps into action, capturing detailed dumps of application and environmental state.
Bactrace then performs automated analysis on process memory
and executable code to classify errors and highlight important signals such as heap corruption,
malware, and much more. This data is aggregated and archived in a centralized object store,
providing your team a single system to investigate errors across your environments.
Join industry leaders like Fastly, Message Systems, and AppNexus that use Backtrace to
modernize their debugging infrastructure.
It's free to try, minutes to set up,
fully featured with no commitment necessary.
Check them out at backtrace.io.cppcast.
So what was your role during the design of Volta?
So I was one of the leads in the SM core.
The SM core is the execution core inside of the GPU.
And for the most part, I focused on these two things here.
I focused on the memory model and the execution model.
I have a fairly long history of working on the execution side.
Seven years ago, I wrote a prototype in microcode for Kepler GPU that had this forward progress guarantee.
It was incredibly slow, but it sort of proved out the idea.
You know, people were not thinking about it as much.
People were thinking that this was just how it was.
GPU threads couldn't run concurrent algorithms because obviously.
And then you'd approach them and you'd say, well, we can work on that.
And they're like, oh, yeah, but the performance would never be good.
But the area would be too big. It would be too ever would be too big it would be too many transistors or it would burn
too much power and uh so we had to chip away at that until eventually um everybody said well
it's just plain better so let's do it yeah so seven years ago I did that first prototype
that showed that it was doable
and then I followed through
so would it be fair to say
that you specifically designed Volta
with C++ in mind to be able to target it?
oh yeah, absolutely
I quoted paragraphs
out of the C++ standard in specs and in internal communication. You know, the definition of forward
progress in C++ is rather well crafted, actually. And it's even better crafted in C++ is rather well-crafted, actually.
And it's even better crafted in C++ 17, the SG1 PSA here.
We've done a lot of work to add different kinds of forward progress guarantees and clarify the language.
But even back in C++ 11, there was useful stuff in the spec, in particular the definition of what is a visible machine step.
That was really key.
So I quoted that out of what used to be, you know, cause one.
So there was that. And then the second thing is, on the memory model side, we also stole as many good ideas as we could out of C++.
I have a lot more to say about that, if you want to drill in.
Sure, go, right. So on the memory model side, um, I think C++11 was one of the biggest gifts, uh, ever, ever given to, uh, to processor designers, honestly.
Interesting. the state of the art was everybody used volatile incorrectly
well right
and approximately no one
can correctly describe the semantics
volatile and in part it's because
caveat it doesn't have any semantics
it's you know it's
crack open your CPU
manual and then the semantics
or whatever that other thing says
and so people would come to us CPU manual and then the semantics or whatever that other thing says.
And so people would come to us
with code that
they had written on x86,
say, and they'd put volatile in
there and
then they'd say, well, we expect you to run this.
And
as a processor designer,
this is awful.
You're basically asking me
to be bug compatible with x86.
You know, like,
if you use volatile,
then your code
rests directly on the metal.
And then the semantics of your code
or whatever the semantics of the metal is,
bugs included.
And since concurrent programs are very difficult to debug, and the And since concurrent programs
are very difficult to debug
and the bugs in concurrent programs
can be latent for a very long time
they just don't activate
then if there's a bug in your code
that x86 did not activate
because it has an extremely strong
memory model
then you coming to me and saying,
this code with volatile needs to work,
is basically saying, I need to build an x86 now,
which I'm not allowed to do.
So then C++11 comes along,
and now there's a memory model, like a spec you can read.
And it's a relatively portable spec.
And there's some really good ideas in there.
You know, the C++11 memory model is based on work by Sarita.
You know, a paper she wrote in, I think, 1990 introduced a model called DRF0.
And the DRF0 model says if you can split all your memory accesses between synchronizing and non-synchronizing, then you make the synchronizing one sequentially consistent.
And she can prove that there are no non-sequentially consistent executions left.
But here's the good news.
The synchronizing XSCs is like 1% and the non-synchronizing is like 99%.
And the non-synchronizing XSCs,
they don't have any expectation of memory consistency between threads.
Any expectation at all. have any expectation of memory consistency between threads. Right.
Any expectation at all.
In fact, the non-synchronizing XSEs don't even need coherence.
Now, think about that.
CPUs don't know the difference between these two things today.
Like when you JIT or compile to x86, you do a bunch of reads and writes.
x86 assumes every last one of them is synchronizing.
Okay.
Even though 99% of them are not.
And the 99% that are not don't even value the coherence protocol that x86 is running.
Okay. So 99% of the effort being expended by a cache-coherent CPU design is not providing any value whatsoever.
It is warming your room ever so slightly and occasionally depressing your performance.
Because it needs to assume everything is always synchronizing.
Okay, so that's one of the big things we did. oh my god we can take this distinction between
atomic and non-atomic in C++
matching to synchronizing
and non-synchronizing in DR0
we can take this information, this high level information
from the program and give it
straight to the silicon
not only carry it through compiler
optimizations but
actually give it to the silicon
and then the silicon
only needs to make
the atomic XSCs
really appear coherent.
The non-synchronizing XSCs
don't need any coherence whatsoever
up until
the
acquires and releases.
But then at that point, on an
acquire and a release,
you know you can put a fence there. And so you can recover coherence at that point.
It's really interesting. And I'm trying to process it all because if you're doing
multi-threaded programming correctly, you can never assume that the threads are synchronized,
as you're saying, right? Or the data between them is synchronized,
unless you have some sort of a fence or an atomic
or something that you check on, right?
Right.
So you don't have to do any of the coherency
until you see one of those fences that would force that to happen.
Exactly.
Exactly. So do you anticipate,
not that this is necessarily
your expertise, but that
general purpose
CPUs will take advantage
of this more at some point?
It's really
tough for them. It's really tough
for them because
they have spilled their specs all over the interwebs. And they have backwards compatibility requirements placed on them.
Right. That makes it so, you know, design decisions they made in the 80s constrain what they can do in the 2010s and 2020s.
Very much so.
You know, and x86 must run x86 code because the day that an x86 stops running x86 code, they lose uh there is they lose their stickiness right right right
so that is massively important to them um other processor architectures get to occasionally make
a break um but very occasionally um so you know even even in the case of arm you know we we like
to think that arm is this new upstart somehow, but they are totally not.
They're from the 80s also, yeah.
They are also from the 80s, yes.
So it's tough for other processor teams to be able to exploit this.
Yeah, and one more thought.
So I didn't mention IBM.
IBM Power is just an awesome, badass product
that unfortunately too few people get to experience.
But they have huge reliability requirements placed on them.
The power system needs to be resilient in the face of the kind of
errors that will make your macbook pro just you know crash you know your your laptop will just
panic and reboot and you'll go like oh my computer crashed it happens right um that does not happen to an IBM Power, basically. It cannot happen.
So they have reliability features in there that force them to be clean of any possible memory error before they can execute past a certain point.
Okay.
And that affects how strong their memory model can be.
Now, I use the words strong and power, which will make some people glitch.
But the IBM power is a stronger memory model than my machine.
That's right.
I don't know enough about it to nitpick anything that you just said,
so you're good here.
So we've talked a lot about what you can do with c++ and volta are there any limitations worth pointing out yeah so right so we're not done so i said a little while ago that um that
c++ is is both um surprisingly conforming uh and also the least conforming implementation. You can hold both in your head at once.
Now, like what's remaining in our conformance?
Some of those things look more durable than others.
You know, you can imagine.
So with Volta, we tried to make significant progress on the top things.
We're not done.
So we're still thinking about this.
We're still working on these things.
Yeah, so the things that are easy or hard.
Well, concurrent with development on Volta, there was development on C++17. And in C++17, we
clarified the nature of forward progress
to include
three different progress guarantees.
There are three different levels, tiers,
of progress guarantees in C++17.
There's the concurrent
forward progress where
you
launch a thread and
you know from the time
that the constructor of the thread returns
that that other thread
shall be making forward progress right now
and it will never stop.
So you can launch a thread
and then wait for it to write to an atomic,
for instance.
That's the concurrent forward progress.
The parallel forward progress
is slightly less than that,
where you know that if it has started to make progress, then it will continue.
But you don't know when it will start unless you block on it. And if you block on it,
then you know it will start. And then weekly parallel is basically no guarantee of independent forward progress.
So that's where that matches with par and seek.
And that's where you can't write a mutex in that, for example.
Okay, so concurrent with Volta, there was also a lot of clarification going into C++.
I think that that's going to continue, basically.
We're going to build another architecture. Spoil basically. We're going to build another architecture.
Spoilers!
We're going to build another architecture.
We're going to build another architecture
and we're going to fix some things
that are in our conformance issues.
But then we're also probably going to think about
changes to C++20
that will make it much easier to implement. So, alright, what's in that list?
Thread local storage
is a big one.
You know, back
to this
emulating 80s computer thing,
you know,
threads in C++ are assumed to be fairly
heavyweight.
C++ is
pretty well designed for a machine where the number of threads is like 12.
Okay.
Yeah, 12 or I guess we'll go to 60 or something.
It's pretty well designed for that. But some features like thread local storage are
expensive
for systems that have
lots of short-lived threads.
Volta has
163,840
threads.
Wow.
All of which can run,
all of which can execute
anything a std thread can execute.
But except for TLS.
But
it sounds like you were saying that
virtually all the variables are thread
local anyhow because it's not going to
synchronize them unless it has a reason to.
Oh,
no.
If I were to, am I
being too pedantic or something?
Yeah, yeah, yeah.
So Volta threads can dereference any pointer anywhere in memory.
So you can imagine you build a box and you put 128 gigs of RAM on the CPU and then you slot a Volta in there,
and it has its own 16 gigs
on it.
Volta threads can
dereference any of this, all of this.
And they can synchronize.
So you could use
a moral equivalent of
Studeatomic, or hopefully very soon now,
actually Studeatomic, or hopefully very soon now, actually Studeatomic.
And when they synchronize, they have a view of memory that's compatible with the C++ memory model. Okay.
So you just, you know, you write code as normal.
Okay, but one thing that they don't currently have is access
to thread local storage. So if you declare some stuff
that's thread on the bar local,
then only your CPU
threads will get that right now.
Now you can take a pointer to these
things and then share it
out in memory. Thread local doesn't
preclude sharing, it's just
private by default, but if you take
the address of it, you can share it.
And then GPU
threads will be able to see it and
access it just fine. It's just that
using the
name of the variable,
accessing it by its name,
will not work on
the GPU threads right now.
I am not sure which way this one
will land.
The on the GPU threads right now. Now, I am not sure which way this one will land. The lack of conforming thread local is not unique to us.
In fact, if you look at even CPU parallel programming systems,
they have a similar issue.
I think one of the poster children is actually
Silk
from Intel.
Silk, I'm going to say something good about Intel.
Silk is really
impressive work. It's really nice.
It's fairly well reasoned.
Now, in Silk,
when you spawn
children
and you block on them,
the parent can be suspended and passed by continuation passing.
And the children can steal the parent continuation.
Which means that a child running on another thread
can steal the continuation of the parent that came from, you know,
parents on thread one spawn some children, blocks on them,
children run in thread two,
and then when the children in thread two are done,
they steal the continuation of the parent,
which now resumes execution on thread two.
Your thread local just changed.
You used to reference an object.
Now it's a different object.
What does that mean?
Like, what does that mean?
You could write a conforming C++ program
that is highly dependent
on all your thread local accesses
being to the same object.
Yeah.
Because maybe you put some important data in there
and you're going to get it back later.
And now you're under the threat, and it's not the same one.
So this TLS issue is, I swear, SG1 spends one day per meeting talking about that.
So we have to do something about it the definition that's there right now
is very strong
basically only
works with std threads
and the thread that runs main
the thread that shall not be named
that runs main
and
as we move into
more and more parallelism,
we need to clarify what thread local means.
Even if you ignore Cilk and you stay in C++17,
consider std par and std par and par and seek.
If I run a parallel algorithm with std par,
then more than one of my lambdas will get the same TLS.
But they'll get it in sequence.
But they'll get the same TLS,
which means that they can kind of...
It means you don't get fresh TLS every time.
Did you expect fresh TLS?
Did you expect that it would get reinitialized?
It's not going to be reinitialized.
Par on seek, multiple lambdas get reinitialized, it's not going to be reinitialized. Par unseq,
multiple lambdas
effectively share the same
TLS. Because
they might run in vectorized
SIMD lanes.
And when they read and write the TLS
object, they're reading and writing the same one.
Right.
So they kind of have a data race.
They could
stomp on each other there.
So I think
you have to expect that in C++20 we're going to write
something about TLS and there's
maybe similar to what we did with the
forward-focus guarantees, there's going to be more than one
tier.
There's going to be
has full-fat TLS, that'll be the
thread thing, and then has potentially shared TLS or potentially identity changing TLS.
And that will be maybe the std par thing.
And then has no access to TLS.
Might be like the par and c thing.
Wow.
And so what it will mean for our future architectures here at NVIDIA is we'll have to decide which
tier we want to go for.
In Volta and C++17, which were designed
concurrently,
which no one except people
from NVIDIA knew,
which were being developed concurrently,
we decided
to go for the stdpar
tier, even though everyone
in the world expected us to build the
par-unseek tier.
Okay, so what are we going to do with TLS?
I don't know.
Yeah, well,
I have to say, I don't know.
So, you know,
that's one challenge, and then
there's a similar challenge with
sharing pointers to automatics.
I just said, you know,
thread local is private by default, but it
can be shared if you take a pointer to it.
The same thing is true
of automatics.
If you have a local variable,
you can take the address of it and somebody else can
read and write it. As long as you, you know,
and by the way,
be careful because it's easy to run a file
over the memory model when you do that.
But assuming you dotted all the i's, then you can do that.
Well, you can't do that on the GPU right now.
And that one's been really tough to think about.
We have a super efficient implementation of stacks.
And it's kind of incompatible with addressing.
So, you know, we emulate addressing for the thread that owns the stack,
that particular stack.
But it's tough to emulate it for all threads in the system. So this is a long answer to your question of what incompatibilities remain.
Those are some big ones.
Everything else probably that you can think of, I haven't said the word exceptions.
I think too much ink is spilled on our lack of support
for exceptions. I don't think it's
a durable problem. I think it's a business
question. Somebody needs to say
we want to buy GPUs
and we'll only buy GPUs if they have
exceptions. And then
you have to say it with a straight face and not
burst out laughing in the middle of that sentence.
And then it can be solved. It can be worked on. have to say it with a straight face and not burst out laughing in the middle of that sentence. And
then it
can be solved. It can be worked on.
There's no reason GPUs can't do
C++ exceptions. It's just control flow.
Right.
Okay. Well, it's been great having you on today,
Oliver.
Oh, thank you.
It's been great.
I'm trying to get the word out everywhere that I can about the C++ stuff we did in Volta.
You'll see me at CppCon.
I'm working on blog posts and other things to put out that communicate straight to the users.
Where should listeners look for that?
On the NVIDIA blog or on your personal blog?
No, on the NVIDIA blog.
We have a blog with a great name called Parallel for All.
Although you might say it should really be Parallel for Each.
But it is currently called Parallel for All.
And that blog, in particular, has posts that are relevant to programmers,
and in general, C++ programmers, that most of the articles are aimed at C++ programmers.
Okay. Thanks so much for your time today. Yeah, thanks for joining us.
Thank you.
Okay. Thanks so much for listening in as we chat about C++.
I'd love to hear what you think of the podcast. Please let me know if we're discussing the stuff
you're interested in. Or if you have a suggestion for a topic, I'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com. I'd also appreciate if you like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at
Rob W. Irving and Jason at Leftkiss on Twitter. And of course, you can find all that info and
the show notes on the podcast website at cppcast.com. Theme music for this episode
is provided by podcastthemes.com.