CppCast - Volta and Cuda C++

Starting point is 00:00:00 This episode of CppCast is sponsored by Backtrace, the turnkey debugging platform that helps you spend less time debugging and more time building. Get to the root cause quickly with detailed information at your fingertips. Start your free trial at backtrace.io slash cppcast. And by JetBrains, maker of intelligent development tools to simplify your challenging tasks and automate the routine ones. JetBrains is offering a 25% discount for an individual license on the C++ tool of your choice, CLion, ReSharper, C++, or AppCode. Use the coupon code JetBrains for CppCast during checkout at JetBrains.com. Episode 116 of CppCast with guest Oliver Giroux, recorded August 31st, 2017.

Starting point is 00:01:01 In this episode, we talk about CppCon open content and STL cookbooks. Then we talk to Oliver Giroux from NVIDIA. Oliver talks to us about his work on the Volast, the only podcast for C++ developers by C++ developers. I'm your host, Rob Irving, joined by my co-host, Jason Turner. Jason, how are you doing today? Good, Rob. So we can keep saying that despite Jens' comments last week? Well, we haven't heard of any new podcast yet, but yeah, we do encourage

Starting point is 00:01:53 if you're out there and have any interest in starting a podcast, there is room for more in the C++ community. Yeah, we might have to change it to say someday the only English language one or something. We'll see what happens. Yeah, I'll see what happens. You know anyone down in Texas?

Starting point is 00:02:09 Everyone okay with Harvey that you might know, Jason? I don't know anyone directly who was affected. So I honestly haven't been watching the news a whole lot because I've been busy getting ready for conference prep stuff. Yeah, I don't think I know anyone personally. But if any of our listeners are down there, I hope you're okay with the storm. Well, at the top of every episode, I'd like to read a piece of

Starting point is 00:02:32 feedback this week. I'm going to butcher this name. It's coming from a listener named Wojcik, maybe? He writes in, thank you guys for such a great podcast. For me, podcast is not only c++ news libraries but also a passion because of your show i don't only write code but i also started to love

Starting point is 00:02:52 c++ thank you guys i hope you will never stop recording c++ podcasts and also a separate thank you to jason for the great c++ weekly so yeah thank you for the feedback, and I'm sorry for horribly mispronouncing your name. Yeah, thanks. That was great. Yeah, and we'd love to hear your thoughts about the show as well. You can always reach out to us on Facebook, Twitter, or email us at feedback at cpcast.com, and don't forget to leave us a review on iTunes.

Starting point is 00:03:21 Joining us today is Oliver Giroux. Oliver has worked on 8 GPU and 4 SM architecture generations released by NVIDIA. Thank you. I'm glad to be here. the ISO C++ Committee, and is a passionate contributor to C++'s forward progress guarantees and memory model. Oliver, welcome to the show. Thank you. I'm glad to be here. How did you get started working on GPU stuff, Oliver? Oh, I had an internship with Microsoft one time, and then I had just come back from that and NVIDIA's lead architect came and visited my school and

Starting point is 00:04:08 I'm one of the few people he interviewed there on site and then he gave me another internship and then I just never left. It's basically been 15 years now. So I like it. Wow.

Starting point is 00:04:23 That's pretty cool. Yeah, well, we got... Oh, go ahead. So I like it. Wow. That's pretty cool. Yeah. Well, we got to go ahead. Oh, I was just going to say, when I joined back then, all the interviews were, they were all graphics. NVIDIA didn't do anything other than graphics. I remember doing vector algebra and projecting, you know, permutatives, projecting triangles onto planes and doing intersections of surfaces and other things. And it was tough.

Starting point is 00:04:52 It was really tough. You know, it was like an eight-hour thing back then. Wow. Yeah, I would be out of my league. My 3D geometry skills are not up to that, for sure. Well, Oliver, we've got a couple of news articles to discuss, and we'll start talking about the work you've been doing recently at NVIDIA, okay? Yep. yep okay so this first one is a announcement from the visual c++ blog uh their 15.4 pre-release is out and in it there's going to be support for linux development with cmake which uh should be pretty exciting you're already able to do linux development from visual studio but you had to

Starting point is 00:05:42 create a visual studio project for it, which is kind of weird. But now the CMake support that they had for Windows is being extended to support Linux development as well. It's not explicitly mentioned in here, but it talks about setting up your remote Linux target

Starting point is 00:05:59 with a connection manager. And it doesn't explicitly say how easy or hard it is to do that with the Linux subsystem for Windows. I believe it can be done, though. If you have a Linux subsystem, you can use your own Linux, your own Windows box as a Linux server, yeah. But yeah, they don't go into that explicitly.

Starting point is 00:06:22 I was thinking somewhat cynically before our episode here that I love Linux, I use it for most of my development, I run Windows everywhere. My real problem is when I need to port to macOS. I need a Mac subsystem for Windows that I can just easily connect to with Visual Studio or whatever, and then I don't have to ever boot my Mac box. I am sure Windows or Microsoft would be happy to do that, but I think Apple would sue them

Starting point is 00:06:49 if they ever tried to. Is this anything that you work with at all, Oliver, Visual Studio in your work or in these tools? Oh, yeah. Oh, yeah. I've actually been a bit of a black sheep in the architecture team here for always insisting to work on Visual Studio. Visual Studio pretty much got me started with C++, actually. And in fact, it's funny because you're talking about Linux builds from Visual Studio.

Starting point is 00:07:21 And that's been my day-to-day experience at NVIDIA for 10 years now because 10 years ago, we wrote a tool that ingests VC project files and then spits out a make file. Oh, wow. And so we... Actually, the NVIDIA architecture team has been many of our builds,

Starting point is 00:07:45 those that are not silicon builds, obviously, but most of our C++ builds are enshrined in VC project files as their canonical build specification. And then obviously that works natively on Windows, and then we use our tool to

Starting point is 00:08:00 convert to make files and make on Linux. And yeah, so that's been my normal thing, moving between Windows and Linux, you know, 10 times a day. So have you guys considered moving that stuff to CMake or Nissan or any of the other tools that would automatically generate the various build targets for you? Well, the thing that... One thing that stopped us from doing that

Starting point is 00:08:28 in the past is we really wanted a native experience in the YZWig part of Visual Studio. Tweaking and building Visual Studio is really easy. Right-click on this project, add an option here,

Starting point is 00:08:44 click OK. And it's very self-discoverable. You don't know what the options are a priori? Fine. Open that properties dialog, and then see what the options are there. And then click on the combo box and see what the values for this thing are.

Starting point is 00:09:01 And in comparison, most of the other build tools assume you're already a build ninja and you know everything you want to do. And that's not... For many people who are here doing GPU architecture, that's not their bread and butter. That's not what they do.

Starting point is 00:09:23 And so they kind of want a build system that gets out of the way and is easy to use. So they just need to fit the bill at the time. Now, it's not to say that, you know, this is just like VI and Emacs. You know, it's not to say that this hasn't been a subjective hot debate for that 10-year period, basically. I'm sure if you were starting from scratch today, you probably wouldn't roll your own solution, maybe. Probably not, no. And to be fair,

Starting point is 00:09:54 I was just working with Visual Studio CMake integration, and getting started out of the box of the CMakes build folder is just trivial, just works. But as soon as you want to change any of the configuration options, it seems to be very non-obvious how to do that. They have not integrated it with their GUI or anything yet, but their CMake integration is definitely getting better.

Starting point is 00:10:19 Okay, this next article we have is an update to SourceTrail, which is some software we talked about a while back. How do they describe this again, Jason? It's more of a source explorer. It's not an IDE, right? It is not an IDE. No, I actually have played with it a few times since when they first mentioned it to us and then with just this release also. It looks like the main new feature of this release is interactive tooltips, where if you're hovering over something, it gives you the full signature of that function or method or whatever it is you're looking at.

Starting point is 00:10:54 Right. It's a pretty darn neat tool. Did you play with it at all, Rob? I don't think I have yet. Okay. It gives you this great flow chart of what your code, what functions are calling other functions, and you can click on it and expand,

Starting point is 00:11:11 and it shows you the source code and does mouseovers and hyperlinking between everything. It does a really good job of it, but I think more interesting than just the fact that you can do that is it also shows you what templates are instantiated in many cases. And then you can get slightly more detail there about the template stuff that's happening, which I like. Because I tried browsing with ChaiScript, which has many, many template instantiations. And it does a pretty good job of keeping up.

Starting point is 00:11:40 Right. Yeah. And this next one is calls for lightning talks and open content at CPPCon. And I'm not sure if we've mentioned this before, but this includes an announcement at the bottom of the post saying that all content is going to be free on Friday of CPPCon. That's including the plenary speaker, Matt matt godbolt and all the other sessions that day oh i didn't even see that i didn't even realize that was in this announcement yeah so yeah if you're in the seattle area weren't planning on going to cpvcon already but uh

Starting point is 00:12:18 you know you want to go check out some of the sessions you can go and do so for free on friday it'll be really interesting for those of us who are there for the whole week to see just how much busier it gets on Friday or no matter what happens with that. I mean, there's plenty of, you know, tech companies in the area employing C++ developers. So you'd think there'd be some people interested in that. You know, you got Microsoft and Amazon right there. Right. But yeah, in addition to that, lightning talks and open content, they're now looking for submissions for that at CppCon.

Starting point is 00:12:51 And the open content, those are sessions during lunch and breakfast, I guess? I don't recall that from last year. Yes, yeah. They're at various times during the day, not normal track sessions. And it does explicitly say again this year, open content does not require conference registration. So anyone can present, anyone can attend open content sessions.

Starting point is 00:13:12 Right. And then lightning talks are five minute talks that are usually in the evenings. I think they do that for two or three nights last year. I'm not sure if that's the same plan for this year or not. Right. You know, I will say two years ago, I gave an open content session for the fun of it when I was giving also regular session that year. And it is so much less stress because you know,

Starting point is 00:13:36 the camera's not recording. And so if, if you wanted to try your hand at speaking at a conference, you didn't get a submission in for CBPCon, try an open content thing. There's less pressure to get accepted, less pressure when you're actually up there talking. Okay, and then this last article,

Starting point is 00:13:57 C++17 STL Cookbook Book Review. And this is a new book that is just released from uh i'm gonna butcher this name too jaycee gallowicks i think and it's uh over 90 recipes that leverage the powerful features of standard library and c++ 17 and the review uh is making it sound like it's a pretty good read good book and uh if you are interested the uh author of this blog has managed to get i think five copies of the book that he's going to give away if you're interested and you can uh register at the bottom of this post for that all right did you uh look at this one also oliver any comments yeah yeah i looked at that. Obviously, I zoomed straight to item 9, the SG1, my bread and butter.

Starting point is 00:14:51 I might slightly prefer if people used S3D async a little bit less. That's probably the SG1 PSA there. We're still working on that. But yeah, seeing the parallel STL come to life, that's really exciting. We worked on... There was a lot of

Starting point is 00:15:14 new standard ease added there. We worked hard on that. I am still looking forward to being able to actually try that in one of the standard implementations, if you will. Last I checked, GCC Norclang was able to actually try that in one of the standard library or standard implementations if you will last i checked gcc nor clang was able to do parallel algorithms yet anyhow surely coming soon right you would think yeah well i i've i've noticed i mean i've noticed a lot of

Starting point is 00:15:41 um people the people who work on that um you, making noises through reflectors or IRC channels or sending email to one another. So asking, how did that work? How did you manage to parallelize that? And what's the performance like and all that? So it's clearly being worked on. Yeah, I've seen a handful of tweets from the Visual Studio team in that regard also with a couple of comments that there's a couple of algorithms that no matter what they did, the parallel version was slower than the linear version,

Starting point is 00:16:16 so they were going to recommend against using those ones. I don't recall what the details were, though, unfortunately. It's perfectly valid to, you know, Stdpar communicates information from the programmer to the implementation that were the lambdas past this algorithm invoked in parallel, they would not introduce databases. And then it's up to the implementation to decide what to do with that,

Starting point is 00:16:50 and running it sequentially is perfectly valid. There's no rule that says, if you pass in std par, it shall run in multiple threads. Okay. That's not in the standard. This is, please, I think that it would be a good idea and the library implementation's up to saying

Starting point is 00:17:07 no, it really wouldn't be a good idea. The way that I see it is you're communicating semantic information. You're saying it's possible. It is possible, yes. The program would not become undefined if you parallelize this. Okay.

Starting point is 00:17:28 Which is the case for a normal algorithm, a normal sequential algorithm. The lambda can do anything. The lambda is allowed to read and write any C++ object. And the memory model says if you have multiple threads and more than one is writing, or one is writing and another one is reading, then you have a data race and your program is undefined. So the implementation cannot just go in and say, I think I'm going to run this on the thread pool now. It would have to prove the absence of aliasing between the references of different invocations of the lambda. And we all know that alias analysis is hard, and it's particularly hard in C++. Right.

Starting point is 00:18:12 So this sort of automatic parallelization doesn't happen. But once the user tells the implementation that there would not be any data race in the case of PAR. And in the case of PAR and SEEK, you're saying further, not only would there not be any data races, but this code does not assume forward progress guarantee beyond lock-free. So then you can also vectorize it, also without any further proof. So from your standpoint, would you suggest that when C++17 is fully out there, we all go and look at our usage of standard algorithms and mark them par and seek where we know it's a correct and valid thing to do,

Starting point is 00:19:00 and then just let the implementation do what it wants with it? Or should we only do that if we're trying to optimize our use of the standard algorithms? So in general, my first advice would be, you should always tell the implementation as much semantic information as you know. And the implementation can be clever about that. Especially as, and we're going to talk about this in a few minutes probably, but especially as the variety of implementations increases, you should not prejudge what the implementation might do with that semantic information. More information

Starting point is 00:19:47 helps implementation. Okay. Now, this said, before executors land, which is the case of C++17, which does not have executors, before executors land, it's difficult for you to apply constraints on top of that to say, well, here's the semantic information. This would not introduce any databases. But I don't want you to spin up 100 threads to run this because I know that this would thrash my system because something else is happening. So my first advice is give the implementation as much semantic information as you have,

Starting point is 00:20:30 and then I just need to dither on that a little bit. It might be that for some number of years, doing that would be a de-optimization. Okay. And then in that case, as you optimize, you may want to roll it back. Leave a little comment in your code. You know, to do C++20 at executive here.

Starting point is 00:20:55 Having a hard time porting my stuff to C++17 right now, it's hard to think to do C++20 while I'm at it. Okay, Oliver. So your talk at CppCon is titled Designing C++ Hardware, and your abstract says you can run C++ on any computer you want as long as it pretends it's an 80s

Starting point is 00:21:15 computer. What do you mean by that exactly? The first thing is I want to get your attention. But I have a real thesis here, which is that C and C++ and CPU designs are sort of in orbit of one another. They're kind of a binary star system, and that really limits how much freedom people have to go and make radically different designs. You know, CPU engineers,

Starting point is 00:21:50 so, you know, people working on C++ look at CPU specs and they say, oh, this is, you know, we have this feature and that feature, what can we do with that? Well, I want to peel the curtain and tell you that there's people on the other side who are doing the same

Starting point is 00:22:07 thing in reverse they're looking at the languages and they're like oh well but this language it needs this and so we can't build our design that way then we need to make this other change and we need to add a layer of emulation on top of whatever we were truly intending to do because if we don't have this layer of emulation then C++ would keel over. And there are things that are not getting done. There are things that are not done.

Starting point is 00:22:35 They're not built. There are things that people talk about that don't get built because it's too much risk. It's too much risk to depart from the quote quote, canonical design. And so we'll talk about it and we'll say, well, if we spend five, ten man years building this feature here and then we really don't have much of a story

Starting point is 00:23:00 for how it could work with C++, then it gets killed at some point. I mean, legitimately. Legitimately, we say, no, we can't do that. We can't pin a billion-dollar design on a feature that might not work with the language. That's interesting. So you mentioned C++ specifically,

Starting point is 00:23:26 but really this applies to, I guess, all compiled languages are all kind of similar in what they can do, right? Well, I think it may surprise you to hear that Fortran is better at this. Right. Well, in part because we don't have massive operating systems written in Fortran today, so it's not too much of a systems language. In the systems languages, the machine has sort of leaked up. You know, the abstractions don't hide the machine very well.

Starting point is 00:24:00 And so you can see it better from C and C++, and that makes them a little bit more constraining. Yeah, I mean, in theory, it does apply to all languages, but pointers of all kinds, function and data, certainly apply the most pressure on the design. Okay. Well, then I guess that you've been working on the NVIDIA Volta and you're at least implying that it doesn't pretend to be an 80s computer. So how does it differ? Actually, I would say

Starting point is 00:24:38 it has a very clever way of kind of looking like an 80s computer if you squint at it. But it really is not internally. very clever way of kind of looking like an 80s computer if you squint at it is but it just it really is not um internally so the the main thing was you know we we started back back when we when we started volta um we here's the mandate that we started with um people think gpu programming is hard um you go on on google and you start typing you know cuda programming and then the little autocomplete comes in and then and then the autocomplete can be is hard is you know things like that and uh so um

Starting point is 00:25:20 so we said okay well we we we need to make a list of the biggest problems. And we knew for a really long time, you know, the disjoint address spaces was a very big problem. 80s computers have flat address spaces. You know, memory is all uniform. It's all one pool. It's all one pool. It's all flat. You can reach all of it from any of your threads. So, you know, so we knew that and we've been working on that for years.

Starting point is 00:25:50 Actually, Pascal, the architecture that comes before Volta, moved the needle quite a bit on what memory you can address. But then one of the next things down that we ran into was what can threads do? Like what kind of code can threads run? We've been talking about GPU threads. I don't even have another word for it. We've been talking about GPU threads being threads by definition. But there was code that they couldn't run. Like, you could go on Stack Overflow and search for writing a mutex.

Starting point is 00:26:33 And I'm using a mutex as just an example of a concurrent algorithm. If I put my SG1 hat on, the issue here is the distinction between lock-free programming and starvation-free programming. A mutex is a starvation-free algorithm.

Starting point is 00:26:54 And GPU threads had a really hard time running that. And it's because they're more like vectors. They're more like SIMD underneath. Deep in the bowels of the machine, they're more like vectors. They're more like SIMD underneath. Deep in the bowels of the machine, they're running on SIMD. But that was shining through all the way up through the execution model.

Starting point is 00:27:21 And then these threads could deadlock when they were trying to execute a mutex algorithm. So in Volta, we looked at that, and then we wondered, can we build a different kind of machine? Can we build a machine that has the same efficiency at the bottom,

Starting point is 00:27:40 but runs all C++ concurrent algorithms, bar none, at the top. And that had not been done. That had not been done before. You can look around, and GPU-like threads have existed for quite some time. In the 80s, I don't know, you probably don't know this, Pixar was a hardware company in the 80s.

Starting point is 00:28:08 And they built their own machines. And they're sort of the first example you can point at of somebody building the same kind of threads that we've been building. And across this 30-year time span, nobody had managed to solve this problem. So anyway, so in Volta, we set out to do that because we wanted to be able to run basically anything a std thread can run. That was the goal. So I guess maybe we should just take a quick step back and describe exactly what Volta is. Oh, it's a GPU. Okay. Well, first of all, Volta is an architecture. Volta is the Oh! It's a GPU. First of all, Volta is an architecture.

Starting point is 00:28:48 Volta is the name of the architecture. It's a family of GPUs in some way. You've only seen one today. We launched it this spring. Actually, I was on stage at GDC launching that in the spring. The first Actually, I was on stage at GDC launching that in the spring. So the first implementation of Volta is V100.

Starting point is 00:29:13 And V100 is a big chip. It's a really big chip. You guys just go right out the door. Version 100, most people start at like 1.0. Oh, well, actually, we have a naming scheme at this point. I mean, you could go on Wikipedia. It's pretty amazing how NVIDIA doesn't really communicate internal product numbering,

Starting point is 00:29:34 but somehow Wikipedia always figures it out. There must be a mole or something. But in our current nomenclature, we'll have a two-letter prefix, and then we'll have three digits. And 100 is, is the big one. Okay, 100 is the big one. And then we then, you know, that third digit down at the bottom, you know, twos are just about as big, fours are half of that, and then sixes are half of that, and then sevens are half of that.

Starting point is 00:30:12 GPUs have a big dial on how big or small you make them. We'll have in the same family something with 80 SM cores, and then we'll do one with 40, and then one with 20,

Starting point is 00:30:26 and then one with 10, and maybe one with four. It's a big range. So it's a GPU architecture that's just been released, basically. That's right. Yeah, that's right. Sorry, I didn't mean to cut you off

Starting point is 00:30:41 if you were going to go into more detail there. No, no, actually, that's a good idea. You should try to help me be concise. So this is not something that's specifically designed for computation. It is something that we could buy in our desktop graphics adapter next time I go to upgrade my machine or something. Right, so at some point you'll be able to. Okay So at some point you'll be able to. Okay.

Starting point is 00:31:08 At some point you'll be able to. Right now V100 is going out in big HPC boxes. Okay. V100 is what the Department of Energy bought for Summit and Sierra, two big supercomputers coming online. Where are those two going? Do you know? Oh, yeah.

Starting point is 00:31:28 Summit is going to Oak Ridge National Lab. Oh, okay. And Sierra is going to Lawrence Livermore. All right. Yeah. So these machines are going to be working on both open problems, you know, like climate change research, modeling of aerodynamics of things and, you know, combustion efficiencies and engines. And they'll be working on galaxy collisions and things like that. So big machines. Cool stuff.

Starting point is 00:32:06 So V100 is going in there right now. And it's also going into a lot of deep learning systems that are going out, already out, soon going out. I'm not sure what the right

Starting point is 00:32:21 timing is. So from the C++ programmers perspective now, how does programming Volta look different than programming with CUDA? Oh, so CUDA

Starting point is 00:32:37 is your vehicle. I mean, I think I might even say expand that term a little bit. call it CUDA C++. CUDA C++ is a C++ product, similar to Visual C++ is a C++ product. Okay. Now, CUDA C++ is, and I won't hide it, is the least conforming of all the C++ products out there. You know, probably Visual Studio 6.0 was more conforming than we are at the moment.

Starting point is 00:33:17 But it's actually, you know, now that I've said that, I think you'd be surprised just how conforming it is. On the language side, there's very little that you can't use on the device. In fact, you know, in the last month, I've written a ton of CUDA, way more than I've written recently. You know, I've been waiting to get my hands on a V100 to write a lot of CUDA code. So I finally did. I finally got my own. Sorry, I get them ahead of you.

Starting point is 00:33:56 And I've just been typing. I have been typing for two months, basically. And my own surprise was I have not had to think about, oh, can I use this language feature? I've been writing in C++14, in CUDA C++, and fundamentally, the only thing where it's obvious to me that this isn't your regular conforming C++ implementation is that you need to prefix all your functions with device keyword.

Starting point is 00:34:36 Okay. That's pretty much it. Interesting. So is CUDA C++, did that come out with Volta or has that been around longer? It's been around for 10 years, man. Oh, okay. Mind the thing. 10 years, yeah.

Starting point is 00:34:52 And you don't need to clone a GitHub repo and make it yourself. You can download it on our website. Okay. But the capabilities have changed with Volta. Right. We have this concept of compute capability level. Okay. And Volta is compute capability level 7.

Starting point is 00:35:12 And so what that means is it can use everything that was in 6, which includes the unified address space with the illusion of coherence, at least. And it adds to that two really key things for C++ developers. Number one, it adds a conforming primitives that are compatible with the C++ memory model. And second, the forward progress guarantees, which allow you to write just any concurrent algorithm you want.

Starting point is 00:35:52 Okay. I wanted to interrupt this discussion for just a moment to bring you a word from our sponsors. Backtrace is a debugging platform that improves software quality, reliability, and support by bringing deep introspection and automation throughout the software error lifecycle. Spend less time debugging and reduce your mean time to resolution by using the first and only platform to combine symbolic debugging, error aggregation, and state analysis. At the time of error,

Starting point is 00:36:17 Bactrace jumps into action, capturing detailed dumps of application and environmental state. Bactrace then performs automated analysis on process memory and executable code to classify errors and highlight important signals such as heap corruption, malware, and much more. This data is aggregated and archived in a centralized object store, providing your team a single system to investigate errors across your environments. Join industry leaders like Fastly, Message Systems, and AppNexus that use Backtrace to modernize their debugging infrastructure. It's free to try, minutes to set up,

Starting point is 00:36:48 fully featured with no commitment necessary. Check them out at backtrace.io.cppcast. So what was your role during the design of Volta? So I was one of the leads in the SM core. The SM core is the execution core inside of the GPU. And for the most part, I focused on these two things here. I focused on the memory model and the execution model. I have a fairly long history of working on the execution side.

Starting point is 00:37:25 Seven years ago, I wrote a prototype in microcode for Kepler GPU that had this forward progress guarantee. It was incredibly slow, but it sort of proved out the idea. You know, people were not thinking about it as much. People were thinking that this was just how it was. GPU threads couldn't run concurrent algorithms because obviously. And then you'd approach them and you'd say, well, we can work on that. And they're like, oh, yeah, but the performance would never be good. But the area would be too big. It would be too ever would be too big it would be too many transistors or it would burn

Starting point is 00:38:07 too much power and uh so we had to chip away at that until eventually um everybody said well it's just plain better so let's do it yeah so seven years ago I did that first prototype that showed that it was doable and then I followed through so would it be fair to say that you specifically designed Volta with C++ in mind to be able to target it? oh yeah, absolutely

Starting point is 00:38:42 I quoted paragraphs out of the C++ standard in specs and in internal communication. You know, the definition of forward progress in C++ is rather well crafted, actually. And it's even better crafted in C++ is rather well-crafted, actually. And it's even better crafted in C++ 17, the SG1 PSA here. We've done a lot of work to add different kinds of forward progress guarantees and clarify the language. But even back in C++ 11, there was useful stuff in the spec, in particular the definition of what is a visible machine step. That was really key. So I quoted that out of what used to be, you know, cause one.

Starting point is 00:39:39 So there was that. And then the second thing is, on the memory model side, we also stole as many good ideas as we could out of C++. I have a lot more to say about that, if you want to drill in. Sure, go, right. So on the memory model side, um, I think C++11 was one of the biggest gifts, uh, ever, ever given to, uh, to processor designers, honestly. Interesting. the state of the art was everybody used volatile incorrectly well right and approximately no one can correctly describe the semantics volatile and in part it's because

Starting point is 00:40:33 caveat it doesn't have any semantics it's you know it's crack open your CPU manual and then the semantics or whatever that other thing says and so people would come to us CPU manual and then the semantics or whatever that other thing says. And so people would come to us with code that

Starting point is 00:40:50 they had written on x86, say, and they'd put volatile in there and then they'd say, well, we expect you to run this. And as a processor designer, this is awful. You're basically asking me

Starting point is 00:41:06 to be bug compatible with x86. You know, like, if you use volatile, then your code rests directly on the metal. And then the semantics of your code or whatever the semantics of the metal is, bugs included.

Starting point is 00:41:22 And since concurrent programs are very difficult to debug, and the And since concurrent programs are very difficult to debug and the bugs in concurrent programs can be latent for a very long time they just don't activate then if there's a bug in your code that x86 did not activate because it has an extremely strong

Starting point is 00:41:41 memory model then you coming to me and saying, this code with volatile needs to work, is basically saying, I need to build an x86 now, which I'm not allowed to do. So then C++11 comes along, and now there's a memory model, like a spec you can read. And it's a relatively portable spec.

Starting point is 00:42:12 And there's some really good ideas in there. You know, the C++11 memory model is based on work by Sarita. You know, a paper she wrote in, I think, 1990 introduced a model called DRF0. And the DRF0 model says if you can split all your memory accesses between synchronizing and non-synchronizing, then you make the synchronizing one sequentially consistent. And she can prove that there are no non-sequentially consistent executions left. But here's the good news. The synchronizing XSCs is like 1% and the non-synchronizing is like 99%. And the non-synchronizing XSCs,

Starting point is 00:42:58 they don't have any expectation of memory consistency between threads. Any expectation at all. have any expectation of memory consistency between threads. Right. Any expectation at all. In fact, the non-synchronizing XSEs don't even need coherence. Now, think about that. CPUs don't know the difference between these two things today. Like when you JIT or compile to x86, you do a bunch of reads and writes. x86 assumes every last one of them is synchronizing.

Starting point is 00:43:31 Okay. Even though 99% of them are not. And the 99% that are not don't even value the coherence protocol that x86 is running. Okay. So 99% of the effort being expended by a cache-coherent CPU design is not providing any value whatsoever. It is warming your room ever so slightly and occasionally depressing your performance. Because it needs to assume everything is always synchronizing. Okay, so that's one of the big things we did. oh my god we can take this distinction between atomic and non-atomic in C++

Starting point is 00:44:28 matching to synchronizing and non-synchronizing in DR0 we can take this information, this high level information from the program and give it straight to the silicon not only carry it through compiler optimizations but actually give it to the silicon

Starting point is 00:44:43 and then the silicon only needs to make the atomic XSCs really appear coherent. The non-synchronizing XSCs don't need any coherence whatsoever up until the

Starting point is 00:45:00 acquires and releases. But then at that point, on an acquire and a release, you know you can put a fence there. And so you can recover coherence at that point. It's really interesting. And I'm trying to process it all because if you're doing multi-threaded programming correctly, you can never assume that the threads are synchronized, as you're saying, right? Or the data between them is synchronized, unless you have some sort of a fence or an atomic

Starting point is 00:45:31 or something that you check on, right? Right. So you don't have to do any of the coherency until you see one of those fences that would force that to happen. Exactly. Exactly. So do you anticipate, not that this is necessarily your expertise, but that

Starting point is 00:45:53 general purpose CPUs will take advantage of this more at some point? It's really tough for them. It's really tough for them because they have spilled their specs all over the interwebs. And they have backwards compatibility requirements placed on them. Right. That makes it so, you know, design decisions they made in the 80s constrain what they can do in the 2010s and 2020s.

Starting point is 00:46:32 Very much so. You know, and x86 must run x86 code because the day that an x86 stops running x86 code, they lose uh there is they lose their stickiness right right right so that is massively important to them um other processor architectures get to occasionally make a break um but very occasionally um so you know even even in the case of arm you know we we like to think that arm is this new upstart somehow, but they are totally not. They're from the 80s also, yeah. They are also from the 80s, yes. So it's tough for other processor teams to be able to exploit this.

Starting point is 00:47:24 Yeah, and one more thought. So I didn't mention IBM. IBM Power is just an awesome, badass product that unfortunately too few people get to experience. But they have huge reliability requirements placed on them. The power system needs to be resilient in the face of the kind of errors that will make your macbook pro just you know crash you know your your laptop will just panic and reboot and you'll go like oh my computer crashed it happens right um that does not happen to an IBM Power, basically. It cannot happen.

Starting point is 00:48:07 So they have reliability features in there that force them to be clean of any possible memory error before they can execute past a certain point. Okay. And that affects how strong their memory model can be. Now, I use the words strong and power, which will make some people glitch. But the IBM power is a stronger memory model than my machine. That's right. I don't know enough about it to nitpick anything that you just said, so you're good here.

Starting point is 00:48:49 So we've talked a lot about what you can do with c++ and volta are there any limitations worth pointing out yeah so right so we're not done so i said a little while ago that um that c++ is is both um surprisingly conforming uh and also the least conforming implementation. You can hold both in your head at once. Now, like what's remaining in our conformance? Some of those things look more durable than others. You know, you can imagine. So with Volta, we tried to make significant progress on the top things. We're not done. So we're still thinking about this.

Starting point is 00:49:30 We're still working on these things. Yeah, so the things that are easy or hard. Well, concurrent with development on Volta, there was development on C++17. And in C++17, we clarified the nature of forward progress to include three different progress guarantees. There are three different levels, tiers, of progress guarantees in C++17.

Starting point is 00:49:56 There's the concurrent forward progress where you launch a thread and you know from the time that the constructor of the thread returns that that other thread shall be making forward progress right now

Starting point is 00:50:10 and it will never stop. So you can launch a thread and then wait for it to write to an atomic, for instance. That's the concurrent forward progress. The parallel forward progress is slightly less than that, where you know that if it has started to make progress, then it will continue.

Starting point is 00:50:31 But you don't know when it will start unless you block on it. And if you block on it, then you know it will start. And then weekly parallel is basically no guarantee of independent forward progress. So that's where that matches with par and seek. And that's where you can't write a mutex in that, for example. Okay, so concurrent with Volta, there was also a lot of clarification going into C++. I think that that's going to continue, basically. We're going to build another architecture. Spoil basically. We're going to build another architecture. Spoilers!

Starting point is 00:51:07 We're going to build another architecture. We're going to build another architecture and we're going to fix some things that are in our conformance issues. But then we're also probably going to think about changes to C++20 that will make it much easier to implement. So, alright, what's in that list? Thread local storage

Starting point is 00:51:30 is a big one. You know, back to this emulating 80s computer thing, you know, threads in C++ are assumed to be fairly heavyweight. C++ is

Starting point is 00:51:44 pretty well designed for a machine where the number of threads is like 12. Okay. Yeah, 12 or I guess we'll go to 60 or something. It's pretty well designed for that. But some features like thread local storage are expensive for systems that have lots of short-lived threads. Volta has

Starting point is 00:52:14 163,840 threads. Wow. All of which can run, all of which can execute anything a std thread can execute. But except for TLS. But

Starting point is 00:52:29 it sounds like you were saying that virtually all the variables are thread local anyhow because it's not going to synchronize them unless it has a reason to. Oh, no. If I were to, am I being too pedantic or something?

Starting point is 00:52:46 Yeah, yeah, yeah. So Volta threads can dereference any pointer anywhere in memory. So you can imagine you build a box and you put 128 gigs of RAM on the CPU and then you slot a Volta in there, and it has its own 16 gigs on it. Volta threads can dereference any of this, all of this. And they can synchronize.

Starting point is 00:53:19 So you could use a moral equivalent of Studeatomic, or hopefully very soon now, actually Studeatomic, or hopefully very soon now, actually Studeatomic. And when they synchronize, they have a view of memory that's compatible with the C++ memory model. Okay. So you just, you know, you write code as normal. Okay, but one thing that they don't currently have is access to thread local storage. So if you declare some stuff

Starting point is 00:53:48 that's thread on the bar local, then only your CPU threads will get that right now. Now you can take a pointer to these things and then share it out in memory. Thread local doesn't preclude sharing, it's just private by default, but if you take

Starting point is 00:54:04 the address of it, you can share it. And then GPU threads will be able to see it and access it just fine. It's just that using the name of the variable, accessing it by its name, will not work on

Starting point is 00:54:20 the GPU threads right now. I am not sure which way this one will land. The on the GPU threads right now. Now, I am not sure which way this one will land. The lack of conforming thread local is not unique to us. In fact, if you look at even CPU parallel programming systems, they have a similar issue. I think one of the poster children is actually Silk

Starting point is 00:54:46 from Intel. Silk, I'm going to say something good about Intel. Silk is really impressive work. It's really nice. It's fairly well reasoned. Now, in Silk, when you spawn children

Starting point is 00:55:04 and you block on them, the parent can be suspended and passed by continuation passing. And the children can steal the parent continuation. Which means that a child running on another thread can steal the continuation of the parent that came from, you know, parents on thread one spawn some children, blocks on them, children run in thread two, and then when the children in thread two are done,

Starting point is 00:55:35 they steal the continuation of the parent, which now resumes execution on thread two. Your thread local just changed. You used to reference an object. Now it's a different object. What does that mean? Like, what does that mean? You could write a conforming C++ program

Starting point is 00:55:53 that is highly dependent on all your thread local accesses being to the same object. Yeah. Because maybe you put some important data in there and you're going to get it back later. And now you're under the threat, and it's not the same one. So this TLS issue is, I swear, SG1 spends one day per meeting talking about that.

Starting point is 00:56:24 So we have to do something about it the definition that's there right now is very strong basically only works with std threads and the thread that runs main the thread that shall not be named that runs main and

Starting point is 00:56:40 as we move into more and more parallelism, we need to clarify what thread local means. Even if you ignore Cilk and you stay in C++17, consider std par and std par and par and seek. If I run a parallel algorithm with std par, then more than one of my lambdas will get the same TLS. But they'll get it in sequence.

Starting point is 00:57:10 But they'll get the same TLS, which means that they can kind of... It means you don't get fresh TLS every time. Did you expect fresh TLS? Did you expect that it would get reinitialized? It's not going to be reinitialized. Par on seek, multiple lambdas get reinitialized, it's not going to be reinitialized. Par unseq, multiple lambdas

Starting point is 00:57:27 effectively share the same TLS. Because they might run in vectorized SIMD lanes. And when they read and write the TLS object, they're reading and writing the same one. Right. So they kind of have a data race.

Starting point is 00:57:44 They could stomp on each other there. So I think you have to expect that in C++20 we're going to write something about TLS and there's maybe similar to what we did with the forward-focus guarantees, there's going to be more than one tier.

Starting point is 00:57:59 There's going to be has full-fat TLS, that'll be the thread thing, and then has potentially shared TLS or potentially identity changing TLS. And that will be maybe the std par thing. And then has no access to TLS. Might be like the par and c thing. Wow. And so what it will mean for our future architectures here at NVIDIA is we'll have to decide which

Starting point is 00:58:22 tier we want to go for. In Volta and C++17, which were designed concurrently, which no one except people from NVIDIA knew, which were being developed concurrently, we decided to go for the stdpar

Starting point is 00:58:39 tier, even though everyone in the world expected us to build the par-unseek tier. Okay, so what are we going to do with TLS? I don't know. Yeah, well, I have to say, I don't know. So, you know,

Starting point is 00:58:55 that's one challenge, and then there's a similar challenge with sharing pointers to automatics. I just said, you know, thread local is private by default, but it can be shared if you take a pointer to it. The same thing is true of automatics.

Starting point is 00:59:11 If you have a local variable, you can take the address of it and somebody else can read and write it. As long as you, you know, and by the way, be careful because it's easy to run a file over the memory model when you do that. But assuming you dotted all the i's, then you can do that. Well, you can't do that on the GPU right now.

Starting point is 00:59:35 And that one's been really tough to think about. We have a super efficient implementation of stacks. And it's kind of incompatible with addressing. So, you know, we emulate addressing for the thread that owns the stack, that particular stack. But it's tough to emulate it for all threads in the system. So this is a long answer to your question of what incompatibilities remain. Those are some big ones. Everything else probably that you can think of, I haven't said the word exceptions.

Starting point is 01:00:21 I think too much ink is spilled on our lack of support for exceptions. I don't think it's a durable problem. I think it's a business question. Somebody needs to say we want to buy GPUs and we'll only buy GPUs if they have exceptions. And then you have to say it with a straight face and not

Starting point is 01:00:41 burst out laughing in the middle of that sentence. And then it can be solved. It can be worked on. have to say it with a straight face and not burst out laughing in the middle of that sentence. And then it can be solved. It can be worked on. There's no reason GPUs can't do C++ exceptions. It's just control flow. Right. Okay. Well, it's been great having you on today,

Starting point is 01:00:58 Oliver. Oh, thank you. It's been great. I'm trying to get the word out everywhere that I can about the C++ stuff we did in Volta. You'll see me at CppCon. I'm working on blog posts and other things to put out that communicate straight to the users. Where should listeners look for that? On the NVIDIA blog or on your personal blog?

Starting point is 01:01:31 No, on the NVIDIA blog. We have a blog with a great name called Parallel for All. Although you might say it should really be Parallel for Each. But it is currently called Parallel for All. And that blog, in particular, has posts that are relevant to programmers, and in general, C++ programmers, that most of the articles are aimed at C++ programmers. Okay. Thanks so much for your time today. Yeah, thanks for joining us. Thank you.

Starting point is 01:02:11 Okay. Thanks so much for listening in as we chat about C++. I'd love to hear what you think of the podcast. Please let me know if we're discussing the stuff you're interested in. Or if you have a suggestion for a topic, I'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. I'd also appreciate if you like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at Rob W. Irving and Jason at Leftkiss on Twitter. And of course, you can find all that info and the show notes on the podcast website at cppcast.com. Theme music for this episode is provided by podcastthemes.com.

Your Ad Here

CppCast - Volta and Cuda C++

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.