CppCast - stdpar

Episode Date: September 10, 2020

Rob and Jason are joined by David Olsen from NVIDIA. They first discuss the news from the ISO Committee that C++20 has been approved and work on C++23 will continue virtually. Then they talk with Davi...d about his work on NVIDIA's C++ compiler to run parallel algorithm code on the GPU and a proposal he's working on to introduce 16-bit floats to standard C++. News C++20 approved, C++23 meetings and schedule update If everyone hates it, why is OOP still so widely spread? New safety rules in C++ Core Check Links Accelerating Standard C++ with GPUs using stdpar P1467R4 Extended floating-point types and standard names Sponsors Clang Power Tools

Transcript
Discussion (0)
Starting point is 00:00:00 Thank you. Modernize your code now. Get ahead of bugs by using LLVM static analyzers and CVP core guidelines checks from the comfort of your IDE. Start today at clangpowertools.com. In this episode, we talk about some news from the ISO committee. Then we talk to David Olson from NVIDIA. David talks to us about the N podcast for C++ developers by C++ developers. I'm your host, Rob Irving, joined by my co-host, Jason Turner. Jason, how are you doing today? I'm doing all right, Rob. How are you doing?
Starting point is 00:01:40 I'm doing good. Looking forward to virtual CppCon next week. How about you? Yeah, right. Monday that starts, huh? Yeah. I am. Or technically Sunday is like the training it reads like there's like no no there's still like a meet and greet effectively on the calendar okay something like that and i'm like i don't really know what that means i might not be able to join in myself because i'm going to be uh Sunday. Okay. The training courses have already started. Oh, wow. Okay. Yeah, they're the week before instead of the weekend before.
Starting point is 00:02:12 That's interesting. Okay. Yeah, and they were adjusted to make sense for time zones and people's availability and all that stuff too. So this will be very interesting to see how a virtual conference works. Yeah, it certainly will be. I mean, and oh sorry rob go ahead i was gonna say in a way uh cbp con gets the the advantage of not being the first one right because there's already been like four or five so they get to learn from everyone else's mistakes hopefully, well, at the top of our episode, I'd like to read a piece of feedback. We got this tweet from Kobe last week.
Starting point is 00:02:51 Great episode and fun to listen to. This is in response to our unit testing episode with Oleg Rabiv. And yeah, great episode, fun to listen to. Sometimes I'd use this CPP member accessor library to poke around non-public interface and members. It makes it easier versus extracting something inherent in the class type. Method is based on a few articles, one of them from Herb Sutter. And I'll put a link to this in the show notes. But it's a way to get at private members without just changing
Starting point is 00:03:25 the public-private definition we talked about last week. Right. I saw this thing. I don't remember when. I think it's the first time I've seen it. It's interesting, though. Maybe I'll have to take another look.
Starting point is 00:03:38 Yeah. Well, we'd love to hear your thoughts about the show. You can always reach out to us on Facebook, Twitter, or email us at feedback at speedcast.com. And don't forget to leave us a review on iTunes or subscribe on YouTube. Joining us today is David Olson.
Starting point is 00:03:52 David's first job out of college in 1993 was working for several years on the front end of the Rational Software C++ compiler. Then he didn't touch C++ at all for a couple decades, instead programming mostly in Java and C Sharp, working on various software development tools and mobile apps. For the last four years, David has been the lead developer on the PGI C++ compiler, recently renamed to the NVIDIA HPC C++ compiler. He has been a member of the ISO C++ Standards Committee since November 2018. David, welcome to the show. Thank you. I've been listening to this podcast for several years. I'm glad to be a guest. Awesome. Long time listener, be a guest. Awesome.
Starting point is 00:04:26 Long-time listener, first-time guest. Yes. I'm curious. I don't think I'm at all familiar with Rational's C++ compiler. It was not a success. Nobody heard of it. It had very few sales. I guess that explains why you changed over to Java and C Sharp for a while.
Starting point is 00:04:51 Yeah, I moved off to, it was shut down and I was moved to other projects. 1993, huh? Back in the era when you could still charge for a compiler, basically. Yep. Okay, well, David, we've got a couple news articles to discuss. Feel free to comment on any of these, and we'll start talking more about the NVIDIA C++ compiler you're working on. Okay?
Starting point is 00:05:08 Okay. All right. So this first one is on Herb Sutter's blog, and this is C++20 approved, C++23 meetings and schedule update. So, yeah, the big news, I saw this over Twitter and on Reddit over the weekend is that C++20 has passed unanimously and is on track to publish later this year, which I think we were all expecting, although maybe there was some concern that with the virtual meetings and the pandemic that things might not go as planned. But it's certainly great news that C++20 seems to still be on track and getting published soon.
Starting point is 00:05:48 Yeah, the pandemic didn't affect the publishing. The committee finished all our technical work at the final face-to-face meeting in February. So this is just waiting for the countries to approve their ballots. I think, to me, the most interesting news from this, because I wasn't at all concerned that C++20 was going to ship at this point, was that all future face-to-face meetings have been postponed tentatively. So Kona, the first one that I actually bought a ticket to and was planning to go to, has officially been canceled, I guess, right? Correct. But that's fine. I will still go to has officially been canceled i guess right correct yeah yeah that's fine i will still go to kona so yeah everything's canceled at least until the end of march and then you know hopefully after
Starting point is 00:06:34 then things will be in a state that they can start picking up again uh well herb and other communication to the committee has said he thinks it'll be a long time really yeah the international committee meetings will be one of the last things to get back to normal okay it does seem like one of the more risky things because you're asking a bunch of people from different parts of the world to come all together into a room with 200 people again or something correct yeah it's that gathering from around the world that is the real is a is a problem so just do it on a cruise ship oh god so yes the committee is trying to figure out how to make make progress virtually without these meetings how do you think it's been going as a member of the committee it has slowed down
Starting point is 00:07:21 and so while we still plan to have c++ 23 and we still have the same priorities there's a good chance we won't make we won't accomplish as much as had originally been hoped right yeah so that's one of the key points also from this article is yeah 23 and 26 whatever 29 are still going to happen it just won't be necessarily all the features that were hoped for right correct yeah okay uh next article we have here is a post on the stack overflow blog blog and it's uh if everyone hates it why is oop still so widely spread and this kind of interesting goes into the history of object-oriented programming and uh kind theorizes that OOP might not be that popular on its own merits, but it might just be kind of piggybacking on language trends.
Starting point is 00:08:14 Right, Jason? Yeah, I thought it was fascinating. Effectively, to some extent, this article blames C++ for making OOP popular. Yeah, and Java as well, since Java kind of even doubled down on object-oriented programming and became very popular at the time. Yeah, and meanwhile, C++, from our perspective as C++ developers, like, whoa, hold on a minute here. No one ever said C++ was exclusively an object-oriented language. That's just one of the things it can do.
Starting point is 00:08:45 Do you have any thoughts on this, David? Yeah, I haven't done much UI programming, but when I have, it's been object-oriented. At least in those situations, I found that that object-oriented works pretty well. So I think object-oriented programming has its uses. There are some domains where it's a good fit for the problem, for the programming.
Starting point is 00:09:10 I think it'll always have a place it'll always be around i feel like uh for anyone who's curious to dig into what you just said you can do an apples to apples comparison here with gtk which is the c gui toolkit from you know the gimp to gui gimp toolkit or whatever. And GTKMM, which is the C++ wrapper around it. And GTKMM is like 100 times more usable than GTK because the things are actually organized into classes and objects. Okay, and then the last article we have here is on the Visual C++ blog, and this is about new safety rules in C++ CoreCheck. And I think it's probably been a little while since we talked about the CoreCheck,
Starting point is 00:09:50 but they're adding some new stuff here, checking for having a default case in your switch statement, expensive copies with the auto keyword. So it's all good stuff. Yes, I like that they're finding potential problems and warning about them. And especially like that in three of the four, the warning message contains a suggestion of how to fix it.
Starting point is 00:10:16 Yeah, that's very nice. I also found it very interesting that, and I think also three out of the four of these or so, the example, the counter example is Rust. Like, look, Rust won't let you do this, basically. Yeah, I thought it was interesting how they talked about Rust in here because there's still no support for Rust with Visual Studio, but interesting way to show how another language handles these things.
Starting point is 00:10:43 Yeah. I asked people on Twitter recently, what's your favorite non-C++ programming language? And a considerable number of the answers were like, you mean other than Rust? Okay. Well, David, we talked a little bit about
Starting point is 00:10:58 Stoodpar and the NVIDIA C++ compiler a few weeks ago when we saw a blog post I think from you. Could you start off by telling us a little bit more about it? Okay, the NVIDIA HPC C++ compiler. Sorry, let me go back a bit. PGI, Portland Group Incorporated,
Starting point is 00:11:22 has been around since 1989 with a Fortran C and later a C++ compiler for high-performance computing. We've been recently rebranded to the NVIDIA HPC compiler group. Okay. And so, yes, our market is high-performance computing, especially the scientific computing on the supercomputers. And our big project of the last couple of years is getting the C++ standard parallel algorithms to run on GPUs. Okay. And so we have come up with the term, STD PAR stands for standard parallelism, which is any construct in standard language that'll get you the parallelism. So in C++, that's the parallel algorithms
Starting point is 00:12:09 that were introduced in C++17. Fortran recently introduced a standard parallelism, the Duke concurrent loops. So we're using stdpart to refer to that too. We think that will be useful. Yes, our group still has, we still produce a Fortran compiler and a lot of people use it. It's has, we still produce a Fortran compiler.
Starting point is 00:12:27 And a lot of people use it. There still is a lot of Fortran code out there in the HPC market. So yes, we introduced it this year, feature of running standard C++ code on GPUs. And using the parallel algorithms. Okay, so it exclusively applies to the parallel algorithms. Okay, so it exclusively applies to the parallel algorithms. That's the only mechanism for parallelism in C++. So yes, that's what we latched on to, to get the code onto GPUs.
Starting point is 00:12:57 GPUs are good at running code in parallel. So like a stood thread, I'm saying, that's not going to automatically ship off to a GPU or something, just to clarify. No, and that's not a good choice to put on your GPU. Just making sure we're on the same page. GPUs are best when you run the exact same code in multiple threads, where the threads are running the same code, just on different data. Okay.
Starting point is 00:13:20 And so, yes, you want one of the algorithms where each iteration you can run in a separate thread, but they're all running essentially the same code. So from the programmer's perspective, what does this look like? I'm using a standard algorithm. Do I do anything special? You just write standard C++. And to get it onto the GPU, you use the NVIDIA compiler, NVC++, with the right option to turn this on. Okay. And it will take any algorithm that you've used a parallel execution policy with, and it will run that on the GPU.
Starting point is 00:13:57 Okay. Sounds very easy to use then. It's almost too good to be true. It is easy to get going, but there are some limitations, which are inherent with GPU programming. Okay. Do you want to go into those? Yeah. GPUs are not just CPUs with more threads, and they've always been difficult.
Starting point is 00:14:22 They've often been hard to program. Look, they have a different instruction set. So the compiler has to generate different code to run it on the GPU. And GPUs have separate physical memory. They don't share the same memory banks. So you have to worry about getting your data from the CPU memory to the GPU memory.
Starting point is 00:14:45 And those have always been or usually been done explicitly. In the early days of GPU programming, the programmers would have to specifically say, here, I want this function to be compiled for the GPU. And then you have function calls to your runtime to move the data, to explicitly move data between the CPU and GPU memories. So it can be available on the GPU for your code. So in Studpar, we are working on eliminating or reducing those limitations.
Starting point is 00:15:21 Try to make it easier, but we can't quite get rid of all of them right away. Is that a limitation of the C++ memory model or the hardware capabilities or something else? Well, the limitations come from the GPUs. Removing the limitations from or hiding those limitations from the programmer is just a lot of work. It takes coordination with our compiler, with our runtime, with the driver team. So everyone needs to cooperate. From the operating system, we need to make changes to Linux to get some of the memory to remove some of these memory limitations. So what we have accomplished, and the limitations that we have gotten rid of, the compiler will figure out which functions or lambdas or which code you want to run on the GPU.
Starting point is 00:16:12 And will automatically compile that for the GPU. So you don't need to explicitly say, I want this function compiled for GPU. So there's no host or device annotations on your functions. Okay. For memory, we have arranged through work that NVIDIA has done. We have arranged so that all heat memory, all dynamically allocated memory, is available on the GPU without you having to do anything. How does that work? Actually, the compiler intercepts calls to the allocation routines to operator new and allocates it in a special way.
Starting point is 00:16:52 It basically informs the driver for the GPU, hey, make this memory available. And then the GPU driver will use page faults to automatically copy the memory as necessary. So it'll detect the first time you access it from the GPU and do the copy then, copy it over to the GPU memory. Kind of sounds like how memory checkers like address sanitizer also work, having to kind of hook into these things. Yes, sanitizers would want to do similar things like hooking into page faults and seeing when you access them. So I feel like whenever I go to upgrade my computer, I can never keep track of PCI Express 4.0 3X with 16 lanes.
Starting point is 00:17:37 I don't even know. How much overhead is it to be sharing this memory between the main memory and the GPU? There's some. It depends on the application. Okay. We have many programs where this managed memory, as we call it, or unified memory, is just as fast or is about as fast as if you moved the memory yourself explicitly.
Starting point is 00:18:05 Okay. But there are others where it's significantly slower. So it is something that you might have to be aware of if you're sharing lots of data. Right. Um, so this is, this is one example of, of sort of the perform, the performance productivity trade-off, which we, which we see. So this stood par feature is automatically standard C++ on GPUs is going for productivity. We think this is one of the most productive ways to get good performance.
Starting point is 00:18:37 But because standard C++ does not have any mechanism for fine-tuning the details of how you run these algorithms in parallel. We have to make assumptions and go with things that work generally. There's no way to specify the data movement. So that's why we have to use this unified memory. To get the very best performance, you will sometimes want to do this fine-grained control. You want more control over how things run. And so you would have to use some other programming model, like OpenACC or OpenMP or compiling in CUDA. Okay.
Starting point is 00:19:18 So CUDA is NVIDIA's sort of programming model they've had from the beginning for programming for GPUs, which gives you absolute complete control over everything. So you can get the best performance by programming in CUDA, but it's at the cost of more work and less portability. So you said the compiler detects which functions we're calling and tries to automagically handle that. Is there any limitations in the kind of functions we can pass to our parallel algorithms?
Starting point is 00:19:49 On the GPU, there's limited access to the operating system. It really does not have access to the operating system, except for limited ways that we've provided. So if your function's doing I.O., that won't work.
Starting point is 00:20:08 That makes sense. We can compile it for the GPU, IO that won't work we can compile it for the GPU but it won't work what does the failure case look like actually if I ask it to do something on the GPU and it can't do I get a compile time error or run time sometimes you get a compile time error sometimes it's a run time error the better error reporting is an area we need to work on Sometimes you get a compile time error, sometimes it's a run time error. The better error reporting is an area we need to work on.
Starting point is 00:20:34 At the moment, though, that's understood. It sounds like a very difficult problem. Is it an error that I can catch in some way with an operating system signal handler or standard exception or something like that? No. Actually, if you're using the parallel algorithms, the definition in C++17 for the parallel versions is if any exception escapes, then terminate is called.
Starting point is 00:20:55 So that's what we do. We have in our implementation, if anything goes wrong, if the GPU kernel fails, then we call terminate. It's better than corrupted memory and continuing on or something like that. Right. Want to interrupt the discussion for just a moment to bring you a word from our sponsors. Clang Power Tools is the open source Visual Studio extension on a mission to bring LLVM tools like Clang++, Clang Tidy, and Clang Format to C++ developers on Windows.
Starting point is 00:21:23 Increase your productivity and modernize your C++ code with automatic code transformations powered by Clang Tidy. Find subtle latent bugs by using LLVM Static Analyzer and CPP Core Guidelines checks. Don't fiddle with command line switches or learn hundreds of compiler flags. Clang Power Tools comes with a powerful user interface to help you manage the complexity
Starting point is 00:21:40 directly from the comfort of your IDE. Never miss a potential issue by integrating Clang Power Tools into your workflows with a configurable PowerShell script for CI CD automation. Start experimenting on your code today. Check it out at clangpowertools.com. I know there's different execution flags with parallel algorithms. How do those work with Steadpar?
Starting point is 00:22:00 Okay. C++17 introduced three execution policies for the parallel algorithms. C++20 added a fourth, so the other. Now four of them, seek, S-E-Q, stands for sequential. That means no parallelism. Unseek really means vectorized. But some people objected to the name vectorized or vectorization.
Starting point is 00:22:26 So it's called unsequenced. And then par stands for parallel. So just run this in parallel. You can run it on multiple threads. And then par unseq says it's safe to do both parallel and vectorized. And the vectorized one, the unsequenced, has more restrictions on the code in question because you can run multiple instances,
Starting point is 00:22:52 multiple iterations of your loop on the same thread at the same time interleaved. So if you're doing unseq, one of the unsequenced ones, you can't use any locks inside your code. For the parallel ones, you have to avoid data races. It's up to the programmer to avoid the data races that can fire. So you have to be careful about what each iteration of your loop accesses to avoid data races. Right. So those are the four
Starting point is 00:23:18 execution policies that you can apply to these algorithms. In our stdpar implementation, the two parallel ones, par and par and seek, we will run those on the GPU. Okay, so we will not do sequential execution on the GPU? No. That would make sense. There's no point. There's not a benefit of doing sequential execution on a GPU. Right, right. Someone we had on a long time ago, Rob, I don't even remember who it was. We were talking about the parallel algorithms. And our guest suggested that we should treat every algorithm call as if it were a parallel algorithm call and always pass one of those execution flags to it. And then just let the runtime or the compiler decide what to do in that case, if it's enough data to parallelize it or not, or if we specify that it can't be. I was just
Starting point is 00:24:11 curious if you had an opinion on that since you have been working on these things. Auto parallelization is quite hard. It's hard for compute. That's still a little bit beyond what compilers can do. Okay. Detecting the potential data races is a very hard problem. I think that's where programmers still earn their money, is avoiding problems like that. PGI has tried auto-parallelization in the past in their OpenACC, and it's not something we continue to pursue.
Starting point is 00:24:49 We're not quite ready for that. It may make sense as you write your algorithms, you can look at every one and say if this one is safe to parallelize I may as well throw in that par policy
Starting point is 00:25:03 and let the compiler parallelize it if it can. But you've got to be careful that you don't tell the compiler to parallelize one that would be unsafe, that would result in a data race. Right. And I guess if we took that standpoint, then we would be ready to roll as soon as we wanted to compile with NVCC. Right. It's interesting. It's like future-proofing your code in a sense, if, as you say, you can know whether or not this algorithm call can be parallel. Right, but you've got to watch out for those other limitations that are, say, specific to the GPU.
Starting point is 00:25:34 So I talked about the memory. The heap memory is automatically shared. Right. But that means stack memory or global memory is not. And so if you try to access a stack variable from within your parallel algorithm call, and we put that on the GPU, that will fail. Oh, okay. That would be easy to accidentally do, or at least unintentionally,
Starting point is 00:26:00 with a reference capture and a lambda or something. Exactly. That's why in my presentations I put that. watch out for reference captures in your lambdas. But if I pass a lambda in with a copy capture, that's fine because that's part of the function state? Yes, the lambda object, the capture itself, gets actually copied because it's passed to the function. It actually gets copied to the GPU memory. So there's a separate copy of the Lambda capture in GPU memory, and that works fine. Interesting. Yeah, I mean, I guess if you were doing a reference capture into a Lambda that you're passing into a parallel algorithm,
Starting point is 00:26:37 you're probably already asking for trouble because you're probably doing some unchecked access on it, right? I mean, not probably, but there's at least a good chance. It depends on whether you're modifying it or you just, you may have used a reference capture just because it's expensive to copy. Right. And you're only reading it. And that would work in a CPU multi-threaded environment.
Starting point is 00:26:59 It just might have problems on the GPU. Might have problems. Okay, cool. So I actually have used used in some of our example code some of some of our code we've used uh reference capture by reference but for something that was expensive to copy but we had to make sure that the object itself was on the heap okay uh i know you mentioned linux uh is that currently the only supported platform? Yes. The NVIDIA HPC SDK is Linux only,
Starting point is 00:27:34 though we support a variety of CPUs, x86, Power, and ARM. We are working on a Windows port. Windows version will come sometime. Okay. And what types of GPU hardware are supported supported is it kind of only the latest and greatest from nvidia uh forested par yes it is uh volta or newer volta was released two years ago or three years ago some support for the earlier architecture um pascal but to get to get the full feature you need to be on volta or. The other GPU programming we support,
Starting point is 00:28:08 other GPU programming models we support all the way back to the oldest ones. So if you're using OpenACC to get onto the GPU, then we support older ones. But the Stoodpar is just the newer ones. Because Volta was the first one to have independent thread scheduling um where which is what you need to guarantee that the the parallel algorithms will work okay um i i'm sorry i had to look up power because i'm like power i know i'm supposed to know that that's ibm's architecture right yes that is i yeah ibm's uh architecture. Okay, so I feel like I see a theme here.
Starting point is 00:28:45 If you say x86 ARM and Power, you are specifically aiming for the supercomputing world, then it sounds like. Okay. What kind of response have you gotten since introducing the NVIDIA C++ compiler? I mean, I'm sure there's lots of programmers out there who are used to using something like CUDA
Starting point is 00:29:07 and having full control, but are they more interested in getting the productivity by using Stdpar? Many of them are, we think. People are still trying it out. We haven't had a whole rash of bugs. We've had enough bug reports that we know people are trying it, but not enough to make us worried.
Starting point is 00:29:30 The, the GA release first one was only in July. So it's only been out there easily available for a couple months. Oh, wow. Yeah. I mean, if you're not getting very many bug reports, the other option is that the system is perfect and there's no bugs for
Starting point is 00:29:45 people to report. Right. I know that's not the case. But our messaging is that the parallel algorithms are a good way to go for new code and new projects. If you have existing code that runs on the GPU or some other programming model, don't go try to change it. If it works to your satisfaction. So this is aimed at newer projects. Okay. So if you're already having success with CUDA or something else, don't try switching. Correct. So we encourage it for the productivity reason, we encourage you start with standard C++ and measure your performance. So that'll get you the best productivity, the best portability. And in many cases, we think the performance will be good enough for your environment.
Starting point is 00:30:37 But if you need better performance, then you can move on to other programming models. You can mix in OpenACC. Let me get you some. If you need the best performance, then you can rewrite just the core parts of your program that need the performance. You could rewrite them in CUDA. Okay, so you said it's only recommended for new code. That's the way we're recommending it. But you can kind of mix it in like if you're writing you know some new code but your code base already has some cuda you could write the new code and
Starting point is 00:31:10 use the nvc compiler yes okay i'm curious what kind of challenges you hit i'm trying to like visualize okay you have to compile this function for offloading to the GPU. Then when you go to actually execute the algorithm, you've got your C++ memory, whatever that is in, whatever byte ordering makes sense with whatever calling conventions make sense on this architecture, and then you have to do something different on the GPU. Is there like a translation layer or something?
Starting point is 00:31:44 How does that work? Well, we've been doing GPU programming for a long time. Right. And so we basically pulled together technology that we already had. So our compiler, through the OpenACC support, already knows how to generate code for the GPU. We know how to identify functions, and we have our GPU code generator the GPU. We know how to identify functions, and we have our GPU code generator
Starting point is 00:32:05 for them. So those functions get fed to both the CPU and GPU code generators. Okay. And then the unified memory that gets the heap automatically shared, that's been available in CUDA and OpenACC for several years already. So that's a well-known technology. Okay. So it really was piecing them all together and writing the header files that calls the right things to launch your kernel on the GPU.
Starting point is 00:32:35 With the important key being that it looks to the user like magic. Right. Yeah, so there was quite a bit of work of piecing it all together, getting all these pieces together, but most of the really big chunks were already existing stuff that just had to be adapted okay uh so david we mentioned in your bio that you've been a member of the iso committee since
Starting point is 00:32:55 2018 uh do you want to tell us a little bit about this paper you're working on about floating types floating point types okay um right this is one of the things NVIDIA wants in the standard. Because 16-bit floating point types have been out there for a while. People are using them. And there is hardware support for them on GPUs, on NVIDIA GPUs, and some ARM CPUs and some other hardware out there. And people want to use them. 16-bit floating point is really small. and some ARM CPUs and some other hardware out there. And people want to use them.
Starting point is 00:33:30 16-bit floating point is really small, and there's limitations, and you don't want to use it generally. What are the use cases people are using it for? Sorry. I'm not a math expert, or a math programming expert, so I don't have a lot of experience. But I think if you have lots of data, so storage becomes, the amount of storage becomes an issue.
Starting point is 00:33:52 16-bit floating point is less storage. But if you have lots of data and you're more interested in the averages or in the trends, and you don't need precision in every single data point, then this smaller, less precise, less precision value may be useful. It may be a good space trade-off. The space you save may be a good trade-off. So yeah, there's these 16, people are using 16-bit floating point, but there was no good way to do that. You can't do that in the standard. In standard C++. Standard C++ only has three floating point types, float, double, and long double.
Starting point is 00:34:30 No provisions for more. And by convention, float and double are pretty much locked into 32-bit and 64-bit. If you changed float to be a 16-bit type, people would scream at you. It might break some existing code. So this to be a 16-bit type, people would scream it. Your users would scream at you.
Starting point is 00:34:46 It might break some existing code. So this paper's proposal is an effort to expand the standard's ability for floating point. So the paper is titled Extended Floating Point. And so we are proposing a way to allow compilers to define additional floating point types. These would be optional and implementation defined. Okay. But we specify the rules for how all these floating point types have to interact.
Starting point is 00:35:16 So if your implementation defines a couple 16-bit floating points and 128-bit floating point type. These are the rules for implicit conversions, how they convert with each other, how overload resolution works, how your arithmetic conversions, so your math works, what the resulting types are. And then the second part of the paper proposes some standard names for well-known types, well-known layouts. Okay. So we are proposing standard names for IEEE 16-bit, 32-bit, 64-bit, and 128-bit, and for BFloat 16.
Starting point is 00:35:58 And that is, BFloat 16 is IEEE 32-bit with 16 bits of the mantissa just chopped off. Okay. And that is in use. So there's actually two 16-bit floating point types that are out there in use with hardware support, the IEEE 16-bit and this B float 16. So just to maybe clarify for me and for the listeners, IEEE floating point is, that's what, 754 or something like that? Yeah, that was the original standard number. Okay, that's what we think of when we think of floating point. That's what our FPUs and our general purpose, sorry, and our CPUs, whatever, and our general purpose computers generally work with. Yes, that's what the vast majority of
Starting point is 00:36:45 hardware out there is now IEEE okay so when you said this b flow chops off 16 bits of the mantissa like I'm just I cannot visualize what that means what does that mean um trying to remember how many bits are in there's there's normally IEEE 32-bit if I'm trying to remember how many bits are in there. There's normally IEEE 32-bit, if I'm remembering right, uses 8 bits for the exponent, 1 bit for the sign bit, and then 23 bits for the mantissa. Okay. And BFloat 16 just gets rid of 16 of those bits, so it only has 7 bits for the mantissa.
Starting point is 00:37:24 And the mantissa... But it still uses 7 bits for the Mantissa. And the Mantissa... But it still uses 8 bits for the exponent. Okay. Weird. Yeah, so it's very low precision. Low precision, but high... But you have the same range. The same range, right. As a 32-bit.
Starting point is 00:37:38 Okay. So who uses that? I don't know exactly who it is, but again, that's actually maybe a good fit for this case where you have lots and lots of data, but you're only interested in averages and trends.
Starting point is 00:37:58 Okay. So you don't need the precision on every data point. You just need to know what the trends are. Right. Okay. So this paper is in its fourth revision. Is it going well through the standard? Do you expect it to get into C++23? I am hopeful that it will get into C++23.
Starting point is 00:38:18 Okay. I guess with our slowing down, it's a little less likely, so it's definitely not guaranteed. It's not on the official priority list of things to do. But yes, it is making progress. There will be a couple more revisions. There's still more work to do. But the language rules for how these types interact are mostly worked out. Go ahead, Justin. interact are mostly worked out. On a half-serious, half-joking note, if you're adding 1632, 64, 128, why not float 8?
Starting point is 00:38:52 Because people aren't using that in the world very much. I have heard about an 8-bit float, though it's not what I've heard of. It's not your usual IEEE. It's not similar to the IEEE types. Okay. It's something completely different. But it's not widely used. But a compiler could support it as an extended floating point type.
Starting point is 00:39:17 It just wouldn't have a standard name. Right. And we would have the rules there for how it would interact with the other floating point types. Okay. What type of work is still to be done on the paper? To finish up, get consensus on how overload resolution should work. And then there's a bunch of library work to do. So we're proposing adding overloads or template specializations for all the extended floating point types in a bunch of places.
Starting point is 00:39:45 So we have to figure out exactly which places we should do that and what exactly those mean. I am curious since it's especially which sort of IO should, should we support for extended floating point types? And then we have to pick the name, pick the actual names. So we've proposed the types that, um,
Starting point is 00:40:02 we'll have standard names, but we haven't settled on what those standard names are yet. You mean like short short float or something like that? Short will not be in there. There was an earlier proposal of just adding short float as a floating point type, and that failed and then led us to be this more expansive one that allows more extensibility. So one of the proposals is name it STD float 32T. So some people want sort of IEEE in there,
Starting point is 00:40:37 or actually IEC is the formal name of what this layout is. So IEC something in there. So there are several different proposals out there. I'm not worried that we won't be able to agree on something. We just haven't done the work to come to that agreement on what these will be named. I could see the argument for wanting the IEC or IEEE in there. Yeah. So I'm curious, because something that's always kind of annoyed me with the standard is if I add two 8-bit values, I get back a 32-bit int because of integer promotion rules. And you were just saying a large part of this paper is describing the rules between these floating point values.
Starting point is 00:41:17 What happens if I add two float 16s? You get a float 16. And that matches the current floating point behavior. Right now, if you add two floats, you get a float. Oh, it doesn't automatically get promoted to the default of a double in that case. There are some places where float promotes to double. Okay. That's preferred during overload resolution, for example, over a standard conversion. But in arithmetic, floating point arithmetic does not repeat the integer arithmetic of everything promoting to it. Sounds great. And so we will
Starting point is 00:41:51 preserve that behavior. Just curious what things you are looking forward to in C++23 besides this floating point support to help with your parallel stuff? Well, as we're pushing and what nvidia's goal is standard c++ on gpus so a part of that is expanding the standard right in ways that is more gpu friendly so the big is the big one in there is executors which to greatly oversimplify it is a standard way to control how and where your code gets run and so once executors get in it will be easier to do some fine tuning of how you want this to run on a gpu you can specify yes i want this chunk of code to run on a gpu with certain properties
Starting point is 00:42:41 so yes we are very excited about executors hopefully getting in. Does that look like executors is on track for C++ 23, hopefully? It is one of the top priorities, and yes, it is making progress. It's now being in library evolution and has had
Starting point is 00:42:59 a lot of virtual meetings have been devoted to reviewing that and improving it. And there's other features that are kind of waiting for executors too, right? Like networking, I believe? Yeah, right. And the other area that we are working on that will help GPU programming is linear algebra. Specifically, there's a proposal to have a C++ interface for BLOS,
Starting point is 00:43:25 which is basic, though I can't remember what it stands for. The L-A in there is linear algebra. Linear algebra system. Let's just, I don't know, yes. Which has been around for many decades. It was developed for Fortran many, many years ago and has become a standard. But its interface is ugly, hard to use, very Fortran or C-specific.
Starting point is 00:43:55 And so there's a proposal underway to provide a C++ interface for that functionality. So some form of BLOS is used very widely in scientific computing. There's lots of ports to it. It's something that parallelizes as well. So there's lots of implementations of BLOS for multi-core CPU or for GPU. And we want to have a better interface port, so it's much easier to use from C++. Awesome. I did not know anything about that being worked on.
Starting point is 00:44:26 That's something that likely will not make C++23. Okay. Because it wasn't near the top of the priority list. Right. But there's lots of different people working on it. We're working in conjunction with the National Labs on this. So I'm sure it will happen. I'm just not sure when.
Starting point is 00:44:48 Right. Okay. Well, David, it's been great having you on the show today. Listeners who are interested in getting started with the NVIDIA compiler, where should they go? Go to the blog post that you will have in the show notes. Okay. That's the best description of it and has links to how to download the product
Starting point is 00:45:09 and getting it started section. Sounds great. Thanks for coming on the show. Thank you for having me. Thanks for joining. Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast.
Starting point is 00:45:20 Please let us know if we're discussing the stuff you're interested in or if you have a suggestion for a topic, we'd love to hear about that too you can email all your thoughts to feedback at cppcast.com we'd also appreciate if you can like cppcast on facebook and follow cppcast on twitter you can also follow me at rob w irving and jason at lefticus on twitter we'd also like to thank all our patrons who help support the show through patreon if you'd like to support us on patreon you can do so at patreon.com cppcast and of course you can find all that info and the show notes on the podcast website at cppcast.com theme music for this episode was provided by podcastthemes.com

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.