CppCast - SIMD Wrapper Libraries

Starting point is 00:00:00 Episode 176 of CppCast with guest Jeff Amstutz recorded November 21st, 2018. Thank you. Use the coupon code JETBRAINS for a CPP cast during checkout at JetBrains.com. In this episode, we discuss cute WebAssembly apps. Then we talk to Jeff Amstutz from Intel. Jeff talks to us about SIMD and SIMD wrapper libraries. Welcome to episode 176 of CppCast, the first podcast for C++ developers by C++ developers. I'm your host, Rob Irving, joined by my co-host, Jason Turner. Jason, how are you doing today? Happy Turkey Day. Happy Turkey Day. Well, it's Wednesday, technically. Yeah, it's my Thanksgiving now.

Starting point is 00:01:54 Well, yeah. And happy Thanksgiving. Sorry, I guess I'm a little distracted and tired and such. It's okay. It's okay. But I was just thinking this time as you read the first podcast, because we've read this many times, right? The posts that you make

Starting point is 00:02:13 to the ISO CPP website still say the only podcast for C++ developers. I need to update that. I always just copy-paste it from the previous post. I think it's fine if you leave it. Well, Phil actually pointed out to me that the tag and the metadata for the website still said the only podcast. Oh, is that right?

Starting point is 00:02:36 He asked me so I could update it recently. Yeah, we'll just keep it subversively like that for a while. Don't tell anyone. Okay. Do you have anything fun planned for Thanksgiving? Thanksgiving, hanging out with family. but subversively like that for a while. Don't tell anyone. Shh. Okay. Uh, you have anything fun planned for Thanksgiving? Uh, Thanksgiving,

Starting point is 00:02:48 hanging out with family, but it'd be a pretty simple Thanksgiving. Okay. Yeah. We're doing a friends giving down here. With the level of travel that I've been doing lately. Our plan is basically to not travel on all the days. That's a good plan.

Starting point is 00:03:01 I'm not a fan of Thanksgiving travel either. Yeah. Okay. Well, at the top of your episode episode i'd like to read a piece of feedback uh this might be our first youtube comment and this is from uh john on the episode from two weeks ago with uh devin labrie uh and he wrote i'm not sure if this is the best place to post a comment for the podcast but i'm really impressed with cppCast. I like how you guys mix things up, having folks from various levels of C++ skills in the show. I've had an itch multiple times to delve into the game arena, but have been redirected by career choices. I really feel for Devon and hope him well on his game dev adventure.

Starting point is 00:03:46 He also says that outside of the other awesome game-related guests in past episodes of CPP Guest, I would suggest Devin have a listen to the Game Dev Unchained podcast to get a good feel of the game industry, in particular the episode with John Podlasek. This was an eye-opener for me, so it may be of interest to him as well. Thanks, Rob and Jason, for posting these great podcasts, and I look forward to the future episodes. Yeah, I'm glad to see we got some nice

Starting point is 00:04:04 YouTube comments. Usually YouTube comments are a bit of a cesspool, but we're getting good ones. And I guess for our listeners who don't know this, we do, you do, post the episodes to YouTube every week, right? Yeah, I think we just started doing that this year, but I've got like three years worth of the show up there on YouTube now. And a kind of a side benefit of that is

Starting point is 00:04:29 YouTube's automatic transcription service is going to come into play, so there are transcriptions of the episodes there. I'm sure not perfect, but they do exist. Yeah, they are there. Okay, well, we'd love to hear your thoughts about the show. You can always reach out to us on Facebook, Twitter, or email us at feedback at cpcast.com. And don't forget to leave us reviews on iTunes. Joining us today is Jeff Amstutz.

Starting point is 00:04:54 Jeff is a software engineer at Intel where he leads the open source Osprey project. He enjoys all things ray tracing, high performance and heterogeneous computing, and code carefully written for human consumption. Prior to joining Intel, Jeff was an HPC software engineer at Service Engineering, where he worked on interactive simulation applications for the U.S. Army Research Lab, implemented using high performance C++ and CUDA. Jeff, welcome back to the show. Hey, thanks for having me. Jeff, you know, I mean, this is your second time on the show. And it's been a long time. And we'll talk about that more later. And I can't remember if I asked this before. But you say you enjoy all things ray tracing. How did you get started with ray tracing?

Starting point is 00:05:34 The oldest ray tracer you played with? Well, I got started in ray tracing because of my first job at Service Engineering doing ballistic simulation. Okay. So I didn't really play with a lot of ray tracing packages. I played with the one that Service made. It's called Rayforce. It's actually open source up on SourceForge. And that one was good.

Starting point is 00:05:58 But Service Engineering was about making products to solve problems for customers. So eventually, the Embry ray tracing kernels that we have at Intel, kind of eclipsed the need for Rayforce to exist, at least when I was at service, I believe they continue to use our Embry ray tracing kernels. So now the ones that I was playing with at service are now the ones that I helped maintain at Intel. Okay. Yeah. And that was starting back in 2012. So, yeah.

Starting point is 00:06:30 2012. So, do you have any experience with the old PavRay, MoRay, some of those old open source ones that started that world? No, I definitely need to go back and educate myself on the history of ray tracing before my time. That'd be a fun time. Yeah, it could be fun. Yeah.

Starting point is 00:06:50 Okay, Jeff, we have a couple news articles to discuss. Feel free to comment on any of these, and we'll start talking more about maybe ray tracing and more about SIMD, okay? Sounds great. Okay, so this first one is a trip report focused on the freestanding proposal, which Ben Craig is working on. We talked about freestanding with him earlier this year, and he was at the San Diego meeting presenting updates to the freestanding proposal, I guess, and he's giving a rundown on how that went.

Starting point is 00:07:23 This seems like it was kind of a mix of both good and bad. He seems like they're getting a lot of direction, and he kind of pointed out that he kind of did his whole proposal kind of backwards, where he made all these suggested changes first, and then is just now getting direction on what the committee will go for. Yeah, it does seem like the backwards way to do it. Yeah, which'll go for it. Yeah, it does seem like the backwards way to do it. Yeah, which he pointed out. It does feel like good news that it is moving forward, it seems.

Starting point is 00:07:50 Yes, yes. Yeah, and it seems like he is kind of getting some recognition at or around the committee meetings. Like everyone recognizes him as the freestanding guy now, I guess. That's cool. That's nice, yeah. Is there anything else you want to point out in this, Jason? No, but I was just in my head thinking the freestanding guy.

Starting point is 00:08:07 He's the one over there by himself without anyone. I mean, it's freestanding. Anyhow, sorry. That's bad. Next, we have a post on the Qt blog, and this is getting started with Qt for WebAssembly. And this is pretty cool. It's a post

Starting point is 00:08:28 just going through what you want to do if you want to make kind of a Hello World type application using Qt GUI controls in WebAssembly, which is pretty awesome that you can do that. And they also have a link at the bottom to a sample application,

Starting point is 00:08:44 which you can go and check out, which is like a little image editor. But it's running in WebAssembly, which is pretty cool. That's just amazing to me. Have you messed with this at all, Jeff? So back in my DoD contracting days, I had lots of experience with Qt. That'd be the late 4.8 and early 5 stuff. And so I haven't played with Inscriptum at all, but when I go down on this article and I just look at the screenshot of this little slate

Starting point is 00:09:11 application that they made with this, I went, oh my gosh, that's really cute in a web browser. That's impressive. Yeah, your brain kind of gets confused because you're like, okay, there's a URL bar at the top, and then wait a minute. There's like a file application. Yeah. Yeah. It's crazy. So that that's really impressive. I think that's some great work and maybe one day I'll play with it. Yeah. I need to play with it as well. Some of the projects I'm working on right now would be well suited to playing with it at the very least. One thing I want to highlight for our listeners in this post is that if you're interested in this,

Starting point is 00:09:49 they're going to be hosting a Cute for WebAssembly webinar on November 27th. That's sometime next week. I'm assuming that's free to join and learn more about it. It does appear to be free. Yeah. Does it say how long. It does appear to be free. Yeah. Yeah. Does it say how long the class is going to be?

Starting point is 00:10:09 The webinar? It says it starts at two and ends at three. Yeah. Okay. Although it doesn't say what time zone it's in unless it's automatically adjusted to the current time zone of the viewer. That would make sense. So last thing we have is Herb Sutter's trip report from the San Diego meeting. And obviously we talked a bit about this last week with Ashley.

Starting point is 00:10:40 And I think we're going to still have a second part two trip report with another guest hopefully next week. But is there anything you wanted to highlight from herb's trip report jason yes there was one thing that caught my eye that i did not hear anything about um before this like i looking through the trip papers and everything else i had not seen anyone else mentioned bind front which herb mentions here bind front it's like one of the last things it's like by well bind first from back in the day from c++ 98 was terrible it only worked with like a two parameter function or something like that and returned a new callable thing and stood bind um is just in general terrible because it's way over complicated, adds considerably to compile times. And if you're not lucky, it also adds runtime overhead, depends on how much the compiler can

Starting point is 00:11:31 see. But it looks like this bind first is just, just bind the first argument and return a new function that takes the remaining set of arguments. And I've written this a couple of times myself with variadic lambdas, and it's a fairly straightforward thing in the simple cases to do, but definitely a handy thing to have when you need to be passing functions around and you're like, oh, well, I just want to partially bind just the first function, pass it on to the next thing that doesn't care about it, bind the next thing onto it or whatever. It definitely has uses. It's just cool to me to see something that's kind of in between this crazy world of std bind and the old outdated world of bind first. So would that make it straightforward to implement like a bind n, like bind first three? Yes, but binding only the

Starting point is 00:12:20 third element is still quite tricky. Yeah, yes, it would be up to n not the nth one right yeah yeah you could totally do that you could easily just write a function that takes three arguments and calls bind first three times now granted you're gonna making the compiler work a lot because it's going to generate effectively a variatic lambda for each of these things and then have to optimize them all but it can. Yep. So. Okay, cool.

Starting point is 00:12:47 How about you, Jeff? Is there anything you wanted to call out from this trip report? There's, there's all kinds of stuff. Uh, you know, the fact that,

Starting point is 00:12:54 that ranges and concepts are in C plus plus 20 is just huge. Um, it's really, I heard a, an anecdote. Um, uh, I was at the super competingputing Conference last week,

Starting point is 00:13:05 and I think I overheard someone talk about when ranges got voted in, there was just this standing ovation in the room. That's because it's a huge body of work and such a big change for C++. That's really cool. And then the one thing I'm looking forward to based on this trip report is the is constant evaluated. That sounds like it's going to be really cool. And then the one thing I'm looking forward to based on this trip report is the is constant evaluated. That sounds like it's going to be really useful to kind of be able to constrain, like if someone accidentally uses some kind of constexpr feature in your library or your code, and it ends up becoming runtime and you want to explicitly say, don't do that, now here's a trait that you can use to explicitly

Starting point is 00:13:46 maybe use static assert to enforce that. Yeah. It definitely opens up a lot of doors for craziness. Yep. Go ahead. I was just going to say one more thing. I'm not sure if we mentioned this earlier.

Starting point is 00:14:04 They're creating two new study groups, Machine Learning and Education, SG19 and SG20. And the Education one sounds interesting. It has the aim of trying to improve the quality of C++ education in universities. And I think we've talked to various guests on the show about how that's very hit or miss how well they teach C++ at universities these days. Yeah. Well, you know, Reddit. Sometimes you'll read a comment on some C++ related thing and someone will be like, oh, well, my professor told us to do this. And then you basically get responses from other Redditors that are like, you need to go to a new university, basically. But I am curious, Jeff, you just said you were at the supercomputing conference, what kind of a representation does C++ have there?

Starting point is 00:14:57 It's a mix, because, you know, you get some old school Fortran C, you know, codes that have been around forever. And then, you know, there's definitely a plenty of demographic of C++ programmers looking to do very state of the art C++ kinds of things. There's a number of libraries out there that are focused on kind of HPC distributed computing. There's HPX, Raja, Legion, and there's some others. And those folks are always trying to push C++ forward to do things like heterogeneous computing. How do I like write a program that will get distributed on a cluster without having to manage all the networking stuff?

Starting point is 00:15:42 There's really cool stuff going on. And then on top of that, there were even a number of C++ standard committee members that showed up at SC exhausted from San Diego. And there was a birds of a feather session on C++ for heterogeneous computing. So in HPC, being able to have a number of GPU accelerators or FPGAs alongside your CPUs,

Starting point is 00:16:07 like how do you program all those? That's definitely a problem bigger than HPC, but the HPC folks are also interested in those. Cool. And if I heard this right, is this the conference that's going to be rotating back to Denver next year? Yeah, so if I play my cards right, if I go to C++ now, I'll be in Denver for CPP Now, CPP Con, and SC. Alright. Stop by and visit my meetup. Yeah, yeah. If I'm in town, I'll definitely

Starting point is 00:16:35 look for it. Okay, well, as we said earlier, Jeff, you were on the show once before, but it was back in March 2016, about two and a half years ago. So what's, what's happened since then in your life? Yeah. So the, the Osprey project, I've kind of stepped into a role of, of leading that. There's a number of engineers at Intel that all contribute to it. And we're all kind of one, one team and that it's, it's been great to work with them, mostly based out of Austin.

Starting point is 00:17:07 We also have some folks that work out of Germany. And the project's gotten, it's been maturing over time, which has been great. User base is just growing and growing. It's been a really fun project to work on. And now I've kind of also added onto my plate there are future undisclosed hardware coming from Intel, as you would always expect. stepping into the how do I talk to the right compiler folks at Intel to make sure that things like Osprey are going to properly map to whatever hardware comes out. And so that's been really fun as well because Intel has definitely, you know,

Starting point is 00:17:57 made plenty of its money and profit in user-based buy-in with its CPU, like OpenMP, pragma-based SIMD programming. And we'll talk about that maybe a little later. But kind of branching out into other things is really fun. And we'll see what happens in the coming months. It seems pretty crazy to be able to get to kind of live in this world of experimental hardware with compilers and will it work with the tools that you're working on yeah it's it's simultaneously exciting and terrifying because yeah because it's it's one thing to like play with it it's another thing to

Starting point is 00:18:35 say well i also have to ship this library that's going to be like live or die by how well these tool chains work um so we'll see. It also seems like your program crashes, and now you're saying, is this a bug in my program, a bug in the compiler, or a bug in the hardware? Yeah. Which most people don't have to stop and ask, usually. Yeah. My best experiences have always been

Starting point is 00:19:02 when the hardware is not buggy anymore. I try to not get involved with hardware that's so early revision that it's going to have known problems. So by the time it gets to me, it's a matter of programming it right versus making sure the thing's actually working. Crazy stories about that stuff. But yeah, that's for another time. Yeah, probably some you can't talk about as well.

Starting point is 00:19:26 Yeah, yeah. stuff but uh yeah that's for another time yeah probably some you can't talk about as well yeah yeah so you just presented at cpp con uh compute more in less time using c++ simd wrapper libraries uh you want to tell us more about your talk yeah uh so you know there's there's multiple ways you can get a compiler to generate simd instructions and And I'll just back up and explain what that is and then what the talk is covering. So SIMD stands for Single Instruction Multiple Data. And what that is is if you think of like a normal mathematical expression like A plus B minus C, and we'll assign that to D. So it's an add and a subtract. Normally, each operator is going to operate on one value on each side of that operator. So if those are all floats, it'll do, you know, add two floats together

Starting point is 00:20:12 and then take that result and then subtract two floats and store that in a float or whatever. And so SIMD says, okay, that each one of those operandsands, the plus and the minus, those are each basically an instruction. So what if we could have one instruction that instead operates on multiple data values at the same time? So if I have an addition and followed by a subtraction, but I can apply those on more than one value at once, then I can actually increase the amount of computation I've done in the same number of instructions. So that's the, it's called vectorization. Instead of having a single value, which we call scalar values, you instead widen those to a vector width of something

Starting point is 00:20:57 your hardware supports. And then that lets you just do this more computation at one time, which hopefully speeds up something that needs lots of computation done. So with that concept, there's a number of ways you can make a compiler generate these special instructions that exist on all kinds of different CPUs, and it turns out GPUs are pretty similar. Obviously, there's going to be lots of devil in the details that are vastly different, but at kind of like a

Starting point is 00:21:28 high level, they're looking for the same thing. I want a single stream of instructions that's going to map to collections of values instead of single values. And so there's multiple ways you can get these instructions generated by your compiler. There's some trade-offs with different ones, but when you want to look at idiomatic C++, like I want to use the type system to do work for me, I think the best way to do this is with what we call a SIMD wrapper,

Starting point is 00:21:57 which is I represent these. So if I take a built-in float, and I want to instead create a type that represents, you know, four or eight floats at a time, we can use the type system in C++ to create that type for me. So then I can say, you know, A plus B minus C, those each can be represented as a float eight instead of just a float. But everything else then is the same, the plus looks the same. The minus looks the same. The assignment. So the idea is things that are normally kind of complicated to get these instructions to deterministically come out of the compiler. These type of SIMD wrapper libraries make that look like your normal code.

Starting point is 00:22:47 And so the talk was not trying to sell anyone on a particular library because there's a number of them out there. There's actually one now voted into the standard. So that's all great. But rather the talk was, hey, here's the kinds of problems we're solving. Here's the common things that are going to be in all of these libraries. So when you go play with one, these are the things to look for. And I hope that's like a nice foundation that then maybe next year there can be some more advanced like, all right, how do we take, you know, scalar existing code that isn't vectorized? And, you know, now what are some techniques that we can use these libraries to then write vectorized code? I want to take a quick aside to data sizes, since you're talking about these things being packed into things that the CPU can support. And you say float, and I know floats

Starting point is 00:23:32 tend to be used a lot because they're small, you can pack many of them. And what like a double, you can have fewer long double, maybe only two or something in a SIMD instruction. If, if the CPU even supports long doubles, I don't even know. Um, where's the trade-offs on these kinds of decisions? So, uh, uh, a couple of things there. One is, um, most SIMD instruction or pretty much all SIMD, um, instruction sets, they measure their register sizes in bits and not in number of elements of a particular type. So like for machine learning, there's a lot of cases where precision is not as important. So you can get more speed up by cramming more lower precision floats into a register of like, let's say AVX and AVX2, most modern core i-series CPUs out of Intel,

Starting point is 00:24:27 will have 256-bit registers. So if I can cram eight 32-bit floats into that, that's great. But if I don't need the precision, I can make that smaller and use maybe half precision or the other way around, which is the world that I usually trade off with, the 32-bit float being the less precise for doubles, where scientific computing, all that precision is desirable.

Starting point is 00:24:53 And so, yeah, it's pretty much precision and what you're doing. And then, of course, for integers, the same thing is true. If I only need to represent 255 values in my computation, then I can cram a bunch of 8-bit integers into a register versus using full 32-bit ints or 64-bit ints.

Starting point is 00:25:13 So are these AVX or whatever on the appropriate platform instructions flexible enough to say I've got a pack of 8-bit floats if you had some way of representing that?

Starting point is 00:25:25 Yeah. the way to not get too into the, I don't want to lose any listeners with getting too into the word. That's fine, they tell us to get more technical anyhow. So I can represent, here's a collection of ints, but most instructions will then have, here is the expected collection of ints. But most instructions will then have like, here is the expected collection of elements you're going to use. So, man, I can't remember the intrinsic names.

Starting point is 00:25:56 So intrinsics, for listeners that don't know, are little C-style functions that more or less correspond to an exact instruction. And so they usually are the implementation detail of what an implementer of a Cindy wrapper library uses to like, when I say float eight plus float eight, I'm going to call a particular intrinsic. Okay. So the intrinsics, then they say like, here's, here's a 256 bit register of 32-bit floats then it's an ad um so there's like

Starting point is 00:26:28 this kind of naming convention to organize all that so you actually represent the register with the same type it'd still be like an m256 or an m256i and then it's the particular instructions you choose to get what the element wise um are going to be, which of course is like, maddeningly easy to get wrong. So that's why these wrapper libraries are nice, because now we use the type system to enforce some things to do proper conversions and stuff like that. So the the answer is yes, the hardware is very flexible, but it's not arbitrary it does have fixed 8 16 32 64 or whatever yeah and usually like as instruction sets um have moved on uh like through time uh so you know the difference between avx and avx2 is that um we there was instructions added for additional wider vectors for ints.

Starting point is 00:27:27 For AVX1, it was just floats. Okay. I mean, this is super oversimplified, but take it for what it is. It was like, okay, we'll go from 128-bit to 256-bit, but AVX1 was like, we're going to do it with floats for the 256-bit operations. And then AVX2 was like, oh, we're going to add the int support as well. And then AVX512 that goes to 512 bits added also lots of instructions all over across the board for the 512-bit wide stuff.

Starting point is 00:27:56 Yeah, you can get lost in understanding what the exact instructions that are supported in the instruction set per generation of ISA, then how that maps to your CPU generation, what your CPU supports at runtime. It's a fun problem to solve that other people then don't have to think about. Right. I'd like to interrupt the discussion for just a moment to bring you a word from our sponsors. Authors of the PVS Studio Analyzer suggest downloading and trying the demo version of the product. Link to the distribution package is in the description of this podcast episode. You might not even be aware of the large number

Starting point is 00:28:33 of errors and potential vulnerabilities that the PVS Studio Analyzer is able to detect. A good example is the detection of errors that are classed as CWB14, according to the common weakness enumeration. Compile a removal of code to clear buffers. PVS Studio creators demonstrated the detection of such an error type, for example, in one of the latest articles, We Check the Android Source Code by PVS Studio, or Nothing is Perfect. Link to this article is also in the description of this episode.

Starting point is 00:29:05 PVS Studio works in Windows, Linux, and macOS environments. The tool supports the analysis of C, C++, and C Sharp code, and Java support is coming soon. Try it today. Okay. Do you want to tell us a little bit more about SIMD and maybe how you use it in your day job? Yeah. So, so I, I, I'll, I'll say this. Um, if anyone who's listening, uh, watch my talk, uh, I had a little example and I got called out for it on the, on the comments cause I was dumb and I read the comments for the, for the talk and, uh, yeah. And, uh, so I had my first example is called Saxby, uh stands for A times X plus Y. It's just like, here's a little formula. You comply to every element in like two arrays and store them in a third array.

Starting point is 00:29:54 Is your like hello world of vector computing? Right. And I said in the talk like Saxby's nonsense. And the comment was, you know, this is like the basis for a ton of machine learning learning algorithms and the what i meant when i was thinking out in my head was this particular thing i wrote is nonsense like i'm not computing anything i'm just like taking garbage data and garbage data out storing it just for the to exercise the machine um and so yes there are there are plenty of algorithms out there that um trivial to parallelize in this way, which is, for what it's worth, orthogonal to threading. So vector parallelism is like what every thread would be executing.

Starting point is 00:30:37 So we're talking about just one of the many types of parallelism you have to go and consider when you're optimizing code. But for me, it's basically any time I have an operation that I'm going to do a lot of the same thing to just a bunch of elements, that's a great candidate for using SIMD. Now, Osprey doesn't use C++ SIMD wrappers. It may in the future. What it uses is actually a small custom language called ISPC, the Intel SPIMB compiler. It's a free open source thing, Bunkit Hub. And it's neat, has some trade-offs, and there's tons of details we, the simple way to describe it would be if I'm trying to model light in some virtual scene. So like I define some geometric objects, some spheres and cubes or like triangle meshes. What I can do to get a picture out of that is define a virtual camera and define an image plane

Starting point is 00:31:46 which is basically going to be the the screen and I can basically trace light backwards you know you trace a ray from the camera out into the scene and when you find a hit point you then figure out what material you hit and what angle you hit it at and then you can figure out okay well what's the next what's the next bounce that the light would have traveled to get to that point? And you keep going until you get to light sources. And that's the basic rendering algorithm. So there's a number of ways you can vectorize these kinds of things. But the way we do it is we take what we call packets of rays. So we'll take eight rays out of the screen at a time and we will trace

Starting point is 00:32:28 all of those rays through acceleration structures when we want to create a bounce set of rays. We can do all of that because we're, like for instance, if I'm creating a secondary ray

Starting point is 00:32:42 from one hit point, I'm going to do the same calculation for all of those rays that hit. So that's a great candidate for SIMD. And where these kinds of SIMD wrappers play in is the way you model that. If I'm going to do four or eight or 16 rays at a time, or even giant buffers of stream ray streams. Um, I don't really necessarily have to care about the exact width I'm working with. Of course, when you tune with like for, like if I was tuning for SSE or tuning for AVX, um, maybe I would want to get specific

Starting point is 00:33:19 and use like float fours or float eights, but, at the majority of the time, I just want to say like a vfloat, like a vector float, or what I call a varying float, that I just want it to be something that can be widened. And by the time I get to compiling for a particular SIMD register width, I want it to just pick the right thing. So if I'm compiling this for SSE, that vfloat turns into a vfloat4. Or if I'm compiling for AVX 512, that turns into a V float 16. Okay, so you're saying, just to be clear, four floats in a pack, 16 floats in a pack. Okay. Yeah. I was at first thinking you were saying four bit wide floats, and I'm like, I don't even what could you possibly represent with that? Man, that's a really long mantis and exponent and all.

Starting point is 00:34:08 That'd be the most useful floating point ever. Anyway, yeah, yeah, no. Number of floats in a register. 32-bit floats is what I use most commonly. And so I'd say the thing there is it's always easy to when you're looking at something like Zaxby or doing machine learning or something it's like oh why can't the compiler just do this all the time and it's like if your math is as simple as that and you're just doing that very

Starting point is 00:34:40 simple expression to a bunch of pieces of data that's great you might not need fancy SIMD wrappers for that. But where rendering, ray trace based rendering is the opposite. What we have is these very deep function calls where, you know, I take a packet of rays, and I have to generate it from generate the rays from the camera, then shoot them out. I've got to traverse them down a bounding volume hierarchy into ray primitive intersections to get the normals, the hit point distance. Then I have my material, and I've got to go and figure out,

Starting point is 00:35:18 based on that material, how am I going to calculate the next ray that bounces. It gets to the point where you get two things very deep, as in several function calls deep on the stack, where it's not all just in this one for loop. I can actually reason about a packet of rays anywhere in the function call stack because I know this is like a packet of four rays, this is a packet of eight rays. And then the second half of that is with with that packet of rays, I can have all these user defined structures all over the place. And that's where

Starting point is 00:35:58 it gets really interesting. Because if I if I define, like an algebraic vector, like a vec3, so representing a point in space or maybe it represents a direction. So you could define a ray as a point in a direction. So each one of those x, y, z points are going to be, in the naive sense, scalar, just float x, float y, float z but if i want to if i want to have like a a group of origins um then i could say well it's actually x is a float 8 y is a float 8 and z is a float 8 and i can compose this all the way up so if i have a float 8 i can then create uh like a point 8 and with.8, I can create a ray eight and maybe, you know, bigger structures like screen sample eight. And I'll, you walk this all the way up where I can say if I have a varying ray,

Starting point is 00:36:53 I now can say that could be widened to be a ray four, ray eight, ray 16, depending on my architecture. And when I say like ray intersect triangle or something I can actually implement that in such a way that like all the dot products with that algebraic vector just don't care about the width you can basically express an entire

Starting point is 00:37:16 library in such a way that I can write all of this math the exact same way I would as if I was just playing with plain floats but now I have a very tight understanding of how it's going to be widened. And this is called a structure of arrays. Where, you know, interject any time if you've got questions or anything, because I can just ramble

Starting point is 00:37:40 forever. Actually, I will pause for pause. You'd well for, for just a moment here, um, because I'm, I'm trying to make sure I I'm, I'm wrapping my mind around this and it still keeps thinking when you say float eight, I still keep thinking, want to think an eight bit wide float. No, you mean eight floats wide. Fine. Yeah.

Starting point is 00:38:00 Okay. So when we, uh, cause I'm thinking, you know, int eight, you int eight, you know, built in the types from C++11. Yeah. All right. So you have your, so what I'm thinking is this sounds a lot like data oriented design. It definitely overlaps with that. But that tends to be geared more towards fitting your structs into cache lines and that kind of thing. And you, on the other hand, are doing data-oriented design, if you will, but with a goal of fitting it within the vectorization of the hardware capabilities. Yep, yep. So it's like data-oriented design for an ISA, not for a memory hierarchy.

Starting point is 00:38:45 Right. Okay. How do those two things interact with each other then? Well, so that's where you can get endlessly lost in design trade-offs that are very specific context of the code you're implementing. So it's tricky to say, like, here's the one prescription for this problem you'll have everywhere and just do this. It's always in context of what you're doing. So for instance, if I take in measure code that is running on a single core, yeah, so if I have, if I'm optimizing, like if I'm optimizing for better vector instruction utilization,

Starting point is 00:39:29 I can be reasoning about that code. But if I'm running it single-threaded, if I'm looking at my cache, my cache probably is going to be mostly stocked full of stuff that that core is going to be working on. And then as soon as you multi-thread it, now my cache is going to be thrashed in and out with things that maybe separate threads are working on.

Starting point is 00:39:52 And so there's, there's all these contexts of like, when I'm optimizing, you have to understand where, what state the system is going to be in. It's not just, well, what if what's just this code doing It's what's this code doing in

Starting point is 00:40:05 context of what the whole machine's doing. So that's why it's tricky to say, for filling cache lines, that's not an issue. But if you're looking to say, could I keep an entire function in cache of the data that it's working on, that might be trickier because that's then the issue. Obviously, cache lines themselves won't be divided by threads, but the greater context of what you're optimizing might. So you might want to arrange threads in a way that they're all trying to work on smaller pieces together instead of spreading them out in your data set

Starting point is 00:40:42 and having them working on vastly different pieces. But then now all of that is then going to influence how you're looking at your vector code utilization. Like sometimes for various reasons, it might in ray tracing might make sense to represent your problem with shorter vectors than what you natively can have. So for instance, I talked about that issue where I have rays that come out of the screen and they all hit an object. Well, what happens if some of them hit one object and some of them hit another? Let's say I have eight rays and four of them hit one object and four of them hit another object. Well, it might make sense that as we do the calculation then to create the secondary

Starting point is 00:41:26 arrays for each of those, where they'll have to load different data because they might have hit different materials or something. They definitely hit different primitives. It might make sense to rearrange them into two packets of four and then move on from there instead of keeping them all as one packet of eight

Starting point is 00:41:42 because they're what we call diverging. They're going different places and they're not going to be doing the same things, kind of by design of the algorithm. So then there's other design trade offs, where you can say, well, I might change that I have a single scalar ray, and then I march that down, and I can do batches of primitive intersections at one time. So instead of doing many rays to one primitive, like one sphere, maybe I'll do one ray and have several spheres that are candidates and then test them all at once. So all of these things require ways to express user-defined structures

Starting point is 00:42:22 that I can say, I want to load like a group of spheres that I'm going to call like a varying sphere eight, and then test that against array or vice versa. And it just gets, it gets tricky to do like, if you want to do that with intrinsics, it's very, there's two problems. One is you end up only implementing it for one instruction set. And it's really complicated to look at. It's almost as bad as assembly. So there's that. And then the other half is like things like OpenMP that were more built for Fortran and just pure C don't have the rich type hierarchies or type composition

Starting point is 00:43:07 that idiomatic C++ has. And so it's really tricky to be like, pragma, OpenMP, parallelize, simplify this, and then all of your data structures just magically get widened correctly. Because it lives outside

Starting point is 00:43:23 the C++ type system. It's extracurricular to C++, the pragma stuff. So some customers it's good, but for my particular use case with ray tracing, it just is not effective. That's a lot of the motivation for

Starting point is 00:43:40 why these SIMD wrappers I think are really useful and why programmers looking to do more idiomatic C++, like representing your problem as a type, can be very effective for that. It sounds, yes, very cool. And we've covered an awful lot of ground so far. So I'm just kind of thinking where to go next um there was a comment on youtube i know we all agreed you're not supposed to read the comments yeah that said basically why didn't you just let the compiler vectorize this and i gather from everything

Starting point is 00:44:15 you've said you're you're effectively first of all you want to fit it into specific sizes and then second of all you're doing things more complicated than the compiler is able to see through and vectorize. Yeah, so OpenMP and the stuff that predates it. So OpenMP you can view as two things. One is the threading side. I can just put a pragma OMP parallel for and my for loop turns into scheduling these loop iterations on different threads.

Starting point is 00:44:42 That's really useful. So the piece I'm objecting to is the other half, which is they have ways that you can say pragma, OpenMP, basically like vectorize this. And there's two things that that gets tricky with. One is like I can't make C++ type decisions on if something was widened or not. So remember we had like a V float, like a varying float in like a C++ Cindy wrapper library.

Starting point is 00:45:12 Well, if I just have a float that I say, you know, A plus B minus C equals D, and those are all plain floats, that simple expression, the compiler can widen. But the problem is I've expressed it in terms of scalar code. I've said, this is just a single float. That's what the C++ language says. When I say float, I get this 32-bit float or whatever the implementation decided float was going to be. Is that actually guaranteed to be 32-bit? No.

Starting point is 00:45:41 No, okay. The only guarantees with C++ types or with c's types are things like double must be greater than or equal to float long double must be greater than or equal to double basically but but the point is is it's still scalar it's still like this is a value uh not a collection of values that you can iterate over or decide to work with a subset of. You've still said this is a single variable of this, a single element. And so OpenMP says I can figure out that that can be widened to a collection that then has special instructions that can work on it. But if I wanted to have a template, I couldn't, like, let's say, specialize a function call based on am I working on a scalar or am I working on a vectors worth?

Starting point is 00:46:32 And so, like, all of the stuff that I would want to do in the type system now are off limits, because now only the OpenMP vectorizer is allowed to work with that. And the second part that's really tricky is everything has to be inlinable. So if I wanted to make a function call into something that I can't inline, I have, there are some solutions that exist in like ICC, but it's very, very non-portable how the quality of implementation is between other compilers, to be able to say, like, this function call can be accepting scalar and widened varying versions of this function. So maybe if I have a function, like, do some math,

Starting point is 00:47:15 and it takes float, you know, x and float y and returns a float, like, I think ICC will actually figure out how to say, I can create a version that takes a varying float X, a varying float Y and return a varying float but in general, again, that's another thing that I don't have a lot of control over it's just how you happen to use it and it needs to be unlineable so the SimDwrapper libraries, because they work in the type system

Starting point is 00:47:42 I can now either template a function and say, this will work for any width. So for instance, in the SIMD library I wrote called tSIMD, you have to provide vectorized versions of like math. So like trigonometric functions, think sine or cosine or tangent, you know, it could take a single float and return a float, you'll get that in your standard library. Or you could take a single float and return a float. You'll get that, um, in your standard library, or you could take a varying float and return a varying float where every element had sine applied to it or cosine. Um, so, so those kinds of things, um, are, uh, are actually width independent. So in T-Syndia, I implemented it as a template that always takes floats, but any width, and then it's the exact same implementation for every single one. Like I could do that because it was in the type system that I was able to write a template based on the characteristic of this

Starting point is 00:48:36 Syndi register I'm trying to program with. So those are just some of the trade-offs between like the, if I let the compiler do too much i then can't take control at various points um because for instance like let's say i have that a plus b minus c equals d um and i say hey openmp vectorize that for me um that's great when it does um but as soon as i say like i want to use this user defined i want to turn that into like a a vec 3f so i took the result of that and used it as like a i'm gonna i'm gonna subtract an algebraic xyz from it and then that would result another one as soon as i start doing stuff like that openmp is like oh wait i don't know what you're doing anymore and gets paranoid and won't vectorize it.

Starting point is 00:49:26 So compiler paranoia is one of the biggest barriers to compiler auto-vectorization. Let's be clear. You mean our compiler is wanting to generate correct code for us all the time. Yeah, the paranoia is justified because I, as the programmer, need to say exactly what's okay. Right. And so when I do it in the type system, the compiler can reason about what that should be. When I just give ambiguous, this is like one of my big objections to using raw pointers is like, I have no idea what you mean, so I'm going to be paranoid and assume that it's the worst case that it could be. It could be an array.

Starting point is 00:50:01 It could be an optional thing. Is null valid? There's all these things that you're not expressing when you just say float star. And the same thing is true with can I widen something or not? And when I have a user defined structure, is the user

Starting point is 00:50:15 depending on the layout of that structure or memory being the same? There's all these things that become justifiably paranoid by the compiler that if you just rely on the compiler to do it and the compiler says no, then that's just what you got. Right. So with your SIMD, with all of these SIMD wrappers that you're talking about, do they have to make runtime choices as to which intrinsic to execute? So that is left up to, I think every single library does the same thing,

Starting point is 00:50:50 which is you as the programmer say, I will compile my translation units maybe in multiple versions. And then you decide some high level point. Like the way we solve this in Osprey is we have multiple dynamic shared libraries, and we just at runtime load the one that has everything for a particular ISA. So the Osprey API is such high level that we can load the AVX 5.12 version of Osprey or the AVX 2 version. And then we just assume everything is uniform.

Starting point is 00:51:28 There are other solutions, like I know the Intel compiler can create like fat binaries that it does more granular function selection. But the actual wrappers themselves only make compile time decisions on what the implementation is going to be. Okay, cool. So you talked about a couple different SIMD wrapper libraries in your talk, including the one you worked on. I think you said that one of them is currently being standardized.

Starting point is 00:51:52 Is that being aimed at C++ 20 or some future version? Yeah, I believe it's aimed at 20 and it was voted in. That's the VC library. That's up on GitHub. Then, of course, then the next question is, well, why'd you write your own? And it's, of course, because there's objections not to, like, the high level, I'm going to do the entire thing differently.

Starting point is 00:52:21 It's more of there's these, like, couple of small design decisions that we needed to do differently and, you know, for various reasons, and I did it differently. So VC and, like, T-Cindy, the one I wrote, are kind of the same as you would use it, but, like, supporting AVX-512 in the very specific ways we needed it for ray tracing. And we're talking, like, super in the weeds kind of design decisions, not things users should be really concerned about.

Starting point is 00:52:49 But yeah, so VC is the one that I point people to because that's the one that's being standardized and lets you standard stuff as much as possible. One question I have regarding that is you talked about how uh in t simd you implemented your own vectorized versions of like the math functions you know sine and cosine um are there other parts of the standard library or c++ in general that are going to kind of take this vectorization simd wrapper into account, are we going to get vectorized versions of the standard algorithms or anything like that? So that is a part of the parallel STL, which, yeah, that's actually a, so there's OpenMP,

Starting point is 00:53:33 there's intrinsics, auto vectorization, and then there's the one we didn't mention, which is parallel STL, where you say, like, I want to do std sort, but I want to do it in parallel. And there's, the way that works is there's execution policies, where you say, I want to do std sort, but I want to do it in parallel. And the way that works is there's execution policies, where you say, I want to do this std par, which would be parallel. Then there was par unseq, which was technically not vectorize this, but maybe implement this in a way that could be vectorized. And I believe in 20. People out in the C++ world will correct me if I'm wrong,

Starting point is 00:54:05 but there's additional execution policies being worked out by SG1, which is the study group on concurrency and parallelism where there's like a std vec execution policy. And what that is, is like when you provide the little lambda to your std standard algorithm, it really says what is the compiler, what's the license the compiler can take to vectorize that. So it's in the similar vein to the pragma-based stuff, but it's at least in Core C++, and it's not something else. So these actually should compose together.

Starting point is 00:54:47 So for instance, I could have, and for Jason, for your sake, I'll call it vfloat8, not float8. If I have a vfloat8, I can actually have a std vector of vfloat8s in my container, and I could write my lambda function that will take a V float eight instead. So like it composes with standard algorithms already. And really the, the I think the parallel STL solution to vectorization should be more viewed as I have some simple math I'm going to do that is all in lineable in this

Starting point is 00:55:22 one, you know, Lambda that I'm going to put to us into a stode algorithm. You know, it's useful for more of those use cases, I would think that the real deep, very rich environment that we do with ray tracing, maybe that wouldn't be the best. But you know, the cool thing is, both of these things are aimed at the standard, like the SIMD wrappers are in the standard, the parallel STL vector I vectorization execution policies are also there. standard. Like the SIMD wrappers are in the standard, the parallel STL vectorization execution policies are also there. So it's the choose the tool that better fits your problem,

Starting point is 00:55:52 which I think is fantastic. Yeah, that's cool. Yeah. Is there anything we haven't gone over yet that you really wanted to talk about today? Yeah, I'll just say that concurrency and parallelism and high-performance computing, all that stuff is hard. So I'm really glad that there's tools out there to make some of the stuff easier. Like, hey, I love Sean Parent brought this up in some of his talks and like

Starting point is 00:56:20 C++ seasoning. Like if you have a problem that needs to be multi-threaded, go find a tasking system. Don't implement one yourself until you're the expert in the room that knows the problem the best that a new tasking system implementation will be better than existing ones. But same thing with SIMD libraries, tasking systems, with a lot of this stuff,

Starting point is 00:56:42 it's always good to find a library that solves that problem first, figure out a reason you should object to it, and then implement your own. This is all outside of people that implement them for academic purposes. If you want to learn how to do this, go for it. But I mean for more production code, that is. There are tools out there that make these big, complicated problems

Starting point is 00:57:04 easier to reason about. So for listeners out there that want to get into this stuff, if you go look at those libraries, it doesn't have to be as bad as you think. And so hopefully blog posts and books and stuff out there, maybe videos will lower the bar to entry on getting involved in this stuff because lighting up CPUs and GPUs and stuff

Starting point is 00:57:29 to go bang out real fast computation is a lot of fun. So I'd encourage people to go write a ray tracer or something like that. It's a good time. And I'll add to that if you want to get into ray tracing specifically, check out Osprey. Check out Embry. these are cool open source libraries on github but um if you really want to like

Starting point is 00:57:51 implement it yourself there's a great book out there called uh ray tracing in a weekend put out by uh pete shirley out of vidia and um he he just lays out the like you implement your little algebraic vec 3f all the way up to you then render this with ray tracing this scene of, like, spheres and have cool materials that are reflective and stuff. And it's just, like, it's literally in the title, Ray Tracing in a Weekend. So if you really want to get into that domain, it's an e-book up on Amazon, Kindle, and all that. So encourage listeners to go enjoy the world of ray tracing, even if it's not your day job. Awesome.

Starting point is 00:58:31 Okay, and where can listeners find you online, Jeff? Yeah, I'm on Twitter. I'm on GitHub at Jeff Amstutz for both of those. And then I have a blog, jeffamstutz.io, that I have been terrible with updating, but I really want to get back to it. It's been a busy year, but writing is still fun, so I'm going to find some time to start doing that again. Awesome. Thanks again today, Jeff.

Starting point is 00:58:57 Thanks for having me. Yeah, thanks for coming on. Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in, or if you have a suggestion for a topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter.

Starting point is 00:59:20 You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter. We'd also like to thank all our patrons who help support the show through patreon if you'd like to support us on patreon you can do so at patreon.com cpp cast and of course you can find all that info and the show notes on

CppCast - SIMD Wrapper Libraries

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.