CppCast - SIMD Wrapper Libraries
Episode Date: November 22, 2018Rob and Jason are joined by Jeff Amstutz to discuss SIMD and SIMD wrapper libraries. Jeff is a Software Engineer at Intel, where he leads the open source OSPRay project. He enjoys all things r...ay tracing, high performance and heterogeneous computing, and code carefully written for human consumption. Prior to joining Intel, Jeff was an HPC software engineer at SURVICE Engineering where he worked on interactive simulation applications for the U.S. Army Research Laboratory, implemented using high performance C++ and CUDA. News Freestanding in San Diego Getting Started Qt with WebAssembly Trip Report: Fall ISO C++ standards meeting (San Diego) Jeff Amstutz @jeffamstutz Links CppCon 2018: Jefferson Amstutz "Compute More in Less Time Using C++ SIMD Wrapper Libraries" tsimd - Fundamental C++ SIMD types for Intel CPUs (sse to avx512) OSPRay Sponsors Download PVS-Studio We Checked the Android Source Code by PVS-Studio, or Nothing is Perfect JetBrains Hosts @robwirving @lefticus
Transcript
Discussion (0)
Episode 176 of CppCast with guest Jeff Amstutz recorded November 21st, 2018. Thank you. Use the coupon code JETBRAINS for a CPP cast during checkout at JetBrains.com.
In this episode, we discuss cute WebAssembly apps.
Then we talk to Jeff Amstutz from Intel.
Jeff talks to us about SIMD and SIMD wrapper libraries. Welcome to episode 176 of CppCast, the first podcast for C++ developers by C++ developers.
I'm your host, Rob Irving, joined by my co-host, Jason Turner.
Jason, how are you doing today? Happy Turkey Day.
Happy Turkey Day. Well, it's Wednesday, technically.
Yeah, it's my Thanksgiving now.
Well, yeah. And happy Thanksgiving.
Sorry, I guess I'm a little distracted and tired and such.
It's okay. It's okay.
But I was just thinking this time as you read
the first podcast, because
we've read this many times,
right?
The posts that you make
to the ISO CPP
website still say the only podcast
for C++ developers.
I need to update that. I always just copy-paste it
from the previous post.
I think it's fine if you leave it.
Well, Phil actually pointed out to me that the tag and the metadata for the website still said the only podcast.
Oh, is that right?
He asked me so I could update it recently.
Yeah, we'll just keep it subversively like that for a while.
Don't tell anyone.
Okay.
Do you have anything fun planned for Thanksgiving? Thanksgiving, hanging out with family. but subversively like that for a while. Don't tell anyone. Shh. Okay.
Uh, you have anything fun planned for Thanksgiving?
Uh,
Thanksgiving,
hanging out with family,
but it'd be a pretty simple Thanksgiving.
Okay.
Yeah.
We're doing a friends giving down here.
With the level of travel that I've been doing lately.
Our plan is basically to not travel on all the days.
That's a good plan.
I'm not a fan of Thanksgiving travel either.
Yeah. Okay. Well, at the top of your episode episode i'd like to read a piece of feedback uh this might be our first youtube comment
and this is from uh john on the episode from two weeks ago with uh devin labrie uh and he wrote
i'm not sure if this is the best place to post a comment for the podcast but i'm really impressed
with cppCast.
I like how you guys mix things up, having folks from various levels of C++ skills in the show.
I've had an itch multiple times to delve into the game arena, but have been redirected by career choices.
I really feel for Devon and hope him well on his game dev adventure.
He also says that outside of the other awesome game-related guests in past episodes of CPP Guest, I would suggest Devin have a listen to the Game Dev Unchained podcast to get
a good feel of the game industry, in particular
the episode with John Podlasek.
This was an eye-opener for me, so it may
be of interest to him as well. Thanks, Rob
and Jason, for posting these great podcasts, and I look forward
to the future episodes.
Yeah, I'm glad to see we got some nice
YouTube comments.
Usually YouTube comments are a bit of a cesspool,
but we're getting good ones.
And I guess for our listeners who don't know this,
we do, you do, post the episodes to YouTube every week, right?
Yeah, I think we just started doing that this year,
but I've got like three years worth of the show up there on YouTube now.
And a kind of a side benefit of that is
YouTube's automatic transcription service is going to come into play,
so there are transcriptions of the episodes there.
I'm sure not perfect, but they do exist.
Yeah, they are there.
Okay, well, we'd love to hear your thoughts about the show.
You can always reach out to us on Facebook, Twitter, or email us at feedback at cpcast.com.
And don't forget to leave us reviews on iTunes.
Joining us today is Jeff Amstutz.
Jeff is a software engineer at Intel where he leads the open source Osprey project.
He enjoys all things ray tracing, high performance and heterogeneous computing, and code carefully written for human consumption. Prior to joining Intel, Jeff was an HPC software engineer at Service Engineering,
where he worked on interactive simulation applications for the U.S. Army Research Lab,
implemented using high performance C++ and CUDA. Jeff, welcome back to the show.
Hey, thanks for having me.
Jeff, you know, I mean, this is your second time on the show. And it's been a long
time. And we'll talk about that more later. And I can't remember if I asked this before. But
you say you enjoy all things ray tracing. How did you get started with ray tracing?
The oldest ray tracer you played with?
Well, I got started in ray tracing because of my first job at Service Engineering doing ballistic simulation.
Okay.
So I didn't really play with a lot of ray tracing packages.
I played with the one that Service made.
It's called Rayforce.
It's actually open source up on SourceForge.
And that one was good.
But Service Engineering was about making products to solve problems for customers. So eventually, the Embry ray
tracing kernels that we have at Intel, kind of eclipsed the need for Rayforce to exist,
at least when I was at service, I believe they continue to use our Embry ray tracing kernels. So
now the ones that I was playing with at service are now the ones that I helped maintain at Intel.
Okay.
Yeah.
And that was starting back in 2012.
So, yeah.
2012.
So, do you have any experience with the old PavRay, MoRay,
some of those old open source ones that started that world?
No, I definitely need to go back and educate myself
on the history of ray tracing before my time.
That'd be a fun time.
Yeah, it could be fun.
Yeah.
Okay, Jeff, we have a couple news articles to discuss.
Feel free to comment on any of these, and we'll start talking more about maybe ray tracing and more about SIMD, okay?
Sounds great.
Okay, so this first one is a trip report focused on the freestanding proposal,
which Ben Craig is working on.
We talked about freestanding with him earlier this year,
and he was at the San Diego meeting presenting updates to the freestanding proposal, I guess,
and he's giving a rundown on how that went.
This seems like it was kind of a mix of both good and bad.
He seems like they're getting a lot of direction,
and he kind of pointed out that he kind of did his whole proposal kind of backwards,
where he made all these suggested changes first,
and then is just now getting direction on what the committee will go for.
Yeah, it does seem like the backwards way to do it.
Yeah, which'll go for it. Yeah, it does seem like the backwards way to do it. Yeah, which he pointed out.
It does feel like good news that it is moving forward, it seems.
Yes, yes.
Yeah, and it seems like he is kind of getting some recognition
at or around the committee meetings.
Like everyone recognizes him as the freestanding guy now, I guess.
That's cool.
That's nice, yeah.
Is there anything else you want to point out in this, Jason?
No, but I was just in my head thinking the freestanding guy.
He's the one over there by himself without anyone.
I mean, it's freestanding.
Anyhow, sorry.
That's bad.
Next, we have a post on the Qt blog,
and this is getting started with Qt for WebAssembly.
And this is pretty cool.
It's a post
just going through what you
want to do if you want to make
kind of a Hello World type application
using Qt
GUI controls in WebAssembly,
which is pretty awesome that you can do that.
And they also have a link at the bottom
to a sample application,
which you can go and check out, which is like a little image editor.
But it's running in WebAssembly, which is pretty cool.
That's just amazing to me.
Have you messed with this at all, Jeff?
So back in my DoD contracting days, I had lots of experience with Qt.
That'd be the late 4.8 and early 5 stuff.
And so I haven't played with Inscriptum at all, but when I go down on this article
and I just look at the screenshot of this little slate
application that they made with this, I went, oh my gosh, that's
really cute in a web browser. That's impressive.
Yeah, your brain kind of gets confused because you're like, okay, there's a URL bar at the
top, and then wait a minute. There's like a file application. Yeah. Yeah. It's crazy. So that that's
really impressive. I think that's some great work and maybe one day I'll play with it.
Yeah. I need to play with it as well. Some of the projects I'm working on right now would be
well suited to playing with it at the very least. One thing I want to highlight for our listeners in this
post is that if you're interested in this,
they're going to be hosting a Cute for WebAssembly
webinar on November 27th.
That's sometime next week. I'm assuming that's
free to join and learn more about it.
It does appear to be free.
Yeah. Does it say how long. It does appear to be free. Yeah.
Yeah.
Does it say how long the class is going to be?
The webinar?
It says it starts at two and ends at three.
Yeah.
Okay.
Although it doesn't say what time zone it's in unless it's automatically adjusted to the current time zone of the viewer.
That would make sense.
So last thing we have is Herb Sutter's trip report from the San Diego meeting.
And obviously we talked a bit about this last week with Ashley.
And I think we're going to still have a second part two trip report with another guest hopefully next week.
But is there anything you wanted to highlight from herb's trip report jason yes there was one thing that caught my eye that i did not hear anything about um before this
like i looking through the trip papers and everything else i had not seen anyone else
mentioned bind front which herb mentions here bind front it's like one of the last things
it's like by well bind first from back in the day from c++ 98 was terrible it only worked with like
a two parameter function or something like that and returned a new callable thing and stood bind
um is just in general terrible because it's way over complicated, adds considerably to compile times.
And if you're not lucky, it also adds runtime overhead, depends on how much the compiler can
see. But it looks like this bind first is just, just bind the first argument and return a new
function that takes the remaining set of arguments. And I've written this a couple of times myself
with variadic lambdas,
and it's a fairly straightforward thing in the simple cases to do, but definitely a handy thing to have when you need to be passing functions around and you're like, oh, well, I just want
to partially bind just the first function, pass it on to the next thing that doesn't care about it,
bind the next thing onto it or whatever. It definitely has uses. It's just cool to me to
see something that's kind of in between this crazy world of std bind and the old outdated world of bind first. So would that make
it straightforward to implement like a bind n, like bind first three? Yes, but binding only the
third element is still quite tricky. Yeah, yes, it would be up to n not the nth one right yeah yeah
you could totally do that you could easily just write a function that takes three arguments and
calls bind first three times now granted you're gonna making the compiler work a lot because it's
going to generate effectively a variatic lambda for each of these things and then have to optimize
them all but it can. Yep.
So.
Okay,
cool.
How about you,
Jeff?
Is there anything you wanted to call out from this trip report?
There's,
there's all kinds of stuff.
Uh,
you know,
the fact that,
that ranges and concepts are in C plus plus 20 is just huge.
Um,
it's really,
I heard a,
an anecdote.
Um,
uh,
I was at the super competingputing Conference last week,
and I think I overheard someone talk about when ranges got voted in,
there was just this standing ovation in the room.
That's because it's a huge body of work and such a big change for C++.
That's really cool.
And then the one thing I'm looking forward to based on this trip report is the is constant evaluated. That sounds like it's going to be really cool. And then the one thing I'm looking forward to based on this trip report is the is
constant evaluated. That sounds like it's going to be really useful to kind of be able to constrain,
like if someone accidentally uses some kind of constexpr feature in your library or your code,
and it ends up becoming runtime and you want to explicitly say, don't do that, now here's a trait that you can use to explicitly
maybe use static assert to enforce
that. Yeah.
It definitely
opens up a lot of doors for craziness.
Yep.
Go ahead.
I was just going to say one more thing.
I'm not sure if we mentioned this earlier.
They're creating two new study groups, Machine Learning and Education, SG19 and SG20.
And the Education one sounds interesting.
It has the aim of trying to improve the quality of C++ education in universities.
And I think we've talked to various guests on the show about how that's very hit or miss how well they teach C++ at universities these days.
Yeah. Well, you know, Reddit. Sometimes you'll read a comment on some C++ related thing and
someone will be like, oh, well, my professor told us to do this. And then you basically
get responses from other Redditors that are like, you need to go to a new university, basically. But I am curious, Jeff, you just said you were at the
supercomputing conference, what kind of a representation does C++ have there?
It's a mix, because, you know, you get some old school Fortran C, you know, codes that have been around forever. And then, you know, there's definitely a
plenty of demographic of C++ programmers looking to do very state of the art C++ kinds of things.
There's a number of libraries out there that are focused on kind of HPC distributed computing. There's HPX, Raja, Legion, and there's some others.
And those folks are always trying to push C++ forward
to do things like heterogeneous computing.
How do I like write a program
that will get distributed on a cluster
without having to manage all the networking stuff?
There's really cool stuff going on.
And then on top of that,
there were even a number of C++ standard committee members
that showed up at SC exhausted from San Diego.
And there was a birds of a feather session
on C++ for heterogeneous computing.
So in HPC, being able to have a number of GPU accelerators
or FPGAs alongside your CPUs,
like how do you program all those? That's definitely a problem bigger than HPC, but
the HPC folks are also interested in those. Cool. And if I heard this right, is this the
conference that's going to be rotating back to Denver next year? Yeah, so if I play my cards right, if I go to C++ now,
I'll be in Denver for CPP Now,
CPP Con, and SC.
Alright.
Stop by and visit my meetup.
Yeah, yeah. If I'm in town, I'll definitely
look for it.
Okay, well, as we said earlier, Jeff,
you were on the show once
before, but it was back in March 2016,
about two and a half years ago. So what's,
what's happened since then in your life? Yeah. So the, the Osprey project, I've kind of stepped
into a role of, of leading that. There's a number of engineers at Intel that all contribute to it.
And we're all kind of one, one team and that it's, it's been great to work with them, mostly based out of Austin.
We also have some folks that work out of Germany.
And the project's gotten, it's been maturing over time, which has been great.
User base is just growing and growing.
It's been a really fun project to work on.
And now I've kind of also added onto my plate there are future undisclosed hardware coming from Intel, as you would always expect. stepping into the how do I talk to the right compiler folks at Intel to make sure that things like Osprey are going to properly map
to whatever hardware comes out.
And so that's been really fun as well
because Intel has definitely, you know,
made plenty of its money and profit in user-based buy-in
with its CPU, like OpenMP,
pragma-based SIMD programming. And we'll talk about that maybe a little later. But
kind of branching out into other things is really fun. And we'll see what happens in the coming
months. It seems pretty crazy to be able to get to kind of live in this world of experimental
hardware with compilers
and will it work with the tools that you're working on yeah it's it's simultaneously exciting
and terrifying because yeah because it's it's one thing to like play with it it's another thing to
say well i also have to ship this library that's going to be like live or die by how well these
tool chains work um so we'll see. It also seems like your program crashes,
and now you're saying, is this a bug in my program,
a bug in the compiler, or a bug in the hardware?
Yeah.
Which most people don't have to stop and ask, usually.
Yeah.
My best experiences have always been
when the hardware is not buggy anymore.
I try to not get involved with hardware that's so early revision
that it's going to have known problems.
So by the time it gets to me, it's a matter of programming it right
versus making sure the thing's actually working.
Crazy stories about that stuff.
But yeah, that's for another time.
Yeah, probably some you can't talk about as well.
Yeah, yeah. stuff but uh yeah that's for another time yeah probably some you can't talk about as well yeah yeah so you just presented at cpp con uh compute more in less time using c++ simd wrapper libraries
uh you want to tell us more about your talk yeah uh so you know there's there's multiple ways you
can get a compiler to generate simd instructions and And I'll just back up and explain what that is and then what the talk is covering.
So SIMD stands for Single Instruction Multiple Data.
And what that is is if you think of like a normal mathematical expression like A plus B minus C, and we'll assign that to D.
So it's an add and a subtract.
Normally, each operator is going to operate on one value on
each side of that operator. So if those are all floats, it'll do, you know, add two floats together
and then take that result and then subtract two floats and store that in a float or whatever.
And so SIMD says, okay, that each one of those operandsands, the plus and the minus, those are each basically an
instruction. So what if we could have one instruction that instead operates on multiple
data values at the same time? So if I have an addition and followed by a subtraction,
but I can apply those on more than one value at once, then I can actually increase the amount
of computation I've done
in the same number of instructions. So that's the, it's called vectorization. Instead of having a
single value, which we call scalar values, you instead widen those to a vector width of something
your hardware supports. And then that lets you just do this more computation at one time, which hopefully speeds up something
that needs lots of computation done.
So with that concept,
there's a number of ways you can make a compiler
generate these special instructions
that exist on all kinds of different CPUs,
and it turns out GPUs are pretty similar.
Obviously, there's going to be lots of devil in the details that are vastly different, but at kind of like a
high level, they're looking for the same thing. I want a single stream of instructions that's going
to map to collections of
values instead of single values. And so there's multiple ways you can get these
instructions generated by your compiler. There's some trade-offs with different
ones,
but when you want to look at idiomatic C++,
like I want to use the type system to do work for me,
I think the best way to do this is with what we call a SIMD wrapper,
which is I represent these.
So if I take a built-in float,
and I want to instead create a type that represents, you know,
four or eight floats at a time, we can use the type system in C++ to create that type for me. So then I can say, you know,
A plus B minus C, those each can be represented as a float eight instead of just a float. But
everything else then is the same, the plus looks the same. The minus looks the same. The assignment.
So the idea is things that are normally kind of complicated to get these instructions to deterministically come out of the compiler.
These type of SIMD wrapper libraries make that look like your normal code.
And so the talk was not trying to sell anyone on a particular library because there's a number of them out there.
There's actually one now voted into the standard.
So that's all great.
But rather the talk was, hey, here's the kinds of problems we're solving.
Here's the common things that are going to be in all of these libraries.
So when you go play with one, these are the things to look for. And I hope that's like a nice foundation that then maybe next year there can be some more advanced like, all right, how do we take, you know, scalar existing code that isn't vectorized?
And, you know, now what are some techniques that we can use these libraries to then write vectorized code?
I want to take a quick aside to data sizes, since you're talking about these things being packed into things that the CPU can support. And you say float, and I know floats
tend to be used a lot because they're small, you can pack many of them. And what like a double,
you can have fewer long double, maybe only two or something in a SIMD instruction. If, if the CPU even supports
long doubles, I don't even know. Um, where's the trade-offs on these kinds of decisions?
So, uh, uh, a couple of things there. One is, um, most SIMD instruction or pretty much all SIMD,
um, instruction sets, they measure their register sizes in bits and not in number of elements of a particular
type. So like for machine learning, there's a lot of cases where precision is not as important.
So you can get more speed up by cramming more lower precision floats into a register of like,
let's say AVX and AVX2, most modern core i-series CPUs out of Intel,
will have 256-bit registers.
So if I can cram eight 32-bit floats into that, that's great.
But if I don't need the precision,
I can make that smaller and use maybe half precision
or the other way around,
which is the world that I usually trade off with,
the 32-bit float being the less precise for doubles,
where scientific computing, all that precision is desirable.
And so, yeah, it's pretty much precision and what you're doing.
And then, of course, for integers, the same thing is true.
If I only need to represent 255 values in my computation,
then I can cram a bunch
of 8-bit integers
into a register
versus using full 32-bit ints
or 64-bit ints.
So are these AVX
or whatever
on the appropriate platform
instructions flexible enough
to say I've got
a pack of 8-bit floats
if you had some way
of representing that?
Yeah. the way
to not get too into the, I don't want to lose any listeners
with getting too into the word. That's fine, they tell us to get more technical anyhow.
So I can represent, here's a collection
of ints, but most instructions will
then have, here is the expected collection of ints. But most instructions will then have like,
here is the expected collection of elements you're going to use.
So, man, I can't remember the intrinsic names.
So intrinsics, for listeners that don't know,
are little C-style functions that more or less correspond to an exact instruction.
And so they usually are the implementation detail of what an implementer of a
Cindy wrapper library uses to like, when I say float eight plus float eight,
I'm going to call a particular intrinsic.
Okay.
So the intrinsics, then they say like, here's,
here's a 256 bit register of 32-bit floats then it's an ad um so there's like
this kind of naming convention to organize all that so you actually represent the register with
the same type it'd still be like an m256 or an m256i and then it's the particular instructions
you choose to get what the element wise um are going to be, which of course is
like, maddeningly easy to get wrong. So that's why these wrapper libraries are nice, because now we
use the type system to enforce some things to do proper conversions and stuff like that.
So the the answer is yes, the hardware is very flexible, but it's not arbitrary it does have fixed 8 16 32
64 or whatever yeah and usually like as instruction sets um have moved on uh like through time uh so
you know the difference between avx and avx2 is that um we there was instructions added for additional wider vectors for ints.
For AVX1, it was just floats.
Okay.
I mean, this is super oversimplified, but take it for what it is.
It was like, okay, we'll go from 128-bit to 256-bit,
but AVX1 was like, we're going to do it with floats for the 256-bit operations.
And then AVX2 was like, oh, we're going to add the int support as well.
And then AVX512 that goes to 512 bits added also lots of instructions
all over across the board for the 512-bit wide stuff.
Yeah, you can get lost in understanding what the exact instructions
that are supported in the instruction set per generation of ISA,
then how that maps to your CPU generation, what your CPU supports at runtime.
It's a fun problem to solve that other people then don't have to think about.
Right.
I'd like to interrupt the discussion for just a moment to bring you a word from our sponsors.
Authors of the PVS Studio Analyzer suggest downloading and trying the demo version of the product. Link to the distribution package
is in the description of this podcast episode. You might not even be aware of the large number
of errors and potential vulnerabilities that the PVS Studio Analyzer is able to detect.
A good example is the detection of errors that are classed as CWB14, according to the common
weakness enumeration.
Compile a removal of code to clear buffers.
PVS Studio creators demonstrated the detection of such an error type,
for example, in one of the latest articles,
We Check the Android Source Code by PVS Studio, or Nothing is Perfect.
Link to this article is also in the description of this episode.
PVS Studio works in Windows, Linux, and macOS environments.
The tool supports the analysis of C, C++, and C Sharp code, and Java support is coming soon.
Try it today. Okay. Do you want to tell us a little bit more about SIMD and maybe how you use it in your day job? Yeah. So, so I, I, I'll, I'll say this. Um, if anyone who's listening,
uh, watch my talk, uh, I had a little example and I got called out for it on the, on the comments
cause I was dumb and I read the comments for the, for the talk and, uh, yeah. And, uh, so I had my
first example is called Saxby, uh stands for A times X plus Y.
It's just like, here's a little formula.
You comply to every element in like two arrays and store them in a third array.
Is your like hello world of vector computing?
Right.
And I said in the talk like Saxby's nonsense.
And the comment was, you know, this is like the basis for a ton of machine learning learning algorithms and the what i meant when i was thinking out in my head was this
particular thing i wrote is nonsense like i'm not computing anything i'm just like taking garbage
data and garbage data out storing it just for the to exercise the machine um and so yes there are
there are plenty of algorithms out there that um trivial to parallelize in this way, which is, for what it's worth, orthogonal to threading.
So vector parallelism is like what every thread would be executing.
So we're talking about just one of the many types of parallelism you have to go and consider when you're optimizing code. But for me, it's
basically any time I have an operation that I'm going to do a lot of the same thing to just a
bunch of elements, that's a great candidate for using SIMD. Now, Osprey doesn't use C++ SIMD
wrappers. It may in the future. What it uses is actually a small custom language called
ISPC, the Intel SPIMB compiler. It's a free open source thing, Bunkit Hub. And it's neat,
has some trade-offs, and there's tons of details we, the simple way to describe it would be if I'm trying to model light in some virtual scene.
So like I define some geometric objects, some spheres and cubes or like triangle meshes.
What I can do to get a picture out of that is define a virtual camera and define an image plane
which is basically going to be the the screen and I can basically trace light backwards you know you
trace a ray from the camera out into the scene and when you find a hit point you then figure out
what material you hit and what angle you hit it at and then you can figure out okay well what's the
next what's the next bounce that the light would
have traveled to get to that point? And you keep going until you get to light sources.
And that's the basic rendering algorithm. So there's a number of ways you can vectorize these
kinds of things. But the way we do it is we take what we call packets of rays. So we'll take eight rays out of the screen at a time
and we will trace
all of those rays
through acceleration structures
when we want to create
a bounce set of rays.
We can do all of that
because we're,
like for instance,
if I'm creating a secondary ray
from one hit point,
I'm going to do the same calculation for all of those rays that hit.
So that's a great candidate for SIMD.
And where these kinds of SIMD wrappers play in is the way you model that.
If I'm going to do four or eight or 16 rays at a time,
or even giant buffers of stream ray streams. Um, I don't really
necessarily have to care about the exact width I'm working with. Of course, when you tune with
like for, like if I was tuning for SSE or tuning for AVX, um, maybe I would want to get specific
and use like float fours or float eights, but, at the majority of the time, I just want to say like a vfloat,
like a vector float, or what I call a varying float, that I just want it to be something that
can be widened. And by the time I get to compiling for a particular SIMD register width, I want it
to just pick the right thing. So if I'm compiling this for SSE, that vfloat turns into a vfloat4.
Or if I'm compiling for AVX 512,
that turns into a V float 16. Okay, so you're saying, just to be clear, four floats in a pack,
16 floats in a pack. Okay. Yeah. I was at first thinking you were saying four bit wide floats,
and I'm like, I don't even what could you possibly represent with that? Man, that's a really long mantis and exponent and all.
That'd be the most useful floating point ever.
Anyway, yeah, yeah, no.
Number of floats in a register.
32-bit floats is what I use most commonly.
And so I'd say the thing there is it's always easy to when you're looking at something like
Zaxby or doing machine learning or something it's like oh why can't the compiler just
do this all the time and it's like
if your math is as simple as that and you're just doing that very
simple expression to a bunch of pieces of data that's
great you might not need fancy
SIMD wrappers for that. But where rendering, ray trace based rendering is the opposite.
What we have is these very deep function calls where, you know, I take a packet of rays,
and I have to generate it from generate the rays from the camera, then shoot them out. I've got to traverse them down a bounding volume hierarchy
into ray primitive intersections to get the normals,
the hit point distance.
Then I have my material, and I've got to go and figure out,
based on that material, how am I going to calculate
the next ray that bounces.
It gets to the point where you get two things very deep,
as in several function calls deep on the stack,
where it's not all just in this one for loop.
I can actually reason about a packet of rays anywhere in the function call stack
because I know this is like a packet of four rays, this is a packet of eight rays. And then the second half of that is with with that
packet of rays, I can have all these user defined structures all over the place. And that's where
it gets really interesting. Because if I if I define, like an algebraic vector, like a vec3, so representing a point in space or maybe it represents a direction.
So you could define a ray as a point in a direction.
So each one of those x, y, z points are going to be, in the naive sense, scalar, just float x, float y, float z but if i want to if i want to have like a a group of origins um then i could say well it's
actually x is a float 8 y is a float 8 and z is a float 8 and i can compose this all the way up so
if i have a float 8 i can then create uh like a point 8 and with.8, I can create a ray eight and maybe, you know,
bigger structures like screen sample eight.
And I'll,
you walk this all the way up where I can say if I have a varying ray,
I now can say that could be widened to be a ray four, ray eight,
ray 16, depending on my architecture.
And when I say like ray intersect triangle or something
I can actually implement that
in such a way that like all the dot
products with that algebraic vector
just don't care about the width
you can basically express an entire
library in such a way
that I can write all of this
math the exact same way I would
as if I was just playing with plain floats
but now I have a very tight understanding of
how it's going to be widened. And this is called a structure of arrays.
Where, you know,
interject any time if you've got questions or anything, because I can just ramble
forever.
Actually, I will pause for pause.
You'd well for, for just a moment here, um, because I'm, I'm trying to make sure I I'm,
I'm wrapping my mind around this and it still keeps thinking when you say float eight, I
still keep thinking, want to think an eight bit wide float.
No, you mean eight floats wide.
Fine.
Yeah.
Okay.
So when we, uh, cause I'm thinking, you know, int eight, you int eight, you know, built in the types from C++11. Yeah. All right. So you have your, so what I'm thinking is this sounds a lot like data oriented design.
It definitely overlaps with that. But that tends to be geared more towards
fitting your structs into cache lines and that kind of thing.
And you, on the other hand, are doing data-oriented design, if you will,
but with a goal of fitting it within the vectorization of the hardware capabilities.
Yep, yep.
So it's like data-oriented design for an ISA, not for a memory hierarchy.
Right. Okay.
How do those two things interact with each other then?
Well, so that's where you can get endlessly lost in design trade-offs
that are very specific context of the code you're implementing.
So it's tricky to say, like, here's the one
prescription for this problem you'll have everywhere and just do this. It's always in
context of what you're doing. So for instance, if I take in measure code that is running on a single
core, yeah, so if I have, if I'm optimizing, like if I'm optimizing for better vector instruction utilization,
I can be reasoning about that code.
But if I'm running it single-threaded,
if I'm looking at my cache,
my cache probably is going to be mostly stocked full of stuff
that that core is going to be working on.
And then as soon as you multi-thread it,
now my cache is going to be thrashed in and out with things that maybe
separate threads are working on.
And so there's,
there's all these contexts of like,
when I'm optimizing,
you have to understand where,
what state the system is going to be in.
It's not just,
well,
what if what's just this code doing It's what's this code doing in
context of what the whole machine's doing. So that's why it's tricky to say, for filling
cache lines, that's not an issue. But if you're looking to say, could I keep an entire function
in cache of the data that it's working on,
that might be trickier because that's then the issue.
Obviously, cache lines themselves won't be divided by threads, but the greater context of what you're optimizing might.
So you might want to arrange threads in a way
that they're all trying to work on smaller pieces together
instead of spreading them out in your data set
and having them working on vastly different pieces. But then now all of that is then going to influence how you're looking at your vector
code utilization. Like sometimes for various reasons, it might in ray tracing might make
sense to represent your problem with shorter vectors than what you natively can have.
So for instance, I talked about that issue where I have rays that come out of the screen
and they all hit an object.
Well, what happens if some of them hit one object and some of them hit another?
Let's say I have eight rays and four of them hit one object and four of them hit another object.
Well, it might make sense that as we do the calculation then to create the secondary
arrays for each of those, where they'll have to
load different data because they might have hit different
materials or something. They definitely hit different
primitives. It might
make sense to rearrange them
into two packets of four
and then move on from there
instead of keeping them all as one packet of eight
because they're what we call diverging.
They're going different places and they're not going to be doing the same things,
kind of by design of the algorithm. So then there's other design trade offs, where you can say,
well, I might change that I have a single scalar ray, and then I march that down, and I can do
batches of primitive intersections at one time. So instead of doing many rays to one primitive, like one sphere,
maybe I'll do one ray and have several spheres that are candidates
and then test them all at once.
So all of these things require ways to express user-defined structures
that I can say, I want to load like a group of spheres
that I'm going to call like a varying sphere eight, and then test that against array or
vice versa. And it just gets, it gets tricky to do like, if you want to do that with intrinsics,
it's very, there's two problems. One is you end up only implementing it for one
instruction set. And it's really complicated to look at. It's almost as bad as assembly.
So there's that. And then the other half is like things like OpenMP that were more built for
Fortran and just pure C don't have the rich type hierarchies
or type composition
that idiomatic
C++ has. And so it's
really tricky to be like, pragma,
OpenMP, parallelize,
simplify this, and then
all of your data structures just magically
get widened correctly.
Because it lives outside
the C++ type system. It's extracurricular
to C++, the
pragma stuff. So some
customers it's good, but
for my particular use case
with ray tracing, it just
is not effective.
That's a lot of the motivation for
why these SIMD wrappers
I think are really useful and
why programmers looking to do
more idiomatic C++, like representing your problem as a type, can be very effective for that.
It sounds, yes, very cool. And we've covered an awful lot of ground so far. So I'm just kind of
thinking where to go next um there was a
comment on youtube i know we all agreed you're not supposed to read the comments yeah that said
basically why didn't you just let the compiler vectorize this and i gather from everything
you've said you're you're effectively first of all you want to fit it into specific sizes and
then second of all you're doing things more complicated than the compiler is able to see through and vectorize.
Yeah, so OpenMP and the stuff that predates it.
So OpenMP you can view as two things.
One is the threading side.
I can just put a pragma OMP parallel for
and my for loop turns into scheduling these loop iterations
on different threads.
That's really useful.
So the piece I'm objecting to is the other half,
which is they have ways that you can say
pragma, OpenMP, basically like vectorize this.
And there's two things that that gets tricky with.
One is like I can't make C++ type decisions
on if something was widened or not.
So remember we had like a V float, like a varying float in like a C++ Cindy wrapper library.
Well, if I just have a float that I say, you know, A plus B minus C equals D, and those are all plain floats, that simple expression, the compiler can widen.
But the problem is I've expressed it in terms of scalar code.
I've said, this is just a single float.
That's what the C++ language says.
When I say float, I get this 32-bit float
or whatever the implementation decided float was going to be.
Is that actually guaranteed to be 32-bit?
No.
No, okay.
The only guarantees with C++ types or with c's types are
things like double must be greater than or equal to float long double must be greater than or equal
to double basically but but the point is is it's still scalar it's still like this is a value uh
not a collection of values that you can iterate over or decide to work with a subset of.
You've still said this is a single variable of this, a single element.
And so OpenMP says I can figure out that that can be widened to a collection that then has special instructions that can work on it. But if I wanted to have a template, I couldn't, like, let's say,
specialize a function call based on am I working on a scalar or am I working on a vectors worth?
And so, like, all of the stuff that I would want to do in the type system now are off limits,
because now only the OpenMP vectorizer is allowed to work with that. And the second part that's
really tricky is everything has to be
inlinable. So if I wanted to make a function call into something that I can't inline,
I have, there are some solutions that exist in like ICC, but it's very, very non-portable how
the quality of implementation is between other compilers, to be able to say, like, this function call can be accepting scalar
and widened varying versions of this function.
So maybe if I have a function, like, do some math,
and it takes float, you know, x and float y and returns a float,
like, I think ICC will actually figure out how to say,
I can create a version that takes a varying float X, a varying float Y
and return a varying float
but in general, again, that's another thing
that I don't have a lot of control over
it's just how you happen to use it and it needs to be unlineable
so the SimDwrapper libraries, because they work in the type system
I can now either template a function and say, this will work for any width. So for instance, in the SIMD library I wrote called
tSIMD, you have to provide vectorized versions of like math. So like trigonometric functions,
think sine or cosine or tangent, you know, it could take a single float and return a float,
you'll get that in your standard library. Or you could take a single float and return a float. You'll get that, um, in your standard library, or you could take a varying float and return a varying float where
every element had sine applied to it or cosine. Um, so, so those kinds of things, um, are, uh,
are actually width independent. So in T-Syndia, I implemented it as a template that always takes floats, but any width,
and then it's the exact same implementation for every single one. Like I could do that because
it was in the type system that I was able to write a template based on the characteristic of this
Syndi register I'm trying to program with. So those are just some of the trade-offs between
like the, if I let the compiler do too much i then
can't take control at various points um because for instance like let's say i have that a plus b
minus c equals d um and i say hey openmp vectorize that for me um that's great when it does um
but as soon as i say like i want to use this user defined i want to turn that into like a
a vec 3f so i took the result of that and used it as like a i'm gonna i'm gonna subtract an
algebraic xyz from it and then that would result another one as soon as i start doing stuff like
that openmp is like oh wait i don't know what you're doing anymore and gets paranoid and won't vectorize it.
So compiler paranoia is one of the biggest barriers to compiler auto-vectorization.
Let's be clear. You mean our compiler is wanting to generate correct code for us all the time.
Yeah, the paranoia is justified because I, as the programmer, need to say exactly what's okay.
Right.
And so when I do it in the type system, the compiler can reason about what that should be.
When I just give ambiguous, this is like one of my big objections to using raw pointers is like,
I have no idea what you mean, so I'm going to be paranoid and assume that it's the worst case that it could be.
It could be an array.
It could be an optional thing.
Is null valid?
There's all these things that you're not
expressing when you just say float star.
And the same thing
is true with can I widen
something or not? And when I have a user
defined structure, is the user
depending on the layout
of that structure or memory being the same?
There's all these things that
become
justifiably paranoid by the compiler that if you just rely on the compiler to do it and the compiler says no, then that's just what you got.
Right.
So with your SIMD, with all of these SIMD wrappers that you're talking about, do they have to make runtime choices as to which intrinsic to execute? So that is left up to,
I think every single library does the same thing,
which is you as the programmer say,
I will compile my translation units
maybe in multiple versions.
And then you decide some high level point.
Like the way we solve this in Osprey
is we have multiple dynamic shared
libraries, and we just at runtime load the one that has everything for a particular ISA.
So the Osprey API is such high level that we can load the AVX 5.12 version of Osprey or the AVX 2 version. And then we just assume everything is uniform.
There are other solutions, like I know the Intel compiler can create like fat binaries
that it does more granular function selection.
But the actual wrappers themselves only make compile time decisions on what the implementation
is going to be.
Okay, cool.
So you talked about a couple different
SIMD wrapper libraries in your talk, including the one you worked on.
I think you said that one of them is currently being standardized.
Is that being aimed at C++ 20 or some
future version? Yeah, I believe it's aimed at
20 and it was voted in. That's the VC library.
That's up on GitHub.
Then, of course,
then the next question is, well, why'd you write your own?
And it's, of course, because there's objections not to, like,
the high level, I'm going to do the entire thing differently.
It's more of there's these, like, couple of small design decisions that we needed to do differently and, you know, for various reasons,
and I did it differently.
So VC and, like, T-Cindy, the one I wrote,
are kind of the same as you would use it,
but, like, supporting AVX-512 in the very specific ways
we needed it for ray tracing.
And we're talking, like, super in the weeds kind of design decisions,
not things users should be really concerned about.
But yeah, so VC is the one that I point people to
because that's the one that's being standardized
and lets you standard stuff as much as possible.
One question I have regarding that is you talked about how uh in t simd you implemented
your own vectorized versions of like the math functions you know sine and cosine
um are there other parts of the standard library or c++ in general that are going to kind of take
this vectorization simd wrapper into account, are we going to get vectorized versions of the standard algorithms or anything like that?
So that is a part of the parallel STL, which, yeah, that's actually a, so there's OpenMP,
there's intrinsics, auto vectorization, and then there's the one we didn't mention,
which is parallel STL, where you say, like, I want to do std sort, but I want to do it in parallel.
And there's, the way that works is there's execution policies, where you say, I want to do std sort, but I want to do it in parallel. And the way that works is there's execution policies,
where you say, I want to do this std par, which would be parallel.
Then there was par unseq, which was technically not vectorize this,
but maybe implement this in a way that could be vectorized.
And I believe in 20.
People out in the C++ world will correct me if I'm wrong,
but there's additional execution policies
being worked out by SG1,
which is the study group on concurrency and parallelism
where there's like a std vec execution policy.
And what that is, is like when you provide the little lambda
to your std standard algorithm, it really says what is the compiler, what's the license the compiler can take to vectorize that.
So it's in the similar vein to the pragma-based stuff, but it's at least in Core C++, and it's not something else.
So these actually should compose together.
So for instance, I could have, and for Jason, for your sake,
I'll call it vfloat8, not float8.
If I have a vfloat8, I can actually have a std vector of vfloat8s in my container,
and I could write my lambda function that will take a V float eight instead.
So like it composes with standard algorithms already.
And really the,
the I think the parallel STL solution to vectorization should be more viewed
as I have some simple math I'm going to do that is all in lineable in this
one,
you know,
Lambda that I'm going to put to us into a stode
algorithm. You know, it's useful for more of those use cases, I would think that the real deep,
very rich environment that we do with ray tracing, maybe that wouldn't be the best. But you know,
the cool thing is, both of these things are aimed at the standard, like the SIMD wrappers are in the
standard, the parallel STL vector I vectorization execution policies are also there. standard. Like the SIMD wrappers are in the standard, the parallel STL vectorization execution policies are also there.
So it's the choose the tool that better fits your problem,
which I think is fantastic.
Yeah, that's cool.
Yeah.
Is there anything we haven't gone over yet
that you really wanted to talk about today?
Yeah, I'll just say that concurrency and parallelism and high-performance
computing, all that stuff is hard. So I'm really glad that there's tools out there to make some of
the stuff easier. Like, hey, I love Sean Parent brought this up in some of his talks and like
C++ seasoning. Like if you have a problem that needs to be multi-threaded, go find a tasking system.
Don't implement one yourself
until you're the expert in the room
that knows the problem the best
that a new tasking system implementation
will be better than existing ones.
But same thing with SIMD libraries,
tasking systems, with a lot of this stuff,
it's always good to find a library
that solves that problem first,
figure out a reason you should object to it,
and then implement your own.
This is all outside of people that implement them for academic purposes.
If you want to learn how to do this, go for it.
But I mean for more production code, that is.
There are tools out there that make these big, complicated problems
easier to reason about.
So for listeners out there that want to get into this stuff,
if you go look at those libraries,
it doesn't have to be as bad as you think.
And so hopefully blog posts and books and stuff out there,
maybe videos will lower the bar to entry on getting involved in this
stuff because lighting up
CPUs and GPUs and stuff
to go bang out real fast
computation is a lot of fun. So I'd encourage
people to go
write a ray tracer or something like that. It's a good time.
And I'll add to that
if you want to get into ray tracing
specifically, check out
Osprey. Check out Embry. these are cool open source libraries on github but um if you really want to like
implement it yourself there's a great book out there called uh ray tracing in a weekend put out
by uh pete shirley out of vidia and um he he just lays out the like you implement your little
algebraic vec 3f all the way up to you then render this with ray tracing this scene of, like, spheres and have cool materials that are reflective and stuff.
And it's just, like, it's literally in the title, Ray Tracing in a Weekend.
So if you really want to get into that domain, it's an e-book up on Amazon, Kindle, and all that.
So encourage listeners to go enjoy the world of ray tracing,
even if it's not your day job.
Awesome.
Okay, and where can listeners find you online, Jeff?
Yeah, I'm on Twitter.
I'm on GitHub at Jeff Amstutz for both of those.
And then I have a blog, jeffamstutz.io,
that I have been terrible with updating,
but I really want to get back to it.
It's been a busy year, but writing is still fun, so I'm going to find some time to start doing that again.
Awesome. Thanks again today, Jeff.
Thanks for having me.
Yeah, thanks for coming on.
Thanks so much for listening in as we chat about C++.
We'd love to hear what you think of the podcast.
Please let us know if we're discussing the stuff you're interested in,
or if you have a suggestion for a topic, we'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com.
We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter.
You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter.
We'd also like to thank all our patrons who help support the show through patreon if you'd like to support us on patreon you
can do so at patreon.com cpp cast and of course you can find all that info and the show notes on