CppCast - SIMD
Episode Date: December 15, 2023Matthias Kretz joins Phil and Timur. Matthias talks about SIMD, including what it is, how it works, and what its useful for. We also discuss his proposal to introduce SIMD vocabulary types and functio...nality into the C++ standard and how it relates to what was in the Parallelism TS. News MISRA C++ 2023 published Sonar webinar on MISRA C++ 2023 with Andreas Weis nlohmann/json 3.11.3 released reflect-cpp - Reflection library for C++20 Links P1928R7 - "std::simd — merge data-parallel types from the Parallelism TS 2" Matthias' CppCon 2023 talk on std::simd
Transcript
Discussion (0)
Episode 372 of CppCast with guest Matthias Kretz, recorded 8th September 2023.
This episode is sponsored by Sona, the home of clean code. In this episode, we talk about the new MISRA C++ 2023 guidelines,
a new release of the nlohmann.json library,
and a new C++ Reflection library.
Then we are joined by Matthias Kretz.
Matthias talks to us about SIMD and his proposal to standardize
SIMD for C++26. Welcome to episode 372 of CppCast, the first podcast for C++ developers
by C++ developers. I'm your host, Timo Dummler, joined by my co-host, Phil Nash.
Phil, how are you doing today?
I'm all right, Timo. How are you doing?
I'm not too bad, actually. Got some sleep, got a coffee mug here in front of me, so I'm good.
Yeah, if you're getting sleep, that's progress, I think.
Yes, it's definitely improved a lot. Again, like if anybody has kids, you know how it goes,
but we're two months in now and it's definitely getting a lot better.
Been there, done that.
Yeah.
All right.
So how are you, Phil?
Yeah, I'm back at home this week.
I was traveling last week.
So apologies for the bad audio last week.
I was trying out a new travel mic.
It wasn't as good as I'd hoped.
So back on my good mic this week.
As we record next week, by the time this is released, I'll be traveling again. So you get the benefit of my good mic this week. As we record next week, by the time this is released, I'll be traveling again.
So you get the benefit of my good mic this week.
All right.
Well, that's good.
At the top of every episode, we'd like to read a piece of feedback.
This time, actually, there wasn't much feedback on the show itself,
but there were quite a few comments from people on X and on Reddit
about how they like Dock Test more than Catch
and how they like Rust more than C++.
So, yeah, thank you very much to everybody
for all your comments.
It's apparently quite a hot debate,
both of those things, which is not over.
So thanks for your comments on our show material.
We do appreciate it.
We'd like to hear your thoughts about the show as well.
And you can always reach out to us on X and Mastodon
and LinkedIn or email us at feedback at cppcast.com.
Joining us today is Matthias Kretz.
Matthias began programming in primary school and got serious with C++ when he joined the development of the KDE 2 desktop in its alpha stages.
Working on different GUI applications at first, moving on to library work all over the KDE core infrastructure.
He studied physics in Heidelberg, Germany, and got into SIMD for his thesis.
There he worked on porting parts of the online reconstruction software for a CERN experiment
to the Intel Larrabee GPU.
This motivated his work on SIMD and VC and abstraction for expressing data parallelism
via the type system.
VC was the first solution of this kind with ESS-free software.
His PhD in computer science at the University of Frankfurt
was a continuation of his SIMD work,
developing higher-level abstractions and vectorization of challenging problems.
ITS has been contributing his SIMD work and his expertise in HPC
and scientific computing in the C++ committee since 2013.
In 2022, he is the chair of the SG6,
which is the numeric study group on the C++ Committee.
He's also a contributor to GCC and has founded and chaired
C++ user groups at Frankfurt Institute for Advanced Studies
and at GSI Helmholtz Center for Heavy Iron Research.
Matthias, welcome to the show.
Hey, thanks for having me. So you worked in physics and you are chair of SG6, the numerics group.
So you would be the perfect person to weigh in on our discussion of random numbers.
So is there such a thing as real random numbers and how does that affect C++?
Wait, I thought we settled that last time, didn't we?
Let's do it again.
Timo made a good argument.
And as someone who studied physics uh yes i support
what timo said there even though i just i i tend to forget lots of what i learned in physics
use it or lose it and i don't use physics and not enough so i i lose much of what i learned there
but yeah i same i i finished my like phd in physics in like what 2012 or something and i
don't remember any of that stuff.
Yeah.
The only takeaways I learned how to program,
and that's why I'm here, basically.
That was actually my goal when I was targeting what I wanted to study.
My goal in the end was really to do computer science,
but I wanted to have more than just the computer science background.
I wanted to have more than just the computer science background. I wanted to
have some science background and physics was like the most interesting there.
All right. So we'll get more into your bio and your work and potentially also physics in just
a few minutes. But first we have a couple of news articles to talk about. So feel free to comment
on any of these. Okay. The first news item for today is that MISura c++ 2023 got published so that actually already
happened in october but it was only posted on reddit last week so i'll consider it a piece
of news for this week so i think that's one of those cases of there's a depends on your
definition of published so i think technically it was published in october but um it wasn't
publicly released until just last week and that that's why it appeared on Reddit then.
All right, so it is actually brand new information.
Yeah.
All right, so MISRA, for those of you who don't know,
are guidelines for the use of C++ in critical systems.
I think it's mostly used in automotive.
MISRA is like a motor industry kind of industry association kind of thing.
And so it tells you things like do not allocate dynamic memory or do not call a function recursively or do not do
a bunch of other things that could lead to unsafe
and undefined behavior in C++,
which is allowed in C++ but not in MISRA
C++. And so
there's also a bunch of static analysis tools that can
enforce these guidelines to an extent
and it's been waiting for an
update for quite a while.
So the previous MISRA guidelines were targeting C++03.
Now they're targeting C++17.
There's a reasonably modern standard.
They updated a lot of the rules.
They also integrated AUTOSAR guidelines,
which I think is the other well-known set of coding guidelines
for safety-critical C++.
It doesn't mean they just merged all of AUTOSAR in,
but it means that they considered everything that's in there
and they decided whether what's in there fills an existing gap
and belongs into MISRA as well.
It replaces the previous version, which was from 2008.
So yeah, I think it's quite interesting for those
who work in kind of safety critical software.
Phil, I think you have actually a virtual event coming up
talking about what's new in MISRA 2023, don't you?
So I do.
It's coming up as we record.
Unfortunately, it will have already happened
by the time this is released
because we're recording it a week ahead.
But I will put a link to the recording of the webinar
in the show notes.
That will be with Andreas Veess,
who was on the show just a few episodes back,
going to be discussing the MISRA C++ 2023, as we've just been talking about, and what that really means and what's different.
So one of the things, actually, you sort of covered it, but just to emphasize a bit, as well as considering every one of the AUTOSAR guidelines, they went back and reconsidered every one of the original MISRA C++ 2008 guidelines.
Do they still make sense?
Is there a better way to do it in modern C++?
It really was a proper modernizing effort
from many people who are themselves
on the C++ stand as committee.
So it's actually a really good update.
And you should watch the webinar.
Right. Well, that sounds very interesting.
I will definitely watch it.
The second news item for today is that the NLoman JSON library, so that's the JSON library by Niels Loman, which I'm pretty sure is the most popular JSON library in C++, published a new version 3.11.3, which doesn't sound remarkable in itself because that replaces 3.11.2. But the remarkable thing about this
is that it's actually the first release since 473 days.
Like 3.11.2 was released on the 12th of August, 2022.
So I think that's actually remarkable
because if you look at that, it seems like,
oh, is this library still maintained?
But yes, it is.
So that's good to know that it is actively maintained
because it is so popular.
It has some new features. it allows like custom base class as a node customization points they have like an additional template parameter you can set a custom base class for your nomad
colon colon json object which like stores like a json object so that's like a you can add kind of
your own functionality to json node, like metadata or additional member functions like visitors or whatever.
So that's good.
There's a few other features like they support Bazel now, they support Apple's Swift package manager, which I'm not entirely sure why you want that for a C++ library, but maybe I'm missing something.
Is that about interop maybe?
Yeah, I'm not sure either.
It could be, but obviously, you know,
we have a number of package managers in the C++ world now,
so why not add one more?
All right.
So yeah, lots of bugs fixes.
So yeah, it's really good to see
that this library is alive and kicking,
considering how popular it is.
I'm actually thinking,
maybe you should try to get Niels on the show.
What do you think, Phil?
I think that would be a great
idea. All right, all right. So we'll reach
out. Niels, if you're
listening to this, we're coming for you.
Also kind of tangentially related
to JSON,
but also a lot more related to reflection,
which is what we were talking about last time.
So we talked about
the new proposal that's heading for CSS 26
about adding reflection to the language
while we don't have reflection yet in language people still do reflection
without proper language support using all kinds of hacks right and so there is a new
reflection library that popped up on the internet called reflect cpp which can serialize the c++
struct to json and deserialize it again from JSON.
And what it does is it actually includes the member names of your struct fields
in the JSON file.
And that's obviously something you need reflection for.
And usually this is done, like,
if you don't have reflection, which we don't,
you're using kind of compiler-specific macros
and things like that.
And ReflectCPP actually does it just using C++20
without any compiler-specific intrinsics,
which I thought was interesting.
So I was like, oh, how does this work?
So I had a bit of a closer look at it.
And I'm not sure.
I'm not sure.
This is like a piece of information I needed to know.
So what it does is, so it's C++20.
It doesn't work in earlier it's C++20.
It doesn't work in earlier standards.
C++20 adds std source location
and it provides a function called
std source location current
like dot function
name. So you can call that
and that gives you the name of the current function that you're
in. And so if the current
function is a template, you will also get a template parameters
passed in and stuff like that. And so if the current function is a template, you will also get a template parameters passed in and stuff like that.
And so the library then
kind of declares the struct as an extern
struct.
And then somehow, I didn't quite understand
what the machinery behind
that is and why you need that
extern. But somehow you can then
basically pass pointers to
the fields of that struct
that you declared as X10 to a function
containing source, location, current, function name,
and then the resulting function name
kind of contains the name of the field.
And so all you have to do at that point then
is to kind of parse that string,
and then you get the field name.
And you can also get the name of the struct itself
using the same trick.
So it's not strictly portable
because the standard says that source location function
name returns an implementation-defined string,
which is supposed to contain the member name,
but it's not guaranteed to contain the member name.
It can contain whatever.
If it's actually implementation-defined,
then you can look at the compiler documentation
and do the right thing.
But it's still no guarantee that
it doesn't change from version to version right right but like so all major compilers do actually
contain the memory name and that string so it actually does kind of work but yeah um i yeah i
can't wait for actual reflection because as helpful as this is i mean the functionality is really cool
right you can just serialize something into json and it just works right you don't need any add any macros to your like struct which is i
think what all the gaming people do with their reflection engines that you have to actually add
macros being there done that so so not having to do any of that is good but like i still like
there's still showcases where we really need like proper reflection support and so i guess for
deserialization it then searches for the right string
and then assigns to that member.
I guess.
I didn't look into that deserialization part,
but I think that's...
Sounds fun.
Yeah, that sounds fun.
This is very similar to the enum trick
that's been around for a few years now.
I think Magic Enum was one of the first libraries
that did that.
There's a few more now.
But there's a similar trick, not source location but even before that with the um
underscore underscore funk name preprocessor identifier again when you expand a template
name it will give you the the values passed to the template and if you pass an enum value it'll
actually give you the name of the enum value in the string that you can then parse out but of
course it's completely implementation defined even more so than this i think i think
you would have to take the pretty function uh macro instead of func but yeah i forget which
one it is but pretty function is longer so it's an implementation defined hack on top of an
implementation defined hack so if anything this is actually slightly more um standardized um there's only a bit this
implementation you find but it's a popular thing to do right so that concludes our news items for
this time and so we can transition to our main topic and our main guest uh hi again matthias
hi it's great to have you on the show great to be here so so i actually uh you know we always
kind of prepare in advance like a list of possible questions that you want to ask the guests i mean it doesn't always go exactly that way but uh
actually this time uh my list of possible questions is like way too long so i i probably
not going to get to ask all of them um so that's going to be interesting i think i have enough
material for like a five-hour episode here so let's see how that to go. As long as we ask multiple questions at the same time,
then we should be fine.
Well, do you also get multiple answers at the same time?
Or how's that going to work?
Yeah, well, it doesn't really work.
It should be single question, multiple answer.
Right.
It's multiple data coming in.
The question is the single instruction.
I don't think that works.
It's more like task parallelism, not data parallelism.
Maybe if you just get the same answer to each
question we should be good right so as you might have guessed by now our topic for today is simd
which stands for single instruction multiple data so matthias do you want to just give us a very
quick introduction what simd is and why we should care sure sure um so simdy is a part of computer Arctic architecture since like
the 70s but nowadays it's a bit different back back then it was vector computers nowadays we
call it call it rather short vectors so what happens is that you have a uh a register in
your computer or many of them that can hold multiple values and then instructions
that do the same thing that they would do on a single value on multiple values at the same time
so for example you have an integer you do an add and now instead of doing just it one on one
integer value you put multiple integer values into a bigger register,
and then you say do a SIMD add,
and then it will do that element-wise.
So for the first integer pair,
it does an addition for the second integer pair
because you're passing it two registers, of course,
in order to do a binary operation and so on.
So you get a higher throughput of computing.
And nowadays, most CPUs actually implement them
in a way that the throughput and latency of the instructions.
So how long does it take to get an answer?
And how long does it,
do you have to wait before you can issue a new instruction?
Those numbers are the same for the scalar and the SIMD, of your CPU.
And this can get worse because it's not just SIMD that is parallelism in the CPU on a single core.
There's also instruction level parallelism.
Most of this, the compiler will try to do as much as possible to let it run in parallel, even though it's not explicit.
It's all implicit.
But in order for instruction-level parallelism to work, you need independent instruction streams.
So you need to be, or the CPU needs to be able to issue an operation in one cycle and the next operation in the next cycle
and so on at each cycle one operation or rather two if you look at modern cpus you can do like
two fmas a fuse multiply instruction and then it takes like seven cycles or something until you
get an answer so that means you need seven independent instruction streams
in order to be able to issue at every single instruction,
at every single clock cycle or something.
So the example I showed at CppCon and also last time in Kona
for the CppVas committee is that in an extreme case,
you have a parallelism of a factor of 128
on a single thread, on a single core,
on a modern CPU.
Actually, not so modern.
It's a Skylake AVX512C,
so Skylake is like old by now.
But that's a factor of 128 of parallelism, right?
Parallelism on a single-threaded execution stream.
This is like the most extreme example you can build up,
but there's like so much parallelism, single-threaded,
and it's not used in most of the cases.
And if you really would be able to take it
all over your program where you're doing computing, and not just isolate it in a few parts, there are huge gains.
Yeah, I like the tagline on your blog site.
There's too much unused parallelism on a single core.
It sums up what you just said. Yeah. Yeah, it always makes me sad looking at code
or rather disassembling code
and seeing how it's not using the CPU's resources.
But I mean, what sorts of programs or algorithms
can actually benefit from SIMD then?
Is it just to do with number crunching
like you do in the numerics group?
No, I mean, it's all over the place um whenever you see a loop going over a container there is probably a good potential for using simd and a string in a way is a container of jars
and when you're looping over a string it's's, yeah, you really want to do that with SIMD.
Now, most of the low-level libraries like libc, string operations, they have SIMD implementations nowadays.
But using a simple approach, using a SIMD type, you can write it directly.
And it's sometimes even faster than those implementations
because it can get inlined
whereas the libc implementation
needs to do an indirect jump
because it has to work on
all kinds of different platforms
and so it first has to figure out
what do I have available
and then it jumps to the best implementation
and from then on
it directly takes that indirect jump,
but it's always an indirect jump in a function call.
Whereas if you use SIMD types in C++,
you can do that directly without any function call,
and so we can get even faster.
So I don't see SIMD being a niche thing to do.
In principle, it's applicable all over the place.
Right.
But it's not used all over the place.
No, it's not.
So is the reason for that that it's just complex to get into?
Well, one of the things is you need to think a bit differently, um that that takes some getting used to in principle at this
point just say all the types that you know from the basic types all the integers all the floating
points just don't ever use them anymore just put them in a simd and that's what you want to do
and so so the other reason why it's not used all over the place is because it's just hard
to program to express at this point you either write a loop and hope for the best
you write a loop and try to annotate it with i know this can go in parallel and still hope for
the best because it's not forcing the compiler to do anything.
You write intrinsics and it's completely unreadable and unportable.
Or you look for some kind the codes, especially as we
have them here in our science codes where we have to say, these have to live for 20 years.
Who can guarantee us that this will be around and working still in 20 years? So there's like,
how can you trust that going forward?
And that is one of the reasons why I am working with the C++ committee on this front, because I think this should be vocabulary type.
Something that where you can express that intent of, I just want to be able to scale this to any kind of parallelism here.
How it's then implemented is not that important.
I want to express that data parallelism.
And that is just not there in the language at this point.
So you just preempted several of the questions that I wanted to ask.
That's really great.
Thank you.
So the system works.
We've been doing it in parallel.
So it's like, was it speculative execution?
Is it another good CPU feature? But anyway. Oh, yeah. So it was a speculative execution. It's another good CPU feature.
But anyway.
Oh, yeah.
I have lots of experience there.
I mean, I've done astrophysics and then audio.
So I've done some SIMD as well. And yeah, it's basically, as you said, you either rely on the autovectorizer or you write your own intrinsics or then you rely on some kind of library.
And I think you basically just said that none of these are really great solutions, right?
Because if you rely on the autovectorizer,
then it might just not autovectorize it or it might.
And then you add like a tiny change
and then everything falls apart
and it suddenly is like eight times slower.
Maintenance of autovectorization is a hard problem.
Yeah, yeah, yeah.
And then, you know, writing SIMD instructions,
I think, yeah, you said it's not portable.
The other problem is that sometimes it's actually slower
than the code that the autorectorizer produces,
even though you think it's not,
but then you benchmark it and it's actually really slow.
I've actually seen that in production as well.
That's one of the other tricks you don't realize.
What I do in my implementation is trying to help the compiler
to understand what's going on.
So when you write intrinsics, you sometimes wave your magic wand
and say, compiler, forget what I'm doing.
I'm telling you what to do.
And the compiler says, okay, I don't see through all these intrinsics.
I'm going to do what you say. Whereas in the SIMD library implementation, I try to make sure that all
those places where that happens, I give the compiler the path out. And when the compiler
says, I actually know the values of these parameters coming in, I'm not going to use
the intrinsics for implementing them
because then the compiler will not know what's going on.
I will choose a different path
so that the compiler can still optimize through
and do constant propagation all over the place.
And that is something that you don't want to write
in your own code anywhere
because that's like you write everything
three times or four times
i do that for you in in the simd implementation and we can go into testing now because testing
this is like a nightmare um i have i can write the same code and i then have to test it in all
kinds of different like is it constant expiry? Is one of them a constant propagated value?
Is the other one?
Are both of them?
None of them.
And how do I tell the compiler
that it's not constant propagated
when I actually want to test something
and I write the value right there?
So when I began testing things,
the compiler would just throw away everything I wrote
and say, okay, return zero, you're good.
And none of the code that I wrote
was actually executed when I ran the tests.
And so I didn't see some of the bugs
because the constant propagation just did the right thing.
But there were actual bugs when it was executing
when it didn't know the values.
So I had to invent all kinds of ways of how to test things properly
in all these different scenarios,
because bugs happen all over the place.
And sometimes they happen in constant propagation.
I mean, I was just debugging for like three days now
for why a left shift on integers is wrong.
Because I'm doing some kind of trick
using floating point exponents
in order to do the shifting more efficiently.
And what happens if you have one shifted by 31?
So like the highest bit of an unsigned 32-bit integer is set, right?
Yep.
And then you say that is a float value.
So that's a conversion without loss.
And then you convert that float value back to a signed integer.
So the integral instruction that does it
and the corresponding intrinsic documents exactly
what happens you get the same integer back that you put in as a value for the float right but when
it's constant propagated the compiler says well i can do something else this is not specified how
this has to happen and what it actually does it saturates so you get max int it's like the value
minus one no it's not what the hardware does but it's what the c++ language allows the compiler to
do because like it can just assume that certain things can't happen because it would be ub yes
yes right yeah so even though even though i was using an intrinsic that was documented to work
that way the compiler said, I know that intrinsic,
it does a conversion from loading point to integer.
I know how to concept propagate that.
And so I'm going to give you something else.
Finding these kinds of bugs is like...
So even if you write intrinsics, basically you don't really know.
No, you don't have a guarantee that you actually get the instruction that's behind that.
So if you look at the headers that define all the intrinsics, some of them are implemented as built-in IA32 something, whatever it calls.
The compiler then has no idea what's going on. But some of them are actually implemented as different built-ins
where the compiler has its own notion
of SIMD vector types
and then it can do operations on those.
And so some of these intrinsics
are actually implemented using this
because then the optimizer understands
what's going on
and can do better optimizations.
And so you're never sure
when you call an intrinsic
without looking what's going to happen.
And there are many surprises hidden in there.
So it's one of the reasons why I don't think
many people should be implementing this.
It's so full of pitfalls.
And there are bugs all over the place in in compilers it is pretty good and solid
by now but i've been working on this like since 2009 and there was bugs all over the place and
people shouldn't be using that back then it was like not reliable enough right so if you don't want to write intrinsics which the
user shouldn't do i guess you can go two ways right you can you can just write inline assembly
i guess basically that's one way out of it or the kind of other way out of it would be to
go like higher on the abstraction level and just use a simd library but you know when 10 years ago when i
got into music tech you know there was boost simd and then i was working at a kind of relatively
large music tech company that had like their own in-house simd library that was a pain to maintain
that nowadays there's like a lot more stuff we have highway xcindy vector if we have this like whole spectrum of of cindy
libraries but and they all kind of have different paradigms different trade-offs so it's kind of
unclear you know how to approach this i i guess i guess your solution is that's why we needed a
standard right uh yeah so i i don't want any of these to go away. What I hope will happen there is that the standard SIMD type will allow interfaces that work across library boundaries. And then in a library, you can implement that with a different library, because you just take a std simd, as I want to define it, and you bitcast it to whatever else you want.
You can just bitcast it to an intrinsic type or something else,
and then you go from there.
And so that means you can have interfaces that interact,
that work with each other,
but implementation-wise, you're not fixed
to just using the standard implementation.
You could also do something else. So I don't see that as a competition. With Stetsimdi,
I don't want to say, okay, you don't need any of these anymore. They are doing innovation
in areas that I would like to be able to do, but I have to focus on getting the basis right here.
Okay, so a few episodes back,
I think Phil was on his vacation.
I did an episode with Mark Gillard
about his library Sogen,
which gives you like auto-generated,
which gives you the ability to like
make either struct of arrays or array of
structs and kind of have an automated way of like changing one,
one to the other.
What's your opinion on that?
Like,
because that's ultimately also about enabling,
enabling SIMD optimizations,
right?
And Mark was saying that he was getting lots of,
you know,
performance improvements just by switching from array of structs to
struct of arrays and then
letting the autoregulator do its thing. So is that something that we should be doing or what's
your take on this? So first of all, maybe explain the reason why this is important. So what happens
is that when you look at those SIMD registers, they have all these values next to each other in the register.
And in order to be efficient in putting those values there, you want to have them in a close by in memory so that you can load them all at once. So if they are all in a single cache line
in the correct order, then you issue a so-called load instruction, and then it just copies the whole thing into the
register, and then you can work. Now, if you have a struct, and then you have like in the struct an
x, y, and a z, and then you have an array of those, and you want all the x's in one SIMD register,
then the x's are not next to each other in memory, and you cannot do a SIMD load instruction.
And that's often where the compiler then says, okay, that's not going to each other in memory and you cannot do a SIMD load instruction. And that's often where the compiler then says,
okay, that's not going to be efficient
because the load will be so slow
in order to get all the values into the SIMD register
that the speedup that I can get
from actually then working in parallel is not worth it.
And so I'm not going to vectorize this.
But when you reorder them
to have all the X next to each other in memory,
then the autovectorizer says,
okay, that's just not costing me anything to do that in a vector.
And then I'm going to get some speed up from vectorizing.
So then the autovectorizer kicks in.
And the interesting thing here is that you're also losing something. Like if you have the case of an X, Y, Z point, and you always need to process X, Y, and Z together,
then you want to have that locality, right?
When you say, I'm going to access this object,
then all of these need to be in the cache at that time.
Now, if you have a struct of array, you have three streams that the hardware prefetcher now has to look into.
And typically that works well enough.
You might get into cache associativity issues for more things.
So that when you're actually starting to load from one of those
arrays, you're evicting stuff from
another one.
So a library that is smart there
makes sure that the distances are
not powers of two
between those arrays.
But otherwise,
yeah, for
SIMD, that can give you a significant
speedup. But now the story is not as simple.
And you cannot, in general, say what is better.
Because often you have IO.
You have things coming in.
And they don't come in as struct of arrays, typically.
So the first thing you would have to do is you have to go over all your memory, reshuffle the things.
Then you can do efficient work.
And then for output, you have to reshuffle everything again, and then you're done.
And now it's so you have to measure the whole thing.
Is it now faster with all the reshuffling at the input and output?
And so sometimes you have just problems where input and output is important in terms of
dominating the cost of runtime. So then it makes sense to actually vectorize on an array of structs.
But in order to do that efficiently,
you'd need to do some tricks that the compiler,
the auto-vectorizer, I've never seen it do.
I've shown in my CPCON talk that I can do it manually
and I can implement it for you,
giving you like a std for each
or a std transform algorithm implementation
that goes over an array of scalar structs
and it passes you the whole thing as SIMDs.
Then you can do efficient calculation
and even store it back.
So it's not like do that and you get better vectorization.
I can't tell you that.
There's even a third one, the one that I like to use most.
That's an array of a vectorized struct.
So again, to the-
An array of a struct of an array, essentially?
Yeah, but that last array is like just a SIMD.
Oh, so array of structs of SIMD packs.
Yeah, so again, of the example of XYZ,
now instead of having that being a float,
make it a SIMD float.
So that's what I call a vectorized struct,
because now instead of just having a single struct i have a vector
a simd vector struct that i can efficiently work with because the the x that i access
is in memory in the correct order so that it's a single simd load and then i just because i need
to work with more values than just whatever the SIMD width of the hardware is.
I have an array of the whole thing.
And then it's really simple just to iterate over the whole array and work with each of those things.
And in terms of the code you write, you write it, the algorithm, you write it once for the scalar case.
Then you vectorize the struct and you, except for branches, you don't have to touch the code.
It just compiles with a vectorized struct.
So I think we can summarize all that by just saying it depends.
Sure, yes.
Or perhaps more importantly,
yeah, these libraries will help you,
but you still have to understand
what's going on under the hood
to know what the best thing to use is.
My takeaway from all I've learned is
vectorizing algorithms is almost trivial, but getting speed out of them is a question of data structures.
And getting data structures right is the actual challenge of SIMD.
Right.
Okay, so there's no free lunch.
I actually want to transition to the um the standardization effort
you've uh hinted at it a couple times already i wanted to dig a bit deeper but before we do that
it's a good time to have our sponsor break uh we are once again sponsored by sonar the home of
clean code so sonar lint is a free plugin for your ide helps you to find and fix bugs and security
issues from the moment you
start writing code. I'm not sure if it will find SIMD bugs though. I'll have to find out about that.
They could also add SonarCube or SonarCloud to extend your CICD pipeline and enable your whole
team to deliver clean code consistently and efficiently on every check-in or pull request.
SonarCloud is completely free for open source projects and integrates with all of the main cloud DevOps platforms. So back to SIMD and getting into the C++ standard. What is the
current state of, I think it's P1928, is it? Yeah, that's the paper that is proposing the
merge of the TS into the standard. So the goal there was to not increase the scope,
the feature set, but keep it at what the TS was
and incorporate the feedback of what to improve
and also to look at language features,
what is now possible that wasn't possible back then
or integration with library.
The TS depends on C++ 17.
So we didn't have ranges then.
We didn't have contiguous iterator.
And that is like so important for SIMD.
I don't care for random access.
I care for contiguous memory if I do SIMD.
So that was something that I wanted to incorporate there
to generalize the interfaces a bit.
So yeah,
that passed the library evolution working group back in Varna in June.
And so we're in library wording with that paper now in terms of design
review,
that means we're done there might be a bit of an issue going forward
without adding more papers or more features because some people in the committee um think and i
mostly agree that the feature set is not enough for c++ 26 if that is all then it shouldn't be
in 26 there should be more in order to be good enough for seeing the day of light for the first time.
I totally agree there should be more.
And so I will try to focus my efforts on getting more features in and then getting all these through library review.
But we're doing most of the things in parallel as we can.
So we have a pipeline and out of order in the execution here.
Yeah.
The problem is that I'm the bottleneck in many cases.
So which are the kinds of features that you have in cases. So which are the kinds of features
that you have in the paper
and which are the kinds of features
that you feel like are missing
but really need to be in there
to have a meaningful offer for C++ 26?
So I see the TS
and also the merge of the TS feature set
as the low-level thing
that you need to have
in order to do anything on top.
And 10 years ago, I was working on higher level things.
I haven't been since then because I've been at the committee working on this low level
stuff and trying to get that right and also implementing it.
And I really think we need the high level interfaces for especially the memory access
like i don't want people to to do the loads and stores themselves it's a bit of a foot gun in
terms of uh out of bounds access and so consider you have any range. You don't know the exact size.
And now you want to iterate over that with SIMDs.
The SIMD has a fixed chunk size.
So typically that doesn't match the number of elements you have in the range
and the number of elements you would then process if you go over it by SIMD.
So you always need the SIMD epilogue,
where you make sure that at the end where there's this mismatch of numbers, you don't get an out-of-bounds access on the array.
So at this point, if all you have is that SIMD types and the loads and stores operations that it provides, then you always have to write this epilogue yourself.
And you always have to make sure that you don't forget it, because otherwise you're getting out-of-bounds access. And I would like to provide
the necessary
interfaces
to just
give me a range, and I will
give you
the SIMDs to work with,
and at the end of that range,
those SIMDs might get smaller and smaller,
just until you're done.
So, I would really like to see that in 26,
but at this point, time is already running out.
I don't think I get that ready for 26.
I have to focus on getting all the low-level interfaces right and complete.
And there were always some things missing,
like permutations were missing, the gather and scatter access.
So when you have to do random access into memory,
there is some hardware support to do so,
though it's not actually making things much faster.
It's a bit of help, but yeah.
So the other question that came up
and that I will have to investigate
is the UB that you get on loads and stores
when you're doing this out-of-bounds access,
whether we can push that UB out of SIMD
and up to the user
so that the user basically would have to give me
like a span of static extent or something
and say you can load from that whole thing here and then when when the user did the span of static
extent he like did the problem and did the out of range access out of bounds access and
simdy just did what it was told.
That there is, I mean, for the static extent, it's simple.
There is also the question of passing a range of dynamic extent.
But then when you do bounce checking at runtime in a SIMD low-level interface,
that kind of goes against the idea of this is a low-level performance primitive.
So if we do that, it has to be without cost, at least in the scenarios that are interesting.
So there is some homework I can and have to do.
I've already done some compiler exploration of some test cases.
I already did a few benchmarks.
And so there are some cases where all these checks can get optimized away.
And that's certainly interesting to use.
So the feature set basically consists of,
you can do almost anything that you can do with an int or a float in terms of all the operations that are there.
In terms of cmath, what is still missing are some of the integer operations that we got since C++17, like the bit operations.
But they have papers.
So in many cases, you just replace the type from an int to a simd end and things
continue working the biggest thing that i'm missing is expressing conditionals in a way
that it works for an int and a simd end in the set with the same syntax an if statement obviously
can work.
If you do a comparison on SIMDs,
you get multiple answers
because you always work element-wise.
And the elements are not related.
There is no single answer.
So what you need to do in SIMD code
is not to branch,
but you execute both branches
and then you blend the results
of those branches together
into the result that you were looking for.
This is also what happens if you do any kind of CUDA programming, and that's translated to a GPU or potentially OpenCL, and that's translated to SIMD, which is in theory possible.
And it would have to do exactly that for if statements.
It executes both branches,
but parts of the execution units just don't work, do work when a branch is executed,
and then the results are blended together. And so the natural thing to do there for C++ would be to
use the conditional operator, question mark, colon.
So you can write a condition on the left-hand side,
and then you get either the second or the third argument. And for a SIMD, that means you would evaluate both the second and the third,
and then element-wise apply the question mark.
So this aspect, actually, I remember when I was in Kona,
I wasn't really there when you were discussing SIMD.
I was busy with contracts and other things.
But then later I went to this bar, the Shark Shack,
and everybody there was discussing this thing where comparisons
and things like operator equals equals,
where,
as you said,
instead SIMD operator equals equals actually returns a bit mask.
And like,
because you get like a different result for every element,
like what you just explained basically.
And then all the like non-SIMD people were saying,
oh,
but that just breaks how types work in C++.
Types need to be regular operator equals equals,
like needs to return a bool.
Otherwise the type system breaks.
And like, if you want to do anything else,
it needs to be like a named function or something.
So it was such a hotly debated topic.
I'm just curious what your take is on that.
Yes, it's a very important topic to me
because the way I understand it,
I wrote a blog post about it after the Kona meeting
about regularity in this question, right? It's about the things that Stepanov wrote and how you
can use types to do any kind of reasoning. So for example, associativity, you define associativity
by being able to say, well, the left-hand side is equal to
the right-hand side by just putting the parentheses at a different place. But if equality doesn't work,
then you cannot define associativity. Now for SIMD, if you would say everything happens element-wise
and we ignore comparisons for now, you would still say, right, it's the same answer that I get.
So it is associative.
If I put the parentheses on the A plus B or on the B plus C, that doesn't make a difference.
At least for integers.
So it is associative.
But equality doesn't give me a Boolean.
It's not regular. So according to what Stepanov wrote in his book, it appears like, okay, then it's So the value type of a SIMD int is the int.
I need to preserve regularity of that single int. Now, you can consider what you do with a SIMD.
Like a very simple transformation is you take a loop over four elements, And then you do something with it and you do a comparison in there.
And now you translate that into a SIMD
by saying, okay, all these were independent.
They had nothing to do with each other.
Each of those four iterations,
they were running independently.
Now you translate that to using a SIMD
because that's the idea of a SIMD.
These things are running independently
so I can just do the same operations,
the same instructions, but on multiple data.
But now if you come to the comparison, and that comparison now suddenly mashes all of
them together and they become dependent, then it's not a valid transformation anymore.
Because now you would say
every four elements next to each other,
they interact with each other.
And the way that one compares
has an influence on how the next three compare.
That doesn't sound right to me.
You now break regularity of the elements
if the whole SIMD comparison returns a single boolean.
Okay, so basically operator equals equals needs to return a bitmask for SIMD types.
In a way, it's a bitmask, right?
It's a bit more abstract.
I'm not saying how it's represented and whether it's actually a bitmask, but in spirit, it
is one, yes.
And so when you do a comparison on a SIMD, you do multiple comparisons.
And whenever you do multiple things and you want a single answer, that means you need a reduction.
And so you can reduce the answer of the comparison and say, are all of them true?
Is any of them true?
Is none of them true?
And then you get a Boolean and then you can branch on that. And this is also what you do in SIMD code. You typically don't branch unless you know my data is biased to this
side. And so it is possible that in many cases, none of them are true,
even though there are 16 in there. And then you can do a branch like if none of them are true, even though there are 16 in there.
And then you can do a branch like,
if none of them, then go the fast path,
otherwise do something slower
and don't execute everything.
And then in the end, when you blend it together,
throw everything away again.
So sometimes you do branches in SIMD code,
but in many cases you don't.
And so having the comparison reflect that, I think, is the correct behavior to teach how to use SIMD correctly in terms of performance.
And from the concepts side, I see it as SIMD is a generalization of what we do with all the types that we have in the language.
So an int is regular.
And the generalization of that is the SIMD int that is element-wise regular.
You can reduce it to a single answer.
Now, you can do that also with a single answer. You can reduce a single answer to a single answer. Now, you can do that also with a single answer.
You can reduce a single answer to a single answer trivially.
It's just the same, right?
So the scalar case is like the special case of the SIMD.
The SIMD is more general in how you would have to define those concepts.
I'm not ready to write down all those concepts
or propose any of them as actual C++ concepts,
but in the way that I program them in my code, it's exactly what happens.
I want to be able to say, well, this function, I need an actual regular type because I'm
going to branch on it on comparisons.
But in other cases, I want to be able to say, I can work with a parallel regular type because
I'm going to reduce in the cases where I actually need to branch.
But otherwise I don't need to branch
and I can just go on.
And it's more general in that sense.
Right. So before you wrap up,
I would like to pick your brain
on like two particular things
that I've encountered
as like things you have to deal with
when you actually do SIMD.
And like in my case,
it wasn't like audio
stuff so one one and i'm just really curious how the the standard proposal deals with those things
like one is uh kind of can i use to simd like with my own types for example can i have a std simd of
a std complex or whatever like std okay std complex is actually a bad example because it forces you to
like zero initialize everything so you know nobody no high performance people actually use that like i have my own like simd type which
you know lets me like have the things uninitialized if i so i don't have to pay for like initializing
it like but then complex is the only type we have so far so that's another proposal for
so that's another problem but if i have like my own custom type can i just plug that
into stetsimdy uh like that's one question yeah that that is a very important thing and i've been
solving that problem also for 10 years and i'm nowhere near to standardizing any of that
so yes like i said i i have these vectorized structs that i want to work with. And I have them all over the place in my program.
But I also want to be able to just go back to the scalar struct.
So I don't want to write those twice.
The way that you have to think about, like, complex is a good example.
You can have a complex of SIMD or a SIMD of complex.
And it should not be the same thing.
Because a SIMD of complex means that your elements are still complex, so they have a real and imaginary value inside.
So in terms of looking at the register, you have in the register real image, real image, real image.
It's interleaved.
Whereas if you say, I want a complex of SIMD,
you're saying my complex has a member real of type SIMD and a member image of type SIMD.
So in terms of looking at memory,
you have all the reals next to each other
and all the imaginaries next to each other.
So it's again, the SOE versus AOS problem, right?
Yeah, kind of.
I mean, it's now on just the struct level.
We're talking about this vectorized struct.
And a complex of SIMD would be what I call the vectorized struct.
So in most cases, you don't want a SIMD of your user-defined type
because that would mean interleaving that
and you would have to define all the meaning
of all the operations, things you would do with it.
But that said, there are things like fixed point
and all kinds of other interesting user-defined types
that we want to do that basically are just an int underneath.
Yeah, like an input separation math or whatever.
There's lots of them.
Lots of them.
And SG6 has a call for paper to give us these.
And yeah, they need to interact.
And that needs to work.
Then all these types that just have a single value type, basically,
they need to work out of the box.
And you somehow need to be able
to define what is the behavior of all
these operations of the operators,
of the operator overloads.
It's not obvious
how to do the
vectorization without having to rely on
an auto-vectorization,
but at least I've done it
for enums. For enums, it's
also just like, okay, now that I'm using SIMD, suddenly all the enums that I have to use, I have done it for enums. For enums, it's also just like,
okay, now that I'm using SIMD,
suddenly all the enums that I have to use,
I have to use an int instead.
I'm going back in time, right?
That's not useful.
So you want to be able to just continue using your enums and have a SIMD of your enum.
I also, I have implemented that.
I just don't have the time
to actually make a proposal out of it and get it into the standard at this point because I have to that. I just don't have the time to actually make a proposal out of it
and get it into the standard at this point
because I have to focus on the rest.
But yeah, those should work.
But all the ones that have multiple data members,
these are typically the ones that you don't want to make a SIMD of struct,
but you want to have a struct of SIMD.
So the way that I've always used them is to just write a template
and then I say template type
name capital float
and then I define
all my float data members using that
template parameter and then I either
instantiate that with a float or with a SIMD
float.
And then that auto-vectorizes
even the functions in there
because now the functions,
the member functions that you define in that class,
they use that float template parameter.
And now by using a just scalar float,
it's scalars.
If you just use a simply float,
it auto vectorizes in a sense, right?
It vectorizes the implementation.
And I would really like to automate that.
And I would need powerful reflection support in the language to do it right.
At this point, I do struct reflection.
So you can give me a struct that is not a template. I will look at, okay, it has all these members of these types.
I can give you a std tuple that is st simd types in instead i cannot reconstruct a struct because i would have to name
those members somehow um so all i can do is give you a tuple so the proof of concept in terms of i
can do these transformations and they are useful it's all there but yeah in order to be complete
and nice i really would have to work with reflection.
Well, the C++26,
kind of the repression proposal
that's currently targeting C++26 gives you,
I think some of these capabilities you can have,
like you can unreflect with these splicers
and I think do this kind of stuff.
But like, we don't really have time to get into that more.
Before I let you go,
I have one more SIMD specific question that I'm really curious about where what's that's where stitsimdy is on
that so if i look at stitsimdy of a float for example like the first question i'm asking asking
myself is like like how many floats are there actually in there like what what's the width of
this thing right because that depends on the on the particular architecture that you're compiling for.
And so,
and things like audio,
video games,
or, you know,
stuff like that,
like you're actually targeting
consumer hardware.
So it's not like the kind of
probably physics simulations
that you do,
but you know exactly like
what hardware you run on,
but you run on
somebody's laptop
from 10 years ago, right?
And you have no idea
if it supports like AVX,
whatever,
or like ARM,
Neon, whatever, right? You just don't know. So the way you do this for if it supports like AVX, whatever, or like ARM, Neon, whatever, right?
You just don't know.
So the way you do this for consumer software,
or like one approach to solve this for consumer software
is to say, okay, I'm gonna,
I have like my kernel with all the maths
that like my game or whatever needs to do.
And I'm gonna compile that for SSE3,
for SSE4, for AVX1, for AVX512, whatever.
And you have like these multiple kernels.
And then when your game starts up,
you like basically at runtime on the user's machine,
you ask, hey, does the CPU support this instruction?
And then you do kind of dynamic dispatch.
Yes.
And that actually can have an influence
of whether your SIMD offload has four floats
or eight floats or whatever.
And so, yeah, it's kind of annoying to do this.
And so I'm wondering if std simd gives us
anything better than that
or how you do that with std simd.
So at this point, I see no way
that the C++ standard can help you
because all we have is an abstract machine.
There is no notion of you compile
for different ISA extensions or not.
And in the problem, described, the one problem that never goes away is that you tell the compiler what instructions it can use in the binary.
And those, I mean, you need to have these as like different compilations and then dispatch at runtime there is no other way because
you have to say okay this is this is the set of instructions that you can use and if it's the
smallest one then it's not the most efficient one so you want the bigger ones and then you have to
do a runtime dispatch because otherwise the program will just stop with like unknown instruction yeah
yeah sure like on the binary level yeah but i was hoping that we don't have to do that myself anymore, you know,
that the library does that for me.
How do you do the dispatch, right?
And this is something
that I don't see the C++
standard solving.
This is something that tooling has
to solve.
So I see that either as
compilers that can solve
it by doing multiple passes over the whole thing they're compiling
and then emitting different binary outputs
and then somehow doing that dispatch.
It's hard, especially if you have headers.
Now, you have this attribute target clones on
GCC, where you say,
compile this function for many
different targets, and whenever it's called,
automatically do an ifunk resolve
to take the one that your
machine is running on, or supports.
That works
as long as it's not using anything
from headers that depend on
predefined macros that depend on these kinds of, is AVX enabled or not.
And so the SIMD header,
it has to be implemented using these macros.
So it has already been parsed.
These macros don't change any of the code
that has been seen before.
So that doesn't work.
The target clones just doesn't work.
There are a few ideas but no compiler vendor that i've seen has like targeted like tried to do anything
in that area because it's like a huge change to make that work um what i see more realistically as a solution is like on a dynamic linker level, you compile your application into a dynamic library and put it in a subdirectory for the target machine.
And then on that target machine, you have just like a small main function that then is linked to the library. And the
dynamic linker, when it looks for the library name, it first goes through all those subdirectories,
knowing what architectures it supports, ignoring the subdirectories that it doesn't support.
And in the first subdirectory where there is a library that matches that name, it will take that
instead of the generic one. And so that way you have the fallback on the linker level, or you have the dispatch on the
linker level. And when all the linking gets resolved, which are kind of indirect calls anyway,
then you're done. And so you don't have additional indirection anymore. So I see that as one of the most interesting solutions but it still
requires tooling work because you have to compile it multiple times into all these libraries that
have the same name and then install them in those subdirectories that would be nicer if tooling
would just allow you to say okay this one this one switch, do it for me.
I don't have to repeat everything.
But you need to have some notion of the target machine in the library, right?
Like what is size of std sim defloat?
I mean, that's going to depend on the target machine, right?
Yeah, so that is a kind of orthogonal problem. The way I like to work
is to explicitly
not know the number of elements in there.
That's how I like to write my algorithms.
So that means
whenever you start thinking about
mySIMD has four elements,
you're probably encoding
that information into your algorithm and it doesn't scale to wider architectures.
So I rather want the SIMD users to think as I cannot know the number of elements in there and I have to write my algorithm in a way that I don't have to know.
And then it's scalable and works most efficient on most machines.
And it typically also leads you to using the right parallelism.
People that think my vector has four elements
like to put like X, Y, and Z in there.
And they take away from instruction level parallelism put it into simd and in total get a slowdown because they thought oh well i have
four elements available using three is efficient but if i tell you you don't know how many you
have in there that typically doesn't work for them they will not do that anymore then they
will look for okay I need to take multiple
points, take all the
Xs, put them in one SIMD,
take all the Ys, put them in one SIMD.
And that scales and is
way more efficient anyway.
So I acknowledge
there are problems where you need to know the
size, and I've had a problem myself
where the size was 12,
not even a power of two
and it was really useful to have an implementation and it said okay 12 i can implement that as one
avx vector and one sse vector and yeah sometimes it's just there there is no other or better way
to do it and then you you have to be able to say that. And Statsimity allows you to say, I need 12.
But for most of the time, start from, I cannot know how many I'm going to get.
That's how I have to write my algorithm.
That's how I have to write my data structures.
Right.
So you mentioned a situation where the runtime can be longer than expected.
And that is true of our recording today.
So we are going to have to wrap up there.
But I think we could have carried on talking for a long time.
So thanks very much for coming on the show
and for everything you've been telling us.
Is there anything else you do want to let us know
before we do let you go?
Or perhaps how to find you?
Oh.
People want to find out more.
How to find me.
Yeah, you can surely find me.
The bigger problem is me answering sometimes.
So I'm on Mastodon.
I'm not posting a lot, but you can find me there.
You can find me on GitHub, which is probably the best to follow also
what I'm doing.
I'm pushing my
paperwork to the committee up into
repositories there. I have
a feedback repository
on the TS that is
used to capture everything
we want to put into 26.
I have a SIMD
prototype for a C++ 26 repository
in there and some other work.
So GitHub might be
most interesting in order to see what I'm
doing and also to give me some feedback on
some of those repositories.
Great. We'll put those links in the
show notes or on the website.
So if you want to
reach out to Matthias, you'll know
where to go.
So again, thank you very much for coming on the show.
Yeah, thanks for the nice and interesting talk.
Thanks so much for listening in as we chat about C++.
We'd love to hear what you think of the podcast.
Please let us know if we're discussing the stuff you're interested in.
Or if you have a suggestion for a guest or a topic, we'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com you'd also appreciate it if you can follow cppcast on twitter or mastodon you can also follow me and phil individually on twitter or mastodon all those
links as well as the show notes can be found on the podcast website at cppcast.com