CppCast - SIMD

Episode Date: December 15, 2023

Matthias Kretz joins Phil and Timur. Matthias talks about SIMD, including what it is, how it works, and what its useful for. We also discuss his proposal to introduce SIMD vocabulary types and functio...nality into the C++ standard and how it relates to what was in the Parallelism TS. News MISRA C++ 2023 published Sonar webinar on MISRA C++ 2023 with Andreas Weis nlohmann/json 3.11.3 released reflect-cpp - Reflection library for C++20 Links P1928R7 - "std::simd — merge data-parallel types from the Parallelism TS 2" Matthias' CppCon 2023 talk on std::simd

Transcript
Discussion (0)
Starting point is 00:00:00 Episode 372 of CppCast with guest Matthias Kretz, recorded 8th September 2023. This episode is sponsored by Sona, the home of clean code. In this episode, we talk about the new MISRA C++ 2023 guidelines, a new release of the nlohmann.json library, and a new C++ Reflection library. Then we are joined by Matthias Kretz. Matthias talks to us about SIMD and his proposal to standardize SIMD for C++26. Welcome to episode 372 of CppCast, the first podcast for C++ developers by C++ developers. I'm your host, Timo Dummler, joined by my co-host, Phil Nash.
Starting point is 00:01:05 Phil, how are you doing today? I'm all right, Timo. How are you doing? I'm not too bad, actually. Got some sleep, got a coffee mug here in front of me, so I'm good. Yeah, if you're getting sleep, that's progress, I think. Yes, it's definitely improved a lot. Again, like if anybody has kids, you know how it goes, but we're two months in now and it's definitely getting a lot better. Been there, done that. Yeah.
Starting point is 00:01:28 All right. So how are you, Phil? Yeah, I'm back at home this week. I was traveling last week. So apologies for the bad audio last week. I was trying out a new travel mic. It wasn't as good as I'd hoped. So back on my good mic this week.
Starting point is 00:01:41 As we record next week, by the time this is released, I'll be traveling again. So you get the benefit of my good mic this week. As we record next week, by the time this is released, I'll be traveling again. So you get the benefit of my good mic this week. All right. Well, that's good. At the top of every episode, we'd like to read a piece of feedback. This time, actually, there wasn't much feedback on the show itself, but there were quite a few comments from people on X and on Reddit about how they like Dock Test more than Catch
Starting point is 00:02:02 and how they like Rust more than C++. So, yeah, thank you very much to everybody for all your comments. It's apparently quite a hot debate, both of those things, which is not over. So thanks for your comments on our show material. We do appreciate it. We'd like to hear your thoughts about the show as well.
Starting point is 00:02:19 And you can always reach out to us on X and Mastodon and LinkedIn or email us at feedback at cppcast.com. Joining us today is Matthias Kretz. Matthias began programming in primary school and got serious with C++ when he joined the development of the KDE 2 desktop in its alpha stages. Working on different GUI applications at first, moving on to library work all over the KDE core infrastructure. He studied physics in Heidelberg, Germany, and got into SIMD for his thesis. There he worked on porting parts of the online reconstruction software for a CERN experiment to the Intel Larrabee GPU.
Starting point is 00:02:57 This motivated his work on SIMD and VC and abstraction for expressing data parallelism via the type system. VC was the first solution of this kind with ESS-free software. His PhD in computer science at the University of Frankfurt was a continuation of his SIMD work, developing higher-level abstractions and vectorization of challenging problems. ITS has been contributing his SIMD work and his expertise in HPC and scientific computing in the C++ committee since 2013.
Starting point is 00:03:24 In 2022, he is the chair of the SG6, which is the numeric study group on the C++ Committee. He's also a contributor to GCC and has founded and chaired C++ user groups at Frankfurt Institute for Advanced Studies and at GSI Helmholtz Center for Heavy Iron Research. Matthias, welcome to the show. Hey, thanks for having me. So you worked in physics and you are chair of SG6, the numerics group. So you would be the perfect person to weigh in on our discussion of random numbers.
Starting point is 00:03:55 So is there such a thing as real random numbers and how does that affect C++? Wait, I thought we settled that last time, didn't we? Let's do it again. Timo made a good argument. And as someone who studied physics uh yes i support what timo said there even though i just i i tend to forget lots of what i learned in physics use it or lose it and i don't use physics and not enough so i i lose much of what i learned there but yeah i same i i finished my like phd in physics in like what 2012 or something and i
Starting point is 00:04:24 don't remember any of that stuff. Yeah. The only takeaways I learned how to program, and that's why I'm here, basically. That was actually my goal when I was targeting what I wanted to study. My goal in the end was really to do computer science, but I wanted to have more than just the computer science background. I wanted to have more than just the computer science background. I wanted to
Starting point is 00:04:45 have some science background and physics was like the most interesting there. All right. So we'll get more into your bio and your work and potentially also physics in just a few minutes. But first we have a couple of news articles to talk about. So feel free to comment on any of these. Okay. The first news item for today is that MISura c++ 2023 got published so that actually already happened in october but it was only posted on reddit last week so i'll consider it a piece of news for this week so i think that's one of those cases of there's a depends on your definition of published so i think technically it was published in october but um it wasn't publicly released until just last week and that that's why it appeared on Reddit then.
Starting point is 00:05:28 All right, so it is actually brand new information. Yeah. All right, so MISRA, for those of you who don't know, are guidelines for the use of C++ in critical systems. I think it's mostly used in automotive. MISRA is like a motor industry kind of industry association kind of thing. And so it tells you things like do not allocate dynamic memory or do not call a function recursively or do not do a bunch of other things that could lead to unsafe
Starting point is 00:05:50 and undefined behavior in C++, which is allowed in C++ but not in MISRA C++. And so there's also a bunch of static analysis tools that can enforce these guidelines to an extent and it's been waiting for an update for quite a while. So the previous MISRA guidelines were targeting C++03.
Starting point is 00:06:09 Now they're targeting C++17. There's a reasonably modern standard. They updated a lot of the rules. They also integrated AUTOSAR guidelines, which I think is the other well-known set of coding guidelines for safety-critical C++. It doesn't mean they just merged all of AUTOSAR in, but it means that they considered everything that's in there
Starting point is 00:06:27 and they decided whether what's in there fills an existing gap and belongs into MISRA as well. It replaces the previous version, which was from 2008. So yeah, I think it's quite interesting for those who work in kind of safety critical software. Phil, I think you have actually a virtual event coming up talking about what's new in MISRA 2023, don't you? So I do.
Starting point is 00:06:48 It's coming up as we record. Unfortunately, it will have already happened by the time this is released because we're recording it a week ahead. But I will put a link to the recording of the webinar in the show notes. That will be with Andreas Veess, who was on the show just a few episodes back,
Starting point is 00:07:04 going to be discussing the MISRA C++ 2023, as we've just been talking about, and what that really means and what's different. So one of the things, actually, you sort of covered it, but just to emphasize a bit, as well as considering every one of the AUTOSAR guidelines, they went back and reconsidered every one of the original MISRA C++ 2008 guidelines. Do they still make sense? Is there a better way to do it in modern C++? It really was a proper modernizing effort from many people who are themselves on the C++ stand as committee. So it's actually a really good update.
Starting point is 00:07:37 And you should watch the webinar. Right. Well, that sounds very interesting. I will definitely watch it. The second news item for today is that the NLoman JSON library, so that's the JSON library by Niels Loman, which I'm pretty sure is the most popular JSON library in C++, published a new version 3.11.3, which doesn't sound remarkable in itself because that replaces 3.11.2. But the remarkable thing about this is that it's actually the first release since 473 days. Like 3.11.2 was released on the 12th of August, 2022. So I think that's actually remarkable because if you look at that, it seems like,
Starting point is 00:08:19 oh, is this library still maintained? But yes, it is. So that's good to know that it is actively maintained because it is so popular. It has some new features. it allows like custom base class as a node customization points they have like an additional template parameter you can set a custom base class for your nomad colon colon json object which like stores like a json object so that's like a you can add kind of your own functionality to json node, like metadata or additional member functions like visitors or whatever. So that's good.
Starting point is 00:08:50 There's a few other features like they support Bazel now, they support Apple's Swift package manager, which I'm not entirely sure why you want that for a C++ library, but maybe I'm missing something. Is that about interop maybe? Yeah, I'm not sure either. It could be, but obviously, you know, we have a number of package managers in the C++ world now, so why not add one more? All right. So yeah, lots of bugs fixes.
Starting point is 00:09:15 So yeah, it's really good to see that this library is alive and kicking, considering how popular it is. I'm actually thinking, maybe you should try to get Niels on the show. What do you think, Phil? I think that would be a great idea. All right, all right. So we'll reach
Starting point is 00:09:28 out. Niels, if you're listening to this, we're coming for you. Also kind of tangentially related to JSON, but also a lot more related to reflection, which is what we were talking about last time. So we talked about the new proposal that's heading for CSS 26
Starting point is 00:09:44 about adding reflection to the language while we don't have reflection yet in language people still do reflection without proper language support using all kinds of hacks right and so there is a new reflection library that popped up on the internet called reflect cpp which can serialize the c++ struct to json and deserialize it again from JSON. And what it does is it actually includes the member names of your struct fields in the JSON file. And that's obviously something you need reflection for.
Starting point is 00:10:14 And usually this is done, like, if you don't have reflection, which we don't, you're using kind of compiler-specific macros and things like that. And ReflectCPP actually does it just using C++20 without any compiler-specific intrinsics, which I thought was interesting. So I was like, oh, how does this work?
Starting point is 00:10:32 So I had a bit of a closer look at it. And I'm not sure. I'm not sure. This is like a piece of information I needed to know. So what it does is, so it's C++20. It doesn't work in earlier it's C++20. It doesn't work in earlier standards. C++20 adds std source location
Starting point is 00:10:49 and it provides a function called std source location current like dot function name. So you can call that and that gives you the name of the current function that you're in. And so if the current function is a template, you will also get a template parameters passed in and stuff like that. And so if the current function is a template, you will also get a template parameters passed in and stuff like that.
Starting point is 00:11:06 And so the library then kind of declares the struct as an extern struct. And then somehow, I didn't quite understand what the machinery behind that is and why you need that extern. But somehow you can then basically pass pointers to
Starting point is 00:11:22 the fields of that struct that you declared as X10 to a function containing source, location, current, function name, and then the resulting function name kind of contains the name of the field. And so all you have to do at that point then is to kind of parse that string, and then you get the field name.
Starting point is 00:11:40 And you can also get the name of the struct itself using the same trick. So it's not strictly portable because the standard says that source location function name returns an implementation-defined string, which is supposed to contain the member name, but it's not guaranteed to contain the member name. It can contain whatever.
Starting point is 00:11:58 If it's actually implementation-defined, then you can look at the compiler documentation and do the right thing. But it's still no guarantee that it doesn't change from version to version right right but like so all major compilers do actually contain the memory name and that string so it actually does kind of work but yeah um i yeah i can't wait for actual reflection because as helpful as this is i mean the functionality is really cool right you can just serialize something into json and it just works right you don't need any add any macros to your like struct which is i
Starting point is 00:12:29 think what all the gaming people do with their reflection engines that you have to actually add macros being there done that so so not having to do any of that is good but like i still like there's still showcases where we really need like proper reflection support and so i guess for deserialization it then searches for the right string and then assigns to that member. I guess. I didn't look into that deserialization part, but I think that's...
Starting point is 00:12:53 Sounds fun. Yeah, that sounds fun. This is very similar to the enum trick that's been around for a few years now. I think Magic Enum was one of the first libraries that did that. There's a few more now. But there's a similar trick, not source location but even before that with the um
Starting point is 00:13:08 underscore underscore funk name preprocessor identifier again when you expand a template name it will give you the the values passed to the template and if you pass an enum value it'll actually give you the name of the enum value in the string that you can then parse out but of course it's completely implementation defined even more so than this i think i think you would have to take the pretty function uh macro instead of func but yeah i forget which one it is but pretty function is longer so it's an implementation defined hack on top of an implementation defined hack so if anything this is actually slightly more um standardized um there's only a bit this implementation you find but it's a popular thing to do right so that concludes our news items for
Starting point is 00:13:51 this time and so we can transition to our main topic and our main guest uh hi again matthias hi it's great to have you on the show great to be here so so i actually uh you know we always kind of prepare in advance like a list of possible questions that you want to ask the guests i mean it doesn't always go exactly that way but uh actually this time uh my list of possible questions is like way too long so i i probably not going to get to ask all of them um so that's going to be interesting i think i have enough material for like a five-hour episode here so let's see how that to go. As long as we ask multiple questions at the same time, then we should be fine. Well, do you also get multiple answers at the same time?
Starting point is 00:14:30 Or how's that going to work? Yeah, well, it doesn't really work. It should be single question, multiple answer. Right. It's multiple data coming in. The question is the single instruction. I don't think that works. It's more like task parallelism, not data parallelism.
Starting point is 00:14:43 Maybe if you just get the same answer to each question we should be good right so as you might have guessed by now our topic for today is simd which stands for single instruction multiple data so matthias do you want to just give us a very quick introduction what simd is and why we should care sure sure um so simdy is a part of computer Arctic architecture since like the 70s but nowadays it's a bit different back back then it was vector computers nowadays we call it call it rather short vectors so what happens is that you have a uh a register in your computer or many of them that can hold multiple values and then instructions that do the same thing that they would do on a single value on multiple values at the same time
Starting point is 00:15:34 so for example you have an integer you do an add and now instead of doing just it one on one integer value you put multiple integer values into a bigger register, and then you say do a SIMD add, and then it will do that element-wise. So for the first integer pair, it does an addition for the second integer pair because you're passing it two registers, of course, in order to do a binary operation and so on.
Starting point is 00:16:02 So you get a higher throughput of computing. And nowadays, most CPUs actually implement them in a way that the throughput and latency of the instructions. So how long does it take to get an answer? And how long does it, do you have to wait before you can issue a new instruction? Those numbers are the same for the scalar and the SIMD, of your CPU. And this can get worse because it's not just SIMD that is parallelism in the CPU on a single core.
Starting point is 00:16:57 There's also instruction level parallelism. Most of this, the compiler will try to do as much as possible to let it run in parallel, even though it's not explicit. It's all implicit. But in order for instruction-level parallelism to work, you need independent instruction streams. So you need to be, or the CPU needs to be able to issue an operation in one cycle and the next operation in the next cycle and so on at each cycle one operation or rather two if you look at modern cpus you can do like two fmas a fuse multiply instruction and then it takes like seven cycles or something until you get an answer so that means you need seven independent instruction streams
Starting point is 00:17:46 in order to be able to issue at every single instruction, at every single clock cycle or something. So the example I showed at CppCon and also last time in Kona for the CppVas committee is that in an extreme case, you have a parallelism of a factor of 128 on a single thread, on a single core, on a modern CPU. Actually, not so modern.
Starting point is 00:18:13 It's a Skylake AVX512C, so Skylake is like old by now. But that's a factor of 128 of parallelism, right? Parallelism on a single-threaded execution stream. This is like the most extreme example you can build up, but there's like so much parallelism, single-threaded, and it's not used in most of the cases. And if you really would be able to take it
Starting point is 00:18:44 all over your program where you're doing computing, and not just isolate it in a few parts, there are huge gains. Yeah, I like the tagline on your blog site. There's too much unused parallelism on a single core. It sums up what you just said. Yeah. Yeah, it always makes me sad looking at code or rather disassembling code and seeing how it's not using the CPU's resources. But I mean, what sorts of programs or algorithms can actually benefit from SIMD then?
Starting point is 00:19:21 Is it just to do with number crunching like you do in the numerics group? No, I mean, it's all over the place um whenever you see a loop going over a container there is probably a good potential for using simd and a string in a way is a container of jars and when you're looping over a string it's's, yeah, you really want to do that with SIMD. Now, most of the low-level libraries like libc, string operations, they have SIMD implementations nowadays. But using a simple approach, using a SIMD type, you can write it directly. And it's sometimes even faster than those implementations because it can get inlined
Starting point is 00:20:07 whereas the libc implementation needs to do an indirect jump because it has to work on all kinds of different platforms and so it first has to figure out what do I have available and then it jumps to the best implementation and from then on
Starting point is 00:20:23 it directly takes that indirect jump, but it's always an indirect jump in a function call. Whereas if you use SIMD types in C++, you can do that directly without any function call, and so we can get even faster. So I don't see SIMD being a niche thing to do. In principle, it's applicable all over the place. Right.
Starting point is 00:20:49 But it's not used all over the place. No, it's not. So is the reason for that that it's just complex to get into? Well, one of the things is you need to think a bit differently, um that that takes some getting used to in principle at this point just say all the types that you know from the basic types all the integers all the floating points just don't ever use them anymore just put them in a simd and that's what you want to do and so so the other reason why it's not used all over the place is because it's just hard to program to express at this point you either write a loop and hope for the best
Starting point is 00:21:35 you write a loop and try to annotate it with i know this can go in parallel and still hope for the best because it's not forcing the compiler to do anything. You write intrinsics and it's completely unreadable and unportable. Or you look for some kind the codes, especially as we have them here in our science codes where we have to say, these have to live for 20 years. Who can guarantee us that this will be around and working still in 20 years? So there's like, how can you trust that going forward? And that is one of the reasons why I am working with the C++ committee on this front, because I think this should be vocabulary type.
Starting point is 00:22:43 Something that where you can express that intent of, I just want to be able to scale this to any kind of parallelism here. How it's then implemented is not that important. I want to express that data parallelism. And that is just not there in the language at this point. So you just preempted several of the questions that I wanted to ask. That's really great. Thank you. So the system works.
Starting point is 00:23:01 We've been doing it in parallel. So it's like, was it speculative execution? Is it another good CPU feature? But anyway. Oh, yeah. So it was a speculative execution. It's another good CPU feature. But anyway. Oh, yeah. I have lots of experience there. I mean, I've done astrophysics and then audio. So I've done some SIMD as well. And yeah, it's basically, as you said, you either rely on the autovectorizer or you write your own intrinsics or then you rely on some kind of library.
Starting point is 00:23:23 And I think you basically just said that none of these are really great solutions, right? Because if you rely on the autovectorizer, then it might just not autovectorize it or it might. And then you add like a tiny change and then everything falls apart and it suddenly is like eight times slower. Maintenance of autovectorization is a hard problem. Yeah, yeah, yeah.
Starting point is 00:23:44 And then, you know, writing SIMD instructions, I think, yeah, you said it's not portable. The other problem is that sometimes it's actually slower than the code that the autorectorizer produces, even though you think it's not, but then you benchmark it and it's actually really slow. I've actually seen that in production as well. That's one of the other tricks you don't realize.
Starting point is 00:24:04 What I do in my implementation is trying to help the compiler to understand what's going on. So when you write intrinsics, you sometimes wave your magic wand and say, compiler, forget what I'm doing. I'm telling you what to do. And the compiler says, okay, I don't see through all these intrinsics. I'm going to do what you say. Whereas in the SIMD library implementation, I try to make sure that all those places where that happens, I give the compiler the path out. And when the compiler
Starting point is 00:24:38 says, I actually know the values of these parameters coming in, I'm not going to use the intrinsics for implementing them because then the compiler will not know what's going on. I will choose a different path so that the compiler can still optimize through and do constant propagation all over the place. And that is something that you don't want to write in your own code anywhere
Starting point is 00:25:01 because that's like you write everything three times or four times i do that for you in in the simd implementation and we can go into testing now because testing this is like a nightmare um i have i can write the same code and i then have to test it in all kinds of different like is it constant expiry? Is one of them a constant propagated value? Is the other one? Are both of them? None of them.
Starting point is 00:25:31 And how do I tell the compiler that it's not constant propagated when I actually want to test something and I write the value right there? So when I began testing things, the compiler would just throw away everything I wrote and say, okay, return zero, you're good. And none of the code that I wrote
Starting point is 00:25:48 was actually executed when I ran the tests. And so I didn't see some of the bugs because the constant propagation just did the right thing. But there were actual bugs when it was executing when it didn't know the values. So I had to invent all kinds of ways of how to test things properly in all these different scenarios, because bugs happen all over the place.
Starting point is 00:26:15 And sometimes they happen in constant propagation. I mean, I was just debugging for like three days now for why a left shift on integers is wrong. Because I'm doing some kind of trick using floating point exponents in order to do the shifting more efficiently. And what happens if you have one shifted by 31? So like the highest bit of an unsigned 32-bit integer is set, right?
Starting point is 00:26:48 Yep. And then you say that is a float value. So that's a conversion without loss. And then you convert that float value back to a signed integer. So the integral instruction that does it and the corresponding intrinsic documents exactly what happens you get the same integer back that you put in as a value for the float right but when it's constant propagated the compiler says well i can do something else this is not specified how
Starting point is 00:27:19 this has to happen and what it actually does it saturates so you get max int it's like the value minus one no it's not what the hardware does but it's what the c++ language allows the compiler to do because like it can just assume that certain things can't happen because it would be ub yes yes right yeah so even though even though i was using an intrinsic that was documented to work that way the compiler said, I know that intrinsic, it does a conversion from loading point to integer. I know how to concept propagate that. And so I'm going to give you something else.
Starting point is 00:27:55 Finding these kinds of bugs is like... So even if you write intrinsics, basically you don't really know. No, you don't have a guarantee that you actually get the instruction that's behind that. So if you look at the headers that define all the intrinsics, some of them are implemented as built-in IA32 something, whatever it calls. The compiler then has no idea what's going on. But some of them are actually implemented as different built-ins where the compiler has its own notion of SIMD vector types and then it can do operations on those.
Starting point is 00:28:34 And so some of these intrinsics are actually implemented using this because then the optimizer understands what's going on and can do better optimizations. And so you're never sure when you call an intrinsic without looking what's going to happen.
Starting point is 00:28:50 And there are many surprises hidden in there. So it's one of the reasons why I don't think many people should be implementing this. It's so full of pitfalls. And there are bugs all over the place in in compilers it is pretty good and solid by now but i've been working on this like since 2009 and there was bugs all over the place and people shouldn't be using that back then it was like not reliable enough right so if you don't want to write intrinsics which the user shouldn't do i guess you can go two ways right you can you can just write inline assembly
Starting point is 00:29:33 i guess basically that's one way out of it or the kind of other way out of it would be to go like higher on the abstraction level and just use a simd library but you know when 10 years ago when i got into music tech you know there was boost simd and then i was working at a kind of relatively large music tech company that had like their own in-house simd library that was a pain to maintain that nowadays there's like a lot more stuff we have highway xcindy vector if we have this like whole spectrum of of cindy libraries but and they all kind of have different paradigms different trade-offs so it's kind of unclear you know how to approach this i i guess i guess your solution is that's why we needed a standard right uh yeah so i i don't want any of these to go away. What I hope will happen there is that the standard SIMD type will allow interfaces that work across library boundaries. And then in a library, you can implement that with a different library, because you just take a std simd, as I want to define it, and you bitcast it to whatever else you want.
Starting point is 00:30:46 You can just bitcast it to an intrinsic type or something else, and then you go from there. And so that means you can have interfaces that interact, that work with each other, but implementation-wise, you're not fixed to just using the standard implementation. You could also do something else. So I don't see that as a competition. With Stetsimdi, I don't want to say, okay, you don't need any of these anymore. They are doing innovation
Starting point is 00:31:18 in areas that I would like to be able to do, but I have to focus on getting the basis right here. Okay, so a few episodes back, I think Phil was on his vacation. I did an episode with Mark Gillard about his library Sogen, which gives you like auto-generated, which gives you the ability to like make either struct of arrays or array of
Starting point is 00:31:45 structs and kind of have an automated way of like changing one, one to the other. What's your opinion on that? Like, because that's ultimately also about enabling, enabling SIMD optimizations, right? And Mark was saying that he was getting lots of,
Starting point is 00:32:01 you know, performance improvements just by switching from array of structs to struct of arrays and then letting the autoregulator do its thing. So is that something that we should be doing or what's your take on this? So first of all, maybe explain the reason why this is important. So what happens is that when you look at those SIMD registers, they have all these values next to each other in the register. And in order to be efficient in putting those values there, you want to have them in a close by in memory so that you can load them all at once. So if they are all in a single cache line in the correct order, then you issue a so-called load instruction, and then it just copies the whole thing into the
Starting point is 00:32:47 register, and then you can work. Now, if you have a struct, and then you have like in the struct an x, y, and a z, and then you have an array of those, and you want all the x's in one SIMD register, then the x's are not next to each other in memory, and you cannot do a SIMD load instruction. And that's often where the compiler then says, okay, that's not going to each other in memory and you cannot do a SIMD load instruction. And that's often where the compiler then says, okay, that's not going to be efficient because the load will be so slow in order to get all the values into the SIMD register that the speedup that I can get
Starting point is 00:33:14 from actually then working in parallel is not worth it. And so I'm not going to vectorize this. But when you reorder them to have all the X next to each other in memory, then the autovectorizer says, okay, that's just not costing me anything to do that in a vector. And then I'm going to get some speed up from vectorizing. So then the autovectorizer kicks in.
Starting point is 00:33:48 And the interesting thing here is that you're also losing something. Like if you have the case of an X, Y, Z point, and you always need to process X, Y, and Z together, then you want to have that locality, right? When you say, I'm going to access this object, then all of these need to be in the cache at that time. Now, if you have a struct of array, you have three streams that the hardware prefetcher now has to look into. And typically that works well enough. You might get into cache associativity issues for more things. So that when you're actually starting to load from one of those
Starting point is 00:34:26 arrays, you're evicting stuff from another one. So a library that is smart there makes sure that the distances are not powers of two between those arrays. But otherwise, yeah, for
Starting point is 00:34:41 SIMD, that can give you a significant speedup. But now the story is not as simple. And you cannot, in general, say what is better. Because often you have IO. You have things coming in. And they don't come in as struct of arrays, typically. So the first thing you would have to do is you have to go over all your memory, reshuffle the things. Then you can do efficient work.
Starting point is 00:35:04 And then for output, you have to reshuffle everything again, and then you're done. And now it's so you have to measure the whole thing. Is it now faster with all the reshuffling at the input and output? And so sometimes you have just problems where input and output is important in terms of dominating the cost of runtime. So then it makes sense to actually vectorize on an array of structs. But in order to do that efficiently, you'd need to do some tricks that the compiler, the auto-vectorizer, I've never seen it do.
Starting point is 00:35:40 I've shown in my CPCON talk that I can do it manually and I can implement it for you, giving you like a std for each or a std transform algorithm implementation that goes over an array of scalar structs and it passes you the whole thing as SIMDs. Then you can do efficient calculation and even store it back.
Starting point is 00:36:04 So it's not like do that and you get better vectorization. I can't tell you that. There's even a third one, the one that I like to use most. That's an array of a vectorized struct. So again, to the- An array of a struct of an array, essentially? Yeah, but that last array is like just a SIMD. Oh, so array of structs of SIMD packs.
Starting point is 00:36:31 Yeah, so again, of the example of XYZ, now instead of having that being a float, make it a SIMD float. So that's what I call a vectorized struct, because now instead of just having a single struct i have a vector a simd vector struct that i can efficiently work with because the the x that i access is in memory in the correct order so that it's a single simd load and then i just because i need to work with more values than just whatever the SIMD width of the hardware is.
Starting point is 00:37:06 I have an array of the whole thing. And then it's really simple just to iterate over the whole array and work with each of those things. And in terms of the code you write, you write it, the algorithm, you write it once for the scalar case. Then you vectorize the struct and you, except for branches, you don't have to touch the code. It just compiles with a vectorized struct. So I think we can summarize all that by just saying it depends. Sure, yes. Or perhaps more importantly,
Starting point is 00:37:35 yeah, these libraries will help you, but you still have to understand what's going on under the hood to know what the best thing to use is. My takeaway from all I've learned is vectorizing algorithms is almost trivial, but getting speed out of them is a question of data structures. And getting data structures right is the actual challenge of SIMD. Right.
Starting point is 00:37:58 Okay, so there's no free lunch. I actually want to transition to the um the standardization effort you've uh hinted at it a couple times already i wanted to dig a bit deeper but before we do that it's a good time to have our sponsor break uh we are once again sponsored by sonar the home of clean code so sonar lint is a free plugin for your ide helps you to find and fix bugs and security issues from the moment you start writing code. I'm not sure if it will find SIMD bugs though. I'll have to find out about that. They could also add SonarCube or SonarCloud to extend your CICD pipeline and enable your whole
Starting point is 00:38:35 team to deliver clean code consistently and efficiently on every check-in or pull request. SonarCloud is completely free for open source projects and integrates with all of the main cloud DevOps platforms. So back to SIMD and getting into the C++ standard. What is the current state of, I think it's P1928, is it? Yeah, that's the paper that is proposing the merge of the TS into the standard. So the goal there was to not increase the scope, the feature set, but keep it at what the TS was and incorporate the feedback of what to improve and also to look at language features, what is now possible that wasn't possible back then
Starting point is 00:39:19 or integration with library. The TS depends on C++ 17. So we didn't have ranges then. We didn't have contiguous iterator. And that is like so important for SIMD. I don't care for random access. I care for contiguous memory if I do SIMD. So that was something that I wanted to incorporate there
Starting point is 00:39:42 to generalize the interfaces a bit. So yeah, that passed the library evolution working group back in Varna in June. And so we're in library wording with that paper now in terms of design review, that means we're done there might be a bit of an issue going forward without adding more papers or more features because some people in the committee um think and i mostly agree that the feature set is not enough for c++ 26 if that is all then it shouldn't be
Starting point is 00:40:22 in 26 there should be more in order to be good enough for seeing the day of light for the first time. I totally agree there should be more. And so I will try to focus my efforts on getting more features in and then getting all these through library review. But we're doing most of the things in parallel as we can. So we have a pipeline and out of order in the execution here. Yeah. The problem is that I'm the bottleneck in many cases. So which are the kinds of features that you have in cases. So which are the kinds of features
Starting point is 00:41:05 that you have in the paper and which are the kinds of features that you feel like are missing but really need to be in there to have a meaningful offer for C++ 26? So I see the TS and also the merge of the TS feature set as the low-level thing
Starting point is 00:41:21 that you need to have in order to do anything on top. And 10 years ago, I was working on higher level things. I haven't been since then because I've been at the committee working on this low level stuff and trying to get that right and also implementing it. And I really think we need the high level interfaces for especially the memory access like i don't want people to to do the loads and stores themselves it's a bit of a foot gun in terms of uh out of bounds access and so consider you have any range. You don't know the exact size.
Starting point is 00:42:05 And now you want to iterate over that with SIMDs. The SIMD has a fixed chunk size. So typically that doesn't match the number of elements you have in the range and the number of elements you would then process if you go over it by SIMD. So you always need the SIMD epilogue, where you make sure that at the end where there's this mismatch of numbers, you don't get an out-of-bounds access on the array. So at this point, if all you have is that SIMD types and the loads and stores operations that it provides, then you always have to write this epilogue yourself. And you always have to make sure that you don't forget it, because otherwise you're getting out-of-bounds access. And I would like to provide
Starting point is 00:42:46 the necessary interfaces to just give me a range, and I will give you the SIMDs to work with, and at the end of that range, those SIMDs might get smaller and smaller,
Starting point is 00:43:02 just until you're done. So, I would really like to see that in 26, but at this point, time is already running out. I don't think I get that ready for 26. I have to focus on getting all the low-level interfaces right and complete. And there were always some things missing, like permutations were missing, the gather and scatter access. So when you have to do random access into memory,
Starting point is 00:43:31 there is some hardware support to do so, though it's not actually making things much faster. It's a bit of help, but yeah. So the other question that came up and that I will have to investigate is the UB that you get on loads and stores when you're doing this out-of-bounds access, whether we can push that UB out of SIMD
Starting point is 00:43:58 and up to the user so that the user basically would have to give me like a span of static extent or something and say you can load from that whole thing here and then when when the user did the span of static extent he like did the problem and did the out of range access out of bounds access and simdy just did what it was told. That there is, I mean, for the static extent, it's simple. There is also the question of passing a range of dynamic extent.
Starting point is 00:44:37 But then when you do bounce checking at runtime in a SIMD low-level interface, that kind of goes against the idea of this is a low-level performance primitive. So if we do that, it has to be without cost, at least in the scenarios that are interesting. So there is some homework I can and have to do. I've already done some compiler exploration of some test cases. I already did a few benchmarks. And so there are some cases where all these checks can get optimized away. And that's certainly interesting to use.
Starting point is 00:45:17 So the feature set basically consists of, you can do almost anything that you can do with an int or a float in terms of all the operations that are there. In terms of cmath, what is still missing are some of the integer operations that we got since C++17, like the bit operations. But they have papers. So in many cases, you just replace the type from an int to a simd end and things continue working the biggest thing that i'm missing is expressing conditionals in a way that it works for an int and a simd end in the set with the same syntax an if statement obviously can work.
Starting point is 00:46:05 If you do a comparison on SIMDs, you get multiple answers because you always work element-wise. And the elements are not related. There is no single answer. So what you need to do in SIMD code is not to branch, but you execute both branches
Starting point is 00:46:21 and then you blend the results of those branches together into the result that you were looking for. This is also what happens if you do any kind of CUDA programming, and that's translated to a GPU or potentially OpenCL, and that's translated to SIMD, which is in theory possible. And it would have to do exactly that for if statements. It executes both branches, but parts of the execution units just don't work, do work when a branch is executed, and then the results are blended together. And so the natural thing to do there for C++ would be to
Starting point is 00:46:59 use the conditional operator, question mark, colon. So you can write a condition on the left-hand side, and then you get either the second or the third argument. And for a SIMD, that means you would evaluate both the second and the third, and then element-wise apply the question mark. So this aspect, actually, I remember when I was in Kona, I wasn't really there when you were discussing SIMD. I was busy with contracts and other things. But then later I went to this bar, the Shark Shack,
Starting point is 00:47:38 and everybody there was discussing this thing where comparisons and things like operator equals equals, where, as you said, instead SIMD operator equals equals actually returns a bit mask. And like, because you get like a different result for every element, like what you just explained basically.
Starting point is 00:47:56 And then all the like non-SIMD people were saying, oh, but that just breaks how types work in C++. Types need to be regular operator equals equals, like needs to return a bool. Otherwise the type system breaks. And like, if you want to do anything else, it needs to be like a named function or something.
Starting point is 00:48:13 So it was such a hotly debated topic. I'm just curious what your take is on that. Yes, it's a very important topic to me because the way I understand it, I wrote a blog post about it after the Kona meeting about regularity in this question, right? It's about the things that Stepanov wrote and how you can use types to do any kind of reasoning. So for example, associativity, you define associativity by being able to say, well, the left-hand side is equal to
Starting point is 00:48:45 the right-hand side by just putting the parentheses at a different place. But if equality doesn't work, then you cannot define associativity. Now for SIMD, if you would say everything happens element-wise and we ignore comparisons for now, you would still say, right, it's the same answer that I get. So it is associative. If I put the parentheses on the A plus B or on the B plus C, that doesn't make a difference. At least for integers. So it is associative. But equality doesn't give me a Boolean.
Starting point is 00:49:23 It's not regular. So according to what Stepanov wrote in his book, it appears like, okay, then it's So the value type of a SIMD int is the int. I need to preserve regularity of that single int. Now, you can consider what you do with a SIMD. Like a very simple transformation is you take a loop over four elements, And then you do something with it and you do a comparison in there. And now you translate that into a SIMD by saying, okay, all these were independent. They had nothing to do with each other. Each of those four iterations, they were running independently.
Starting point is 00:50:17 Now you translate that to using a SIMD because that's the idea of a SIMD. These things are running independently so I can just do the same operations, the same instructions, but on multiple data. But now if you come to the comparison, and that comparison now suddenly mashes all of them together and they become dependent, then it's not a valid transformation anymore. Because now you would say
Starting point is 00:50:45 every four elements next to each other, they interact with each other. And the way that one compares has an influence on how the next three compare. That doesn't sound right to me. You now break regularity of the elements if the whole SIMD comparison returns a single boolean. Okay, so basically operator equals equals needs to return a bitmask for SIMD types.
Starting point is 00:51:13 In a way, it's a bitmask, right? It's a bit more abstract. I'm not saying how it's represented and whether it's actually a bitmask, but in spirit, it is one, yes. And so when you do a comparison on a SIMD, you do multiple comparisons. And whenever you do multiple things and you want a single answer, that means you need a reduction. And so you can reduce the answer of the comparison and say, are all of them true? Is any of them true?
Starting point is 00:51:44 Is none of them true? And then you get a Boolean and then you can branch on that. And this is also what you do in SIMD code. You typically don't branch unless you know my data is biased to this side. And so it is possible that in many cases, none of them are true, even though there are 16 in there. And then you can do a branch like if none of them are true, even though there are 16 in there. And then you can do a branch like, if none of them, then go the fast path, otherwise do something slower and don't execute everything.
Starting point is 00:52:15 And then in the end, when you blend it together, throw everything away again. So sometimes you do branches in SIMD code, but in many cases you don't. And so having the comparison reflect that, I think, is the correct behavior to teach how to use SIMD correctly in terms of performance. And from the concepts side, I see it as SIMD is a generalization of what we do with all the types that we have in the language. So an int is regular. And the generalization of that is the SIMD int that is element-wise regular.
Starting point is 00:52:59 You can reduce it to a single answer. Now, you can do that also with a single answer. You can reduce a single answer to a single answer. Now, you can do that also with a single answer. You can reduce a single answer to a single answer trivially. It's just the same, right? So the scalar case is like the special case of the SIMD. The SIMD is more general in how you would have to define those concepts. I'm not ready to write down all those concepts or propose any of them as actual C++ concepts,
Starting point is 00:53:27 but in the way that I program them in my code, it's exactly what happens. I want to be able to say, well, this function, I need an actual regular type because I'm going to branch on it on comparisons. But in other cases, I want to be able to say, I can work with a parallel regular type because I'm going to reduce in the cases where I actually need to branch. But otherwise I don't need to branch and I can just go on. And it's more general in that sense.
Starting point is 00:53:53 Right. So before you wrap up, I would like to pick your brain on like two particular things that I've encountered as like things you have to deal with when you actually do SIMD. And like in my case, it wasn't like audio
Starting point is 00:54:05 stuff so one one and i'm just really curious how the the standard proposal deals with those things like one is uh kind of can i use to simd like with my own types for example can i have a std simd of a std complex or whatever like std okay std complex is actually a bad example because it forces you to like zero initialize everything so you know nobody no high performance people actually use that like i have my own like simd type which you know lets me like have the things uninitialized if i so i don't have to pay for like initializing it like but then complex is the only type we have so far so that's another proposal for so that's another problem but if i have like my own custom type can i just plug that into stetsimdy uh like that's one question yeah that that is a very important thing and i've been
Starting point is 00:54:52 solving that problem also for 10 years and i'm nowhere near to standardizing any of that so yes like i said i i have these vectorized structs that i want to work with. And I have them all over the place in my program. But I also want to be able to just go back to the scalar struct. So I don't want to write those twice. The way that you have to think about, like, complex is a good example. You can have a complex of SIMD or a SIMD of complex. And it should not be the same thing. Because a SIMD of complex means that your elements are still complex, so they have a real and imaginary value inside.
Starting point is 00:55:36 So in terms of looking at the register, you have in the register real image, real image, real image. It's interleaved. Whereas if you say, I want a complex of SIMD, you're saying my complex has a member real of type SIMD and a member image of type SIMD. So in terms of looking at memory, you have all the reals next to each other and all the imaginaries next to each other. So it's again, the SOE versus AOS problem, right?
Starting point is 00:56:05 Yeah, kind of. I mean, it's now on just the struct level. We're talking about this vectorized struct. And a complex of SIMD would be what I call the vectorized struct. So in most cases, you don't want a SIMD of your user-defined type because that would mean interleaving that and you would have to define all the meaning of all the operations, things you would do with it.
Starting point is 00:56:30 But that said, there are things like fixed point and all kinds of other interesting user-defined types that we want to do that basically are just an int underneath. Yeah, like an input separation math or whatever. There's lots of them. Lots of them. And SG6 has a call for paper to give us these. And yeah, they need to interact.
Starting point is 00:56:54 And that needs to work. Then all these types that just have a single value type, basically, they need to work out of the box. And you somehow need to be able to define what is the behavior of all these operations of the operators, of the operator overloads. It's not obvious
Starting point is 00:57:13 how to do the vectorization without having to rely on an auto-vectorization, but at least I've done it for enums. For enums, it's also just like, okay, now that I'm using SIMD, suddenly all the enums that I have to use, I have done it for enums. For enums, it's also just like, okay, now that I'm using SIMD, suddenly all the enums that I have to use,
Starting point is 00:57:29 I have to use an int instead. I'm going back in time, right? That's not useful. So you want to be able to just continue using your enums and have a SIMD of your enum. I also, I have implemented that. I just don't have the time to actually make a proposal out of it and get it into the standard at this point because I have to that. I just don't have the time to actually make a proposal out of it and get it into the standard at this point
Starting point is 00:57:47 because I have to focus on the rest. But yeah, those should work. But all the ones that have multiple data members, these are typically the ones that you don't want to make a SIMD of struct, but you want to have a struct of SIMD. So the way that I've always used them is to just write a template and then I say template type name capital float
Starting point is 00:58:09 and then I define all my float data members using that template parameter and then I either instantiate that with a float or with a SIMD float. And then that auto-vectorizes even the functions in there because now the functions,
Starting point is 00:58:27 the member functions that you define in that class, they use that float template parameter. And now by using a just scalar float, it's scalars. If you just use a simply float, it auto vectorizes in a sense, right? It vectorizes the implementation. And I would really like to automate that.
Starting point is 00:58:49 And I would need powerful reflection support in the language to do it right. At this point, I do struct reflection. So you can give me a struct that is not a template. I will look at, okay, it has all these members of these types. I can give you a std tuple that is st simd types in instead i cannot reconstruct a struct because i would have to name those members somehow um so all i can do is give you a tuple so the proof of concept in terms of i can do these transformations and they are useful it's all there but yeah in order to be complete and nice i really would have to work with reflection. Well, the C++26,
Starting point is 00:59:28 kind of the repression proposal that's currently targeting C++26 gives you, I think some of these capabilities you can have, like you can unreflect with these splicers and I think do this kind of stuff. But like, we don't really have time to get into that more. Before I let you go, I have one more SIMD specific question that I'm really curious about where what's that's where stitsimdy is on
Starting point is 00:59:49 that so if i look at stitsimdy of a float for example like the first question i'm asking asking myself is like like how many floats are there actually in there like what what's the width of this thing right because that depends on the on the particular architecture that you're compiling for. And so, and things like audio, video games, or, you know, stuff like that,
Starting point is 01:00:09 like you're actually targeting consumer hardware. So it's not like the kind of probably physics simulations that you do, but you know exactly like what hardware you run on, but you run on
Starting point is 01:00:17 somebody's laptop from 10 years ago, right? And you have no idea if it supports like AVX, whatever, or like ARM, Neon, whatever, right? You just don't know. So the way you do this for if it supports like AVX, whatever, or like ARM, Neon, whatever, right? You just don't know.
Starting point is 01:00:26 So the way you do this for consumer software, or like one approach to solve this for consumer software is to say, okay, I'm gonna, I have like my kernel with all the maths that like my game or whatever needs to do. And I'm gonna compile that for SSE3, for SSE4, for AVX1, for AVX512, whatever. And you have like these multiple kernels.
Starting point is 01:00:47 And then when your game starts up, you like basically at runtime on the user's machine, you ask, hey, does the CPU support this instruction? And then you do kind of dynamic dispatch. Yes. And that actually can have an influence of whether your SIMD offload has four floats or eight floats or whatever.
Starting point is 01:01:04 And so, yeah, it's kind of annoying to do this. And so I'm wondering if std simd gives us anything better than that or how you do that with std simd. So at this point, I see no way that the C++ standard can help you because all we have is an abstract machine. There is no notion of you compile
Starting point is 01:01:22 for different ISA extensions or not. And in the problem, described, the one problem that never goes away is that you tell the compiler what instructions it can use in the binary. And those, I mean, you need to have these as like different compilations and then dispatch at runtime there is no other way because you have to say okay this is this is the set of instructions that you can use and if it's the smallest one then it's not the most efficient one so you want the bigger ones and then you have to do a runtime dispatch because otherwise the program will just stop with like unknown instruction yeah yeah sure like on the binary level yeah but i was hoping that we don't have to do that myself anymore, you know, that the library does that for me.
Starting point is 01:02:08 How do you do the dispatch, right? And this is something that I don't see the C++ standard solving. This is something that tooling has to solve. So I see that either as compilers that can solve
Starting point is 01:02:24 it by doing multiple passes over the whole thing they're compiling and then emitting different binary outputs and then somehow doing that dispatch. It's hard, especially if you have headers. Now, you have this attribute target clones on GCC, where you say, compile this function for many different targets, and whenever it's called,
Starting point is 01:02:52 automatically do an ifunk resolve to take the one that your machine is running on, or supports. That works as long as it's not using anything from headers that depend on predefined macros that depend on these kinds of, is AVX enabled or not. And so the SIMD header,
Starting point is 01:03:11 it has to be implemented using these macros. So it has already been parsed. These macros don't change any of the code that has been seen before. So that doesn't work. The target clones just doesn't work. There are a few ideas but no compiler vendor that i've seen has like targeted like tried to do anything in that area because it's like a huge change to make that work um what i see more realistically as a solution is like on a dynamic linker level, you compile your application into a dynamic library and put it in a subdirectory for the target machine.
Starting point is 01:03:57 And then on that target machine, you have just like a small main function that then is linked to the library. And the dynamic linker, when it looks for the library name, it first goes through all those subdirectories, knowing what architectures it supports, ignoring the subdirectories that it doesn't support. And in the first subdirectory where there is a library that matches that name, it will take that instead of the generic one. And so that way you have the fallback on the linker level, or you have the dispatch on the linker level. And when all the linking gets resolved, which are kind of indirect calls anyway, then you're done. And so you don't have additional indirection anymore. So I see that as one of the most interesting solutions but it still requires tooling work because you have to compile it multiple times into all these libraries that
Starting point is 01:04:54 have the same name and then install them in those subdirectories that would be nicer if tooling would just allow you to say okay this one this one switch, do it for me. I don't have to repeat everything. But you need to have some notion of the target machine in the library, right? Like what is size of std sim defloat? I mean, that's going to depend on the target machine, right? Yeah, so that is a kind of orthogonal problem. The way I like to work is to explicitly
Starting point is 01:05:29 not know the number of elements in there. That's how I like to write my algorithms. So that means whenever you start thinking about mySIMD has four elements, you're probably encoding that information into your algorithm and it doesn't scale to wider architectures. So I rather want the SIMD users to think as I cannot know the number of elements in there and I have to write my algorithm in a way that I don't have to know.
Starting point is 01:06:06 And then it's scalable and works most efficient on most machines. And it typically also leads you to using the right parallelism. People that think my vector has four elements like to put like X, Y, and Z in there. And they take away from instruction level parallelism put it into simd and in total get a slowdown because they thought oh well i have four elements available using three is efficient but if i tell you you don't know how many you have in there that typically doesn't work for them they will not do that anymore then they will look for okay I need to take multiple
Starting point is 01:06:46 points, take all the Xs, put them in one SIMD, take all the Ys, put them in one SIMD. And that scales and is way more efficient anyway. So I acknowledge there are problems where you need to know the size, and I've had a problem myself
Starting point is 01:07:02 where the size was 12, not even a power of two and it was really useful to have an implementation and it said okay 12 i can implement that as one avx vector and one sse vector and yeah sometimes it's just there there is no other or better way to do it and then you you have to be able to say that. And Statsimity allows you to say, I need 12. But for most of the time, start from, I cannot know how many I'm going to get. That's how I have to write my algorithm. That's how I have to write my data structures.
Starting point is 01:07:38 Right. So you mentioned a situation where the runtime can be longer than expected. And that is true of our recording today. So we are going to have to wrap up there. But I think we could have carried on talking for a long time. So thanks very much for coming on the show and for everything you've been telling us. Is there anything else you do want to let us know
Starting point is 01:07:58 before we do let you go? Or perhaps how to find you? Oh. People want to find out more. How to find me. Yeah, you can surely find me. The bigger problem is me answering sometimes. So I'm on Mastodon.
Starting point is 01:08:16 I'm not posting a lot, but you can find me there. You can find me on GitHub, which is probably the best to follow also what I'm doing. I'm pushing my paperwork to the committee up into repositories there. I have a feedback repository on the TS that is
Starting point is 01:08:37 used to capture everything we want to put into 26. I have a SIMD prototype for a C++ 26 repository in there and some other work. So GitHub might be most interesting in order to see what I'm doing and also to give me some feedback on
Starting point is 01:08:53 some of those repositories. Great. We'll put those links in the show notes or on the website. So if you want to reach out to Matthias, you'll know where to go. So again, thank you very much for coming on the show. Yeah, thanks for the nice and interesting talk.
Starting point is 01:09:13 Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in. Or if you have a suggestion for a guest or a topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com you'd also appreciate it if you can follow cppcast on twitter or mastodon you can also follow me and phil individually on twitter or mastodon all those links as well as the show notes can be found on the podcast website at cppcast.com

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.