CppCast - EVE - the Expressive Vector Engine
Episode Date: October 14, 2021Rob and Jason are joined by Joël Falcou and Denis Yaroshevskiy. They first talk about the 6.2 release of Qt and the range-based for loop bug that won't be getting fixed in C++23. Then they talk to Jo...el and Denis about EVE, a C++20 SIMD library that evolved from Boost.SIMD. News QT 6.2 LTS Released GDBFrontend C++ Committee don’t want to fix range-based for loop in C++23 (broken for 10yrs) Links EVE on GitHub EVE example on Compiler Explorer CppCon 2021: SIMD in C++20: EVE of a new Era Meeting C++ 2021 - EVE: A new, powerful open source C++20 SIMD library C++Russia EVE Talk Denis Yaroshevskiy - my first SIMD - Meeting C++ online Sponsors Use code JetBrainsForCppCast during checkout at JetBrains.com for a 25% discount
Transcript
Discussion (0)
Episode 321 of CppCast with guests Joel Falku and Denis Jaroszewski, recorded October 7th, 2021.
This episode of CppCast is sponsored by JetBrains. JetBrains has a range of C++ IDEs to help you
avoid the typical pitfalls and headaches that are often associated with coding in C++.
Exclusively for CppCast, JetBrains is offering a 25% discount for purchasing or renewing a yearly
individual license on the C++ tool of your choice, CLion, ReSharper C++, or AppCode. Use the coupon code
JetBrains for CppCast during checkout at www.jetbrains.com. In this episode, we discuss a cute update
and a bug with range-based for loops.
Then we talk to Joel Falcoo and Dennis Raczewski.
Joel and Dennis talk to us about EVE,
a C++20 SIMD library. Welcome to episode 321 of CppCast, the first podcast for C++ developers by C++ developers.
I'm your host, Rob Irving, joined by my co-host, Jason Turner.
Jason, how are you doing today?
I'm all right, Rob. How are you doing?
Doing all right.
You got anything you feel like opening up the show with today?
Well, I don't know. I was just thinking when this airs, it'll be really close to CPPCon, won't it?
It will be. It will be. And really close to hopefully your Norway trip.
Yes, I will be theoretically getting on an airplane or something.
Theoretically. Hopefully that is able to happen. Yeah.
No way to know at the moment okay well at the top of every episode i threw a piece of feedback i got this dm from paul leslie
paul says that the he's from the auckland new zealand uh c++ user group and they've been on
a very long hiatus because of the pandemic but but they're going to do a virtual meetup this coming Wednesday,
October 20th,
and was hoping that we could give a shout out to all those Kiwis in the C++
community.
So,
uh,
yeah,
if you're in the Auckland area and want to rejoin the C++ group,
it's reopening very soon.
Check it out.
That's exciting.
Yeah.
I guess anyone could join if it's virtual,
but, uh, you know, time zones and everything. Yeah. I guess anyone could join if it's virtual, but,
uh, you know, time zones and everything. Yeah. Time zones are hard with New Zealand, but it's possible. Yeah. I need to, I have not yet missed a meeting, but it looks entirely
possible that I'll end up missing one in November. I mean, like my meetup won't have one in November
because my co-host is currently, not co co-host co-leader is currently unavailable.
And I am going to be, I think, unavailable with everything going on.
But that's pretty impressive that you kept it going this whole time during the pandemic,
though.
Oh, I know.
We've never missed a month like ever since starting the meetup four years ago or whatever.
That's very good.
Yeah.
All right.
Well, we'd love to hear your thoughts about the show.
You can always reach out to us on Facebook, Twitter, or email us at feedback at cppcast.com.
And don't forget to leave us reviews on iTunes or subscribe on YouTube.
Joining us today is Joel Falco.
Joel is an associate professor at the University Paris-Saclay in Orsay, France.
His research focuses on studying generative programming idioms and techniques to design tools for parallel software development.
The main parts of his work are the exploration of embedded domain-specific languages designed for parallel computing on various architectures and the definition of a formal framework for reasoning about metaprograms. host of the C++ French meetup, president of the C++ French Association, co-organizes
the CPPP conference, and is part entrepreneur, being one of the founders of Code Reckons,
a company focused on bringing people and companies up to date to the best and newest C++.
Joel, welcome back to the show.
Thanks again for having me, Jason, Rob.
Sure.
So, you know, while I've got your unique attention at the moment, do you want to give us a quick
update on PPPP, PPPP, CPPP?
So we closed the call for paper like last week, September 30th.
And we are currently reviewing the paper submissions right now.
And we aim to probably some some results some programs soon uh we will
probably start you know um advertising for the conference uh we have our um our keynotes and
stuff like that to speak about so it should be on track for december uh it will be fully online. So, and we will try to, how to say that,
accommodate our North American audience
by placing talks on the proper time zone for everybody.
So that's also something we want to do.
And yes, so that will be our second edition,
first online edition.
And we will be trying to see
if we can actually get a third edition
next year, but I hope
in person this time. I think
it will be long overdue
for a lot of reasons.
And you were
speaking about the virtual
meetup for the New Zealand
people.
We have been resuming the C++Frag meetup for the New Zealand people. We have been resuming
the C++
Frag meetup since
almost a year now.
We do that on Discord.
We have a Discord server if you want to join
in and follow
the meetups there if you ever
happen to want to
see French people with bad English accents
speaking about things.
Or the other way around.
And you're also very welcome
if you want to give a talk or something
because we like having people from elsewhere
because it's rather easy now that we do that online.
We actually had a lot more participants.
One problem with the meetup was that, you know,
like when it's in person,
you are in Paris
and you know France.
Well, France is basically
the size of Texas.
But, you know,
Americans have this a bit of,
you know, like driving long distance,
doing things,
getting from places to places.
And it's not something
that French people do a lot,
especially above like 100 or 200 kilometers range. getting from places to places and uh it's not something that french people do a lot especially
above like 100 or 200 kilometers uh range so uh it was very you know like it was a very parisian
in paris meetup and now that we are on discord we have a lot of people from all over france and so
that's kind of cool that was the let's say the unseen benefit of the situation.
But if you are French-Canadian
and you want to participate,
you are very welcome. Or even if you are
not necessarily
a French speaker, we can accommodate
that. Be our guest and come over.
And the entire thing is hosted on
Discord? On Discord, yes. And we have
a special setup so we can actually
manage the
Discord's limitation on how many. And we have a special setup so we can actually manage the Discord's limitation on
how many people can actually have a look at
stuff and so on. So it's
actually okay.
And it's also a trial for
the CBPP that will
be also hosted
mostly on Discord.
Interesting. That is actually really interesting.
Oh, and you said you're going to try to
accommodate North American time zones
a little bit with CBPPPP.
Yeah. I forgot to P, I guess.
I think I forgot to P.
I did just notice, and I think it's worth
commenting for the sake of our listeners, that
CBPCon is
also attempting to accommodate
Europe a little bit.
The really late, or what would
be really late in Europe
talks are going to be replayed
the next morning with the
speaker live to answer questions.
That's a great idea.
That's a very good idea.
It's a very good idea.
You have the other side of the coin I was
joking about on my
Twitter status a bit long ago when I actually got my time slot for CppCon.
It was like at 7.45 in the morning.
I was like, ooh.
So it's fine for me because it's the middle of the afternoon.
But I was like, oh, God.
Because 7 p.m., you really want to be dealing with these kind of things.
So we'll see how it goes.
But it's actually very good to be able to interact with people again.
That's the one thing I missed a lot from all these times.
Okay.
Also joining us today is Denis Yaroshevsky.
Denis is a semi-active member of the C++ community.
He's mostly interested in algorithms
and has done a few things in that area,
such as research and implementation of Chromium's flat set,
a couple of tiny contributions to Lib C++ algorithm library,
a few algorithm-related talks,
and one sole paper to the C++ standard
that didn't get consensus.
For the last couple of years, in his free time,
Dennis is implementing STL algorithms portably using SIMD.
Dennis, welcome to the show. Hi portably using SIMD. Dennis, welcome
to the show. Hi, nice to be here. What's the paper you submitted that they didn't get consensus on?
Well, the paper was against accepting flat map in the shape that it was at the time.
It didn't get consensus, but as you can see, we don't have a flat map. We don't have a flat map.
Oh my goodness.
I've re-implemented a simple flat map probably 10 times at this point.
It's such a handy thing
to have in the simple case.
Well, they wanted to do
parallel arrays, and my paper
was against doing that,
especially until we get ZIP.
So right now, the ZIP proposal, I think, has come through,
C++23.
Yeah.
But I said, so the main point was that if you know ZIP,
it has a problem with proxy references.
And the paper was basically arguing that if we're going to do proxy references
in the C++ with all of the downsides it brings,
and we should do it with open eyes
and do the zip first,
and then maybe do a flat map as parallel arrays after that.
But also I measured,
and it's not actually a win for a flat map
to do parallel arrays.
So kind of like the paper was about that.
Okay.
Well, Joel and Dennis, we got a couple news articles to discuss uh feel free to
comment on any of these and also talking about uh eve and simdy okay sure all right so this first
one we have is uh cute 6.2 has been released and this is the first uh lts release for 6.0, which is a long-term support.
So Qt 6, I guess, was a really large change.
I think we covered when that first came out a while ago.
They're now using C++ 17.
They're using CMake.
And with this release, though,
they brought back a bunch of modules
that were not available in Qt 6.0 or Qt 6.1.
So now if you're using any of those in previous versions of Qt,
you should be able to use 6.2.
Are either of you Qt users?
Not much for me myself.
I don't know, Denis, are you doing Qt things?
No, I'm not doing Qt.
But I look through the release notes and there is so much.
It's even more than boost stuff. I looked through the release notes and there is so much.
It's more than boost stuff when you have C++ libraries that are not just a library or a free work to do things.
But maybe because we don't have package management
or some other reasons,
we have this gigantic monster of a thing
that brings everything together.
Yeah.
Yeah.
I miss the point where they switched to a proper C++ standard things that brings everything together. Yeah.
I miss the point where they switched to
a proper C++ standard.
I think that's very, very
important.
I mean, maybe some of you
remember the...
What was it, like three years ago or four years ago
when we had the Qt guy at
C++?
Oh, yeah.
That was a bunch of time ago.
One of the keynotes.
Yeah, I guess.
And you could actually feel the,
it was some kind of, not tension, that's not tension,
but you see like some disconnection between, you know,
the way it was actually presenting QT and how it works
and the way some people from the audience
was actually looking at it.
So I think it's a great move.
They move that much.
The CMake thing is also a great change
because I still have a nightmare of QMake
when I had to do some ages ago.
So no, that's cool.
It's actually moving in a good direction so
I mean I think
I don't have any opinion about the fact that it's
you know like is it the best
UI library or not but I know it does
the job it's
consistent it's
I mean it exists and it
do the works and I guess that
for such a thing it's probably
the main you know the main features you want. You
want something that just works. And the fact that they have these ongoing changes, it's actually
quite good for the library and the Qt ecosystem itself, which is, as Denis says, it's like a massive brick of things.
So, no, that's the good news that they went forward like this.
So, yeah, maybe I should try to do that at some point, you know,
probably brush up my, you know, like dynamic C++ a bit.
I have a friend.
Sorry, go ahead, Dennis. I just want to say that, you know,
in any project that I saw in C++,
there is this whole ecosystem built.
In my case, it was mostly like really big companies
that have their own ecosystem.
But, you know, if you don't have one,
like maybe Qt is a good shout, right?
Oh, yeah, I need networking and, you know,
UIs and all that stuff. You get on one,
yeah. Yeah. I have
one friend who
is not a C++ programmer
and I told him that there were a bunch of people
asking for me to do Qt tutorials
on my YouTube channel and I'm like,
it's been like three years since I've touched Qt
and he's like, well, that's a GUI library,
right? And I'm like, yeah.
Sure. and he's like well that's a GUI library right and I'm like it is it is
it is a GUI library
it's also a networking library
and yeah
was it
what's the name of it doesn't do
GUI Poco is also a huge
mass of you know
let's say,
productivity tools.
I know they have networking and stuff,
but they don't do GUI.
It's like everything except that, yeah.
So,
okay. The one thing I always
found funny is that you end up in
those huge, you know,
company
or, you know, like, I don't know, places where they will happily, you know, company or, you know, like, I don't know,
places where they will happily, you know,
gobble up the dependencies to Qt or Poco or sometimes both.
I don't want to know why.
But then you suggest them to, oh, you should use these both things.
And they are like, oh, no, don't think about the dependencies.
And I'm like, guys, I mean, you already have,
I don't know, like, I mean, Qt is basically
the node GS of C++, you know, like it's,
you think it's okay.
In fact, it just, you know, like comes with so many things.
It's not even funny at the end.
So I was always, you know, like pondering about
why the Poco is fine, Qt is fine, Bose is fine,
Booster, Boost,
no, I don't want to do that. It's too much dependencies.
I never really understood this position, but I mean, maybe there was a reason or
something. I've definitely worked on
projects where, you know,
when we had both Qt
and Boost in the project,
it was a lot to
manage. One or
the other is fine,
but both,
and then having things like boost as you expecting some version of open SSL,
that cute's expecting a different version of open as a cell.
And then the whole world just goes to crap.
That's,
that's where the problems happen.
I think we have four different versions of lib Z linked into that project at
one point.
It just happens. once you add four large
dependencies that's just where you are all right next thing we have is uh this project on github
it's gdb front end and uh i'm not a gdb user so jason i might need you to talk more about this
one but it looks like it's uh pretty powerful for giving you a GUI on top of GDB. Yeah, this is the author of this contacted
me and said, Hey, you might be interested in this. And I have not yet played with it. But it looks,
it looks pretty powerful. If you're not already using sea lion or MSVC or something else with a
nice debugger front end. I know, Joel, you do everything at compile time so you don't have any use for a debugger.
Well,
I took a look. It looks very nice, actually.
From the
screenshots and the
demo things, it looks quite
impressive.
I don't maybe
use that much, but
I know that I always have this
trouble.
GDB is something which is very complicated to to teach to people because it's you know like okay it's let's say that it's
it's uh it's off-putting because it looks so antiquated when you just you know fire it up
when you have people that are you know like used to get used to get, I don't know, like, IDEs and you click on a button and stuff.
So sometimes it's very complicated
to get people inside it.
But once they are inside
and you teach them, actually, the ropes, it's okay.
So maybe it's actually something that could actually
get more people to jump on GDB.
And I think the less people actually debug
with Sprintel, the best it is,
if you see what I mean.
And I know that some people
are just frightened
by the raw GDB things,
so it can actually have people
jump onto this background.
So I'd probably give it a try
because it really, really looks
like very nice, polished,
and, you know,
like well-thought-out
on the way it interacts
with the GDB basic functionality.
So it's definitely something to have a look at, I guess.
But you, Dennis, are you a GDB user?
When I have to.
Like, I basically, you know, when I see there was talks by people from GDB, right?
And they're like, well, we can do all of these amazing things.
And if I manage to put a watch point on a variable,
I'm really proud of myself.
Realistically speaking, I can open a core dump and look at that.
And that's the extent of my master GDB skills.
I should know more, but there is no way I can do UIs.
It's all on the server somewhere. So if I could
do SSH setup, I would
be much better at setting up things than I am.
That's one of those guys from
Undo or something
who has given a bunch of talks
at CBPCon on GDB, which I think
might be the ones you're referring to, Dennis.
It's practically like watching
a magic show.
I mean, it's like, and now what we can do with GDB with the Python plugin.
And you're like, what just happened?
I thought the one I was referring to was by a GDB maintainer.
I might be wrong about that.
I'm not sure.
That was a short one and then an hour long one.
Right.
Okay. I'm not sure. There was a short one and then an hour-long one. Right. Okay, and then the last thing we have is on Reddit,
linking to a post from Nikolaj Jusudis we had on, I think, a year ago.
And he is pointing out that this paper he submitted
to fix the range-based for loop in C++23 was not accepted.
And yeah, the C++ committee
doesn't want to fix it. Apparently
it's been broken for 10 years. I didn't
realize there was a problem with the range based for
loop until reading this. It's not something I've
run into. We discussed it very briefly
when we had him on the show. Okay.
Just very briefly, I think.
I wrote that blog the other day.
So like
that specific blog, I wrote it.
So for people who didn't see,
so it's talking about...
We didn't mention what the bug is.
So basically, you have a temporary object
and you try to iterate over it.
So if you just try to iterate over a temporary object, it works.
But if you try to iterate on something you get from a temporary object,
it doesn't work because the temporary object goes away.
If you have a function that returns something with a map member
and you try to get that member, that will not work.
Yeah, I wrote that bug.
And somebody mentioned VLifetime extensions,
that it should catch it.
It would be really nice.
Yeah, it seems to be the committee's consensus here
is that they want a general lifetime solution,
which Joel seems to be thrilled about.
It's not that they want that.
I mean, lifetime extension standardization
is a good thing.
I have like flashbacks on uh that one guy or
a couple of people uh that came with a very simplistic way to get uh you know like uh you
can print the name of an enumerator or something very smooth things and then and then so was okay
what about having full-blown reflection so so we are again in this situation where we could have very, you know,
limited theoretical
strike on the problem and it gets fixed.
And somehow we
decided to go for the
complete, you know, like
world-saving scenario instead
of locally fixing the things.
That's
what I'm more, you know, like
rolling my ass over than the actual issues.
The issue is an issue.
I have been, like, falling into this trap, I don't know, many times.
And each time, you know, like, it's time for a huge site, you know, right?
And so I don't even understand why he didn't pick,
because the fix that was proposed in the paper if I remember was rather simple
because he just changed
the equivalent syntax
of what the
for-auto loop expands to.
And it should have been a
no-brainer in my opinion.
So, deal with it.
And especially because
it's not something like
it happened for, I don't remember what recently.
It's not like it's something we just came upon right now.
Say, oh, crap, we forgot about that.
It's late.
We have to think about it properly.
So let's delay.
It's a 10-year bug.
Yeah.
It's a 10-year bug.
So, I mean, well, whatever.
Interestingly, I don't think I've ever personally
seen this in the real world.
It happens.
And sometimes you
do it. I mean, what happens
a lot to me is that you do it,
you know, like you forgot about that,
and, you know, it works on your machine,
and you shift the code, it doesn't work
on the other machine, and you're like, what the heck?
And then, oh, God, yeah, the map from the temporary,
you know, like, all the vector or whatever,
and then you're screwed.
So the local fix for people is not very complicated,
but it's just, you know, like, but ugly to have this,
you know, you put the variable before,
and then you do your whatnot.
It's probably sometimes not efficient
because you have to probably make, well,
if everything goes well,
maybe you're just being a move,
but worst case scenario,
you probably have a copy or something.
If you don't, you know,
like end up doing it properly
or you are in a situation,
it can be moved or whatever.
So my opinion is that it should have been fixed
with the surgical strike.
And then later, yeah,
let's have, you know,
like a decade-long discussion about
general lifetime management
and see what we can do
for C++33 or something.
But, well,
yeah, sorry for being so jaded
about the process,
which I am actually part of, so I
have no excuses on that. But,
well, that's two or three,
you know, like, occurrence of things that just, you know, like, you just cited at that, but well, that's two or three occurrence of
things that you just
decided at that, and okay,
whatever.
That's the
downside of
working designed by
a committee consensus.
That's the
rule of the game, so whatever.
I want to interrupt the discussion for just a moment
to bring you a word from our sponsor.
CLion is a smart cross-platform IDE
for C and C++ by JetBrains.
It understands all the tricky parts of modern C++
and integrates with essential tools
from the C++ ecosystem,
like CMake, Clang tools, unit testing frameworks,
sanitizers, profilers, Doxygen, and many others.
CLion runs its code analysis to detect unused and unreachable code, dangling pointers, missing
typecasts, no matching function overloads, and many other issues.
They are detected instantly as you type and can be fixed with a touch of a button while
the IDE correctly handles the changes throughout the project.
No matter what you're involved in, embedded development, CUDA, or Qt, you'll find specialized support for it. Thank you. C-Line. Use the coupon code JetBrains for CppCast during checkout for a 25% discount
off the price of a yearly individual
license.
Alright, well, let's start
talking about
EVE, which I don't think we've
discussed before on the show,
but could one of you give us
a little intro to that library?
You start, and then I'll talk how I came to Eve.
Yeah, yeah, let's do that.
So a long time ago, once upon a time,
I was actually, I mean, Eve is like,
I mean, let me count, 2021.
It's basically the end result of, let's say, 13 years
of, you know of playing around with that.
Eve is actually the name of my first SIMD numerical array library that I wrote at some point
that tried to get out as buscindi for a while and then that went dormant again for a lot of reasons and I went back to the problem in 2018, something like that.
Well, yes, 18, it is.
Because the problem still wasn't solved.
I mean, you have this huge amount of processing power in a regular CPU.
Let's put the GPU outside.
And you have a lot of things.
You have the mythical parts. You have all the GPU outside. And you have a lot of things. You have the multi-core parts, you have all the internals, and you have those special
multimedia instruction sets that are there since 1995, 2006, something like that, that
are supposed to help you write efficient code when it fits the model of CMD, which means that your
data are well laid out,
continuous, and you do the same operation
on everything, which is basically
what happens when you're dealing with,
I don't know, like image processing, signal
processing,
scientific computing, whatever.
That's the obvious things.
And you have the non-obvious things
because, you know, well, if you have a non-obvious things because you know well well
if you have a very very long string it's basically an array of characters so it's just an array
somehow so maybe you can actually you know make your string processing faster or whatnot
and it was always seen by developers as something which is very obscure
because
the API is
a nightmare, both
in the naming scheme and the way
it works and the fact that it relies on
half macro, half intrinsic.
You never know which one is what
or whatever. And the other problem
is that you probably have
12 or almost
yeah, 12 or 10 or 12
variants just for x86
and God forbid
you need to write code that uses those things
and you have to go on like
multiple kind of architectures like I don't know
like x86 and ARM if we stay on
mainstream architectures.
And nobody wanted actually
to write that.
Compilers, which are supposed to help tries,
you get auto-vectorization in most compilers since ages also.
But as soon as it's not what the auto-vectorizer
is trained to recognize, to analyze,
well, you don't get it.
So it was an issue in 2005 when I started my PhD.
It was still an issue when I went back to the problem 15 years afterwards, 10 years afterwards, something like that.
And I wasn't sure it was something we needed to do again because I was under the impression that the current compilers went far better than they should be now.
And it's probably something we don't need, except for very specific cases. And in this case, you know what, I will just write it by
end. And after getting involved in a couple of projects that were quite performance critical,
and actually witnessing it was clearly not the sole problem, I went back at the drawing board.
So the main idea was to say that whenever you compute on a regular types,
like you do computation on an array of float or doubles,
you have a bunch of operators, you have a bunch of functions
that work on those types, and they do one computation.
And one thing that we actually spoke about and worked on
was to have like a type wrapper that says,
this is not a float, this is a bunch of floats.
And the size of the bunch depends on your architectures.
You don't have to care about it.
And whenever you operate on this bunch of values,
the library will take care of actually figuring down
the best intrinsic to call,
the best combination of intrinsic to generate,
and so on.
So you still write code that looks like,
let's say, normal, you know, scalar code.
The last thing you did is change the variable types,
and if you are doing a loop,
you do the loop in a slightly different way.
And by actually using a bunch of metaprogramming tricks
and some other template-related shenanigans,
you could actually hide a lot of things,
including the fact that you can detect
which is actually the proper architectures you need to target,
and then you can decide what you want to do with your bunch size,
and so on and so forth.
And so we started working on that back again, 2018,
and I was
using it as a way to say,
okay, now you know that C++17
is quite well supported
on multiple compilers.
Let's put it in C++17
so I can actually get up
to the new features
and so on and so forth.
And I was majorly
having good eyes
on ifconstexper, which was
looking like a perfect
dream operation for
what we wanted to do. And so we started
writing a C++ version of that,
which was a mishmash of
what Boost.cnd was in terms
of design and what the old
Eve was.
And then we started playing around with C++17.
One thing I quickly learned that was you cannot just take a design from ages ago,
change the standout number, you know, and sparkle some stuff. You need to actually rethink the design
with the new features in mind.
And I learned that the hard way.
So we started actually writing, you know, like,
okay, can I do that actually?
What if I do this if constexpr trick
in this Lambda thing and so on and so forth?
And we started to get new, better design.
And at some point, I don't remember exactly when,
it's probably late 2019,
we started having
partial support for C++20
in GCC, when it
was back, you know, like
STD C++2A, something like
that. And I was like, oh yeah, you know what,
we should make a branch and try
to see if the new C++20
concept things and so on
actually help us.
So whenever it's, you know, like actually supported,
we already have, you know, a base to work with.
And six weeks later, the conclusion was,
okay, you know what?
Let's stop C++17 and jump on to C++20 right now
because it's so much easier.
And we basically divided the code size by two.
Wow.
Something like that.
By moving a lot of, you know, complicated, you know,
let's do a constexpr function that evaluates the thing
so I can put it back into a thread that tells me
if I can call this or that into, you know what,
let's just make a concept and put it into the template stuff
and thanks for playing, you know what, they just make a concept and put it into the template stuff and thanks for playing, you know.
And it was
easier to
how to say that, to
express what we wanted to do
this way. And
I know Denis will love
when I say that. And
the compile time was so much better.
It was so much better than before.
You know, like, let's say it's this way.
And so we decided to just say,
okay, anyway, currently it's not something
that is like computed anyway.
We have this, you know,
we have this chance when you do research
that you can actually say, you know what?
I will just scrap that, start again
and use these bleeding-edge things,
and we will see later who wants to use that anyway.
So we took this opportunity to rewrite again a lot of stuff.
As I said, we probably dumped like half the code we wrote.
And basically what's left was the actual, you know,
math things that compute the function. And everything else was completely, you know, map things that compute the function. And
everything else was completely, you
know, stripped down and rebuilt
better. And
that was basically when you come in,
Denis, I guess, around 2020,
something like that, end of 2019?
I don't remember.
Something like that, yeah.
And so Denis just pinged me on Slack
and you said, i don't remember
what it was for actually at the beginning i was doing a talk oh yeah yeah yeah so uh
okay so yeah i was basically right so if you talk if you look at anything online that talks about
same thing right they will say if our data is perfectly aligned and divisible by our register size and the stars align and all of that stuff, then you can, you know, do, you can add two vectors together and get a third vector.
And in the meantime, you know, I know that Sterlain exists, that is completely vectorized.
I know and doesn't require, your strings don't have to be aligned to anything, right?
There is 2dn6 that says, hey, we't have to be aligned to anything right there is 2d6 that says hey
we should have a vectorized sort
right we should have
vectorized all of those things
and I wanted to know how it's done
right so I basically and I decided
I bugged a bunch of people
I bugged Bryce who said
go to GPU
what do you want to say anyway
I bugged I bugged a bunch of people go to GPU. What do you want to say anyway?
I bugged Gashper, I bugged a bunch of people, but
the answer was on Stack Overflow
and
there is an amazing community on Stack Overflow.
Shout out to Peter
Cortez, most of all. Anybody who
asks anything about
x86 on Stack Overflow will find his answer.
Alright,
there are some other amazing people.
There is a Tijago, I don't know his last name,
on Slack, who was also a massive help.
And basically,
I'm just trying to remember people
who are really helpful.
So, yeah, and I learned all of those things,
and I gave a talk.
That was my goal.
And in the middle,
so I went online to look for a SIMD library. I saw a talk that was my goal and in the middle um so i went online to look
for a cindy library i saw a talk by jefferson astmus i think i'm sorry for butchering your last
name who talked about cindy rappers libraries right but basically what gerald talked about so
you have a class that says this is my register of integers let's say and i was looking for a library
like that to do my talk where i want to write Sterlen. I want to write, you know,
sort maybe,
or the algorithms I did was Sterlen,
reduce inclusive scan and remove,
right?
That was the algorithm of my talk then.
And so basically he talked about this libraries.
I looked at a few of them.
I looked at the VC,
which is probably the most popular one at the moment.
It's a basic forest DCMD. I the moment. It's the basic for STD, SIMD.
I looked at TCMD, XCMD,
and they scared me.
They look just very
complicated, and if you open the
examples, it's like all of that
metaprogramming.
For example, VC is
probably the most mature of all of them,
but they try to
do a lot of things for you
and you kind of lose.
I think that, you know,
probably you have the same experience with Eve,
to be honest.
Like if you're open Eve
and you try to dig through it,
it's probably also scary.
Anyways, and then I stumbled upon JL,
Justin Slack,
and I asked him,
hey, you were doing this thing with Cindy.
Like if this is what you're doing,
like I'm giving the talks that builds on top. Do you want me to build my talk on top of your library? Because it's kind
of a cool use case. Can you use a library to build a find? And this is how we started doing things
together. So right now I gave my talk and, and I reported now all of my algorithms between
all of x86 and ARM.
So, for example, we now can take
a remove and vectorize
it on any x86
and ARM, and on my machine, for example,
for chars, if you want to remove
space, let's say zeros, if you want
to remove zeros from an array of chars, we can
do it up to 25 times faster
on a single core
for a thousand bytes.
It doesn't have to be a lot.
Yeah, and so
we reported all of that. We also support things
like the last effectively
cool thing we did was support
of parallel ranges. So, for example,
right now you can do inclusive
scan of complex numbers.
So you have one vector with real part, one vector with imaginary part, and can do inclusive scan of complex numbers right so you have one vector with real
part one vector with imaginary part and you apply inclusive scan on all of that um so yeah and
that's my story with it so now i'm like active contributor and doing awesome stuff but when you
you said standard algorithms but is it is it algorithms that look like standard algorithms
or does it somehow actually work with the actual,
your standard library standard algorithms?
So, okay, you can pass it a standard vector.
That works.
But it's our algorithm, right?
So like there is an implementation of inclusive scan
is our implementation of inclusive scan.
And even the interface looks slightly different
to the standard one. So specifically, for example, of inclusive scan. And even the interface looks slightly different to the standard one.
So specifically,
for example, for inclusive scan, you need
to know what zero for your plus
is. So
you have a plus operation
and a plus operation
needs to know its identity element.
There's a thing you add with anything
and you get the same value.
And it's different from the base value you want to inclusive scan over.
Sometimes people usually mix them up.
You need your base value and your operator must be paired with an identity element.
And currently the best thing we found out is that you actually pass a pair of the identity element. And currently the best thing we found out is that you actually pass a pair
of the identity element
and the actual function
and it just works.
I don't know if that's super confusing
and a lot of information.
No, that's fine. I'm just trying
to like, I'm thinking
like a little bit more
the use case, like, okay, you're trying to sell
this library to me right
now theoretically um can i can i throw like a vector of streams at it and it's just gonna say
you know what i can't do anything with with strings that's special so i'm just gonna pass
that you know whatever i'm not gonna try to vectorize it but if i pass it a vector of ints
to one of your algorithms that'll be like okay, okay, now I've got your back,
it's going to be 300 times faster than it was.
No, it's going to be 8 times
faster, but...
No, it's a limit.
My machine's going to be 8 times faster.
No, we will not allow you to pass in a string.
It will say, I'm not going to
compile, and I'm not sure.
But you can do, if you have two vectors of integers,
yeah, we can do that.
Or if you have structs.
So, for example, one of the things we have an example of
is like a data-oriented design system,
where you have, you know,
where we have Joël wrote bouncing balls
that is not terribly correct in terms of semantics, but we have, Joël wrote bouncing balls, that is not terribly correct, in terms
of semantics, but we wanted
to showcase the example.
They are not balls, but
they bounce.
They are characters that are
under gravity, and all of that is vectorized.
It's the most useful application
of vectorization possible, but
it looks like an entity component system.
Interesting.
And the one
thing we wanted
to do, and
that was part
of the
discussion,
the reflexes,
the thought
process we
tried to have
at some point
was, okay,
everybody
and his
grandmother
have this
let me wrap
the whatever SIMD type behind the box
and you can play with it.
And we wanted to go further than that
because most of the time it just stops there, actually.
And as soon as you have something complicated,
yeah, you have the bits and nuts to do the things,
but it's not trivial.
So what we wanted to do at some point was to,
and for a long time, actually,
was to actually have this report for,
you tell me you have a struct of things,
and you want to make this array of structs,
and you want to vectorize operation on this struct.
And most of the time when you have to do that by hand,
it's quite cumbersome.
You need to do this structure of five things and so on.
And we wanted to have something that can just say,
okay, okay, just describe me the structure
and I will deal with it.
And whenever you want to vectorize this vector of,
let's say, balls or polygons or, I don't know, pixels,
we have a protocol you can follow,
and we understand it, and then we can vectorize.
Because most people don't want to deal with having, like,
10 or 16 arrays that, you know, just go everywhere,
because you lose this data structure field, okay?
And it's easier to think about the structures
than just a bunch of values.
So we wanted to do that.
And the other thing we kind of went to,
even if it wasn't planned,
when we started doing the algorithms,
at some point, it became evident that,
okay, we can pass a ready-to-use algorithm to people.
You have this array of float
and you want to do the sum or the reduce. You can do that. You have this array of float, and you want to do the sum or the reduce.
You can do that.
You have this vector of whatever pair of int,
and you want to do a find, whatever.
But it happens that a lot of time people,
they just don't want to do that.
It looks like they need a reduce,
but at some point they need to do something else.
It's like a find, but.
And so what we wanted to have at some point they need to do something else. Or it's like a fine butt. And so what we wanted to have at some point was,
can we actually write all this algorithm
in a way that we can have, you know,
basic block that you can use as an advanced user
if you really, really want to make
that slightly different algorithm
and you don't want to manually deal with
all the things we need to deal with
with the vectorization.
We actually had a case like that.
What's the name of the guy?
Marcus?
Marcy, I want to say.
I can look him up.
Which was the main contributor to the RageDB project.
That was making a series of blog posts about the fact that he wanted to do some performance thing in the database implementation.
And at some point, he basically stumbled upon the fact that the next step in his work is basically,
well, I need to vectorize that if I need to go faster.
And, you know, it happens to stumble upon Dennis.
And Dennis, basically, yeah, you wrote him
some kind of fork. Yeah, I'm sorry,
I cannot find...
I will search it.
Yeah, so, yeah,
basically, so this is one of the
examples we're going to show how it's done in the talk
we're giving, is
you have an array, you have a predicate,
and you want to collect all the indexes
that match the predicate, right?
So just get a vector of indexes.
And we can vectorize that.
And it's a custom.
You need to write a custom loop that will do it.
But he claims that he went more than five times request per second
in his database.
Oh, wow.
Okay, so you've got all the basic building blocks
for building SIMD things in your library,
and then you've got...
What's that?
Not all, but...
Well, I mean, you can build SIMD things,
and then you've got higher-level building blocks, it sounds like,
and then you have a selection of very high-level algorithms
that you've implemented.
Is that accurate?
Okay.
And I can just drop this into my program.
What?
Sorry, Jean?
I mean, that tells you, you know,
the classical algorithm you are used to if you do++.
Okay.
We have some of them.
I'm working.
Yes.
And somewhere in the, like, four to eight times
performance improvement depending
on your architecture and the size
of the things that you can vectorize.
Yes.
So, well,
the smallest speedup can be smaller
if you hit the perfect case. So probably
the worst will be something
if you do a remove,
which is probably one of the more complicated algorithms we'll have.
And your branch predictor is perfect for a scalar version,
then we'll be about the same.
At this point, I don't think I know any case where we're worse than the scalar code.
And sometimes we're massively better.
That's cool.
And the important thing to understand about that is that it's single core.
That means that you are still in core
and you have this two-digit speedup.
That means that you are currently saving
whatever the extra speedup of multithreading
should or can bring you to put it on top
if really you really need that.
And I mean, let's take a very conservative 10x speeder.
And now imagine, it's conservative, I mean, it's conservative.
One order of magnitude, okay?
Let's say it like that.
So you have this 10x thing, okay?
And the idea is to say, okay, now, okay, I will put my,
I don't have it there, okay, I should put my marketing at, but basically that means that if you are running a natural code right now, okay, I will put my, I don't have it there. Okay, I should put my marketing at,
but basically that means that
if you are running an actual code right now,
you know, like on a real big system
and you need to get 10 times faster,
what do you want to do?
Do you want to just change your code and get 10x faster
or do you need to allocate 10 more times resources
to get more machines or more cloud time or whatever.
That's basically where the thing is.
And that's something that people don't actually take advantage of
because the current state of,
I need to write a SIMD complex code is abysmal.
So you have this one order of magnitude of performance
that just go to the drain
because nobody wants to do the effort to use it.
You have to, like, your code has to be structured properly, right?
So, like, you have to already get to the point where the problem is the loop, right?
Or if you got to the point where your problem is a loop
and you have your parallel arrays and now you want to get faster uh right
something like that i think also unity project have been doing like i saw at least a talk where
they said that we're gonna have like um where you can say instead of a vector you can say like an
soa uh of that vector and they're going to use like sc4 which is one of the extensions to
vectorize your code at least the basic stuff,
but that's already really good.
We also have an SOA vector,
and we also kind of,
we try to get you the same experience
on the C++ library.
Jason asked about,
I can take that into my project,
a couple of caveats.
So just out of the book,
like you need to compile a C++20, right?
That's a library requirement.
The regular thing about CMD applies,
so it's a process of extensions.
You need to compile for a specific
architecture, right?
Or at least a specific
architecture, right? You need to understand,
okay, I'm targeting, let's say,
you know,
processors from 10 years ago, and like,
this is a minimum requirement.
Right. Or you need to do
the horrible, very difficult thing and do
the dynamic linking, which is what your
Ellipse will do. Right. So Ellipse
will have your implementation of
your Ellipse on your local machine,
and it will use that one. So if you
call, correct me if I'm wrong,
because it might be wrong,
but it will pick up the, you know,
system implemented on your machine
with a specific architecture.
So if you say compile for SSE3,
then you're not going to get any kind of runtime behavior.
You're going to get SSE3
no matter where you ship that binary to.
Yes.
Yes.
There is,
so sometimes people do...
People just built in the dynamic switch.
For example, Microsoft STL implementation
nowadays provides you with some vectorized algorithms
out of the box.
If you hit the specific type,
like a reverse will be vectorized.
And they do it...
They have a dynamic sort of switch.
So basically, they detect what's your runtime architecture,
and they will select the collect implementation of reverse.
It's a really cool piece of work,
but it only works for specific type if you hit it,
and we can do all the other cool stuff.
Right.
And what platforms are supported right now?
Currently, most are x86, up to Skylake.
Visual Studio, GCC, Clang.
GCC, Clang, no problem.
Visual Studio is, how to say that, in the work.
I'm still fighting against the fact that I'm sure I'm writing C++20
that Visual Studio should understand,
but it probably doesn't understand it anyway.
I did some trials on the Clang-based Visual Studio things.
Yes.
And it was okay.
I mean, it compiled.
I didn't have time to do any actual code gen investigation,
but it compiled most of what we wanted him to compile.
I mean,
the thing that does not compile on
Visual Studio is basically
two things. Stuff that should be
in the standard library, but apparently
are still not there for whatever reasons.
But usually it's easy
to fix.
That's the easy thing to fix.
The other easy thing to fix is that you've wrote
proper C++20,
but it doesn't want to compile it because, I don't know,
like it's confused or something.
One thing that comes up a lot of times is when you do, you know,
you do fold expression in a context where the syntax is a bit complicated, like you want to do a plus
dot dot dot
in appearance with one
via dict types
divided by another
one, which is actually another stuff
inside another template.
You say, okay, I don't understand.
And that are actually also easy to fix.
It's a bit ugly with
RQuotes because you need to, you know, slice the thing
so you understand every bit and then you do the thing.
But that's okay.
And then there is a bunch of stuff
it just doesn't want to compile.
And I have no clue why,
because when I do something similar,
like in Compiler Explorer, it compiles.
So it's probably a problem of context, you know,
like, I don't know, something is not parsed properly.
But it's basically this.
It's basically fighting, you know, as, I don't know, something is not parsed properly. It's basically this. We are basically fighting
as a parser or
something. So that's a bit of
not very glorious work,
but it has to be done.
I really want to be able to get
Visual Studio supported, because
that's a massive
segment of people that could be interested.
And I know that, for example,
from past experience
with Boost.Indy and with the old Eve,
but more recently now, as Denis says,
there is a lot of things that Visual Studio is able to very,
to vectorize very well for whatever it does.
And I know that we have some bench
that the success of the benchmark on Visual Studio will be,
we go just as fast because they just have this one perfect, you know, like vectorized, I don't know, logarithm or whatnot.
And it just happens to be called in a perfect loop and so on.
So, yeah, so Visual Studio is in the works.
Client GCC will require the latest version for now.
The plan is that, once they both support modules correctly,
it will be our baseline version, and we'll start actually modularizing the library,
which is kind of done right now, but just at the physical level,
we basically know where the stuff are going to be sliced,
but it's not physically done.
We also support ARM, mostly the 64-bit parts.
We probably support the 32-bit ARMv7 stuff,
but I think that we have some bitches in the code
that try to call the
64-bits when you should not, but that's
something we... We have an issue on that, I think.
That's an active issue, I guess.
What kind of... Like, we are pretty
close on that. Yeah, yeah. I mean, it's
probably like a bunch of...
I mean, a dozen of things to fix.
The thing is that for a very
long time, I mean, you can
right now go on the internet and you can
fetch the what they call
the Intel Intrinsic Guide
which is a massive web page
where you type something and you have
all the variants of the
x86 intrinsics that do the thing
with whatever architectures
what not it's a very very useful tool
and
I mean, if I
wanted to make a bad joke
Eve is basically
a C++
version of the intrinsic
because basically
we have the page open and
we need to do that, what's the intrinsic
I mean
it has to be done
it's an odd set, come on
I said a bad joke I mean, it has to be done. It's an odd set, come on. No, I said a bad joke.
No, I mean, no.
But the basic thing is,
as soon as we go up, it's better than that.
And for a very long time,
if you wanted to get documentation
on the ARM CMD things,
it was quite a challenge
to find the proper version for the proper architectures
you wanted to target.
And sometimes now, they basically have the equivalent of the Intel pages for arm, and
it's very, very easy now to find where we would need things wrong or finding intrinsic
we didn't even knew existed before.
So it's a bit of a whack-a-mole thing because you need to fill all the holes you can find
in the thing we should be supporting and we forgot about or we didn't know about and so on.
We have a very, very unsupported support, if I can say that, for PowerPC.
It's in there because I had the code back from my PhD.
You said PowerPC?
PowerPC, yes.
Okay.
They do, yeah, yeah, yeah.
This one thing that guy over there probably has in the back.
But, I mean, the code was there since my
phd so i just recycled it because well it doesn't change that much if you see what i mean uh i i'm
i'm i mean because uh power eight and power nine uh added a bunch of things uh the main issue with
ibm is that contrary to uh arm which has a very good cross-compiler and a very good QEMU support,
so we can actually do emulation to test,
PowerPC is like you get that one version with that one version of GCC,
and it's very complicated to get the other one.
So it's more like we don't have the resources to do it.
Because all the systems work the same anyway.
So that's there because it's there
but our main
front-end is X86
and R.
You mentioned the C++20
requirement and I think, Joel, you mentioned
a big fan of if constexpr.
Dennis, any other C++20
features you want to highlight
that the library uses?
I want to complain about concepts if that's cool.
Sure.
It's massively
helpful.
If you're an expert, you cannot write the library
without ifconcepts for whatever Joel says.
It's a lie.
With concepts, there are a couple of things that people
might need to learn.
First of all, perfect forwarding
and concepts. If you all, perfect forwarding and concepts.
If you know that perfect forwarding, when you do the deduction, you might get a T ref as a type deduced, right?
You might get a T or you might get a T ref.
Like, let's say you pass a vector, your type might deduce as vector reference.
And it means that the concept will be applying to that vector reference and not to the vector
right which sometimes leads to very unpleasant situations where basically you're uh i don't know
what to do about it yet right but your concept might not work even that's what you meant uh at
the moment i'm just removing references in all the concepts which i'm not sure is a correct call
maybe it should be removed reference in every invocation.
The second thing to watch out for is that if the concept fails silently
and you have a different overload, that is kind of like, for example,
if you miss type name, right, then the just concept fails
and you get a different
overload, and you didn't even think about it.
Wow.
In our case, we have
overloads that is more powerful, and then
overloads that is less powerful, and we get
a less powerful overload, and it just works
silently.
Yeah, but
also people say
good error messages. Not from what I have seen, for the most part, for also people say like, you know, good error messages, uh, not from what I have
seen for most part for our code, at least because not yet.
No, there's a fundamental reason, right?
When you have a template, you have a failure.
That's one failure.
When you have a concept and let's say you had the restriction, you had an overload set.
It was supposed to select overload number three, right? It didn't.
The concept check failed because something
wasn't right. You get a failure
for each of your overloads.
And that can be quite deep.
And so
that is a bit painful. But still
massively powerful tools that we
use all over the place.
Yeah, concept if concept
were. I
think that the two things that I
don't know what I was doing before
actually. Let's be honest.
More than 100%.
Yeah, but
we have this very fine stuff
where basically what we do is that we use concept
to discriminate on the architectures.
So it's a very high level
decimation of the overloads. And then
when we know that it's that,
whatever, I mean, that's for x86
in this case. Okay, right.
And then inside, we have
those massive, like, I
should actually call them, like,
some kind of constexpr switch,
if you see what I mean, those huge nests
of if constexpr that just go over
every case you can encounter
for this version of the function.
And they'll just say, oh, yeah, just return that or just return that.
And the thing is that it's very helpful
because what we had before, we actually had an overload
for each of those if-constexper things.
So if I just take apps, absolute value of x, that means that
on x86, we
had something like
18 overload, something like that,
when you factor in all the architectures.
And how
to say that?
5 overload, it's okay. 6 is okay.
18 starts to be just ridiculous,
you see?
So now that we can just say,
okay, this is this case, and you go there,
and then inside, let's pick out which subversion you need.
So you have less overloads, compile faster,
because you just have to do not this one, not this one, this one. And then inside, you just pick up the correct if constexpr and be done.
It has its limitation,
because sometimes the if constexpr nest starts to look like a mess, let's say it this way.
And then sometimes you want to try to find the best angle
to slice your if constexpr so you have the minimum amount of them
and the nesting is not too deep or whatever.
And one thing we had to
fight against
with if-constexper is
sometimes you do
if-constexper something happens,
then if-constexper
something else happens,
do this or do the
fallback. Else,
if something again different happens, do another thing. Else, do the fallback. Else, if something again different happens,
do another thing.
Else, do the fallback again.
And the problem is that if you want to factor out the fallback,
all the conditionals just look ugly,
and you're stuck between duplicating code
or having stupidly complex conditionals.
And so we have to navigate that.
But most of the time, it's okay, especially on ARM
and the latest E36, because it's just like, okay, we just have to basically make a mapping between
the type, the size, and the actual name of the intrinsic, because nobody except PowerPC has a good idea to have intrinsic which does
not contain the type
and the size in the name of the function
which is when you write
generic code it's like
a huge mess because you just have to
enumerate all the cases
and even if you look at
ARM which is very fine because
you want to do a vectorize
plus it's probably
vadd and then
something that looks like, you know, the
stdint
name, like vadd
e8 or vadd
u8, stuff like this. So you can actually
find what you need, but you
still have to do this 10 or 12
you know, sub stuff
that doesn't bring anything to the table
except being this mapping.
It helps
also in the algorithms
because we basically have to
rebuild the basic concept for
ranges and stuff. Tell me if I'm saying
stupid stuff, Denis.
I guess it is.
It was
the same thing, I guess.
Algorithms without concepts, very complicated.
Yeah, so it's like, this is a range, and this is, you know,
well, mostly it's a range, it's an iterator,
we have a different notion of an iterator, things like that.
All right.
Well, you're going to be doing a talk at CppCon,
and obviously listeners should go and check that out.
But it was great having you both on the show today, and everyone should definitely go and check that out. It was great having you both on
the show today, and everyone should
definitely go check out EVE. Thanks, guys.
Yeah.
Can you shout out Jean-Therese?
I don't know how to pronounce it. We have a search person
who works on the web.
Yeah, of course.
I will be
beaten with the rod.
Yeah, so
we
are actually three people working on the library.
Jean-Thierry Lapreste is also working with us.
He's one of my past life thesis advisors.
And we kept working together on that.
And it's basically the guys that do the math.
We have a lot of math functions here.
Some I know what they do, some I don't.
And you've got most of them is very good at numerical stuff.
So half of the library won't exist if he wasn't there.
So yeah, big shout out to Jean-Thierry for helping us on this part.
All right.
Thanks, Joel.
Thanks, Dennis.
Thank you very much.
Thanks so much for listening in as we chat about C++.
We'd love to hear what you think of the podcast.
Please let us know if we're discussing the stuff you're interested in,
or if you have a suggestion for a topic, we'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com.
We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter. You can
also follow me at Rob W. Irving and Jason at Lefticus on Twitter. We'd also like to thank all
our patrons who help support the show through Patreon. If you'd like to support us on Patreon,
you can do so at patreon.com slash cppcast. And of course, you can find all that info and the show notes on the podcast website at cppcast.com.
Theme music for this episode is provided by podcastthemes.com.