CppCast - HPX and DLA Future
Episode Date: July 22, 2021Rob and Jason are joined by Hartmut Kaiser and Mikael Simberg. They first discuss some blog posts on returning multiple values from a function and C++ Ranges. Then they talk to Hartmut Kaiser and Mika...el Simberg on the latest version of HPX, how easy it is to gain performance improvements with HPX, and DLA Futures, the Distributed Linear Algebra library built using HPX. News An Unexpected Article About Our Unicorn: Who Is the PVS-Studio Mascot? How to Return Several Values from a Function in C++ C++20 ranges benefits: avoid dangling pointers Links HPX HPX on GitHub HPX 1.7.0 released DLA Future on GitHub Sponsors C++ Builder
Transcript
Discussion (0)
Episode 309 of CppCast with guests Mikhail Simberg and Hartmut Kaiser, recorded July 21st, 2021.
This episode is sponsored by C++ Builder, a full-featured C++ IDE for building Windows apps five times faster than with other IDEs.
That's because of the rich visual frameworks and expansive libraries.
Prototyping, developing, and shipping are easy with C++ Builder.
Start for free at Embarcadero.com.
In this episode, we discuss returning multiple values from a function.
Then we talk to Hartmut Kaiser and Mikael Simberg.
Hartmut and Mikael tell us about the latest version of HPX and the DLA future libraries. Welcome to episode 309 of CBPCast, the first podcast for C++ developers by C++ developers.
I'm your host, Rob Irving, joined by my co-host, Jason Turner.
Jason, how are you doing today?
I'm all right, Rob. How are you doing?
Doing just fine. You got anything going on?
You make any CppCon submissions? This is the last day.
This is the last day.
The last day is in the past, but the day we're recording is the last day.
Yes, the day this airs will be different.
I did make one submission. but the day we're recording is the last day. Yes. The day this airs will be different. Yeah. Um,
I did make one submission.
Um,
okay.
Hopefully it'll get accepted.
Uh,
that's all I know at the moment,
I guess.
I did also submit to teach a class at CBP con.
Oh,
very cool.
So we'll see what happens there as well.
I'm sure we'll find news.
They're usually pretty quick to do,
um,
approvals for submissions,
right?
Well,
I don't know the deadline is
today and then I'm not saying
it'll be immediate yeah no
people have to make travel arrangements and stuff if
they're going to come right okay
well at the top of every episode I threw
a piece of feedback this
week we got a tweet from
Antonio Nutella
commenting on last week's episode saying this was so
interesting to listen to thank you all week's episode saying, this was so interesting to listen to.
Thank you all for the episode.
And yeah, it was great talking to Ivica last week about performance.
It's been a while since we kind of just talked about the subject of performance.
We kind of talk about it a lot without talking directly to it.
Right, right.
Well, we'd love to hear your thoughts about the show.
You can always reach out to us on Facebook, Twitter, or email us at feedback at cppcast.com. And don't forget to leave us a review on iTunes or subscribe on YouTube. scientist at LSU's Center for Computation and Technology. He's probably best known through his involvement in open source software projects, such as being the author of several C++ libraries.
He's contributed to Boost, which are in use by thousands of developers worldwide.
His current focus is research on leading the Stellar group at CCT, working on the practical
design implementation of future execution models and programming methods. These things are tried
and tested using the HPX, a C++ standard library for concurrency and parallelism.
Hartmut's goal is to enable the creation of a new generation of scientific applications in powerful,
though complex environments such as high performance and distributed computing,
spatial information systems, and compiler technologies. Hartmut, welcome back to the show.
That's quite a mouthful, right? Welcome back.
How are you doing, Rob?
Hey, Jason.
Nice to see you again.
Yeah.
Well, we may talk about this a little bit later, but the last time you were on the show
was the first time I was co-hosting the show.
Which was exactly 300 episodes ago, actually.
Wow.
That's great.
And congratulations to you guys for making that happen for the last six years.
That's quite an accomplishment on its own. Pretty crazy.
But also joining us today is Mikael Simberg. Mikael works at the Swiss National Computing Center.
And as a scientific software developer, he has a degree in operations research and computer science.
He works in industry doing embedded programming before joining CSCS in 2017.
At CSCS, he works on improving HPX itself
and helps users integrate it into their libraries and applications.
Mikhail, to the show.
Thanks for having me.
Nice to be on the show.
It seems like we have four time zones covered today, then.
Yeah, that's true.
Slightly unusual.
Yeah, but we've got it working.
This is still sort of the
manageable time zones.
For me and Hartman, this is
quite a regular occurrence.
Right.
It works. Awesome.
I have to get early out of bed and he has
to stay longer.
So just to make that happen.
Exactly.
Actually, I'm just kind of curious.
Where is the Swiss National Supercomputing Center?
So the actual supercomputer, the headquarters are in southern Switzerland,
on the border to Italy, in the canton of Ticino in Lugano,
the sort of sunny part of Switzerland.
I myself, I'm in
Zurich,
a bit further north, which is also
sunny at the moment, but the weather
tends to be better down south.
Yeah, most people at
CSCS are down in Lugano, and then
we have a small 30 people
or so up here in Zurich
who are mainly software developers.
I don't think I've spent any time in any of the Italian-speaking Canton region down there.
There's only one.
There's only the one? Okay.
There's only one, yeah.
But it's really beautiful down there.
I'm sure it is.
If you ever get a chance to go down there, and whenever I get a chance to go there for work, it's usually a nice opportunity to just get to, you know, get the sun and get to see some colleagues.
Right.
And you get both the good side of Switzerland and the good side of Italy without having the disadvantages of either side, right?
It's kind of unique.
Exactly. Exactly.
We got a couple news articles to discuss. Feel free to comment on any of these
and we'll start talking more about HPX.
All right. So this first one is an article
from PBS Studio and they've been sponsoring the podcast
for a while i really appreciate uh
that but this article is all about um their mascot and we we always see the mascot in in the blog
posts that are uh coming from pbs studio uh and it was kind of fun to read this article about its
genesis and uh evolution over the years that they've been blogging and using their unicorn mascot.
Sorry, I still don't get why it's a unicorn. That's kind of still not quite clear. I guess it doesn't matter. If you like it, then that's all that matters.
Yeah, I mean, they mentioned at the very top, they just used it originally as like a free piece of
clip art that was out there and
people just loved it and they kind of made it their own and I guess that evolved it.
A unicorn vomiting rainbow conveyed what they wanted to convey.
Yep. I suppose I have to admit, I don't think I've ever used PBS Studio. And I was first wondering why you have a post about how the mascot developed and so on.
Why does it belong on the show?
But I don't know if I'm reading too much into it.
But then when I actually read the post, I was like, well, you know, there are writing software, these small things actually matter and can have an impact on how people see your product or project.
So, yeah, as I said, maybe I'm reading too much into it, but I thought it was fun to think about it like that. about the kinds of bugs that their software finds. And those articles, besides the fact that they've been a sponsor,
those articles tend to end up being articles that we discuss
because often it's interesting.
What is it?
I think one of the ones that stands out to me the most is they're like,
what is it?
Shoot.
It's like the last copy rule is I think what they've dubbed it,
where if you copy and paste a block of code
like four times you always make a mistake on the last copy and they showed example and example and
example of this where you like yeah you correct that you know you change the things you're
supposed to the first three times but on the fourth time or the last time whatever the last
time is that's where you make your mistake. Yeah, anyhow, interesting stuff. So the PVS studio can actually
detect things like that.
Copy-paste
bug detection is one of the
things that they can do.
The first
kind of thought I had when I was
reading that article after you
sent it is, damn, we really need
an HPX logo.
This can be your problem to come up with something.
You just can't use a rainbow bar from Unicorn.
We'll work on that, Hartmut.
We'll work on that.
So the next thing we have is a post on Fluent CPP,
and this is how to return several values from a function in C++.
And when I was reading this, I thought we'd use this as an article on CppCast a few weeks ago.
But I don't think we did.
I definitely was looking at it, though, a few weeks ago.
Yes.
It was dated July 9th.
It's a good post and, you know, talks about different ways to return multiple values from a function.
Definitely recommends against using them, using a pointer or reference as an input parameter and modifying it within the function body.
It's definitely a bad practice.
But talk about using a struct or using a pair or tuple.
I'm turning it that way.
Is it tuple or is it tuple?
I'd say tuple, but it's probably tuple. I'm turning it that way. Is it tuple or is it tuple? I say tuple, but it's probably tuple.
I say tuple as well, but I never know what's right. Yeah, I very much, you know, prefer the
name structs, and I think tuples or tuples, whatever you want to call them. You know, it's
nice for prototyping, but I don't think they really belong
in production code in the end,
because the only places if you really have a sort of,
just a pack of arguments, parameters,
which are really just first parameter,
second parameter, and so on,
and then tuples is perfect.
But if you can get names to them,
I always prefer doing that yeah tuples are very
useful in engineering generic programming especially pre-c plus 20 when you have to
capture a whole template parameter pack in a lambda and things like that and they're very
useful for prototyping as nikhil said If you do a bit of Python programming,
then you kind of get used to that, right?
You just return x, y, z,
and then you unpack it on the other end,
which is very convenient.
So in that context, it's certainly very similar.
But for production code,
I rather prefer to have named structs,
so that I completely agree with Mikhail here.
I think the one option that wasn't covered in the article that I'm aware of, which is admittedly questionable practice,
would be to have used auto return type on the function and declare a local struct that has the named parameters that you want.
If you say, well, it's only got a use in this one particular case then you're kind of relying heavily on the person using an ide so that they can you know like dot member whatever and get the
names of the things back out because discovering that is harder but it it's just another tool yeah okay and then the last uh post we have here is uh sable's 20
ranges avoiding dangling pointers jason you want to tell us more about this one
uh so this is interesting because it's one of the um um sorry i lost the link myself
this is uh andreas fertig is working on a new. So it's one of the chapters from his book.
But this is just about avoiding the problem.
If you've got a trying to iterate over some sort of container or range or something, that is a temporary.
And this is actually a problem with C++ ranged for loops as well.
That is rarely hit by people. Have you all seen that issue where if you try to do a range
for over a value that's returned from a function that's
returned from a function, so you end up with a dangling object?
So anyhow, one thing that stood out to me about this article
was, well, so Andreas is going over what standard ranges did.
But one of the things that stood out to me
is that there's usually a mentality
of that your compile error should be as soon as possible,
but the technique taken here
is to make the compile error as late as possible
so you can give the most descriptive error.
I've not seen that technique before, personally.
That's interesting.
Yeah.
Didn't realize that when reading the article.
Yeah, I hadn't thought about that either before, but I like the technique. And in a way, you're
just tagging this type for later use. Okay, this can only be used in particular contexts
because it gets a different type than what it would originally have had.
And I think that applies in many other places as well.
If you can put additional metadata in your types,
you can actually avoid quite a lot of errors that way.
It's always a trade-off.
It's more work for you when you're defining your APIs,
but it can pay off.
Right.
Okay.
Well, Harvmomut, as we mentioned, you know, it's been 300 episodes since we first had you on first talking about HPX.
Do you want to start off by kind of telling our listeners all about what HPX is again and
what has changed?
Well, I'll try to be brief.
Sure, I'm not sure there's a lot.
Well, HPX itself, we call it the standard library for concurrency and parallelism.
In the end, at its core,
it's a very efficient threading library.
We have put a lot of effort into making it really, really efficient and reduce
overheads as much as possible so that thread creation, scheduling, termination can be done
in less than, I don't know, almost half a microsecond, which is very handy because
you stop thinking about creating threads as overhead.
You can manage very fine-grained parallelism, much more fine-grained than in conventional threading systems.
And on top of that core, we have built several interesting features which make essentially HBX distinct from other what's nowadays called an
asynchronous many-tasking runtime system. First of all, and that is what I would like to highlight,
is the C++ conforming interface conforming to the C++ standard. So what we have tried to do there is to say,
okay, if somebody has a standards-conforming program
that uses standard facilities like Steadthread or Steadfuture
or similar things and flips a namespace from standard to HBX,
it will work correctly.
So the semantics of all the APIs are exactly following the standard, but you get the benefit that things run usually faster than what you had before.
So standards conforming interface of everything that the recent standards have specified, like barriers, ledges, threats, future, you name it, everything related to parallelism and concurrency.
The second thing we have implemented on top of that efficient tasking system is a full implementation of the standard algorithms,
the parallel algorithms that are in C++17, and we are currently extending that to be conforming to the range-based algorithms in C++20,
which opens up very nice additional things. We can talk about that later if that's interesting.
And the third thing that we have implemented on top of the tasking is we have extended or we have tried to extend most of those facilities to the distributed
case so that you have a runtime system or you can write code that is completely agnostic to
whether things are happening on your local machine or when you run it on a cluster,
for instance, on a high-performance computing system, or whether things happen on the neighboring
node. And all the networking, all the remote procedure invocation and everything is hidden behind
the scenes so that you essentially do a simple async with a function and you get a future
back and that function can run on a different node completely transparently.
So in that mode, we treat when you run an application on a high performance system,
you usually have that, oh, you have the problem that you have several memory spaces, right? Each
node has its own physical memory space. So you have that distinction, which reflects in these
programming models, which are used nowadays, that you have to specifically work differently when you
access local data and when you access remote data, because it's different memory spaces, right? You can't directly address.
And there we have added a global address space
so that you can treat the whole application
essentially running as if it was running on one giant machine,
even if it runs on a thousand nodes under the hood.
And that is nice because you can move things around between nodes,
which is a big problem in high-performance computing
because you want to achieve load balancing. And you want to move things around between nodes, which is a big problem in high performance computing because you want to
achieve load balancing and you want to move things around in a way that each node has the same amount
of work. And doing that with conventional methods is very difficult. So that's kind of the three
things we have in HPX and I think that is what's distinguishing HBX from other similar implementations.
Is that enough for now about what is HBX?
Mikael, you want to add something?
I can expand on some of the things we sort of extend there, like now.
I mean, I could maybe specify when Hartmann says that we, you know,
essentially run a thread pool underneath the hood.
And we have these lightweight threads specifically for people maybe coming from other languages or
other frameworks. They're stackable threads in the end. So, you know, just like stud threads,
you can yield them and suspend them and so on. And this is why we can actually have things
like mutexes inside these threads.
On the other hand, it also means that while they're
a lot cheaper than standard threads,
they're still not for free.
So there is some overhead there as well,
which is something you still need to keep in mind.
HPX just shifts the minimum grain size.
So we usually talk about the grain size
when we're talking about sort of the task size
or the size of the amount of work
that you want to actually submit as a task.
That's usually the grain size.
And with standard threads,
you need to have quite a lot of work for it to make sense
to actually spawn a new thread with HPX
that shifts much further down. So you can have much smaller tasks, but there's still a limit.
The other thing is, okay, our futures, again, I haven't sort of implied it,
but standard futures just are on their own.
You can standard async, you get a future, but you can't really do anything with it
except get the value at some point.
With our futures, you can chain them.
So the dot then, there are other futures libraries that allow you to do that as well.
But I think that's one of the key features actually, that you can chain these tasks and
build up complicated graphs of tasks using these primitives. And then I suppose the third thing I
wanted to add is that for the parallel algorithms, again, the ones in the standard library take various execution policies, but they're all blocking calls. joined or started working on HPX is we have a sort of task policy for the parallel algorithms,
which allow you to run things either sequentially or in parallel, but then on top of that you get
a future to the results. That means that you can actually not just have your individual tasks that
run then on one thread, but you can have a full parallel region as part of your task graph,
which is pretty cool. Well, the main benefit of that is, if I may add,
the problem of the, and one of the questions you wanted to ask today, am I happy
with parallel algorithms and stuff, right? Well,
yeah, in some sense, I'm very happy about them. On the other hand, they are very limiting
because they still give you four joint parallelism, right?
Four couple of threads, and then you join them in the end.
And what many people don't see when they use that in OpenMP or even the parallel algorithms
is that that joint operation is what costs you dearly, especially when you have a large
number of cores you want to, or threads you want to join at that point. So that join operation can add significant overhead
to your overall execution.
By launching a parallel algorithm on the side asynchronously,
you still have that join operation in the end,
but you can part of the course free up early, right?
And part of the course have to do the join operation in the end.
By moving them on a side and executing it asynchronously,
you can do other work when the cores free up.
So you can kind of mitigate that join operation very nicely in the end
by doing other possibly unrelated work on your main execution flow,
if you want.
So that gives you performance benefits,
even if you would think, hey, I'm adding more overhead
because I'm getting a future pack now.
No, you're saving on the joint operation,
which is one of the main culprits, actually,
of the existing execution models we have.
People just slap an OpenMP pragma in front of their loop
and say, yeah, great. I parallelized that thing.
No, bummer.
Will not work.
Yeah.
Okay.
Yeah, sorry.
So I just wanted to say, I guess, specifically, it's, I mean, it's MDAL's law that kicks in where if you have your fork join, it's the serial regions that kill you in the end.
You try to run it on 128 core dual epic system or something like this,
and it just won't scale the way you want if you're doing core turn.
Yeah, that's two questions I had is,
what do you call, when Hartmut said, if you have a lot of cores,
and I'm like, what is a lot of cores to them?
What is a lot of cores to you?
The biggest run we've ever done with hpx was using 650 000 cores
okay that's definitely a lot on i don't know 5 000 nodes or something like that okay so each
each node has up to i don't know 256 cores physical cores on the high performance computing
systems so that they're you know if you run it on your laptop with four cores on the high-performance computing systems. So there, you know, if you run
it on your laptop with four cores, then the join operation probably will not be noticeable or not
that significant. But if you run it on a large system, then these join operations really, oh,
Hamdol hits you over the head here, right? You add sequential execution and you force your execution into some sequential piece.
And that is exactly what kills your scalability. So then also I was thinking about, we were talking
about how you can, some of the cores will get freed up earlier in a parallel operation. So
I want to make sure I understand. You're suggesting if you know that you have like three algorithms
that could run in parallel, then with your version of these algorithms, you could say, OK, future, future, future for those three.
And then you ultimately get as much as you can complete utilization of your cores as soon as as soon as some processes start to free up on the first algorithm.
The second algorithm will just automatically start using them.
Is that correct?
OK.
Yes. Yes. If you have three algorithms in a row, then when you do it
synchronously, you get these three join operations. And that is a big pitfall. And that's, by the way, where ranges come in. And that's what we have added to the, I said, we implement,
we are currently implementing the C++20 range-based algorithms.
And what we have done, we have added parallel versions for those algorithms as well,
even if the standard doesn't specify them.
So essentially, you can pass an execution policy to the range-based algorithms as well
and parallelize the range-based ones.
And that is very interesting and non-obvious
because what libraries like the standard range library or range v3 do for you
is silent loop fusion.
If you have three algorithms and you join them with a pipe
in the range-based world, you pipe them, right?
That essentially means,
and the main reason you want to do that, or the goal of allowing that is to avoid the temporary arrays in between the algorithms. And that means that these three algorithms are executed not one
by one, but element by element. So the first element will be kind of dragged through
the three algorithms, and the second element will be dragged through three algorithms, and so on,
because it's the only way for you to avoid the temporaries in between. And that means you fuse
the three loops into one bigger loop, where one loop executes the three operations for each
element consecutively. And that means if you parallelize that one, you suddenly gain a lot of potential because you parallelize the fused loop
instead of parallelizing these three separate loops. And you don't have to do anything. You
just use the range-based parallel algorithms and pass a sequence of piped algorithms,
piped range-based algorithms to it, and you get loop fusion,
which is another big gain. That's what you normally would do by hand, right?
If you have three algorithms in a row, you would try to fuse the three loops yourself by hand.
By combining parallel range-based algorithms with libraries like RangeV3, you get that for free.
So for your parallel range-based algorithms, do they actually have different names like par underscore for each or something?
What does this look like?
No, they are just in a different namespace.
Oh, different namespace, okay.
They are in the ranger's namespace and take the normal execution policy, par execution policy.
And the interesting thing is that they don't do much.
They essentially just rip the range apart
and to begin and end and pass that
to the underlying implementation.
That's the whole trick there.
But it opens up these loop fusion,
automatic loop fusion effects, which are very beneficial.
So do you use like a expression template kind of technique or something
to fuse the loops? How does that actually...
I'm curious how it's actually implemented.
Range v3 is doing that.
We don't do anything. Oh, it doesn't...
Okay, then I didn't realize that.
And the standard library has these piping
operators as well, I believe. Yes.
Do you know that? So it will do
the same thing. It has to do it
to avoid the temporaries in between the algorithms.
Otherwise, it wouldn't work, right?
Okay, okay.
The only way to implement that without
creating temporary arrays
or temporary containers in between
the steps is to do it element by element
ways. There's no other way of doing that.
So they have to fuse the loops.
Okay, I missed that detail.
That's pretty cool.
You brought up one other question in my mind as you were describing all of this.
And actually, coincidentally, I wasn't even thinking about this.
I just released an episode on C++ Weekly this morning that is about the C++17 parallel algorithms.
But it's a super high-level episode.
It just shows, look, if you have the right workflow, right data, you can,
you know, magically get faster algorithm. But, um, uh, one of the sore points for me is that,
uh, lib C plus plus still doesn't have parallel algorithm support. Can I take HPX and just drop
it into my code? And now it'll work on Clang, on macOS, and boom, I magically have parallel algorithms available to me.
I mean, is it portable across all the platforms?
Wouldn't that be cool?
But I think Mikhail has something.
Pretty much.
Yeah, so that's the goal.
So Clang and GCC are well-supported.
Hotmode uses Windows, so we support Windows as well,
or Visual Studio.
macOS is a bit difficult.
We're not too many people working on this,
so we sometimes have breakages there.
And really, you need to have someone
who is actually dedicated to fixing these things
to be able to support it.
So we try to do it as well as we can.
We often have patches or PRs coming in
from people who fix compilation on their platform,
sometimes on BSDs, things like this.
We can't, unfortunately, support it fully ourselves,
but wherever we can, we try to be as portable as possible. Okay. Well, I think
That was part of the question which I would like to add to
Can I just use the HPX parallel algorithms and drop them into the existing code I have?
And that is a bit more problematic at least at the moment because currently the HPX algorithms use the HPX threading system underneath, which is not fully compatible with other threading you might do yourself.
Ah, all in or not then?
Yeah, at the moment, yes. But, and that's what I was hoping Mikhail would jump in because what the C++23 is aiming at is,
and you might have heard about that,
is the executor discussion,
which is nowadays not an executor discussion anymore.
They call it schedulers nowadays, but whatever.
But the idea here is to create infrastructure that allows it on the standard
library level to combine different threading implementations into one. And this is based on
senders and receivers. So a very low level abstraction mechanism of a building asynchronous
execution chains by having senders, which give you some value and receivers
that receive those values and that is standardized. And the effect here is that you will be able to
move execution between execution environments like GPUs or threading systems like HPX or
standard threading. And once all of this is in place,
and that's our goal to get that implemented,
we will be able to use the HPX standard library implement
as standard algorithm implementations
directly in the context of your normal C++ program.
Because then these algorithms will be agnostic
to the execution environment they are running on.
And we hope to achieve that at some point.
Yes.
Okay.
So let's just say I, well, I have two, I have two questions now.
If, if, if in my code, I am currently only using the C++ 17 standard parallel algorithms
and I say, okay, I'm going to switch fully to the HPX model based on your lighter weight
threads, what I expect to see without changing anything else to switch fully to the HPX model. Based on your lighter weight threads,
would I expect to see, without changing anything else,
a performance difference?
Yes.
Okay.
That's a very straightforward answer.
I would expect that, yes.
Okay.
And then the other question is,
no, I don't think the other question,
I don't think I have another question.
But the point is I have to go all in on your threading model
at the moment if I want to do that. Okay. question. But the point is I have to go all in on your threading model at the moment.
At the moment, yes.
This was already mentioned with the benefit that ideally you just need to change std to HPX.
So the transition would be fairly straightforward.
Well, I'm just thinking like on another project that needs to support like all three major operating systems,
if macOS isn't very well supported, but we can get the performance boost that we want on linux and windows then it sounds
like i could just use namespace hpx into a local namespace somewhere swap it at compile time
assuming i stick to the standard things and don't use your more advanced features
then that sounds like a perfectly reasonable option for some people.
By the way, if I may have a plug in here.
So if somebody of the audience is interested in helping us with supporting macOS, please
get in contact.
We are always a very open community and we would like to have more people helping with
developing.
I wanted to end up the discussion for just a moment
to bring you a word from our sponsor, C++ Builder,
the IDE of choice to build Windows applications
five times faster while writing less code.
It supports you through the full development lifecycle
to deliver a single source code base
that you simply recompile and redeploy.
Featuring an enhanced Clang-based compiler,
Dyncomware STL, and packages like Boost and STL2
in C++ Builder's Package Manager, and many more.
Integrate with continuous build configurations quickly
with MSBuild, CMake, and Ninja Support,
either as a lone developer or as part of a team.
Connect natively to almost 20 databases
like MariaDB, Oracle, SQL Server, Postgres, and more
with FireDAC's high-speed direct access.
The key value is C++ Builder's frameworks
powerful libraries that do more than
other C++ tools. This includes
the award winning VCL framework for high performance
native Windows apps and the powerful
FireMonkey framework for cross platform
UIs. Smart developers and
agile software teams write better code faster
using modern OOP practices
and C++ Builder's robust
frameworks and feature-rich IDE.
Test drive the latest version at Embarcadero.com.
So I think we've mentioned a couple of the features that were released in the newest version of HPX,
which is 1.7.0, I believe, right?
So I think the range is supported.
What else is new in this new version? What's it about? Suppose one of the things, I mean,
I hope you can tell more about this,
but GCC or lib standard C++ has support
for the SIMD types now, experimental, but still.
And this is something we've been doing for a while
is Google Summer of Code.
We currently have a student working on essentially
implementing policies based on GCC's same game implementation, which is quite cool. And as far
as I've understood, the performance seems to be quite good and it's been quite a smooth transition.
We used to have, or we still have support for VC, which is another vectorization library.
But eventually that will just get replaced by the standard support for SIMD.
So that's one big thing.
I don't know if you want to add anything about that.
Well, I'm going to go ahead and interject right here that you should probably give our listeners an overview of what SIMD means and what that actually looks like to use it. Well, SIMD means, as IMD stands for single instruction
multiple data, and is a means or is a execution model which is used when you vectorize or when
the compiler is vectorizing code. Essentially, let's assume you have a for loop over floats,
and you calculate something for each of the floats
in your input array.
Then compilers might be able to vectorize that
by instead of doing one operation at a time,
just put it into the vector registers
and do four or eight operations in one cycle.
So, and these SIMD types that are now in the experimental, std experimental namespace and GCC and it's libc++, right?
libc++, I think that's the new one.
Yeah, in the canoe library essentially introduce types that encapsulate vector registers
in the cpu okay um and overload all the operators on those types so that you can use those types as
if it was a single float so i can say very explicitly I am using eight floats or whatever in this register.
Now do these things on them.
Okay.
And that maps into a compiler intrinsics.
So it generates fairly efficient assembly code underneath,
which is optimally close to what you do by hand when you add vectorization yourself.
So it's a very convenient way and a C++ way to do vectorization,
explicit vectorization, and not rely on the compiler, but do it yourself. Essentially,
have full control over what's happening. What these algorithms, the relation to the algorithms,
what Mikhail mentioned, is that we have added special execution policies like par or sec
which are called SIMD and SIMD par that instead of calling the lambda which represents your loop
function your iteration with one element it calls it with a SIMD type that holds eight elements or four elements.
So your Lambda now can operate on vector types automatically.
So essentially you write the code as if it was normal, non-vectorized code, but the loop
function will be called with a vector type.
So you have a Lambda with auto as a parameter type,
and the compiler will deduce that correctly.
And since all operations are overloaded for the SIMD data types,
it's fairly straightforward to make work properly.
Okay.
So by just changing the execution policy,
you can, for arithmetic-based kernels,
you can easily achieve vectorization and full control over it,
even in the standard parallel algorithm world.
So it's a new execution policy, parallelization policy that you added.
Yeah. Well, vectorization.
It's simply for sequential, only sequential vectorization, and simply par, which is doing both, vectorization and parallelization. It's simply for sequential, only sequential vectorization and simply power, which is doing both vectorization and parallelization.
And the nice thing here is for compute bound kernels, we see speed up a factor of 250 on eight cores.
No, 16 cores. 16 cores. I don't want to lie.
On eight wide vector registers, which is just amazing, right?
If you compare that to sequential execution,
just getting a speed of 250 by changing the execution policy is quite a feat.
So you're saying going from a serial
to parallel with vectorization 250 times.
Okay.
On compute bound arithmetic kernels.
And so this does kind of,
you must be operating over floating-point types.
Is that correct?
Well, anything the vector units support.
Okay.
So it could be integers, could be floating points.
So some integral data type supported by your CPU.
So now the system then basically falls apart.
If it was even just a struct of two
ints, then there's no way for you
to automatically decompose that and
do a thing on it.
There are tricks you can do that
by using zip iterators and
adding support for
zip iterators that you take it apart and
repackage it as a zip off
of two vector registers, things like that.
Right.
But we haven't done that explicitly at this point.
Yeah, I know, I'm just, now I'm just super curious.
So I'm just kind of sussing out where the corner's here
on this, because that, I think I can visualize
what you're doing and it sounds really cool, so.
That work has been, is driven by Matthias Kretz.
He is very active in the standardization community
and he has implemented the SIMD types for GCC, I believe.
So he's doing all the work there and specifying things.
And the SIMD power and the SIMD execution policy
are his idea as well.
So we just took, again, what the standard is trying to do
and where we are trying to go as a standard
and create some experimental
implementations to see how much effect do we get. So if I do the SIMD thing and get this awesome
speed up, but then I go and recompile on Windows, it'll still compile, it just won't be able to use
it won't be able to do the SIMD thing, because you don't have the SIMD intrinsic, or you don't
have the intrinsics, but you don't have the SIMD types that GCC offers on Visual Studio.
Correct.
But there you can use VC.
I said other library.
That is not really supported anymore, but at least for the time being, you can use other libraries there.
Okay.
I do very similar things.
Yeah.
And I guess...
You just mentioned standardization again.
I was just curious how involved are HPX team members with the future of the standards? You know, obviously,
we talked a little bit about executors, which is no longer executors. So how involved is HPX in
the direction of this? So it makes, I guess, a long history. Hotmont has been part of the
executors discussions years ago. I guess at some point, you know, it just drags on and on.
You know, people come and go. I've recently just, you know, it just drags on and on. You know, people come and go.
I've recently just, you know, started joining some meetings, just to listen in. And, you know,
eventually, we hope we can actually contribute something in the form of feedback, because now
we actually have an implementation of at least parts of the new proposals. So that was sort of the third big topic in our last release
is support for senders and receivers.
And I guess, yeah, it's an interesting topic because, okay,
we have standard futures,
which are very much tied to standard threads,
quite heavyweight.
We have our own implementation of HBX futures.
We will keep those around.
But these always involve heap allocation for shared states and things like this.
And senders and receivers are really sort of a generalization of futures and promises,
in a sense, where you can, in the best case, avoid lots of allocations.
There are very generic framework
for chaining various asynchronous operations,
not necessarily even asynchronous.
You can also write synchronous code,
but if that's all you wanna do,
then you might just write it sequentially.
But it's quite a neat framework,
and I actually hope that it's gonna you know
finally take off and maybe even make it into c++ 23 because uh so from our experience it looks
quite nice uh and has worked quite well um there are certain things when you start dealing with uh
with network communication for example mpi uh is is one of the big things we deal with in the HPC world.
And then, of course, accelerators with CUDA, AMD's HIP, and there's Intel Cycle, for example.
These are all quite important because in all these cases, you're dealing on a GPU and trying to combine these is one thing,
which we have been able to do with futures,
but there are certain overheads
when you're dealing with just plain futures.
There are certain optimizations
that you can't at least do cleanly with just plain futures
and senders and receivers allow you to do that sort of officially according to the, you know, the sender concepts, the receiver concepts, and off. So I didn't go to all the meetings,
but I attended quite a couple of meetings.
And at least we haven't been really
able to directly influence things in the standardization
committee because we're kind of on the fringes, right?
University.
You're dealing with big companies there when you when you're
doing standardization um but i think we have at least reported back on the results we had
and we have influenced things or confirmed things and help moving things forward even if no direct few features we have used have made it into the standard yet. But the fact that
we have confirmed, for instance, the concurrency TS1, which was specifying the futures exactly in
the same way as we have them now in HPX. So we start then and these methods of combining futures were all specified there.
So we have taken all the things from there.
And our experience has at least helped deciding whether to go ahead with that or not.
And in the end, we decided not to go ahead with it because of the things Mikhail mentioned, right?
There's quite some overhead associated with it, and we can't do better than that.
And that's one of the reasons why the concurrency
TS1, the technical specification
one, didn't go into the standard
in 2017,
I believe.
It was dropped on the floor because
of that.
So we had some impact,
even if not directly visible.
I think that's the best we can do at this point.
I think I just wanted to add that I think that's one of the key things that, you know,
implementation experience, I think that's quite important for standardizing things.
There are a few other implementations currently of senders and receivers.
But I think with HPX, we have actually quite a large and diverse code base where we can test these ideas out, especially, like I said, with accelerators and in actual applications.
Not saying that other people don't have applications, but we have at least one interesting use case in HPX applications. The other thing I wanted to mention is that even if we don't directly influence the standards,
I hope at least that we're making people aware of the ideas that are being proposed and hopefully
actually going into the standard library, like you're on the show right now.
But also, for example, at CSCS, they've been quite involved. And, you know, users and
scientists writing scientific applications, you know, get directly exposed to these ideas. And,
you know, slowly, they will hopefully trickle into their applications as well. If there were
fewer people doing that, maybe, you know, 2030, we might see people actually adopting these things.
If we tried to push, it might happen a bit earlier.
These conversations just make me feel like I'm getting old.
I'm like, in 2030, am I going to care still?
That's like nine years from now.
I'm going to be thinking about retirement.
If you don't mind if I do just a little bit of historical digging, because Hartmut, if I recall correctly, you were involved.
Well, you did mention explicitly that you were involved in Boost, have been involved in Boost for a long time.
But you were involved in like Boost Futures specification or implementation a long time ago, were you?
Well, I proposed the library in 2004 implementing something that was very similar to Standard Futures.
Yes.
At that point, nobody picked up on it because they said, hey, why do we need that crap?
So it didn't go anywhere.
But that was the first time I kind of dug into Futures.
And I still think it's a very neat feature which allows to do a lot of things
in a very convenient way. Mostly because it gets away from the notion of threads,
because I believe people shouldn't use threads directly. No raw threads, just to add one more
paradigm to Sean Perron's list of things, right? No raw new, no raw loops,
no raw pointers, no raw threads.
Right, right. Also,
no raw synchronization primitives. That
goes along with raw threads.
Well, you still need some sometimes, right?
Even with futures, you have
to do things. But if you
work with value semantics and
try to make your tasks self-sufficient with
no global side effects, then you don't need synchronization at all. So that works very well.
Yeah. Well, for the last 10 years or so, if I have to write parallel code, I aim for futures,
no explicit synchronization, and then send C++11, maybe pass around an atomic occasionally
if I have to. It's like a stop token. But now we have stop tokens, so I don't have to do that
anymore. Yeah. Mikhail, I was wondering, in your bio, you mentioned that in addition to working on
improving HPX itself, you help users integrate into their libraries and applications. I'm just
curious what that looks like. Obviously, you know, we've talked about how you can just change the
namespace from std to HPX and get a lot of performance improvements. What else are you
doing with those users? Yeah, so one of the main drivers of us developing HPX at CSCS is
distributed linear algebra.
This is a sort of big topic for many
of the scientific applications where, you know,
linear algebra comes up surprisingly
in scientific applications
and they need to do it efficiently.
And many of the existing libraries, they, you know,
they do Fort Join.
They do a lot of sort of explicit synchronization, sort of global barriers using MPI, which is if you have a distributed application, you synchronize between each step.
And just like on a node, you end up with sequential portions of the code that doesn't scale that well.
So one of the developments we have is a library called DLA Future,
which is the future of distributed linear algebra,
but also based on HPX Futures.
And this is, you know, a really, a good test case
and a good, I think, application also for HPX.
It plays to the strengths of HPX so what what you're essentially doing is
you want to do linear algebra on your matrices and you split these matrices up into blocks
and they live on different compute nodes and then you need to do some operations on these
and by dividing your work up like this, you end up with discrete tasks.
You want to do a matrix multiplication on one node.
Then you wait for the result from some other node.
And it kind of plays very naturally into this task-based programming flow.
And then, like we already mentioned with parallel algorithms,
if you have multiple parallel algorithms that are independent,
again, this sort of naturally just load balances.
You get another algorithm filling in the gaps on one worker thread once it runs out of work.
So what we're developing at CSCS is not my project.
It was led by a guy called Rafael Esonca.
And essentially, we're helping them build this library.
At the moment, it's a sort of generalized eigenvalue solver.
Specifically, I know little about the actual science behind it.
The goal is essentially to beat the existing state of the art implementations,
which I think we're actually doing at the moment already.
There's still some development work to be done there, but at the moment it's looking actually very nice.
And then the next steps would be actually integrating this into applications that do the real science.
This is just the library that does the linear algebra and it sort of goes stepwise closer and closer to users.
You know, I'm thinking, Rob, through the course of this conversation here that our guest a
few weeks ago, Andre Chertik, working on his Fortran implementation, should use HPX because
the nature of Fortran means you can prove some of these things right off the gate.
You can say, oh, well, I could apply a parallel algorithm here.
It's only floats, use HPX, and just automagically gets super fast C++ output from his Fortran C++ compiler.
You all should collaborate on this somehow.
In the end, that's an application HPX was made for.
It's a runtime system.
It's to be used by others to speed up their library,
their application, whatever they want to do. So it's never meant to be used as a thing on its own.
It's always really just the environment that sits on top of the operating system and gives you more
efficient facilities, very much along the lines of what the operating system gives you but focused on a
single application its operating system is kind of the mother of everybody and tries to make it
good for all applications that run on the same node at the same time runtime system is very
egoistic it just cares about the application it's related to and linked to. And it doesn't care what the others do. It just grabs all the resources it can get
and utilizes those.
And so if you can integrate that
as something in a compiler,
like you mentioned,
or into a library,
you don't even have to expose that to the user.
Yeah.
You use it really just as a facility
that allows you to speed up the code you're
generating so that's one of the perfect applications for this yeah i would love to see that happening
sounds pretty cool so maybe to uh close this all out is there anything you can tell us about
hartman or mikhail about the you know upcoming features of hpx that you want to preview?
No. So we still definitely have some work to do with senders and receivers. That's at least a high priority on our side at CSCS. Instead of integrating that into applications, getting more
field experience with how that works and seeing what kind of optimizations we can do with those.
That's, you know, number one for us, I guess.
Yeah, Mikael and his team are mostly worried about using HBX for local parallelism
and leave the distributed part to things like MPI or other systems.
We here at LSU are very interested
in the distributed part of HBX as well.
And we use it for several applications,
very large scale applications.
And I really hope that we can move forward on the,
and perhaps extend the sender receiver concept
into the distributed world, which sounds very promising.
So senders receivers is really a big topic I see for the future.
But standards conformance in the general case is very important to us
because that lessens the learning curve for people, right?
When they come with their own C++ code and they know C++,
it's not a problem for them to switch to HPX because in the end,
all they have to do in the first place is change the namespace and be done with it.
And from our experience as well, when people come on to the project, they are not struggling with HPX itself.
They mostly struggle with C++ and with modern C++, which we really try to use for implementing things.
So using the HPX is very simple and straightforward if you know C++ well.
If you believe that you know C++ and you know how it is, right? Everybody thinks they know
C++ on a scale from one to ten at about 5, right? No matter who you ask. Whether it's
a newbie who has attended one lecture in university, they believe, oh, 5, yeah, I know what that
is about. Or if you ask Bjarne, right? Bjarne says, I'm at 7 or something like this.
Yeah, and you get really nervous if someone says 9 or 10 because you're like... I know only a few people I would believe what they said.
Okay.
So just to round that up, if you're interested in HPX and want to try it, feel free to get in contact.
We are more than happy to support you.
The team around HPX is a very large team. We are kind of representing people contributing
all over the world, from South America,
North America, Europe.
I don't think we have anybody in Africa.
I think we had somebody in
Egypt at some point.
Russia, China, Japan.
So please join the team.
It's a very nice,
large, open-source team with a very
open atmosphere.
One of the few channels, IRC channels, that are not hostile.
So everybody's welcome.
I guess on that note, where should our listeners go if they do want to either join the team
or just maybe looking for help in getting started, getting your own stuff to run using HPX?
So I guess, well, we're on IRC at Matrix.
Libera.chat, there's the Stellar channel.
But the easiest way to find that is through hpx.stellargroup.org.
That'll be in the show notes.
There are links to different ways you can contact us.
We also have a mailing list, both
for users and developers.
And you can follow us
on GitHub for
releases and just checking what's
going on with pull requests and so on.
It's all public and open.
We try to hide as little as possible.
Words and all.
Yeah, I think the website is the best entry point to actually find us.
Perhaps adding, I want to add that the license HPX has been published
is the Boost software license.
So it's very liberal.
You don't even have to tell us when you use it.
So very open, no strings attached.
Oh, is it supported by any of the
C++ package managers?
Can I use VC package or Conan or something?
Yes. Well, not Conan. We have VC
package.
We have
SPAC.
SPAC is more
HBC related.
We would love to have somebody
working on a Conan package.
People are asking for it,
but we didn't have the bandwidth
to look into that.
Well, Mikhail and Hartmut,
thank you so much for coming on the show today.
Thanks for having us.
Thanks for having us.
Thanks so much for listening in as we chat about C++.
We'd love to hear what you think of the podcast.
Please let us know if we're discussing the stuff you're interested in,
or if you have a suggestion for a topic, we'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com.
We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter.
You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter.
We'd also like to thank all our patrons who help support the show through Patreon.
If you'd like to support us on Patreon, you can do so at patreon.com slash cppcast.
And of course, you can find all that info and the show notes on the podcast website at cppcast.com.
Theme music for this episode was provided by podcastthemes.com.