CppCast - HPX and DLA Future

Starting point is 00:00:00 Episode 309 of CppCast with guests Mikhail Simberg and Hartmut Kaiser, recorded July 21st, 2021. This episode is sponsored by C++ Builder, a full-featured C++ IDE for building Windows apps five times faster than with other IDEs. That's because of the rich visual frameworks and expansive libraries. Prototyping, developing, and shipping are easy with C++ Builder. Start for free at Embarcadero.com. In this episode, we discuss returning multiple values from a function. Then we talk to Hartmut Kaiser and Mikael Simberg. Hartmut and Mikael tell us about the latest version of HPX and the DLA future libraries. Welcome to episode 309 of CBPCast, the first podcast for C++ developers by C++ developers.

Starting point is 00:01:24 I'm your host, Rob Irving, joined by my co-host, Jason Turner. Jason, how are you doing today? I'm all right, Rob. How are you doing? Doing just fine. You got anything going on? You make any CppCon submissions? This is the last day. This is the last day. The last day is in the past, but the day we're recording is the last day. Yes, the day this airs will be different.

Starting point is 00:01:44 I did make one submission. but the day we're recording is the last day. Yes. The day this airs will be different. Yeah. Um, I did make one submission. Um, okay. Hopefully it'll get accepted. Uh, that's all I know at the moment, I guess.

Starting point is 00:01:53 I did also submit to teach a class at CBP con. Oh, very cool. So we'll see what happens there as well. I'm sure we'll find news. They're usually pretty quick to do, um, approvals for submissions,

Starting point is 00:02:03 right? Well, I don't know the deadline is today and then I'm not saying it'll be immediate yeah no people have to make travel arrangements and stuff if they're going to come right okay well at the top of every episode I threw

Starting point is 00:02:16 a piece of feedback this week we got a tweet from Antonio Nutella commenting on last week's episode saying this was so interesting to listen to thank you all week's episode saying, this was so interesting to listen to. Thank you all for the episode. And yeah, it was great talking to Ivica last week about performance. It's been a while since we kind of just talked about the subject of performance.

Starting point is 00:02:35 We kind of talk about it a lot without talking directly to it. Right, right. Well, we'd love to hear your thoughts about the show. You can always reach out to us on Facebook, Twitter, or email us at feedback at cppcast.com. And don't forget to leave us a review on iTunes or subscribe on YouTube. scientist at LSU's Center for Computation and Technology. He's probably best known through his involvement in open source software projects, such as being the author of several C++ libraries. He's contributed to Boost, which are in use by thousands of developers worldwide. His current focus is research on leading the Stellar group at CCT, working on the practical design implementation of future execution models and programming methods. These things are tried and tested using the HPX, a C++ standard library for concurrency and parallelism.

Starting point is 00:03:29 Hartmut's goal is to enable the creation of a new generation of scientific applications in powerful, though complex environments such as high performance and distributed computing, spatial information systems, and compiler technologies. Hartmut, welcome back to the show. That's quite a mouthful, right? Welcome back. How are you doing, Rob? Hey, Jason. Nice to see you again. Yeah.

Starting point is 00:03:49 Well, we may talk about this a little bit later, but the last time you were on the show was the first time I was co-hosting the show. Which was exactly 300 episodes ago, actually. Wow. That's great. And congratulations to you guys for making that happen for the last six years. That's quite an accomplishment on its own. Pretty crazy. But also joining us today is Mikael Simberg. Mikael works at the Swiss National Computing Center.

Starting point is 00:04:17 And as a scientific software developer, he has a degree in operations research and computer science. He works in industry doing embedded programming before joining CSCS in 2017. At CSCS, he works on improving HPX itself and helps users integrate it into their libraries and applications. Mikhail, to the show. Thanks for having me. Nice to be on the show. It seems like we have four time zones covered today, then.

Starting point is 00:04:44 Yeah, that's true. Slightly unusual. Yeah, but we've got it working. This is still sort of the manageable time zones. For me and Hartman, this is quite a regular occurrence. Right.

Starting point is 00:04:59 It works. Awesome. I have to get early out of bed and he has to stay longer. So just to make that happen. Exactly. Actually, I'm just kind of curious. Where is the Swiss National Supercomputing Center? So the actual supercomputer, the headquarters are in southern Switzerland,

Starting point is 00:05:18 on the border to Italy, in the canton of Ticino in Lugano, the sort of sunny part of Switzerland. I myself, I'm in Zurich, a bit further north, which is also sunny at the moment, but the weather tends to be better down south. Yeah, most people at

Starting point is 00:05:38 CSCS are down in Lugano, and then we have a small 30 people or so up here in Zurich who are mainly software developers. I don't think I've spent any time in any of the Italian-speaking Canton region down there. There's only one. There's only the one? Okay. There's only one, yeah.

Starting point is 00:05:56 But it's really beautiful down there. I'm sure it is. If you ever get a chance to go down there, and whenever I get a chance to go there for work, it's usually a nice opportunity to just get to, you know, get the sun and get to see some colleagues. Right. And you get both the good side of Switzerland and the good side of Italy without having the disadvantages of either side, right? It's kind of unique. Exactly. Exactly. We got a couple news articles to discuss. Feel free to comment on any of these

Starting point is 00:06:32 and we'll start talking more about HPX. All right. So this first one is an article from PBS Studio and they've been sponsoring the podcast for a while i really appreciate uh that but this article is all about um their mascot and we we always see the mascot in in the blog posts that are uh coming from pbs studio uh and it was kind of fun to read this article about its genesis and uh evolution over the years that they've been blogging and using their unicorn mascot. Sorry, I still don't get why it's a unicorn. That's kind of still not quite clear. I guess it doesn't matter. If you like it, then that's all that matters.

Starting point is 00:07:18 Yeah, I mean, they mentioned at the very top, they just used it originally as like a free piece of clip art that was out there and people just loved it and they kind of made it their own and I guess that evolved it. A unicorn vomiting rainbow conveyed what they wanted to convey. Yep. I suppose I have to admit, I don't think I've ever used PBS Studio. And I was first wondering why you have a post about how the mascot developed and so on. Why does it belong on the show? But I don't know if I'm reading too much into it. But then when I actually read the post, I was like, well, you know, there are writing software, these small things actually matter and can have an impact on how people see your product or project.

Starting point is 00:08:11 So, yeah, as I said, maybe I'm reading too much into it, but I thought it was fun to think about it like that. about the kinds of bugs that their software finds. And those articles, besides the fact that they've been a sponsor, those articles tend to end up being articles that we discuss because often it's interesting. What is it? I think one of the ones that stands out to me the most is they're like, what is it? Shoot. It's like the last copy rule is I think what they've dubbed it,

Starting point is 00:08:43 where if you copy and paste a block of code like four times you always make a mistake on the last copy and they showed example and example and example of this where you like yeah you correct that you know you change the things you're supposed to the first three times but on the fourth time or the last time whatever the last time is that's where you make your mistake. Yeah, anyhow, interesting stuff. So the PVS studio can actually detect things like that. Copy-paste bug detection is one of the

Starting point is 00:09:12 things that they can do. The first kind of thought I had when I was reading that article after you sent it is, damn, we really need an HPX logo. This can be your problem to come up with something. You just can't use a rainbow bar from Unicorn.

Starting point is 00:09:36 We'll work on that, Hartmut. We'll work on that. So the next thing we have is a post on Fluent CPP, and this is how to return several values from a function in C++. And when I was reading this, I thought we'd use this as an article on CppCast a few weeks ago. But I don't think we did. I definitely was looking at it, though, a few weeks ago. Yes.

Starting point is 00:09:58 It was dated July 9th. It's a good post and, you know, talks about different ways to return multiple values from a function. Definitely recommends against using them, using a pointer or reference as an input parameter and modifying it within the function body. It's definitely a bad practice. But talk about using a struct or using a pair or tuple. I'm turning it that way. Is it tuple or is it tuple? I'd say tuple, but it's probably tuple. I'm turning it that way. Is it tuple or is it tuple? I say tuple, but it's probably tuple.

Starting point is 00:10:28 I say tuple as well, but I never know what's right. Yeah, I very much, you know, prefer the name structs, and I think tuples or tuples, whatever you want to call them. You know, it's nice for prototyping, but I don't think they really belong in production code in the end, because the only places if you really have a sort of, just a pack of arguments, parameters, which are really just first parameter, second parameter, and so on,

Starting point is 00:10:59 and then tuples is perfect. But if you can get names to them, I always prefer doing that yeah tuples are very useful in engineering generic programming especially pre-c plus 20 when you have to capture a whole template parameter pack in a lambda and things like that and they're very useful for prototyping as nikhil said If you do a bit of Python programming, then you kind of get used to that, right? You just return x, y, z,

Starting point is 00:11:30 and then you unpack it on the other end, which is very convenient. So in that context, it's certainly very similar. But for production code, I rather prefer to have named structs, so that I completely agree with Mikhail here. I think the one option that wasn't covered in the article that I'm aware of, which is admittedly questionable practice, would be to have used auto return type on the function and declare a local struct that has the named parameters that you want.

Starting point is 00:12:10 If you say, well, it's only got a use in this one particular case then you're kind of relying heavily on the person using an ide so that they can you know like dot member whatever and get the names of the things back out because discovering that is harder but it it's just another tool yeah okay and then the last uh post we have here is uh sable's 20 ranges avoiding dangling pointers jason you want to tell us more about this one uh so this is interesting because it's one of the um um sorry i lost the link myself this is uh andreas fertig is working on a new. So it's one of the chapters from his book. But this is just about avoiding the problem. If you've got a trying to iterate over some sort of container or range or something, that is a temporary. And this is actually a problem with C++ ranged for loops as well.

Starting point is 00:13:00 That is rarely hit by people. Have you all seen that issue where if you try to do a range for over a value that's returned from a function that's returned from a function, so you end up with a dangling object? So anyhow, one thing that stood out to me about this article was, well, so Andreas is going over what standard ranges did. But one of the things that stood out to me is that there's usually a mentality of that your compile error should be as soon as possible,

Starting point is 00:13:33 but the technique taken here is to make the compile error as late as possible so you can give the most descriptive error. I've not seen that technique before, personally. That's interesting. Yeah. Didn't realize that when reading the article. Yeah, I hadn't thought about that either before, but I like the technique. And in a way, you're

Starting point is 00:13:51 just tagging this type for later use. Okay, this can only be used in particular contexts because it gets a different type than what it would originally have had. And I think that applies in many other places as well. If you can put additional metadata in your types, you can actually avoid quite a lot of errors that way. It's always a trade-off. It's more work for you when you're defining your APIs, but it can pay off.

Starting point is 00:14:22 Right. Okay. Well, Harvmomut, as we mentioned, you know, it's been 300 episodes since we first had you on first talking about HPX. Do you want to start off by kind of telling our listeners all about what HPX is again and what has changed? Well, I'll try to be brief. Sure, I'm not sure there's a lot. Well, HPX itself, we call it the standard library for concurrency and parallelism.

Starting point is 00:14:55 In the end, at its core, it's a very efficient threading library. We have put a lot of effort into making it really, really efficient and reduce overheads as much as possible so that thread creation, scheduling, termination can be done in less than, I don't know, almost half a microsecond, which is very handy because you stop thinking about creating threads as overhead. You can manage very fine-grained parallelism, much more fine-grained than in conventional threading systems. And on top of that core, we have built several interesting features which make essentially HBX distinct from other what's nowadays called an

Starting point is 00:15:49 asynchronous many-tasking runtime system. First of all, and that is what I would like to highlight, is the C++ conforming interface conforming to the C++ standard. So what we have tried to do there is to say, okay, if somebody has a standards-conforming program that uses standard facilities like Steadthread or Steadfuture or similar things and flips a namespace from standard to HBX, it will work correctly. So the semantics of all the APIs are exactly following the standard, but you get the benefit that things run usually faster than what you had before. So standards conforming interface of everything that the recent standards have specified, like barriers, ledges, threats, future, you name it, everything related to parallelism and concurrency.

Starting point is 00:16:47 The second thing we have implemented on top of that efficient tasking system is a full implementation of the standard algorithms, the parallel algorithms that are in C++17, and we are currently extending that to be conforming to the range-based algorithms in C++20, which opens up very nice additional things. We can talk about that later if that's interesting. And the third thing that we have implemented on top of the tasking is we have extended or we have tried to extend most of those facilities to the distributed case so that you have a runtime system or you can write code that is completely agnostic to whether things are happening on your local machine or when you run it on a cluster, for instance, on a high-performance computing system, or whether things happen on the neighboring node. And all the networking, all the remote procedure invocation and everything is hidden behind

Starting point is 00:17:51 the scenes so that you essentially do a simple async with a function and you get a future back and that function can run on a different node completely transparently. So in that mode, we treat when you run an application on a high performance system, you usually have that, oh, you have the problem that you have several memory spaces, right? Each node has its own physical memory space. So you have that distinction, which reflects in these programming models, which are used nowadays, that you have to specifically work differently when you access local data and when you access remote data, because it's different memory spaces, right? You can't directly address. And there we have added a global address space

Starting point is 00:18:29 so that you can treat the whole application essentially running as if it was running on one giant machine, even if it runs on a thousand nodes under the hood. And that is nice because you can move things around between nodes, which is a big problem in high-performance computing because you want to achieve load balancing. And you want to move things around between nodes, which is a big problem in high performance computing because you want to achieve load balancing and you want to move things around in a way that each node has the same amount of work. And doing that with conventional methods is very difficult. So that's kind of the three

Starting point is 00:18:57 things we have in HPX and I think that is what's distinguishing HBX from other similar implementations. Is that enough for now about what is HBX? Mikael, you want to add something? I can expand on some of the things we sort of extend there, like now. I mean, I could maybe specify when Hartmann says that we, you know, essentially run a thread pool underneath the hood. And we have these lightweight threads specifically for people maybe coming from other languages or other frameworks. They're stackable threads in the end. So, you know, just like stud threads,

Starting point is 00:19:39 you can yield them and suspend them and so on. And this is why we can actually have things like mutexes inside these threads. On the other hand, it also means that while they're a lot cheaper than standard threads, they're still not for free. So there is some overhead there as well, which is something you still need to keep in mind. HPX just shifts the minimum grain size.

Starting point is 00:20:02 So we usually talk about the grain size when we're talking about sort of the task size or the size of the amount of work that you want to actually submit as a task. That's usually the grain size. And with standard threads, you need to have quite a lot of work for it to make sense to actually spawn a new thread with HPX

Starting point is 00:20:23 that shifts much further down. So you can have much smaller tasks, but there's still a limit. The other thing is, okay, our futures, again, I haven't sort of implied it, but standard futures just are on their own. You can standard async, you get a future, but you can't really do anything with it except get the value at some point. With our futures, you can chain them. So the dot then, there are other futures libraries that allow you to do that as well. But I think that's one of the key features actually, that you can chain these tasks and

Starting point is 00:20:56 build up complicated graphs of tasks using these primitives. And then I suppose the third thing I wanted to add is that for the parallel algorithms, again, the ones in the standard library take various execution policies, but they're all blocking calls. joined or started working on HPX is we have a sort of task policy for the parallel algorithms, which allow you to run things either sequentially or in parallel, but then on top of that you get a future to the results. That means that you can actually not just have your individual tasks that run then on one thread, but you can have a full parallel region as part of your task graph, which is pretty cool. Well, the main benefit of that is, if I may add, the problem of the, and one of the questions you wanted to ask today, am I happy with parallel algorithms and stuff, right? Well,

Starting point is 00:21:56 yeah, in some sense, I'm very happy about them. On the other hand, they are very limiting because they still give you four joint parallelism, right? Four couple of threads, and then you join them in the end. And what many people don't see when they use that in OpenMP or even the parallel algorithms is that that joint operation is what costs you dearly, especially when you have a large number of cores you want to, or threads you want to join at that point. So that join operation can add significant overhead to your overall execution. By launching a parallel algorithm on the side asynchronously,

Starting point is 00:22:35 you still have that join operation in the end, but you can part of the course free up early, right? And part of the course have to do the join operation in the end. By moving them on a side and executing it asynchronously, you can do other work when the cores free up. So you can kind of mitigate that join operation very nicely in the end by doing other possibly unrelated work on your main execution flow, if you want.

Starting point is 00:23:04 So that gives you performance benefits, even if you would think, hey, I'm adding more overhead because I'm getting a future pack now. No, you're saving on the joint operation, which is one of the main culprits, actually, of the existing execution models we have. People just slap an OpenMP pragma in front of their loop and say, yeah, great. I parallelized that thing.

Starting point is 00:23:25 No, bummer. Will not work. Yeah. Okay. Yeah, sorry. So I just wanted to say, I guess, specifically, it's, I mean, it's MDAL's law that kicks in where if you have your fork join, it's the serial regions that kill you in the end. You try to run it on 128 core dual epic system or something like this, and it just won't scale the way you want if you're doing core turn.

Starting point is 00:23:51 Yeah, that's two questions I had is, what do you call, when Hartmut said, if you have a lot of cores, and I'm like, what is a lot of cores to them? What is a lot of cores to you? The biggest run we've ever done with hpx was using 650 000 cores okay that's definitely a lot on i don't know 5 000 nodes or something like that okay so each each node has up to i don't know 256 cores physical cores on the high performance computing systems so that they're you know if you run it on your laptop with four cores on the high-performance computing systems. So there, you know, if you run

Starting point is 00:24:26 it on your laptop with four cores, then the join operation probably will not be noticeable or not that significant. But if you run it on a large system, then these join operations really, oh, Hamdol hits you over the head here, right? You add sequential execution and you force your execution into some sequential piece. And that is exactly what kills your scalability. So then also I was thinking about, we were talking about how you can, some of the cores will get freed up earlier in a parallel operation. So I want to make sure I understand. You're suggesting if you know that you have like three algorithms that could run in parallel, then with your version of these algorithms, you could say, OK, future, future, future for those three. And then you ultimately get as much as you can complete utilization of your cores as soon as as soon as some processes start to free up on the first algorithm.

Starting point is 00:25:21 The second algorithm will just automatically start using them. Is that correct? OK. Yes. Yes. If you have three algorithms in a row, then when you do it synchronously, you get these three join operations. And that is a big pitfall. And that's, by the way, where ranges come in. And that's what we have added to the, I said, we implement, we are currently implementing the C++20 range-based algorithms. And what we have done, we have added parallel versions for those algorithms as well, even if the standard doesn't specify them.

Starting point is 00:25:57 So essentially, you can pass an execution policy to the range-based algorithms as well and parallelize the range-based ones. And that is very interesting and non-obvious because what libraries like the standard range library or range v3 do for you is silent loop fusion. If you have three algorithms and you join them with a pipe in the range-based world, you pipe them, right? That essentially means,

Starting point is 00:26:34 and the main reason you want to do that, or the goal of allowing that is to avoid the temporary arrays in between the algorithms. And that means that these three algorithms are executed not one by one, but element by element. So the first element will be kind of dragged through the three algorithms, and the second element will be dragged through three algorithms, and so on, because it's the only way for you to avoid the temporaries in between. And that means you fuse the three loops into one bigger loop, where one loop executes the three operations for each element consecutively. And that means if you parallelize that one, you suddenly gain a lot of potential because you parallelize the fused loop instead of parallelizing these three separate loops. And you don't have to do anything. You just use the range-based parallel algorithms and pass a sequence of piped algorithms,

Starting point is 00:27:23 piped range-based algorithms to it, and you get loop fusion, which is another big gain. That's what you normally would do by hand, right? If you have three algorithms in a row, you would try to fuse the three loops yourself by hand. By combining parallel range-based algorithms with libraries like RangeV3, you get that for free. So for your parallel range-based algorithms, do they actually have different names like par underscore for each or something? What does this look like? No, they are just in a different namespace. Oh, different namespace, okay.

Starting point is 00:27:59 They are in the ranger's namespace and take the normal execution policy, par execution policy. And the interesting thing is that they don't do much. They essentially just rip the range apart and to begin and end and pass that to the underlying implementation. That's the whole trick there. But it opens up these loop fusion, automatic loop fusion effects, which are very beneficial.

Starting point is 00:28:22 So do you use like a expression template kind of technique or something to fuse the loops? How does that actually... I'm curious how it's actually implemented. Range v3 is doing that. We don't do anything. Oh, it doesn't... Okay, then I didn't realize that. And the standard library has these piping operators as well, I believe. Yes.

Starting point is 00:28:39 Do you know that? So it will do the same thing. It has to do it to avoid the temporaries in between the algorithms. Otherwise, it wouldn't work, right? Okay, okay. The only way to implement that without creating temporary arrays or temporary containers in between

Starting point is 00:28:55 the steps is to do it element by element ways. There's no other way of doing that. So they have to fuse the loops. Okay, I missed that detail. That's pretty cool. You brought up one other question in my mind as you were describing all of this. And actually, coincidentally, I wasn't even thinking about this. I just released an episode on C++ Weekly this morning that is about the C++17 parallel algorithms.

Starting point is 00:29:19 But it's a super high-level episode. It just shows, look, if you have the right workflow, right data, you can, you know, magically get faster algorithm. But, um, uh, one of the sore points for me is that, uh, lib C plus plus still doesn't have parallel algorithm support. Can I take HPX and just drop it into my code? And now it'll work on Clang, on macOS, and boom, I magically have parallel algorithms available to me. I mean, is it portable across all the platforms? Wouldn't that be cool? But I think Mikhail has something.

Starting point is 00:29:56 Pretty much. Yeah, so that's the goal. So Clang and GCC are well-supported. Hotmode uses Windows, so we support Windows as well, or Visual Studio. macOS is a bit difficult. We're not too many people working on this, so we sometimes have breakages there.

Starting point is 00:30:17 And really, you need to have someone who is actually dedicated to fixing these things to be able to support it. So we try to do it as well as we can. We often have patches or PRs coming in from people who fix compilation on their platform, sometimes on BSDs, things like this. We can't, unfortunately, support it fully ourselves,

Starting point is 00:30:42 but wherever we can, we try to be as portable as possible. Okay. Well, I think That was part of the question which I would like to add to Can I just use the HPX parallel algorithms and drop them into the existing code I have? And that is a bit more problematic at least at the moment because currently the HPX algorithms use the HPX threading system underneath, which is not fully compatible with other threading you might do yourself. Ah, all in or not then? Yeah, at the moment, yes. But, and that's what I was hoping Mikhail would jump in because what the C++23 is aiming at is, and you might have heard about that, is the executor discussion,

Starting point is 00:31:35 which is nowadays not an executor discussion anymore. They call it schedulers nowadays, but whatever. But the idea here is to create infrastructure that allows it on the standard library level to combine different threading implementations into one. And this is based on senders and receivers. So a very low level abstraction mechanism of a building asynchronous execution chains by having senders, which give you some value and receivers that receive those values and that is standardized. And the effect here is that you will be able to move execution between execution environments like GPUs or threading systems like HPX or

Starting point is 00:32:21 standard threading. And once all of this is in place, and that's our goal to get that implemented, we will be able to use the HPX standard library implement as standard algorithm implementations directly in the context of your normal C++ program. Because then these algorithms will be agnostic to the execution environment they are running on. And we hope to achieve that at some point.

Starting point is 00:32:46 Yes. Okay. So let's just say I, well, I have two, I have two questions now. If, if, if in my code, I am currently only using the C++ 17 standard parallel algorithms and I say, okay, I'm going to switch fully to the HPX model based on your lighter weight threads, what I expect to see without changing anything else to switch fully to the HPX model. Based on your lighter weight threads, would I expect to see, without changing anything else, a performance difference?

Starting point is 00:33:09 Yes. Okay. That's a very straightforward answer. I would expect that, yes. Okay. And then the other question is, no, I don't think the other question, I don't think I have another question.

Starting point is 00:33:21 But the point is I have to go all in on your threading model at the moment if I want to do that. Okay. question. But the point is I have to go all in on your threading model at the moment. At the moment, yes. This was already mentioned with the benefit that ideally you just need to change std to HPX. So the transition would be fairly straightforward. Well, I'm just thinking like on another project that needs to support like all three major operating systems, if macOS isn't very well supported, but we can get the performance boost that we want on linux and windows then it sounds like i could just use namespace hpx into a local namespace somewhere swap it at compile time

Starting point is 00:33:56 assuming i stick to the standard things and don't use your more advanced features then that sounds like a perfectly reasonable option for some people. By the way, if I may have a plug in here. So if somebody of the audience is interested in helping us with supporting macOS, please get in contact. We are always a very open community and we would like to have more people helping with developing. I wanted to end up the discussion for just a moment

Starting point is 00:34:26 to bring you a word from our sponsor, C++ Builder, the IDE of choice to build Windows applications five times faster while writing less code. It supports you through the full development lifecycle to deliver a single source code base that you simply recompile and redeploy. Featuring an enhanced Clang-based compiler, Dyncomware STL, and packages like Boost and STL2

Starting point is 00:34:45 in C++ Builder's Package Manager, and many more. Integrate with continuous build configurations quickly with MSBuild, CMake, and Ninja Support, either as a lone developer or as part of a team. Connect natively to almost 20 databases like MariaDB, Oracle, SQL Server, Postgres, and more with FireDAC's high-speed direct access. The key value is C++ Builder's frameworks

Starting point is 00:35:06 powerful libraries that do more than other C++ tools. This includes the award winning VCL framework for high performance native Windows apps and the powerful FireMonkey framework for cross platform UIs. Smart developers and agile software teams write better code faster using modern OOP practices

Starting point is 00:35:22 and C++ Builder's robust frameworks and feature-rich IDE. Test drive the latest version at Embarcadero.com. So I think we've mentioned a couple of the features that were released in the newest version of HPX, which is 1.7.0, I believe, right? So I think the range is supported. What else is new in this new version? What's it about? Suppose one of the things, I mean, I hope you can tell more about this,

Starting point is 00:35:48 but GCC or lib standard C++ has support for the SIMD types now, experimental, but still. And this is something we've been doing for a while is Google Summer of Code. We currently have a student working on essentially implementing policies based on GCC's same game implementation, which is quite cool. And as far as I've understood, the performance seems to be quite good and it's been quite a smooth transition. We used to have, or we still have support for VC, which is another vectorization library.

Starting point is 00:36:27 But eventually that will just get replaced by the standard support for SIMD. So that's one big thing. I don't know if you want to add anything about that. Well, I'm going to go ahead and interject right here that you should probably give our listeners an overview of what SIMD means and what that actually looks like to use it. Well, SIMD means, as IMD stands for single instruction multiple data, and is a means or is a execution model which is used when you vectorize or when the compiler is vectorizing code. Essentially, let's assume you have a for loop over floats, and you calculate something for each of the floats in your input array.

Starting point is 00:37:09 Then compilers might be able to vectorize that by instead of doing one operation at a time, just put it into the vector registers and do four or eight operations in one cycle. So, and these SIMD types that are now in the experimental, std experimental namespace and GCC and it's libc++, right? libc++, I think that's the new one. Yeah, in the canoe library essentially introduce types that encapsulate vector registers in the cpu okay um and overload all the operators on those types so that you can use those types as

Starting point is 00:37:58 if it was a single float so i can say very explicitly I am using eight floats or whatever in this register. Now do these things on them. Okay. And that maps into a compiler intrinsics. So it generates fairly efficient assembly code underneath, which is optimally close to what you do by hand when you add vectorization yourself. So it's a very convenient way and a C++ way to do vectorization, explicit vectorization, and not rely on the compiler, but do it yourself. Essentially,

Starting point is 00:38:32 have full control over what's happening. What these algorithms, the relation to the algorithms, what Mikhail mentioned, is that we have added special execution policies like par or sec which are called SIMD and SIMD par that instead of calling the lambda which represents your loop function your iteration with one element it calls it with a SIMD type that holds eight elements or four elements. So your Lambda now can operate on vector types automatically. So essentially you write the code as if it was normal, non-vectorized code, but the loop function will be called with a vector type. So you have a Lambda with auto as a parameter type,

Starting point is 00:39:27 and the compiler will deduce that correctly. And since all operations are overloaded for the SIMD data types, it's fairly straightforward to make work properly. Okay. So by just changing the execution policy, you can, for arithmetic-based kernels, you can easily achieve vectorization and full control over it, even in the standard parallel algorithm world.

Starting point is 00:39:53 So it's a new execution policy, parallelization policy that you added. Yeah. Well, vectorization. It's simply for sequential, only sequential vectorization, and simply par, which is doing both, vectorization and parallelization. It's simply for sequential, only sequential vectorization and simply power, which is doing both vectorization and parallelization. And the nice thing here is for compute bound kernels, we see speed up a factor of 250 on eight cores. No, 16 cores. 16 cores. I don't want to lie. On eight wide vector registers, which is just amazing, right? If you compare that to sequential execution, just getting a speed of 250 by changing the execution policy is quite a feat.

Starting point is 00:40:33 So you're saying going from a serial to parallel with vectorization 250 times. Okay. On compute bound arithmetic kernels. And so this does kind of, you must be operating over floating-point types. Is that correct? Well, anything the vector units support.

Starting point is 00:40:53 Okay. So it could be integers, could be floating points. So some integral data type supported by your CPU. So now the system then basically falls apart. If it was even just a struct of two ints, then there's no way for you to automatically decompose that and do a thing on it.

Starting point is 00:41:12 There are tricks you can do that by using zip iterators and adding support for zip iterators that you take it apart and repackage it as a zip off of two vector registers, things like that. Right. But we haven't done that explicitly at this point.

Starting point is 00:41:27 Yeah, I know, I'm just, now I'm just super curious. So I'm just kind of sussing out where the corner's here on this, because that, I think I can visualize what you're doing and it sounds really cool, so. That work has been, is driven by Matthias Kretz. He is very active in the standardization community and he has implemented the SIMD types for GCC, I believe. So he's doing all the work there and specifying things.

Starting point is 00:41:51 And the SIMD power and the SIMD execution policy are his idea as well. So we just took, again, what the standard is trying to do and where we are trying to go as a standard and create some experimental implementations to see how much effect do we get. So if I do the SIMD thing and get this awesome speed up, but then I go and recompile on Windows, it'll still compile, it just won't be able to use it won't be able to do the SIMD thing, because you don't have the SIMD intrinsic, or you don't

Starting point is 00:42:20 have the intrinsics, but you don't have the SIMD types that GCC offers on Visual Studio. Correct. But there you can use VC. I said other library. That is not really supported anymore, but at least for the time being, you can use other libraries there. Okay. I do very similar things. Yeah.

Starting point is 00:42:36 And I guess... You just mentioned standardization again. I was just curious how involved are HPX team members with the future of the standards? You know, obviously, we talked a little bit about executors, which is no longer executors. So how involved is HPX in the direction of this? So it makes, I guess, a long history. Hotmont has been part of the executors discussions years ago. I guess at some point, you know, it just drags on and on. You know, people come and go. I've recently just, you know, it just drags on and on. You know, people come and go. I've recently just, you know, started joining some meetings, just to listen in. And, you know,

Starting point is 00:43:12 eventually, we hope we can actually contribute something in the form of feedback, because now we actually have an implementation of at least parts of the new proposals. So that was sort of the third big topic in our last release is support for senders and receivers. And I guess, yeah, it's an interesting topic because, okay, we have standard futures, which are very much tied to standard threads, quite heavyweight. We have our own implementation of HBX futures.

Starting point is 00:43:44 We will keep those around. But these always involve heap allocation for shared states and things like this. And senders and receivers are really sort of a generalization of futures and promises, in a sense, where you can, in the best case, avoid lots of allocations. There are very generic framework for chaining various asynchronous operations, not necessarily even asynchronous. You can also write synchronous code,

Starting point is 00:44:15 but if that's all you wanna do, then you might just write it sequentially. But it's quite a neat framework, and I actually hope that it's gonna you know finally take off and maybe even make it into c++ 23 because uh so from our experience it looks quite nice uh and has worked quite well um there are certain things when you start dealing with uh with network communication for example mpi uh is is one of the big things we deal with in the HPC world. And then, of course, accelerators with CUDA, AMD's HIP, and there's Intel Cycle, for example.

Starting point is 00:44:56 These are all quite important because in all these cases, you're dealing on a GPU and trying to combine these is one thing, which we have been able to do with futures, but there are certain overheads when you're dealing with just plain futures. There are certain optimizations that you can't at least do cleanly with just plain futures and senders and receivers allow you to do that sort of officially according to the, you know, the sender concepts, the receiver concepts, and off. So I didn't go to all the meetings, but I attended quite a couple of meetings.

Starting point is 00:45:48 And at least we haven't been really able to directly influence things in the standardization committee because we're kind of on the fringes, right? University. You're dealing with big companies there when you when you're doing standardization um but i think we have at least reported back on the results we had and we have influenced things or confirmed things and help moving things forward even if no direct few features we have used have made it into the standard yet. But the fact that we have confirmed, for instance, the concurrency TS1, which was specifying the futures exactly in

Starting point is 00:46:37 the same way as we have them now in HPX. So we start then and these methods of combining futures were all specified there. So we have taken all the things from there. And our experience has at least helped deciding whether to go ahead with that or not. And in the end, we decided not to go ahead with it because of the things Mikhail mentioned, right? There's quite some overhead associated with it, and we can't do better than that. And that's one of the reasons why the concurrency TS1, the technical specification one, didn't go into the standard

Starting point is 00:47:12 in 2017, I believe. It was dropped on the floor because of that. So we had some impact, even if not directly visible. I think that's the best we can do at this point. I think I just wanted to add that I think that's one of the key things that, you know,

Starting point is 00:47:29 implementation experience, I think that's quite important for standardizing things. There are a few other implementations currently of senders and receivers. But I think with HPX, we have actually quite a large and diverse code base where we can test these ideas out, especially, like I said, with accelerators and in actual applications. Not saying that other people don't have applications, but we have at least one interesting use case in HPX applications. The other thing I wanted to mention is that even if we don't directly influence the standards, I hope at least that we're making people aware of the ideas that are being proposed and hopefully actually going into the standard library, like you're on the show right now. But also, for example, at CSCS, they've been quite involved. And, you know, users and scientists writing scientific applications, you know, get directly exposed to these ideas. And,

Starting point is 00:48:34 you know, slowly, they will hopefully trickle into their applications as well. If there were fewer people doing that, maybe, you know, 2030, we might see people actually adopting these things. If we tried to push, it might happen a bit earlier. These conversations just make me feel like I'm getting old. I'm like, in 2030, am I going to care still? That's like nine years from now. I'm going to be thinking about retirement. If you don't mind if I do just a little bit of historical digging, because Hartmut, if I recall correctly, you were involved.

Starting point is 00:49:09 Well, you did mention explicitly that you were involved in Boost, have been involved in Boost for a long time. But you were involved in like Boost Futures specification or implementation a long time ago, were you? Well, I proposed the library in 2004 implementing something that was very similar to Standard Futures. Yes. At that point, nobody picked up on it because they said, hey, why do we need that crap? So it didn't go anywhere. But that was the first time I kind of dug into Futures. And I still think it's a very neat feature which allows to do a lot of things

Starting point is 00:49:47 in a very convenient way. Mostly because it gets away from the notion of threads, because I believe people shouldn't use threads directly. No raw threads, just to add one more paradigm to Sean Perron's list of things, right? No raw new, no raw loops, no raw pointers, no raw threads. Right, right. Also, no raw synchronization primitives. That goes along with raw threads. Well, you still need some sometimes, right?

Starting point is 00:50:18 Even with futures, you have to do things. But if you work with value semantics and try to make your tasks self-sufficient with no global side effects, then you don't need synchronization at all. So that works very well. Yeah. Well, for the last 10 years or so, if I have to write parallel code, I aim for futures, no explicit synchronization, and then send C++11, maybe pass around an atomic occasionally if I have to. It's like a stop token. But now we have stop tokens, so I don't have to do that

Starting point is 00:50:51 anymore. Yeah. Mikhail, I was wondering, in your bio, you mentioned that in addition to working on improving HPX itself, you help users integrate into their libraries and applications. I'm just curious what that looks like. Obviously, you know, we've talked about how you can just change the namespace from std to HPX and get a lot of performance improvements. What else are you doing with those users? Yeah, so one of the main drivers of us developing HPX at CSCS is distributed linear algebra. This is a sort of big topic for many of the scientific applications where, you know,

Starting point is 00:51:31 linear algebra comes up surprisingly in scientific applications and they need to do it efficiently. And many of the existing libraries, they, you know, they do Fort Join. They do a lot of sort of explicit synchronization, sort of global barriers using MPI, which is if you have a distributed application, you synchronize between each step. And just like on a node, you end up with sequential portions of the code that doesn't scale that well. So one of the developments we have is a library called DLA Future,

Starting point is 00:52:07 which is the future of distributed linear algebra, but also based on HPX Futures. And this is, you know, a really, a good test case and a good, I think, application also for HPX. It plays to the strengths of HPX so what what you're essentially doing is you want to do linear algebra on your matrices and you split these matrices up into blocks and they live on different compute nodes and then you need to do some operations on these and by dividing your work up like this, you end up with discrete tasks.

Starting point is 00:52:45 You want to do a matrix multiplication on one node. Then you wait for the result from some other node. And it kind of plays very naturally into this task-based programming flow. And then, like we already mentioned with parallel algorithms, if you have multiple parallel algorithms that are independent, again, this sort of naturally just load balances. You get another algorithm filling in the gaps on one worker thread once it runs out of work. So what we're developing at CSCS is not my project.

Starting point is 00:53:19 It was led by a guy called Rafael Esonca. And essentially, we're helping them build this library. At the moment, it's a sort of generalized eigenvalue solver. Specifically, I know little about the actual science behind it. The goal is essentially to beat the existing state of the art implementations, which I think we're actually doing at the moment already. There's still some development work to be done there, but at the moment it's looking actually very nice. And then the next steps would be actually integrating this into applications that do the real science.

Starting point is 00:53:59 This is just the library that does the linear algebra and it sort of goes stepwise closer and closer to users. You know, I'm thinking, Rob, through the course of this conversation here that our guest a few weeks ago, Andre Chertik, working on his Fortran implementation, should use HPX because the nature of Fortran means you can prove some of these things right off the gate. You can say, oh, well, I could apply a parallel algorithm here. It's only floats, use HPX, and just automagically gets super fast C++ output from his Fortran C++ compiler. You all should collaborate on this somehow. In the end, that's an application HPX was made for.

Starting point is 00:54:42 It's a runtime system. It's to be used by others to speed up their library, their application, whatever they want to do. So it's never meant to be used as a thing on its own. It's always really just the environment that sits on top of the operating system and gives you more efficient facilities, very much along the lines of what the operating system gives you but focused on a single application its operating system is kind of the mother of everybody and tries to make it good for all applications that run on the same node at the same time runtime system is very egoistic it just cares about the application it's related to and linked to. And it doesn't care what the others do. It just grabs all the resources it can get

Starting point is 00:55:25 and utilizes those. And so if you can integrate that as something in a compiler, like you mentioned, or into a library, you don't even have to expose that to the user. Yeah. You use it really just as a facility

Starting point is 00:55:43 that allows you to speed up the code you're generating so that's one of the perfect applications for this yeah i would love to see that happening sounds pretty cool so maybe to uh close this all out is there anything you can tell us about hartman or mikhail about the you know upcoming features of hpx that you want to preview? No. So we still definitely have some work to do with senders and receivers. That's at least a high priority on our side at CSCS. Instead of integrating that into applications, getting more field experience with how that works and seeing what kind of optimizations we can do with those. That's, you know, number one for us, I guess. Yeah, Mikael and his team are mostly worried about using HBX for local parallelism

Starting point is 00:56:35 and leave the distributed part to things like MPI or other systems. We here at LSU are very interested in the distributed part of HBX as well. And we use it for several applications, very large scale applications. And I really hope that we can move forward on the, and perhaps extend the sender receiver concept into the distributed world, which sounds very promising.

Starting point is 00:57:03 So senders receivers is really a big topic I see for the future. But standards conformance in the general case is very important to us because that lessens the learning curve for people, right? When they come with their own C++ code and they know C++, it's not a problem for them to switch to HPX because in the end, all they have to do in the first place is change the namespace and be done with it. And from our experience as well, when people come on to the project, they are not struggling with HPX itself. They mostly struggle with C++ and with modern C++, which we really try to use for implementing things.

Starting point is 00:57:48 So using the HPX is very simple and straightforward if you know C++ well. If you believe that you know C++ and you know how it is, right? Everybody thinks they know C++ on a scale from one to ten at about 5, right? No matter who you ask. Whether it's a newbie who has attended one lecture in university, they believe, oh, 5, yeah, I know what that is about. Or if you ask Bjarne, right? Bjarne says, I'm at 7 or something like this. Yeah, and you get really nervous if someone says 9 or 10 because you're like... I know only a few people I would believe what they said. Okay. So just to round that up, if you're interested in HPX and want to try it, feel free to get in contact.

Starting point is 00:58:37 We are more than happy to support you. The team around HPX is a very large team. We are kind of representing people contributing all over the world, from South America, North America, Europe. I don't think we have anybody in Africa. I think we had somebody in Egypt at some point. Russia, China, Japan.

Starting point is 00:58:58 So please join the team. It's a very nice, large, open-source team with a very open atmosphere. One of the few channels, IRC channels, that are not hostile. So everybody's welcome. I guess on that note, where should our listeners go if they do want to either join the team or just maybe looking for help in getting started, getting your own stuff to run using HPX?

Starting point is 00:59:26 So I guess, well, we're on IRC at Matrix. Libera.chat, there's the Stellar channel. But the easiest way to find that is through hpx.stellargroup.org. That'll be in the show notes. There are links to different ways you can contact us. We also have a mailing list, both for users and developers. And you can follow us

Starting point is 00:59:52 on GitHub for releases and just checking what's going on with pull requests and so on. It's all public and open. We try to hide as little as possible. Words and all. Yeah, I think the website is the best entry point to actually find us. Perhaps adding, I want to add that the license HPX has been published

Starting point is 01:00:14 is the Boost software license. So it's very liberal. You don't even have to tell us when you use it. So very open, no strings attached. Oh, is it supported by any of the C++ package managers? Can I use VC package or Conan or something? Yes. Well, not Conan. We have VC

Starting point is 01:00:33 package. We have SPAC. SPAC is more HBC related. We would love to have somebody working on a Conan package. People are asking for it,

Starting point is 01:00:48 but we didn't have the bandwidth to look into that. Well, Mikhail and Hartmut, thank you so much for coming on the show today. Thanks for having us. Thanks for having us. Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast.

Starting point is 01:01:04 Please let us know if we're discussing the stuff you're interested in, or if you have a suggestion for a topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter. We'd also like to thank all our patrons who help support the show through Patreon. If you'd like to support us on Patreon, you can do so at patreon.com slash cppcast. And of course, you can find all that info and the show notes on the podcast website at cppcast.com.

Starting point is 01:01:41 Theme music for this episode was provided by podcastthemes.com.

CppCast - HPX and DLA Future

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.