CppCast - Reducing Memory Allocations

Starting point is 00:00:00 Thank you. In this episode, we discuss parameter passing and fuzzing. Then we talk to Arnaud Desider. Arnaud talks to us about reducing memory allocations, tooling, and more. Welcome to episode 290 of CppCast, the first podcast for C++ developers by C++ developers. I'm your host, Rob Irving, joined by my co-host, Jason Turner. Jason, how are you doing today? I'm all right, Rob. How are you doing? Doing okay. Do you want to talk about what happened at your house this week? Yeah, no, sure. Why not? I had a little bit of excitement when I went to bathe my dog

Starting point is 00:01:34 on this weekend. And at that moment, learned that in some of this very cold weather that we had had, I ended up with a frozen pipe up here in the corner and uh it's a thing that they do in cold regions at least in the u.s we've got these uh frost proof silcock is what they call it so when you turn off the water on the outside spigot it actually has like this long pipe and it or uh like a yeah it's part of the pipe and it turns off the water back deep inside the house. So if any water got trapped in that section of pipe, that's more exposed to the cold air, then it might freeze. It might bust. It might do whatever. And you're not going to know it until you turn on the spigot. And then it's going to let water flow through now that busted piece of pipe.

Starting point is 00:02:19 So that happened. Um, and both my wife and myself were like, Oh, it sounds a little weird, but Jason's just water, you know, washing the dog and I'm myself were like oh it sounds a little weird but jason's just water you know washing the dog and i'm out there like it sounds weird and i only have half the water pressure i expect to have but i'm sure it's fine and then after a minute i'm like wait a minute oh no so this is outside using the hose and as soon as you went back in the house you realized what had happened as i sprinted down to the basement as fast as i safely could which is where i do my recording here this is a basement i have a nice bright window but it is a basement and uh and discovered um enough water to to make us rush around very quickly and

Starting point is 00:02:57 then try to find a professional who could make sure that was repaired properly and and all that stuff it was exciting yeah sounds exciting that stuff. It was exciting. Yeah, sounds exciting. I'm glad it all got resolved. I don't recommend it, just for the record. No. I also learned you can't, there's a phrase that basically doesn't exist in the English language. If you say a minor basement flood, right?

Starting point is 00:03:17 Like that, those words, they can't go together. But it is pretty accurate what we had. Any kind of water in the house is just not good though but i thought it was fairly minor and no no actual damage to anything so very good uh before we move on uh one quick programming note i wanted to mention um pvs studio has been working with us for a while as a sponsor uh we really appreciate it and one thing they asked us about doing a couple weeks ago and then started putting together was generating episode transcripts um so they've put these out for a couple episodes now and i've gone back and

Starting point is 00:03:57 updated um show notes for those episodes so that they have a link to the transcript which is on pbs studios website but they're making them in both english and russian so if you uh are episodes so that they have a link to the transcript, which is on PBS Studios website, but they're making them in both English and Russian. So if you are a listener to the podcast and you think you might want to actually read through the transcript, or if you maybe know someone who might be interested, but is deaf, or maybe just would prefer to read it versus listening to us, there are transcripts available for a couple episodes and i think they'll keep making more of those which we really appreciate and it can be handy just for like searching and referencing as well oh yeah if you want to be like hey i thought i heard someone say

Starting point is 00:04:34 this thing at this one time which right now is kind of possible if you go and do look at the automatic transcriptions on the youtube postings these these videos. Right. The automatic YouTube stuff is not, you know, certainly not perfectly accurate. You know, if I say CPP, they probably write out, you know, the ocean, see something. They don't know what I'm talking about. You know, before we move on to Rob, we may as well talk about the fact that the view behind you has changed considerably as well. Yeah. I don't recall if I mentioned it, but we are preparing to move. So right now we're getting ready to stage our house to have people come in and take pictures and then do open houses for people to buy the house.

Starting point is 00:05:20 So it's been a little chaotic around here, too. But looking forward to the move. Yeah. Good luck with that. Yeah. Okay. Well, at the top of every episode, I'll treat it a piece of feedback. We got this tweet from Martin Durham. And he wrote, Rob and Jason, hey, listening to your old new thing episode and heard you talking about lambda syntax.

Starting point is 00:05:45 I 100% agree. We need an arrow syntax. I went so far as to implement it in Clang and wrote up a thing about it. Love your show. So he's got a link here, which I'll put in the show notes. And the post is titled Hacking on Clang is surprisingly easy. And I think he actually wrote a second tweet saying how, yeah, putting this type of thing in the compiler is actually really easy.

Starting point is 00:06:06 But going through the process of standardizing it and trying to get it into a future version of C++ is a different story. I know there's been other proposals along this line. So there's certainly got to be someone Martin could collaborate with. Yeah. And this is definitely syntax that I've seen with lambdas and other programming languages so i think it would make sense for us to get it adopted into c++ as well the simpler syntax yeah okay well we'd love to hear your thoughts about the show you can always reach out to us on facebook twitter or email us at feedback at cppcast.com and don't forget to

Starting point is 00:06:41 leave us a review on itunes or subscribe on. Joining us today is Arnaud de Sitter. Arnaud is a senior software engineer based in Oxford with 25 years experience in scientific programming. He has a special interest in software reliability and optimization. He holds a civil engineering master's from Ecole de Pomp, Paris Tech, France. He is a member of ACCU and has done several presentations at their conference and local events. He's worked on reservoir simulators for the last 15 years. He cycles torque currently midway through his third virtual trip around the world.

Starting point is 00:07:15 In his spare time, he sails an old wooden dinghy on the River Thames. Arnaud, welcome to the show. Yeah, thanks for having me. I'm curious how you ended up getting interested in scientific programming in the first place. Well, I studied a civil engineering degree, and I got a lot of lectures on mechanics, fluid mechanics, CFDs, and soil mechanics. My first internship was about studying surfing waves and movement of sediment on the bottom of the sea.

Starting point is 00:07:44 I got really interested in the unraveling from there on. I ended up being in Britain simulating groundwater flow at the University of Bristol. And then I found similar jobs ever since. That's pretty cool. Yeah, very cool. All right. Well, Arnaud, we got a couple of news articles to discuss. Feel free to comment on any of these and we'll start talking more about the work and about that ACCU talk or not ACCU, CPNC talk you gave a year ago, right?

Starting point is 00:08:15 Yeah. Okay. So this first one is a blog post parameter passing in C and C++. And I guess it's kind of like a deep dive in how parameter passing works in the assembly jason you want to talk about this one yeah well and specifically in the x86 64 that um on linux and unix that we're used to seeing calling conventions in my references the system 5 amdBI. But yeah, just like it's a deep dive into, you know, how do these values actually get saved and registered

Starting point is 00:08:49 and pushed onto the stack and everything when you're passing values between functions or in returning values. Actually, one comment in here is interesting. It's like, I actually, you know, changed this code because the return parameter was uninteresting without me, you know, complicating the example, basically. Arnaud, do you have any thoughts on this one?

Starting point is 00:09:08 Yeah, I thought it was very interesting, because my own guideline is always, if your parameter is bigger in size than two words, which in 64-bit would be two times 64-bit, then you pass it by reference. And I'm kind of vindicated, because that's exactly what the study shows, right? And as a matter of fact,

Starting point is 00:09:27 I just look it up in the so-called core guidelines and you've got the F16 rules. It says indeed that their advice is two to three words and it turns out on x64, then it's exactly two words. So in practice, which is why, for example, on a string view, you should pass it by value.

Starting point is 00:09:45 That's a good point, yeah. Otherwise you end up with basically a pointer to a pointer if you pass a string view by reference for no good reason. Absolutely. One of the things that really stood out to me, and this is the kind of thing I love to show students when I'm teaching, and I'm like, yeah, look, it's easy to see how the compiler is passing these parameters in Compiler Explorer or whatever. But I had never stopped to consider this example, makeColor. It's about halfway. Well, not halfway.

Starting point is 00:10:12 It's towards the top, I guess. makeColor and makeColorBad. And the author is showing the example of passing RGBA to a function and then at some point swapping the argument so it goes alpha rgb and if you pass them straight through in the same order then the compiler can do all kinds of optimizations because they're just going to stay in the same register no matter what but if you swap just a little bit then the compiler ends up having to copy all of the values around in the registers. Well, I think one of the cool things, around 2000-ish, when x64 was designed by AMD,

Starting point is 00:10:56 they took a lot of care to redesign the argument passing. And I think they've done quite a good job, especially considering how bad it was before in x86. And with one of these decisions you do once every decade and he sticks for like 30 40 years and i'm glad they did it quite well right yeah it's interesting comment like i i don't is anyone working on uh you know 128 bit calling conventions or anything i haven't heard any discussion about that well i doubt it but yeah well maybe on the gpu space it seems like 64 bits should be able to stick around for a while 64 bits should be enough for anyone okay uh next thing we have is uh it's another blog post this is on fire eye and it is fuzzing

Starting point is 00:11:40 image parsing in windows uh part 2 on Initialized Memory. And I might need you to talk me through this one, Jason. This is one you found. Actually, one of my friends who watches my YouTube channel, but otherwise doesn't do any C or C++ development, but has been involved in security in one way or another, basically his entire career, sent this article to me. And he was like, I thought you might find this interesting. And I'm like, well, yes, heck yes, I do actually. Um, and it's just this cool exploration into how the author found use of uninitialized

Starting point is 00:12:19 memory and windows's image processors. And it's starting from the idea of fuzzing, which we've talked about fuzzing on here several times, but most tools can't detect a read of uninitialized memory. So one technique that's in here, which I had never seen or heard before, is actually running the same fuzz input multiple times, and then detecting if you get the same output. And if you get a different output, and one of the multiple runs that you made, then you know that you have a read of uninitialized memory

Starting point is 00:12:57 somewhere that was undetected. So it's a whole deep dive into how techniques like that were used to find two different security vulnerabilities in microsoft's image parsers which i do believe they said you know those vulnerabilities are getting patched like microsoft is accepting uh this issue yeah well the cves were from 2020 so reading between the lines here oh my friend who's been involved in security says that his opinion is that there's a good chance that these were actually submitted to Microsoft for a bug bounty and only released after Microsoft had fixed it.

Starting point is 00:13:31 Oh, yeah. When I read this article, what came to me is that rightly so, the author said that Valgrind or something similar to Valgrind is not available on Windows. I think that's maybe the crux of the matter on this kind of software, right?

Starting point is 00:13:47 I remember very well, Julian Seaworth released Valgrind in March 2000. And I discovered it like two months later at the time. And where a software company was working, we actually, a colleague, just took this very, very early version of Valgrind and we found the software in production code straight away. And to me, it was like, wow, these new things are now...

Starting point is 00:14:08 So this thing is going to change everything, at least on Linux. And I think it did in a very large extent, right? So why Microsoft never invested to have it something similar on Windows? I am not so sure, but I think it's probably the way it is. Well, there is a side comment in here that Dr. Memory... He's like, there are here that Dr. Memory,

Starting point is 00:14:28 there are tools like Dr. Memory on Windows, and every time I've seen Dr. Memory come up, that's basically how it's treated. And there is this Dr. Memory on Windows, but I've never seen anyone actually give a report of how they used it, and they found issues with it that they were able to fix. And I've tried to run Dr. Memory

Starting point is 00:14:44 myself on Windows, and and have never gotten satisfying results from it, but it's supposed to do the same things that Valgrind can do. And the other thing I had was about the reference to LAMCAMTUF, which is basically a Mitchell Zalewski.

Starting point is 00:15:00 This guy is quite amazing. He's been producing American fuzz lol and various similar fuzz technology for the last 20 years. And he looks like he must be responsible for being able to find so many security vulnerabilities thanks to these ideas.

Starting point is 00:15:16 And that's truly bewildering, right? Yeah. Going back to the sanitizers for one moment, I did see that Microsoft announced that the latest version of Visual Studio, I think they're calling the address sanitizer ready for production use, which is great to see.

Starting point is 00:15:32 Hopefully they keep investing in that kind of tools, because I know there's memory sanitizer too. It'll be nice to see that get into Visual Studio. Yeah, memory sanitizer can catch that. I think. Well, I think memory sanitizer can catch that. Yeah. I think. Well, I think memory sanitizer can catch it.

Starting point is 00:15:49 My understanding or my recollection when I tried years ago is that you have to recompile every single application with memory sanitizer. Oh, that's right. For the third-party libraries. So in practice, usually people will use address sanitizer and that's it, right? Right, right.

Starting point is 00:16:03 That's a fair amount of that. But Valgrind, even as a JIT, doesn't have its limitation, but Valgrind has got a penalty which is easily 100 times slower, which usually you cannot afford, right? Right. Depending on what I do, usually we can't, right? Yeah, I was recently using Valgrind to try to find an issue, and it was, it was 20, 30

Starting point is 00:16:28 whatever times slower and even on a relatively small test that took like 10 seconds before I ran it through Valgrind I'm sitting there going, alright, it's about time for this to be done now. I've got a few old stories when we have to let it run for like 50, 60, 70 hours before finding the badge. That's painful, right? That's very painful. Could be kind of fun to get one of our security researcher guests back on to discuss an issue like this.

Starting point is 00:16:56 Yeah. All right. And last thing we have is a post on Arthur Dwyer's blog. And this is, don't blindly prefer in place back to push back. And this is pointing out how some static analyzers will suggest using in place back over push back. But if you're not using in place back the right way, then it's really doing the same thing as push back

Starting point is 00:17:22 and as far as performance is concerned, but it's actually the compiler generated code will take longer to do, which I don't think I was aware of that. Like three levels of template instantiations to use in place back if you're not using it properly. Well, even if you are using it properly. Right. So it's worth it if you're using it properly. But if you're not, if you're just doing substituting pushback with emplaceback, then you're not using it the right way and it's definitely not worthwhile.

Starting point is 00:17:50 Yeah, it was an interesting blog post. I use that exact same clunk that you check and if you use the fixit, I'm pretty sure it does the right thing. Yeah, if you use the fixit, if you misread what the fixit said to do, then you end up doing the wrong thing.

Starting point is 00:18:05 One advantage of emplaceback in C++17 is that it gives you a If you use the fixit, if you misread what the fixit said to do, then you end up doing the wrong thing. Right. But one advantage of Mplaceback in C++17 is that it gives you a reference to the last element, which actually can simplify a little bit your code, and sometimes readability is worth only for that reason. Yeah, I think that's a really important point that should have been mentioned in this article as well. Yeah, that's true.

Starting point is 00:18:24 And the last thing that comes to my mind reading that article is about reducing compile time. I've been on this ODC for years trying to reduce this, well, stopping, waiting, and fighting sword of my colleagues while waiting for the build time, for the build to complete.

Starting point is 00:18:39 And the one nugget of information with GCC, at least, is that usually you might compile with minus G to get debugging information. For example to have a stack trace when you go to crash and if you're not interested to use a debugger you can move to minus G1 and usually that cuts a third of

Starting point is 00:18:56 your compilation time. So only that if you pair your build on the cloud if you virtualize your build it might decrease the money you pay every month by quite a bit, right? So, dash G is equivalent to dash GGDB. It's like full debugging information, right?

Starting point is 00:19:16 And you're saying dash G1, you're cutting it back? So, G1, basically, I've got far less debugging information. I still have enough to get you a stack trace where you crash which is if you do things in batch is good enough right the size of the object

Starting point is 00:19:29 file is much smaller so now if you need to debug interactively that's not sufficient but you can always do a special build at that moment yeah once you

Starting point is 00:19:37 have a stack trace yeah valgrind works perfectly well in minus g1 for example that that is really interesting so there you go that was my trick of the day i i i will accept that trick and i wish i had brought my notebook down with me

Starting point is 00:19:51 to write that down oh i have a notebook computer in front of me i guess that's related um you know since we just mentioned decreasing build time in the context of in place back i you know i accept uh arthur's arguments that, it's slower to compile. I would be shocked if it's actually a problem. Maybe it is on some larger builds. But I was just recently working on some code, which I think our listeners will find this an interesting anecdote. So I guess I'll share real quick.

Starting point is 00:20:24 About 30% of the total build time comes down to the fact that we're using a custom container where standard vector would have sufficed instead. Is it a project you're working on? Wow. Yes. Were you able to make the substitution and get that increase in build performance? I made a tiny wrapper class that uses vector in the background because there's a little bit of functionality that they added on. Okay. background because there's a little bit of functionality that they added on. And it probably would be relatively easy to swap out my tiny wrapper class

Starting point is 00:20:52 with Vector with just a little bit of thought for most of the use cases. 30% of the entire project's build time. That's crazy. It is crazy. So you do have tools especially Clang has got a command line option to give us some feedback, and I think there's a project on GitHub

Starting point is 00:21:10 whose name I can't remember, that allows you, you can do your build, and then aggregate statistics to know where the time was spent inside your compiler, at least. Yeah, that is actually what I... At least on the software where I work, what you find is that you spend a lot of time compiling unique pointer.

Starting point is 00:21:28 And I was thinking, well, that's actually rather good news because at least we are using it. Right. We did find an outlier using Boost Azure was pulling an enormous amount of headers and you can just put a macro, say, Azure only or something like that and it cuts something like 50 seconds of build time. Wow.

Starting point is 00:21:48 So that sounds very good, but it might be only 0.1% of your whole build anyway. So there you go. So it's a dash F time trace is the Clang flag, and I think Clang build analyzer is the tool that you're referring to. Oh, it's pretty good, right? Yes, I used those tools to help me isolate those. Just for the record, I didn't just magically stare at the source code and say, clearly, this is the slow part.

Starting point is 00:22:14 Sponsor of this episode of CppCast is the PVS Studio team. The team develops the PVS Studio Static Code Analyzer, which detects errors in C, C++, C Sharp, and Java code. When you use the analyzer regularly, you can spot and fix many errors right after you write new code. This means your team is more productive during code reviews and has more time to discuss algorithms and high-level errors. Let the analyzer that never gets tired do the tedious work of sifting through the boring parts of code looking for typos. For example, let it check comparison functions. Why comparison functions?

Starting point is 00:22:45 Click the link in the podcast description to find out. Remember that you can extend the PVS Studio trial period from one week to one month. Just use the CppCast hashtag when requesting your license. So you did a talk at CppOnSeSI last year, I think, Arnaud, which was on the topic of improving performance by reducing memory allocations. Is that right? Yeah, I did indeed, right. Yeah, I can talk a bit about where it comes from anyways.

Starting point is 00:23:19 Sure. So as it turns out, I work for an old company called Schlumberger. So I can say a few words about the company. It's an engineering company, a very large one, and we provide services to all companies. And one tiny portion, which is big in money but tiny for them, is producing simulation tools. So I work specifically on the reservoir simulator, which is the kind of software used to simulate oil and gas field.

Starting point is 00:23:49 So we simulate the flow in porous medium. So that's as much as I want to say about the domain. It's not the topics of this podcast. Oh, that's fine. People will enjoy it regardless. So go into whatever detail you want to. It's actually quite an important piece of software, right? I mean, a large amount of energy you're consuming will be actually simulated by this kind of software

Starting point is 00:24:12 or this product itself. So it's definitely production software. So as a software engineer, what you end up is a several million lines of C++ code for the simulation part. You could find that a simulation might last a few minutes on the laptop. It can actually keep a cluster of a thousand processors busy for several days, depending on the size of the problem. So definitely reducing the time is a good thing to make the software faster.

Starting point is 00:24:42 So that's one thing. And the thing we don't want is to change the results, right? So that's a big no-no. We can fix bugs, but we don't want to change the formulation. And one reason for that is that if you calculate how much hole you can produce, and you have a new version of the software, and you realize that now the numbers are different, usually people are not happy whatsoever, right?

Starting point is 00:25:04 So we have to be very careful. So in other words, if I fix something, the last thing I want is to break it. I just want to keep the result identical. So to come back to the reducing memory allocation, I mean, a large part of it comes just looking at the source code. And as a C++ developer, you have been brainwashed by the literature, like Scott Myers and so on,

Starting point is 00:25:25 saying memory allocation is really, really bad. You should reduce at any cost. The fact is that I never found a tool up to semi-recently that tells me where this allocation takes place. So you see a lot of micro-benchmarks. But when you take something which is very big, finding this allocation and the one that actually matters is actually quite difficult.

Starting point is 00:25:49 So what you can do is you can do some profiling with whatever technology you want, vTune or Perf. Usually what you find for most C++ codebase is about 5-10%, let's say 5% of the time is spent inside the heap algorithms, so we spend in malloc and free. Intuitively, I think, well, actually, if I decrease half of my allocation, I might gain something like 2.5%, which is quite frankly not worth it. It's not a lot.

Starting point is 00:26:17 So that was it, right? And then in 2016, we got the customer contacted us saying he was very unhappy because now his simulation was 20% slower with a new version of the simulator than the previous version, which is something which is not very desirable. So we actually backtracked inside the revision control system,

Starting point is 00:26:39 and we traced it back to an upgrade of CMUK, which is rather counterintuitive. They say, how could CMUK reduce an HPC or an high-performance computing application by performance by 20%? That makes no sense whatsoever. And what was even more curious is taking the head of the tree, the bug has actually disappeared, but we didn't know why. So clearly something happened. So a colleague called John Herring did some detective work and managed to crack it

Starting point is 00:27:06 so kudos to him so he actually did try a lot of things and he finally attached the process with vtune and managed to trace to find the root cause and it goes that way which is you could see that somebody by mistake introduced

Starting point is 00:27:22 a copy of a small vector a vector that contains something like five or six elements as an int, actually. So it was a stud vector of int, and he introduced a copy in a critical loop. So basically, we started copying these five or six elements and copying the vector again and again and again, millions or billions of times. But that by itself shouldn't be a problem. What was really, really, and surely that has nothing to do with CMake, but what happened is the new version of CMake

Starting point is 00:27:51 basically changed the order of the linking. And somewhere when you do the link, the specific instantiation of stud vector of int has to be picked up from somewhere. And only one is kept at link time. And this one was picked up from a third-party library that we linked statically in our application, and sadly, it was actually

Starting point is 00:28:10 compiled without optimization on. So what we did is that stud vector for the copy constructor of the stud vector was actually used completely unoptimized, right? So we amplified that copies by a gigantic factor, and we ended up taking 20% of the simulation.

Starting point is 00:28:25 That might take something like 10 hours. So what happened in the head of the tree is that we did some profiling, and we actually found that problem, and I actually added from it, but I didn't even remember it, so I fixed it and moved on. So anyway, that's a nice little story.

Starting point is 00:28:40 What it tells you is that, well, first of all, you should be quite careful when you compile your third-party library, and I think we learn our lessons now that that is now fully automated with conan so we don't have that issue anymore but the other thing is that um okay so i got 20 percent but maybe once you fix the optimization problem it will cost you let's say 0.25 but what about if i've got many more of these copies, which I'm not aware of? How much do they add up? And I didn't have an answer to that. I even considered recompiling the whole simulator,

Starting point is 00:29:13 changing libstc++, and make the copy constructor very slow indeed, to find the origin of all these things. But it was not very practical, and I don't have the time for that. So that was it, right? The follow-up to the story is that I went to the SECU conference in Bristol in 2017, and I attended John Lakers' talks, or one of the many talks by John Lakers. So John Lakers at Bloomberg is Mr. Allocator. He's been selling allocators for years.

Starting point is 00:29:41 I remember attending his talk, which was truly fascinating. I was thinking, I don't even talk, which was truly fascinating. I was thinking, I don't even know where my allocations are. So let's try to replace them by a specific allocator. And that was it. So I left the session at the end of the session, and I talked with a friend called Matthias Schulz, which was attending the conference. And I said, well, I wish I had a tool. And he said, well, why don't you use HipTrack? I said, HipTrack? I never heard of that. So he opened his laptop and started showing me straight away HipTrack. And he had

Starting point is 00:30:09 this amazing ability of capturing at a very fast rate. Pretty much like Valgrind is you put HipTrack in front of your command line application. And then you basically log every single of your allocation from my log and free. And then you can just go to GUI to later on,

Starting point is 00:30:26 basically to display all these allocations with Flameshot and very nice GUI capability. I was like, well, this is it, right? One of these Valgrind moments, right? I mean, suddenly I've got a tool where I can work with, right? And so that's really the start of the talk I gave three years later was really in a way to give back to the community what I learned thanks to this amazing

Starting point is 00:30:48 open source tool. Right, so I don't think we've talked about lots of different tools on the show, but I don't think we've ever talked about HeapTrack, right Jason? I don't know. It was, I think the first time I saw HeapTrack, it was

Starting point is 00:31:03 in one of the poster sessions at CBPCon, like in 2017 or 20s. Maybe that might be right. 2017 sounds like the right ballpark for what you're saying, too. But I've never used it before myself. I feel like it's possible we may have mentioned it because of a lightning talk or something, but I don't know. I don't know. Yeah. Certainly never gone in depth

Starting point is 00:31:26 on it. No, we haven't. So HipTrack is, the author is called Miriam Wolf, a German software engineer that works for KDAB. He done, in my opinion, an amazing job. When you look at his, when you watch his videos,

Starting point is 00:31:41 or say, oh, I'm standing on Shudder of Giant, I was thinking, well, that's pretty good going. You've done quite an amazing job. So what this tool is doing is basically capturing the stack trace and then use something called the backtrace, which is part of GCC, to symbolize the stack traces. It does it in a queue, so the slowdown is actually very acceptable. So as long as your application doesn't call basically malloc N3 at an extremely

Starting point is 00:32:06 fast rate, the override might end up being very, very small indeed. And then you can analyze and find out where they are and start to try to fix it. So if you watch the talk I gave at CPP on C, which was online, it was really the tale of first of all, how can you install it? And it turns out that, again, MillionWorth made it very straightforward indeed, right? Because you can use a technology called AppImage, which basically you should double you get a blob and make it executable and off you fly, right?

Starting point is 00:32:39 So that's very, very straightforward to try. But then, so I tried on the simulator, and what I found is the, well, it was not very pretty. It was totally unexpected to me. I mean, we had a runaway allocation inside of the inner loop, another one. So I went, and one of them was basically copying a vector and never ever using it, right? That's probably a debugging experiment

Starting point is 00:33:06 that was basically a leftover. But if you think about the kind of tragedy of C++, if you put a stud vector, you put 20 elements in it. It's actually for the compiler. It's an unused variable, but not as far as the compiler is concerned

Starting point is 00:33:19 because it's got side effects. So you've got no rolling whatsoever. So that was easy to fix. I just removed a piece of code and I had a 2% speedup. And I said, wow, okay. Instead, you could have put a comment, to do, remove this line when we need more performance.

Starting point is 00:33:37 So I ended up in this, after a month of works, and chasing up one after all the problems I was seeing, at least on the common code pipe and i ended reduce it by a factor of two order of magnitude right so when we had something like allocating 500 millions now we're under 101 millions of allocations and um and if we go back to the story that's um that would consume only 5% of our time. I'm surprised that, at least in serial, the speedup was more like 20%. So that doesn't quite make sense. And the reason for that is when you look at the profiling,

Starting point is 00:34:14 the time spent in malloc and free is indeed something you pay for. But what it does is destroying all your caches. So actually, it's really hard to estimate what this allocation costs you because it could be that it's in a relatively cold loop or something when your cache behavior wasn't great anyway. So that costs you the cost of malloc and free.

Starting point is 00:34:35 But if you're in a potentially hot loop, it costs you far more than that. It costs you the fact that basically you don't use memory efficiently anymore. Your cache hierarchy is being destroyed by going out of line and so on and so forth. And

Starting point is 00:34:52 yeah, so potentially you've got much more to gain by actually fixing this issue. So go for it. When you launch, when you run HeapTrack, you said you basically run it like Valgrind, like it's HeapTrack and then you're executable.

Starting point is 00:35:07 That's right. And you make it sound like it's just that straightforward. You pull up the interface or whatever after it's done and go, oh, look at that, line 12. I should probably take a look at line 12 because it's doing way more. Absolutely. So what you say, you say HeapTrack,

Starting point is 00:35:23 so command line utility, it will leave your application working as normal right and when you stop the application

Starting point is 00:35:29 it basically writes a file compressed file containing all the information and then you have good GUI that comes with hiptrack which you can install

Starting point is 00:35:38 very easily because it's pre-compiled on Linux oh I forgot to say all this is Linux only you can basically read this file and analyze this file. And the most useful part is to use

Starting point is 00:35:49 Flameshards. So I don't know if you clear what a Flameshard is. I don't think we've ever talked about it on the show, but I'm familiar with it from Chrome's tracing. Absolutely. So you got this enormous quantity of data to wait for. And Flameshard makes it quite easy

Starting point is 00:36:05 because you're going to display, if you're interested in the number of allocations, that would be 100% of them would be on the X-sense and on the Y you put the stack traces. And then what you care is about this big rectangle. You want to reduce basically the one that becomes very frequent, right? Or the allocation which are repeated inside the loop. And that's the first one you want to reduce basically the one that becomes very frequent, right? Or the allocation which are repeated inside the loop, and that's the first one you want to target.

Starting point is 00:36:28 Right. And so you go trying to do something about it, and if you succeed, you just repeat the experiment and see if this thing has improved things, right? And rinse and repeat thousands of times until you get it sorted, right? So yeah, to me, HipTrack is an amazing technology in that respect, right? I mean, to me, HeapTrack is an amazing technology in that respect, right?

Starting point is 00:36:46 I mean, I'm quite surprised that people don't use it more often. And one reason I did that talk was to make it more popular. I've got a bunch of questions right now. The first one that I would like to ask is you started the story by saying that you were

Starting point is 00:37:03 in one of John Lakers' talks and you're like, oh, allocators, clearly that's our answer. Then you came over to heap track. I'm curious if you ever came back to allocators and said, Well, okay, there's still these things left that I wasn't able to remove. So I'm gonna use allocators here. Or if he tracked just got you all of the help that you needed? Well, so again, referring to my own talk, what I tried to do is to categorize all, first of all, what I found. I'm not saying that the application I work on

Starting point is 00:37:31 is necessarily that typical, but it is written by a lot of people over the years. You've got 20 years of development by a team of 20 to 40 people constantly, right? We have a lot of copter. So you would expect probably have something quite typical. And is there a locator useful? Or what in C++17 is PMR locator?

Starting point is 00:37:53 And the answer is yes, right? But I would say that is actually at the back of the list, right? Okay. It's actually much simpler techniques to deploy first before you get to use allocators. And I can elaborate on that if you're interested. Yeah, go ahead. Yeah, sure.

Starting point is 00:38:09 So where allocators is useful is specifically for associative container. Okay. So the usual case is to have basically you've got an unordered map or a map if you're unlucky. And the typical thing is that you're going to populate it, but you never remove any element. And once it's populated, it becomes kind of immutable, and then you keep on doing a lot of lookup.

Starting point is 00:38:32 So because it's a node-based allocator, if you put a map containing two million elements, and that's very frequent in what I work on, you end up with two million allocations. So if your problem is of size 10 million, you might do something like 50 times these 10 millions. And that's a map which is populated only once, as opposed to a child, basically,

Starting point is 00:38:54 to use a motorning buffer resource. Right. You put a motorning buffer resource associated with that map, fill everything in, it will probably reduce to something like 30 allocation due to the amortized growth. It works beautifully

Starting point is 00:39:09 as long as you don't copy the map around. You have to be very careful of that. Presumably, it's inside an object that should be made non-copyable. That works really well. Another use of a use of the allocator, which I found quite useful, is a typical situation when you've got a lot of,

Starting point is 00:39:31 at least you do that in a lot of numerical applications, you have a lot of vectors which size you know. So you might actually have, if you've got physics, you might have something like number of components, and you've got 20 vectors with different names, but you don't know how many components you will have. It will be something quite small, right? And so you end up vector one of size n, vector two of size n,

Starting point is 00:39:52 vector three of size n, and you allocate that inside the kernel. And they all add up for something like two kilobytes of memory or something like that. So again, because you know the size at runtime, it doesn't move, you can put a monotonic buffer allocator resource. You put two kilobytes of memory on the stack, which basically is passed to your memory buffer resource. And then the only thing you have to do is to transform your vectors

Starting point is 00:40:18 into PMR vectors and assist that resource on the job done. Right. And you saw a significant impact making that kind of change well again it depends what you put significant but yes in production i will gain one one percent here in their own speed um one percent here there adds up yeah well if you take an extremely optimized major industrial system and you speed it up by 10% on any case, that's pretty good going with it. Yeah, that's definitely good, yeah. Well, same thing is that you... Well, first of all, this kind of software I'm talking about is the Ruby user on clusters

Starting point is 00:40:56 and keep them busy and warm for a long time. So if you can decrease your electricity bill by 10%, that's quite a good thing. And I think it's a good thing for the environment, which in any circumstance, right? It's just better use of resources. Right. But having said that, I think

Starting point is 00:41:13 all these PMO facilities is probably the last thing you should do. They are actually usually much easier techniques to deploy with. Okay. Just eliminating the dynamic allocations in the first place. Well, the poster child for that

Starting point is 00:41:30 would be, unsurprisingly, in C++, you will find that you spend a lot of time copying strings around. So, the first thing to do is probably to use string view when you can. So, a typical one, you might do some parsing and use

Starting point is 00:41:46 substr, which creates a new string, only for the benefit of parsing. If you actually make it a string view, you end up with something very lightweight, because it's got zero allocation. So usually that's a solution to a lot of problems. And the

Starting point is 00:42:02 other one, which is very interesting, is C++20 span. Right. And the reason one, which is very interesting, is C++20 Span. Right. And the reason for that is that in a lot of numerical simulation, you end up passing, you want to pass a lot of vectors. So usually the way people code in C++ quite naively is to press a reference to a vector.

Starting point is 00:42:19 And there's not really a problem with that. It's just, well, first of all, if you pass a reference to a vector, you never know if the function you're calling is going to resize your vector in my outing. Whereas if you make, if you pass a span of double, you know that you can't possibly resize that span, right? You can only populate it.

Starting point is 00:42:36 But the beauty of it now is that you, so first of all, by doing that, you make your function more strict in terms of contract. You clarify that you're not going to resize that vector. Interesting point, yeah. So that's one thing. But the other thing is that imagine now that your physical problem is to pass a vector of size 2, which might be, for example, oil and gas, the number of phases.

Starting point is 00:42:58 Right. If you make it a vector, you're obliged to allocate a vector of size 2, which is a bit silly, really. Right. And that can go on the stack. So if you make it a std array of double comma 2, and the function you're passing to is coded in terms of span, then you don't force the allocation scheme on the color. You can tell, or the color of the function

Starting point is 00:43:20 can actually use any kind of contiguous container, whether it be vector, or it could be array, or it could be a small vector implementation, which I've been using extensively. And in a lot of situations, your vector might be on a relatively small size, and a small vector is the way to go, right? Or you might want to use PMR vector.

Starting point is 00:43:38 Right. And, well, I think, to me, Span was really one of the very very useful addition of c++ 20. we need to talk about how you are already using c++ 20 but if you don't mind i have one other question about your heap track and and so you you commented that c standard string having lots of copies of strings is pretty typical of a problem in c++, which I fully agree with. I'm not going to argue with that at all. But does HeapTrack... Now, if I have a bunch of small copies of strings spread across 200 functions, does HeapTrack help me say, at least point out, hey, you're copying string in far too many places, and it's going to be up to you to dig into where? Is there any way to bubble that

Starting point is 00:44:23 kind of information up no absolutely you um to use the flame chart and usually what i do is that you use the reverse um flame chart by putting the um the leaf at the bottom so you can now search for the copy constructor for for string and you can okay but the beauty of the um of the flame chart is that maybe you don't have this problem it's only you you can see how many allocations take place, right? So if it actually occupies only 0.1% of all your allocation, your problem is elsewhere, and you shouldn't waste time on that, right?

Starting point is 00:44:53 Right, very good. So actually, it's very efficient. Having said that, the other thing is, imagine you could, well, if you take the example of ChaiScript, for example, your problem, depending on what you parse, you will see that the allocation will take place in different places, right?

Starting point is 00:45:09 Yeah. So what I'm trying to say here is what you do is you optimize a particular code path depending on the workflow. If you do a few... In my case, in terms of simulation, if I take a different field or a different type of physics, it will go in different places.

Starting point is 00:45:24 Right. And then, of course, now I take a different field or different type of physics, it will go in different places. Right. And of course, now your distribution of allocation will change. Yeah, these kind of scripting and simulation environments are difficult to profile, yeah, because you have to try all the possible different code paths that someone might do and choose what you're optimizing. Well, the danger there is to say, because I can't do everything, I do nothing.

Starting point is 00:45:49 Right. What you can do is take a few typical cases and having a good goal and usually that can make a big difference so do you have any now techniques or automated process of any kind to be able to run heap track and say okay nothing nothing new and scary has now popped up like is there any way to do this in your ci or something like that just to give you like a sanity run HeapTrack and say, okay, nothing new and scary has now popped up. Is there any way to do this in your CI or something like that, just to give you a sanity check? Can you do it? Yes, absolutely. Because it's a batch application, so HeapTrack will basically give you how many allocations take place,

Starting point is 00:46:16 and then you can just put a threshold and say, well, now... Oh, just look at total allocations or something. Yeah, at the end. Oh, okay. So it was very easy. We have not implemented that, but we could. That's a really interesting idea. It's quite very easy to automate, right?

Starting point is 00:46:32 Now the question is, how many things do you track down in your CI? That's the other matter. That's the other question, right? So typically on what I work on is if I put any modification, there's three hours of test before actually the pull request can be accepted. So we run a lot of tests, a lot of unit tests, and we run

Starting point is 00:46:52 a lot of them on the address sanitizer and this and that. At some point, you have to decide where the limit is, right? Right. And that is only the first level of testing. But yes, you could absolutely automate to some extent the fact that you don't actually re-inject

Starting point is 00:47:10 an enormous number of allocations. So you made some references to using C++20 already. How are you on the bleeding edge of the standard? So maybe I slightly overstated this. So what you can do in C++20, or I do not use C++20, what I do use is

Starting point is 00:47:33 span specifically. I needed a span so I looked around and there is many people that coded a span. So it turns out that the standardization process i think gosh this i've got this good feeling that span is going to be successful and there is a fair amount of implementation so there is one in gsl um which i used to start with and these days we use the

Starting point is 00:47:58 implementation by tristan brindle which is really excellent it's just a single file so um you know you put inside your project and you call it I don't know, SSS span or something like that in a different namespace, hopefully three letters. And then you can just change it to std next time around, which is exactly what I did for StringView in 2016 when I imported the StringView implementation

Starting point is 00:48:18 written by Marshall. And only something like four weeks ago, I changed everything to std string view once more. But it gives us four years ahead of the curve to be able to use string view way before everybody else. So you are able to keep using new versions of standard though. So you are on C++ 17 then? Yes, we are.

Starting point is 00:48:40 So as far as I can make it, C++ was kind of vegetating a bit, and people have been slow at adopting C++11. And we finally got the go-ahead in 2015 on the project I work on, and I went slightly crazy. So we basically changed the code base or put as many features as I could. Well, no, not as many. I mean, as many useful features as I could at the time. And so the first thing I did at the time was to very aggressively removing all the naked new and delete. When I counted, I had about 3000 naked delete and I reduced it to 100. So that was pretty good going.

Starting point is 00:49:19 And that was done manually. Just to make unique, you're saying? Or yeah. Yeah, just to unique pointer or whatever was necessary. But otherwise, Clang-Tidy was actually the way to go, or is the way to go these days. And incredibly, it's very, very reliable. Well, it's two things. First of all, a lot of transforms are reliable.

Starting point is 00:49:40 So I did blank transform on several million lines of code. And the other thing is when i did find bugs in clang tidy um usually they fix the bug within a few days so that's actually well worth um filling a bug with both clang and both gcc actually so i've got um many bugs are from ngcc got fixed in matter of hours so So they're always quite amazing, these guys. So what's the process like? Are you still then, you said you did, your first thing to do is to remove

Starting point is 00:50:12 a bunch of manual news and deletes, but are you then able to keep upgrading? You said you're on C++17 now, I believe. Did I hear that correctly? Yeah, so the process would be well, C++11 and 14 is well consumed the

Starting point is 00:50:31 C++17 string view for example was very useful oh, another nugget information for you, but you can refer to my talk is the, if you use a PMR map and you insert element in it it's actually a very very bad

Starting point is 00:50:48 idea and the reason for that is when you insert on the map it creates a temporary node which might be inserted but it might be actually discarded and if you use a monotonic buffer resource you end up with a slow memory growth so the trick there is to use try and place which is

Starting point is 00:51:03 C++17 which basically guarantees you not to have temporary nodes and avoid slow memory growth again. So that's another example of C++17. And I think a large part of all this transform were more political than technical in a sense. I mean, to me, the reason why we remove manual memory allocation is first of all to remove bugs, but it's as well to make the project more attractive for new people.

Starting point is 00:51:33 And in a way, we've been very successful in that because we had people that came in the project since, right? And I say, oh, yeah, we use a lot of memory, but we know we don't have memory leaks because we only use smart pointer. It's like, yes, we use a lot of memory, but we know we don't have memory leaks because we only use smart pointer. It's like, yes, we won the battle. So I think there's a large aspect that if you're a young graduate, the last thing you want is to work on the whole code base,

Starting point is 00:51:55 which is in a terrible state, right? So if you got something which is more palatable or kept to a higher standard, it makes the project far more interesting. That is really a fascinating point. I haven't interviewed for a regular job in quite some time, but yeah. standards it makes the project far more interesting that is uh that is really a fascinating point i haven't interviewed for a regular job in quite some time but yeah i mean if they said we're working on this great code base that's so much fun and and and then you find out later that you know it's nothing but a bunch of manual memory management you're not gonna it'll be a shock

Starting point is 00:52:20 right using it as a way way to recruit people to say, no, we're actually making the best advantage we can of modern techniques, that sounds... Yeah. So the talk I gave, its origin was as well that I had to convince my colleague of the usefulness of this technique. So I did a lot of internal talks,

Starting point is 00:52:39 so I had to write a lot of slides and convince them, bouncing up and down, showing how great all these things were. And because I had all these slides, I basically reshuffled them and gave it to a conference. That's a great way to do it. Exactly. So maybe some homework for you, Jason.

Starting point is 00:52:57 You can do an episode of your YouTube CPP Weekly on HipTrack. Yeah, I might do that. Actually, you're making me feel bad honestly because like i said i know i was made aware of this a long time ago at cpp con and uh and i've never tried it definitely sounds like a really useful tool yeah and considering i pride myself in being like the use all the tools guy and this is like an easy to use one and i've never used it then i just feel bad now well I'm surprised that people don't use it more regularly

Starting point is 00:53:27 or are aware of it, but maybe the reality is that there are so many new things happening and so many things to know, but there are only so many things you can learn. Yeah, and there's a certain level of awareness, right? At this point, quote, everyone knows about the sanitizers, so saying, oh, I'm going to deploy the Like, at this point, quote, you know, everyone knows about the sanitizers. So saying, oh, I'm going to deploy the sanitizers in this project, most people aren't going to argue with that, or Valgrind, or whatever.

Starting point is 00:53:52 People have used these things, but you have to, like, take this to your coworkers and be like, oh, by the way, I ran HeapTrack on Monday morning because it seemed like it would be fun, and I got a 3% performance improvement in 10 minutes. You know, then people are going to notice, right? So I might have to... Yeah, no, it's...

Starting point is 00:54:11 Or maybe you can just... I can only recommend it, basically, right? Yeah, I have an idea. We'll see how it plays out. Thank you very much for sharing this with us. So there you go. Okay, well, Arnaud, there anything else uh you want to plug since we're starting to run out of time anything else you want to go over no not really i want to thank you to for inviting me i mean um you said in my bio

Starting point is 00:54:37 that i am i'm cycling to work so a few years ago i discovered that discovered that cycling on the ice was a rather bad idea. So I had to basically replace my cycling to work by a long walk to the station, taking the train, taking the bus, because I couldn't cycle because of a small accident I had. And I was able to basically give me a lot of time to listen to all your backlog of people. So I had to thank you to keep to uh to make that period of my life much more enjoyable i went for something like 50 or 60 episodes wow well i'm glad that was helpful yeah definitely so there you go and i think a lot of colleagues i know of is listening carefully or regularly to your blog and it makes it very interesting

Starting point is 00:55:23 thank you okay well arnaud it was great having you on the show today thank you very much for having me thanks for coming on all right so we are going to hit end broadcast and thanks so much for listening in as we chat about c++ we'd love to hear what you think of the podcast please let us know if we're discussing the stuff you're interested in or if you have a suggestion for a topic, we'd love to hear about that too. You can email all your thoughts to feedback at cppcast.com. We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter. We'd also like to thank all our patrons who help support the show through Patreon.

Starting point is 00:56:04 If you'd like to support us on Patreon, you can do so at patreon.com. And of course, you can find all that info and the show notes on the podcast website at cppcast.com. Theme music for this episode was provided by podcastthemes.com.

Your Ad Here

CppCast - Reducing Memory Allocations

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.