CppCast - Reducing Memory Allocations
Episode Date: March 12, 2021Rob and Jason are joined by Arnaud Desitter. They first discuss blog posts on parameter passing, fuzzing and push_back vs emplace_back. Then they talk to Arnaud Desitter about his successes improving ...application performance by reducing memory allocations found using heaptrack. Episode Transcripts PVS-Studio Episode Transcripts News Hacking on Clang is surprisingly easy Parameter Passing in C and C++ Fuzzing Image Parsing in Windows, Part Two: Uninitialized Memory Don't blindly prefer emplace_back to push_back Links Reducing Memory Allocations in a Large C++ Application - Arnaud Desitter [ C++ on Sea 2020 ] Reducing Memory Allocations in a Large C++ Application - Slides- Arnaud Desitter [ C++ on Sea 2020 ] heaptrack Sponsors PVS-Studio. Write #cppcast in the message field on the download page and get one month license The Evil within the Comparison Functions Top 10 Bugs Found in C++ Projects in 2020
Transcript
Discussion (0)
Thank you. In this episode, we discuss parameter passing and fuzzing.
Then we talk to Arnaud Desider.
Arnaud talks to us about reducing memory allocations, tooling, and more. Welcome to episode 290 of CppCast, the first podcast for C++ developers by C++ developers.
I'm your host, Rob Irving, joined by my co-host, Jason Turner.
Jason, how are you doing today?
I'm all right, Rob. How are you doing?
Doing okay. Do you want to talk about what happened at your house this week?
Yeah, no, sure. Why not? I had a little bit of excitement when I went to bathe my dog
on this weekend. And at that moment, learned that in some of this very cold weather that we had had,
I ended up with a frozen pipe up here in the corner and uh it's a
thing that they do in cold regions at least in the u.s we've got these uh frost proof silcock is what
they call it so when you turn off the water on the outside spigot it actually has like this long
pipe and it or uh like a yeah it's part of the pipe and it turns off the water back deep inside the house.
So if any water got trapped in that section of pipe, that's more exposed to the cold air,
then it might freeze. It might bust. It might do whatever. And you're not going to know it until
you turn on the spigot. And then it's going to let water flow through now that busted piece of pipe.
So that happened. Um, and both my wife and myself were like, Oh, it sounds a little weird,
but Jason's just water, you know, washing the dog and I'm myself were like oh it sounds a little weird but jason's just
water you know washing the dog and i'm out there like it sounds weird and i only have half the
water pressure i expect to have but i'm sure it's fine and then after a minute i'm like wait a minute
oh no so this is outside using the hose and as soon as you went back in the house you realized
what had happened as i sprinted down to the basement as fast as i safely could
which is where i do my recording here this is a basement i have a nice bright window but
it is a basement and uh and discovered um enough water to to make us rush around very quickly and
then try to find a professional who could make sure that was repaired properly and and all that
stuff it was exciting yeah sounds exciting that stuff. It was exciting.
Yeah, sounds exciting.
I'm glad it all got resolved.
I don't recommend it, just for the record.
No.
I also learned you can't,
there's a phrase that basically doesn't exist in the English language. If you say a minor basement flood, right?
Like that, those words, they can't go together.
But it is pretty accurate what we had.
Any kind of water in the house is just
not good though but i thought it was fairly minor and no no actual damage to anything so very good
uh before we move on uh one quick programming note i wanted to mention um pvs studio has been
working with us for a while as a sponsor uh we really appreciate it and one
thing they asked us about doing a couple weeks ago and then started putting together was generating
episode transcripts um so they've put these out for a couple episodes now and i've gone back and
updated um show notes for those episodes so that they have a link to the transcript which is on
pbs studios website but they're making them in both english and russian so if you uh are episodes so that they have a link to the transcript, which is on PBS Studios website,
but they're making them in both English and Russian. So if you are a listener to the podcast
and you think you might want to actually read through the transcript, or if you maybe know
someone who might be interested, but is deaf, or maybe just would prefer to read it versus
listening to us, there are transcripts available for a couple episodes and i think
they'll keep making more of those which we really appreciate and it can be handy just for like
searching and referencing as well oh yeah if you want to be like hey i thought i heard someone say
this thing at this one time which right now is kind of possible if you go and do look at the
automatic transcriptions on the youtube postings these these videos. Right. The automatic YouTube stuff is not,
you know, certainly not perfectly accurate. You know, if I say CPP, they probably write out,
you know, the ocean, see something. They don't know what I'm talking about.
You know, before we move on to Rob, we may as well talk about the fact that the view behind
you has changed considerably as well. Yeah. I don't recall if I mentioned it, but we are preparing to move.
So right now we're getting ready to stage our house to have people come in and take pictures
and then do open houses for people to buy the house.
So it's been a little chaotic around here, too. But looking forward to the move.
Yeah.
Good luck with that.
Yeah.
Okay.
Well, at the top of every episode, I'll treat it a piece of feedback.
We got this tweet from Martin Durham.
And he wrote, Rob and Jason, hey, listening to your old new thing episode and heard you talking about lambda syntax.
I 100% agree.
We need an arrow syntax.
I went so far as to implement it in Clang and wrote up a thing about it.
Love your show.
So he's got a link here, which I'll put in the show notes.
And the post is titled Hacking on Clang is surprisingly easy.
And I think he actually wrote a second tweet saying how, yeah,
putting this type of thing in the compiler is actually really easy.
But going through the process of standardizing it and trying to get it into a future version of C++ is a different story.
I know there's been other proposals along this line.
So there's certainly got to be someone Martin could collaborate with.
Yeah.
And this is definitely syntax that I've seen with lambdas and other programming
languages so i think it would make sense for us to get it adopted into c++ as well the simpler
syntax yeah okay well we'd love to hear your thoughts about the show you can always reach
out to us on facebook twitter or email us at feedback at cppcast.com and don't forget to
leave us a review on itunes or subscribe on. Joining us today is Arnaud de Sitter.
Arnaud is a senior software engineer based in Oxford with 25 years experience in scientific
programming.
He has a special interest in software reliability and optimization.
He holds a civil engineering master's from Ecole de Pomp, Paris Tech, France.
He is a member of ACCU and has done several presentations at their conference and local events.
He's worked on reservoir simulators for the last 15 years.
He cycles torque currently midway through his third virtual trip around the world.
In his spare time, he sails an old wooden dinghy on the River Thames.
Arnaud, welcome to the show.
Yeah, thanks for having me.
I'm curious how you ended up getting interested in scientific programming in the first place.
Well, I studied a civil engineering degree,
and I got a lot of lectures on mechanics, fluid mechanics, CFDs, and soil mechanics.
My first internship was about studying surfing waves
and movement of sediment on the bottom of the sea.
I got really interested in the unraveling from there on.
I ended up being in Britain simulating groundwater flow at the University of Bristol.
And then I found similar jobs ever since.
That's pretty cool.
Yeah, very cool.
All right.
Well, Arnaud, we got a couple of news articles to discuss.
Feel free to comment on any of these and we'll start talking more about the work and about that ACCU talk or not ACCU, CPNC talk you gave a year ago, right?
Yeah.
Okay.
So this first one is a blog post parameter passing in C and C++.
And I guess it's kind of like a deep dive in how
parameter passing works in the assembly jason you want to talk about this one
yeah well and specifically in the x86 64 that um on linux and unix that we're used to seeing
calling conventions in my references the system 5 amdBI. But yeah, just like it's a deep dive into, you know,
how do these values actually get saved and registered
and pushed onto the stack and everything
when you're passing values between functions
or in returning values.
Actually, one comment in here is interesting.
It's like, I actually, you know, changed this code
because the return parameter was uninteresting
without me, you know, complicating the example, basically.
Arnaud, do you have any thoughts on this one?
Yeah, I thought it was very interesting,
because my own guideline is always,
if your parameter is bigger in size than two words,
which in 64-bit would be two times 64-bit,
then you pass it by reference.
And I'm kind of vindicated,
because that's exactly what the study shows, right?
And as a matter of fact,
I just look it up in the so-called core guidelines
and you've got the F16 rules.
It says indeed that their advice is two to three words
and it turns out on x64,
then it's exactly two words.
So in practice,
which is why, for example, on a string view,
you should pass it by value.
That's a good point, yeah.
Otherwise you end up with basically a pointer to a pointer if you pass a string view by reference for no good reason.
Absolutely.
One of the things that really stood out to me, and this is the kind of thing I love to show students when I'm teaching,
and I'm like, yeah, look, it's easy to see how the compiler is passing these parameters in Compiler Explorer or whatever.
But I had never stopped to consider this example, makeColor.
It's about halfway.
Well, not halfway.
It's towards the top, I guess.
makeColor and makeColorBad.
And the author is showing the example of passing RGBA to a function and then at some point swapping the argument so it goes alpha rgb and if
you pass them straight through in the same order then the compiler can do all kinds of optimizations
because they're just going to stay in the same register no matter what but if you swap just a
little bit then the compiler ends up having to copy all of the values around in the registers.
Well, I think one of the cool things, around 2000-ish,
when x64 was designed by AMD,
they took a lot of care to redesign the argument passing.
And I think they've done quite a good job,
especially considering how bad it was before in x86.
And with one of these decisions you do once every decade and he sticks for like 30 40 years and i'm glad they did it quite well right yeah it's interesting comment like i i don't is
anyone working on uh you know 128 bit calling conventions or anything i haven't heard any
discussion about that well i doubt it but yeah well maybe on the gpu space it seems like 64 bits should be able to
stick around for a while 64 bits should be enough for anyone
okay uh next thing we have is uh it's another blog post this is on fire eye and it is fuzzing
image parsing in windows uh part 2 on Initialized Memory.
And I might need you to talk me through this one, Jason.
This is one you found.
Actually, one of my friends who watches my YouTube channel,
but otherwise doesn't do any C or C++ development,
but has been involved in security in one way or another,
basically his entire career, sent this article to me. And he was like, I thought you might find this interesting. And I'm like, well, yes, heck yes, I do
actually. Um, and it's just this cool exploration into how the author found use of uninitialized
memory and windows's image processors. And it's starting from the idea of fuzzing,
which we've talked about fuzzing on here several times,
but most tools can't detect a read of uninitialized memory.
So one technique that's in here,
which I had never seen or heard before,
is actually running the same fuzz input multiple times,
and then detecting if you get the same output. And if you get a different output, and one of
the multiple runs that you made, then you know that you have a read of uninitialized memory
somewhere that was undetected. So it's a whole deep dive into how techniques like that were used
to find two different security vulnerabilities
in microsoft's image parsers which i do believe they said you know those vulnerabilities are
getting patched like microsoft is accepting uh this issue yeah well the cves were from 2020 so
reading between the lines here oh my friend who's been involved in security says that his opinion
is that there's a good chance that these were actually submitted to Microsoft
for a bug bounty
and only released after Microsoft had fixed it.
Oh, yeah.
When I read this article,
what came to me is that rightly so,
the author said that Valgrind
or something similar to Valgrind
is not available on Windows.
I think that's maybe the crux of the matter
on this kind of software, right?
I remember very well,
Julian Seaworth released Valgrind in March 2000.
And I discovered it like two months later at the time.
And where a software company was working,
we actually, a colleague,
just took this very, very early version of Valgrind
and we found the software in production code straight away.
And to me, it was like, wow, these new things are now...
So this thing is going to change everything,
at least on Linux.
And I think it did in a very large extent, right?
So why Microsoft never invested
to have it something similar on Windows?
I am not so sure, but I think it's probably the way it is.
Well, there is a side comment in here that Dr. Memory...
He's like, there are here that Dr. Memory,
there are tools like Dr. Memory on Windows, and every time
I've seen Dr. Memory come up,
that's basically how it's treated.
And there is this Dr. Memory on Windows,
but I've never seen anyone actually give a report
of how they used it, and they found
issues with it that they were able to fix.
And I've tried to run Dr. Memory
myself on Windows, and and have never gotten satisfying
results from it, but it's supposed to
do the same things that Valgrind can do.
And the other
thing I had was about the
reference to
LAMCAMTUF, which is
basically a Mitchell Zalewski.
This guy is quite amazing.
He's been producing American
fuzz lol
and various similar fuzz technology
for the last 20 years.
And he looks like he must be responsible
for being able to find so many security vulnerabilities
thanks to these ideas.
And that's truly bewildering, right?
Yeah.
Going back to the sanitizers for one moment,
I did see that Microsoft announced
that the latest version of Visual Studio, I think
they're calling the address sanitizer
ready for production
use, which is great to see.
Hopefully they keep investing in that kind of tools,
because I know there's memory sanitizer
too. It'll be nice to see that get
into Visual Studio. Yeah, memory
sanitizer can catch
that.
I think. Well, I think memory sanitizer can catch that. Yeah. I think.
Well, I think memory sanitizer can catch it.
My understanding or my recollection when I tried years ago
is that you have to recompile every single application
with memory sanitizer.
Oh, that's right.
For the third-party libraries.
So in practice, usually people will use address sanitizer
and that's it, right?
Right, right.
That's a fair amount of that.
But Valgrind, even as a JIT, doesn't have its limitation,
but Valgrind has got a penalty which is easily 100 times slower,
which usually you cannot afford, right?
Right.
Depending on what I do, usually we can't, right?
Yeah, I was recently using Valgrind to try to find an issue,
and it was, it was 20, 30
whatever times slower and even on a relatively small test that took
like 10 seconds before I ran it through Valgrind
I'm sitting there going, alright, it's about time for this to be done now.
I've got a few old stories when we have to let it
run for like 50, 60, 70 hours before finding the badge.
That's painful, right?
That's very painful.
Could be kind of fun to get one of our security researcher guests back on to discuss an issue like this.
Yeah.
All right.
And last thing we have is a post on Arthur Dwyer's blog.
And this is, don't blindly prefer in place back to push back.
And this is pointing out how some static analyzers
will suggest using in place back over push back.
But if you're not using in place back the right way,
then it's really doing the same thing as push back
and as far as performance is concerned,
but it's actually
the compiler generated code will take longer to do, which I don't think I was aware of that.
Like three levels of template instantiations to use in place back if you're not using it properly.
Well, even if you are using it properly. Right. So it's worth it if you're using it properly.
But if you're not, if you're just doing substituting pushback
with emplaceback, then you're not using it the right way
and it's definitely not worthwhile.
Yeah, it was an interesting
blog post. I
use that exact same clunk that you check
and if you use the fixit,
I'm pretty sure it does the right thing.
Yeah, if you use the fixit, if you misread what
the fixit said to do, then you end up doing
the wrong thing.
One advantage of emplaceback in C++17 is that it gives you a If you use the fixit, if you misread what the fixit said to do, then you end up doing the wrong thing. Right.
But one advantage of Mplaceback in C++17
is that it gives you a reference to the last element,
which actually can simplify a little bit your code,
and sometimes readability is worth only for that reason.
Yeah, I think that's a really important point
that should have been mentioned in this article as well.
Yeah, that's true.
And the last thing that comes to my mind
reading that article is about reducing
compile time. I've been on this
ODC for years trying to
reduce this, well,
stopping, waiting, and fighting
sword of my colleagues while waiting for the
build time, for the build to complete.
And the one nugget of information
with GCC, at least, is
that usually you might compile with minus G to get
debugging information. For example
to have a stack trace when you go to crash
and if you're not interested to use
a debugger you can move to minus
G1 and usually that cuts a third of
your compilation time.
So only that
if you pair your build on the cloud
if you virtualize your build it might decrease
the money you pay every month
by quite a bit, right?
So, dash G is equivalent to dash GGDB.
It's like full debugging information, right?
And you're saying dash G1, you're cutting it back?
So, G1, basically, I've got far less debugging information.
I still have enough to get you a stack trace
where you crash which is
if you do things
in batch is good
enough right the
size of the object
file is much
smaller so now if
you need to debug
interactively that's
not sufficient but
you can always do a
special build at that
moment yeah once you
have a stack trace
yeah valgrind works
perfectly well in
minus g1 for example
that that is really
interesting so there
you go that was
my trick of the day i i i will accept that trick and i wish i had brought my notebook down with me
to write that down oh i have a notebook computer in front of me i guess that's related um you know
since we just mentioned decreasing build time in the context of in place back i you know i accept
uh arthur's arguments that, it's slower to compile.
I would be shocked if it's actually a problem.
Maybe it is on some larger builds.
But I was just recently working on some code,
which I think our listeners will find this an interesting anecdote.
So I guess I'll share real quick.
About 30% of the total build time comes down to the fact that we're using a custom container where standard vector would have sufficed instead.
Is it a project you're working on? Wow.
Yes.
Were you able to make the substitution and get that increase in build performance?
I made a tiny wrapper class that uses vector in the background because there's a little bit of functionality that they added on.
Okay. background because there's a little bit of functionality that they added on. And it probably
would be relatively easy to swap out
my tiny wrapper class
with Vector with just a little bit of thought
for most of the use cases.
30% of the entire project's build time.
That's crazy. It is crazy.
So you do have tools
especially Clang has got a
command line option to give us some feedback, and
I think there's a project on GitHub
whose name I can't remember, that allows
you, you can do your build, and then
aggregate statistics to know where the time was
spent inside your
compiler, at least.
Yeah, that is actually what I...
At least on the software where I work,
what you find is that you spend a lot of time compiling unique pointer.
And I was thinking, well, that's actually rather good news
because at least we are using it.
Right.
We did find an outlier using Boost Azure
was pulling an enormous amount of headers
and you can just put a macro, say, Azure only or something like that
and it cuts something like 50 seconds of build time.
Wow.
So that sounds very good,
but it might be only 0.1% of your whole build anyway.
So there you go.
So it's a dash F time trace is the Clang flag,
and I think Clang build analyzer is the tool that you're referring to.
Oh, it's pretty good, right?
Yes, I used those tools to help me isolate those.
Just for the record, I didn't just magically stare at the source code and say, clearly, this is the slow part.
Sponsor of this episode of CppCast is the PVS Studio team.
The team develops the PVS Studio Static Code Analyzer, which detects errors in C, C++, C Sharp, and Java code.
When you use the analyzer regularly,
you can spot and fix many errors right after you write new code. This means your team is more
productive during code reviews and has more time to discuss algorithms and high-level errors.
Let the analyzer that never gets tired do the tedious work of sifting through the boring parts
of code looking for typos. For example, let it check comparison functions. Why comparison
functions?
Click the link in the podcast description to find out. Remember that you can extend the PVS Studio
trial period from one week to one month. Just use the CppCast hashtag when requesting your license.
So you did a talk at CppOnSeSI last year, I think, Arnaud,
which was on the topic of improving performance
by reducing memory allocations.
Is that right?
Yeah, I did indeed, right.
Yeah, I can talk a bit about where it comes from anyways.
Sure.
So as it turns out,
I work for an old company called Schlumberger.
So I can say a few words about the company.
It's an engineering company, a very large one, and we provide services to all companies.
And one tiny portion, which is big in money but tiny for them, is producing simulation tools.
So I work specifically on the reservoir simulator,
which is the kind of software used to simulate oil and gas field.
So we simulate the flow in porous medium.
So that's as much as I want to say about the domain.
It's not the topics of this podcast.
Oh, that's fine. People will enjoy it regardless.
So go into whatever detail you want to.
It's actually quite an important piece of software, right?
I mean, a large amount of energy you're consuming
will be actually simulated by this kind of software
or this product itself.
So it's definitely production software.
So as a software engineer,
what you end up is a several million lines of C++ code
for the simulation part.
You could find that a simulation might last a few minutes on the laptop.
It can actually keep a cluster of a thousand processors busy for several days, depending on the size of the problem.
So definitely reducing the time is a good thing to make the software faster.
So that's one thing.
And the thing we don't want is to change the results, right?
So that's a big no-no.
We can fix bugs, but we don't want to change the formulation.
And one reason for that is that if you calculate how much hole you can produce,
and you have a new version of the software,
and you realize that now the numbers are different,
usually people are not happy whatsoever, right?
So we have to be very careful.
So in other words, if I fix something,
the last thing I want is to break it.
I just want to keep the result identical.
So to come back to the reducing memory allocation,
I mean, a large part of it comes just looking at the source code.
And as a C++ developer, you have been brainwashed
by the literature, like Scott Myers and so on,
saying memory allocation is really, really bad.
You should reduce at any cost.
The fact is that I never found a tool up to semi-recently
that tells me where this allocation takes place.
So you see a lot of micro-benchmarks.
But when you take something which is very big,
finding this allocation and the one that
actually matters is actually quite difficult.
So what you can do is you can do some profiling with whatever technology you want, vTune or
Perf.
Usually what you find for most C++ codebase is about 5-10%, let's say 5% of the time is
spent inside the heap algorithms, so we spend in malloc and free.
Intuitively, I think, well, actually,
if I decrease half of my allocation,
I might gain something like 2.5%, which is quite frankly not worth it.
It's not a lot.
So that was
it, right?
And then in 2016, we got the customer
contacted
us saying he was very unhappy
because now his simulation was 20% slower with a new version of the simulator than the previous version,
which is something which is not very desirable.
So we actually backtracked inside the revision control system,
and we traced it back to an upgrade of CMUK, which is rather counterintuitive. They say, how could CMUK reduce an HPC
or an high-performance computing application
by performance by 20%?
That makes no sense whatsoever.
And what was even more curious is taking the head of the tree,
the bug has actually disappeared, but we didn't know why.
So clearly something happened.
So a colleague called John Herring did some detective work and managed to crack it
so kudos to him
so he actually did try
a lot of things and he
finally attached the process with vtune
and managed to trace to find
the root cause and it goes that way
which is you could see that
somebody by mistake introduced
a copy of a small vector
a vector that contains
something like five or six elements as an int, actually. So it was a stud vector of int, and he
introduced a copy in a critical loop. So basically, we started copying these five or six elements and
copying the vector again and again and again, millions or billions of times. But that by itself
shouldn't be a problem. What was really, really,
and surely that has nothing to do with CMake,
but what happened is the new version of CMake
basically changed the order of the linking.
And somewhere when you do the link,
the specific instantiation of stud vector of int
has to be picked up from somewhere.
And only one is kept at link time.
And this one was picked up from a third-party
library that we linked statically in our application,
and sadly, it was actually
compiled without optimization on.
So what we did is that stud vector
for the copy constructor of the stud vector
was actually used
completely unoptimized, right?
So we amplified that copies
by a gigantic factor, and we ended up
taking 20% of the simulation.
That might take something like 10 hours.
So what happened in the head of the tree
is that we did some profiling,
and we actually found that problem,
and I actually added from it,
but I didn't even remember it,
so I fixed it and moved on.
So anyway, that's a nice little story.
What it tells you is that, well, first of all,
you should be quite careful
when you compile your third-party library, and I think we learn our lessons now that that is now fully automated with
conan so we don't have that issue anymore but the other thing is that um okay so i got 20
percent but maybe once you fix the optimization problem it will cost you let's say 0.25 but what
about if i've got many more of these copies, which I'm not aware of?
How much do they add up? And I didn't have an answer
to that. I even considered recompiling the whole simulator,
changing libstc++, and make the copy constructor
very slow indeed, to find the origin of all these things.
But it was not very practical, and I don't have the time
for that. So that was it, right?
The follow-up to the story is that I went to the SECU conference in Bristol in 2017,
and I attended John Lakers' talks, or one of the many talks by John Lakers.
So John Lakers at Bloomberg is Mr. Allocator.
He's been selling allocators for years.
I remember attending his talk, which was truly fascinating.
I was thinking, I don't even talk, which was truly fascinating. I was thinking,
I don't even know where my allocations are. So let's try to replace them by a specific allocator.
And that was it. So I left the session at the end of the session, and I talked with a friend
called Matthias Schulz, which was attending the conference. And I said, well, I wish I had a tool.
And he said, well, why don't you use HipTrack? I said, HipTrack? I never heard
of that. So he opened his laptop and started
showing me straight away HipTrack. And he had
this amazing ability of capturing
at a very fast rate.
Pretty much like Valgrind
is you put HipTrack in front of your
command line application. And then you basically
log every single
of your allocation from my log and free.
And then you can just go to GUI to later on,
basically to display all these allocations with Flameshot
and very nice GUI capability.
I was like, well, this is it, right?
One of these Valgrind moments, right?
I mean, suddenly I've got a tool where I can work with, right?
And so that's really the start of the talk I gave three years later
was really in a way to give back to the community
what I learned thanks to this amazing
open source tool.
Right, so I don't think
we've talked about lots of
different tools on the show, but I
don't think we've ever talked about
HeapTrack, right Jason? I don't
know. It was, I think the first
time I saw HeapTrack, it was
in one of the poster sessions at CBPCon, like in 2017 or 20s.
Maybe that might be right.
2017 sounds like the right ballpark for what you're saying, too.
But I've never used it before myself.
I feel like it's possible we may have mentioned it because of a lightning talk or something, but I don't know.
I don't know.
Yeah.
Certainly never gone in depth
on it. No, we haven't.
So HipTrack is,
the author is called Miriam Wolf,
a German software engineer that works for
KDAB.
He done, in my opinion,
an amazing job. When you look at his,
when you watch his videos,
or say, oh, I'm standing
on Shudder of Giant, I was thinking, well, that's pretty good going.
You've done quite an amazing job.
So what this tool is doing is basically capturing the stack trace
and then use something called the backtrace,
which is part of GCC, to symbolize the stack traces.
It does it in a queue, so the slowdown is actually very acceptable.
So as long as your application doesn't call basically malloc N3 at an extremely
fast rate, the override might end up being very, very small indeed. And then you can analyze and
find out where they are and start to try to fix it. So if you watch the talk I gave at CPP on C,
which was online, it was really the tale of first of all, how can you install it?
And it turns out that, again, MillionWorth made it very
straightforward indeed, right?
Because you can use a technology called AppImage, which
basically you should double you get a blob
and make it executable and off you fly, right?
So that's very, very straightforward to try.
But then, so I tried on the simulator,
and what I found is the, well, it was not very pretty.
It was totally unexpected to me.
I mean, we had a runaway allocation inside of the inner loop, another one.
So I went, and one of them was basically copying a vector
and never ever using it, right?
That's probably a debugging experiment
that was basically a leftover.
But if you think about
the kind of tragedy of C++,
if you put a stud vector,
you put 20 elements in it.
It's actually for the compiler.
It's an unused variable,
but not as far as the compiler is concerned
because it's got side effects.
So you've got no rolling whatsoever.
So that was easy to fix.
I just removed a piece of code
and I had a 2% speedup.
And I said, wow, okay.
Instead, you could have put a comment,
to do, remove this line when we need more performance.
So I ended up in this, after a month of works,
and chasing up one after all the problems I was seeing,
at least on the common code pipe and i ended reduce it by a factor of two order of magnitude right so when
we had something like allocating 500 millions now we're under 101 millions of allocations and um
and if we go back to the story that's um that would consume only 5% of our time. I'm surprised that, at least in serial,
the speedup was more like 20%.
So that doesn't quite make sense.
And the reason for that is when you look at the profiling,
the time spent in malloc and free
is indeed something you pay for.
But what it does is destroying all your caches.
So actually, it's really hard to estimate
what this allocation costs you
because it could be that it's in a relatively cold loop
or something when your cache behavior wasn't great anyway.
So that costs you the cost of malloc and free.
But if you're in a potentially hot loop,
it costs you far more than that.
It costs you the fact that basically
you don't use memory efficiently anymore.
Your cache hierarchy
is being destroyed
by going out of line and so on and so forth.
And
yeah, so potentially
you've got
much more to gain by actually
fixing this issue.
So go for it.
When you launch, when you run HeapTrack,
you said you basically run it like Valgrind,
like it's HeapTrack and then you're executable.
That's right.
And you make it sound like it's just that straightforward.
You pull up the interface or whatever after it's done
and go, oh, look at that, line 12.
I should probably take a look at line 12
because it's doing way more.
Absolutely.
So what you say, you say HeapTrack,
so command line utility,
it will leave
your application
working
as normal
right
and when you stop
the application
it basically
writes a file
compressed file
containing all the information
and then
you have good GUI
that comes with hiptrack
which you can install
very easily
because it's pre-compiled
on Linux
oh I forgot to say
all this is Linux only
you can basically
read this file and analyze this file.
And the most useful part is to use
Flameshards. So I don't know if you
clear what a Flameshard is.
I don't think we've ever talked about it on the show, but
I'm familiar with it from Chrome's tracing.
Absolutely. So you
got this enormous quantity of data
to wait for. And Flameshard
makes it quite easy
because you're going to display,
if you're interested in the number of allocations,
that would be 100% of them would be on the X-sense
and on the Y you put the stack traces.
And then what you care is about this big rectangle.
You want to reduce basically the one that becomes very frequent, right?
Or the allocation which are repeated inside the loop. And that's the first one you want to reduce basically the one that becomes very frequent, right? Or the allocation which are repeated inside the loop,
and that's the first one you want to target.
Right.
And so you go trying to do something about it,
and if you succeed, you just repeat the experiment
and see if this thing has improved things, right?
And rinse and repeat thousands of times
until you get it sorted, right?
So yeah, to me, HipTrack is an amazing technology in that respect, right? I mean, to me, HeapTrack is an amazing technology
in that respect, right?
I mean, I'm quite surprised
that people don't use it more often.
And one reason I did that talk
was to make it more popular.
I've got a bunch of questions right now.
The first one that I would like to ask
is you started the story
by saying that you were
in one of John Lakers' talks
and you're like,
oh, allocators, clearly that's our answer. Then you came over to heap track. I'm curious if you
ever came back to allocators and said, Well, okay, there's still these things left that I wasn't able
to remove. So I'm gonna use allocators here. Or if he tracked just got you all of the help that you
needed? Well, so again, referring to my own talk, what I tried to do is to categorize all,
first of all, what I found.
I'm not saying that the application I work on
is necessarily that typical,
but it is written by a lot of people over the years.
You've got 20 years of development
by a team of 20 to 40 people constantly, right?
We have a lot of copter.
So you would expect probably have something quite typical.
And is there a locator useful?
Or what in C++17 is PMR locator?
And the answer is yes, right?
But I would say that is actually at the back of the list, right?
Okay.
It's actually much simpler techniques to deploy first
before you get to use allocators.
And I can elaborate on that if you're interested.
Yeah, go ahead.
Yeah, sure.
So where allocators is useful is specifically for associative container.
Okay.
So the usual case is to have basically you've got an unordered map
or a map if you're unlucky.
And the typical thing is that you're going to populate it,
but you never remove any element.
And once it's populated, it becomes kind of immutable,
and then you keep on doing a lot of lookup.
So because it's a node-based allocator,
if you put a map containing two million elements,
and that's very frequent in what I work on,
you end up with two million allocations.
So if your problem is of size 10 million,
you might do something like 50 times these 10 millions.
And that's a map which is populated only once,
as opposed to a child, basically,
to use a motorning buffer resource.
Right.
You put a motorning buffer resource
associated with that map,
fill everything in,
it will probably reduce to something like 30 allocation due to the
amortized growth.
It works beautifully
as long as you don't copy the map around.
You have to be very careful of that.
Presumably, it's inside an object that should be made
non-copyable.
That works really well.
Another use of
a use of the allocator, which I found quite useful,
is a typical situation when you've got a lot of,
at least you do that in a lot of numerical applications,
you have a lot of vectors which size you know.
So you might actually have, if you've got physics,
you might have something like number of components,
and you've got 20 vectors with different names,
but you don't know how many components you will have.
It will be something quite small, right?
And so you end up vector one of size n, vector two of size n,
vector three of size n, and you allocate that inside the kernel.
And they all add up for something like two kilobytes of memory
or something like that.
So again, because you know the size at runtime, it doesn't move,
you can put a monotonic buffer allocator resource.
You put two kilobytes of memory on the stack,
which basically is passed to your memory buffer resource.
And then the only thing you have to do is to transform your vectors
into PMR vectors and assist that resource on the job done.
Right. And you saw a significant impact making that kind of
change well again it depends what you put significant but yes in production i will gain
one one percent here in their own speed um one percent here there adds up yeah well if you take
an extremely optimized major industrial system and you speed it up by 10% on any case, that's pretty good going with it. Yeah, that's definitely good, yeah.
Well, same thing is that you...
Well, first of all, this kind of software I'm talking about
is the Ruby user on clusters
and keep them busy and warm for a long time.
So if you can decrease your electricity bill by 10%,
that's quite a good thing.
And I think it's a good thing for the environment,
which in any circumstance, right?
It's just better use of resources.
Right.
But having said that, I think
all these PMO facilities
is probably the last thing you should do.
They are actually usually much easier
techniques to deploy with.
Okay. Just eliminating
the dynamic allocations in the first place.
Well, the
poster child for that
would be, unsurprisingly,
in C++, you will find that you spend a lot of time
copying strings around.
So, the first
thing to do is probably to use string view
when you can.
So, a typical
one, you might do some parsing and use
substr, which creates
a new string, only for the
benefit of parsing. If you
actually make it a string view, you end up with
something very lightweight, because it's got zero allocation.
So usually that's
a solution to a lot of problems.
And the
other one, which is very interesting, is
C++20 span. Right. And the reason one, which is very interesting, is C++20 Span.
Right.
And the reason for that is that
in a lot of numerical simulation,
you end up passing, you want to pass a lot of vectors.
So usually the way people code in C++ quite naively
is to press a reference to a vector.
And there's not really a problem with that.
It's just, well, first of all,
if you pass a reference to a vector,
you never know if the function you're calling
is going to resize your vector in my outing.
Whereas if you make, if you pass a span of double,
you know that you can't possibly resize that span, right?
You can only populate it.
But the beauty of it now is that you,
so first of all, by doing that,
you make your function more strict in terms of contract.
You clarify that you're not going to resize that vector.
Interesting point, yeah.
So that's one thing.
But the other thing is that imagine now that your physical problem is to pass a vector of size 2,
which might be, for example, oil and gas, the number of phases.
Right.
If you make it a vector, you're obliged to allocate a vector of size 2, which is a bit silly, really.
Right.
And that can go on the stack.
So if you make it a std array of double comma 2,
and the function you're passing to is coded in terms of span,
then you don't force the allocation scheme on the color.
You can tell, or the color of the function
can actually use any kind of contiguous container,
whether it be vector, or it could be array,
or it could be a small vector implementation,
which I've been using extensively.
And in a lot of situations,
your vector might be on a relatively small size,
and a small vector is the way to go, right?
Or you might want to use PMR vector.
Right.
And, well, I think, to me,
Span was really one of the very very useful addition of c++ 20.
we need to talk about how you are already using c++ 20 but if you don't mind i have one
other question about your heap track and and so you you commented that c standard string having
lots of copies of strings is pretty typical of a problem in c++, which I fully agree with. I'm not going to argue with that at all. But does HeapTrack... Now, if I have a bunch of small copies of strings spread across 200
functions, does HeapTrack help me say, at least point out, hey, you're copying string in far too
many places, and it's going to be up to you to dig into where? Is there any way to bubble that
kind of information up no absolutely
you um to use the flame chart and usually what i do is that you use the reverse um flame chart by
putting the um the leaf at the bottom so you can now search for the copy constructor for for string
and you can okay but the beauty of the um of the flame chart is that maybe you don't have this
problem it's only you you can see how many allocations take place, right?
So if it actually occupies only 0.1% of all your allocation,
your problem is elsewhere,
and you shouldn't waste time on that, right?
Right, very good.
So actually, it's very efficient.
Having said that, the other thing is,
imagine you could, well,
if you take the example of ChaiScript, for example,
your problem,
depending on what you parse,
you will see that the allocation will take place in different places, right?
Yeah.
So what I'm trying to say here is
what you do is you optimize a particular code path
depending on the workflow.
If you do a few...
In my case, in terms of simulation,
if I take a different field or a different type of physics,
it will go in different places.
Right. And then, of course, now I take a different field or different type of physics, it will go in different places. Right.
And of course, now your distribution of allocation will change.
Yeah, these kind of scripting and simulation environments
are difficult to profile, yeah,
because you have to try all the possible different code paths
that someone might do and choose what you're optimizing.
Well, the danger there is to say,
because I can't do everything, I do nothing.
Right. What you can do is take a few typical cases and having a good goal and usually that can make a big difference so do you have any now techniques or automated process
of any kind to be able to run heap track and say okay nothing nothing new and scary has now popped
up like is there any way to do this in your ci or something like that just to give you like a sanity run HeapTrack and say, okay, nothing new and scary has now popped up.
Is there any way to do this in your CI or something like that,
just to give you a sanity check?
Can you do it? Yes, absolutely.
Because it's a batch application,
so HeapTrack will basically give you how many allocations take place,
and then you can just put a threshold and say, well, now...
Oh, just look at total allocations or something.
Yeah, at the end.
Oh, okay.
So it was very easy.
We have not implemented that, but we could.
That's a really interesting idea.
It's quite very easy to automate, right?
Now the question is, how many things do you track down in your CI?
That's the other matter.
That's the other question, right?
So typically on what I work on is if I put any modification,
there's three hours of test before actually the pull request
can be accepted.
So we run a lot
of tests, a lot of unit tests, and we run
a lot of them on the address sanitizer
and this and that.
At some point, you have to decide where the
limit is, right?
Right.
And that is only the first level of testing.
But yes, you could absolutely automate to some extent
the fact that you don't actually re-inject
an enormous number of allocations.
So you made some references to using C++20 already.
How are you on the bleeding edge of the standard?
So maybe I slightly overstated this.
So what you can do
in C++20,
or I do not use C++20,
what I do use is
span specifically.
I needed a span
so I looked around
and there is many people that coded
a span. So it turns out that
the standardization process i think gosh this
i've got this good feeling that span is going to be successful and there is a fair amount of
implementation so there is one in gsl um which i used to start with and these days we use the
implementation by tristan brindle which is really excellent it's just a single file
so um you know you put inside your project and you call it
I don't know, SSS span or something like that
in a different namespace, hopefully three letters.
And then you can just change it to
std next time around, which is exactly
what I did for StringView in 2016
when I imported the StringView implementation
written by Marshall.
And only something like four
weeks ago, I changed everything to
std string view once more.
But it gives us four years ahead of the curve to be able to use string view way before everybody else.
So you are able to keep using new versions of standard though.
So you are on C++ 17 then?
Yes, we are.
So as far as I can make it, C++ was kind of vegetating a bit, and people have been slow at adopting C++11.
And we finally got the go-ahead in 2015 on the project I work on,
and I went slightly crazy.
So we basically changed the code base or put as many features as I could.
Well, no, not as many.
I mean, as many useful features as I could at the time.
And so the first thing I did at the time was to very aggressively removing all the naked new and delete.
When I counted, I had about 3000 naked delete and I reduced it to 100. So that was pretty good going.
And that was done manually.
Just to make unique, you're saying? Or yeah.
Yeah, just to unique pointer or whatever was necessary.
But otherwise, Clang-Tidy was actually the way to go,
or is the way to go these days.
And incredibly, it's very, very reliable.
Well, it's two things.
First of all, a lot of transforms are reliable.
So I did blank transform on several million lines of code.
And the other thing is when
i did find bugs in clang tidy um usually they fix the bug within a few days so that's actually
well worth um filling a bug with both clang and both gcc actually so i've got um many bugs are
from ngcc got fixed in matter of hours so So they're always quite amazing, these guys.
So what's the process like?
Are you still then, you said you did,
your first thing to do is to remove
a bunch of manual news and deletes,
but are you then able to keep upgrading?
You said you're on C++17 now, I believe.
Did I hear that correctly?
Yeah, so the process would be
well, C++11
and 14 is well consumed
the
C++17
string view for example was very useful
oh, another nugget information
for you, but you can refer to my talk
is the, if you use
a PMR map and you insert
element in it
it's actually a very very bad
idea and the reason for that is when you
insert on the map it creates a temporary node
which might be inserted
but it might be actually discarded
and if you use a monotonic buffer
resource you end up with a slow memory growth
so the trick
there is to use try and place which is
C++17 which basically guarantees you not to have temporary nodes
and avoid slow memory growth again.
So that's another example of C++17.
And I think a large part of all this transform
were more political than technical in a sense.
I mean, to me, the reason why we remove manual memory allocation
is first of all to remove bugs,
but it's as well to make the project more attractive for new people.
And in a way, we've been very successful in that
because we had people that came in the project since, right?
And I say, oh, yeah, we use a lot of memory,
but we know we don't have memory leaks
because we only use smart pointer.
It's like, yes, we use a lot of memory, but we know we don't have memory leaks because we only use smart pointer. It's like, yes, we won the battle.
So I think there's a large aspect that if you're a young graduate,
the last thing you want is to work on the whole code base,
which is in a terrible state, right?
So if you got something which is more palatable
or kept to a higher standard, it makes the project far more interesting.
That is really a fascinating point.
I haven't interviewed for a regular job in quite some time, but yeah. standards it makes the project far more interesting that is uh that is really a fascinating point i
haven't interviewed for a regular job in quite some time but yeah i mean if they said we're
working on this great code base that's so much fun and and and then you find out later that you
know it's nothing but a bunch of manual memory management you're not gonna it'll be a shock
right using it as a way way to recruit people to say,
no, we're actually making the best advantage we can
of modern techniques, that sounds...
Yeah.
So the talk I gave, its origin was as well
that I had to convince my colleague
of the usefulness of this technique.
So I did a lot of internal talks,
so I had to write a lot of slides
and convince them, bouncing up and down,
showing how great all these things were.
And because I had all these slides,
I basically reshuffled them and gave it to a conference.
That's a great way to do it.
Exactly.
So maybe some homework for you, Jason.
You can do an episode of your YouTube
CPP Weekly on HipTrack.
Yeah, I might do that.
Actually, you're making me feel
bad honestly because like i said i know i was made aware of this a long time ago at cpp con
and uh and i've never tried it definitely sounds like a really useful tool yeah and considering i
pride myself in being like the use all the tools guy and this is like an easy to use one and i've
never used it then i just feel bad now well I'm surprised that people don't use it more regularly
or are aware of it, but maybe the reality is that there are so many new things
happening and so many things to know, but there are only so many things
you can learn. Yeah, and there's a certain level of
awareness, right? At this point, quote,
everyone knows about the sanitizers, so saying, oh, I'm going to deploy the Like, at this point, quote, you know, everyone knows about the sanitizers.
So saying, oh, I'm going to deploy the sanitizers in this project,
most people aren't going to argue with that,
or Valgrind, or whatever.
People have used these things,
but you have to, like, take this to your coworkers
and be like, oh, by the way, I ran HeapTrack on Monday morning
because it seemed like it would be fun,
and I got a 3% performance improvement in 10 minutes.
You know, then people are going to notice, right?
So I might have to...
Yeah, no, it's...
Or maybe you can just...
I can only recommend it, basically, right?
Yeah, I have an idea.
We'll see how it plays out.
Thank you very much for sharing this with us.
So there you go.
Okay, well, Arnaud, there anything else uh you want to plug since we're starting to run out of time anything else you
want to go over no not really i want to thank you to for inviting me i mean um you said in my bio
that i am i'm cycling to work so a few years ago i discovered that discovered that cycling on the ice was a rather bad idea.
So I had to basically replace my cycling to work by a long walk to the station,
taking the train, taking the bus, because I couldn't cycle because of a small accident I had.
And I was able to basically give me a lot of time to listen to all your backlog of people.
So I had to thank you to keep to uh to make that
period of my life much more enjoyable i went for something like 50 or 60 episodes
wow well i'm glad that was helpful yeah definitely so there you go and i think a lot of colleagues
i know of is listening carefully or regularly to your blog and it makes it very interesting
thank you okay well arnaud it was great having you on the show today thank you very much for having me thanks for coming on all
right so we are going to hit end broadcast and thanks so much for listening in as we chat about
c++ we'd love to hear what you think of the podcast please let us know if we're discussing
the stuff you're interested in or if you have a suggestion for a topic, we'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com.
We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter.
You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter.
We'd also like to thank all our patrons who help support the show through Patreon.
If you'd like to support us on Patreon, you can do so at patreon.com.
And of course, you can find all that info and the show notes on the podcast website at cppcast.com.
Theme music for this episode was provided by podcastthemes.com.