CppCast - Performance Matters
Episode Date: October 1, 2020Rob and Jason are joined by Emery Berger from the University of Massachusetts Amherst. They first discuss updates to GCC and the September ISO mailing. Then they talk to Emery Berger about Performance... tooling and how improvements in Performance should be measured. News New C++ features in GCC 10 September C++ ISO Mailing std::exchange patterns: Fast, Safe, Expressive and Probably Underused include Meeting C++2020 scholarship Links Plenary: Performance Matters - Emery Berger - CppCon 2020 Quantifying the Performance of Garbage Collection vs. Explicit Memory Management Stabilizer Coz Sponsors PVS-Studio. Write #cppcast in the message field on the download page and get one month license PVS-Studio is now in Compiler Explorer! Free PVS-Studio for Students and Teachers
Transcript
Discussion (0)
Episode 267 of CppCast with guest Emery Berger recorded September 30th, 2020.
Sponsor of this episode of CppCast is the PVS Studio team.
The team promotes regular usage of static code analysis and the PVS Studio static analysis tool. In this episode, we talk about C++20 support in GCC and pattern matching.
Then we talk to Emery Berger from the University of Massachusetts Amherst.
Emery talks to us about performance profiling tools and more. Welcome to episode 267 of CppCast, the first podcast for C++ developers by C++ developers.
I'm your host, Rob Irving, joined by my co-host, Jason Turner.
Jason, how are you doing today?
I'm all right, Rob. How are you doing?
I'm doing okay.
Still recovering from all the conferencing,
but I'm glad to get back to normal.
All the virtual conferencing.
Not quite as exhausting as actually flying out for a week,
but it can be a little bit. I don't know. In some ways, I think it was more exhausting.
I don't know.
Yeah, maybe.
You have something to share, I think, right, Jason?
Yeah, we've mentioned that I've been working on a C++ best practices book, but I did officially
release it last week. So it is available. And I'm guessing we can probably put a link to it in the
notes here at some point. It's at leanpub.com slash CBP best practices. And well, it's out
there now, I guess. I won't spend a whole lot of time talking about it since we have a guest today.
But how long is the finished book, though?
About 125 pages and a kind of typical technical, you know, sized book, like, you know, about
the size that you would expect a programming book when you know one of the smaller ones,
not the Tome kind that have you know the complete c++
reference and they're full like by 11 john lakers's thousand page book i think right right right so it
is it is intentionally short um this is one of the things i've learned as a contractors if you
write a lot of words people don't tend to read them so i i aimed for fewer words hoping that people would read them very cool okay well I'll have to read that myself and I'll put the link in the show notes like we
said thanks bro yeah well at the top of the episode I'd like to read a piece of feedback
this week we got this tweet from Harold saying it was a bit confusing to get CBPcast already
on Thursday morning because it usually it's an indicator that weekend is close but it was again a very interesting episode especially
if you have some interest in organizing c++ events and yes for anyone who uh follows you
know when the episodes actually drop on uh online i did i accidentally post this last episode a day
early uh yeah i apologize for that you know i did the same thing
with c++ weekly a few weeks ago also accidentally dropped it a day early and i'm like oh oh well i
guess i'm just sticking with it now yeah yeah there's no one doing that no you really can't
well we'd love to hear your thoughts about the show you can always reach us on facebook twitter
or email at feedback at cbcast.com.
And don't forget to leave us a review on iTunes or subscribe on YouTube.
Joining us today is Emery Berger.
Emery is a professor at the College of Information and Computer Science at the University of Massachusetts Amherst.
He graduated with a PhD in computer science from the University of Texas at Austin in 2002 and has been a visiting scientist at Microsoft Research.
Professor Berger's research spans programming languages, runtime systems, and operating systems with a particular focus on systems that transparently improve reliability, security,
and performance. Emery, welcome to the show. Thanks so much. Thanks for having me. I feel
like there's some other people that we know who were at the University of Texas, also in the C++ world. Isn't that where, no wait, where was
Bjarne and Gabby? Right, so they were at Texas A&M, which is the permanent rival of the University
of Texas. I realized as soon as you corrected me that it was Texas A&M that, well, clearly your
computer science department or that you work in is better than the Texas A&M one, though, right?
I'm not going to say that because I have some very good friends who actually went from UT Austin to Texas A&M.
So all fine departments.
Well, sorry about that.
That makes up there.
I know school rivalries are weird.
It's since I get the opportunity to slam Texas A&M for a moment, I will take advantage of it.
Go for it.
It's a very asymmetric rivalry.
So they are obsessed with the University of Texas and lurk on message boards and things like this.
And their big insult is to refer to UT Austin, which is UT, as lowercase T-U.
And we're all supposed to be clutching our pearls.
This is so upsetting.
And they actually have us in their fight song.
Really?
Yeah, yeah.
It's completely insane.
They're obsessed with UT Austin.
And UT Austin is not at all obsessed with them.
So it's pretty funny.
Virginia Tech has University of Virginia in their fight song as well, actually.
Virginia Tech is where I went.
So I'd have no idea if University of Virginia feels the same way.
I mean, the college football games are always a big deal.
But other than that, yeah.
Yeah, yeah.
Anyway.
Yeah.
It's pretty hilarious.
That's funny.
Okay, well, Emery, we got a couple news articles to discuss.
So feel free to comment on any of these.
And we'll start talking more about
your work on performance.
Sounds great.
So this first one we have is a
blog post on the Red Hat Developer blog
and it's about new C++ features
in GCC 10
which
adds some of the C++ 20 features
that are official
now. So we can get some of them.
Not everything is available yet from C++20, though.
They're still working on some features, right, Jason?
Yeah, but I've been using...
Well, for the stuff that I've been prototyping recently in C++20,
GCC right now is my go-to compiler
because it has, at the moment, the most solid support
for the features that I care about. So I'm using, you know,
concepts pretty extensively and lots of constexpr things. So that's obviously where my,
my interests are going to pop up on these, which that seems to have pretty solid support for at
the moment. Yeah. That's the first thing they mentioned that concepts is I think they're
saying it's complete based on what's in C++20, right?
Yeah, I don't know. It's all early stuff, right? I mean, if they say it's complete,
I haven't hit a problem, but I assume I'm going to at some point, right? Because
this is still largely untested, to be fair, right? So,'s yes i'm sure it's complete but how complete how stable is
it i don't know i'm i definitely take a pessimistic view on that just out of just nothing personal to
the gcc developers if any compiler said we're complete on the c++ 20 feature right now i'd be
like i'm sure there's corner cases left sure it's just how it goes. Emery, have you are you also playing with C++ 20 features
right now? So I haven't started, mostly because of compatibility fears. Right? I mean, I'm
definitely one of those people who likes to move to, to a newer standard as soon as possible.
But there are people who use software that I make, who are not so quick. And so that's always a challenge.
I mean, I'm actually, for reasons that are related to performance, I'm really excited about stuff
like Constinst, because this has been a thorn in my side for quite some time. So we do a lot of
stuff where we interpose on libraries. And so we have some static initialization that happens.
And, you know, the initialization definitely happens at the beginning, it's guaranteed, but we suffer because we have to take the check every single time it gets accessed. And being able to actually just tell the compiler, look, yes, it's not a pretty big difference in performance for some cases, like replacing the memory manager, replacing some thread libraries and things like that.
Awesome.
So that's pretty cool.
So yeah, so I'm always looking for the things that will enhance performance.
Also, all the enhancements for, I know that this is like a bad word for some people, but like meta compilation support.
I depend on that pretty heavily.
And it's also a way to get amazing performance gains.
And obviously at the expense of amazing compile times.
Do you mean like template metaprogramming kind of things?
Yeah, so I mean, I used to use template metaprogramming a lot.
And I've been gradually moving out into the const world
as that's been made possible um and so you know the there's a there's an infrastructure that i built uh years ago for
my phd called heap layers that we use to build a bunch of um fast memory managers so there's this
thing called horde that i worked on for my phd um that is that forms the basis of the mac osx
allocator actually um and it depended on a lot of template metaprogramming
to compute a bunch of stuff at compile time
that avoided lookups at runtime.
And so moving all that overhead to compile time
makes a lot of sense,
but it can get pretty gory, as you know.
So cleaning that all up is really nice.
And just being able to put constexpr and constants
to every place will be tremendous.
It's true. I find myself doing almost nothing that I would call template metaprogramming
anymore. It's just as constexpr, just constexpr. Yeah. Yeah. Yeah. I mean, there's still cases
where I need it. Um, and I, but I would absolutely love to, uh, you know, to be in a place where I
don't have to do it at all. Yeah. Okay. Next thing we have is another mailing list from the ISO. So this is,
you know, mailing list for September, 2020. I didn't go through any of these specific articles,
but I definitely wanted to bring it up. Jason, do you have a chance to look at any of these?
I did. Anything particularly interesting for you? Yeah. I flipped through a fair number of them.
I want to make what I think will be a controversial statement. I think the most important paper for C++23
is pattern matching. And that's in here. And we talked about that a while ago with Michael Park.
I hope it makes it into C++23. Yeah, I do hope some version of it makes it into c++ 23 well now that i've honestly since
i played with rust and saw some of the power of what a good pattern matching syntax can do
and i've been doing versions of that using variant visitors and c++ uh i guess i I'm using C++ 20, yes.
And seeing like, okay, I need this to go to the next step.
It can clean up so many things.
Yeah, that's good.
How about you, Emery?
So I had not been aware of this.
I've not been following C++ 23 development.
It's actually, it's crazy to see something like this.
I mean, so I spent a year when I was in an undergrad in England,
and I was at one of the homes of functional programming.
And so one of the languages that we learned was a predecessor to Haskell.
And it's just crazy to look at the code that comes out of the pattern matching stuff.
I mean, it just looks straight out of, well, you know, with a few ampersands thrown in.
There's always more ampersands. There we go. That's right.
We need, we need triple ampersands. I think that'll really, that'll really clear up all the
confusion. But, but yeah, I mean, a lot of it makes it look, you know, like this very, very
nice, clean, you know, ML style, Haskell style way of doing pattern matching. It just avoids a lot
of boilerplate.
And it's really, you know, you look at the code and you're like, I know what this code does.
So that's super cool. It really, you know, if I show this, I mean, I'm totally after I get up this call, I'm going to send this paper around to some of my colleagues.
And I'll be like, how is this C++?
It just no longer looks like C++ at all.
I was I read through the paper this time looking for more implementation details.
Because if it were an enhancement to lambdas, they would say, it's as if the compiler did this, for example, and show you how it could be translated into C++ 20.
But there's nothing like that in the pattern matching paper that I can see.
I'm pretty sure it's going to be very much, for lack of a better word, compiler magic compared to most features that get added these days, not just syntactic sugar.
Yeah, I mean, these things, I mean, they all get, I mean, they are kind of syntactic sugar in a way, but I mean, in a very deep way, right? Like a compiler has to really, it has to do a lot of work. And there's no straightforward translation of these things. So yeah, but it's, you know, this is I
will say this, I mean, this is incredibly mature language technology, right? People have been doing
so, you know, there's this term for pattern matching, I don't know how much it implements
of this, but it's called Hindley Milner type inference. And you can actually do this kind
of destructuring without even adding types, and it'll infer the right types for you.
And this is something that functional programmers have had literally since the 80s
right and so you know that's a long time ago uh and so seeing this enter um c++ this way um
it's great that would be uh that would be super cool i can see why it will be controversial though
uh because it's definitely a shocking change in the look of everything. I think one thing that
I mean, it's been since my first C++, the second C++ Now conference I went to, so five or six
years ago, as watching talks on people who want multi method dispatch kind of libraries, and
that's painful to do in C++, you can do it with some tricks today but this pattern matching it just falls out of
there if you need that kind of thing so it's yeah i'll be interested to see i'm sorry go ahead no
no that's all right what i was gonna say is i was just interested to see what's going to be the
interaction of these things with the type system writ large i mean you've got all the virtual
dispatch and you have all the the crazy things that can happen with uh you know diamond inheritance and other insanity that uh you know things that people should maybe try to avoid
um but uh how this plays with this i mean this is a traditional pl problem when you're like i'm going
to put some type thing and does this touch all the other types and do i have to deal with the
cross product of all the interactions or is it something that's kind of separable? So that I'm not sure. I mean, it looks really clean on paper right now, but I would be personally
terrified of trying to implement this and making sure everything is good. So we'll see what happens.
That is definitely something that has been happening. So since I said I've been using
standard visit with variants
to do similar kind of pattern matching things for our listeners who don't know you can take
visit pass in the visitor and then pass in multiple variants or i figure which order those go in
regardless uh so if you've got four variants that you pass in, it's going to generate, like you
said, the cross product of all the possible interactions of all of these to try to generate
all of the calls to the visitors that can be very painful at compile time, but it does
seem to actually generate efficient code.
It's a couple of, you know, nested lookup tables in it and it's good, but the compile
time can get a little scary on it.
Yeah. I mean, it's clearly a concern. I mean, one of the things that... So years ago,
Rob Pike gave this keynote talking about Go when the language was just brand new.
And the primary motivation, not the primary motivation, but one of the primary motivations
was compile time concerns that they had with C++.
And I was like, Jesus Christ, we're building a whole new language because of compile time?
It seems insane.
But you're Google.
You can do things like this.
And they're recompiling stuff all the time, so it makes sense.
And if you put something in that is going to lead to quadratic or worse explosion in compile time, it's definitely a risk.
So we'll see what happens. Yeah. But, but I mean, you can already do it today. I mean,
you can write your template metaprograms that compute, you know, the Ackerman function or
something if you feel like it. So there's nothing stopping you from shooting yourself in the foot
in the finest of C++ traditions. You know you're doing something right
if you have to up the template recursion limit
on your compiler parameters, right?
I have to confess,
I have that in a couple of my projects
and it's set in the thousands.
I think I could have assumed that
based on what you said.
Yeah, dirty little secret.
Don't look too closely at the command line, everybody.
That's awesome.
Okay.
Next thing we have is a guest post on the Fluent CPP blog,
which I don't think we've talked about for a while,
but there's always good posts here.
And this is a guest post from Ben Dean,
who we've obviously had on the show before.
And this is a guest post about Stood Exchange,
which I'll admit I was not that familiar with, and I think that was kind of the point of his post. I did a C++ Weekly episode about it, Rob.
Did you do one? Okay. Yeah.
I think that was kind of the point of his post, though, that a lot of people don't know about Stood Exchange, and he's trying to kind of
showcase some of its uses. Yes, and he definitely showed uses in here that I
had never considered
but that's what ben does yeah i'm not gonna go too much into detail on this but i definitely
would encourage reading the blog post especially if you're like me and we're not too familiar with
state exchange before okay and then the last thing i wanted to mention is uh meeting c++
is coming up soon i think it's yeah the second week of November, 12th to 14th.
And Include C++ is a group we've certainly talked about
on the show before.
And they like to support scholarships
for sending attendees to some of these conferences
who might not otherwise be able to go.
And they are running a scholarship
for Meeting C++ this year,
which is still
going to have an in-person component so if you're interested in going to meeting c++ and uh you're
interested in applying to the scholarship uh this link will be in the show notes and i'm not seeing
a link for like supporting the scholarship though but that must be they're probably looking for that
type of support as well i would assume if you go to include cpp.org so these these scholarship tickets are given by meeting c++ to include c++
but also to clarify this is specifically for the online tickets yes is still as far as i can tell
planning to have an in-person version but this does say that these are free online tickets for
meetings okay gotcha cool well yeah like i said if you're interested in going to include C++ virtually,
you should definitely check out and apply to the free ticket sponsorship.
Yeah, scholarship.
Okay.
I'm so curious how this goes, just for the record,
how the in-person version is going to work out for them.
I hope it goes.
Yeah, I hope it goes well.
Okay, so Emery,
so you gave a great talk at CppCon this year,
one of the keynotes where you talked about performance.
I definitely encourage listeners to go and watch that talk
once it's available on YouTube.
I'm not sure if it's up there yet,
but I wanted to just start off
by talking a little bit about performance
and some of the things that you think can affect performance with C++.
Sure. So the talk is actually up. I'll share it with you.
Or you can find it yourself.
So, yeah, so performance in C++.
I mean, the real issue, I mean, obviously, you know, people use C++ primarily because it is close to the metal and can give you really great performance.
You know, the code gen is great.
There's, you know, very little in the way.
You have access to assembly language.
You have all this sort of stuff.
There's no garbage collection, which I should stress is something we've also studied that is mostly a space tradeoff.
So you can run your C++ programs in way, way smaller memory footprints than you could run something like Java or JavaScript, no matter how good the code is.
Because of garbage collection?
Yeah.
So I'll explain it briefly.
So we have a whole paper on this.
It's quite old, but the lessons still hold.
Essentially, the issue is as follows.
So suppose you have, so most garbage collectors trigger collection once the heap fills up to a certain amount, right? And you have some heap size parameter. And so if you set the heap super,
super tight, you could be in a situation where you've got a bunch of memory in use, and then
you allocate something and then you allocate something, and then
you free it, in effect, right? It goes away. You're not using it anymore. But you're bumped
up against the edge of the heap limit. And so it triggers a full garbage collection, and it
reclaims one object. And then you call new again, and then rinse and repeat. And so you can be in
a situation where the runtime just goes through the roof because the heap is too small. And so
as the heap gets smaller and smaller, you get almost an exponential curve that just goes through the roof because the heap is too small. And so as the heap gets smaller and
smaller, you get almost an exponential curve that just goes up and up and up. It's actually power
law, but anyway. And then as the heap gets bigger and bigger, then the runtime that you spend
collecting decreases because you allocate, allocate, a bunch of stuff dies, a bunch of stuff
dies, a bunch of stuff dies. You do a GC, you reclaim a ton of stuff, right? And then you have
a lot of headroom until you have to collect again. And in the limit, like in the limit
where the heap is infinitely large, you never collect, right? And so what happens is at some
point you get to a steady state where you're pretty close to barely collecting at all.
This is especially true for a generational garbage collector that is periodically reclaiming
very, very short-lived object.
But it holds for any garbage collector.
But the problem is that the amount of space that you need before you get to the point where the runtime is basically the same as C or C++ running malloc and free or new and
delete is like three to five times as much memory.
Interesting.
Yeah.
So people just are like, oh, garbage collection is great.
And yeah, garbage collection is super convenient, but it comes at a great space cost. And if you have
plenty of RAM, terrific. But if you would need that RAM, or you're making really a lot of use
of the RAM, like it's a cache, or it's an in-memory database or key value store, you know,
you choose, you end up throwing away a lot of capacity by using a garbage collected language.
Go ahead.
Quantifying the performance of garbage collection?
Yeah, that's the paper. Exactly.
Okay. I know there's at least a few listeners who will be very curious in that.
Yeah. So actually Chris Lantner, who is the creator of LLVM and the designer,
or at least co-designer of the Swift language, specifically cited that paper as a justification
for why Swift doesn't use ordinary garbage collection
and uses reference counting.
Wow, okay.
So anyway, sorry for that segue.
Sorry.
Anyway, so be that as it may,
if you get rid of your garbage collection,
what are you left with?
You're left with the metal.
You're left with whatever machine you're running on.
And the problem is these machines have gotten enormously complex. So, you know, processors got, you know, they used to be really simple.
When I started out, I actually had an Apple two plus, and it had a 6502. And in the 6502,
the instructions in the reference manual literally said how many cycles it takes for every instruction, which now is hilarious because it's like, well, did it hit in the cache?
Oh, there were no caches.
So there were no caches.
There were no branch predictors.
There was no virtual memory.
So there was no TLB.
There were all of these things.
There's no pipeline.
There was no dependency on the past.
There's all kinds of complexity in modern hardware.
And this complexity, unfortunately, surfaces in ways that can be very surprising.
So if you, you know, what they do, so just briefly talk about like branch predictors.
Branch predictors essentially record history of which way your if was taken.
Did it go the if way or the else way?
And it does this so it can prefetch the instructions and start loading them and executing them speculatively. And if it guesses correctly, most of the time,
it saves a lot of time. So it's not just stuck waiting to evaluate the if expression. It just
goes forwards and keeps running. So you have all of this parallelism that's happening.
So it has to be pretty accurate. And when it is, that's terrific. The way that it actually
manages all these history tables is by hashing the program counter,
which is just the instruction pointer, the address.
So this means that if you have a bunch of things that map to the same address, they
can actually overflow the buffers.
And then you get misses.
And so then the predictor doesn't work as well.
So this is referred to as aliasing for branch predictors.
But it's the same problem for caches, for the instruction the instruction level caches for data caches, uh, for the TLB, it's a problem because the TLB maps your pages of virtual memory.
It's a physical memory. So can you explain TLB or tell us what TLB means? Sure. Sure. So it's a
stupid name. Unfortunately, it's almost better to not know what it means. Um, so it's, it stands
for translation look aside buffer. Um. So again, a terrible name.
Like I told you the name and now you're like, now you understand, right?
Yeah, so it's a stupid name.
But the idea basically, you can think of it as just being a map.
It's just a map that maps the start address of a page that's in virtual memory to the
start address of the page that's actually the physical memory in your machine.
Okay.
Right?
So, you know, your machine has a bunch of RAM and it goes and it puts pages wherever.
And then your virtual memory, you have this abstraction, which is really pretty. Like
everything seems to be organized like in a contiguous way. But of course you're sharing
your, your thing with many processes and they're actually picking things that are discontinuous.
So you have to have this map. And so this map is, it's just a,
so it's actually stored in memory in its full glory, but there's a cache to that map.
And the cache to that map is the TLB.
That's all it is.
It really should be called virtual page cache or something,
but it's not.
And so if you don't,
if you have an application that actually spans more pages
than fit in that cache,
then on like an Intel, it will go out to a data structure that's in RAM. And this happens every
single time you access any data at all, or any instructions at all. So if you're in a situation
where it's in cache, then it's free. It returns typically in a cycle. So it's like essentially
invisible. If it goes to RAM, it could miss in the cache,
it could miss in the L3 cache, it could go all the way out to RAM, and it could take hundreds
of cycles. So imagine you could be in a situation where something accidentally takes 100x longer
just because of this failure to fit in the cache. Yeah. So this stuff is really nasty.
And it's, I think, poorly understood just how brittle performance can be.
You can change a line of code,
you can add another new,
you can restructure things,
you can change your make file
so it has the.os in different places
and this can lead to gigantic performance swings.
Yeah, so that kind of brings us
to some of the tooling you went over in your talk.
You talked about how performance is so brittle
and you introduced these tools
that can be used to analyze performance
in such a way to get around that brittleness.
Can you tell us a little bit about those?
Yeah, sure.
So I'm going to have to say up front
that the tool that we used for this experiment,
unfortunately, LLVM itself is a moving target.
And we relied on some internal plumbing because we were trying to make LLVM do things it's not meant to do.
And so LLVM changed.
And it's changed to an extent where it would require months of work for somebody to go and port it forwards.
And, of course, you port it forwards and then LLVM could change again.
So it's not the happiest story for a graduate student who wants to like get out
with a PhD. So unfortunately, it suffered from bit rot. We've talked about reviving it,
but it just seems like such a thankless task that we're just like, look, we did this. We
understand what the performance is. We know how to do this. If somebody wants to actually do this
in industrial tool or bring this into LVM, that'd be great. But we're not going to do it. So let me just explain what it does. So the problem, like I described, is things in memory
shift around. Your performance can go one way or the other. And you can be like, oh, I had an
awesome performance increase because I'm a genius. Or you change something like, oh, oh, darn.
Things went south, like performance fell apart. I need to, you know, I have this giant regression.
I better roll it back, but it could just be a complete accident. And it's just depending on
where things ended up getting laid out in memory. And, you know, this can even be affected by what
directory you're in, what the day of the week is, because it adjusts your environment variables and
your environment variables shift the stack. Yeah. Fricking crazy. I actually had code once. I didn't
talk about this in the talk, but I had a that ran faster on wednesdays than on tuesdays and you were able
to actually like quantify this yeah yeah yeah i actually changed the clock and went back to
tuesday and uh and that was the problem it was the the length of the day that somebody somebody
was storing in an environment like the length of the string literally w-e-d-n-e-s-d-a-y it's longer
than tuesday yeah yeah so um yeah so you know the moral of the story is obviously only program on
wednesdays um so anyway so what we wait monday should be faster yet then right yeah yeah but
well but then it just screws everything up no anyway yeah so it's obviously terrible this is
a bad situation so the challenge is how do you tease apart like the thing you did with the code, like even what a compiler
does when it optimizes the code from where the stuff got laid out in memory and all these
interactions below. So what we did is we built the system that we jokingly called stabilizer
because it actually like messes everything up. What it does is it randomly moves everything in
memory periodically during the runtime of the program.
And the reason that we do this, first, if you just do random at startup, that's not enough.
That's just like a fixed layout.
And it doesn't, like the effects will still manifest.
Like you ended up in one layout.
So what you want to do is the moral equivalent of a randomized control trial.
You're basically randomizing everything.
We randomize where the globals are, where the functions are. We have a randomizing heap. So when you allocate
new objects, it's a little bit decorrelated with where the previous one had been freed.
And then you can actually, you know, run the code a bunch of times, try it with your optimization or
whatever your code change is, try it with, you know, try it with something else, and then you can compare,
and you know whatever the change is
has nothing to do with the layout.
Okay, so you mentioned that this tool
has suffered from some bit rot.
Do you know if there's any other similar tools
out there that someone could try if they wanted to?
Yeah, that's a great question.
So Stabilizer is super extensive in what it does.
And this is why it relied on LVM. Like it literally changes where the stacks are laid out.
And that's, you know, that's right at the heart of the compiler, um, you know, generating stacks.
So, um, that said, we have a few randomizing allocators that we've built for various purposes
and any of these undermines, uh, the effect of layout in the heap. So it doesn't affect the stack, it doesn't affect the globals, it doesn't affect functions, but it changes where objects will be laid out in the heap. And so that particular kind of confounding factor goes away. actually for reliability. If you have a program with memory errors, Diehard makes it probabilistically
likely that your program will run correctly. But as a side effect, it also, well, it's part of the
whole thing is it's randomizing the location of things. Let me, I see lots of puzzled faces. So
let me explain how it helps really quickly. So the way that it helps with use after free errors
or dangling pointer errors, whatever you want to call them, where you free something too soon and it gets recycled.
So a conventional allocator, when you free something, it's immediately available for reclamation.
And then when you call new, it's almost certain to be the next object.
So you call delete, you call new, you probably get that same object right back, which is the worst possible situation.
If you had something pointing in that wanted the old value,
it gets immediately clobbered, right?
And so this is the problem that garbage collection solves.
Like garbage collection makes sure nobody has a pointer into anything
before anything gets reclaimed, right?
So what Diehard does is that it actually has a bitmap-based allocator,
and it randomly chooses among all of the freed objects
for the next object to be used.
So when you call delete, it just sets a bit, and the bit is set to zero. So zero means it's free.
And then it just says, hey, find me a zero, and it randomly pokes into this bitmap. And if it
finds a zero, it returns that object. So if you have a very large heap, suppose you have a million
objects on the heap, a million objects have been freed. When you call new, you have a very large heap suppose you have a million objects on the heap a million objects have been freed um when you call new you have a one in a million chance of clobbering the object you just
freed right because you're basically like you know here's here's a million bits in my hand
right like you know you're gonna like here's the one that you freed there's a million other ones
right so one in a million chance and that chance continues to exist right so it's like it's not
going to be you know maybe it's not going to be a million news that it's going to survive.
But it could actually survive in principle for a very, very long time.
So that's one thing that Die Hard does.
The other thing, which is maybe even easier to understand, is that the heap that it allocates is somewhat larger than required.
And then it randomly places the objects in memory.
So there's a likelihood that if you have an overflow,
it'll overflow into nothing.
Okay.
Okay.
I get it.
I understand what you're saying and I understand why it makes the program more stable,
but my,
the rest of my brain says,
but I wish that it made the program less stable so I could use it to find
rant kinds of errors.
Totally.
Totally.
I, so, um, so this is actually part of a line of work
that we did. Some of this found its way into Windows.
As Rob mentioned, I've spent actually a lot of time at Microsoft.
And so Microsoft had this genius thing that they did, which is an adaptation
of this idea that they called the fault-tolerant heap. So they would detect
if a program had been
crashing more than a certain amount of time over a certain period, they would swap out the heap
with a heap kind of like Die Hard, inspired by Die Hard, that would then like, it's sort of like,
all right, there's pain that this person is experiencing. Let's see if we can do something
to lessen the pain. And then we built other stuff that follows on that actually is designed to automatically find the bugs and fix them.
So we had a follow-on paper called Exterminator, which builds on Die Hard to do that.
And then we have another paper called Die Harder.
You can see that I really liked – I like Bruce Willis a lot.
And so Die Harder is a secure allocator.
And so Die Harder is a secure allocator. And so Die Harder actually is the opposite.
It's more like what you're asking for, Jason, in a way, which is it makes it very, very unlikely that you'll have any information that you can leverage for an attack.
And so one of the things that it does is it randomly allocates things, but everything is a chunk that's separated in space, in virtual address space.
So it's super far away from virtual address space. So it's super
far away from the next chunk and it's randomly located. So if you try, if you do a buffer overflow,
it's very likely you'll segfault. Interesting. Okay. So I know this is not really why I got
brought on to talk about stuff today, but anyway, this is how this goes sometimes. Yeah. Yeah. So
yeah, I mean, we actually also have, I mean, inside of exterminator, we have a thing
called die fast, which is designed to do what it sounds like.
But, but the secret sauce with exterminator is if you have a program and you run a program
a bunch of times and you normally look at the heap, like suppose it was deterministic,
the heap is the same, like suppose you hit the exact same error, you know, five times in a row. If you look at the heap, the heap state
is identical. So it gives you no information. You could run it five times, a thousand times,
you're getting the same heap over and over and over again. But by using the randomization,
all the heaps are different. And so you can actually identify like, oh, it's this, when this
object is this many bytes before this object,
that's when the thing fails. So it's a buffer overflow of this size from this object. And then
we could use that information to basically create these things so that when you run the program,
again, it would patch the allocator. So we would say, hey, when you allocate something at this
line of code, pad it by this many bytes.
Wow.
So you can use that information to send it home to the developer and also keep the program running.
So win-win.
I have no idea if you said you've spent a lot of time at Microsoft Research.
I don't know if you can speak to this at all.
But as you're telling these stories, I'm thinking,
is this how we still get some of these old Win32, Windows 3.1 applications that can still run on Windows 10?
Is it because they can do some of these tricks and just be like, pretend like it's all working fine?
Yeah, I wish I could say yes, but the answer is no.
I mean, they've done a lot of engineering, obviously, to keep these things alive.
I will say this.
One of the things that is pretty funny, when I first went to Microsoft when I was a PhD student, I was like, hey, I have this super fast allocator.
Your allocator is garbage.
I'm going to replace the Windows allocator.
I'll speed up all Microsoft products.
You know, woo, win.
And what I found was, so then I had access to, like, Microsoft's code, right?
So I could recompile things.
So I went to recompile some Office code and some, I think, SQL Server, and everything crashed.
Every single thing crashed as soon as I replaced the memory allocator.
And I was like, oh, my God, I guess my memory allocator has a bug.
But that was not the problem.
The problem was that everybody who had written this code had debugged their code with Microsoft's allocator.
And so, for example, if you were using Microsoft allocator,
suppose you requested 24 bytes, this is all made up, but this is the idea. You request 24 bytes,
Microsoft allocator might've rounded it up to 32. So you have an overflow from the 24 bytes,
25th, 26th, 27th, 28th, like you're writing way past the end. Doesn't matter if it lands in the
32. So you had all these latent errors that were lying around because of course they debugged
it with the allocator they had.
But as soon as I replaced it with another allocator that didn't have those exact same
sizes and the exact same way it reused memory, then everything fell apart.
Wow.
Yeah.
So this is, it's an interesting, it was a good lesson to learn.
It's like, you know, legacy software is hard. And maintaining things forever is hard. And it means you can't often, you know, move forwards, because you're stuck in this situation. You can't just be like, hey, could, could we just stop like your billion dollar revenue stream for like a few weeks and like fix this? Like, that's not how this works so right well so we started this this
so the scary part of your talk and the thing that you've kind of touched on and when you said like
the day of the week can change how the program how fast the program runs and then we talked about
stabilizer right and then you said uh just for reviewing from my own mind to make sure i got
this all right uh that your the other products that you have that do still work today
without requiring Herculean effort
to port them to a current LLVM
all affect the heap.
And so I was curious
how much the stack layout and memory
is important versus the heap layout and memory
for stable execution timing runs and these kind of
things? Yeah, yeah, that's a good question. I mean, I think I mean, I'm not sure I can give you
a solid answer. The I will say this, I mean, the stack is always allocated by the the compiler is
a contiguous chunk, right? So you know, you've got all your local variables, they all appear in the
stack frame, and they're all together. So this means that, you know, like got all your local variables they all appear in the stack frame and they're all together so this means that um you know like they're all going to be in cache
almost certainly right so that is um that's something that you know takes away a certain
performance question the cache is almost always hot um right because you you're accessing the
cache you're executing functions you're just visiting the same memory over and over and over
again so the memory at the you know the of, depending on your definition, the top
or the bottom of the stack is always hot, right? So that means that the stack has less performance
impact in general than the heap, because the heap is kind of like, you have many objects,
they're spread around, it's all dependent. Like if I allocate one more object here or a different sized object, then it can change the whole layout of everything.
So it's a lot more brittle than the stack.
That said, the start of the stack, we found significant impact just from moving it around.
And in fact, the stack is exactly what gets moved by the environment variable shift.
Okay.
Yeah.
Interesting.
So that part is fixable, by the way, in GCC, or at least in LD.
You can tell it, you can give it a linker script.
This is super obscure, but you can give it a linker script, and you can align the segments
of where things get mapped to page boundaries, and then this brittleness goes away.
It may not be a good, it may be that when you align it, it actually is slower than when you don't align it because of random effects. But it guarantees
much more reliability. So that's something they could do today that they just don't do for some
reason. And telling people, hey, put in this, you know, scary compiler, like linker flags or
linker script stuff is maybe a bridge too far for some people, but that can definitely make
the performance a lot less susceptible to that kind of issue.
Wow.
Today's sponsor is the PVS Studio team.
The company develops the PVS Studio Static Code Analyzer designed to detect errors in
the code of programs written in C, C++, C Sharp, and Java.
Recently, the team has released a new analyzer version.
In addition to working under Windows,
the C Sharp part of the analyzer can also operate under Linux and macOS.
However, for C++ programmers, it will be much more interesting to find out
that now you can experiment with the analyzer in online mode
on the godbolt.org website.
The project is called Compiler Explorer
and lets you easily try various compilers and code analyzers.
This is an indispensable tool for studying the capabilities of compilers.
Besides that, it's a handy assistant when it comes to demonstration of code examples.
You'll find all the links in the description of this episode.
So you mentioned that Stabilizer can't really be used today,
but if you want to go back to an earlier version of LLVM,
could you build and run a program with Stabilizer?
And is that still worth doing
if you want to do some profiling?
Yeah, I mean, you could do it.
I'm not sure if it's good or not.
I mean, LVM has moved on, right?
So maybe the code gen is a lot better.
I don't think it's a gigantic difference,
but it's going to be some difference.
There's going to have been bug fixes and so on. I mean, obviously, people are welcome to do it.
There's the specific version of LVM. If you go to the GitHub site, right, it says, you know,
here's all the information you need. But honestly, I think the, you know, using some sort of
randomizing heap is probably the easiest thing for somebody to do, you know, just to try to iron out these things. But at the end of the day, maybe it makes the
most sense to just be like, look, performance is going to go up and down. It's going to change a
lot. And it could change by huge amounts. And so I need to be really careful when I get a regression,
like, one of my former student at UMass worked on the V8 project at Google the the JIT compiler for
JavaScript and he was saying that they would actually roll back things that caused a performance
regression on their benchmarks of one percent and I was like dude no that is ridiculous right like
one percent it's like I mean if you jump up and down in the same room where the program
is running right it's like maybe it'll warm up the temperature by one degree and then it'll
throttle down the cpu right that's just crazy noise you can't possibly be making like software
engineering decisions on the back of a one percent change that's just freaking noise don't do that
i feel like you just need to repeat that again, because I know lots of people who,
you know, live on the 1%. They're constantly aiming to get the half a percent faster than
microseconds. I mean, you know, we talked about high frequency traders on the show,
we have them on that kind of thing. So you're saying just that's just crazy talk,
you have no way of knowing if
that 1% is real. It's exceedingly unlikely. I mean, I think again, like literally, when a CPU
gets warmer, it lowers its clock speed. Right? So just ambient temperature changes are going to
affect the performance of your of your your program. This is noise, run your program a few
times, see what the noise is, If you're, you know, uh,
and even then you're still running it in the exact same configuration. This is something that I
stress. People will be like, Oh, I ran my program 30 times and only has this much noise. It's like,
yes, but it's always 30 times in the exact same layout. And as soon as you change a tiny little
thing, right, somebody blows on that house, it's going to all fall down. Um, it's yeah,
it's kind of crazy. I mean, you know,
people talk about like, performance optimization being, they kind of misquote Knuth, right? He
says, like, you know, basically, you know, don't optimize, like, it's the root of all evil. That's
the line premature optimizations are all evil. And then he immediately followed it up by saying,
but sometimes it really matters. So sometimes it does really
matter. And you know, if you can get consistently significant improvements, then great. But I do
think that, I'm going to give you a very simple example. So when I was working on Horde, I was
like, okay, I'm going to try to, to like speed up this particular path. And because replacing the
memory allocator screws with all your tools, like all your tools cease to work.
So I'm basically putting in some other way to track which paths I went down.
So I literally had a printf on a path to see if this path was taken.
The path was never taken.
Sped up the program by 15%.
Adding the printf, then.
Yes.
Yes, adding the printf.
And for a moment, I was tempted to leave in the printf.
Right? Because I was like, like, this speeds it up by 15%. But of course, you know, if I change
any other code, maybe it will, you know, so that that will go away. So like having code that
literally never executes can speed up or slow down your program by, you know, up to 40%. So
focusing on point 5% is probably crazy.
So do you find that like an extremely large, complicated systems, this kind of print F
somewhere is going to have less impact or more impact? Or is it still just completely
you have no way of knowing?
So yes to all three. So I will say one thing that definitely has the effect of smoothing things out
is concurrency. So if you have a multi-threaded program, and the multi-threaded program is
constantly running different threads that are not being synchronized, right? So it's doing a good
job. Everything is actually running more or less independently. This is going to smooth things out
because everybody is just
kind of running code and that code is changing when it's executing and what it's operating on
all the time. It's like an electron cloud, right? It's just fuzz. And so, you know, what people
often do with performance optimization, they'll be like, let's turn off all the, you know,
the non-determinism. Let's like, you know, lock everything down and let's make it, you know,
kill all the background processes and, you it, make it run the exact same job
over and over again. But that's, that's no good. You're like missing all the noise that's inherent
in the system. And so you'll get, it will be noisier, but it will also tend to cancel out a
lot of effects. Okay. All right. So, I mean, you can, like, you can think of it, you're going to
have some peaks and you're going to have some valleys. And then those peaks and valleys are going to kind of smooth out and you'll get just kind of a mean.
So it's actually, in that situation, probably a lot better in a highly complex app if it is full of stuff that is asynchronous.
So which brings us to cars, then.
Yeah.
Okay.
Do you want to tell us a little bit about cars?
Sure. So basically, what we discovered when we were doing some performance analysis and
research into things is that the existing profilers that were out there really didn't
help. And the reason that they didn't help was they were kind of designed for the applications
of the 80s or earlier. So if you have a sequential program that you just care about how
long it takes from the start to the end, then these programs, these profilers are fine. They're
not great, but they're fine. You know, they tell you like, oh, here's the line, if you're lucky,
the line of code, you know, how much time it spent on the line of code, you know, how often that line
of code is executed, and that can help point you to a place to optimize your code, right? So this is how
classically prof worked from Unix and then gprof, which is included with GNU.
And they've been improved for concurrent programs. There's some stuff on finding critical paths.
The problem with the critical path is that there's no, well, there might be a critical path,
which just means the longest amount, like from start to finish, the longest sequence of code, right? That's the critical path. And so in principle, you should always optimize the critical path. Because if you have a concurrent program, like the critical path is what's slowing everything down. Like if everything is finishing super fast, and there's the one thing that takes a long time, that's the blocker, right? That's the bottleneck. The problem is in a real program, if you focus all your efforts on one critical path, it's like whack-a-mole.
Like that critical path goes away and then another thing becomes a critical path.
It's not like you suddenly went from, oh, I got the critical path and now my program runs 10 times
faster. It could be much worse. It could be like, here's a critical path, here's the other critical
path, except it's 0.0001 seconds faster, right? Imagine
if you went and spent weeks working on critical path one, and then you're done and you optimize
it, you could optimize it to zero. And then critical path two will mean that you actually
had no impact at all. And then, you know, we also care about other things these days. Like we don't
care, like start up a program and see how long it takes to run. Like you don't start up your browser and, and like optimize it. So it quits faster,
right. Um, or your server, right. You have programs that run forever and you care about
things like latency and throughput, not like program total execution time. And profilers just
tend not to do like, that's not what they do. Like profilers, traditional profilers are like
start to end. That's all I care about. So, so we were like, well's not what they do. Like profilers, traditional profilers are like, start to end, that's all I care about.
So we were like, what can we do to attack this problem?
And what we were looking for was some way
where we could have a profiler just tell us,
here is what would happen
if you optimize this line of code.
What would the impact be on latency
or what would the impact be on throughput?
And what we wanted, ideally, was like a graph
where we could literally be like,
here on the x-axis is how much I optimize this line of code from zero to a hundred percent. And then on the Y axis is how much does the overall program speed up or how much does
latency decrease or how much does throughput increase? And so if you have a graph that looks
like this, right, you would never optimize that line of code at all. This is a flat line,
right? A flat line says, no matter how much I speed up this line of code,
program still takes, it's unaffected.
It doesn't affect performance.
It's not on a critical path.
Maybe it's IO bound, who knows?
But if you had one where it was like,
man, you optimize that line of code by 10%
and your program speeds up by a large factor,
you would definitely work on that line of code.
So what we were looking for was,
so this thing we call a causal profile. So it tells you, if you do this, it will have this effect for sure,
as opposed to the profilers in the past, where, well, they tell you where something spent its time,
but it's kind of just like, that's where it spent its time. But that doesn't mean if you change it,
it's going to make things faster. Right. So that's what led to Cause. So Cause, C-O-Z, is a causal profiler for
Linux. And it gets these graphs through a kind of trickery. So you can't really know, just by
looking at a line of code, how much performance would increase by optimizing it from zero to 100%.
If you could optimize everything by zero to 100%, you would just optimize everything by 100%, right?
Like who needs cause?
Just use that.
So instead, what it does,
it basically takes advantage of this kind of nice insight,
which is that you can get the effect
of speeding something up
by slowing everything else down.
So if I've got some line of code,
what I can do is I can look around and see all the other
threads that are running and tell them to wait for a certain amount of time. So I literally just
signal them. And then they pause for some amount of time. And then I can, and I do this with
sampling, I don't actually run the thing forever. I just hit it for a little bit, slow everything
down with kind of a pulse. And then I can observe the effect out the other side. And so that's what cause does. It randomly
injects these delays, which it does with sampling. So it doesn't have much effect on overall runtime.
Like you can run cause in production. Um, and yet it produces these profiles that you can sense
them out on a socket if you want. Uh, and you can look at the performance profile and say,
Oh, here's these lines of code that I really should be working on.
So is it just for multi-threaded applications?
Good question.
So you can run it for a single-threaded application.
I personally, obviously I'm super biased,
but I actually use it even for sequential code.
It's just convenient to get this result where you get these causal graphs,
as opposed to like poring over, you know, perf results or something. It really shines though,
when you have concurrency, when you have an asynchrony, so it doesn't have to have multiple
threads using async IO, for example, will also, you know, that introduces concurrency, right? So
you could have like an event-driven
server that conceptually has no threads, but the OS has threads. So concurrency is in there and
cause will work better on that. And then anytime you care about latency or throughput, a conventional
profiler has nothing to say about those things. So with cause, you can actually say, here's the
start of something and here's the end. Like imagine it was just a sequential server that just took in an input, did something with it and produced a result.
Right. So classic kind of event loop.
You could just say, here's the begin point.
So we call them progress points.
This is progress begin.
This is progress end.
And then what cause will do, it will try to find the lines of code that will reduce the latency. So for a game developer, would it be fair to say like start of a game tick is the begin?
Do you do all the calculations, whatever, before you like dump it to the screen?
That's the end.
Do we actually have to put like markers in our code to tell cause that this is what we care about?
Yeah, yeah, you have to do that.
So there's literally just three macros.
One is a progress point, which is for throughput. And the others are begin and end, which are for latency. So you do have to say something. But I think most, you know, if you're a game developer, you probably know where the start is of when you're actually doing the calculations and where the end is. And so you literally just put in a line at the beginning, a line at the end, you can have multiple ones too. So you can give them, um, identifiers if you want. Okay. And then we run it and it produces a magic graph that says
speed up line 12 and pretty much it'll be faster. Yeah, yeah, exactly. Yeah. I mean, we, we actually,
it was funny cause we built it and, uh, and we're like, we're, you know, there's some like, uh,
light, uh, theorems in the paper. Um, they're not, not super deep. There's, deep. But anyway, we were like, okay,
so we have mathematical proofs in effect that this is going to work. And we demonstrated with
some simple programs. And then it was like, all right, here's the plan. I told my student who's
now a professor at Grinnell College, Charlie Kurtzinger, I was like, Charlie, go and take
this benchmark suite of concurrent multi-threaded programs that Intel put together with Princeton and spend no more than an hour on each program and see how much you can optimize it
with cause. And so these are programs he'd never even looked at. So he had no idea at all of what
was going on inside. He ran the programs, looked at the code, very quickly found places where he
could insert these progress points and came out the other side with optimizations ranging from 10% to 70%.
Wow.
And Coz doesn't rely on any weird LLVM internals or anything like that,
so it should be around for a while?
Yeah, yeah, yeah.
So it's pretty stable.
I mean, you can go and install it like, you know, with apt or snap or what have you.
It's, it relies on this package that Austin Clements put together.
He's the head of Go development at Google.
It's called Live Elfin.
So it manages like reading stuff out of Elf format, which is something that it relies on.
And that's it.
And so I happen to know austin
uh and i was like there was like some little change we needed him to make because they changed
the elf format um so i was like please do this uh and so so he did it so which is great but yeah
that's it so it's good to have a few dependencies we need to compile with dash G to get symbols or right. Um, so, um,
so let's see.
So it can find,
if I recall correctly,
it can actually find like whatever information is out there.
But,
um,
I don't think that the,
yeah,
I actually can't remember.
I always use dash G when I do it.
Um,
it doesn't matter if it's dash Oh three or whatever.
Like that's just fine.
Um,
it's totally fine for totally,
uh,
you know like
it does line granularity stuff even though it is optimized so that part is totally fine you may
need to use dash g like i said we always use it i always use it yeah i have other friends who
recommend just always because why not you're going to want debugging symbols at some point
exactly doesn't affect the the resulting binary right mean, it just produces this stuff in a,
in door format and you can just access it.
I should also add,
and even though this is a,
you know,
C plus plus,
uh,
like we're among friends,
we can talk about other languages.
Um,
it works for,
it works for C of course.
It works for rust as well.
Okay.
And,
somebody made a version of it,
uh,
for Java.
Uh,
so there's a version called Jcos
that works for Java programs.
And in principle, it could work for anything
that generates debugging output.
So it's possible to do for JavaScript,
but we just haven't done it.
Interesting.
Yeah, very cool.
Well, I think we are running a little low on time, Emery,
but is there anything else you want to share with us
before we let you go?
Jeez.
I've shared so much. You have, you much. So yeah, I mean, the only thing
I would say is, obviously, you know, we welcome feedback about these things. So you know, for
listeners out there who go and use cause, you know, please, you know, if you discover some issue,
let us know on GitHub, if you optimize something something and you have a great experience with Cause, we'd love to hear about that.
If you discover some use cases that Cause doesn't work for, there's some other interesting things.
We're always interested in hearing from real developers out there solving real problems.
Awesome. Very cool. Thanks, Emery.
All right. Thanks so much, guys.
Thanks.
Thanks so much for listening in as we chat about C++.
We'd love to hear what you think of the podcast.
Please let us know if we're discussing the stuff you're interested in,
or if you have a suggestion for a topic, we'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com.
We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter.
You can also follow me at Rob W. Facebook and follow CppCast on Twitter. You can also follow
me at Rob W Irving and Jason at Lefticus on Twitter. We'd also like to thank all our patrons
who help support the show through Patreon. If you'd like to support us on Patreon you can do
so at patreon.com slash CppCast and of course you can find all that info and the show notes
on the podcast website at cppcast.com. Theme music for this episode was provided by podcastthemes.com.