Two's Complement - Performance
Episode Date: February 3, 2022Our most efficient podcast ever. Ben and Matt talk performance testing and optimization in fewer than 30 minutes....
Transcript
Discussion (0)
I'm Matt Godbolt, and I'm Ben Rady, and this is Two's Compliment, a programming podcast.
Hey Ben. Hey Matt. How are you doing? I'm doing great. Awesome. Well, I've been thinking a lot about performance recently.
It's not something I've been worrying about for a while.
I've been doing more functional stuff.
But now we're sort of trying to think about how best to measure, track, and improve the performance of code I'm working on. And I wondered, you know, we should just talk about it because, frankly, that's what we do on this podcast
is just talk about the things we're doing in our day job
in some way or other or allude to them.
I have been thinking about the exact same thing, actually.
So this is good. This is good.
Yeah, make it right and then make it fast.
Well, the make it fast part.
Right, right, right, right.
Measure it, right?
Measure. Well, I think that's the key thing, I think.
You know, I think if there's, and I think we've mentioned this before,
one of the key things is always measure, never trust your instincts,
because no matter how many years you've been doing this,
and I've been doing this for nearly three decades on and off,
you're always wrong.
And even when you account for the fact that you know that you're always wrong,
you're like, well, I know I'm always wrong.
So it won't be A, it must be the B thing.
And then you measure it and it's some other thing.
All right, I think I've got a present.
I don't know, is this on my YouTube?
I don't know.
But there's a presentation I give
when I go around to universities
when we're trying to like encourage graduates,
newly new graduate folks to apply.
And I do like a deep dive on like a C++ optimization problem.
And I set it up.
And the funny thing is I set it up for myself thinking,
ha ha, everyone will expect A, but it will be B.
And then I ran it.
Of course, it was neither of those things.
It was something I'd never even heard of suddenly rearing its head.
I'm like, this is amazing.
This is exactly what I want to do.
So that's what the presentation is about.
And so always measure.
I think we all know that. And, you know profiling is is one thing but like
what how do we you know what how what's your general approach to to performance
i mean so like i think i think you know we hit on this a little bit already but i think you got to
really start with the basics which is um you you know, don't be superstitious, right?
What do you mean by that?
I mean, that's intriguing, right?
I mean, the superstitions, there's so many of them.
You hear them all the time.
It's like, oh, you can't use this structure or this call
or this function because it's too slow, right?
That's the kind of stuff, right?
And it's like all of that stuff is sometimes it's true
but it's it's always it's always context dependent so it's like if it was true it was only because
you got lucky right and it's like a is slower than b well you know there's a 50 chance you're right
that sort of thing right right um so so So don't be superstitious, right?
And part of that, and so I think part of the reason people fall into that trap is they're like,
well, but if I don't learn what things are good, what things are fast, what things are slow,
which things are memory efficient, which things aren't, if I don't learn these things,
how am I going to write any code without performance testing every single line of it as I write it, right? And I think that is where you get the very old, you know,
gray beard programmer wisdom of first make it right and then make it fast. Because the first
thing in measuring is figuring out which parts of the code actually matter, because a lot of them
just don't, right? And if you're doing sort of superstitious memory tricks in your command line parsing,
art parsing, it doesn't matter.
I'm with you.
Now I get what you mean by that.
I would sort of take the other side of that just a little bit.
There are some things that are, in particular in C++, that you could do quote unquote wrong
forever.
And it doesn't really matter.
But if you've written an entire
application where you've passed everything as a string by value or whatever that you profile it
and it's like why is my application slow it's not slow anywhere in particular it's just slow
everywhere because you've done something that's kind of harmed you in the approach you've taken
and it can be hard to unpick that after the effect even if it's right right absolutely absolutely so
those so in those cases i think i think the key there is you still need to measure.
And something that I do a lot with these kinds of things,
this is like a classic example of this.
Let's say that you have, I'm going to just pick on Java for a minute.
Let's say that you're...
Other languages are available for picking on.
Other languages are, and this technique would work in any of those languages.
But let's say that you've got some data structure in the JDK
that you want to consider using, some list implementation or something like that.
And you're like, I want to go and write my own version of this
instead of using the JDK one because my own version will be faster, right?
Let's just say that you want to make that claim.
We've all made that mistake before now. X is slow, I'll'll just write my own and then very quickly you discover oh it's harder
than i thought or yes yes sorry continue so if you're going to make that very bold claim right
i would submit that what you need to do is build a uh build a performance test that shows the
performance of the built-in thing, the performance
of your new fancy thing
at the most basic level.
Make sure that when you run that once
you're actually correct that it is faster
in that context, but also
keep that around and have it part
of your pipeline so that as
you, again, in the case
of Java, upgrade your JDK
or run a different architecture or change some Java heap setting or whatever it is that that holds true in all situations.
Because guess what?
The guys that wrote the JDK thought a lot about all of those different situations and you maybe have not, right?
Right.
So writing those tests, though, that's probably the real difficult thing, right?
In my experience.
It is difficult. That's probably the real difficult thing, right? In my experience, and I mean, if we're picking,
you know, you picked Java.
Java has some extra fun and games
on top of like what I would normally worry about
when I'm writing benchmarks or similar tests
that are trying to do performance
in as much as, you know, the JVM and the JIT
are going to come along and like optimize
and give you even more of a hard time
than like your equivalent um in in other compiled
languages although they have their own problems too right like the number of times that one's
written a benchmark around a c++ thing only to discover that because you passed in constant
values to your tests the compiler has just gone you know what the answer is 12 we're done here
no there's no code runs at all right yeah so there are definitely challenges in this so
yes um definitely how does and you know micro benchmarks are great at showing very specific cases,
but they're often not necessarily very representative of your program at large.
There are macro effects sometimes,
especially if you're like,
oh, I can be more cache size efficient with my implementation,
then you might discover that actually that all goes out the window
when you're actually running in a program
where someone's memcop copying two gig of data around
before they call your function or whatever you know there's a lot of things to think about so
how how do you even go about writing that apparently straightforward you know just write
the test of my list versus the the right well no it's not it's not straightforward and that that
was sort of my point is that if you get the easy stuff out of the way don't be superstitious right like they don't measure these things don't assume that you're
super smarter than all the people that wrote the standard library right like don't do that
but like now if you get all the easy stuff out of the way now you're to the actual hard part
the meat of the problem with the meat of the problem which is okay you're gonna write the
benchmark that shows that your your super fancy implementation is faster, how do you do it in a way that's really real, right?
Fair as well, because it's like, hey, I tested it and it's faster. Good. Don't worry about it.
What happens if you pass null values? Oh, yeah, don't worry about that. There's some special
cases thing.
Right, right, right. So yeah, so to me, that is the tricky problem, right? One of the things that
I was just sort of, we were talking about this I think a month or so ago.
So I was kind of wondering, it's like, I wonder if you could use,
and this is of course on brand for me,
but it's like I wonder if you could use some of the generative
testing frameworks to generate.
Because you're just talking about like null values,
constant values, stuff like that.
And you can also generate random data, right?
But I was kind of wondering, it you know generative testing frameworks have these things
where it's like you define you know a function that has like what could create
a range of properties can create different types for you and it's like
you can handle that stuff yourself but maybe you could maybe use some of those
to generate some data for for performance testing that was testing
yeah totally bonkers.
I don't know. I mean, I've certainly, I've written stuff before
for like maps when we've been doing stuff
for the market data.
There's a lot of map accesses in the finance world
of adding and removing and looking things up
by giant 64-bit keys.
And if you're trying to do things fast,
there's a hundred different ways you can skin that.
And I, yeah, I've ended up generating effectively add, modify, remove loops in the generative thing.
And they have to balance that for every add.
There must be a remove at some point, but there could be some modifies in between.
So you write this sort of, as you say, generative approach.
And you're like, OK, here's 100,000 operations that I'm going to stream through.
And I can randomize them with a seed so that I know they're going to be the same each time to be somewhat fair.
And then I can run it through map choice one,
map choice two, map choice three
and get some idea that it's representative.
And maybe you could even measure your actual workload
and say, typically we have more modifies
than we have anything else
or maybe have adds and removes
or get some kind of
smell about the data that tells you that that it looks representative but yeah to your point
generative stuff and if you can if it's not as as complicated and path dependency as as as a
long map where you're adding and removing if it's just you know like hey i've written my own square
root routine then of course then you can throw random numbers at it or you can throw ascending numbers at it or you can throw all those things at it i suppose but so yeah that sort of
covers some of the the tricks and problems um of like generating representative data yeah i haven't
actually tried i have to put a disclaimer out there though i haven't actually i haven't gone
and gotten a generative testing framework and like plugged it into a performance test and give me the
name of one that am i forgetting there's there's um oh well there's quick check which is the original one
and then there's all the quick check derivatives um uh sure it was just quick check would be enough
for me to remember like the kind of i knew that's what i was i was thinking of i couldn't right yeah
yeah yeah i'm trying to remember some of the specific names but they're all a lot of them
are quick check derivatives so it's like you know it's probably j check in java i imagine high check in c plus plus check or something somewhere i'm sure it
rings a bell anyway yeah no that's cool so that kind of somewhat covers that i'm sure it's a huge
topic and you know you and i have got a little of experience in this but maybe not as much but
then then the thing that immediately springs to mind when I'm doing this, these kinds of performance things, particularly in my world where I have the luxury of knowing what kind of computer system I'm going to be deploying my code onto.
In fact, sometimes I get to choose what system it is.
I get to pick the CPU and how much cache and all that kind of stuff.
But my development workstation may not be the same as that.
And my development workstation is running a thousand other processes all the javas
in the world for all my ids my music player it's got chrome and 300 tabs open and that's taking all
the cpu in the world for reasons i don't know or i'm in a video conference with somebody and like
you're mining cryptocurrency somewhere yes yeah something like that yeah um so there's a page that
you're visiting is mining crypto oh more likely more likely as it happens yeah um and so obviously it's uh not necessarily representative and
certainly another sort of key piece of advice when people talk to me about like performance
testing other than you know measure everything is measure what on a system that is actually
representative of where you're going to be running your code because it's so easy to sit there and kind of go oh i pick this map it's perfect it does the work
really well and then you discover that well on your it works fine on your system but with four
times as much cash in three numenodes or whatever right three numinodes i don't know four numinous
come on i don't know what's up with me that that's only two only place powers of two exactly yes that
um that it doesn't work as well or something else is more efficient because of it and obviously this
is very micro optimization based and many people won't ever be worrying about this in general but
it's something to be aware of right we've all we've all run something and gone oh that seems
a lot slower and then you run it again like oh it's back to normal it's fine and sometimes that's
an interesting thing to know because you're like, well, why is it so variable? Another one is like, well, yeah,
I was mining cryptocurrency at the same time
or something else happened in my computer.
So what all can one do to help in that regard, do you think?
Oh, man.
I mean, I think a lot of this ties into
how frequently you measure and check these things.
What you would want to do, what you want to do, I feel like,
is you have it as part of your regular
continuous deployment pipeline, right?
Right.
Every time I make a change to the code,
I want to run the unit test
to make sure that the behavior is correct,
and I would want to run some kind of performance test
to make sure the performance hasn't changed.
But that second part is way harder
than, you know, when I call my square root function,
does it give me the right result, um so it's it's really tricky i honestly can say like i know some ways that to do this
that don't work but i don't really have a lot of ways to do this that i'm like yes this is the way
you do it right um any sort of test where you are using time where you're like oh well this amount
of things should run in 10 seconds or less,
you're scheduling pain-in-the-butt brittle test debugging
for the future.
We've all seen the, this shouldn't take more than 10 seconds
time out on a shared build server
that's running in three layers of Docker
and 12 different hypervisors or whatever.
Right, right, right, right, right.
So, like, those things, I mean,
maybe someone somewhere can say that that works
for them it has never worked for me um and i and i every once in a while i'll try it again just to
make sure that just to make sure it still doesn't work yeah and still doesn't work the world does
not move on yeah um so there's that i haven't really looked at um sort of other ways to measure
you know performance like something you know we were talking about the other day
measuring like instructions
like how many instructions
did this take the last time I ran it
how many is it taking now and maybe why is it
different but I've never actually done that
so I don't know
I have a friend who uses that
approach at least as
a starting point I mean there are some
simulators that you can get that you can run your code through that will give you like a fairly decent approximation as a as as at least a starting point i mean there are some simulators that you
can get that you can run your code through that will give you like a fairly decent approximation
as well as actually measuring the instructions obviously if you measure the actual instructions
that you've got some amount of noise from the measurement system itself but if you're using
like effectively uh an emulator to to cycle level run through your program and it's fast enough to
actually run you can say
well how many how many instructions do you get processed how many times did you access the cache
that kind of thing now again that simulation may not represent reality but it's something that is
measurable and deterministic and that is kind of a good thing to have so that's an option yeah
another thing that i've done in this world is had a special build agent that runs
on just that one machine that's only there to service your request or it runs at three in the
morning and you're like pretty much guaranteed that nothing else is running on that that agent
and um rather than erroring on it you just graph it and that does require human persistence to kind
of look over and say oh wait a second
when did that spike up right and you know maybe you could use some kind of um anomaly detection
type stuff yeah to to say hey it that you know the last four readings have been within this band
and suddenly it's outside of it um in both directions right you know sometimes the the
too good to be true error is worth it.
Like, hey, suddenly this went faster.
What happened?
Oh, we upgraded the compiler.
All right, now it's just not – the compiler's too smart for us now.
It's not actually testing anything useful anymore.
Yeah, it's not testing anything useful anymore.
Yeah, I almost wonder if some of the – and, you know,
we've definitely been dealing with some of these things at work –
is sort of like comparing one test run to another
and looking for
you know like deviation
right? Right.
So it's like you know manually looking at it in a
graph is one thing but if you want to get
some amount of automation around it
like I can imagine like a continuous deployment
pipeline where it's like if the
performance characteristics of this are
radically different it doesn't just automatically go to production. deployment pipeline where it's like if the if the performance characteristics of this are radically
different it doesn't just automatically go to production someone actually has to go and bless
it and be like okay yeah this is this is fine yeah this is fine like we understand that it's
going to be 10 times slower or like you say 10 times faster right because maybe hopefully the
latter metrics for good reason yeah not for the bad we were expecting it to be 10 times faster
we did other tests to confirm and it was great awesome yeah um i guess the other thing about like graphing that i've sort of
implied in this as well is that the noise level can you can sustain higher levels of noise
yeah in a graphing based thing because you're naturally going to say well it took 0.7 seconds
on average to do this thing now it's going to 0.78 now it's down to 0.69 and it kind
of fluctuates around and you can kind of squint and look at it and kind of go well this is within
what we'd expect maybe every now and then there's an anomaly like rerun it or i know it's not a
great answer it's very manual it's very labor intensive but it allows you to see a trend more so
than um than than have some kind of tight bound that you set and say, well, it just should definitely take no more than 0.9 seconds
to do this operation, whatever that is.
How much overlap have you found between the tools
that you would use to automatically check these things
and the actual crack open the toolbox, get the wrench,
get the screwdriver out for doing the optimization?
Interesting.
Because I think there's some areas where they overlap but there's definitely some areas
where they wouldn't yeah there's i mean so there's the first thing that springs to mind is a lot of
the the tools that one would use that are very as you say specific to those tools that you get out
of the toolbox you know count the number of cpu instructions retire count the branch prediction mismatches and things like that those do require um a system where you have access to those things and there
are normally layers of dockerisms and virtualization that prevent you from actually accessing the damn
cpu's functionality because you're sharing it not just with other c processes on your operating
system but actual other operating systems probably on your build agent compared to my keyboard on my actual computer
and I've got the freedom of pseudo-installing
and like monkeying with stuff.
So there's a sort of practical issue there,
but that's not necessarily problematic.
You can configure these things.
It's just I often had to badger various folks
who are involved in the virtualization.
Can you enable this, please? I need it.
Yeah, yeah.
But that aside, I mean, first of all,
obviously the cycle counting thing that we just talked about
where you simulate it,
that obviously is a tool in the arsenal of things
that you have available for you.
That is amenable to being scripted and run,
albeit slowly, on a CI.
So that's one thing and then um if you
have got something like perf and it does work then you can run some things and you could sort of count
the number of uh again broad phase stuff like branch prediction mismatches and you could run
it over a uh a representative like run of your program rather than targeted tests and kind of
at least have some heuristics again they could maybe graph them so i guess you could use them but then once you get
down into the minutiae it doesn't necessarily nothing springs to mind as what you could
extract because i don't know the thing about these tools when i use them it's like it's a
journey into the unknown it's an exploration exploration into the literal unknown and again
like so like the the presentation i give where i'm like where on earth are we spending all of
our time in this simple function and it's like oh it's in the locale system in c++ you're like i
didn't even know there was one how did this happen where are we why are we here right and so trying
to sort of like capture that in an automated way doesn't necessarily spread but obviously the metrics you can get out might be the same ones that yeah yeah i i think it's the difference between detecting
that there's a problem and then figuring out what the actual problem is and those those are two
sometimes there's overlap between those two things and it can make tooling simpler to use but for the
most part there i feel like they're very separate activities right there are some i mean again the
broad phase stuff,
so the top level stuff,
if I'm literally just going to do perf stat on a thing
and just get back the things that you get back by default,
which is how many CPUs,
how many CPU instructions were retired,
how many were issued.
There's about a half dozen stats
and some things that are inferred from it,
and that kind of gives you some idea about it.
There's also an approach that I think Intel first gave out and this is very intel centric
again for folks who are listening and that are um maybe not so okay but like cpus have
in general have multiple like stages in them and there's a sort of front end stage which does
fetching of the instructions that need to be done, which is branch prediction feeds into that,
and all the badness of...
Badness?
Inaccurate branch prediction.
Branch misprediction, I suppose.
There's the reading of the bytes that are actually the instructions,
which is cache and memory.
There's the decoding of the instructions
and turning them into the actual thing that the CPU can execute
because nothing really executes the bytes directly anymore so that's the front end once those those things
have come out they go into like a box of things that I know I need to do and then there's some
out of order magical stuff that happens and then you know like maybe you've got only one divider
on your chip and now you've done lots of divides and now you've got like contention on divides
that's sort of in the back end and then then there's retirement, where it's like now,
at the end of the day, all the instructions have to come out.
So there's this approach that you can kind of hierarchically zoom in
on the bit that's actually the bottleneck.
And so the first couple of levels of that
might be something you could run automatically
and note when something changes.
Like, hey, this code ran before,
and it was absolutely limited by the front end.
We couldn't get instructions out of memory
into the cpu fast
enough to sustain the the whatever workload you gave it but hey now it's suddenly switched and
now suddenly your bottleneck somewhere else and that's an event you can say i don't know why or
what but i can perhaps flag that up as being something that happened i don't know how how
feasible that is to actually do automatically but it strikes me as something like if someone handed me a product that did that,
I'd be like, yeah, I guess I'd use that.
Just run it on my little test case and see.
But yeah, I don't know.
It's a fun thing.
Yeah.
What about visualization?
I know that we've used a few different kinds of tools for visualizing flame graphs.
And I always get the tree of calls that you can click down and sort of see what things are used.
What tools have you used for that?
The flame graph thing is just one of the best visualizations. and done a recording in Chrome, like JavaScript thing.
Google's viewer that shows a timeline and gives you a combination of both the FlameView
and also this is what things were happening on what threads when,
which is great for debugging web pages,
but you can actually record this stuff
and convert system-level traces either on your Android phone
or if you use your own stuff you can output
the format it's pretty straightforward and then you can load it and visualize it in that so that's
been a really really helpful thing um they've recently spun that out or say recently i don't
know when they did it but i haven't i noticed recently that they they used to embed um the the
viewer separately um in i think it was in the Android project originally but now it's been spun out of its own thing and so at perfetto.dev you can go and see the web-based browser you can load
traces into that it's pretty cool and I think you know the thing that we were talking about before
we started recording this was that there are a number of tools that are now starting to target
perfetto traces so that you can load them and visualize them in that um and one of them was i think jane street released something where they were using like the new intel tracing stuff um you
know i should we should really have done more uh reading about this beforehand but like i i've seen
a couple of things recently which have targeted this and and certainly the newer things that are
coming in in the most recent revisions of Intel chips are allowing more and more fine
grained,
accurate and low late low overhead ways of tracing what the heck is going on in
your program,
which is super awesome.
You know,
normally the,
the one finds oneself compromising and saying,
I have to just keep running my program over and over again.
And I'm doing some kind of sample-based profiling,
and that's okay for getting,
oh, there's a lot of branch mispredictions,
and I can vaguely see the area of the code
where they're happening.
But if you really want to know
in what iteration of what loop something happened,
you need to have this new branch tracing,
and I can't think what the other new thing is now,
which is terrible,
but there are some new things in the chip.
And effectively, there's a highly compressed buffer that the CPU is streaming out.
This is what I did, too, which is the coolest thing, right?
Yeah.
I mean, I used to have this.
A journal, yeah.
But I used to have this on the PlayStation 2, right?
There was this crazy, this is a, we're running out of time, unfortunately, because we maybe have to talk about this another time but um there was like this crazy piece of hardware that was like a
four times bigger playstation 2 dev kit which were already much bigger than like a playstation 2 ever
was i used to use it as a foot rest mainly because it was the right height um and it was worth tens
of thousands of dollars but um it had a foot switch and you could play your game and you could hit the foot
switch and it was in a ring buffer keeping 16 um megs worth of this kind of trace information for
all the different components over the the whole place too and one of the cool things was that
yeah you go like why is it taking so long you can see the graphics being drawn you could then see
because effectively the cpu only needs to tell you when it um when it took a conditional branch
and which way it went because the rest of the time if you know where it is you can say well
i know what a cpu does it follows one instruction after another the only time i i need to know what
happened is when it says oh i went either i took the branch or i didn't take the branch and i can
reconstitute the the flow then provided i know which cpu cycle it was and then everything else
is like oh i branch mispredicted and you're like that's great okay that's an event that
goes into the journal yeah lots of fun anyway um yes what were we talking about visualization
tools visualization for photo uh by the way magic trace i believe is the name of that jane street
thing and that is awesome awesome the jane street github account i should find the
other thing that uh travis downs uh tweeted about the other day which i was super excited about but
not apparently excited about enough to remember the name of which was to use um uh the the same
kind of techniques i believe that are now available in these newer cpus to do effectively a cycle by
cycle simulation of a piece of code by running it over and over and over and over again and saying
hey stop after one cycle and tell me what the uh that the counters look like stop after two cycles
and start tell me what the counters look like so you can kind of like slowly run the same bit of
code over and over again and kind of say well now measure this now measure that now measure that a
lot of the problems come from the fact that the cpu has many many many many many counters
that are available like for events interesting things that happen but they can only count like
three or four of them at a time and so you have to keep being able to rerun the same piece of code
over and over again that's very interesting anyway we're actually running out of time. I can't believe this. I've gotten excited again about something.
I mean, it's uncharacteristic of either of us to be excited about topics.
To be so animated about all of these things.
But yes, yes.
But we should try this and talk more about this some other time.
There's so much to think about.
Yeah, maybe there's a part two here.
It could be a part two.
Let's think about that.
We should think about that.
We should think about that. So, listener, if there's a part two here. Maybe. It could be a part two. Let's think about that. We should think about that.
So, listener,
if there's a part two,
there will be a part two after this,
but maybe this is the whole thing.
I don't know.
Maybe we've actually peaked.
Yep.
That sounds about right.
All right.
Well, I guess,
until next time, my friend.
Until next time.
You've been listening to two's compliment a
programming podcast by
Ben Rady and Matt
Godwell find the show
transcript and notes at
twos compliment org
contact us on twitter at
two cp that's at tw o
s c p
theme music by Inverse Phase