Two's Complement - Performance

Episode Date: February 3, 2022

Our most efficient podcast ever. Ben and Matt talk performance testing and optimization in fewer than 30 minutes....

Transcript
Discussion (0)
Starting point is 00:00:00 I'm Matt Godbolt, and I'm Ben Rady, and this is Two's Compliment, a programming podcast. Hey Ben. Hey Matt. How are you doing? I'm doing great. Awesome. Well, I've been thinking a lot about performance recently. It's not something I've been worrying about for a while. I've been doing more functional stuff. But now we're sort of trying to think about how best to measure, track, and improve the performance of code I'm working on. And I wondered, you know, we should just talk about it because, frankly, that's what we do on this podcast is just talk about the things we're doing in our day job in some way or other or allude to them. I have been thinking about the exact same thing, actually.
Starting point is 00:00:54 So this is good. This is good. Yeah, make it right and then make it fast. Well, the make it fast part. Right, right, right, right. Measure it, right? Measure. Well, I think that's the key thing, I think. You know, I think if there's, and I think we've mentioned this before, one of the key things is always measure, never trust your instincts,
Starting point is 00:01:13 because no matter how many years you've been doing this, and I've been doing this for nearly three decades on and off, you're always wrong. And even when you account for the fact that you know that you're always wrong, you're like, well, I know I'm always wrong. So it won't be A, it must be the B thing. And then you measure it and it's some other thing. All right, I think I've got a present.
Starting point is 00:01:31 I don't know, is this on my YouTube? I don't know. But there's a presentation I give when I go around to universities when we're trying to like encourage graduates, newly new graduate folks to apply. And I do like a deep dive on like a C++ optimization problem. And I set it up.
Starting point is 00:01:50 And the funny thing is I set it up for myself thinking, ha ha, everyone will expect A, but it will be B. And then I ran it. Of course, it was neither of those things. It was something I'd never even heard of suddenly rearing its head. I'm like, this is amazing. This is exactly what I want to do. So that's what the presentation is about.
Starting point is 00:02:03 And so always measure. I think we all know that. And, you know profiling is is one thing but like what how do we you know what how what's your general approach to to performance i mean so like i think i think you know we hit on this a little bit already but i think you got to really start with the basics which is um you you know, don't be superstitious, right? What do you mean by that? I mean, that's intriguing, right? I mean, the superstitions, there's so many of them.
Starting point is 00:02:35 You hear them all the time. It's like, oh, you can't use this structure or this call or this function because it's too slow, right? That's the kind of stuff, right? And it's like all of that stuff is sometimes it's true but it's it's always it's always context dependent so it's like if it was true it was only because you got lucky right and it's like a is slower than b well you know there's a 50 chance you're right that sort of thing right right um so so So don't be superstitious, right?
Starting point is 00:03:06 And part of that, and so I think part of the reason people fall into that trap is they're like, well, but if I don't learn what things are good, what things are fast, what things are slow, which things are memory efficient, which things aren't, if I don't learn these things, how am I going to write any code without performance testing every single line of it as I write it, right? And I think that is where you get the very old, you know, gray beard programmer wisdom of first make it right and then make it fast. Because the first thing in measuring is figuring out which parts of the code actually matter, because a lot of them just don't, right? And if you're doing sort of superstitious memory tricks in your command line parsing, art parsing, it doesn't matter.
Starting point is 00:03:48 I'm with you. Now I get what you mean by that. I would sort of take the other side of that just a little bit. There are some things that are, in particular in C++, that you could do quote unquote wrong forever. And it doesn't really matter. But if you've written an entire application where you've passed everything as a string by value or whatever that you profile it
Starting point is 00:04:11 and it's like why is my application slow it's not slow anywhere in particular it's just slow everywhere because you've done something that's kind of harmed you in the approach you've taken and it can be hard to unpick that after the effect even if it's right right absolutely absolutely so those so in those cases i think i think the key there is you still need to measure. And something that I do a lot with these kinds of things, this is like a classic example of this. Let's say that you have, I'm going to just pick on Java for a minute. Let's say that you're...
Starting point is 00:04:38 Other languages are available for picking on. Other languages are, and this technique would work in any of those languages. But let's say that you've got some data structure in the JDK that you want to consider using, some list implementation or something like that. And you're like, I want to go and write my own version of this instead of using the JDK one because my own version will be faster, right? Let's just say that you want to make that claim. We've all made that mistake before now. X is slow, I'll'll just write my own and then very quickly you discover oh it's harder
Starting point is 00:05:09 than i thought or yes yes sorry continue so if you're going to make that very bold claim right i would submit that what you need to do is build a uh build a performance test that shows the performance of the built-in thing, the performance of your new fancy thing at the most basic level. Make sure that when you run that once you're actually correct that it is faster in that context, but also
Starting point is 00:05:35 keep that around and have it part of your pipeline so that as you, again, in the case of Java, upgrade your JDK or run a different architecture or change some Java heap setting or whatever it is that that holds true in all situations. Because guess what? The guys that wrote the JDK thought a lot about all of those different situations and you maybe have not, right? Right.
Starting point is 00:05:58 So writing those tests, though, that's probably the real difficult thing, right? In my experience. It is difficult. That's probably the real difficult thing, right? In my experience, and I mean, if we're picking, you know, you picked Java. Java has some extra fun and games on top of like what I would normally worry about when I'm writing benchmarks or similar tests that are trying to do performance
Starting point is 00:06:15 in as much as, you know, the JVM and the JIT are going to come along and like optimize and give you even more of a hard time than like your equivalent um in in other compiled languages although they have their own problems too right like the number of times that one's written a benchmark around a c++ thing only to discover that because you passed in constant values to your tests the compiler has just gone you know what the answer is 12 we're done here no there's no code runs at all right yeah so there are definitely challenges in this so
Starting point is 00:06:41 yes um definitely how does and you know micro benchmarks are great at showing very specific cases, but they're often not necessarily very representative of your program at large. There are macro effects sometimes, especially if you're like, oh, I can be more cache size efficient with my implementation, then you might discover that actually that all goes out the window when you're actually running in a program where someone's memcop copying two gig of data around
Starting point is 00:07:07 before they call your function or whatever you know there's a lot of things to think about so how how do you even go about writing that apparently straightforward you know just write the test of my list versus the the right well no it's not it's not straightforward and that that was sort of my point is that if you get the easy stuff out of the way don't be superstitious right like they don't measure these things don't assume that you're super smarter than all the people that wrote the standard library right like don't do that but like now if you get all the easy stuff out of the way now you're to the actual hard part the meat of the problem with the meat of the problem which is okay you're gonna write the benchmark that shows that your your super fancy implementation is faster, how do you do it in a way that's really real, right?
Starting point is 00:07:48 Fair as well, because it's like, hey, I tested it and it's faster. Good. Don't worry about it. What happens if you pass null values? Oh, yeah, don't worry about that. There's some special cases thing. Right, right, right. So yeah, so to me, that is the tricky problem, right? One of the things that I was just sort of, we were talking about this I think a month or so ago. So I was kind of wondering, it's like, I wonder if you could use, and this is of course on brand for me, but it's like I wonder if you could use some of the generative
Starting point is 00:08:15 testing frameworks to generate. Because you're just talking about like null values, constant values, stuff like that. And you can also generate random data, right? But I was kind of wondering, it you know generative testing frameworks have these things where it's like you define you know a function that has like what could create a range of properties can create different types for you and it's like you can handle that stuff yourself but maybe you could maybe use some of those
Starting point is 00:08:39 to generate some data for for performance testing that was testing yeah totally bonkers. I don't know. I mean, I've certainly, I've written stuff before for like maps when we've been doing stuff for the market data. There's a lot of map accesses in the finance world of adding and removing and looking things up by giant 64-bit keys.
Starting point is 00:09:00 And if you're trying to do things fast, there's a hundred different ways you can skin that. And I, yeah, I've ended up generating effectively add, modify, remove loops in the generative thing. And they have to balance that for every add. There must be a remove at some point, but there could be some modifies in between. So you write this sort of, as you say, generative approach. And you're like, OK, here's 100,000 operations that I'm going to stream through. And I can randomize them with a seed so that I know they're going to be the same each time to be somewhat fair.
Starting point is 00:09:25 And then I can run it through map choice one, map choice two, map choice three and get some idea that it's representative. And maybe you could even measure your actual workload and say, typically we have more modifies than we have anything else or maybe have adds and removes or get some kind of
Starting point is 00:09:46 smell about the data that tells you that that it looks representative but yeah to your point generative stuff and if you can if it's not as as complicated and path dependency as as as a long map where you're adding and removing if it's just you know like hey i've written my own square root routine then of course then you can throw random numbers at it or you can throw ascending numbers at it or you can throw all those things at it i suppose but so yeah that sort of covers some of the the tricks and problems um of like generating representative data yeah i haven't actually tried i have to put a disclaimer out there though i haven't actually i haven't gone and gotten a generative testing framework and like plugged it into a performance test and give me the name of one that am i forgetting there's there's um oh well there's quick check which is the original one
Starting point is 00:10:28 and then there's all the quick check derivatives um uh sure it was just quick check would be enough for me to remember like the kind of i knew that's what i was i was thinking of i couldn't right yeah yeah yeah i'm trying to remember some of the specific names but they're all a lot of them are quick check derivatives so it's like you know it's probably j check in java i imagine high check in c plus plus check or something somewhere i'm sure it rings a bell anyway yeah no that's cool so that kind of somewhat covers that i'm sure it's a huge topic and you know you and i have got a little of experience in this but maybe not as much but then then the thing that immediately springs to mind when I'm doing this, these kinds of performance things, particularly in my world where I have the luxury of knowing what kind of computer system I'm going to be deploying my code onto. In fact, sometimes I get to choose what system it is.
Starting point is 00:11:15 I get to pick the CPU and how much cache and all that kind of stuff. But my development workstation may not be the same as that. And my development workstation is running a thousand other processes all the javas in the world for all my ids my music player it's got chrome and 300 tabs open and that's taking all the cpu in the world for reasons i don't know or i'm in a video conference with somebody and like you're mining cryptocurrency somewhere yes yeah something like that yeah um so there's a page that you're visiting is mining crypto oh more likely more likely as it happens yeah um and so obviously it's uh not necessarily representative and certainly another sort of key piece of advice when people talk to me about like performance
Starting point is 00:11:56 testing other than you know measure everything is measure what on a system that is actually representative of where you're going to be running your code because it's so easy to sit there and kind of go oh i pick this map it's perfect it does the work really well and then you discover that well on your it works fine on your system but with four times as much cash in three numenodes or whatever right three numinodes i don't know four numinous come on i don't know what's up with me that that's only two only place powers of two exactly yes that um that it doesn't work as well or something else is more efficient because of it and obviously this is very micro optimization based and many people won't ever be worrying about this in general but it's something to be aware of right we've all we've all run something and gone oh that seems
Starting point is 00:12:38 a lot slower and then you run it again like oh it's back to normal it's fine and sometimes that's an interesting thing to know because you're like, well, why is it so variable? Another one is like, well, yeah, I was mining cryptocurrency at the same time or something else happened in my computer. So what all can one do to help in that regard, do you think? Oh, man. I mean, I think a lot of this ties into how frequently you measure and check these things.
Starting point is 00:13:02 What you would want to do, what you want to do, I feel like, is you have it as part of your regular continuous deployment pipeline, right? Right. Every time I make a change to the code, I want to run the unit test to make sure that the behavior is correct, and I would want to run some kind of performance test
Starting point is 00:13:16 to make sure the performance hasn't changed. But that second part is way harder than, you know, when I call my square root function, does it give me the right result, um so it's it's really tricky i honestly can say like i know some ways that to do this that don't work but i don't really have a lot of ways to do this that i'm like yes this is the way you do it right um any sort of test where you are using time where you're like oh well this amount of things should run in 10 seconds or less, you're scheduling pain-in-the-butt brittle test debugging
Starting point is 00:13:50 for the future. We've all seen the, this shouldn't take more than 10 seconds time out on a shared build server that's running in three layers of Docker and 12 different hypervisors or whatever. Right, right, right, right, right. So, like, those things, I mean, maybe someone somewhere can say that that works
Starting point is 00:14:05 for them it has never worked for me um and i and i every once in a while i'll try it again just to make sure that just to make sure it still doesn't work yeah and still doesn't work the world does not move on yeah um so there's that i haven't really looked at um sort of other ways to measure you know performance like something you know we were talking about the other day measuring like instructions like how many instructions did this take the last time I ran it how many is it taking now and maybe why is it
Starting point is 00:14:34 different but I've never actually done that so I don't know I have a friend who uses that approach at least as a starting point I mean there are some simulators that you can get that you can run your code through that will give you like a fairly decent approximation as a as as at least a starting point i mean there are some simulators that you can get that you can run your code through that will give you like a fairly decent approximation as well as actually measuring the instructions obviously if you measure the actual instructions
Starting point is 00:14:53 that you've got some amount of noise from the measurement system itself but if you're using like effectively uh an emulator to to cycle level run through your program and it's fast enough to actually run you can say well how many how many instructions do you get processed how many times did you access the cache that kind of thing now again that simulation may not represent reality but it's something that is measurable and deterministic and that is kind of a good thing to have so that's an option yeah another thing that i've done in this world is had a special build agent that runs on just that one machine that's only there to service your request or it runs at three in the
Starting point is 00:15:32 morning and you're like pretty much guaranteed that nothing else is running on that that agent and um rather than erroring on it you just graph it and that does require human persistence to kind of look over and say oh wait a second when did that spike up right and you know maybe you could use some kind of um anomaly detection type stuff yeah to to say hey it that you know the last four readings have been within this band and suddenly it's outside of it um in both directions right you know sometimes the the too good to be true error is worth it. Like, hey, suddenly this went faster.
Starting point is 00:16:07 What happened? Oh, we upgraded the compiler. All right, now it's just not – the compiler's too smart for us now. It's not actually testing anything useful anymore. Yeah, it's not testing anything useful anymore. Yeah, I almost wonder if some of the – and, you know, we've definitely been dealing with some of these things at work – is sort of like comparing one test run to another
Starting point is 00:16:25 and looking for you know like deviation right? Right. So it's like you know manually looking at it in a graph is one thing but if you want to get some amount of automation around it like I can imagine like a continuous deployment pipeline where it's like if the
Starting point is 00:16:42 performance characteristics of this are radically different it doesn't just automatically go to production. deployment pipeline where it's like if the if the performance characteristics of this are radically different it doesn't just automatically go to production someone actually has to go and bless it and be like okay yeah this is this is fine yeah this is fine like we understand that it's going to be 10 times slower or like you say 10 times faster right because maybe hopefully the latter metrics for good reason yeah not for the bad we were expecting it to be 10 times faster we did other tests to confirm and it was great awesome yeah um i guess the other thing about like graphing that i've sort of implied in this as well is that the noise level can you can sustain higher levels of noise
Starting point is 00:17:15 yeah in a graphing based thing because you're naturally going to say well it took 0.7 seconds on average to do this thing now it's going to 0.78 now it's down to 0.69 and it kind of fluctuates around and you can kind of squint and look at it and kind of go well this is within what we'd expect maybe every now and then there's an anomaly like rerun it or i know it's not a great answer it's very manual it's very labor intensive but it allows you to see a trend more so than um than than have some kind of tight bound that you set and say, well, it just should definitely take no more than 0.9 seconds to do this operation, whatever that is. How much overlap have you found between the tools
Starting point is 00:17:53 that you would use to automatically check these things and the actual crack open the toolbox, get the wrench, get the screwdriver out for doing the optimization? Interesting. Because I think there's some areas where they overlap but there's definitely some areas where they wouldn't yeah there's i mean so there's the first thing that springs to mind is a lot of the the tools that one would use that are very as you say specific to those tools that you get out of the toolbox you know count the number of cpu instructions retire count the branch prediction mismatches and things like that those do require um a system where you have access to those things and there
Starting point is 00:18:31 are normally layers of dockerisms and virtualization that prevent you from actually accessing the damn cpu's functionality because you're sharing it not just with other c processes on your operating system but actual other operating systems probably on your build agent compared to my keyboard on my actual computer and I've got the freedom of pseudo-installing and like monkeying with stuff. So there's a sort of practical issue there, but that's not necessarily problematic. You can configure these things.
Starting point is 00:18:58 It's just I often had to badger various folks who are involved in the virtualization. Can you enable this, please? I need it. Yeah, yeah. But that aside, I mean, first of all, obviously the cycle counting thing that we just talked about where you simulate it, that obviously is a tool in the arsenal of things
Starting point is 00:19:17 that you have available for you. That is amenable to being scripted and run, albeit slowly, on a CI. So that's one thing and then um if you have got something like perf and it does work then you can run some things and you could sort of count the number of uh again broad phase stuff like branch prediction mismatches and you could run it over a uh a representative like run of your program rather than targeted tests and kind of at least have some heuristics again they could maybe graph them so i guess you could use them but then once you get
Starting point is 00:19:48 down into the minutiae it doesn't necessarily nothing springs to mind as what you could extract because i don't know the thing about these tools when i use them it's like it's a journey into the unknown it's an exploration exploration into the literal unknown and again like so like the the presentation i give where i'm like where on earth are we spending all of our time in this simple function and it's like oh it's in the locale system in c++ you're like i didn't even know there was one how did this happen where are we why are we here right and so trying to sort of like capture that in an automated way doesn't necessarily spread but obviously the metrics you can get out might be the same ones that yeah yeah i i think it's the difference between detecting that there's a problem and then figuring out what the actual problem is and those those are two
Starting point is 00:20:35 sometimes there's overlap between those two things and it can make tooling simpler to use but for the most part there i feel like they're very separate activities right there are some i mean again the broad phase stuff, so the top level stuff, if I'm literally just going to do perf stat on a thing and just get back the things that you get back by default, which is how many CPUs, how many CPU instructions were retired,
Starting point is 00:20:57 how many were issued. There's about a half dozen stats and some things that are inferred from it, and that kind of gives you some idea about it. There's also an approach that I think Intel first gave out and this is very intel centric again for folks who are listening and that are um maybe not so okay but like cpus have in general have multiple like stages in them and there's a sort of front end stage which does fetching of the instructions that need to be done, which is branch prediction feeds into that,
Starting point is 00:21:27 and all the badness of... Badness? Inaccurate branch prediction. Branch misprediction, I suppose. There's the reading of the bytes that are actually the instructions, which is cache and memory. There's the decoding of the instructions and turning them into the actual thing that the CPU can execute
Starting point is 00:21:43 because nothing really executes the bytes directly anymore so that's the front end once those those things have come out they go into like a box of things that I know I need to do and then there's some out of order magical stuff that happens and then you know like maybe you've got only one divider on your chip and now you've done lots of divides and now you've got like contention on divides that's sort of in the back end and then then there's retirement, where it's like now, at the end of the day, all the instructions have to come out. So there's this approach that you can kind of hierarchically zoom in on the bit that's actually the bottleneck.
Starting point is 00:22:12 And so the first couple of levels of that might be something you could run automatically and note when something changes. Like, hey, this code ran before, and it was absolutely limited by the front end. We couldn't get instructions out of memory into the cpu fast enough to sustain the the whatever workload you gave it but hey now it's suddenly switched and
Starting point is 00:22:32 now suddenly your bottleneck somewhere else and that's an event you can say i don't know why or what but i can perhaps flag that up as being something that happened i don't know how how feasible that is to actually do automatically but it strikes me as something like if someone handed me a product that did that, I'd be like, yeah, I guess I'd use that. Just run it on my little test case and see. But yeah, I don't know. It's a fun thing. Yeah.
Starting point is 00:22:57 What about visualization? I know that we've used a few different kinds of tools for visualizing flame graphs. And I always get the tree of calls that you can click down and sort of see what things are used. What tools have you used for that? The flame graph thing is just one of the best visualizations. and done a recording in Chrome, like JavaScript thing. Google's viewer that shows a timeline and gives you a combination of both the FlameView and also this is what things were happening on what threads when, which is great for debugging web pages,
Starting point is 00:23:38 but you can actually record this stuff and convert system-level traces either on your Android phone or if you use your own stuff you can output the format it's pretty straightforward and then you can load it and visualize it in that so that's been a really really helpful thing um they've recently spun that out or say recently i don't know when they did it but i haven't i noticed recently that they they used to embed um the the viewer separately um in i think it was in the Android project originally but now it's been spun out of its own thing and so at perfetto.dev you can go and see the web-based browser you can load traces into that it's pretty cool and I think you know the thing that we were talking about before
Starting point is 00:24:14 we started recording this was that there are a number of tools that are now starting to target perfetto traces so that you can load them and visualize them in that um and one of them was i think jane street released something where they were using like the new intel tracing stuff um you know i should we should really have done more uh reading about this beforehand but like i i've seen a couple of things recently which have targeted this and and certainly the newer things that are coming in in the most recent revisions of Intel chips are allowing more and more fine grained, accurate and low late low overhead ways of tracing what the heck is going on in your program,
Starting point is 00:24:53 which is super awesome. You know, normally the, the one finds oneself compromising and saying, I have to just keep running my program over and over again. And I'm doing some kind of sample-based profiling, and that's okay for getting, oh, there's a lot of branch mispredictions,
Starting point is 00:25:09 and I can vaguely see the area of the code where they're happening. But if you really want to know in what iteration of what loop something happened, you need to have this new branch tracing, and I can't think what the other new thing is now, which is terrible, but there are some new things in the chip.
Starting point is 00:25:24 And effectively, there's a highly compressed buffer that the CPU is streaming out. This is what I did, too, which is the coolest thing, right? Yeah. I mean, I used to have this. A journal, yeah. But I used to have this on the PlayStation 2, right? There was this crazy, this is a, we're running out of time, unfortunately, because we maybe have to talk about this another time but um there was like this crazy piece of hardware that was like a four times bigger playstation 2 dev kit which were already much bigger than like a playstation 2 ever
Starting point is 00:25:53 was i used to use it as a foot rest mainly because it was the right height um and it was worth tens of thousands of dollars but um it had a foot switch and you could play your game and you could hit the foot switch and it was in a ring buffer keeping 16 um megs worth of this kind of trace information for all the different components over the the whole place too and one of the cool things was that yeah you go like why is it taking so long you can see the graphics being drawn you could then see because effectively the cpu only needs to tell you when it um when it took a conditional branch and which way it went because the rest of the time if you know where it is you can say well i know what a cpu does it follows one instruction after another the only time i i need to know what
Starting point is 00:26:34 happened is when it says oh i went either i took the branch or i didn't take the branch and i can reconstitute the the flow then provided i know which cpu cycle it was and then everything else is like oh i branch mispredicted and you're like that's great okay that's an event that goes into the journal yeah lots of fun anyway um yes what were we talking about visualization tools visualization for photo uh by the way magic trace i believe is the name of that jane street thing and that is awesome awesome the jane street github account i should find the other thing that uh travis downs uh tweeted about the other day which i was super excited about but not apparently excited about enough to remember the name of which was to use um uh the the same
Starting point is 00:27:19 kind of techniques i believe that are now available in these newer cpus to do effectively a cycle by cycle simulation of a piece of code by running it over and over and over and over again and saying hey stop after one cycle and tell me what the uh that the counters look like stop after two cycles and start tell me what the counters look like so you can kind of like slowly run the same bit of code over and over again and kind of say well now measure this now measure that now measure that a lot of the problems come from the fact that the cpu has many many many many many counters that are available like for events interesting things that happen but they can only count like three or four of them at a time and so you have to keep being able to rerun the same piece of code
Starting point is 00:27:58 over and over again that's very interesting anyway we're actually running out of time. I can't believe this. I've gotten excited again about something. I mean, it's uncharacteristic of either of us to be excited about topics. To be so animated about all of these things. But yes, yes. But we should try this and talk more about this some other time. There's so much to think about. Yeah, maybe there's a part two here. It could be a part two.
Starting point is 00:28:23 Let's think about that. We should think about that. We should think about that. So, listener, if there's a part two here. Maybe. It could be a part two. Let's think about that. We should think about that. So, listener, if there's a part two, there will be a part two after this, but maybe this is the whole thing. I don't know.
Starting point is 00:28:30 Maybe we've actually peaked. Yep. That sounds about right. All right. Well, I guess, until next time, my friend. Until next time. You've been listening to two's compliment a
Starting point is 00:28:48 programming podcast by Ben Rady and Matt Godwell find the show transcript and notes at twos compliment org contact us on twitter at two cp that's at tw o s c p
Starting point is 00:29:01 theme music by Inverse Phase

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.