Two's Complement - Measure Twice, Optimize Once

Starting point is 00:00:00 I'm Matt Godbolt. And I'm Ben Radie. And this is To's Compliment, a programming podcast. Hey, Ben. Hey, Matt. So, last time we were talking about performance. And amazingly, you and I have remembered just about enough of what we were talking about before to be at us continue on that subject.

Starting point is 00:00:33 So the story so far, as we were recapping during the intro, we have a relatively trivial piece. code and we were interested in making it go as fast as possible, mostly by looking at the things outside of the code itself that could be taking time. I-O and shipping things magically through kernel space or avoiding kernel space, all that good stuff. So then you said, well, what happens if it wasn't a trivial piece of code? What if it was like a lot of work? How do you make that go fast?

Starting point is 00:01:03 And I think that's where we were going to continue. So, I mean, how do you want to say? start this. Well, so yeah, like the previous example, it was dominated by I.O.O.O., right? Or, you know, a very simple map lookup and just RAM, and we talked about L2L3 and that kind of stuff. Yeah, not quite adding two numbers together. You do maybe have to go out to RAM and pull some data down. But, you know, I think Redis cache was sort of like, you know, the mental model we had. Exactly. Yeah. And, you know, maybe alternating with robot servo control. But that's, I think, we went off on a tangent there. That's right. We both, yeah, unsurprisingly. Yeah. So now imagine the,

Starting point is 00:01:46 the program that I wish to make go fast, instead of being dominated by I.O., let's say, is dominated by the program itself. And I think this can be interesting in two dimensions. One is, you might have something where what you're actually looking for is throughput, right? You're just trying to process a lot of data. Or maybe you're not trying to process that much data, but the computation on that data is very expensive. And so you're trying to do that as quickly as you possibly can. And then there's also the dimension of just latency, where it's like, okay, I.O. is maybe not the dominant factor here, but the round trip time of this still matters. And so you have tradeoffs that you now need to make, whereas before we were

Starting point is 00:02:35 just kind of optimizing for, you know, let's stay in the L1 cache and let's make sure that we never miss it's like we can't this problem is too complicated for that so now we we maybe have some interesting trade-offs between i.O and performance inside the CPU to sort of concrete eyes concretize something that's a word now it is a word now something like a video game is a great example of there's a ton of stuff going on in every frame but you want it so that when you hit the fire button you aren't waiting 120 milliseconds before you actually finally see the fire you know the shot go off And, you know, there's a number of things that are sort of kernel bypassing in the way. But most of it is just down to computation and can you get something done in, say, a 60th of a second or a hundredth of a second in modern things?

Starting point is 00:03:19 And then for a throughput-based thing, we're talking maybe credit card processing. I don't know. That's still I-O-E feeling. I don't know. I'm trying to think of an example that might be that we could talk about. I mean, you know, there's all kinds of problems with, like, text manipulation. Like you've got lots and lots of text you're trying to summarize it, extract some, you know, signal from it, you know, run some NLP process on it. That's one. Right, right. And you've got tons of data to go through and you just want to get through it quickly.

Starting point is 00:03:48 But any one piece of text is not particularly important. Right. And that works. Well, so I think probably it's funny we didn't even talk about last time or I don't remember talking about last time is the single most important thing about any kind of performance. which is measuring it and making sure you know what you're doing before you make any changes. Because unlike almost any other aspect of software engineering, there are so many hidden variables

Starting point is 00:04:21 and so many weird things going on inside computers or exogenous or in your own code, which is more complicated than perhaps you remember it is or there are surprising aspects of it. The only way is to establish some kind of base, truth like I have a representative amount of work and I have a harness that could run it and I can get reproducible results that are hopefully intersubjective so I can share it with my team and they can run it and they can also see that or that kind of thing before you even start making the first change or even

Starting point is 00:04:53 trying to understand the problem you just need to have a baseline right right so measure measure measure you know is the is the key there and then profiling is the next sort of of aspect of it. How do you know where you're spending time in your program? And that can be a surprising journey as you discover that actually it's not the computation. It's not the big O of n squared thing that you've got. It's the fact that unbeknownst to you, the particular function you're calling has to do a locale lookup every single time to parse a string and you didn't realize that it was just checking. I wonder if the user has switched into German locale and they're expensive use. It treats this identifier as a domain name and does a DNS lookup.

Starting point is 00:05:35 every time you call this function, right? Right, you know, yeah, yeah, exactly. Understanding the complex code that you have written and that you're interacting with. Yeah. And the surprising effects it has are key. So I realized we didn't really touch on that last time, but that would be like the first and most important thing

Starting point is 00:05:51 is measure everything. And then you kind of hit the sort of list of, you know, do I need to do this? Can I do this before? And then, you know, can I avoid it? Those kinds of things. I can't even remember that. Who was it?

Starting point is 00:06:04 Somebody we spoke to. once had a list of like the sort of like rules of optimization and sort of the zero rule is do I even need to do this at all. It's always, I'm getting ahead of myself. I'm getting, yeah, that's true.

Starting point is 00:06:17 That's true. Okay. So let me, let me, to kind of kick this conversation off, let me throw a scenario at you because everything you're saying here makes a ton of sense, right? And so like,

Starting point is 00:06:26 okay, yeah, measure, measure at least twice cut once, right? Like I'm going to measure this thing. I'm going to make sure that it is slow. I'm going to maybe have a profile that tells me that it's that it is the slow part in the context of the system. It's not that it has some, you know, slowness to it that I don't like. It's like, no, the reason that, you know, the bullet isn't traveling out of the gun fast enough is because this line of code or this method or this thing is slow.

Starting point is 00:06:54 So let's say that I'm like, all right, well, obviously what I want to do here is I want to measure it and then I want to change it. And then I want to measure it again and I want to make sure it got better, right? what do I do when I set up my little thing to measure it and I run it and I get one result and then I run it again after having made no changes and I get a totally different result? I mean despair. But be aware that that's very common, right? You know, it's benchmarking anything is incredibly difficult. And with all the variables that we've talked about, you know, the noise on your machine is perhaps a source of that, you know, making sure that there isn't an exogenous input that actually you're depending on, like you say, a DNS lookup that sometimes isn't in the cash, sometimes isn't in the cash, those kinds of things.

Starting point is 00:07:47 And then if you're on a shared machine, obviously, I guess it depends, right, right? There are some aspects of performance are so large, like if you're measuring the throughput of a system and you're running it for like 20 minutes and you're getting resolved. out of that, then usually the noise levels are low enough that it doesn't matter as much. But if you're doing something that's like a, right, a, you know, the collision detection for a video game and you're like, okay, I have, I've, I've got a test, essentially, just like I would write a test to say, do these two things intersect, except that it's a little benchmark and I'm going to run the, are these intersecting over and over again? And each time it takes, I know, 200 mics, something like that.

Starting point is 00:08:31 I run it once and it says 200 mics and I change nothing. I run it again and it says 10 mics. I'm like, hang on, this is this is bonkers. Now you're in sort of like the noise of the machine area and you are looking at things like the warmth of the cache, whether or not other processes are running at the same time as you, whether you're being kicked off the CPU, whether or not you've got slack open in the background

Starting point is 00:08:54 and someone's just posted an animating GIF and now 200% of your CPU is actually animating. a stupid cat. Those are the things that you immediately think about. If you can rule out some sort of internal differences like the DNS-related thing, or maybe there's even a cash file on the disk and whatever. And, you know, there are frameworks for running benchmarks that try to do their best to isolate your process.

Starting point is 00:09:24 They will warn you if you haven't turned off other certain aspects of your CPU. So, for example, the CPU can throttle its speed up and down for power saving reasons. So that might be a reason. You know, like the first time you ran it, it ran slowly and then it ran fast because actually the CPU woke up and was now faster, you know, the actual gigahertz clock frequency. So there's a ton of sources of noise at that level. And realistically, when I've done this kind of stuff for the more micro-optimizing stuff, it's best to try and have a dedicated resource somewhere. So we've had like CI machines before now that have been things that we can reserve and or run continuous performance monitoring on critical pieces of code and then just keep a graph of it. But it is noisy.

Starting point is 00:10:12 There are tools you can run. So I have a friend who swears by using Valgrind, which many people know as being like a memory checker for compiled languages. but there are other modes that it has. And one of the modes that it has is like to count the number of instructions executed. It kind of virtualizes the whole thing and it runs as a sort of an emulator of the very CPU you're running it on,

Starting point is 00:10:38 which is kind of how Valgrind itself works. But it gives you a deterministic how many instructions did this code path take, which is interesting. Yeah, so you're not measuring it by time anymore, right? You're not measuring it by time. And of course, that's not necessarily a perfect proxy because ultimately you want how long did it take

Starting point is 00:10:55 is your trying to optimize. But if you, for example, want to put something in CI where you could say, literally write a test that says, if this function takes more than 20 million CPU instructions, then we've probably done something dumb somewhere. Either we forgot on a Kamalai switch or someone added something huge to a function. So it's a very gross check,

Starting point is 00:11:19 but it is one that you could consider doing. And similarly, within that, there's a cash grind, which tries to estimate what a cache might look like. And if I actually somebody posted, I'm going to just go over to my other screen here, sorry for the sound changing. Someone's just posted a tool called Cash Explorer, which tries to, yeah, exactly,

Starting point is 00:11:41 tries to run a little bit of code and give you a visualization of what cash things have happened by simulating a cache. But again, we don't really know 100% what the caches are doing and AMD's caches are doing, but we can make some, you know, But those are the kinds of things that might be unstable between runs. Interesting.

Starting point is 00:12:03 So in that world, you are, and I think there's a few different things here. So you're talking about like you could have a continuously running performance test, continuously running in the sense of it runs on every change, runs in every build. Or once a night, you know, two in the morning. Yeah. Like that way you can say like, hey, no one is this is working right now. We've got all these resources that are dedicated. and there's no other noise coming from it.

Starting point is 00:12:26 Yeah. And, you know, obviously you can try running those sorts of tests on dedicated hardware. You can try running the tests by not measuring time at all and just by counting instructions, which should be relatively stable, you know, even if the cat GIF is running. And is your usual MO to then reuse those types of tests for performance optimization? Are you running the same tool that you would run in your CI process when you're like, oh, I need to optimize this method, or are you building things that are more bespoke? It depends, honestly.

Starting point is 00:13:00 There have been times when I've used the same tooling. Those micro benchmarks are a great source of things to look at. But it's really, really easy if you're not very careful. Again, it depends on which domain you're in and what's, you know, we're talking again from the kind of industries that we've been in, where we have a very, usually homogenous deployment environment. where it's like, well, we have many computers that all look the same. So they have this CPU in them and we have the luxury of targeting that CPU or,

Starting point is 00:13:30 you know, caring about this particular amount of RAM and number of CPUs and all that good stuff. Right. But typically developer workstations are not necessarily the same as the servers you're deploying to. And it's, it can be easy to fool yourself by running micro benchmarks on your local developer hardware and then be very, very sad when it doesn't pan out when you run it in production. So you really do have to be careful about. drawing too many conclusions from local stuff at the sort of scale that we're talking about with the hundreds of microseconds level.

Starting point is 00:13:59 So it does, it is antagonistic to doing like quick turnaround and sort of like TDD style. Hey, just make it go faster. You've really got to be able to run on representative data, which is hard because it's really, really easy to write a micro benchmark, which is representative on its face, but actually is, is a little bit too best case scenario. For example, C.P. CPUs have caches, we've just talked about, CPUs have branch predictors.

Starting point is 00:14:27 And so if you have a very branchy bit of code that you're running 100 million times in a tight loop with the same input data to sort of get you how fast this is going to be, A, you're hitting the same cash lines over and over again, B, the branch predictor has learned 100% of all of the branches in your benchmark and the benchmarking code around it.

Starting point is 00:14:45 And so it is a perfect representation of the best case. Now, that's an important thing to understand. like the min, this is something I see people do all sorts of distributions of these profiles. And like the min is an important point. This is like if everything lines up perfectly, then this is the theoretical best case that you can get. So the min gives you a piece of information, right? Then there's sort of like people talk about the mean. And I think that's a terrible representation of anything.

Starting point is 00:15:12 Because like there's so many tables in there usually. Right. But having an idea about what the distribution of your function looks like is important. And usually it's a horrible log normal thing that it's not very amenable to understand you. But like looking at it and kind of getting a sense of like, are most of them less than 100 bikes? Or is it just the average? Yeah. I don't even know what.

Starting point is 00:15:32 I'm not a statistician, but I do tend to look at plots and CDFs and things of that of what's coming out of the function and the test. But yeah, winding back around to the cash and the branch predictor, if you don't have representative data, then, yeah, you, you, you, you're, when you then deploy to production, then you discover that every time you call your function, it's the worst case because you haven't recently called it. The branch projector has lost, has forgotten about it, the cash is not warm,

Starting point is 00:16:00 then you're hitting the worst case of everything. And then that doesn't dovetail. You're like, well, I made this improvement, but you don't see it. So it's almost impossible to do a good microbenchmuffs. They can only be sort of representative. They can help you along the way. But you always have to run on real hardware,

Starting point is 00:16:17 and you always have to run like a real workload and measure something that you actually care about. You know, for us in our world, it's usually, you know, like the end-to-end latency of a system that we care about. You know, so you maybe say that taking, going back to our Redis example, let's say it was a bit more,

Starting point is 00:16:33 there was some bit more meat inside what it was actually doing. And you're like, well, I've optimized the hell out of this part, the bit that does a tree look up for the cash key. And then there's some complicated stuff that goes on. And then we return it. And there's this, you know, all this networking code that we've talked about, well, you could probably sit and write a really good hash map implementation and all the

Starting point is 00:16:54 benchmarks that go with it and think that you've done a great job. And then really, you still have to see it in context of in the actual Redis server, has it made a difference? Or is it actually no, well, this is great because it uses tons more RAM than before. And in my micro benchmark, that just works a treat because I've got all the cash to myself. But when I have to share the cash with the networking code that runs around it, then suddenly it's not good anymore. I realize that's an incredibly long and rambly answer to your point. No, no, that's, I mean, that's really good. And so, you know, I can go back and forth and argue one versus the other about whether

Starting point is 00:17:36 it's more important to have systems that are sort of, you know, we talk about systems that are designed to be tested, systems that are designed to be observable, you know, you you can maybe argue one dimension of observability is performance, I guess. And so you could say, like, is this system designed to be profiled, right? Like, can you measure its performance under real world conditions with real world data? And the trick, of course, there is that, you know, the more you care about performance, the more you care about not slowing down your performance by measuring your performance. And so having something that's running in the most real world environment,

Starting point is 00:18:12 which is to say the actual production environment that is measuring performance, you might be questioning, am I making my system slower by measuring how slow it is? So what are some tricks that you've used in the past to sort of break this? I don't know that it's an iron triangle, but no, that's interesting. Yeah, yeah, because we haven't really talked about that. We've been talking about sort of end-to-end measurements, be it because I'm running a micro-benchmark and calling a function

Starting point is 00:18:36 and then literally the micro-benchmarking system is doing that, like start the clock, call your function 100,000, time, stop the clock, that's how it takes. You know, that's, yeah, yeah. Or in the case of, you know, our Redis thing, it's like, well, I have, I start the clock, I send a request to Redis, I get the request back, I stop the clock, you know, that's how long a request takes. That's pretty obvious.

Starting point is 00:18:55 But yeah, if you need anything more fine grain than that, then usually some kind of instrumentation is what you want to put into your software. But to your point, somewhere along the line, that instrumentation itself takes time. And so you kind of put a Heisenberg effect of like, hey, I know exactly how long my program takes, and it's really, really long because it's measuring time to a first information, all it's doing is getting the current time

Starting point is 00:19:22 at any point in the program. And so there are lots of tricks you can do. I mean, there are definitely relatively low overhead timers that you can use. Again, if you're in the weeds of microseconds, then it's useful in a C++ context to write yourself a little RAAI-style class, which starts a clock in its constructor and stops the clock in its destructor,

Starting point is 00:19:51 and then it squirrels away the number somewhere so that you know how long something took. Maybe you can accumulate over time all of the calls that fall into this scope. And those timers can be measured using the CPU's own timestamp clock, which is a relatively low overhead measured in CPU cycles, number that you can then later on turn back into a clock number. I mean, it's also nowadays actually the Linux kernel, if we're talking about Linux, is pretty fast at getting the actual real time.

Starting point is 00:20:22 But the TSC is a pretty decent, a fast way of getting time. So that allows you accumulate blocks, and then you can at least sort of look for hotspots in that way. But then there are many tools that you can get that will give you more information, or you can write your own tools that sort of hierarchically arrange these, so you can end up with like flame graphs,

Starting point is 00:20:41 which are these time-divided, it's almost like stalagmites or stalactite-looking things. You know, flames, I suppose. That's what they look like, right? Certainly if you use the red colors. And so you can see hierarchically where time is being spent. And that's a useful thing to have. So you can say, hey, I spend all my time doing this thing.

Starting point is 00:20:59 But really in this function, it's because these three functions it calls are the things that take time. And you get that kind of view out of it. But there is a little bit of overhead there because you're measuring on the entrance and exit from those. those functions and maybe you have to write that down somewhere in like a log file or in share memory or something like that. So those are all valid ways of measuring, but they have different amounts of overhead. I was going to say there's one other sort of trick, which is that modern CPUs have

Starting point is 00:21:30 some amount of accounting that you can do within the CPU itself. And there are sort of two ways of viewing that. So a traditional profiler, which is, and by tradition, I mean, like every profile I'd used until fairly, you know, till sometime at previous company, every profile I use is a trick where you set a timer, like in the CPU timer, and then after every, you know, a thousand times a second, you just say, where the hell is the CPU right now? What program counter do I have? And then you do a little bit of work to find out what the stack is.

Starting point is 00:22:09 And then that gives you a sample that says, hey, you spend this much time in here. And then you can sort of infer if you can run your task long enough, then you get this great answer of like, well, you spend 30% of your time in Malok and 30% time in free. And you're like, well, maybe I need to think about my memory strategy. That's a great thing. And that's fantastic for throughput-based things where you have a continuous workload that you care about it all the time. You just want the whole thing to be faster. So you can put that on. It's relatively low overhead, you know, a thousand times a second.

Starting point is 00:22:39 sounds amazing and really, really often, but to a computer, that's like eternities between sample points. And so if you can run it long enough, you get great results. If you're worried about latency, though, that's not much use. You typically find that you're spending 99.99% of your time in the wait for something to do routine. And then the samples never actually land in the code that you care about. So in the reddish case, for example,

Starting point is 00:23:09 you would find that, you know, you've got to optimize your, your E-Pol routine. You're like, no, I haven't. E-Pol is just waiting for a packet to arrive. I only care about it when the packet comes in. It's like, well, that thousand times a second time and never went off while you were actually doing anything useful. Yeah. And that's where, like, the instrumentation I was just talking about coming to,

Starting point is 00:23:28 because there you get to put the instrumentation where you care about it. But modern CPUs have the ability to do, like, essentially record. Oh, that's great. So we've got a car alarm going off somewhere in the background. I know if you can hear it. I can. And I've got my noise cancelling headphones on. All right, well, sorry.

Starting point is 00:23:45 Sorry, editor Matt. You've now got something to worry about. But yeah, so the modern CPUs, Intel ones can do like a trace of execution. And so they can essentially instrument themselves. And every time a call happens or a return happens or a branch, a conditional branch is either taken or not taken, some amount of tiny information is recorded into a buffer. And then you can ask it to,

Starting point is 00:24:08 write out that buffer and then you can post process that and sort of infer what happened. And there are tooling. So Linux has Perth that can record this information. And then there's things like Magic Trace, which is like a set of user scripts over the top of it and a nice website that lets you view it in like a timeline diagram. And that's an eye open. Honestly, at that point, you've kind of got the whole world open to you and every single nanosecond is accounted for you. that's really quite something. So I forgot what you asked now.

Starting point is 00:24:44 Again, you've just, you just keep winding me up and watching me go. No, this is great. I mean, you answered my question exactly. So let's say that you do all these things and you find the part of your code that you're now fairly convinced is the reason that it's slow, right? And at the risk of having you just recount the compiler explorer origin story, now what? Well, so that's funny.

Starting point is 00:25:08 should say that. Yeah, the compiler explorer origin story is almost somewhat orthogonal, actually, in some ways. That was, yeah, a little bit. It's not what is what instructions is this code generating? Well, it was, but it was that was more like arguing over whether the compiler was smart enough to make human readable code as good as like the awkward code that you used to write because compilers weren't smart enough, right? And that's right. So we've, I mean, we're very much in the micro optimizations when we're talking about code generation. And honestly, that's the fun bit for me, but very often, it's like,

Starting point is 00:25:44 the things are doing something dumb, right? Don't be stupid on purpose. It's like the very first thing you should say. It's like, you know, I'm like, am I looking this up in a map over and over and over again when I could just look it up once and remember it. Right now, compilers are smart, but they're not necessarily smart enough to stop you from calling a whole function chain that is the look up something in the map.

Starting point is 00:26:05 It may not, you know, the compiler hasn't got perfect knowledge. It can't tell that like no one has changed that map since the last time you looked in it. And so this is the kind of thing you see. But they get these are relatively small wins sometimes. Algorithms are important, obviously. You know, if you're doing something dumb, if you're searching through a massive array of things when you could use an acceleration structure, be it at a hash map or something else, then that's important. But then at this granularity that we're sort of talking about, sometimes it is actually faster to go through a big array than it is to, like, jump randomly around in memory that would be like hash map potentially.

Starting point is 00:26:49 There's almost no case in which a linked list is the right answer. I think that's, yeah. I have one special case of where I think a link list is the right answer. I really want to hear this. When should one use a length list? So one should use a linked list when you need to keep an ordered list of things, not necessarily ordered as in like they, let's say insertion ordered list of things. So I've got a sequence of objects and I can only ever add them to the back of that list, right? but I can arbitrarily remove them out from the middle of the list.

Starting point is 00:27:34 And those, I'm given a unique identifier for that object by some external source that I have no control over. And it's associated forever with one of those objects. And I need to keep them in that sequence, right? But at any point, I can get a message saying, oh, number 7-9-2-3-4 is gone. And I need to be able to find it and remove it from that ordered sequence. but I need to keep the sequence. So you've kind of, in the traditional sense, you might have a hash map to find that object, which is great.

Starting point is 00:28:11 But if it's in the middle of a vector, like an array of memory that would be cool, you deleting it is not only really painful because you have to shuffle everyone beyond it down one, but you now have to go into the map of where everyone else is, and I kind of update them to be in the new location that they're in. So that's a pain. Having a doubly linked list is disgusting, but given a pointer to just that object itself, all you have to do is look at its next and prayer

Starting point is 00:28:46 and wire them up together and you're done. And you're done. And so that's one of those examples where the the worst case is for a single operation. So like if I say, so I mean, I'm alluding to this, this is me thinking about an order book. So in our world, finance. I was just going to say, this sounds like the priority queue for an order. Yeah.

Starting point is 00:29:12 It's a price level for, yeah, exactly. So you're told about these things and that happens. So you could imagine you've got thousands and thousands of orders on a single level. And it's important to keep them in that sequence. So you know their relative. priority. Whenever anyone trades, it's always the order at the front that trades away, which is kind of the worst case for if you did keep them in a vector. Yeah.

Starting point is 00:29:36 Right. But you could always store them backwards. And then that's a good case, right? Maybe that works. You know, that's sort of more like a stack basically. The ones at the back. Yeah, exactly. Which is not.

Starting point is 00:29:47 Yeah. And also there are things like decks, double-ended cues that have like this sort of capability of being like multiple slabs. which you can chain a link list effectively of slabs. You could chain a new link slab at the beginning to or unchain it moreover. So those are all great and cool and everything. And it's very much more cash efficient to have them laid out that way. But it does mean that the worst case is worse, right?

Starting point is 00:30:17 You know, like whatever it is, be it the front order or the back order, depending on which one you've optimized for, If you have to remove that one, then suddenly you're shuffling however many thousand other orders around potentially. So that's bad. It's probably the case that the link list based version is always worse, but it's consistently no, there isn't a bad case, right? Right, right. You get the same amount of worseness every time.

Starting point is 00:30:44 It's an interesting choice. And, you know, I've seen a number of book implementations in my time. And there's just different tradeless. It's a really interesting data structure to try and optimize and ask the question, why would you do it this way? I've just started working in a new company and I've seen another way that it can be done, which I can't talk about, but it's fascinating and I'm really, really excited by. But, you know, that's how it is. So, you know, that's an interesting one. So anyway, the algorithm can be important.

Starting point is 00:31:12 You know, that's why I think when we got onto this. And then, you know, caching, not doing work you don't have to do. So I think, yeah, we've said, like, the questions you ask yourself are something like, do I actually need to do this at all for each thing? And oftentimes, you're like, you discover, actually, why are we logging this out? Let's just take the log line out, right? Or put it behind a guard so it doesn't happen in production. It only turns on when we want that log or whatever. So it's an expensive piece of work.

Starting point is 00:31:37 The classic example here is that you format a bunch of strings and then you pass them to debug log, which is not on in production. But you've already done all the work of formatting all the strings. And you're like, well, hang on a second. I don't need to do these. Another thing is, do I need to do this now? So can I cash this ahead of time? Is it just something I could do at program startup? Can I make a reasonable guess as to the values that will happen in this particular thing?

Starting point is 00:32:07 In which case, maybe I take some time at the startup and I prepopulate a lookup table with all the possible values. And then I say, okay, well, it's just a lookup now at runtime. There are other tricks in that world where you can say, well, do I need a perfect answer to this now? Or can I defer updating something until later? Or can I have another thread post me back occasionally saying, hey, this is the amount of whatever's that are available. Can you be conservative and get away without on your hot path doing the really complicated stuff and then push other stuff to other threads? Yeah, I'm just, I'm trying to think of the sort of obvious things here. and, you know.

Starting point is 00:32:46 So how often I feel like there's this almost myth of, and I have to wonder if anyone has ever come up to you with questions like this, the myth of the expert programmer that has like the function that is taking all the time. And they're like, I'm going to rewrite this in assembly code. And it will be super fast because I have written it in assembly code. Like, so. Yeah. Does that actually happen? And when and why would that actually happen? So, yes.

Starting point is 00:33:21 Yes. But it's something you have to do very advisedly. I'm trying to think about how I can say this without breaking any confidences. But I'm aware of certain critical core loops in certain circumstances where compilers just aren't good at intuiting which variables are the important variables and no amount of tagging them can convince them otherwise and so they forever spill onto or off of the stack for those things

Starting point is 00:33:55 and in one specific case, rewriting the core loop of something, you know, and by a core loop, I mean, it's a chunky piece of code, you know, pages of assembly at least. Rewriting that was, writing that in assembly was a definite, definite win in terms of not spending half the time pushing and popping or reading and writing to memory, which has its own issues to do with aliasing and stuff. We've grazed that before in some of the conversations we've had about the clever tricks,

Starting point is 00:34:28 the registry renaming that can happen if you use registers, but as soon as you put it into memory, a whole new system has to come in and it's a lot more complicated and slow. Anyway, so, so yes. And there are also some other examples where if there are specific instructions and see, of instructions that do specific things that you need to do, then sometimes cracking out the assembly is worth it. Although less often now that most of the instructions are available, as intrinsic that you can call,

Starting point is 00:34:56 or library functions that have the same meaning behind the scenes, and have basically been invented to wrap the underlying hardware's capabilities. And then you can phrase things in maybe a slightly tortured way where you have to call underscore, underscore some weird thing get some function and assign it to some weird type and then write your instructions, not instruction, write your code out longhand where it's like underscore underscore, underscore, add of bracket, underscore, underscore, underscore, sub of X, comma, Y, close brackets, comma, Z, that stuff rather than the more, but you know, you can get it to generate the right code.

Starting point is 00:35:31 So both, but at that point, you're really just using the C compiler to write the assembly code that you would like it to write and that's still better than write in the assembly. So to answer your question, yes. But it really, really has to be worth it because more than anything else, you're trading off the ability to change and understand your code down the line. It's so fragile. It is so very, very fragile once you've written it in assembly. And you have to be so sure it's right.

Starting point is 00:36:01 So testing's important. And, you know, keeping a C version of it around that you occasionally race with a new compiler against your other input optimizes, you know, version. I was going to say, I feel like one of the things you give up by doing this is the ability to just upgrade your compiler and then all of a sudden it's faster, right? Right. Exactly. So I think the times that I've known that has worked, and this is not anything I've done

Starting point is 00:36:24 for what it's worth, I haven't written the substantial amount of assembly code since the 90s. But where it has been known to work is where folks have taken the time to carefully write some assembly for something that's really, really important and then keep a C version of it next to it and then have like continual races against the C code versus the assembly code on upgrades and also correctness checks against the two. You know, this is. Yeah, right. We run inputs on the C code and then we compare it against the output from the,

Starting point is 00:36:52 the assembly code, that kind of stuff. But yeah, it's, it's tough, man. And I mean, another example of like when you have to write assemblies, like if you're interacting with kernel magical stuff, you know, like that, but which, which for some of the more esoteric, um, profiling things, you, might have to do an instruction here and there to get. But yeah, usually no. Usually, in my experience, the flexibility you get of leaving it in C or C++ and having the ability

Starting point is 00:37:23 to quickly manipulate, move things around, change around, play with compiler flags and let essentially 40 years of other people's experiences that have been poured into the heuristics of a compiler at my code. they're usually better than most things I can come up with. And the times that I found I can substantially beat the compiler with assembly have been times where I have been unable to explain correctly to the compiler the unwritten constraints that I'm aware of that it is not, or vice versa.

Starting point is 00:37:55 It is being more conservative. It doesn't know that writing through pointer X and then reading through pointer Y, those two things will never be the same address, but I do. And so if I write the code that way, of course I won't. I'll read one into a register and keep it in a register the whole time. Whereas the compiler's like, well, every time you write to Y, I have to reread X again because for all I know, Y points at X, that kind of thing. Yeah.

Starting point is 00:38:17 Gosh, that's a lot of things to talk about abstractly, 40 minutes into a podcast. Well, so I got one, I don't know if this is, uh, no, I'm just going to ask this question. So go ahead. I'm talking about Perf and using, using Perf, a tool that I have also used for just answering generally the question of what is my code doing, for lots of reasons, performance optimization being one, is S-Trace. Like, what system calls am I making? What are the arguments to those system calls?

Starting point is 00:38:50 When are you using one versus the other? When are you using Perf? When are you using S-Trace? What problems are you trying to solve when you use those tools? Yeah, that's a really good point, actually. S-trace is like the go-to, isn't it? If something's taking a long time, I don't have the source code to it. I'm going to S-Trace that thing and see what the hell it's doing.

Starting point is 00:39:09 Going back to the it's doing the DNS lookup or it's doing the weird file system. Exactly. Right. Yeah. Exactly. And so if we're talking about very high performance things like we have been, then if you're doing it right, S-trace will show you nothing because you shouldn't be interacting with the operating system. You know, if you're the instant you're calling F-write to write to a file, you've already lost, right? That's not a high-performance piece of code, right?

Starting point is 00:39:34 And that's perfectly valid. I mean, obviously, there are loads of bits of high-through. throughput code that are going to call F right. And then you know, you do want to look at S-rays. So maybe that's your answer. If you're looking for latency, you won't see anything. But for throughput, it might be a really good indicator of like, am I spending a lot of time waiting for the file right to finish happening?

Starting point is 00:39:54 So S-trace is very valuable for that kind of thing. There are other tools as well. So there's various EBPF-based stuff that I haven't had much personal experience with. So my knowledge goes back to SystemTap, which is like another. similar thing, but you can hook into more parts than just system calls. You can say like, hey, call this function effectively in like a funny little scripting language every time there's a page fault. And then I can aggregate page faults and I can see what the heck's happening there and give

Starting point is 00:40:23 me the stack. And then I can work out, hey, wait a second, we've just allocated a massive slab of memory. And then we're getting all these page faults, what's going on? And you realize, oh, of course, although I think I've allocated a big slab of memory, the operating system has just given me a big gaping empty hole of virtual address space. And then every time I read or write to it, it decides to now fault in the 4K page. And that takes a bit of time. I'm like, no, no, no, I don't want to do this. It's a low latency thing. At the program startup, I'd like you to do all of that, please. And then I don't have to pay for it later on.

Starting point is 00:40:53 Those kinds of things. And those kinds of stuff you can find from like a system tap or detrace in other operating systems. Or again, there's something, something, EBPF. So, so yeah, there are other tools available for sure. And I know that things like system tap and EBPF-based things can also patch function calls both in the kernel. And I think in your user code as well, you can put trace points and so you can kind of add some dynamic things where you can say every time this function is called,

Starting point is 00:41:20 I want to know about it. I mean, like, funny, funny true story. Like, it's not the worst profiler in the world for throughput-based things to just run it in GDB and hit Control C and say, backtrace, where are you now? Okay, continue. You're sampling. You're just acting as a sampler.

Starting point is 00:41:36 Exactly. Sometimes it works. Exactly. And while we're just thinking about these things, another performance investigation technique is to just single step through your code. And if you get bored of stepping through stuff, then your CPU is taking too long. Sympathetic profiling.

Starting point is 00:41:57 How do I feel when this code runs? Do I feel fast? Does it feel bad? I mean, but no, I mean, joking aside, like, especially with the levels, layers and layers of indirection that things like C++ can give you. And if it hasn't inlined everything and the compiler hasn't had heuristically determined that it's valuable to keep inlining until it nets out and says, well, actually, this is just, you know, return to.

Starting point is 00:42:24 Then sometimes you'll find yourself stepping into functions that step into functions, that step into functions, you're like, what, where, how far down is this going? And then eventually you get to the return to and you're like, okay, I need to, I need to, I need to do something about this, right? This is not right. And again, people, I say this and people like, oh, but in order to debug it, surely you had to do a debug build

Starting point is 00:42:45 and debug builds have no, you know, aren't fast. It's like, no, we have to separate the idea of optimization, you know, the dash oh, 1,0203 of like a C++ build from leaving the debug symbols around. And knowing that you can still have a completely optimized binary that has at least somewhat useful debug information so that although inlining has happened and code has been moved all over the shop

Starting point is 00:43:09 and so if you actually single step through it, you'll see your poor cursor inside your source code jumping all over the place, but it still gives you some idea about what happened and where time is being spent and what things are going on. But yeah, so, you know, for me, I will always have debug symbols around

Starting point is 00:43:25 because it's like, why wouldn't I? Now, obviously, if we were shipping to external customers, that maybe we wouldn't send. Yeah, you're leaking that information or maybe your binaries are just a little bigger that way. And if you're sensitive about the space. Otherwise, why would you know? You know, it can certainly another build in cases because it's a lot of debug stuff to go on. But there are clever linkers that can do tricks and things.

Starting point is 00:43:44 And yeah, so. But yeah, that's why S-trace. No, what was the question again? I love that point of like if you're in latency-sensitive code and S-trace gives you anything, well, there's your problem, right? Yeah. Like that's great. Although, of course, you know, with multiple threads, you have to be a bit careful

Starting point is 00:44:00 because other threads could be doing system calls, and that's fine. We didn't really talk about threads as well. There are all sorts of horrible thread sharing related issues that can come from that. And the fact that waiting on a mutex or not waiting on a mutex is sometimes an operating system level thing, and sometimes is a few text, and sometimes it's sort of slightly outside of the kernel, so that things can be, you know, so it's complicated, man. I think that's the short version. There's a lot.

Starting point is 00:44:27 It's a deep topic. Well, we've gone forward. 45 minutes on this so far. Yeah, I think there's a part three, isn't there? Part three? Maybe. I'm not going to commit to that yet, but that could possibly be a thing in the future. Yeah, we could certainly consider that.

Starting point is 00:44:41 We'll see how this goes out. Yeah. Well, we better stop because, yeah, I've got to edit this. And I'm lazy. I don't want to have to do more than 45 minutes. Have you any idea how awful it is listening to yourself, gabble and realizing all the mistakes you made while editing? You know, there's only so much I can do in the edit to make myself sound.

Starting point is 00:44:59 I'm trying to cut out this um that I said. And it's just like the worst thing in the whole way. Yeah, I'm sure our listeners have noticed that recently I've stopped cutting most of the things out because life's too short. Yeah. Nah, it's better this way. Holy natural. This is, yeah, our listener is sat on the table next to us in a restaurant while we're

Starting point is 00:45:19 having this kind of conversation anyway and just listening in. And that's fine by me. Yeah. It's a way to do it. All right. Should we call it there? Let's call it there, my friend. So I will see you next time.

Starting point is 00:45:29 Until next time. You've been listening to Toos Compliment, a programming podcast by Ben Rady and Matt Godbolt. Find the show transcript and notes at www.2.2.complement.com.com. Contact us on Mastod. We are at Tooscomplement at hackyderm.io. Our theme music is by Inverse Phase. Find out more at inversephase.com.

Two's Complement - Measure Twice, Optimize Once

Ben asks a simple question about performance and Matt talks for 46 minutes. The one true use case for linked lists is revealed, and a part three is threatened....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.