Signals and Threads - Performance Engineering on Hard Mode with Andrew Hunter

Starting point is 00:00:00 Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Chainstream. I'm Ron Minsky. It's my pleasure to introduce Andrew Hunter. Andrew is a software engineer who's worked here for the last five years, and he's currently working on our market data team. And Andrew is by inclination and background an expert performance engineer, and performance is really what we're going to talk about today. Andrew, just to start off, can you tell me a little bit about how you got into performance engineering and a little bit more about what your path was to getting into software engineering in general? I can, but I'm just going to be lying to you because the problem is I can give you all sorts of reasons why I do this, but the real reason is just like I find it really addictive. It's just hard for me not to get excited about how systems work. There's all

Starting point is 00:00:48 sorts of reasons why it's cool or complicated or fun, but I just get this like electric high when I make something faster. I guess it's not a path, right? Right. So I kind of know why you like it, but I'm curious how you got here. Like one thing that I think of as characteristic of performance engineering and actually lots of different software engineering disciplines is part of how you get really good at them is getting really interested in stuff that's objectively kind of boring. You get super psyched about the details of how CPUs work and interconnects and compilers and just all sorts of these little pieces.

Starting point is 00:01:15 There's just a lot of knowledge you need to build up over time to be really good at it. I'm always interested in what were people's paths where they built up that kind of knowledge and understanding and background that lets them really dive in deeply in the way that's necessary? Well, I think that's exactly right. I just have to care deeply about all the parts of the board. And the way I got there, there was a couple of interesting times when, like in college, for example, I was taking an operating systems class.

Starting point is 00:01:38 And I realized that the best way to study this and to learn it well was to just go into the details of the code. And it's like, whenever we're talking about a topic about virtual memory or whatever, I would go look at Linux's virtual memory implementation. And I'd see what it looked like. And I'd have more questions. And I just kept asking these questions. And I never said, well, that's out of scope. And I just kept finding these things interesting, right? And from then, I just realized that, like you say, all of these little details matter.

Starting point is 00:02:02 And if you keep caring about them and you just don't accept no for an answer, you get pushed towards the places where people really do care about this, which often means performance. And then once you start doing performance, you get that little high that I've talked about. So in what context did you first experience working in a kind of serious way on performance-sensitive systems? Grad school, at the very least, where one of the projects I worked on

Starting point is 00:02:25 was this big graph reversal system that was trying to like replicate some really weird, complicated hardware and do it in software, but like maintain reasonable levels of performance. And we just had to think really carefully about like, okay, how wide is this memory controller? How many accesses can it do in parallel?

Starting point is 00:02:39 What happens when one of them stalls? Wait, what does this even mean to have parallel memory accesses? How many cycles do each of these things take? Because we were roughly trying to replicate this really complicated chip in software, which meant you had to know exactly how would the original hardware have worked

Starting point is 00:02:52 and how did all the parts of it that you can replicate in software work? And you end up looking up all these bizarre details and you learn so much about it. Is that work that you went in thinking you would spend your time thinking about fancy mathy graph algorithms and ended up spending most of your time thinking about gritty operating system and hardware details?

Starting point is 00:03:07 A little bit. I definitely thought there was going to be a little bit more algorithmic content, but I really rapidly realized that the hard and interesting part here was in fact just, oh God, how do you keep this much stuff in flight? And the hardware has actually gotten way better or more aggressive about this sort of things over time, so I'm glad I learned that. So how did that work in grad school end up leading to you working professionally in this area?

Starting point is 00:03:28 Well, I was interning at Google at the time for the summers, and I kind of realized that I could do the same large-scale systems research that I was doing in grad school in a place that just had a lot more scale and a lot more usage of it, right? A lot of grad school research is done for the point of doing the research, whereas the proper industrial research, the coolest part is that it just shows up in production and suddenly people care about it. Yeah, and this is like a challenge for lots of different kinds of academic work where there's some part of it that really only connects and makes sense at certain scales.

Starting point is 00:03:56 And a lot of those scales are just only available inside of these very large organizations that are building enormous systems. Well, that's true. But I think even more than just the scale issue, it's the issue of what happens when this actually meets real true. But I think even more than just the scale issue, it's the issue of what happens when this actually meets real data. And I think this isn't just true about performance. One thing I will tell, for example, the most common question I get towards the end of an internship is like,

Starting point is 00:04:13 what's going to be different when I come back as a full-timer? And what I tell them is that I was shocked the first time that I had a professional project in my first real job. And I finished it and we submitted it to the code base and it rolled into production and everyone was using it. And then a couple of weeks later, I got an IM from somebody in some other group saying, hey, we're using your system and it's

Starting point is 00:04:31 behaving weirdly in this way. What do you think about it? And my mental reaction to this was like, what do you mean? I turned it in. I got an A. You have to actually keep this going until you can hand it off to someone else or quit, right? Which is like, sounds depressing, but at the same time means you actually see what it does in reality and under fire. And then you learn how to make it even better and you get to do something even more complicated and just this virtuous cycle of optimizations that you get, or features depending on what you're working on, right? Yeah, this always strikes me when I go off and give lectures at various universities. It's really hard to teach software engineering in an academic context because there is a weird thing that all of the work writing software in a university

Starting point is 00:05:09 context is this weird kind of performance art where you're like you create this piece of software and then it gets graded and poof it vanishes like a puff of smoke and this is just not what the real world is like all right so let's get back to performance engineering one thing i'm curious about is you have a bunch of experience at Google thinking about the kind of performance engineering problems you ran into there and also thinking about it here in a number of different spots. I'm curious how those two different problems feel different to you. The difference between performance engineering at Google

Starting point is 00:05:38 and performance engineering at Jain Street, to me, is fundamentally one of leverage. The easy thing about performance engineering at a place like Google or any of the other hyperscalers is that they operate with so many machines consuming so many cycles, doing so many things that any optimization that moves the needle on how much CPU you consume is really valuable,

Starting point is 00:06:02 which means that it's very easy to find good targets. It's very easy to find things that are worth doing. And it may be very difficult to fix those problems. You have to think really carefully and you have to understand these systems. But the return on that investment means you can really support a lot of people who just sit there doing nothing else other than, how do I make memory allocation faster? How do I make serialization faster? What can I do in my compiler to just optimize code generation and compression and all these things? There's actually a really interesting paper. It's called Profiling a Warehouse Scale Computer, which looked at, okay, if you just look at all the things that a data center does for one of

Starting point is 00:06:39 these hyperscalers, the business logic is really, really, really diverse. Some things are serving videos of cats and some things are doing searches or social networking or whatever. And all of this does different stuff, but it all uses the same infrastructure. And it turns out that the infrastructure is a huge percentage. They coined the term that I like a lot, the data center tax and the 10, 15, 20% of your cycles that you spend on low-level infrastructure that everything uses. And it's not even that that infrastructure is bad or slow. It's just that that's the common link that scales, whereas fixing business logic kind of doesn't. It takes individual engineer effort on each individual piece of business logic,

Starting point is 00:07:18 but everyone's using the same compiler or one of a small set of compilers. Everyone's using just a small set of operating systems. And so you grab those lower levels of the infrastructure and you optimize them, and you can just improve everyone. Yeah, that's exactly right. And you can improve it enough by just moving things by half a percent, making logging cheaper. It just pays for itself way, way more easily than it does if you're not operating at that level of scale, which means that you get this nice cycle where you hit one hotspot and you point another profiler at the system as a whole, and you see the next hotspot, and you just get better and better at this just by doing the really

Starting point is 00:07:53 obvious thing that sits in front of your face. So I like to think of this as easy mode, not because the optimizations are easy or because the work is easy or because it doesn't take skill, but just because it's really clear what to do. It is a target-rich environment. It's a really target-rich environment. There's money falling from the sky if you make something faster. Right. And in some sense, this has to do with cost structure. This works best in an organization where a large amount of the costs are going to the actual physical hardware that you're deploying. Right. When we say that the business logic, quote unquote, doesn't matter,

Starting point is 00:08:23 what it really means is we just don't really care what you're working on. You can be serving videos at cats. You can be doing mail. You can do whatever you want. We don't really have to care. As long as you make something that everyone uses a little bit faster, it'll pay for itself.

Starting point is 00:08:35 Because the only thing you care about is the number of CPUs you buy, the amount of power you buy, and the amount of user queries you service. Those are the only three things that matter. It's not that the business logic doesn't matter. And in fact, optimizing the business logic might be the single most impactful thing you can do

Starting point is 00:08:49 to improve a given application. But it's harder. But the business logic doesn't matter to you because you are working in the bowels of the ship, fixing the things that affect everyone, and it's kind of impossible for you to, at scale, improve the business logic. So you are focused on improving the things you can improve.

Starting point is 00:09:04 That's exactly right. Great. So how is it different here? We don't have that scale. The amount of total compute that we spend is a fair bit of money, but it's not enough that making things 1% faster matters, which means that the average CPU cycle we spend is just not very important. It's kind of worthless. If you make logging faster, everyone's going to shrug and say,

Starting point is 00:09:22 okay, but that doesn't change the dial. In fact, a surprising thing is that most of our systems spend most of their CPU time intentionally doing nothing. It is just table stakes in a trading environment that you do user space polling IO. You just sit there spinning in a hard loop on the CPU, waiting for a packet to arrive. And most of the time there's nothing there. So like if you point a profiler at a trading system, it's going to tell you it's doing 95, 99% of the time there's nothing there. So if you point a profiler at a trading system, it's going to tell you it's doing 95%, 99% of its time doing nothing. And actually at this point, I want to push back a little bit on this narrative

Starting point is 00:09:52 because when you say most of the systems are doing nothing, it's not actually most of our systems. We actually have a really diverse set of different programs doing all sorts of different pieces. But a lot of the work that you're doing thinking about performance optimization is focused on specifically trading systems. And trading systems have exactly the character of what you're describing, which is they're sitting there consuming market data.

Starting point is 00:10:11 The whole idea, which you hear about a lot in large-scale traditional web and tech companies, of trying to build systems that have high utilization is totally crazy from our perspective. We're trying to get systems that have low latencies. They need to be able to quickly respond to things and also to be able to perform well when there are bursts of activity. So it means that most of the time when things are quiet, they need to be mostly idle. Yeah, it's definitely true that we have easy mode targets that we care a lot about at Jane Street. A really good example is historical research. If you are trying to run simulations of what some strategy might have done over the last 10 years, it turns out that's just a question of throughput. You can pile as much input data as you can on the hopper, and you see how many trades fall out the other side in the next 10 seconds.

Starting point is 00:10:52 And the faster that gets, the happier we are. You can just do the same easy mode tactics that you would on hyperscalers. But even there, the size of changes we chase there is considerably larger. Yeah, you don't care about 1% because it's not the CPU you care about here, it's the user latency in some sense. It's whether or not the user gets a result in an hour or a day or a week. Just to speak on the compiler's team, we totally care about a 1% improvement in code generation, but you don't care about it on its own.

Starting point is 00:11:19 You care about it in combination with a bunch of other changes because you want to build up bigger changes out of smaller ones. And if you look at a much larger organization, people are looking for compiler-level improvements that are in order of magnitude smaller than that. I sometimes have to push people about this, in fact. I sometimes have to say, oh, no, no, no. It's not that this optimization matters on its own.

Starting point is 00:11:37 It's not that this thing that I did that removes a cache line from each data structure is going to make our trading exponentially faster. It's that I'm in a long process of pulling out the slack from a system. And every time I do this, it gets a little bit better and everything is going to slowly converge to a good state. But it's hard to get the statistical power to say like any of these small changes matter. Sometimes I get pushback from people saying, well, did you measure this? I'm like, no, I didn't bother. I know it's below the noise floor, but I also know it's right. That sort of small change incrementally applied over time is really good.

Starting point is 00:12:05 But the hard part about it is you just have to have faith that you're going to get there and you have to have faith that you know that this is making an improvement or find ways you can test it in isolation. Whereas if you operate at the huge scale that some other people do, you can just look at a change. It could be five bips and you can know like, oh no, that's really real. And it's actually worth a lot of money in its own. I think, yeah, this is a problem that hits people who are thinking about performance

Starting point is 00:12:27 almost everywhere. It's kind of funny to me in that a common line of pushback I get from people who are not performance-focused people is like, well, I remember in undergrad when my professor said, well, you should never make a performance change without profiling it and knowing that it matters. And I'm like, no, no, I actually think that's wrong. If you know you are working on systems where this is important, you need to have a certain amount of self-discipline. And, you know, not where it's too costly,

Starting point is 00:12:47 it's going to make the system more dangerous or riskier or make your life worse, but make efficient choices as a matter of defaults in your brain. Right, and this is one of the reasons why I think performance engineering depends a lot on intuition and background and experience. And mechanical sympathy, knowing that you know deep down what the CPU is actually doing when you compile the code that you've got. So, Lelaki, stop for a second on that word And mechanical sympathy. Knowing that you know deep down what the CPU is actually doing when you compile the code that you've got.

Starting point is 00:13:07 So, Lelaki, stop for a second on that word, mechanical sympathy, which is a phrase I really like. Tell me what that phrase means to you. What that phrase means to me, I think a race car driver invented it, actually, is just having an innate knowledge or... Maybe not innate.

Starting point is 00:13:20 You probably weren't born with it. I don't know. Some people seem to come out of infancy just knowing these things. Did you not read books about CPU architecture to your children? I did not. What are you even doing? Lambda calculus. Oh, that tracks. I think that it's not innate, but this really unconscious knowledge of just, you know, when you look at code that this is how it structures on real systems, because different languages have very different models of how reality works,

Starting point is 00:13:43 but reality only has one model. As much as I love Lisp, the world and computers are not made of con cells. They're made of big arrays, one big array, and some integer arithmetic that loads things from arrays. That's all a computer actually does, right? And you have to understand what is the model by which we get from that low-level thing to my high-level types with structure in them. And you have to understand what layouts mean and how this branching structure gets compiled into something that a CPU actually knows how to operate. And you can't just construct this from scratch every time you do it.

Starting point is 00:14:13 You have to develop an intuition towards looking at something and knowing what that's going to be. So I asked you what the difference was between the easy mode performance optimization that you experienced at Google and this kind of harder-to- harder to figure out version of the story where you don't have the same kind of scale. I'd love to hear a little bit more about what is the texture of these problems? What kind of problems do you run into? What is interesting and hard about the version of the performance optimization problem you see here? Hard mode performance optimization, typically but not always, is a question of latency. And latency is a question of what is something doing at a really important time? Not as what something

Starting point is 00:14:49 doing in general, not what it does usually, but what it does when you care about it. Which means it's fundamentally a measurement problem. Because to measure general performance, what your system is doing, you point a profiler at it. You get an idea of it spending 20% of its time here and 10% of its time here and 5% of its time here. I don't care about any of those percents. I care about what was it doing for the nanosecond, the millisecond, the microsecond sometimes, or some of our worst systems, the second, that something interesting was happening. I care about what happens when it was sending an order or analyzing market data. I care only about that and I don't care about anything else. So how do I even know what it's doing at that point in time?

Starting point is 00:15:25 How do I measure this is really the key question. Got it. And maybe somewhat inherently that puts you in the opposite situation that you're in when you're looking at a very big organization where you're thinking about the low levels of the infrastructure and how to make them as fast as possible. Because if you want to know what's happening at a given point in time, you're somewhat unavoidably tied up in the business logic. You care about what is the thing that happens when the important decision is made, and what are the latencies that occur, and what are the code paths that drive that

Starting point is 00:15:52 latency. Is that a fair description? Yeah. It's not universally true. There's some really interesting cases where the infrastructure rears its ugly head in the middle of stuff you want to be doing otherwise, right? But it is generally, a large part of it is, in fact, just the business logic of how is this trading system making a decision? And you have to look at that, and that's what's happening at the interesting point of time, sort of by definition. So you talked about sampling profilers as one common tool. Can you actually just go in a little more detail of what is a sampling profiler, and how does it actually work at a low level? So there's a lot of different implementations of this, but the general shape of it is you take a system and you point a tool at it at a low level. here. And then it writes this down and it lets the program keep going. And profilers only really differ on how do they stop the world and how do they write this down. My favorite is the Linux

Starting point is 00:16:49 kernel profile. It's called perf and it just uses a bunch of hardware features to get an interrupt at exactly the right moment in time. And then it just very quickly writes down the stack trace in this compressed format. It's very optimized. And then you take all these stack traces. A profile is really just a list of stack traces and sometimes a little bit of augmented information, but that's fundamentally the core idea. And then you present it to the user in some way that adds them up. And like I say, the key thing is it tells you, okay, 30% of the stack traces ended in the function foo.

Starting point is 00:17:16 That's a hotspot. You're spending 30% of your time there. But there's all these different kernel counters that you can use for driving when you're doing the sampling. How does the choice of kernel counter affect the nature of the information you're getting out of the profiler? Yeah. People tend to think about people sampling profiles in time, where the counter is just number of cycles that's elapsed. But one of the cool things about it, it lets you cycle on L2 cache misses or branch prediction misses or any of these weird architectural events. And so you get a profile of when did these interesting things happen? And you know, each of them is costly and they probably

Starting point is 00:17:47 have some cost in cycles, but you can get much more precise measurements. And in particular, the nice thing about it is, you know, that 10%, let's say of your program is slowed down by branch prediction misses. But if you just look at the cycles, you're just going to see like, well, it's somewhere in this function. If you profile on branch misses, you will see the branch that is hard to predict. And you can actually do something about that branch. Got it. So branch mispredictions is one. What's like the next most interesting counter that you might use?

Starting point is 00:18:11 Actually, the next most interesting thing isn't even branch prediction. It isn't even a hardware counter. The next most interesting thing to profile on is the unit of memory allocation. A lot of allocators, like in fact, the one we have in NoCaml, but also various C++ ones, will let you get a profile, not out of perf, but out of like kind of done in software, that tells you where were you allocating memory the most. Because that's just a common thing that's very costly. And reducing that, it can really improve the performance of a system. Right. And this comes down to something that we see in OCaml a lot, which is when we write really,

Starting point is 00:18:39 really high performance systems, we often try to bring the amount of heap allocation we do all the way down to zero. We try. Right. It's hard to get it all the way down to zero. Something's misbehaving in a system performance-wise. A relatively common problem is there's a thing that shouldn't be allocating that is in the hot path. Yeah, that's right. And in a good optimized C++ system, it should be spending 5%, 10% of its time memory allocating. And sometimes you just have to do this. It's necessary. But maybe you're allocating twice as much as you really need to be. And you can look at a memory profile

Starting point is 00:19:08 and take a look at it. It's important to remember that profiles aren't just about time. They're about measuring the overall consumption or use of resources. Okay, so that's how Perf essentially works to get you the information that you need. But there's a fundamental trade-off

Starting point is 00:19:20 in a tool like Perf where it's sampling information. And that itself is essentially a performance optimization, right? Part of the reason that you sample rather than capturing all of the information is you want to make it fast, right? You want to not distort the performance behavior of the program by grabbing information out of it. But in some sense, there's a real tension there because you're saying, I don't want like some broad statistical sampling of the overall behavior of my program. I want to know in detail how it's behaving in the times that matter most. I think a really instructive

Starting point is 00:19:49 example of that was an optimization that I hit on last year where we had a system that was trying to send some orders, right? And it was doing it in a totally reasonable way. And it was doing it in a really cheap way. The thinking about whether or not I want to send an order was really cheap. It happened really fast. The physical sending of the order, really cheap, happened really fast. The thinking, what do you mean by the thinking? Do you mean like the business logic to decide the looking at the markets and saying, oh yeah, I want to buy, right? Or, you know, like, and then do the physical act of, you know, sending the message, right? Both of these were really cheap. If you pointed a profiler at them, even a profiler that was like magically restricted to

Starting point is 00:20:23 the times of interest would tell you, yep, 5% of your time was doing this. It's all good. That's not a hotspot. Here's the problem. The order sending was happening 200 microseconds after the thinking was. And the reason was it was being put on a low priority queue of work to do later. It was a misconfiguration of the system. It was using this kind of older API that it needed for complicated reasons that did reasonable things under most circumstances, but it assumed that network traffic you wanted to send couldn't be that latency sensitive. So it just waited to do it on an idle moment. And this was not a good thing to wait about. The profiler tells you nothing about this because I didn't care about

Starting point is 00:20:57 the overall cost. I didn't care about overall time. I cared that it happened promptly. And so fixing this again was really easy. I just switched to an eager API that didn't wait, but a profiler tells you nothing about this. So what kind of tools do you use to uncover things like that? Magic trace. So what's magic trace? Magic trace is a tool we wrote that gives you a view into what a system was doing over a short window of the past. It's retrospective in some sense. And what I mean is that any point in time, you can just yell, stop, tell me what you were doing for the last two milliseconds, three milliseconds, maybe. And you write it down. And exactly like you said earlier, this is not a profile. This is not some statistical average of

Starting point is 00:21:34 what you're doing at various times. This is the exact sequence of where your code went. It went from this function to this function to this function. And you get a different visualization of it that just shows you what things are happening over time. And in fact, exactly like you say, there's more overhead for using this. But it gives you this really direct view into what happened at an interesting time that a profiler fundamentally can't give. Traces are in some senses really better.

Starting point is 00:21:56 In fact, traces aren't restricted to magic trace. I said there's memory profiles that are really useful. Memory allocation traces are another thing that we care about a lot. We have a really good one, in fact, that gives you the literal trace of... Although a memory profiler is actually statistical also, right? That's a sampling profiler.

Starting point is 00:22:09 A memory tracing is, in fact, it's not a profiler, it's a tracer, right? Maybe it's misnamed. Actually, there's a lot of annoying sort of terminological fuzz around this. I think people are often... We are not unique in this, I'll say. Right. At least the terms I've come to like the most are people use profilers for what you might call statistical profilers,

Starting point is 00:22:23 and then people use tracing when they're talking about capturing all of the data. So a common form of tracing that shows up all over the place is there's all these nice systems for doing RPC tracing, where you write down the details of every message that goes by, and this is sort of a common thing to do if you want to debug, why was that web query slow? And some query came in and it kicked off a bunch of RPCs, it kicked off another bunch of RPCs, it kicked off another bunch of RPCs,

Starting point is 00:22:46 and you can pull up an API that lets you see the whole cascade of messages that were sent. So that's a nice example of tracing. And then we also, as you mentioned, we have a thing called memtrace, which sadly I think is actually a profiler in that it is a statistical sample and does not actually capture everything.

Starting point is 00:23:01 But it does give you a time series of events, which is a key thing that a profiler can't. That's interesting. I guess in some sense, all of these systems start by giving you a time series of events, and then it's how you process them, right? Perf is sampling at some rate and grabbing data,

Starting point is 00:23:17 and then you turn that into an overall statistical summary. But you could look at that information temporarily. You just don't. And in any case, the information is sampled. I think what I just said is also true about memtrace. You just don't. And in any case, the information is sampled. I think what I just said is also true about memtrace. You get this information, just like perf, sampled randomly from the memory allocations.

Starting point is 00:23:33 Then you can interpret it statistically as being about the overall state of the heap. A key difference here is that memtrace gives you all of the information about a statistical sample of some of the allocations. It tells you when this guy was allocated and then freed. That's true. Whereas you might get a set of stacks out of perf that here are some allocations and here are some frees, but you have no guarantee it's the same thing.

Starting point is 00:23:55 This lifecycle example, it's exactly like the RPCs. A thing people frequently do is just capture traces for 1% of their RPCs and giving the whole lifecycle of individual one is way more interesting than 1% of the RPCs. And giving the whole lifecycle individual one is way more interesting than 1% of the individual moments. Yeah. I mean, maybe this just highlights, it's a total terminological car crash. It's a little hard to separate it out. All of these tools are way too hard to use and very inexact, and all of them use the same terminology in different ways. Okay. So we've talked about the trace part of the name magic trace, right? And the key thing there is it's not just sampling, it's giving you the complete summary of the behavior. It's maybe worth talking a little

Starting point is 00:24:27 bit about the magic, which is how do you do this thing? You just said, oh, something interesting happened. And then you retrospectively look at what happened before and grab it. How can that be done in an efficient way? What's mechanically happening here? Magic. No, there's two parts to this, right? First is how do you get the data? And the second part is how do you decide when to take the sample? Let's take those in order. So how do you get the data? Well, I'm really just going to call it magic because I don't know how they managed to do this efficiently. But Intel has this technology.

Starting point is 00:24:54 It's called processor trace. It just keeps a ring buffer of everything the CPU does in a really compressed format. Like it uses one bit per branch or something along those lines. And it just continually writes this down. And the ring buffer is the size that it is. And it contains some amount of history. In practice, it's a couple milliseconds. And at any point in time, you snap your fingers and say, give me that ring buffer.

Starting point is 00:25:13 Right. And the critical thing is this is a feature integrated into the hardware. Oh, yeah. We couldn't possibly implement this. The kernel couldn't possibly implement this. Like this is in the silicon. And it's a huge advantage for Intel processors. Yeah, although I really do not understand Intel's business strategy around this, which is to say they built this amazing facility in their CPUs, but they never released an open

Starting point is 00:25:35 source toolkit that makes it really easy to use. They did a lot of great work at the perf level, so perf has integrated support for processor trace. In fact, we relied heavily on that. But I think Magic Trace is the first thing that's actually a nice usable toolkit built around this. Like there's various companies that have built internal versions of this. But it seems like such a great competitive advantage. I'm surprised that Intel hasn't invested more in building nice, easy to use tools for it. Because it's a real differentiator compared to, say, AMD chips. There are a lot of performance analysis tools in the world, and there's very limited hours of the day.

Starting point is 00:26:05 I really don't feel like I know what everything in the world does. But I generally agree with you that a really underinvested thing is good, easy-to-use, obvious, idiot-proof APIs, right? And tools that just work in the obvious way you want them to. I was mentioning before how part of being a good performance engineer is building up a lot of experience and knowledge and intuition. Another part of it is just building encyclopedic knowledge of all the bizarre ins and outs of the somewhat awkward tooling for doing performance analysis. Perf is a great tool. In many ways,

Starting point is 00:26:34 it's beautifully well-engineered, but the user experience is designed for experts. And the command lines kind of work the way they do. And sometimes their meanings have evolved over time. And the flag still says the old thing, but you have to know that it has the new meaning. And there's really, I think, a lot of space for building tools that you just like turn on and hit the button. And it just does the obvious thing for you and gives you the result. It's a user interface built by experts for experts. And I think it's easy for them to forget that most of the people have not used it. Just like you say, a lot of random esoteric knowledge about what CPUs do.

Starting point is 00:27:06 And I also just have a lot of memorized little command lines that I happen to know will point at certain problems. It's just an issue of having done this a lot. And I don't know a good way of teaching this other than getting people to do the reps. And a better way would be to give them tools that just give them obvious defaults. But I haven't figured out how to do this universally. Right. But Magic Trace is one good example of a tool like that,

Starting point is 00:27:25 where the defaults have been pretty carefully worked out, the UX is really nice, and you can just use it. It's not perfect, it doesn't work in all the contexts, but it usually gives you something usable without a lot of thinking. It is exactly like all the best tools I know in that I'm frequently furious at it for not doing quite the right thing,

Starting point is 00:27:40 and that's a sign of how much I want to use it. I would just, oh, can it also do this? Can it also do this? Can it also do this? But yeah, do this? Can it also do this? But yeah, I mostly just use it in the obvious way with the one obvious flag. It gives me what I want to know. One of the interesting things about a tool like Magic Trace

Starting point is 00:27:53 is you've told a whole narrative. When doing easy mode optimization, you care about these broad-based things. Sampling profilers are mostly the right tools. When you care about focused latency-oriented performance, then you want this kind of narrowed-in-time analysis tools, and Magic Trace is the right thing for that. But a thing that's actually struck me by seeing people using Magic Trace inside of the organization is it's often a better tool for doing easy mode-style optimization than perf is. Because

Starting point is 00:28:22 just the fact that you get all of the data every single event in order and when it happened with precision of just like a handful of nanoseconds makes the results just a lot easier to interpret i feel like when you get results from perf there's a certain amount of thinking and interpretation and how do i infer from this kind of statistical sample of what's going on what was my program actually probably doing, and where really is the hotspot. But with magic trace, you can often just see in bright colors exactly what's happening. I've seen people just take a magic trace at a totally random point in the day and use that and be able to learn more from that than they're able to learn from looking at perf data. Yeah, this is not universally true, but it's very frequent that

Starting point is 00:29:03 you can get a bunch of extra information. I think one of the really good examples is that a profiler tells you you're spending 40% of your time in the hotspot of the send order function, right? But here's an interesting question. Is that one call to send order that's taking forever, or is that 1,000 calls each which is cheap, but why are you doing this in a tight loop? And it turns out, you know, it's really easy to make the mistake where you're calling some function in a tight loop you didn't intend to. In a profile, these two things look pretty identical. There are really esoteric tricks you can use to tease this out. But in Magic Trace, you just see, oh God, that's really obvious. There's a tight loop where it has a thousand function calls right in front of one another. That's embarrassing. We should fix it, right? You actually develop this weird intuition for looking at the shape of the trace that you get, the physical shape on your screen and the visualization. And like, oh, that weird tower,

Starting point is 00:29:48 I'm clearly calling a recursive function a thousand deep. That doesn't seem right. You get to see a lot of these things. Yeah. It makes me wonder whether or not there's more space in the set of tools that we use for turning the dial back and forth between stuff that's more about getting broad statistical samples and stuff that gives you this more detailed analysis and just more in the way of visualizations, more ways of taking the data and throwing it up in a graph or a picture that gives you more intuition about what's going on. A huge fraction of what I do in my day-to-day is visualization work. Rather than looking at them, we're trying to build better ones.

Starting point is 00:30:20 It's really important, and we have barely scratched the surface of how to visualize even the simplest profiles. Yeah, one thing I've heard you rant a lot over time is that probably the most common visualization that people use for analyzing performance is flame graphs. And flame graphs are great. They're easy to understand. They're pretty easy to use. But they also drop some important information, and you're a big advocate of Pprof, which is a tool that has a totally different way of visualizing performance data. Can you give a quick testimonial for why you think people should use Pprof more?

Starting point is 00:30:49 Yeah. Flame graphs were way better than anything that came before it, by the way. They were a revelation when they were invented, I think. And so this is one of those things you have to be really careful not to say something is bad. I just think something better has been invented, right? So it's called a flame graph because it's a linear line that has a bunch of things going up out of it that look like flames. And what this is, is that- And people often use orange and red as the color. So it really looks like flames. Exactly, right? It wouldn't look as cool if it was green, right? And so, you know, the first level of this is, you know, broken down 40%, 60%, and that 40% of your stack traces start this way and 60% start this way. And then the next level is just each of

Starting point is 00:31:20 those gets refined up. And so every stack trace corresponds to an individual peak on this mountain range. And then the width of that peak is how many stack traces looked like that. So this is good. It tells you where is your time going. And one nice property is it at a glance, it makes it really easy to intuitively see the percentages, right? Because the width of the lines as compared to the width of the overall plot gives you the percentage that that part of the flame graph is responsible. Yeah. If there's one really fat mountain that's 60% wide, you know what you're doing. It's that. Here's the problem with it. It misses a thing I like to call join points, which are points where stack traces start differently and then reach the same important thing. Because what happens is, suppose you've got 15 or 16 little peaks, none of which is that big,

Starting point is 00:32:01 and then right at the tippy top of each of them in tiny, tiny, narrow things, it's all coming calling the same function. It'd be really easy to dismiss that. You don't even notice the thing at the top, but it turns out if you add them all together, that's 40% of your time. And different visualizations can really show you everything comes together here. So how does Pprof try and deal with this? So we're now going to try to proceed to describe a representation of a directed acyclic graph over a podcast, which of all the dumb ways people have visualized directed acyclic graphs might be the worst. But what it does is it draws a little bag on your screen where each node is a function you end up calling and each arrow is a path. You can imagine that you have one node for every function you're ever in. And then for each stack trace, you just draw a line through each node in order and they go towards the things that

Starting point is 00:32:49 were called. And then you highlight each node with the total percentage of time you spent there. And you put some colors on it, like you say, and you make the arrows thicker or thinner for bigger or smaller weights. And that's the basic idea. And so if you close your eyes and imagine with me for a second, I claim that what will happen in the scenario I described is that you'll see a bunch of bushy, random, small paths at the top of your screen, and then a bunch of arrows all converge on one function. And that function now is really obviously the problem.

Starting point is 00:33:19 And then underneath, it'll also branch out to lots of different things. Yeah, that's actually a really good point, because it tells you, in fact, that maybe it's not that function that is the time, it'll also branch out to lots of different things. Yeah, that's actually a really good point because it tells you, in fact, that maybe it's not that function that is the time, it's the things it calls. But now you at least know where all this is coming from. And as a single example of this, it is the most common thing, at least if you're working in, say, C++, the function's always malloc. Oh, interesting.

Starting point is 00:33:38 Because like I said, with, you know, the business logic may be very diverse, but everything allocates memory. Because it's really easy to realize, oh, it's doing a little bit of mallakir. It's doing a little bit of Malakir. It's doing a little bit of Malakir. And I guess this ties a little bit into what you were talking about of the style of trying to look at the foundations and get the common things that everyone sees. It becomes really important to see those common things. Although I would have thought one of the ways you could deal with this just with flame graphs is there's just two different orientations. You can take your flame graph and turn it upside down. And so you can decide which do you want to

Starting point is 00:34:07 prioritize thinking about the top of the stack or the bottom of the stack. It's the bush below malloc that screws you over there. It's when the thing's in the middle that you get problematic. It's when the really key function is in the middle where everything converges through one particular, it's not a bottleneck, but you can think about it in a bottleneck in the graph, that the flame graphs really fail. And to their credit, people have built flame graph libraries where you can say, give me a flame graph that goes up and down from malloc, but you need to know to focus

Starting point is 00:34:32 on that function. I see. And so Pprof somehow has some way of essentially figuring out what are the natural join points and presenting them. I think that it outsources that problem to the graph drawing library that tries to do various heuristics for how people view graphs. It tends to work.

Starting point is 00:34:48 I've looked at flame graphs and I've looked at P-PROF visualizations. I do think P-PROF visualizations are a little bit harder to intuitively grok what's going on. So I feel like there's probably some space there yet to improve the visualization to make it a little more intuitively clear. I definitely agree. I think that just like we were saying earlier, this is one of those things that experts know, new people don't. You just kind of have to get used to staring at these for a couple days and then you get used to it.

Starting point is 00:35:08 But it would be nice if we didn't have to. It would be nice if it was just as obvious as the other things are. So we've talked a bunch here about performance engineering tools and measurement tools more than anything else, which I think makes sense. I think the core of performance engineering is really measurement. And we've talked about ways in which these focused tracing tools like MagicTrace, in some ways, can kind of outperform sampling tools for a lot of the problems that we run into. What are the kinds of cases where you think sampling tools are better? To me, the case I wish I could use sampling the most is once I've identified a range of interest. A thing we care about a lot when we think about latency and optimization of trading systems is tails. If your system,

Starting point is 00:35:45 99% of the time responds in 5 microseconds and then 1% of the time responds in a millisecond, that's not great because, you know, you don't always get a choice of which of those is the one you trade on, right? Right, and also there's the usual correlation of the busiest times are often the times that are best to be fast at.

Starting point is 00:36:02 That's right. And so exactly where it's bad is the case where you care the most. And Magic Trace is pretty good at finding tails because you can just do some interesting tricks and hacks to get Magic Trace to sample at a time where you're in a 99 percentile tail. Now, sometimes you look at those trails in Magic Trace and you see, oh, I stopped the world to do a major GC. I should maybe avoid that.

Starting point is 00:36:21 I need to allocate less. Or some other bizarre, weird event. Sometimes it's a strange thing that is happening. But a remarkably common pattern we see is that your tails aren't weird, aren't interesting. They're just your medians repeated over and over. And like you said, you're having a tail because the market is very busy, because it's very interesting, because you saw a rapid burst of messages. And each of them takes you a microsecond to process, but that's not the budget you have. You have 800 nanoseconds, and you're just falling behind.

Starting point is 00:36:49 So this is just a classic queuing theory result. It's nothing but queuing theory. Get lots of data in, you're going to have large tails when it all piles up in time. So you pull up this magic trace and you say, oh, I processed 10,000 packets. What do you do now? And sometimes you can think of obvious solutions.

Starting point is 00:37:04 Sometimes you realize, oh, I can know I'm in this case and wait and process all of these in a batch. That's a great optimization that we do all the time. But sometimes you're just, oh, wow, I just really need to make each of these packet processes 20% faster. And what I really wish you could do is take that magic trace and select a range and say, hey, show that to me in Pprof. Because it's just like the flame graph. You have all these little, little tiny chunks in the magic trace, and I really want to see them aggregated. And you have some trouble doing it.

Starting point is 00:37:31 So I asked one question, and I think you answered a different one. I do that a lot. I asked the question of when do you want sampling. You answered the question of when do you want a profile view. And I totally get that, and that seems like a super natural thing and maybe like a nice feature request for the magic trace folk. Have a nice way of flipping it into the profile view. But I really want to poke at the other questions.

Starting point is 00:37:50 When is sampling itself a better technique versus this kind of exhaustive tracing that magic trace is mostly built around? One easy answer is the easy mode problems we have. Things like historical research, things like training machine learning models, things that are really are throughput problems. And we do have throughput problems. And there, it's just easier to look at the sampling profile or you can really target it at the cache misses or whatever you want. So that's a case where we definitely want it. And maybe another thing to say about it is it is definitively cheaper. The whole point of sampling is you're not grabbing the data all the time. I guess we didn't talk about this explicitly, but Intel processor trace, you turn it on and you end up eating five to 15% or something of the performance of the

Starting point is 00:38:29 program. Like there is a material. Don't say that out loud or they'll stop letting me use it on our trading systems. I mean, it is the case that we don't just like have it turned on everywhere all the time, right? We turn it on when we want it. And that's a thing that the hyperscalers do. They just leave a sampling profile or on across the fleet, just getting, you know, 1% of things. And that gives you kind of a great sense of what the overall world is doing. And that is actually a thing I kind of wish we had. It would be less valuable for us than it would be for them. But I would love if I could just kind of look at it like a global view of hotspots. I think the best thing in the world would be like, can I get a sampled profile of all the things all of our trading systems did when they weren't

Starting point is 00:39:05 spinning idly? If I could know that, oh, overall, if I could get a little bit more return from optimizing the order sending code versus the market data parsing code, I think that would be a really valuable thing to me. So another interesting thing about the way in which we approach and think about performance is our choice of programming language, right? We are not using any of the languages that people typically use for doing this kind of stuff. We're not programming in C or C++ or Rust. We're writing all of our systems in OCaml. And that changes the structure of the work in some ways.

Starting point is 00:39:33 And I'm kind of curious how that feels from your perspective as someone who like very much comes from a C++ background and is dropped in weird functional programming land. What have you learned about how we approach these problems? What do you think about the trade-offs here? Well, the best thing about it is employment guarantee. Anyone can write fast C++, but it takes a real expert to write fast OCaml, right? You can't fire

Starting point is 00:39:52 me. Although I think that's actually totally not true. I didn't mean the part about firing you, but the point about writing fast C++, I actually think there's a kind of naive idea of, oh, anyone can write fast C++. And it's like, oh man, there's a lot of ways of writing really slow C++. And actually a lot of the things that you need to get right when designing high-performance systems are picking the right architecture, right way of distributing the job over multiple processes, figuring out how to structure the data, structure the process of pulling things in and out. There are lots of stories of people saying, oh, we'll make it faster in C or C++. And sometimes you implement it in some other language and it can be made faster still because often the design and the

Starting point is 00:40:27 details of how you build it can dominate the language choice. I think it's really easy for people who are performance obsessed like myself to just get a little too focused on, oh, I'm going to make this function faster. And maybe the better answer is, can we avoid that function being called? Can we like not listen to that data source? Can we outsource this to a different process that feeds us interesting information? The single most important thing in performance engineering, I think, is figuring out what not to do. How do you make the thing that you're actually doing as minimal as possible? That is job one. Honestly, I think one of the reasons that I really like performance optimization as a topic to focus on, I don't like writing code very much.

Starting point is 00:41:03 I'm not very productive. It takes me a long time to do good work. So I want to do the stuff that requires me to write the fewest lines of code and have the biggest impact, right? This is like one of those hypotheticals. You make a two-line change and everything gets 10% faster. The hard part was the three weeks of investigation

Starting point is 00:41:16 proving that it was going to work, right? And I think this is actually a good example of like you really have to think about the whole board. You have to think about how you're structuring the code and how you're structuring the system. How many hops this is going to go through? How many systems is going to go through? How can you get assistance from hardware? Any of these things. And like micro-optimizing the fact that Clang has a better loop optimizer than GCC or the new Camel compiler, like it's really annoying to look at the bad loop. Is that really what's killing you? No,

Starting point is 00:41:43 what's killing you is that you're looping over something that you shouldn't be looking at at all. Right, so there's a bunch of stuff you just talked about that we'd love to talk about more about, in fact, the hardware stuff in particular. But I don't want to lose track of the language issue. So I leave what I said that, like, people often over-focus on the details of the language. But the language does matter,

Starting point is 00:41:58 and I think it matters in particular for performance. And I'm kind of curious what's your feeling about how that affects how we approach the work and how you, your own kind of interaction and engagement with it? I break it down into three categories. The first category in which OCaml provides us a challenge is, I call it the most annoying but the least important. And that's what I was saying earlier about, oh, our code generation isn't as good. We're branching too much.

Starting point is 00:42:19 We have too many silly register spills. It's annoying to look at. It's really not what's killing you. I would wish it was better. And really, the limit isn't OCaml. The limit is scale. The C compiler people have been working for 30 more years than the OCaml compiler people have.

Starting point is 00:42:35 And there's more people working on optimizing Clang right now across the world than we have probably employees at Jane Street. We're never going to catch up. That's OK. It's not really what's killing you. The second category is things that are just like maybe more of an actual problem, but like hard to deal with, but not really the key issue. Our memory model requires us to do slightly more expensive things in some cases. Like a good example is we're garbage collected language. Our garbage collector inspects values at runtime.

Starting point is 00:43:08 Therefore, uninitialized data can be really problematic. And so, you know, we have to do stupid things in my brain like, oh, it's really important to null out the pointers in this array and not just leave them behind or they'll leak. Or you can't just have an uninitialized array that I promise I'll get to soon because what happens if you GC in that range? And like, I do actually think this is meaningfully costly in some scenarios, but I'm willing to put up with it in most cases. There are things you can in that range. And like, I do actually think this is meaningfully costly in some scenarios, but I'm willing to put up with it in most cases.

Starting point is 00:43:27 There are things you can do about it. The thing that I think is most problematic for our use of a language like OCaml gets back to mechanical sympathy. And, you know, I said

Starting point is 00:43:35 that the world is not made out of parentheses that Lisp uses, and it's also not made out of algebraic data types. OCaml's fundamental representations of the world are very boxy.

Starting point is 00:43:44 There's a lot of pointers. There's a lot of pointers. There's a lot of, you know, this object contains a pointer to something else where in C++ it would just be splatted right there in the middle. And there are reasons we do this. There are reasons that make it easy to write good, clean, safe code, but it is fundamentally costly and it lacks the language, if anything, lacks some mechanical sympathy. Right. Or at least it makes it hard to express your mechanical sympathy because getting control over the low-level details is challenging. And I don't want to go too much into that.

Starting point is 00:44:09 We're actually doing a lot of work to try and make OCaml better exactly at this. But a question I'm kind of more interested in talking about with you is, how do you see us working around these limitations in the language and the code base that we have? There's a couple options here. The first is you can kind of write it the hard way because OCaml is a real language here. The first is you can kind of write it the hard way because, you know, OCaml's a real language. You can write whatever you want. It's just a question of

Starting point is 00:44:30 difficulty. You know, if nothing else, I can just, I could in theory allocate a, you know, 64 gigabyte inter-ray at the beginning of startup and then just write C in OCaml that just manipulates that as memory, right? It would work. It would never GC. It would do all the things you wanted it to. It'd just be miserable. And clearly I'm not going to do that. But given that we're a company that has a lot of people who care about programming languages, one thing we're pretty good at is DSLs. And so, you know, we have some DSLs, for example, that let you like describe a layout and we're going to embed this layout into some, you know, low level string that doesn't know a lot, but it's still, if you glance at it the right way, type safe. Now the the DSL doesn't let you write out-of-bounds accesses or anything like this.

Starting point is 00:45:09 Right. And the DSL, you sit down and write down what's the format of a packet that you might get from the NASDAQ exchange. And then it generates some actually relatively reasonable, easy to understand interfaces that are backed by the low-level, horrible manipulation of raw memory. And so you write a DSL, you generate some code, and what you surface to the user is a relatively usable thing, but you get the physical behavior that you want with all the flattening and inlining and tight representation of data. You're hitting on a really good point,

Starting point is 00:45:35 that a lot of these originated from our need to parse formats that were given to us, right? But it turns out you can also just use them for representing your data and memory. I can build a book, a representation of the state of the market that's just laid out flatly and packed for me. It's much less pleasant to use than real OCaml. It's difficult. And we only do this in the places that it matters, but you can do it. There's what I like to call like a dialect of OCaml we speak in sometimes. And sometimes we generally say it's zero Alec OCaml.

Starting point is 00:46:01 And you know, the most notable thing about it is it tries to avoid touching the garbage collector. But implied in that zero alec dialect is also a lot of representational things. We have little weird corners of the language that are slightly less pleasant to use, but will give you more control over layout and more control over, you know, not touching the GC and using malloc instead. And it works pretty well. It's harder, but you can do it. In the same way, another thing we think about a lot is interoperability. Again, sort of out of necessity.

Starting point is 00:46:28 There are libraries we have to interact with that only work in C, so we have these little C stubs that we can call into and it's really cheap. It's not like Java. It's not like one of those languages where there's this huge, costly process for going cross-language.

Starting point is 00:46:38 You just make a function call and it just works, right? Yeah, like the overhead, I think, for a function call to C at least is, I don't know, three or four nanos. And I think in Java it's like 3 or 400 nanos because the JNI is a beast for reasons I've never understood. Option two is

Starting point is 00:46:51 cry a little bit and deal with it. Like, yeah, we face a fundamental disadvantage. We're working on reducing it. I'm super excited about getting more control over the layout of OCaml types. This is like the biggest change to me that maybe will ever happen in the compiler is being able to write down a representation of memory that is what I want it to be in a real OCaml type that is fun to play with. But fundamentally,

Starting point is 00:47:12 we're kind of at a disadvantage and we just have to work harder and we have to think more about, okay, we're going to have a higher cache footprint. What does this mean about our architecture? How can we get cache from other places? How can we spread out the job across more steps, more processes, pre-process this one place? It gets back to, you don't want to focus on over-optimizing this one function. You want to make your overall architecture do the right things and just inform infrastructural changes. And I think you make an important point that it's not that any of the optimizations you want to do are impossible. It's that they're a little bit more awkward than you would like them to be, and you have to do a little extra work to get them to happen.

Starting point is 00:47:43 And that means fundamentally that we don't always do them. And so we really do pay a cost in performance in that getting people to do the right thing, the harder you make that, the less it happens. One of the hardest things to learn when you're doing this sort of work is discipline. I have to go through the code base every day and say, no, I'm not fixing that.

Starting point is 00:48:02 Yes, it's like offensive to me on a personal level that it's slow and it allocates and do these things, but it just doesn't matter. It's legitimately hard for me not to stop whatever I'm doing and just like fix this optimization that I know is sitting there. If this doesn't bother you on a fundamental physical level, I just don't understand. But you have to prioritize. You have to prioritize. There's so much more important things to be doing. So another thing I'm wondering about is how you think about the role of hardware in all of this. In some sense, if you're thinking about making things as low latency as possible,

Starting point is 00:48:29 why do we even bother with a CPU, right? You look at the basic latency of consuming and emitting a packet. And on any ordinary computer, you're going to cross twice over the PCI Express bus. It's going to cost you about 400 nanos each way. You know, that and a little bit of slop between the pieces, it's kind of hard to get under a mic, really, for anything where you're like, I'm going to consume some data off the network,

Starting point is 00:48:50 do something and respond to it. And on an FPGA attached to a NIC, you can write a hardware design that can turn around a packet in under 100 nanoseconds. So there's like an order of magnitude improvement that's just like physically impossible to get to with a computer architecture. And so in some sense, if all you cared about is, well, I just want the absolute lowest latency

Starting point is 00:49:09 thing possible, it's like, why are we even using CPUs at all? So how do you think about the role of hardware as integrating and how you think about performance in the context of building these kinds of systems? It informs the architecture you choose. Because yeah, nothing's ever going to be as fast as hardware, but it's really hard to write hardware. It can't do complicated things, and even the things it can do are just exponentially harder to write. I have never in my life written Verilog, which feels like a personal sin. I am reliably informed that it is miserable and unpleasant,

Starting point is 00:49:37 and your compiler takes 24 hours to run. So we have a lot of strategies with really complicated logic, and that logic is important and it's valuable. And implementing that in hardware is, I'm just going to say, flatly impossible. You couldn't do it. And so the question becomes, what can you outsource to hardware that is easy? How do you architect your system so that you can do the really, really hyper-focused speed things in a simple, simple hardware system that only does one thing. And you feed that hardware the right way.

Starting point is 00:50:07 But the rest of the software system still has need to be fast. It has to be fast on a different scale, but it turns out there's optimizations that matter on roughly every single timescale you can imagine. We have trades at this firm that, like you say, complete in less than 100 nanoseconds

Starting point is 00:50:21 or you might not even bother. We also have trades that we send someone an email and the next day you get back a fill, right? And every level in between there turns out you can do useful optimization work. And even with stuff that has no humans in the loop, we really do think about nanoseconds, microseconds, milliseconds, depending on what you're doing and how complicated it is, you really do care about many different orders of magnitude. Yeah, there's a system that I've worked on where our real goal is to get it down from having like 50 millisecond tails to one millisecond tails.

Starting point is 00:50:49 And we celebrate it when we get there and it still does a lot of great trading. We have other systems that are doing simpler, more speed competitive things where like your software needs to be 20 microseconds or 10 microseconds or five microseconds. That's achievable. It's harder and you have to do simpler things

Starting point is 00:51:04 just like with the hardware, but it's achievable. And you care about both of these latencies. And I think another good thing to point out is that you said systems that don't interact with humans, but it turns out some of the most important latencies of systems we interact with is the systems that do interact with humans. I don't know about you, Ron, but when my editor freezes up for five seconds while I'm typing, I just want to put a keyboard through the window. It just drives me nuts, right? And putting aside like the aggravation, human responsive systems are just really important too. Both on like you're actively trading and you want, you know, by hand and you want to like have a good latency on the thing that's displaying the prices in front of you, that matters a lot. But also I think it matters

Starting point is 00:51:40 a lot for just your ability to adapt and improve your systems over time. I said earlier, think about historical research. That's a throughput problem, but it's also a latency problem on a human scale. A thing that will give you feedback on whether your trading idea was good in a minute is worth so much more than one that gives you an idea if it's worth anything in a day. Yeah, that's absolutely right. I think for lots of different creative endeavors, and I think trading and software engineering are both count from my perspective, and also all sorts of different kinds of research, the kind of speed of that interactive loop of like, I have an idea, I try an idea, and I get feedback on how well that idea works out. The faster you can make that loop, the more that people can experiment with new things, and the more the creative juices can

Starting point is 00:52:22 get flowing and more ideas you can create and try out and evaluate. A thing I'm obsessed with telling people about is this Air Force colonel from the 50s or the 60s. His name was John Boyd. He invented this idea called the OODA loop, O-O-D-A. I believe it's Observe, Orient, Decide, Act. It's like the four stages you go through in figuring out like, oh, I see something. I think about it.

Starting point is 00:52:41 I decide how I'm going to adjust to that. I implement the adjustment. And the faster this loop happens, the more control you have over the system and the better you can iterate. I think a great example of this outside software, oddly enough, is whiskey. I like bourbon a lot, right? And it turns out that to make good bourbon takes five, seven, 10 years, right? And so you don't get a lot of opportunity to iterate. And there are some people who are doing a really controversial thing, which is they're using technology to rapidly age spirits. And some people call this sacrilege. And you know, it's never going to be quite as good as like doing it the hard way. But on the other

Starting point is 00:53:11 hand, it lets you taste it in a month and be like, I think I'm going to change this. And they're going to get 12 iterations in the time someone else might get one. And I just think this sort of process turns out to matter a lot in software too, of being able to rapidly iterate on what you're doing and get feedback either on the quality of the trading or for that matter, on the quality of the performance. One of the things I really care about a lot is building systems that let me really quickly evaluate the performance of a system. Because there's a huge difference between, I think this is going to be faster. I did a profile. I know this is a hotspot. I made it better. Okay, I'll run it in prod for a couple of days. And okay, I've made this change. I think it's going to be

Starting point is 00:53:43 better. I'm going to run it on this really realistic test bed. I know in 10 minutes if it's better and I can change it and I can try it again. Yeah. Iteration speed matters kind of almost everywhere you are. And I really like your point about this kind of performance analysis mattering for trading systems and also mattering for like systems with human interaction. And actually, I feel like the performance mindset isn't really so different. You look at someone who's really good at thinking hard about and optimizing the performance of stuff in the browser. There's a very different instruction set.

Starting point is 00:54:12 And oh my God, is that a weird and complicated virtual machine. But a lot of the same intuitions and habits of mind and that focus on being really interested in details that are really boring. All of that really matters a lot, right? You really have to care about like all the kind of gory details of the guts of these things to do a really good job of digging in.

Starting point is 00:54:30 Yeah, and this is why I kind of wonder if there's just a mindset that can't be trained because you have to just look at this and go, what the hell are you talking about, Ron? This isn't boring. I get why you say that, but I just look at this stuff and go like, you don't have to pay me to look at this. Sorry, I take that back.

Starting point is 00:54:44 You do have to pay me to look at this. I would not do this for free. I promise. Boring in quotes. I love this stuff too and totally understand why it is, but it is from the outside. Like it's a little hard to explain to your friends and family why you like this stuff. They're like, you can't even explain in words out loud. What are the details that are going on? Because people will fall asleep. Do you know, my dad once sat me down in college and said, are you sure you want to do this CS thing and not go into something where you can find a job like being a lawyer? I'm the only person who disappointed his parents by not becoming an English major. Well, maybe that's a good point to end it on.

Starting point is 00:55:13 Thanks so much for joining me. This has been great. Thanks for having me on. This is a really good talk. Thank you.

Signals and Threads - Performance Engineering on Hard Mode with Andrew Hunter

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.