Signals and Threads - Performance Engineering on Hard Mode with Andrew Hunter
Episode Date: November 28, 2023Andrew Hunter makes code really, really fast. Before joining Jane Street, he worked for seven years at Google on multithreaded architecture, and was a tech lead for tcmalloc, Google’s world-class sc...alable malloc implementation. In this episode, Andrew and Ron discuss how, paradoxically, it can be easier to optimize systems at hyperscale because of the impact that even miniscule changes can have. Finding performance wins in trading systems—which operate at a smaller scale, but which have bursty, low-latency workloads—is often trickier. Andrew explains how he approaches the problem, including his favorite profiling techniques and tools for visualizing traces; the unique challenges of optimizing OCaml versus C++; and when you should and shouldn’t care about nanoseconds. They also touch on the joys of musical theater, and how to pass an interview when you’re sleep-deprived.You can find the transcript for this episode on our website.Some links to topics that came up in the discussion:“Profiling a warehouse-scale computer”Magic-traceOODA loop
Transcript
Discussion (0)
Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Chainstream.
I'm Ron Minsky.
It's my pleasure to introduce Andrew Hunter.
Andrew is a software engineer who's worked here for the last five years, and he's currently working on our market data team.
And Andrew is by inclination and background an expert performance engineer, and performance is really what we're going to talk about today.
Andrew, just to start off, can you tell me a little bit about how you got into performance engineering and a little bit more about what your path was to getting into software engineering in general?
I can, but I'm just going to be lying to you because the problem is I can give you all sorts of reasons why I do this, but the real reason is just like I find it really addictive.
It's just hard for me not to get excited about how systems work. There's all
sorts of reasons why it's cool or complicated or fun, but I just get this like electric high
when I make something faster. I guess it's not a path, right? Right. So I kind of know why you
like it, but I'm curious how you got here. Like one thing that I think of as characteristic of
performance engineering and actually lots of different software engineering disciplines is
part of how you get really good at them
is getting really interested in stuff that's objectively kind of boring.
You get super psyched about the details of how CPUs work
and interconnects and compilers and just all sorts of these little pieces.
There's just a lot of knowledge you need to build up over time
to be really good at it.
I'm always interested in what were people's paths
where they built up that kind of knowledge and understanding and background
that lets them really dive in deeply in the way that's necessary?
Well, I think that's exactly right.
I just have to care deeply about all the parts of the board.
And the way I got there, there was a couple of interesting times when, like in college, for example, I was taking an operating systems class.
And I realized that the best way to study this and to learn it well was to just go into the details of the code. And it's like, whenever we're talking about a topic about virtual memory or whatever,
I would go look at Linux's virtual memory implementation.
And I'd see what it looked like.
And I'd have more questions.
And I just kept asking these questions.
And I never said, well, that's out of scope.
And I just kept finding these things interesting, right?
And from then, I just realized that, like you say, all of these little details matter.
And if you keep caring about them and you just don't accept no for an answer,
you get pushed towards the places where people really do care about this,
which often means performance.
And then once you start doing performance,
you get that little high that I've talked about.
So in what context did you first experience working in a kind of serious way
on performance-sensitive systems?
Grad school, at the very least, where one of the projects I worked on
was this big graph reversal system
that was trying to like replicate
some really weird, complicated hardware
and do it in software,
but like maintain reasonable levels of performance.
And we just had to think really carefully about like,
okay, how wide is this memory controller?
How many accesses can it do in parallel?
What happens when one of them stalls?
Wait, what does this even mean
to have parallel memory accesses?
How many cycles do each of these things take?
Because we were roughly trying to replicate
this really complicated chip in software,
which meant you had to know exactly
how would the original hardware have worked
and how did all the parts of it
that you can replicate in software work?
And you end up looking up all these bizarre details
and you learn so much about it.
Is that work that you went in thinking
you would spend your time thinking about
fancy mathy graph algorithms
and ended up spending most of your time thinking about gritty operating system and hardware details?
A little bit.
I definitely thought there was going to be a little bit more algorithmic content,
but I really rapidly realized that the hard and interesting part here was in fact just,
oh God, how do you keep this much stuff in flight?
And the hardware has actually gotten way better or more aggressive about this sort of things over time,
so I'm glad I learned that.
So how did that work in grad school end up leading to you
working professionally in this area?
Well, I was interning at Google at the time for the summers,
and I kind of realized that I could do the same large-scale systems research
that I was doing in grad school in a place that just had a lot more scale
and a lot more usage of it, right?
A lot of grad school research is done for the point of doing the research,
whereas the proper industrial research, the coolest part is that it just shows up in production and suddenly
people care about it. Yeah, and this is like a challenge for lots of different kinds of academic
work where there's some part of it that really only connects and makes sense at certain scales.
And a lot of those scales are just only available inside of these very large organizations that are
building enormous systems. Well, that's true. But I think even more than just the scale issue,
it's the issue of what happens when this actually meets real true. But I think even more than just the scale issue, it's the issue of what happens
when this actually meets real data.
And I think this isn't just true about performance.
One thing I will tell, for example,
the most common question I get
towards the end of an internship is like,
what's going to be different
when I come back as a full-timer?
And what I tell them is that I was shocked
the first time that I had a professional project
in my first real job.
And I finished it and we submitted it to the code base
and it rolled into production and everyone was using it. And then a couple of weeks later,
I got an IM from somebody in some other group saying, hey, we're using your system and it's
behaving weirdly in this way. What do you think about it? And my mental reaction to this was like,
what do you mean? I turned it in. I got an A. You have to actually keep this going until you can
hand it off to someone else or quit, right? Which is like, sounds depressing, but at the same time means you actually see what it does in reality
and under fire. And then you learn how to make it even better and you get to do something even more complicated
and just this virtuous cycle of optimizations that you get, or features
depending on what you're working on, right?
Yeah, this always strikes me when I go off and give lectures at various universities.
It's really hard to teach software engineering in an academic context because there is a weird thing that all of the work writing software in a university
context is this weird kind of performance art where you're like you create this piece of software and
then it gets graded and poof it vanishes like a puff of smoke and this is just not what the real
world is like all right so let's get back to performance engineering one thing i'm curious
about is you have a bunch of experience at Google
thinking about the kind of performance engineering problems you ran into there
and also thinking about it here in a number of different spots.
I'm curious how those two different problems feel different to you.
The difference between performance engineering at Google
and performance engineering at Jain Street, to me,
is fundamentally one of leverage.
The easy thing about performance engineering
at a place like Google or any of the other hyperscalers
is that they operate with so many machines
consuming so many cycles, doing so many things
that any optimization that moves the needle
on how much CPU you consume is really valuable,
which means that it's very easy to find good targets. It's
very easy to find things that are worth doing. And it may be very difficult to fix those problems.
You have to think really carefully and you have to understand these systems. But the return on
that investment means you can really support a lot of people who just sit there doing nothing else
other than, how do I make memory allocation faster? How do I make serialization faster?
What can I do in my compiler to just optimize code generation and compression and all these things?
There's actually a really interesting paper. It's called Profiling a Warehouse Scale Computer,
which looked at, okay, if you just look at all the things that a data center does for one of
these hyperscalers, the business logic is really, really, really diverse. Some things are serving
videos of cats and some things are doing searches or social networking or whatever.
And all of this does different stuff, but it all uses the same infrastructure. And it turns out
that the infrastructure is a huge percentage. They coined the term that I like a lot, the data
center tax and the 10, 15, 20% of your cycles that you spend on low-level infrastructure that
everything uses.
And it's not even that that infrastructure is bad or slow.
It's just that that's the common link that scales, whereas fixing business logic kind of doesn't. It takes individual engineer effort on each individual piece of business logic,
but everyone's using the same compiler or one of a small set of compilers.
Everyone's using just a small set of operating systems.
And so you grab those lower levels of the infrastructure and you optimize them,
and you can just improve everyone.
Yeah, that's exactly right. And you can improve it enough by just moving things by half a percent,
making logging cheaper. It just pays for itself way, way more easily than it does if you're not
operating at that level of scale, which means that you get this nice cycle where you hit one hotspot and you point another profiler at the system as a whole,
and you see the next hotspot, and you just get better and better at this just by doing the really
obvious thing that sits in front of your face. So I like to think of this as easy mode, not because
the optimizations are easy or because the work is easy or because it doesn't take skill, but just
because it's really clear what to do.
It is a target-rich environment.
It's a really target-rich environment. There's money falling from the sky if you make something faster.
Right. And in some sense, this has to do with cost structure. This works best in an organization
where a large amount of the costs are going to the actual physical hardware that you're deploying.
Right. When we say that the business logic, quote unquote, doesn't matter,
what it really means is we just don't really care what you're working on.
You can be serving videos at cats.
You can be doing mail.
You can do whatever you want.
We don't really have to care.
As long as you make something
that everyone uses a little bit faster,
it'll pay for itself.
Because the only thing you care about
is the number of CPUs you buy,
the amount of power you buy,
and the amount of user queries you service.
Those are the only three things that matter.
It's not that the business logic doesn't matter.
And in fact, optimizing the business logic
might be the single most impactful thing you can do
to improve a given application.
But it's harder.
But the business logic doesn't matter to you
because you are working in the bowels of the ship,
fixing the things that affect everyone,
and it's kind of impossible for you to, at scale,
improve the business logic.
So you are focused on improving the things you can improve.
That's exactly right.
Great. So how is it different here?
We don't have that scale.
The amount of total compute that we spend is a fair bit of money,
but it's not enough that making things 1% faster matters,
which means that the average CPU cycle we spend
is just not very important. It's kind of worthless.
If you make logging faster, everyone's going to shrug and say,
okay, but that doesn't change the dial.
In fact, a surprising thing is that most of our systems spend most of their CPU time intentionally doing nothing. It is just table stakes in a trading
environment that you do user space polling IO. You just sit there spinning in a hard loop on the
CPU, waiting for a packet to arrive. And most of the time there's nothing there. So like if you
point a profiler at a trading system, it's going to tell you it's doing 95, 99% of the time there's nothing there. So if you point a profiler at a trading system,
it's going to tell you it's doing 95%, 99% of its time doing nothing.
And actually at this point,
I want to push back a little bit on this narrative
because when you say most of the systems are doing nothing,
it's not actually most of our systems.
We actually have a really diverse set of different programs
doing all sorts of different pieces.
But a lot of the work that you're doing
thinking about performance optimization
is focused on specifically trading systems.
And trading systems have exactly the character of what you're describing, which is they're sitting there consuming market data.
The whole idea, which you hear about a lot in large-scale traditional web and tech companies, of trying to build systems that have high utilization is totally crazy from our perspective.
We're trying to get systems that have low latencies.
They need to be able to quickly respond to things and also to be able to perform well when there are bursts
of activity. So it means that most of the time when things are quiet, they need to be mostly idle.
Yeah, it's definitely true that we have easy mode targets that we care a lot about at Jane Street.
A really good example is historical research. If you are trying to run simulations of what some
strategy might have done over the last 10 years, it turns out that's just a question of throughput. You can pile as much input data as
you can on the hopper, and you see how many trades fall out the other side in the next 10 seconds.
And the faster that gets, the happier we are. You can just do the same easy mode tactics that
you would on hyperscalers. But even there, the size of changes we chase there is considerably
larger. Yeah, you don't care about 1% because it's not the CPU you care about here,
it's the user latency in some sense.
It's whether or not the user gets a result in an hour or a day or a week.
Just to speak on the compiler's team,
we totally care about a 1% improvement in code generation,
but you don't care about it on its own.
You care about it in combination with a bunch of other changes
because you want to build up bigger changes out of smaller ones.
And if you look at a much larger organization,
people are looking for compiler-level improvements
that are in order of magnitude smaller than that.
I sometimes have to push people about this, in fact.
I sometimes have to say, oh, no, no, no.
It's not that this optimization matters on its own.
It's not that this thing that I did that removes a cache line
from each data structure is going to make our trading exponentially faster.
It's that I'm in a long process of pulling out the slack from a system. And every time I do this, it gets a
little bit better and everything is going to slowly converge to a good state. But it's hard
to get the statistical power to say like any of these small changes matter. Sometimes I get
pushback from people saying, well, did you measure this? I'm like, no, I didn't bother. I know it's
below the noise floor, but I also know it's right. That sort of small change incrementally applied
over time is really good.
But the hard part about it is you just have to have faith that you're going to get there
and you have to have faith that you know that this is making an improvement
or find ways you can test it in isolation.
Whereas if you operate at the huge scale that some other people do,
you can just look at a change.
It could be five bips and you can know like, oh no, that's really real.
And it's actually worth a lot of money in its own.
I think, yeah, this is a problem that hits people who are thinking about performance
almost everywhere.
It's kind of funny to me in that a common line of pushback I get from people who are
not performance-focused people is like, well, I remember in undergrad when my professor said,
well, you should never make a performance change without profiling it and knowing that it matters.
And I'm like, no, no, I actually think that's wrong.
If you know you are working on systems where this is important,
you need to have a certain amount of self-discipline.
And, you know, not where it's too costly,
it's going to make the system more dangerous or riskier
or make your life worse,
but make efficient choices as a matter of defaults in your brain.
Right, and this is one of the reasons why I think performance engineering
depends a lot on intuition and background and experience.
And mechanical sympathy, knowing that you know deep down
what the CPU is actually doing when you compile the code that you've got. So, Lelaki, stop for a second on that word And mechanical sympathy. Knowing that you know deep down what the CPU is actually doing
when you compile the code that you've got.
So, Lelaki, stop for a second on that word,
mechanical sympathy,
which is a phrase I really like.
Tell me what that phrase means to you.
What that phrase means to me,
I think a race car driver invented it, actually,
is just having an innate knowledge or...
Maybe not innate.
You probably weren't born with it.
I don't know.
Some people seem to come out of infancy
just knowing these things. Did you not read books
about CPU architecture to your children? I did not. What are you even doing?
Lambda calculus. Oh, that tracks. I think that it's not innate, but this really unconscious
knowledge of just, you know, when you look at code that this is how it structures on real systems,
because different languages have very different models of how reality works,
but reality only has one model. As much as I love Lisp, the world and computers are not made of
con cells. They're made of big arrays, one big array, and some integer arithmetic that loads
things from arrays. That's all a computer actually does, right? And you have to understand what is
the model by which we get from that low-level thing to my high-level types with structure in
them. And you have to understand what layouts mean
and how this branching structure gets compiled
into something that a CPU actually knows how to operate.
And you can't just construct this from scratch every time you do it.
You have to develop an intuition towards looking at something
and knowing what that's going to be.
So I asked you what the difference was between the easy mode performance optimization
that you experienced at Google and this kind of harder-to- harder to figure out version of the story where you don't have the
same kind of scale. I'd love to hear a little bit more about what is the texture of these problems?
What kind of problems do you run into? What is interesting and hard about the version of the
performance optimization problem you see here? Hard mode performance optimization, typically but
not always, is a question of latency. And latency is a question of what is something doing at a really important time? Not as what something
doing in general, not what it does usually, but what it does when you care about it. Which means
it's fundamentally a measurement problem. Because to measure general performance, what your system
is doing, you point a profiler at it. You get an idea of it spending 20% of its time here and 10%
of its time here and 5% of its time here. I don't care about any of those percents. I care about what
was it doing for the nanosecond, the millisecond, the microsecond sometimes, or some of our worst
systems, the second, that something interesting was happening. I care about what happens when it
was sending an order or analyzing market data. I care only about that and I don't care about
anything else. So how do I even know what it's doing at that point in time?
How do I measure this is really the key question.
Got it.
And maybe somewhat inherently that puts you in the opposite situation that you're in when
you're looking at a very big organization where you're thinking about the low levels
of the infrastructure and how to make them as fast as possible.
Because if you want to know what's happening at a given point in time, you're somewhat
unavoidably tied up in the business logic. You care about what is the thing that happens when the important decision
is made, and what are the latencies that occur, and what are the code paths that drive that
latency. Is that a fair description? Yeah. It's not universally true. There's some really
interesting cases where the infrastructure rears its ugly head in the middle of stuff you want to
be doing otherwise, right? But it is generally, a large part of it is, in fact, just the business
logic of how is this trading system making a decision? And you have to look at that, and that's what's
happening at the interesting point of time, sort of by definition. So you talked about sampling
profilers as one common tool. Can you actually just go in a little more detail of what is a
sampling profiler, and how does it actually work at a low level? So there's a lot of different
implementations of this, but the general shape of it is you take a system and you point a tool at it at a low level. here. And then it writes this down and it lets the program keep going. And profilers only really differ on how do they stop the world and how do they write this down. My favorite is the Linux
kernel profile. It's called perf and it just uses a bunch of hardware features to get an interrupt
at exactly the right moment in time. And then it just very quickly writes down the stack trace in
this compressed format. It's very optimized. And then you take all these stack traces. A profile
is really just a list of stack traces and sometimes a little bit of augmented information,
but that's fundamentally the core idea.
And then you present it to the user in some way that adds them up.
And like I say, the key thing is it tells you,
okay, 30% of the stack traces ended in the function foo.
That's a hotspot. You're spending 30% of your time there.
But there's all these different kernel counters that you can use
for driving when you're doing the sampling.
How does the choice of kernel counter affect the nature of the information you're getting out of the profiler?
Yeah. People tend to think about people sampling profiles in time, where the counter is just number
of cycles that's elapsed. But one of the cool things about it, it lets you cycle on L2 cache
misses or branch prediction misses or any of these weird architectural events. And so you get a
profile of when did these interesting things happen? And you know, each of them is costly and they probably
have some cost in cycles, but you can get much more precise measurements. And in particular,
the nice thing about it is, you know, that 10%, let's say of your program is slowed down by branch
prediction misses. But if you just look at the cycles, you're just going to see like, well,
it's somewhere in this function. If you profile on branch misses, you will see the branch that
is hard to predict. And you can actually do something about that branch.
Got it.
So branch mispredictions is one.
What's like the next most interesting counter that you might use?
Actually, the next most interesting thing isn't even branch prediction.
It isn't even a hardware counter.
The next most interesting thing to profile on is the unit of memory allocation.
A lot of allocators, like in fact, the one we have in NoCaml, but also various C++ ones,
will let you get a profile, not out of perf, but out of like kind of done in software, that tells you where
were you allocating memory the most. Because that's just a common thing that's very costly.
And reducing that, it can really improve the performance of a system.
Right. And this comes down to something that we see in OCaml a lot, which is when we write really,
really high performance systems, we often try to bring the amount of heap allocation we do all the
way down to zero. We try. Right. It's hard to get it all the way down to zero. Something's misbehaving
in a system performance-wise. A relatively common problem is there's a thing that shouldn't be
allocating that is in the hot path. Yeah, that's right. And in a good optimized C++ system,
it should be spending 5%, 10% of its time memory allocating. And sometimes you just have to do
this. It's necessary. But maybe you're allocating twice as much
as you really need to be.
And you can look at a memory profile
and take a look at it.
It's important to remember that profiles
aren't just about time.
They're about measuring the overall consumption
or use of resources.
Okay, so that's how Perf essentially works
to get you the information that you need.
But there's a fundamental trade-off
in a tool like Perf
where it's sampling information.
And that itself is essentially a
performance optimization, right? Part of the reason that you sample rather than capturing all of the
information is you want to make it fast, right? You want to not distort the performance behavior
of the program by grabbing information out of it. But in some sense, there's a real tension there
because you're saying, I don't want like some broad statistical sampling of the overall behavior
of my program. I want to know in detail how it's behaving in the times that matter most. I think a really instructive
example of that was an optimization that I hit on last year where we had a system that was trying to
send some orders, right? And it was doing it in a totally reasonable way. And it was doing it in a
really cheap way. The thinking about whether or not I want to send an order was really cheap.
It happened really fast. The physical sending of the order, really cheap, happened
really fast. The thinking, what do you mean by the thinking? Do you mean like the business logic
to decide the looking at the markets and saying, oh yeah, I want to buy, right? Or, you know, like,
and then do the physical act of, you know, sending the message, right? Both of these were really
cheap. If you pointed a profiler at them, even a profiler that was like magically restricted to
the times of interest would tell you, yep, 5% of your time was doing this. It's all good. That's not a hotspot.
Here's the problem. The order sending was happening 200 microseconds after the thinking was.
And the reason was it was being put on a low priority queue of work to do later.
It was a misconfiguration of the system. It was using this kind of older API
that it needed for complicated reasons that did
reasonable things under most circumstances, but it assumed that network traffic you wanted to send
couldn't be that latency sensitive. So it just waited to do it on an idle moment. And this was
not a good thing to wait about. The profiler tells you nothing about this because I didn't care about
the overall cost. I didn't care about overall time. I cared that it happened promptly. And so
fixing this again was really easy. I just switched to an eager API that didn't wait, but a profiler tells you nothing about this. So what kind of tools do you use
to uncover things like that? Magic trace. So what's magic trace? Magic trace is a tool we
wrote that gives you a view into what a system was doing over a short window of the past.
It's retrospective in some sense. And what I mean is that any point in time, you can just
yell, stop, tell me what you were
doing for the last two milliseconds, three milliseconds, maybe. And you write it down.
And exactly like you said earlier, this is not a profile. This is not some statistical average of
what you're doing at various times. This is the exact sequence of where your code went. It went
from this function to this function to this function. And you get a different visualization
of it that just shows you what things are happening over time. And in fact, exactly like you say,
there's more overhead for using this.
But it gives you this really direct view
into what happened at an interesting time
that a profiler fundamentally can't give.
Traces are in some senses really better.
In fact, traces aren't restricted to magic trace.
I said there's memory profiles that are really useful.
Memory allocation traces are another thing
that we care about a lot.
We have a really good one, in fact,
that gives you the literal trace of...
Although a memory profiler is actually statistical also, right?
That's a sampling profiler.
A memory tracing is, in fact, it's not a profiler, it's a tracer, right?
Maybe it's misnamed.
Actually, there's a lot of annoying sort of terminological fuzz around this.
I think people are often...
We are not unique in this, I'll say.
Right.
At least the terms I've come to like the most are people use profilers
for what you might call statistical profilers,
and then people use tracing when they're talking about capturing all of the data.
So a common form of tracing that shows up all over the place
is there's all these nice systems for doing RPC tracing,
where you write down the details of every message that goes by,
and this is sort of a common thing to do if you want to debug,
why was that web query slow?
And some query came in and it kicked off a bunch of RPCs,
it kicked off another bunch of RPCs, it kicked off another bunch of RPCs,
and you can pull up an API that lets you see
the whole cascade of messages that were sent.
So that's a nice example of tracing.
And then we also, as you mentioned,
we have a thing called memtrace,
which sadly I think is actually a profiler
in that it is a statistical sample
and does not actually capture everything.
But it does give you a time series of events,
which is a key thing that a profiler can't.
That's interesting.
I guess in some sense,
all of these systems start by giving you
a time series of events,
and then it's how you process them, right?
Perf is sampling at some rate and grabbing data,
and then you turn that into an overall statistical summary.
But you could look at that information temporarily.
You just don't.
And in any case, the information is sampled.
I think what I just said is also true about memtrace. You just don't. And in any case, the information is sampled.
I think what I just said is also true about memtrace.
You get this information, just like perf,
sampled randomly from the memory allocations.
Then you can interpret it statistically as being about the overall state of the heap.
A key difference here is that memtrace
gives you all of the information
about a statistical sample of some of the allocations.
It tells you when this guy was allocated and then freed.
That's true.
Whereas you might get a set of stacks out of perf that here are some allocations and here are some frees,
but you have no guarantee it's the same thing.
This lifecycle example, it's exactly like the RPCs.
A thing people frequently do is just capture traces for 1% of their RPCs
and giving the whole lifecycle of individual one is way more interesting than 1% of the RPCs. And giving the whole lifecycle individual one is way more interesting than 1%
of the individual moments. Yeah. I mean, maybe this just highlights,
it's a total terminological car crash. It's a little hard to separate it out.
All of these tools are way too hard to use and very inexact, and all of them use the same
terminology in different ways. Okay. So we've talked about the trace part of the name magic
trace, right? And the key thing there is it's not just sampling, it's giving you the complete summary of the behavior. It's maybe worth talking a little
bit about the magic, which is how do you do this thing? You just said, oh, something interesting
happened. And then you retrospectively look at what happened before and grab it. How can that
be done in an efficient way? What's mechanically happening here? Magic. No, there's two parts to
this, right? First is how do you get the data? And the second part is how do you decide when to take the sample?
Let's take those in order.
So how do you get the data?
Well, I'm really just going to call it magic because I don't know how they managed to do this efficiently.
But Intel has this technology.
It's called processor trace.
It just keeps a ring buffer of everything the CPU does in a really compressed format.
Like it uses one bit per branch or something along those lines.
And it just continually writes this down.
And the ring buffer is the size that it is.
And it contains some amount of history.
In practice, it's a couple milliseconds.
And at any point in time, you snap your fingers and say, give me that ring buffer.
Right.
And the critical thing is this is a feature integrated into the hardware.
Oh, yeah.
We couldn't possibly implement this.
The kernel couldn't possibly implement this.
Like this is in the silicon.
And it's a huge advantage for Intel processors. Yeah, although I really do not understand Intel's business strategy around this,
which is to say they built this amazing facility in their CPUs, but they never released an open
source toolkit that makes it really easy to use. They did a lot of great work at the perf level,
so perf has integrated support for processor trace. In fact, we relied heavily on that.
But I think Magic Trace is the first thing that's actually a nice usable toolkit built around this.
Like there's various companies that have built internal versions of this.
But it seems like such a great competitive advantage.
I'm surprised that Intel hasn't invested more in building nice, easy to use tools for it.
Because it's a real differentiator compared to, say, AMD chips.
There are a lot of performance analysis tools in the world, and there's very limited hours of the day.
I really don't feel like I know what everything in the world does.
But I generally agree with you that a really underinvested thing
is good, easy-to-use, obvious, idiot-proof APIs, right?
And tools that just work in the obvious way you want them to.
I was mentioning before how part of being a good performance engineer
is building up a lot of experience and knowledge and intuition.
Another part of it is just building encyclopedic knowledge of all the bizarre ins and outs of the
somewhat awkward tooling for doing performance analysis. Perf is a great tool. In many ways,
it's beautifully well-engineered, but the user experience is designed for experts.
And the command lines kind of work the way they do. And sometimes their meanings have evolved
over time. And the flag still says the old thing, but you have to know that it has the new meaning.
And there's really, I think, a lot of space for building tools that you just like turn on and hit the button.
And it just does the obvious thing for you and gives you the result.
It's a user interface built by experts for experts.
And I think it's easy for them to forget that most of the people have not used it.
Just like you say, a lot of random esoteric knowledge about what CPUs do.
And I also just have a lot of memorized little command lines that I happen to know will point
at certain problems.
It's just an issue of having done this a lot.
And I don't know a good way of teaching this other than getting people to do the reps.
And a better way would be to give them tools that just give them obvious defaults.
But I haven't figured out how to do this universally.
Right.
But Magic Trace is one good example of a tool like that,
where the defaults have been pretty carefully worked out,
the UX is really nice, and you can just use it.
It's not perfect, it doesn't work in all the contexts,
but it usually gives you something usable
without a lot of thinking.
It is exactly like all the best tools I know
in that I'm frequently furious at it
for not doing quite the right thing,
and that's a sign of how much I want to use it.
I would just, oh, can it also do this?
Can it also do this?
Can it also do this? But yeah, do this? Can it also do this?
But yeah, I mostly just use it in the obvious way
with the one obvious flag.
It gives me what I want to know.
One of the interesting things about a tool like Magic Trace
is you've told a whole narrative.
When doing easy mode optimization,
you care about these broad-based things.
Sampling profilers are mostly the right tools.
When you care about focused latency-oriented performance,
then you want this kind of narrowed-in-time analysis tools, and Magic Trace is the right thing for that.
But a thing that's actually struck me by seeing people using Magic Trace inside of the organization
is it's often a better tool for doing easy mode-style optimization than perf is. Because
just the fact that you get all of the data every single event in order and
when it happened with precision of just like a handful of nanoseconds makes the results just a
lot easier to interpret i feel like when you get results from perf there's a certain amount of
thinking and interpretation and how do i infer from this kind of statistical sample of what's
going on what was my program actually probably doing, and where really is the hotspot. But with magic trace, you can often just see in bright colors
exactly what's happening. I've seen people just take a magic trace at a totally random point in
the day and use that and be able to learn more from that than they're able to learn
from looking at perf data. Yeah, this is not universally true, but it's very frequent that
you can get a bunch of extra information.
I think one of the really good examples is that a profiler tells you you're spending 40% of your time in the hotspot of the send order function, right?
But here's an interesting question.
Is that one call to send order that's taking forever, or is that 1,000 calls each which is cheap, but why are you doing this in a tight loop?
And it turns out, you know, it's really easy to make the mistake where you're calling some function in a tight loop you didn't intend to. In a profile, these two things look pretty identical.
There are really esoteric tricks you can use to tease this out. But in Magic Trace, you just see,
oh God, that's really obvious. There's a tight loop where it has a thousand function calls right in front of one another. That's embarrassing. We should fix it, right? You actually develop
this weird intuition for looking at the shape of the trace that you get, the physical shape on your screen and the visualization. And like, oh, that weird tower,
I'm clearly calling a recursive function a thousand deep. That doesn't seem right.
You get to see a lot of these things. Yeah. It makes me wonder whether or not
there's more space in the set of tools that we use for turning the dial back and forth between
stuff that's more about getting broad statistical samples and stuff that gives you this more detailed analysis and just more in the way of visualizations,
more ways of taking the data and throwing it up in a graph or a picture that gives you
more intuition about what's going on.
A huge fraction of what I do in my day-to-day is visualization work.
Rather than looking at them, we're trying to build better ones.
It's really important, and we have barely scratched the surface of how to visualize
even the simplest profiles.
Yeah, one thing I've heard you rant a lot over time is that probably the most common visualization that people use for analyzing performance is flame graphs.
And flame graphs are great.
They're easy to understand.
They're pretty easy to use.
But they also drop some important information, and you're a big advocate of Pprof, which is a tool that has a totally different way of visualizing performance
data. Can you give a quick testimonial for why you think people should use Pprof more?
Yeah. Flame graphs were way better than anything that came before it, by the way. They were a
revelation when they were invented, I think. And so this is one of those things you have to be
really careful not to say something is bad. I just think something better has been invented, right?
So it's called a flame graph because it's a linear line that has a bunch of things going up out of it
that look like flames. And what this is, is that- And people often use orange and red as the
color. So it really looks like flames. Exactly, right? It wouldn't look as cool if it was green,
right? And so, you know, the first level of this is, you know, broken down 40%, 60%, and that 40%
of your stack traces start this way and 60% start this way. And then the next level is just each of
those gets refined up. And so every stack trace corresponds to an individual peak on this mountain range. And then the width of that peak is how many stack traces looked like
that. So this is good. It tells you where is your time going. And one nice property is it at a glance,
it makes it really easy to intuitively see the percentages, right? Because the width of the lines
as compared to the width of the overall plot gives you the percentage that that part of the flame
graph is responsible. Yeah. If there's one really fat mountain that's 60% wide, you know what you're doing. It's that.
Here's the problem with it. It misses a thing I like to call join points,
which are points where stack traces start differently and then reach the same important
thing. Because what happens is, suppose you've got 15 or 16 little peaks, none of which is that big,
and then right at the tippy top of each of them in tiny, tiny, narrow things, it's all coming calling the same function. It'd be really easy to dismiss that.
You don't even notice the thing at the top, but it turns out if you add them all together,
that's 40% of your time. And different visualizations can really show you everything
comes together here. So how does Pprof try and deal with this?
So we're now going to try to proceed to describe a representation of a directed acyclic graph over a podcast, which of all the dumb ways people have visualized directed acyclic graphs might be the worst.
But what it does is it draws a little bag on your screen where each node is a function you end up calling and each arrow is a path.
You can imagine that you have one node for every function you're ever in.
And then for each stack trace, you just draw a line through each node in order and they go towards the things that
were called. And then you highlight each node with the total percentage of time you spent there. And
you put some colors on it, like you say, and you make the arrows thicker or thinner for bigger or
smaller weights. And that's the basic idea. And so if you close your eyes and imagine with me for a second,
I claim that what will happen in the scenario I described
is that you'll see a bunch of bushy, random, small paths
at the top of your screen,
and then a bunch of arrows all converge on one function.
And that function now is really obviously the problem.
And then underneath, it'll also branch out to lots of different things.
Yeah, that's actually a really good point,
because it tells you, in fact, that maybe it's not that function that is the time, it'll also branch out to lots of different things. Yeah, that's actually a really good point because it tells you, in fact,
that maybe it's not that function that is the time, it's the things it calls.
But now you at least know where all this is coming from.
And as a single example of this, it is the most common thing,
at least if you're working in, say, C++, the function's always malloc.
Oh, interesting.
Because like I said, with, you know, the business logic may be very diverse,
but everything allocates memory.
Because it's really easy to realize, oh, it's doing a little bit of mallakir. It's doing a little bit of Malakir. It's doing a
little bit of Malakir. And I guess this ties a little bit into what you were talking about of
the style of trying to look at the foundations and get the common things that everyone sees.
It becomes really important to see those common things. Although I would have thought one of the
ways you could deal with this just with flame graphs is there's just two different orientations.
You can take your flame graph and turn it upside down. And so you can decide which do you want to
prioritize thinking about the top of the stack or the bottom of the stack. It's the bush below
malloc that screws you over there. It's when the thing's in the middle that you get problematic.
It's when the really key function is in the middle where everything converges through one particular,
it's not a bottleneck, but you can think about it in a bottleneck in the graph,
that the flame graphs really fail.
And to their credit, people have built flame graph libraries
where you can say, give me a flame graph that goes up and down
from malloc, but you need to know to focus
on that function.
I see. And so Pprof somehow has some way of essentially
figuring out what are the natural join points
and presenting them.
I think that it outsources that problem to the graph drawing library
that tries to do various heuristics for how people
view graphs.
It tends to work.
I've looked at flame graphs and I've looked at P-PROF visualizations.
I do think P-PROF visualizations are a little bit harder to intuitively grok what's going on.
So I feel like there's probably some space there yet
to improve the visualization to make it a little more intuitively clear.
I definitely agree.
I think that just like we were saying earlier,
this is one of those things that experts know, new people don't.
You just kind of have to get used to staring at these for a couple days and then you get used to it.
But it would be nice if we didn't have to.
It would be nice if it was just as obvious as the other things are.
So we've talked a bunch here about performance engineering tools and measurement tools more than anything else, which I think makes sense.
I think the core of performance engineering is really measurement. And we've talked about ways in which these focused tracing tools like MagicTrace,
in some ways, can kind of outperform sampling tools for a lot of the problems that we run into.
What are the kinds of cases where you think sampling tools are better?
To me, the case I wish I could use sampling the most is once I've identified a range of interest.
A thing we care about a lot when we think about latency and optimization of trading systems is tails. If your system,
99% of the time responds in
5 microseconds and then 1% of the
time responds in a millisecond,
that's not great because, you know,
you don't always get a choice of which of those is the one
you trade on, right? Right, and also there's
the usual correlation of the busiest times
are often the times that are best to be fast at.
That's right. And so exactly
where it's bad is the case where you care the most.
And Magic Trace is pretty good at finding tails
because you can just do some interesting tricks and hacks
to get Magic Trace to sample at a time where you're in a 99 percentile tail.
Now, sometimes you look at those trails in Magic Trace and you see,
oh, I stopped the world to do a major GC.
I should maybe avoid that.
I need to allocate less.
Or some other bizarre, weird event.
Sometimes it's a strange thing that is happening.
But a remarkably common pattern we see is that your tails aren't weird, aren't interesting.
They're just your medians repeated over and over.
And like you said, you're having a tail because the market is very busy, because it's very interesting, because you saw a rapid burst of messages.
And each of them takes you a microsecond to process, but that's not the budget you have.
You have 800 nanoseconds, and you're just falling behind.
So this is just a classic queuing theory result.
It's nothing but queuing theory.
Get lots of data in, you're going to have large tails
when it all piles up in time.
So you pull up this magic trace and you say,
oh, I processed 10,000 packets.
What do you do now?
And sometimes you can think of obvious solutions.
Sometimes you realize,
oh, I can know I'm in this case and wait and process all of these in a batch. That's a great
optimization that we do all the time. But sometimes you're just, oh, wow, I just really need to make
each of these packet processes 20% faster. And what I really wish you could do is take that magic
trace and select a range and say, hey, show that to me in Pprof. Because it's just like the flame graph.
You have all these little, little tiny chunks in the magic trace,
and I really want to see them aggregated.
And you have some trouble doing it.
So I asked one question, and I think you answered a different one.
I do that a lot.
I asked the question of when do you want sampling.
You answered the question of when do you want a profile view.
And I totally get that, and that seems like a super natural thing
and maybe like a nice feature request for the magic trace folk.
Have a nice way of flipping it into the profile view.
But I really want to poke at the other questions.
When is sampling itself a better technique versus this kind of exhaustive tracing that magic trace is mostly built around?
One easy answer is the easy mode problems we have.
Things like historical research, things like training machine learning models, things that are really are throughput problems.
And we do have throughput problems.
And there, it's just easier to look at the sampling profile or you can really target it at the cache misses or whatever you want.
So that's a case where we definitely want it.
And maybe another thing to say about it is it is definitively cheaper.
The whole point of sampling is you're not grabbing the data all the time. I guess we didn't talk about this explicitly, but Intel processor trace, you turn it on and you end up eating five to 15% or something of the performance of the
program. Like there is a material. Don't say that out loud or they'll stop letting me use it on our
trading systems. I mean, it is the case that we don't just like have it turned on everywhere all
the time, right? We turn it on when we want it. And that's a thing that the hyperscalers do.
They just leave a sampling profile or on across the fleet, just getting, you know, 1% of things. And that gives you kind of a great sense of what
the overall world is doing. And that is actually a thing I kind of wish we had. It would be less
valuable for us than it would be for them. But I would love if I could just kind of look at it like
a global view of hotspots. I think the best thing in the world would be like, can I get a sampled
profile of all the things all of our trading systems did when they weren't
spinning idly? If I could know that, oh, overall, if I could get a little bit more return from
optimizing the order sending code versus the market data parsing code, I think that would
be a really valuable thing to me. So another interesting thing about the way in which we
approach and think about performance is our choice of programming language, right? We are not using
any of the languages that people typically use for doing this kind of stuff.
We're not programming in C or C++ or Rust.
We're writing all of our systems in OCaml.
And that changes the structure of the work in some ways.
And I'm kind of curious how that feels
from your perspective as someone who like
very much comes from a C++ background
and is dropped in weird functional programming land.
What have you learned about how we approach these problems?
What do you think about the trade-offs here?
Well, the best thing about it is employment guarantee.
Anyone can write fast C++, but it takes a real expert to write fast OCaml, right? You can't fire
me. Although I think that's actually totally not true. I didn't mean the part about firing you,
but the point about writing fast C++, I actually think there's a kind of naive idea of, oh,
anyone can write fast C++. And it's like, oh man, there's a lot of ways of writing really slow C++. And actually a lot of the things that you need to
get right when designing high-performance systems are picking the right architecture,
right way of distributing the job over multiple processes, figuring out how to structure the data,
structure the process of pulling things in and out. There are lots of stories of people saying,
oh, we'll make it faster in C or C++. And sometimes you implement it in some other
language and it can be made faster still because often the design and the
details of how you build it can dominate the language choice. I think it's really easy for
people who are performance obsessed like myself to just get a little too focused on, oh, I'm going
to make this function faster. And maybe the better answer is, can we avoid that function being called?
Can we like not listen to that data source? Can we outsource this to a different
process that feeds us interesting information? The single most important thing in performance
engineering, I think, is figuring out what not to do. How do you make the thing that you're
actually doing as minimal as possible? That is job one. Honestly, I think one of the reasons that I
really like performance optimization as a topic to focus on, I don't like writing code very much.
I'm not very productive. It takes me a long time to do good work.
So I want to do the stuff that requires me
to write the fewest lines of code
and have the biggest impact, right?
This is like one of those hypotheticals.
You make a two-line change
and everything gets 10% faster.
The hard part was the three weeks of investigation
proving that it was going to work, right?
And I think this is actually a good example
of like you really have to think about the whole board.
You have to think about how you're structuring the code
and how you're structuring the system. How many hops this is going to go through? How many systems is going
to go through? How can you get assistance from hardware? Any of these things. And like
micro-optimizing the fact that Clang has a better loop optimizer than GCC or the new Camel compiler,
like it's really annoying to look at the bad loop. Is that really what's killing you? No,
what's killing you is that you're looping over something that you shouldn't be looking at at all.
Right, so there's a bunch of stuff you just talked about
that we'd love to talk about more about,
in fact, the hardware stuff in particular.
But I don't want to lose track of the language issue.
So I leave what I said that, like,
people often over-focus on the details of the language.
But the language does matter,
and I think it matters in particular for performance.
And I'm kind of curious what's your feeling about
how that affects how we approach the work
and how you, your own kind of interaction and engagement with it?
I break it down into three categories.
The first category in which OCaml provides us a challenge is, I call it the most annoying but the least important.
And that's what I was saying earlier about, oh, our code generation isn't as good.
We're branching too much.
We have too many silly register spills.
It's annoying to look at.
It's really not what's killing you.
I would wish it was better.
And really, the limit isn't OCaml.
The limit is scale.
The C compiler people have been working for 30 more years than the OCaml compiler people
have.
And there's more people working on optimizing Clang right now across the world than we have
probably employees at Jane Street.
We're never going to catch up.
That's OK.
It's not really what's killing you. The second category is things that are just like maybe more of an
actual problem, but like hard to deal with, but not really the key issue. Our memory model requires
us to do slightly more expensive things in some cases. Like a good example is we're garbage
collected language. Our garbage collector inspects values at runtime.
Therefore, uninitialized data can be really problematic.
And so, you know, we have to do stupid things in my brain like,
oh, it's really important to null out the pointers in this array and not just leave them behind or they'll leak.
Or you can't just have an uninitialized array that I promise I'll get to soon
because what happens if you GC in that range?
And like, I do actually think this is meaningfully costly in some scenarios,
but I'm willing to put up with it in most cases. There are things you can in that range. And like, I do actually think this is meaningfully costly in some scenarios,
but I'm willing to put up with it in most cases.
There are things
you can do about it.
The thing that I think
is most problematic
for our use of a language
like OCaml
gets back to mechanical sympathy.
And, you know, I said
that the world is not made
out of parentheses
that Lisp uses,
and it's also not made out
of algebraic data types.
OCaml's fundamental
representations of the world
are very boxy.
There's a lot of pointers. There's a lot of pointers.
There's a lot of, you know, this object contains a pointer to something else where in C++ it would
just be splatted right there in the middle. And there are reasons we do this. There are reasons
that make it easy to write good, clean, safe code, but it is fundamentally costly and it lacks
the language, if anything, lacks some mechanical sympathy.
Right. Or at least it makes it hard to express your mechanical sympathy
because getting control over the low-level details is challenging.
And I don't want to go too much into that.
We're actually doing a lot of work to try and make OCaml better exactly at this.
But a question I'm kind of more interested in talking about with you is,
how do you see us working around these limitations in the language
and the code base that we have?
There's a couple options here.
The first is you can kind of write it the hard way
because OCaml is a real language here. The first is you can kind of write it the hard way because,
you know, OCaml's a real language. You can write whatever you want. It's just a question of
difficulty. You know, if nothing else, I can just, I could in theory allocate a, you know,
64 gigabyte inter-ray at the beginning of startup and then just write C in OCaml that just manipulates
that as memory, right? It would work. It would never GC. It would do all the things you wanted
it to. It'd just be miserable. And clearly I'm not going to do that. But given that we're a company that has a lot of
people who care about programming languages, one thing we're pretty good at is DSLs. And so, you
know, we have some DSLs, for example, that let you like describe a layout and we're going to embed
this layout into some, you know, low level string that doesn't know a lot, but it's still, if you
glance at it the right way, type safe. Now the the DSL doesn't let you write out-of-bounds accesses or anything like this.
Right. And the DSL, you sit down and write down what's the format of a packet that you might get
from the NASDAQ exchange. And then it generates some actually relatively reasonable, easy to
understand interfaces that are backed by the low-level, horrible manipulation of raw memory.
And so you write a DSL, you generate some code,
and what you surface to the user is a relatively usable thing,
but you get the physical behavior that you want
with all the flattening and inlining and tight representation of data.
You're hitting on a really good point,
that a lot of these originated from our need to parse formats
that were given to us, right?
But it turns out you can also just use them
for representing your data and memory.
I can build a book, a representation of the state of the market that's just laid out
flatly and packed for me. It's much less pleasant to use than real OCaml. It's difficult. And we
only do this in the places that it matters, but you can do it. There's what I like to call like
a dialect of OCaml we speak in sometimes. And sometimes we generally say it's zero Alec OCaml.
And you know, the most notable thing about it is it tries to avoid touching the garbage collector.
But implied in that zero alec dialect is also a lot of representational things.
We have little weird corners of the language that are slightly less pleasant to use, but will give you more control over layout and more control over, you know, not touching the GC and using malloc instead.
And it works pretty well.
It's harder, but you can do it.
In the same way, another thing we think about a lot
is interoperability.
Again, sort of out of necessity.
There are libraries we have to interact with
that only work in C,
so we have these little C stubs that we can call into
and it's really cheap.
It's not like Java.
It's not like one of those languages
where there's this huge, costly process
for going cross-language.
You just make a function call and it just works, right?
Yeah, like the overhead, I think,
for a function call to C at least is,
I don't know, three or four nanos.
And I think in Java it's like
3 or 400 nanos because the JNI is a
beast for reasons I've never understood.
Option two is
cry a little bit and deal with it. Like, yeah,
we face a fundamental disadvantage.
We're working on reducing it. I'm super excited
about getting more control over the layout
of OCaml types. This is like the biggest change
to me that maybe will ever happen in the compiler
is being able to write down a representation of memory that is what I want it to be
in a real OCaml type that is fun to play with. But fundamentally,
we're kind of at a disadvantage and we just have to work harder and we have to think more about, okay, we're going to have a higher
cache footprint. What does this mean about our architecture? How can we get cache from other places?
How can we spread out the job across more steps, more processes,
pre-process this one place?
It gets back to, you don't want to focus on over-optimizing this one function.
You want to make your overall architecture do the right things and just inform infrastructural changes.
And I think you make an important point that it's not that any of the optimizations you want to do are impossible.
It's that they're a little bit more awkward than you would like them to be, and you have to do a little extra work to get them to happen.
And that means fundamentally that we don't always do them.
And so we really do pay a cost in performance
in that getting people to do the right thing,
the harder you make that, the less it happens.
One of the hardest things to learn
when you're doing this sort of work is discipline.
I have to go through the code base every day
and say, no, I'm not fixing that.
Yes, it's like offensive to me on a personal level
that it's slow and it allocates and do these things, but it just doesn't matter. It's legitimately
hard for me not to stop whatever I'm doing and just like fix this optimization that I know is
sitting there. If this doesn't bother you on a fundamental physical level, I just don't understand.
But you have to prioritize.
You have to prioritize. There's so much more important things to be doing.
So another thing I'm wondering about is how you think about the role of hardware in all of this.
In some sense, if you're thinking about making things as low latency as possible,
why do we even bother with a CPU, right?
You look at the basic latency of consuming and emitting a packet.
And on any ordinary computer, you're going to cross twice over the PCI Express bus.
It's going to cost you about 400 nanos each way.
You know, that and a little bit of slop between the pieces,
it's kind of hard to get under a mic, really,
for anything where you're like,
I'm going to consume some data off the network,
do something and respond to it.
And on an FPGA attached to a NIC,
you can write a hardware design
that can turn around a packet in under 100 nanoseconds.
So there's like an order of magnitude improvement
that's just like physically impossible to get to
with a computer architecture.
And so in some sense, if all you cared about is, well, I just want the absolute lowest latency
thing possible, it's like, why are we even using CPUs at all? So how do you think about the role
of hardware as integrating and how you think about performance in the context of building
these kinds of systems? It informs the architecture you choose. Because yeah, nothing's ever going to
be as fast as hardware, but it's really hard to write hardware. It can't do complicated things,
and even the things it can do are just exponentially harder to write.
I have never in my life written Verilog,
which feels like a personal sin.
I am reliably informed that it is miserable and unpleasant,
and your compiler takes 24 hours to run.
So we have a lot of strategies with really complicated logic,
and that logic is
important and it's valuable. And implementing that in hardware is, I'm just going to say,
flatly impossible. You couldn't do it. And so the question becomes, what can you outsource
to hardware that is easy? How do you architect your system so that you can do the really, really
hyper-focused speed things in a simple, simple hardware system that only does one thing.
And you feed that hardware the right way.
But the rest of the software system
still has need to be fast.
It has to be fast on a different scale,
but it turns out there's optimizations
that matter on roughly every single
timescale you can imagine.
We have trades at this firm that, like you say,
complete in less than 100 nanoseconds
or you might not even bother.
We also have trades that we send someone an email and the next day you get back a fill, right? And every level in
between there turns out you can do useful optimization work. And even with stuff that
has no humans in the loop, we really do think about nanoseconds, microseconds, milliseconds,
depending on what you're doing and how complicated it is, you really do care about many different
orders of magnitude. Yeah, there's a system that I've worked on where our real goal is to get it down
from having like 50 millisecond tails
to one millisecond tails.
And we celebrate it when we get there
and it still does a lot of great trading.
We have other systems that are doing simpler,
more speed competitive things
where like your software needs to be 20 microseconds
or 10 microseconds or five microseconds.
That's achievable.
It's harder and you have to do simpler things
just like with the hardware, but it's achievable. And you care about both of these
latencies. And I think another good thing to point out is that you said systems that don't interact
with humans, but it turns out some of the most important latencies of systems we interact with
is the systems that do interact with humans. I don't know about you, Ron, but when my editor
freezes up for five seconds while I'm typing, I just want to put a keyboard through the window. It just drives me nuts, right? And putting aside like the aggravation,
human responsive systems are just really important too. Both on like you're actively
trading and you want, you know, by hand and you want to like have a good latency on the thing
that's displaying the prices in front of you, that matters a lot. But also I think it matters
a lot for just your ability to adapt and improve your systems over time.
I said earlier, think about historical research.
That's a throughput problem, but it's also a latency problem on a human scale.
A thing that will give you feedback on whether your trading idea was good in a minute is worth so much more than one that gives you an idea if it's worth anything in a day.
Yeah, that's absolutely right. I think for lots of different creative endeavors, and I think trading and software engineering are both count from my perspective, and also all sorts of
different kinds of research, the kind of speed of that interactive loop of like, I have an idea,
I try an idea, and I get feedback on how well that idea works out. The faster you can make that loop,
the more that people can experiment with new things, and the more the creative juices can
get flowing and more ideas you can create and try out and evaluate.
A thing I'm obsessed with telling people about is this Air Force colonel from the 50s or
the 60s.
His name was John Boyd.
He invented this idea called the OODA loop, O-O-D-A.
I believe it's Observe, Orient, Decide, Act.
It's like the four stages you go through in figuring out like, oh, I see something.
I think about it.
I decide how I'm going to adjust to that.
I implement the adjustment.
And the faster this loop happens, the more control you have over the system and the better you can
iterate. I think a great example of this outside software, oddly enough, is whiskey. I like bourbon
a lot, right? And it turns out that to make good bourbon takes five, seven, 10 years, right?
And so you don't get a lot of opportunity to iterate. And there are some people who are doing
a really controversial thing, which is they're using technology to rapidly age spirits. And some people call this sacrilege.
And you know, it's never going to be quite as good as like doing it the hard way. But on the other
hand, it lets you taste it in a month and be like, I think I'm going to change this. And they're
going to get 12 iterations in the time someone else might get one. And I just think this sort
of process turns out to matter a lot in software too, of being able to rapidly iterate on what
you're doing and get feedback either on the quality of the trading or for that matter, on the quality of the performance.
One of the things I really care about a lot is building systems that let me really quickly
evaluate the performance of a system. Because there's a huge difference between, I think this
is going to be faster. I did a profile. I know this is a hotspot. I made it better. Okay, I'll
run it in prod for a couple of days. And okay, I've made this change. I think it's going to be
better. I'm going to run it on this really realistic test bed. I know in 10 minutes if it's better
and I can change it and I can try it again. Yeah. Iteration speed matters kind of almost
everywhere you are. And I really like your point about this kind of performance analysis mattering
for trading systems and also mattering for like systems with human interaction. And actually,
I feel like the performance mindset isn't really so different. You look at someone who's really good
at thinking hard about and optimizing the performance
of stuff in the browser.
There's a very different instruction set.
And oh my God, is that a weird
and complicated virtual machine.
But a lot of the same intuitions and habits of mind
and that focus on being really interested
in details that are really boring.
All of that really matters a lot, right?
You really have to care about like all the kind of gory details
of the guts of these things to do a really good job of digging in.
Yeah, and this is why I kind of wonder
if there's just a mindset that can't be trained
because you have to just look at this and go,
what the hell are you talking about, Ron?
This isn't boring.
I get why you say that, but I just look at this stuff and go like,
you don't have to pay me to look at this.
Sorry, I take that back.
You do have to pay me to look at this. I would not do this for free. I promise.
Boring in quotes. I love this stuff too and totally understand why it is, but it is from
the outside. Like it's a little hard to explain to your friends and family why you like this stuff.
They're like, you can't even explain in words out loud. What are the details that are going on?
Because people will fall asleep. Do you know, my dad once sat me down in college and said,
are you sure you want to do this CS thing and not go into something where you can find a job like being a lawyer?
I'm the only person who disappointed his parents by not becoming an English major.
Well, maybe that's a good point to end it on.
Thanks so much for joining me.
This has been great.
Thanks for having me on.
This is a really good talk. Thank you.