Signals and Threads - Multicast and the markets with Brian Nigito
Episode Date: September 23, 2020Electronic exchanges like Nasdaq need to handle a staggering number of transactions every second. To keep up, they rely on two deceptively simple-sounding concepts: single-threaded programs and multic...ast networking. In this episode, Ron speaks with Brian Nigito, a 20-year industry veteran who helped build some of the earliest electronic exchanges, about the tradeoffs that led to the architecture we have today, and how modern exchanges use these straightforward building blocks to achieve blindingly fast performance at scale.You can find the transcript for this episode along with links to things we discussed on our website.
Transcript
Discussion (0)
Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I'm Ron Minsky.
Today, I'm going to have a conversation with Brian Nogito, essentially about the technological underpinnings of the financial markets and some of the ways in which those underpinnings differ from what you might expect if you're used to things like the open internet and the way in which cloud infrastructures work. And we're going to talk
about a lot of things, but there's going to be a lot of focus on networking and some of the
technologies at that level, including things like IP multicast. And Brian Nijito is a great person
to have this conversation with because he has a deep and long history with the financial markets.
He's worked in the markets for 20 years. Some of the time he spent working at the exchange level where he did a lot of the foundational work
that led to the modern exchange architectures that we know today. And he's also worked on the
side of various different trading firms. And for the last eight years, Brian's been working here
at Chain Street and his work here has covered a lot of different areas. But today he spends a lot
of time thinking about high performance, low latency, and especially network level stuff.
So let's dive in.
I think one thing that I'm very sensitive to is a lot of the people who are listening
don't know a ton about the financial markets and how they work.
Just to get started, Brian, can you give a fairly basic explanation of what an exchange
is?
I think when you hear about an exchange, you can think of lots of different kinds of marketplaces. But when we talk about an exchange,
we're talking about a formal securities exchange. And these are the exchanges that the SEC
regulates, and they meet all of the rules necessary to allow people to trade in securities.
So when we use that loosely, yeah, it's pretty different than your average flea market,
supposed to be anyway.
That's obviously a function which, once upon a time, was done with physical people in the
same location, right?
Those got moved into more formal, more organized exchanges with more electronic support.
And then eventually, there's this kind of transformation that's happened essentially
over the last 20 years, where the human element has changed an enormous amount.
Now, humans are obviously deeply involved in what's going on, but the humans are almost
all outside of the exchange.
And the exchange itself has become essentially a kind of purely electronic medium.
Yeah, it's a really interesting story because you have examples of communications technologies and electronic trading going back to late 60s,
but probably more mid 70s. I'm being a little loose with dates. So it was kind of always present,
but the rule set was not designed to force people to operate at the kinds of timescales
that electronic systems would cause you to operate at.
It was rather forgiving.
So if somebody on the floor didn't want to deal with an electronic exchange, the electronic
exchange had to wait.
And over the past 10 to 15 years, that's kind of flipped.
And so generally we favor always accessible electronic quotations.
To step back a little bit, the exchanges are the places for people to meet and trade,
as you said, to advertise their prices and for people to transact with each other.
Other than people who are buying and selling, what are the other people who interact at the
exchange level? What are the other kind of entities that get hooked in there?
So you have obviously the entities who either in their own capacity or on behalf of other people
are transacting securities, but then you have financial institutions that are clearing and
guaranteeing those trades, providing some of the capital or leverage to the participants who are
trading. They obviously want to know what's going on there. You have other exchanges because the
rule set requires the exchanges to respect each other's quotations. In this odd way, there's a web
where the exchanges are our customers of each other. And you may also have various kinds of
market data providers. So those quotes that reflect the activity on the exchange are eventually making
their way all the way down to what you might see scrolling on the bottom of the television
or your brokerage screen or financial news website, et cetera. I guess they even make it
all the way
down to the printed page when the Wall Street Journal prints transaction prices.
So what does this look like at a more systems level? What are the messages that the different
participants are sending back and forth? The most primitive sorts of things are you
have orders or instructions. There are other platforms where we have quotes and we may use
that loosely, but we'll just say orders. And an order would just say that I would like to buy or sell, let's say a specific stock.
And I'd like to do so at no worse than this price and for no more than this quantity.
That may mean I could get filled at a slightly better price than that. I could get filled for less than that.
I could get filled not at all.
And that order could basically check the book immediately and then come right back to me
if there's nothing to be done.
Or it can rest there for some non-zero amount of time where it could advertise and other
people may see it and choose to interact with it.
And then obviously I can withdraw that interest or cancel it.
So when we talked about orders or cancels, those go hand in hand. And finally, there's execution messages where if you and I
intersect on our interest, I want to buy, you want to sell or vice versa, then the exchange
is going to generate an execution to you and to me saying that that happened and the terms of that
trade. And I guess one of the key properties here is that you have a fairly simple core set of messages. There's this basic data structure at the heart
of it called the book, which is the set of orders that have not been satisfied. And then people can
send messages to add orders and remove orders. And then if two orders cross, if there are two
orders that are compatible where a trade can go up, then an execution occurs and the information
flows out. A fairly simple core machine at the heart of it.
But then lots of different players who want different subsets of information for different
purposes.
There are people who are themselves trading, want to see, of course, their own activity
and all the detail about that.
And they also want to see what we call market data, the kind of public anonymized version
of the trading activity.
So you can see what are the prices that are out there
that are advertised for you to go and transact against. And so in the end, you need to build
a machine that's capable of running this core engine, doing it at sufficient speed,
doing it reliably enough. Maybe a thing that's not apparent if you haven't thought about it is
there's a disturbing, dizzying amount of money at stake. And oh my God, you do not want
to lose track of the transactions, right? If you say like, oh, you guys did this trade and then
you forget about it and don't report it or you report to one side or not the other, terrible
things happen. So reliability is a key thing. Yeah. And I think to go back, there's lots of
different consumers, lots of different participants. And I think the key word there is there's lots of
competing participants. So one thing you didn't mention in there is
disseminating all that information fairly. So trying to get it to everybody at the same time
is a real challenge and one that participants are studying very, very carefully and looking
for any advantage they can technologically within the rule set, et cetera. So that extra
layer of competition
sort of makes the problem a little more complicated and a little more challenging.
And this fairness issue is one that you've seen from the inside working
on early exchange infrastructure at Island and at Instanet, which eventually became
the technology that NASDAQ is built on. Early on, you guys built an infrastructure that I think
didn't have all of the fairness guarantees that modern exchanges happen today.
Can you say more about how that actually plays out in practice?
When working on the island system, it was very close originally to sort of, I guess,
fair in that you had the same physical machines, you had an underlying delivery mechanism,
which we'll talk about, that was very fair at getting it to those individual machines.
And then you were sending copies of orders or instructions after going through one application
to everyone.
So you were all passing through about the exact same amount of work and about the exact
same number of devices, but it was actually very inefficient.
We were using thousands of machines that were mostly idle.
So once we started trying to handle multiple clients on a single machine, it exposed sort of
some obvious and silly problems. The naive implementation where people would connect,
we would collect all of those connections. And then when we had a message, we would send them
on the connections serially,
often in the order in which people connected. Well, that immediately led to thousands of
messages per second before the exchange opened where somebody tried to be the very first
connection in that line. So then you start sort of round robining. So you start from one and then
the next time around you start from two, et cetera, et cetera, to try to randomize this.
And then you had people who were connecting to as many different ports as they could and
taking the fastest of each one.
And so these incentives are very, very strong.
And we'd like to use machines to their fullest, but to literally provide each participant
their own unique machine for each connection starts to get ridiculous as well.
So where did that lead you? How did you end up resolving that problem?
A lot of these were TCP protocols. In those days, we actually had a decent number of people
connecting over the open internet. I don't think we provided trading services directly
over the open internet, but we did actually provide market data that way.
And TCP is probably your only reasonable option over something like the internet.
But once you started moving towards co-location and towards more private networks where
people's machines were in the same data center and really only two or three network devices away
from the publishing machine, it became a lot more feasible to start using different forms
of networking, unreliable networking, UDP, and that leads you
to something called multicast, where rather than you sending multiple copies of the message to
end people, you send one copy that you allow the network infrastructure to copy and deliver
electrically and much more deterministically and quickly.
For someone who's less familiar with the low-level networking story,
just give a quick tour of the different options you have
in terms of how you want to get a packet of data
from one computer to another.
The internet and ethernet protocols
are generally a series of layers.
And at the lowest layer,
we have an unreliable best effort service
to deliver a packet's worth of data.
And it's sort of a one-shot thing,
more or less point-to-point from this machine to some destination address.
And then we build services on top of that that make it reliable by sequencing the data,
attaching sequence numbers so we know the original order that was intended,
and having a system of retransmissions, measuring the average round trip
time, probabilistically guessing whether packets are lost, et cetera, et cetera. So that all gets
built up into a fairly complex protocol that most of the internet uses, TCP, maybe not all,
and there are some people pushing for future extensions to that. But by and large, I'd say
that's the vast majority of reliable in-order connected data over the internet is sent via TCP.
And TCP assumes that there's one sender and one receiver.
And it has unique sequence numbers for each of those connections.
So I really can't show the same data to multiple participants.
I actually have to write a unique copy to each participant. UDP is a much lighter layer on top of the underlying
raw transport, still unreliable, but with a little bit of routing information. And that protocol
has some features where you can say, I want to direct this to a specific participant
or a virtual participant, which the network could interpret as a group. Machines can join that group
and then that same message can be delivered by network hardware to all the interested parties.
One of the key features there, which I think is maybe not obvious, why would I prefer
multicast over unicast? Why is it better for me to send one copy to the switch that then sends a
bunch of copies to a bunch of
different recipients versus me just sending a bunch of individual copies on my own? What's
the advantage of baking this into the actual switch infrastructure?
I mean, the switches are very fast and deterministic about how they do that.
And because of, I think, their usage in the industry, they've gotten faster and more
deterministic. So they can just electrically repeat those bits simultaneously
to all, you know, 48 ports or whatever that that switch might have. And that's just going to be much
faster and more regular than you trying to do it on a general purpose server where you might be
writing multiple copies, writing to multiple places. It's
just, you really can't compare the two. One of the key advantages of using switches is that
the switches are doing the copying in specialized hardware, which is just fundamentally faster than
you can do on your own machine. And also there's a distributed component of this, right? When you
make available a multicast stream, there's this distributed algorithm that runs on the switches
where it learns essentially
what we call the multicast tree, which is at each layer, each switch knows to what other switches
it needs to forward packets. And then those switches know which part they need to forward
packets to. And so that gives you the ability to kind of distribute the job of doing the copying.
So if you have like 12 recipients in some distant network, you can send one to the local switch.
And then the final
copying happens at the last layer at the place where it's most efficient. So that's like the
fundamental magic trick that multicast is providing for you. I mean, as the networks get simpler,
the very first versions we were using weren't even using multicast. We were using something
called broadcast, which just basically said anything you get, I want you to repeat everywhere.
It's funny because you could imagine that you could certainly overwhelm a network that way.
And a large part of the uncertainty and the variation that comes from TCP
are these self-learning algorithms that are very concerned about network health.
And so when we would work with Linux kernel maintainers and stuff like that,
and have questions about variability that we saw, then they would say, well, you shouldn't be using TCP. If you care
about latency, you shouldn't be using TCP. TCP makes trade-offs all the time for network health
and so on and so forth. And for the internet, that is absolutely necessary. And if you really
have these super tight requirements and you really want them to get there fast, and you have a
controlled network with very little packet loss and very few layers between participants, you
should be using UDP. And they were probably right. We mostly do nowadays for this stuff,
but it took a while to get there. And they were right in a way that was kind of totally
unactionable, which is to say there are a bunch of standardized protocols about how you communicate
when you're sending orders. In fact, another thing to say about the trading world is,
if you step back and look at how the protocols we use are broken up, there are two kinds of primary
data flows that a trading firm encounters, at least when we're talking to an exchange.
There is the order flow connection, where when we send our specific orders and our specific
cancels and see the specific responses to those, and that is almost always done on a
TCP connection.
And then there is the receipt of market data, and that's where you're sending the data that
everyone needs to see exactly the same anonymized stream of data, and that's almost always done
through multicast.
So there is part of the data which is done via UDP in the way the Linux kernel developers
would recommend.
And there's part of the data flow that's still to this day done under TCP.
And I think the difference is we no longer use the open internet in the way that we once
did, right?
I think there's been this transformation where instead of sending things up to the trunk
and having things routed around the big bucket of the open internet,
trading firms will typically have lots in the way of co-location sites where they will put some of their servers very near to the switches that the exchange manages. And they will have what we call
cross-connects, right? We will connect our switch to their switch and then bridge between the two
networks and deliver multicast across these kind of local area networks that are very tightly controlled, that have very low rates of message loss. So in some sense,
we're running these things over a very different network environment than the one that most of the
world uses. Yeah, a couple of interesting observations to that. It means that co-location
makes competition between professional participants more fair. It enables us to use
these kinds of technologies. Whereas without co-location, you have less control over how
people are reaching you and you end up with probably more variation between participants.
I think it's also worth saying that a lot of things we're talking about are a little bit
skewed towards US equities and equities generally. There's lots of other trading
protocols that are a little bit more bilateral. There isn't like a single price that everybody
observes. In currencies, often people show different prices to different people. There's
RFQ workflows and fixed income and somewhat in equities and ETFs. But by and large, probably the
vast majority of the messages generated look a bit like this, where there's shared public market data that's anonymized, but viewable by everyone. And then the private stream,
as you say, of your specific transactions and your specific involvement.
I think from a perspective of what was going on 15 years ago, I feel like the obvious feeling was,
well, yeah, the equity markets are becoming more electronic and more uniform and operate in this way where there's
kind of central open exchange and not much in the way of bilateral trading relationships.
And surely this is the way the future and everything else is going to become like this.
No, actually the world is way more complicated than that. And currencies and fixed income and
various other parts of the world just have not become that same thing.
Yeah. And I think that's partly because those products are just legitimately different and the participants have different needs.
Sometimes it's because the equity markets happen so, I think, relatively rapidly. A lot of the
transformation happened there. And so other markets that were a little bit behind saw the playbook,
they saw how it changed and they positioned and controlled some of that change to maintain
their current business models, et cetera. I wanted to go back to one thing was you said
we mostly use TCP. And it's interesting because there were attempts, I know of at least one off
the top of my head, probably there are more, to use UDP for order entry. Specifically,
somebody had a protocol called UFO, UDP for orders. There wasn't a ton of uptake because look, if you're a trading firm
connecting to 60 exchanges and 59 of them require you to be really good at managing a TCP connection
and one of you offers a unique UDP way, that's great, but that's one out of 60. And so I kind
of have to be good at the other thing anyway. So there just wasn't as much adoption because
there's just enough critical mass and momentum that the industry kind of hovers around a certain set
of conventions. And the place where you see other kinds of technologies really taking hold are where
there's a much bigger advantage to using them, right? I think when distributing market data,
it's just kind of obviously almost grotesquely wasteful to send things in unicast form where you send one message
per recipient. And so multicast is a huge simplifier. It makes the overall architecture
simpler, more efficient, fairer. There's a big win that really got adopted pretty broadly.
We've kind of touched on half of the reason people use multicast, right? Which is, I think,
one of the core things I'm kind of interested in this whole story is why is trading so weird in this way, right? Multicast was like, when I was a graduate
student many years ago, multicast was going to be a big thing. It was going to be the way in the
internet that we delivered video to everyone, totally dead. Multicast on the open internet
doesn't work. Multicast in the cloud basically doesn't work. But multicast in trading environments is
a dominant technology. And one of the reasons I think it's a dominant technology is because
it turns out there are a small number of videos that we all want to watch at the same time.
Unlike Netflix, where everybody watches a different thing, we actually want in the
trading world to all see what's going on on NASDAQ and ARCA and NYSE and SIBO and so on and so forth live in real time. We're all stuck to the same cathode ray tube.
But there's a whole different way that people use multicast that has less to do with that,
which is that multicast is used as a kind of internal coordination tool for building certain
kinds of highly performant, highly scalable infrastructure?
What is the role that multicast plays on the inside of exchanges and also on the inside of lots of firms trading infrastructure? The exchange, the primary thing it's doing is
determining the order of the events that are happening. And then the exchange wants to
disseminate that information to as many participants as possible. So certain parts of this don't parallelize very well. The sequence has to pretty much be done in one place
for the same security. So you ended up where you were trying to funnel a lot of traffic down into
one place and then report those results back. In that one place, you wanted to do as little work
as possible so that you could be fast and deterministic. And then you were spreading that work out into lots of other applications that were
sort of following along and provided value-added information, value-added services, and reporting
what was happening in whatever their specific protocol was.
So the same execution that tells you that you bought the security you were interested
in can also tell your clearing firm, maybe security you were interested in, can also tell
your clearing firm, maybe in a slightly different form, can tell the general public via market data
that's anonymized and takes your name off of it, et cetera, et cetera.
And let me try and just kind of sharpen the point you're making here, because I think it's an
interesting fact about how this architecture all comes together, which the kind of move you're
talking about making is taking this very specific and particularistic problem of like, we want to manage a book of open orders on an exchange and distribute this
and that kind of data and turning it into a fairly abstract CS problem of transaction
processing.
You're saying like, look, there's all these things that people want to do.
The actual logic and data structure at the core of this thing is not incredibly complicated.
So what we want to do is just to simplify all of the work around it,
we're just going to have a system whose primary job is taking the events, the request to add
orders and cancel and so forth, and choosing an ordering and then distributing that ordering
to all the different players on the system so that they can do the concrete computations that
need to be done to figure out what are the actual executions that happen, what are the things that need to be reported to regulators, what needs to
be reported on the public market data. And then multicast becomes essentially the core fabric
that you use for doing this, right? You have one machine that sits in the middle,
you can call it the matching engine, but you could also reasonably just call it a sequencer,
because its primary role is getting all the different requests and then publishing them back out in a well-defined order.
Worth noting that multicast gives you part of the story, but not all of the story because
it gives you getting messages out to everyone, but it misses two components.
It doesn't get it to them reliably, meaning messages can get lost.
And it doesn't necessarily get them to each participant in order.
Essentially,
the sequence kind of puts a counter on each message. So you can see it's like, oh, I got
message one, two, four, three. Well, okay, I got to reorder them and interpret them as one, two,
three, four. And then also that ordering lets you detect when you lose messages. And then you have
another set of servers out there whose job is to retransmit. When things are lost, they can fill the gaps.
And now this is a sort of specialized supercomputer architecture, which gives you this very
specialized bus for building what you might call state machine style applications.
Right. And I will say, I think I'm aware of a number of exchanges that actually do have a model
where they actually have just a sequencer piece that does no matching, that really just determines the order. And then some of these sidecar pieces are
the ones that are actually determining whether matches do indeed happen, and then sequencing
them back, reporting them back, et cetera, et cetera. So there's definitely examples of that.
A couple other points. So yeah, the gap filling and recovery has been a problem that I think is
covered by other protocols. There are reliable
multicast RFCs and protocols out there. And everywhere I've been, when we've looked at them,
we've run into the problem that they have the ability for receivers to slow or stop publication.
And in those cases, if you scale up to having thousands of participants,
there's sort of somebody somewhere who always has a problem.
So using any of these general purpose, reliable multicast protocols
never seemed to quite fit any of the problems that we had.
And I think because of the lack of use for the other reasons you mentioned,
they were generally not super robust compared to what we had to build ourselves.
And so we ended up doing exactly that where we added sequencing and the ability to retransmit missed messages
in various specialized ways. It's also worth noting that you get some domain-specific benefits
that I think also can generalize where if you've missed a sufficient amount of data,
I guess you can always replay everything from the beginning, but it sort of turns out that
if you know your domain really well and you can compress that data down to some fixed amount of
state, you can have an application that starts after 80% of the day is complete and be immediately
online because you can give him just a smaller
subset of the state. And a general purpose protocol like TCP, where you'd have to sort of
replay any missed data, has a number of problems in trading. That can be buffered there for sort
of arbitrarily long, and it assumes you still want it to get there. And it's buffering it bite by
bite. Whereas if you say, oh, I'd like to place an order, oh, I'd like to cancel an order, oh, I'd like to place you say, oh, I'd like to place an order,
oh, I'd like to cancel an order,
oh, I'd like to place an order,
oh, I'd like to cancel an order.
If all of those are sitting in your buffers,
the ideal thing to do would be,
well, if you know the domain,
they cancel each other before even going out
if they're waiting in the buffer and you said nothing.
So when we design those protocols ourselves,
optimized for this specific domain,
we can pick up a little bit more efficiency when we do it. This is, in fact, in some ways, a general story about optimizing
many different kinds of systems, essentially specialization, understanding the value system
of your domain and being able to optimize for those values. I think the thing you were just
saying about not waiting for receivers, that's in some sense part of the way in which people
approach the business of trading. The people who are participating in trading care about the latency and responsiveness
of their systems. People who are running exchanges, who are disseminating data, care about getting
data out quickly and fairly, but they care more about getting data to almost everyone in a clean
way than they do about making sure that everyone can keep up. So you'd much rather just kind of
pile forward obliviously and keep on pushing the data out.
And then if people are behind, well, you know, they need to think about how to engineer their
systems differently.
So they're going to be able to keep up.
And you worry about like the bulk of the herd, but not about everyone in the herd.
You know, the stragglers in the herd, well, you know, they can catch up and get retransmissions
later and they're going to be slower, but we're not going to slow down.
Understanding what's important to the applications can be massively simplified. A huge step you can take in any
technical design is figuring out what are the part of the problems you don't have to solve.
I think it's also worth saying that the problem is somewhat exacerbated by fragmentation. We've
said it's important for people to determine the order of events, but you also need to report it
back to them quickly and reliably quickly, deterministically
quickly, because that translates directly into better prices.
If I told you that you could submit an order and it would be live for the next six or eight
hours, you're going to enter probably a much more conservative price.
And let's say I'm actually acting as an agent for you.
I'm routing your order to one of these other 14 exchanges. Well, I may want to check one and then go on to the next one. And the faster and more
reliable it is for me to check this one, the more frequently I'll do so. If I think there's a good
chance that the order will get held up there, well, that opportunity cost, I may miss other
places. So this is all kind of a rambling way of saying that
speed and determinism translate directly into better prices when you have markets competing
like this. People often don't appreciate some of the reasons that people care about performance
in exactly the way that you're kind of highlighting. Just to kind of give another
example in the same vein, like this fragmentation story, you might want to put out bids and offers at all of the different exchanges where someone might want to transact,
right? There's a bunch of different marketplaces. You want to show up on all of them. You might
think, oh, I'm willing to buy or sell a thousand shares of the security and I'm happy to do it
anywhere, but you might not be happy to do it everywhere. There's like a missing abstraction
in the market. So they want to be able to express something like, I would be willing to buy the security
at any one of these places, but they can't do it.
So they try and simulate that abstraction
by being efficient, by being fast.
So they'll put out their orders
on lots of different exchanges.
And then when they trade on one of them,
they'll say, okay, I'm no longer to,
so they'll pull their orders from the other.
And they're now worried about other professionals
who are also very fast,
who try to route quickly and in parallel to all the different places and take all of the
liquidity that shows up all at once. There's this dynamic that the speed and determinism
of the markets now becomes something that essentially affects the trade-offs between
different professional participants in the market. Yeah, that's right.
Another thing I kind of want to talk about for a second is what are some of the trade-offs that you walk into when you start building systems
on multicast? I remember a bunch of years ago, you were in the guts of systems like Island and
InstantNet and NASDAQ and Chiax and all of that building this infrastructure before you came to
Jane Street. I was on the other side and at the time, Jane Street, I think, understood much less
about this part of the system.
And I remember the first time we heard a description from NASDAQ about how their system worked, and I basically didn't believe them, kind of for two reasons.
One reason is it seemed impossible. the way NASDAQ works is every single transaction on the entire exchange goes through a single
machine on a single core. And on that core is running a more or less ordinary Java program that
crosses every single transaction. And that single machine was the matching engine, the sequencer.
And I didn't really know how you could make it go fast enough for this to work. There was
essentially a bunch of optimization techniques that I felt like at the time we just didn't really know how you could make it go fast enough for this to work. There was essentially a bunch of optimization techniques I felt like at the time we just
didn't understand well enough. And also, it just seemed perverse. What was the point?
Why to go to all that trouble? Maybe you could do it, but why?
Well, a couple of things. First, I want to say on all the systems you mentioned,
I like to think I did some good engineering work, but I was certainly a part of many excellent teams
and worked with just a tremendous bunch of
people over the years. But yes, from a performance perspective, you said, well, the fewer processes
I have, the simpler the system is. And it gives you some superpowers there where you just don't
have to worry about splitting things up in various ways. There were certainly some benefits to adding complexity, but a lot of
that came about as hardware itself started to change. And that should provide probably the
baseline for optimization. I think you want to understand the hardware and the machines you're
using, the machines that are available, the hardware that's available to you deeply. And
you want to basically model out what the theoretical bounds
are. And then when you look at what you're doing in software and you look at the kind of performance
you're getting, if you can't really explain where that is relative to what's capable, you're leaving
some performance on the floor. And so we were trying very, very hard to understand what the
machine could theoretically do and
really utilize it to its fullest.
Part of what you're saying is that instead of thinking about having systems where you
fan out and distribute and break up the work into pieces, you stop and you think, if we
can just optimize to the point where we can handle the entire problem in a single core,
a bunch of things get simpler, right? We're just going to
keep everything going through this one stream. There's a lot of work that goes into making
things uniformly fast enough for this to make sense, but it simplifies the overall architecture
in a dramatic way. It definitely does. And it's been pretty powerful. I mean, not every exchange
operates exactly on these principles. There's certainly lots of unique variations that people have put out there, but I do think
that it is pretty ubiquitous.
And certainly the idea that exchanges want some kind of multicast functionality, I think
is universal at this stage.
I'm sure there maybe is an exception here or there, but amongst high performance exchanges
with professional participants like this, I think it's pretty universal.
And when you're talking about publication of market data, we can see that directly,
since we're actually subscribing to multicast in order to receive the data ourselves.
But their internal infrastructure often depends on multicast as well, right?
True. Although, you know, I'm not as familiar with like the crypto side of the world, but
since a lot of that is happening
over the open internet, UDP is probably not one of the options.
And so you have people using more web sockets and JSON APIs and things like that.
But it is kind of the exception that proves the rule, right?
Because of that focus on the open internet and everything, you've got a totally different
set of tools. It highlights the fact that the technical choices are in some sense conditional on
the background of the people building it. There's like two sides of the question we were just
talking about. There's the question of what's the advantage of doing all this performance
engineering? And the other question is, how do you do it? How do you go about it?
It has moved around over the years. Many years ago,
I remember that we had interrupt-driven I.O. Packets would come into the network card where
they would essentially wait for some period of time. And if it had waited there long enough or
enough data had accumulated, then the network card would request an interrupt for the CPU to
come back and service the network card. And so how
frequently should we allow interrupts? If we allow them essentially anytime a packet arrives,
that'll be way too much CPU overhead. And so there were trade-offs of throughput and latency.
But once you end up with the sheer number of cores that we do nowadays, we can essentially
do away with interrupts and just wait for the network card by polling and checking,
do you have data? Do you have data? by polling and checking. Do you have data?
Do you have data?
Do you have data?
Do you have data?
And the APIs have shifted a bit away from general purpose sockets.
The sockets APIs require lots of copies of the data.
There's like an ownership change there.
When you read, you give the API a buffer that you own.
The data is filled in from the network and then given back to you.
So this basically implies a copy on all of the data that comes in. And if you start to look at,
say, a 25 gigabit networking, that means you basically have to copy 25 gigabits a second to
do anything at a baseline. And the alternative is you try to reduce those copies as much as possible.
And you have the network card just delivering data into memory, the application polling, waiting for that data to change,
seeing the change, showing it to the application logic, and then telling the network card he's
free to overwrite that data. You're done with it. And when you get down to that level, you really
are getting very close to the raw performance of what the machine is capable of. So eliminating the copies and the unnecessary work in the system, that's certainly one.
Trying to make your service times for every packet and every event as reliable and deterministic
as possible so that you have very smooth sort of behavior when you queue, you don't end
up having to do that everywhere.
The critical path tends to be
pretty small when it's all said and done. I think one of the guys who had built the island system
really kind of had the attitude that if any piece of the system is so complicated that you can't
rewrite it correctly and perfectly in a weekend, it's wrong. And so I think that, you know,
probably the average length of an application there was, you know, 2000 lines, something like that. And the whole exchange probably was maybe four or
five applications stitched together. Sad to say, I think we do not follow that rule in our
engineering. I think we could not rewrite all of our components in a weekend. I'm afraid.
The world has gotten more complicated, but it's, it's not a bad goal, you know,
to often ask people. And I think it's consistent with
reliability and performance to constantly ask yourself, yes, but can it be simpler?
We want it to be as simple as possible. No simpler, but as simple as possible. And it really
is a mark of you deeply understanding the problem when you can get it down to something that
seems trivial to the next person who looks at it. It's a little depressing
because you kill yourself to get to that point. And then the next person that sees it is like,
oh, well, that makes sense. That seems obvious. What did you spend all your time on? And you're
like, if only you knew what was on the cutting room floor. One thing that strikes me about this
conversation is that just talking with you about software is pretty different than lots of the
other software engineers that I talk to because you almost immediately in talking about these questions, go to the question
of what does the machine do and how do you map the way you're thinking about the program onto
the actual physical hardware? Can you just talk for a minute about like the kind of role of
mechanical sympathy, which is a term I know you like for this in the process of designing the
software where you really care about performance. about how drivers with mechanical sympathy who really had a deep understanding of the car itself
were better drivers in some way. And I think that that translates to performance in that
if you have some appreciation for the physical and mechanical aspects, just the next layer of
abstraction in how our computers are built, you can design solutions that are really much closer to the edge of what
they're capable of. And it helps you a lot, I think, in terms of thinking about performance
when you know where those bounds are. So I think what's important there is it gives you a yardstick.
Without that, without knowing what the machine is capable of, you can't quickly back of the
envelope say, does the system even hang together? Can this work at all? If you
don't know what the machine is capable of, you can't even answer that question. And then when
you look at where you're at, you say, well, how far am I from optimal, right? Without knowing
what the tools you have are capable of, I just don't know how you answer that question and,
you know, when you stop digging, so to speak. Or if you're observing that the market,
be it from a competitive perspective or just the demands of the customer are much higher than what
you think is possible, well, you've probably got the wrong architecture. You've probably got the
wrong hardware. It's kind of hard for me to not consider that. As a practical matter, as a software
engineer, how do you get a good feel
for that? I feel like lots of software engineers, in some sense, operate most of the time at an
incredible remove from the hardware they work on. There's the programming language they're in,
and that compiles down to whatever representation, and maybe it's a dynamic programming language,
and maybe it's a static one, and there's like several different layers of infrastructure and frameworks they're using.
And there's the operating system.
And they don't even know what hardware they're running on.
They're running on virtualized hardware
and lots of different environments.
For lots of software engineers,
a kind of concrete and detailed understanding
of what the hardware can do feels kind of unachievable.
How do you go about building intuition,
trying to understand better
what the actual machine is capable of?
Well, I think, so you're separating programmers into people who get a lot of things done
and people like myself. I think, is that fair?
Seems fair, yes.
No, it's a good question. I think part of it is interest. And I think you really need
to construct a lot of experiments. And you have to have a decent amount of curiosity. And you
have to be blessed with either a problem that demands it or the freedom to be curious and to
dig. Because you are going to waste some time constructing experiments.
And your judgment initially is probably not going to be great. The machines nowadays are
getting more and more complicated. They're trying to guess and anticipate a lot of what your
programs do. So very simple sorts of benchmarks, simple sorts of experiments don't actually give you the insight you think
you're getting from them. And so I do think it is a hard thing to develop, but certainly a good
understanding of computer architecture or grounding in computer architecture helps. And then there are
now a decent number of tools that give you this visibility. But you do have to develop an intuition for
what are the key experiments? What are the kinds of things that are measurable? Do they correlate
with what I'm trying to discover, et cetera, et cetera. And I think it requires a lot of work
of staying current with the technology and following the industry solutions as well as what's happening in the
industry generally of computing technology. You got to kind of love it, right? Got to spend enough
time to develop the right kind of intuition and judgment to pick your spots when you do your
experiments. I think in lots of cases, people approach problems with a kind of, in some sense, fuzzy notion
of scalability.
There are some problems where if you're like, no, actually, I can write this one piece,
it admits simpler solutions some of the time than they do if you try and make it scalable
in a general way.
You can make a thing that is scalable, but the question of being scalable isn't the same
as being efficient.
So when you think about scalability
and think about performance, it's useful to think about it in concrete numerical terms,
and in terms that are at least dimly aware of what the machine is capable of.
I think it's actually easy to get programmers to focus on this sort of thing. If you just stop
hardware people from innovating, they will have no choice, right? So many programming paradigms
and layers of complexity have been empowered by the good work of hardware folks who have
continued to provide us with increasing amounts of power. And if that stops, and it does seem
like in a couple of key areas that is slowing, I don't know about stopping, but certainly slowing,
then yeah, people will pay a lot more attention to efficiency.
So this is maybe a good transition to talking about some of the work that you do now, right?
You, these days, spend a bunch of your time thinking about a lot of the kind of lowest
level work that we do.
And some of that has to do with building abstractions over network operations that give us the ability
to more simply and more efficiently do the kind of things that we want to do. And part of it has to do with hardware. So I'm wondering
if you could just talk for a minute about the role that you think custom hardware plays in
trading infrastructures and some of the work that we've done around that.
Jane Street has always had a large and diversified business. And for lots of our business, it's just not super relevant.
But in the areas where message rates and competitiveness are a little extreme,
it becomes a lot more efficient for us to take some of these programmable pieces of hardware and really specialize for our domain. And that can mean,
you know, like a network card is actually very good at filtering for multicast data. It can
compare these addresses bit by bit. But there's really nothing that stops us from going deeper
into the data and filtering based on content, looking for specific securities, things like that. And there aren't a
lot of general purpose solutions out there to do that at hardware speeds, but we can get programmable
network cards, custom pieces of hardware, where we can stitch together solutions ourselves.
And I think that's going to become increasingly relevant and maybe even necessary as we start to
move up in terms of data rates. I think earlier I mentioned
that we have, I didn't get the exact number, maybe there's 12 now going up to something like 15, 16,
17 different US equity exchanges. If each one of those can provide us data at something close to
10 gigabits per second, and the rule set requires that we consolidate and aggregate all that
information in one place, well, we have
something of a fundamental mismatch if we only have 10 gigabit network cards, right? So for us
to do that quickly and reliably in a relatively flat architecture, we're going to need some magic.
And the closest thing I think we have to magic is some of the custom hardware.
This feels to me like the evolution of the multicast story,
which is if you step back for a moment, you can think of the use of multicast in these systems
as a way of using specialized hardware to solve problems that are associated with trading. But
in this case, it's specialized networking hardware. So it's general purpose at the level
of networking, but it's not like a general purpose programming framework for doing
all sorts of things. It's, you know, specialized to copying packets efficiently. Is there anything
else at the level of switching and networking worth talking about? Yeah, I think that it's
funny. I don't know if I've ever come across these like layer one cross point devices outside of
our industry. I think certainly some use them maybe in the cybersecurity field, but within our
industry, there's been a couple of pioneering folks that have built devices that allow us to,
with no switching or intermediate analysis of the packet, just merely replicate things
electrically everywhere according to a fixed set of instructions. And it turns out that that
actually covers a tremendous number of our use cases when we're distributing things like market
data. So the more traditional, very general switch will take in the packet, look at it,
think about it, look up in some memory where it should go, and then route it to the next spot.
That got sped up with slightly more specialized switches based on concepts from
InfiniBand that would do what was known as cut through. They would look at the early part of
the packet, begin to make a routing decision while the rest of the bytes were coming in,
start setting up that flow, send that data out, and then forward the rest of the bytes as they
arrive. Those were maybe an order of magnitude or even two faster than the first generation.
Well, these that actually do no work whatsoever, but just mechanically,
electrically replicate this data, they're another order of magnitude or two faster than that.
So maybe a store and forward switch, the first kind I was describing, I don't know,
maybe that was seven to 10 microseconds. A cut-through switch, looking at part of the packet and moving it forward, maybe that's
300 to 500 nanoseconds.
And now these switches, these layer one cross points, maybe they're more like 3 to 5 nanoseconds
themselves.
And so now we can take the same packet and make it available in maybe hundreds of machines
with two layers of switching like that.
And that's, we're talking about, you know, a low single to double digit number of nanoseconds in
terms of overhead from the network itself. I think it's an interesting point in general that
having incredibly fast networking changes your feelings about what kind of things need to be
put on a single box and what kind of things can be distributed across boxes, right? Computer scientists like to solve problems by adding layers of indirection.
The increasing availability of very cheap layers of indirection suddenly means that you can do
certain kinds of distribution of your computation that otherwise wouldn't be easier and natural to
do. What do the latencies look like inside of a single computer versus between computers these days?
It's starting to vary quite a bit, especially with folks like AMD having slightly different
structure than Intel. But it's true that moving between cores is starting to get fairly close to
what we can do with individual network cards. I mean, to throw out some numbers that somebody
will then probably correct me on, I think maybe that's something on the order of 100 nanoseconds. It's not that different when we're going across the PCI bus and going through a highly optimized network card. That might be something like, you know, 300 to 600 nanoseconds. And this is, you know, one way to get the data in. But it is not unreasonable for the sorts of servers that we work with to get frames all the
way up to the user space into the application to do very little work on, but then turn around
and get that out in something less than a microsecond. Moving, you know, context switching,
things like that in the OS can start to be on the order of a microsecond or two.
Yeah. And I think the thing that's shocking and counterintuitive about that is the quoted number for going through an L1 crosspoint switch versus going over the PCI
Express bus. We're talking 300 nanos to go across the PCI Express bus and two orders of magnitude
faster to go in and out of a crosspoint switch. Well, you got to add the wires in though. The
wires start to... Yeah, yeah. The physical wiring starts to matter. The wiring absolutely starts to matter. And by the way, in some ways to go back to like the kind
of mechanical sympathy point, when you think about the machines, we're not just talking about the
computers. We're also talking about the networking fabric and things like that. I think an aspect
of the performance of things that people often don't think about is serialization delay. Can
you explain like what serialization delay is and how it plays into that story? We've been talking about networking at
specific speeds. I can send one gigabit per second. I can send 10 gigabits per second,
25 gigabits per second, 40 gigabits per second, et cetera. I can't take data in at 10 gigabit and
send it out at 25 gigabit. I have to have the data continuously available. I have to buffer enough and wait for
enough to come in before I start sending because I can't underflow. I can't run out of data to
deliver. Similarly, if I'm taking data in at 10 gig and trying to send it out at one gig,
I can't really do this bit for bit. I've kind of got to queue some up and I've got to wait.
The lowest latency is happening at the same speeds where you can do that.
And certainly the L1 cross points
are operating at such a low level,
as far as I understand,
that certainly no speed conversions
are happening at the latencies that I described.
And just to kind of clarify the terminology,
by serialization delay,
just like you were making this point that,
oh yeah, when you're in at 10 gig and out at 25,
it's like, well, you can't pause or anything, right? You have to have all of the data available at the
high rate, which means you have to queue it up. When you send out a packet, it kind of has to be
emitted in real time from beginning to end at a particular fixed rate. And that means there's a
translation between how big the packet is and temporally how long it takes to get emitted onto the wire.
There's a kind of electrically determined space-to-time conversion that's there.
And so it means if you have a store-and-forward switch, and you have, say, a full, what's called an MTU, which is like the maximal transmission unit of an Ethernet switch, which is typically, you know, 1500 bytes-ish, that just takes a fixed amount of time.
Like on a 10 gig network, what does that translation look like?
I think it roughly works out to something like a nanosecond per byte.
And I think this comes back to the thing we were talking about in the beginning and a
little bit of appreciation for multicast.
So imagine I have 600 customers and I have one network card and I would like to write
a message to all 600.
Well, let's say the message is a thousand bytes. Okay. So that's about a microsecond per.
So the last person in line is going to be, you know, 600 microseconds at a minimum behind
the first person in line. Whereas with multicast, if I can send one copy of that and have the switch replicate that in parallel, one of these layer one cross points, I'm getting that to everybody in something close to a microsecond.
And that affects latency, but it also affects throughput. If it takes you a half a millisecond of wire time to just get the packets out the door, well, you could do at most 2,000 messages per
second over that network card. And that's that, right? Again, this goes back to there are real
physical limits imposed by the hardware that it can be as clever as you want, but there's just
a limit to how much stuff you can emit over that one wire. And that's a hard constraint that's
worth understanding. Multicast is a story of the technology that
could, incredibly successful in this niche. There's other bits of networking technology
that have a more complicated story. And I'm in particular thinking about
things like InfiniBand and Rocky. What is RDMA? What is InfiniBand?
InfiniBand is a networking technology that was very ahead of its time.
I think it's still used in supercomputing areas. And a lot of high-performance Ethernet has
begged, borrowed, and stolen ideas from InfiniBand. InfiniBand provided things like
reliable delivery at the hardware layer. They had APIs that allowed for zero copy IO,
and they had the concept of remote direct memory access.
So direct memory access is something that like peripherals,
devices on your computer can use to sort of move memory around
without involving the CPU.
The CPU doesn't have to stop what it's doing
and copy a little bit over here, from here to
there, from here to there. The device itself can say, okay, that memory over there, I just want
you to put this data right there. And remote DMA extends that concept and says, I'd like to take
this data and I'd like to put it on your machine over there in memory without your CPU being
involved. And this is obviously powerful,
but requires different APIs to interact with.
A number of the places I've been at used InfiniBand,
some very much in production,
some a little more experimentally.
And there are some bumps in the road there.
You know, InfiniBand had some of this problem
where by default,
it essentially had some flow control in hardware,
meaning that it was concerned about network bandwidth and could slow down the sender.
So we'd have servers that didn't seem to be doing anything, but their network cards were
sort of oversubscribed. They had more multicast groups than they could realistically sort of
filter. And so they were pushing back on the sender. And so when we scaled it up to big infrastructures, we'd have market data slow down
and it was very difficult to figure out why and to track down who was slowing that down.
So the Ethernet model of like best effort and sort of fail fast and throw things away quickly
is in some cases a little bit easier to get your
head around and to debug. You mentioned that when we talk about multicast, one of the key issues
with multicast is it's not reliable. We don't worry about dealing with people who can't keep
up, right? People who can't keep up, fall behind and have like a separate path to recover and
that's that. And you just mentioned that InfiniBand had a notion of reliability and reliability is a
two-edged sword, right? The way you make things reliable is in part by constraining what can be done.
And so the pushback on senders of data is kind of part and parcel of these reliability guarantees,
I'm assuming. Is that the right way of thinking about it?
Yeah, I think that's a good way to think about it. But certainly the visibility and the
debuggability could have been improved as well. And you mentioned Rocky. I never worked with it personally, but it was a way to
sort of extend Ethernet to support the RDMA concept from InfiniBand. But I don't believe it
involved some proprietary technology still. So it was a little bit of like the embrace and extend
approach applied to Ethernet. So when
you look at the kinds of custom hardware that was being developed, I think there were sort of more
interesting things happening in the commodity world than Rocky. We've spent a lot of time
talking about the value of customizing and doing just exactly the right thing and understanding
the hardware. I guess the Ethernet versus InfiniBand story is in some sense about the
value of not customizing, of using the commodity thing.
There is a strong lesson there.
I mean, I had a couple of instances over my career where I was very surprised at the power
of commodity technologies.
I was at a place that did telecommunications equipment, and they were doing special purpose
devices for processing phone calls, phone number recognition, what number did you press,
sorts of menus.
And these had very special cards with digital signal processors and algorithms to do all
of this detection, some basic voice recognition.
And this is in the 90s.
And these were complex devices.
And it turned out that somebody in the research office in California built a pure software
version of the API that could use like a $14 card that was like sufficient to be able to generate
ring voltages and could emulate like 80% of the product line in software. And when I saw that,
I was like, I'm not really sure I want to work on in software. And when I saw that, I was like,
I'm not really sure I want to work on custom hardware. I don't know that I want to sort of swim upstream against the relentless advance of x86 hardware and commodity vendors, like
just the price performance. It's, you know, you've got a million people helping you,
whereas in the other direction, you've got basically yourself. And it took a lot to get me convinced to consider some alternative things.
But I do think that trends around the way processors and memory latency are improving
certainly make it clear that, I mean, just, you know, looking at things like deep learning and
GPUs, like it's pretty clear that we're starting to see some gains from specializing again, even though
I'd say the first 10 or 15 years of my career, it was pretty clear that commodity hardware
was, was relentless.
And it's worth saying, I think in some sense, the question of what is commodity hardware
shifts over time.
Like I think a standard joke in the networking world is
always bet on ethernet. You have no idea what it is, but the thing that's called ethernet
is the thing that's going to win. And I think that has played out over multiple generations
of networking hardware, whereas you see it stealing lots of ideas from other places,
InfiniBand and whatever, but there is the chosen commodity thing and learning how to use that and how to identify what that thing is going to be is valuable. The work that we're doing now in
custom hardware is also still sensitive to the fact that FPGAs themselves are a new kind of
commodity hardware, but it's not the case that we actually have to go out and actually get
fabricated a big collection of chips on one of those awesome reflective disks,
we get to use a certain kind of commodity hardware
that lots of big manufacturers
are actually getting better and better
at producing bigger and more powerful
and easier to use versions of these systems.
Is there anything else that you see
coming down the line in the world of networking
that you think is gonna be increasingly relevant
and important, say, over the next two to five years? I think what we're going to see is
a little bit more of the things we've been doing already, standardized and more common.
So this sort of like user space polling and that form of IO, I think you're seeing some of that
start to hit Linux with IO U-ring. So these
are very, very, very similar models to what we've been already doing with a lot of our own cards,
but now they're going to become a bit more standardized. You're going to see more IO
devices meet that design, and then you're going to see more efficient zero copy polling sorts of
things come down the line. You know, some of the newer networking technologies like 25 gig,
I do think is going to have a decent amount of applicability. It is waiting for things like an
L1 crosspoint. And it is not always a clear net win. Some of the latency has gone up as we've
gone to these higher signaling rates. You can be overcome with large quantities of data. The gain
and serialization delay will overcome some of the
baseline latency if the data gets big enough. But it's complicated.
Can you say why it is? Why do switches that run at faster rates sometimes have higher latency
than switches that are running at lower rates?
I believe that that's... I mean, part of it is decisions by the vendor where they're sort of finding the right market for
the mix of features and the sensitivity to latency.
I do think that we are at the mercy, so to speak, of some of the major buyers of hardware,
which is probably cloud providers.
That's just an enormous market.
And so I do think that the requirements hew a little closer to that than
they do for our specific industry. So we've got to contend with that. As the signaling rates go up,
and again, I'm no expert here, but I think that you start to have to rely more on error correction
and forward error correction is built into 25 gig and eats up a decent amount of latency if you have
runs of any length. So that's also a thing that
we have to contend with and an added complexity. So I think it's going to be important. I think
it's going to be something that does come to our industry and maybe quickly. I think at this point,
there's a decent amount of 25 gig outside of the finance industry, but not quite as much
in the trading space. And once you start to see a little bit of it, you'll see a lot very quickly.
All right. Well, thanks a lot. This has been super fun. I've really enjoyed kind of walking
through some of the history and some of the kind of low-level details of how this all works.
I think we do this basically all the time, you and I. It's just kind of like now we're doing it
for somebody else.
You can find links to more information about some of the topics we discussed,
as well as a full transcript of the episode on signalsandthreads.com.
And while you're at it, please rate us and review us on Apple Podcasts.
Thanks for joining us and see you next week.