Signals and Threads - State Machine Replication, and Why You Should Care with Doug Patti
Episode Date: April 20, 2022Doug Patti is a developer in Jane Street’s Client-Facing Tech team, where he works on a system called Concord that undergirds Jane Street’s client offerings. In this episode, Doug and Ron discuss ...how Concord, which has state-machine replication as its core abstraction, helps Jane Street achieve the reliability, scalability, and speed that the client business demands. They’ll also discuss Doug’s involvement in building a successor system called Aria, which is designed to deliver those same benefits to a much wider audience.You can find the transcript for this episode  on our website.Some links to topics that came up in the discussion:Jane Street’s client-facing trading platformsA Signals and Threads episode on market data and multicast which discusses some of the history of state-machine replication in the markets.The FIX protocolUDP multicastReliable multicastKafka
Transcript
Discussion (0)
Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Chainstreet.
I'm Ron Minsky.
It's my pleasure to introduce Doug Patty.
Doug has worked here at Chainstreet for about five years.
And in that time, he's worked on a bunch of interesting systems and really distributed systems.
That's mostly what I want to talk about today is the kind of design and evolution of those systems. But Doug, just to get us started,
why don't you tell us a little bit about where you came from and where you worked before you
joined us at Jane Street? Sure, and thanks for having me, Ron. Before Jane Street, I worked at
a few smaller web startups, and my primary day-to-day was writing Node.js and Ruby and
building little web frameworks and APIs.
So it was quite a departure from the things that I'm working on today.
So I know a bunch about what you work on, but what's different about the character of the
work that you do here versus that? Well, I think that it's in a way we kind of just use
completely different technologies. For example, we very rarely come across databases in our day-to-day
here at Jane Street. There are some teams that work with databases, of course, but
I personally have almost never interacted with them. And that was a core part of being a web
developer is there has to be a database. I think another big divergence in my world is that a lot
of the Jane Street systems are based around a trading day. And so things kind of start up in the morning and some stuff happens and then it shuts down at the end of the day.
And you have time to go do maintenance because there's nothing going on.
Whereas with web services, you're constantly thinking about how to do continuous rollouts over the course of the day so that there's no downtime.
Yep. Although that's a thing that used to be more true here than it is now.
And we are more and more building services
that run 24 hours a day and in fact, seven days a week.
So the trading day over time has become more continuous.
But for sure, there's still a lot of that going on
in many of the systems we build.
Okay, so let's talk a little bit about the system.
So the first thing you came and worked on
when you joined here was a system called Concord.
Can you tell us a little bit more
about what Concord was for, what business problem you were working on and the team that worked with Concord
was doing? Yeah, definitely. So Concord predates me. When I joined, I think Concord had already
been around for at least four years or so. At the time, we were entering into a new business
where instead of doing the normal Jane Street thing, which is like we go to an
exchange and we trade there, we're getting into relationships, like some bilateral relationships
with different clients, different firms, where we trade with them directly. And so this brings
on a whole new swath of responsibilities, and we built Concord to manage those things.
Right. And I guess maybe it's worth saying from a business level,
the thing we're doing in those businesses is not so different from our on-exchange activity.
In both cases, we're primarily liquidity providers.
Other people want to trade, and we try and be the best price for that trade.
But in these bilateral relationships, we have an arrangement with some other kind of financial institution
that wants to reach out to us and see what liquidity we have available.
That's one of the kind of relationships we might have. And suddenly we're a service provider. That's the difference
is we are not just providing the trading, but also the platform on which the interaction happens.
That's totally right. Being a service provider is where all this responsibility comes from.
We have this obligation to be reliable, to be available. We need to have good uptime.
We need to be fast in turnaround. When we get a message from a client, we need to be reliable, to be available. We need to have good uptime. We need to be fast in turnaround.
When we get a message from a client, we need to be able to respond really quickly. And we also
need to maintain some sort of an audit trail, more so than in our traditional trading systems
for regulatory reasons. So I understand this point about being a service provider and how
that kind of changes your obligations. But can you maybe sketch out almost at the network level,
what are the services that we are providing to customers that connect to us? Yeah, maybe I can walk through
what it looks like from end to end, perhaps. Is that what you're asking? Yeah, totally. Yeah,
okay. So you can start out with we are running some server and a client is going to connect into
us into one of our ports. a client's going to send us an
order. And the order flows through our system and tries to match against one of the orders that we
have from one of our desks, our trading desks. And if there is a match, then we produce an execution
and we send a copy of that execution back to the client. We send a copy to the desk and we send a
copy to a regulatory reporting
agency. And that's a very simplistic version of what we do.
And in some sense, that's not so different from the interaction you have on an exchange, right?
When we want to provide liquidity on exchange, we post orders on exchange that express what we're
willing to trade. And then people who want to trade come in and send orders, and then maybe
they cross one of the orders that we have, and an execution goes up.
But the difference is we're playing two roles here.
We're both running the exchange-like infrastructure, and we're also the same organization that's actually putting in the orders that are providing liquidity to clients who might want to come in and connect.
Yep, that's exactly right. And in fact, I think the original design for Concord was taken from inspiration from what other exchanges use in the real world today.
Got it. So you said before that this puts pressure in terms of reliability. Can you say a little bit more about how that plays out?
Yeah. If we go back to that example of a client sends an order, we have to be ready for the possibility of some sort of a network loss. So the client might get disconnected. If the client reconnects,
they really need to be able to see like, hey, did the order actually get sent? And if so,
did it get filled? And these are important pieces of information that we need to hold on to.
So it's not just what happens if the network goes down, but what happens if one of our entire
machines crashes and we suddenly lost the piece of state, the program that was running that was accepting those connections?
We need to make sure that when we come back online, we have the right answers for the client.
Right. So the scenario is a client connects, they send an order, something terrible happens,
and they're disconnected. And then when they do connect again, they would really like to know
whether they traded or not. And so there's that kind of re-synchronization process that you have
to get right. That's right. That's really important. So how does that work? There are
multiple levels at which you can think about this, but let's talk about just the client and the
server talking to each other. We use a protocol called FIX. It's a pretty standard exchange
protocol in the financial world.
And in FIX, there is this notion of a sequence number that is attached to every single message.
You start at one, you expect it to go up by one, and you never expect it to go down or skip a sequence or something like that.
I should say this is not unlike TCP, where in TCP you have sequence numbers based on
the bytes and making sure that
packets arrive in the same order. But the difference is that this persists over the course of the day.
So that means if you lose connection, and you reconnect, what you would end up doing is you'd
log on and say, hey, I want to start from message 45. And we might say, oh, wait a minute, we don't
know about 45. Can you send us 45?
And all of a sudden there's this protocol for exchanging missed messages in that sense
based on sequence number.
Right, so TCP gives you an illusion of reliability on top of an unreliable network.
Your base network sends unreliable datagrams.
You might lose any packet.
TCP does a bunch of sequence number stuff to synthesize a kind of reliable connection on top of that unreliable one. But it's limited in that it's just for the scope of
the connection. You lose this connection for any reason, and certainly you lose one host or the
other, and you're dead, right? There are no guarantees provided by TCP that let you be
certain about what information did and didn't get through. And so we're providing an extra layer of
reliability on top of that where we can recover.
And we can even recover when the server that's on the receiving point itself crashes, right? So it may have lost any of its local state. And still we have to be able to recover because we need to
be able to make sure that clients know what they did, even if one of our box crashes. It's kind of
not an acceptable customer service thing for you to be like, oh, we don't really know if you traded.
Like maybe that's an awkward conversation to have with a client.
Cool.
Are there any other ways in which the constraints of the business that we're doing affect the kind of technology that we have to build?
One of the things that I said was that we need to be very available and we need to have uptime.
And what that means is in the case that a box crashes, like, yes, it's a
very rare event, and hopefully we can bring it back up. But we also need to be able to move
quickly. And sometimes that means moving all of our things off of that box and onto a new box,
not exactly instantly, but on the order of a few minutes instead of a few hours, potentially to
debug an issue with a box. And so we have the flexibility of kind of moving processes around in this distributed
system pretty freely.
Got it.
All right.
So there are requirements around reliability and durability of transactions.
Like when you've told a client about something that happened, you'd better remember it and
you better be able to resynchronize them and know all the stuff about their state that
you can recover even when you lose components in your system. What about scalability? Scalability is a big thing because
we sometimes have days where trading is busy on the order of magnitudes larger than a normal day.
And so we need to be able to handle all of that traffic at the same time like we would on a normal
day. And so we have to design the system such that we can be efficient,
not only in the common case, but even under a lot of load.
Got it. Is all of the load what you're dealing with just the messages that are coming in from
clients who want to trade? What else drives the traffic on the system?
No, the system is actually full of lots of other data as well. For example, for regulatory reasons, we need to
have all of the market data from all the public exchanges in our system at the same time as all
of the orders for all the clients. And that alone is a huge amount of data. And especially on a busy
day, that's going to go up significantly. So we end up with these very, very hot paths in our code
where you need to be able to process
certain messages way more quickly than others.
Got it.
You've told us a bunch about what's the problem that's being solved.
Can you give us a sketch of the system that was built to do it?
What's the basic architectural story of Concord?
Yeah.
We've already alluded to messages a little bit in the past.
What we've built is a replicated state machine is kind of the
fancy term for it. And when we say state machine, when I say state machine, I mean not state machine
in the finite state machine, kind of something you would learn in CS with boxes and arrows and,
you know, things changing from one state to another. We're talking more just think of a module that encapsulates some
state, and that module has the ability to process messages one at a time. And each time it processes
a message, it updates its state. It might perform an action, but that's it. That's the state machine.
It's a very contained, encapsulated piece of software.
In some sense, that sounds very generic. Like, I write programs,
they consume data, they update their state. How does the state machine approach differ from just
any way I might write some kind of program that processes messages off the network?
One of the things that we do is we process messages immediately. We don't do any sort of
like, hey, let me go out and send a network request, or let me block for a second while I
talk to something on the disk.
It's really important that a single message both just updates your state and it does it
deterministically. And this is really important because it means that if you have the same piece
of code and you run the same messages through it, you can get to the same state each time.
Okay, so why does that ability to rerun the same state machine
in multiple places and get the same result, why is that so important to the architecture? it and make sure that even if we crash, we're able to recover it later and tell them about the order
and some potential execution that happened. And so we can do this by leveraging these state machines
because when you crash and you come back online, one of the things that Concord does is it forces
you to read all of the messages that happened in the system from the beginning of the day in the
same order. And what that means is that you will
end up as an exact copy of what you were when you crashed, and you will have all the information that
you need to know. Got it. So there are really two things going on here. One is the individual
applications are structured as these synchronous deterministic state machines, so that they have
a well-defined behavior upon receiving the sequence of transactions.
And somewhere there has to be some mechanism for delivering a reliable and ordered sequence
of transactions to everyone who starts up, right? That's like the other leg upon which
this whole thing stands. Yep, that's right. We call that sequence of transactions the transaction
log. And there are kind of two questions around the transaction log. One is,
like you mentioned, how do you disseminate it? How do you make sure that in this distributed
system with hundreds of processes, everything is seeing the same messages in the same order?
The other question is, how do you append to the transaction log? Because obviously,
you have to be able to add new messages. Got it. So how does that work in Concord?
Yeah. So for the first part, when it
comes to dissemination, what we do is we use something called UDP multicast. And if you're
not familiar, multicast is a kind of protocol where you can take a single UDP packet, send it
to a special network address, and the networking devices like switch, will then take it and make copies of it and send it to all the other switches or devices that are subscribed to that special address.
It's lossy because it's UDP, but it's a way of taking one packet and suddenly turning it into 50.
And you don't have to think about it at all from your application level.
The actual work of doing the copying is done by the switches. And importantly, the switches have specialized hardware for doing this copying so that they can do it in parallel basically at the same time
down through all of the ports that they need to send that packet. They can just do it in this
very, very efficient way and at very high speed. And weirdly, nobody else uses this.
IP multicast, which is like a big deal historically. And when I was a graduate student so long ago,
I'm embarrassed to admit it. It was supposed to be the thing that was going to cause video transmission over the internet to
happen. No, it turned out nobody uses it in the outside world, but it's extremely common in the
finance world. It is. One of the drawbacks of it is that it's much harder to maintain over larger
and larger networks. When you're dealing with like a nice enclosed, like, hey, Concord is just
a single cabinet of hardware.
We actually get a lot of the benefits of it without any of the drawbacks.
One of the drawbacks, though, that we do have to think about is this is UDP.
And it means it's unreliable.
Packets can get dropped.
And so we need to build some layer of additional functionality to help prevent that. UW multicast is efficient, but it's unreliable,
meaning you can lose data and messages can get reordered.
And we need to claw that back.
So how do you claw back those reliability properties?
Right.
So this kind of goes back to what we were talking about earlier
when we talked about the fix protocol and sequence numbers.
We actually just use sequence numbers inside of our own messaging as well. So
each packet has a sequence number attached to it. And that means that if you were to say receive
packet 45, and then 47, and then 48, you would very quickly realize that you may have missed 46.
Maybe it's still in flight, maybe it hasn't gotten there yet. Or maybe it will never get there. And so
what you can do
then is you can go out to a separate service and contact it and say, hey, I'm missing packet 46.
Can you fill me in? And this is a thing that we do all the time. And it's actually works
quite seamlessly, even when drops are very common, which might be the case due to a slow process.
Right. So an important thing about these kind of unreliable protocols is you might lose data because something is wrong with the network,
but as any good network engineer will tell you, it's almost never the network. Just like, you know,
you're starting to work in a new programming language, you think, oh, maybe it's a compiler
bug. It's like, no, it's not a compiler bug, it's your bug. And if you're losing packets,
it's probably not because the network can't keep up, it's probably because you, the damn program
you wrote, can't keep up, and you're because you, the damn program you wrote, can't keep up,
and you're dropping packets for that reason.
But again, you can always recover them the same way,
which is there's some retransmission service.
So really, you've described two things here, right?
There are sequence numbers, which let you know that you're ordering,
and there's a retransmission server,
which lets you recover data that might have been lost.
But these both sound kind of magical.
Where do the sequence numbers come from,
and how do you get this magical retransmission service that always knows all the things that your poor application missed?
Right. And before we talk about that, I think it might be more interesting to start from how do we. Every process in the system has the ability to send these packets out into the multicast
void where they say, I want to add this message to the transaction log.
And there is a single app that we call the sequencer.
And that sequencer is responsible for taking packets from that multicast group and re-emitting them on a different multicast
group. When it does that, it also stamps the packets with a timestamp and it stamps them with
their sequence number and it sends them all out to all of the things listening. And so you just
have this unordered mess of messages coming in and a single canonical ordering of messages coming out
of the sequencer.
Got it. So in some sense, you get ordering in the stupidest possible way, which is you just
pick a single point in space and you shove all the messages through that one point,
and you have a counter there that increments every time it sees a message,
and that's how you get the sequence numbers.
Yep, that's how it is. And when it comes to retransmitters, they're just normal processes,
just like all the other processes. They listen to the sequencer. They write down everything that they hear.
Sometimes they might lose information. It's very possible. And so retransmitters will try to talk
to other retransmitters or in a worst case scenario, talk to the sequencer because there's
a chance that the sequencer tried to send a packet that never made it to anyone.
And so you eventually have to go back to the sequencer and say, hold on a second,
we missed 46. Everyone missed 46. Please help fill us in.
Got it. And then what happens if you lose the sequencer?
Well, in the case of Concord, everything stops. Everything grinds to a halt. There are no new
messages. You can't really do things since everything that you
do is based around messages. And fortunately, we have something that is a backup sequencer running.
And what we do is we intentionally want to make that a manual process because we want to make
sure we can inspect the system and feel confident because we're talking about trading, we're talking
about money. We would much rather have a few seconds of downtime while we collect our thoughts and make sure we're in a
healthy state. What we can do is we can send a message through a different channel to this backup
sequencer, and it will be promoted into the new sequencer and take over responsibilities from
there. Right. And I feel like this is the point at which we could like fork the conversation
and have a whole separate conversation about how this compares to all sorts of fancy distributed systems algorithms like Paxos and Raft and blah, blah, blah for dealing with the fact that the sequencer might fail.
But let's not do that conversation because that's a long and complicated and totally different one.
What I am curious about is this design is very simple, right?
Like how are we going to drive the system?
We're going to have a single stream of transactions.
How are we going to drive the system we're going to have a single stream of transactions how are we going to do it we're going to have one box that just is responsible for stamping all the
transactions going through and rebroadcasting them to everyone and then we're going to build a bunch
of apps that do whatever they do by listening to the stream of transactions it also seems kind of
perverse in the sense that you said scalability was important but it sounds like there's a single
choke point for the entire system that like everything has to go through this one process that has to do the timestamping on every message.
So how does that actually work out from a scalability perspective?
Surprisingly well, actually.
We're limited by the speed of the network, which is like 10 gigabits per second.
We're limited by the program that is trying to pull packets off and put them on a
different interface. And we're also limited to some degree by disk writing, because we have to
be able to write things to disk in case we need to recall them later, because we can't store
everything in memory. So those are a few limiting factors. But in practice, this fits perfectly well
for Concord. We have seen in total a transaction log from 930 to 4, a rough trading day, that was over 400 gigabytes, which is pretty large.
We also see probably messages baseline rate during the trading day of over 100,000 per second.
And some of those messages are smaller than others, but you still see like, you know, maybe even three to five times that on
a busier day. Right. And those messages actually pack a lot of punch, right? Part of the way that
you optimize the system like this is you try and have a fairly concise encoding of the data
that you actually stick on the network. So you can have pretty high transaction rates
with fairly small number of bytes per transaction. And in fact, you put more energy into optimizing
the things that come up the most. So market data inputs are pretty concise and, you know, you've put less energy
into messages that happen 10 times a day. You don't think very much at all about how much space
they take. Exactly. On the kind of exchange side where you see a very similar architecture,
the kind of peak transaction rates, like there they're trying to solve a larger problem.
And the peak transaction rates for that version of this kind of system is on the order of several million transactions a second. I think of this as a kind
of case where people look at a design like this and think, oh, there's one choke point, so it's
not scalable. But that's a kind of fuzzy-headed approach to thinking about scalability. Like
your problem, kind of whatever problem you have, has some finite size, you know, even if it's
web scale. And whatever design you come up with, it's going to scale a certain distance and not farther. And what I've always liked about this class of designs is
you think in advance, like, wait, no, how much data is there really? What are the transaction
rates we need to do? And then if you can scope your problem to fit within that, you have an
enormous amount of flexibility and you can get really good performance out of that, both in
terms of the throughput of the data you can get into kind of one coherent system
and latency too, right?
Do you know like roughly what the latency is
in and out of the sequencer
that you guys currently have in Concord?
Yeah, we measure somewhere between,
I want to say eight and 12 microseconds
as a round trip time from an app.
A single process wants to sequence a message
and the sequencer sends it back to the app.
Right, which is like within an order of magnitude of the theoretical limits of a software solution
here.
It takes like about 400 nanos to go one way over the PCI Express bus.
It's kind of hard to do better than like one to one and a half mics for something like
this.
And so the 8 to 12 is like within a factor of 10 of the best that you can do in theory.
Yep.
Cool. Okay, so that tells us a little bit about this kind of scaling question.
What else is good about this design? How does this play out in practice?
The thing that we've talked about the most so far is how Concord acts as a transport mechanism.
And I think there's probably more that we can say about that, but I think it's not just a transport mechanism.
It's also a framework. It's a framework for building applications
and thinking in a different style
when you're writing these applications
so that they work efficiently
with the giant stream of data that you have to consume.
I think one of the things that's really nice about this
is that our testability story
is really, really good in Concord.
And that's because we have the ability
to take these synchronous state machines
that we've talked about
and put messages in them one at a time
in an exact order that we can control.
And we can really simulate race conditions
to an amazing degree
and take all sorts of nearly impossible things
and make them reproduce perfectly.
The way that we actually have this
nice testability, one of the reasons why it's so nice, I mentioned it was like a framework for
building applications. And the framework itself forces you to write your application in an inverted
style. And what that means is your application doesn't have the ability to say, hey, I want to
send a message. I want to propose this message. Instead, your application has to
process messages and make decisions about what it might want to send when something asks it what it
wants to send. So you're no longer pushing messages out of your system. You are just waiting around
for something to say, hey, what's your next message? Here's my next message. And this seems
like a kind of weird design at first. But
when it comes down to the testability story, what it does is it really forces you to put more of
your state in your state. In the former example of I want to send a message, you might have some
code that looks like, okay, well, I send the message, then I wait for it, then I'm going to
send another message. But what you're actually doing is you're taking a very subtle piece of state there, the
fact that you want to send a second message, and you're hiding it in a way that's not observable
anymore. And so by inverting control, you have no choice but to put all of your state down into
some structure that you can then poke at. And from the testing perspective, you can say, all right, we've seen one message happen.
Let's investigate what every other app
wants to send at this point.
So when you talk about inversion of control here,
this goes back to a very old idea
where in the mid-70s,
if you were writing an application on a Unix system
that needed to handle multiple different connections,
there's a system call called select that you would call.
You'd call select and it would tell you
which file descriptors were ready and had data to handle.
And you'd say, oh, I'm going to like handle the data
as far as I can handle it without doing anything
that requires me to block and wait.
And I'll just keep on calling select over and over.
And so it's inverted in the sense that
rather than expressing your code as
I want to do one damn thing after another,
like writing down it in terms of threads of control,
you essentially write it in terms of event handlers
saying like, when the thing happens,
here's how I respond.
That's a really good way to put it.
There's some point again,
you know, back in the Pliestine era
where I was a graduate student,
where there was this relatively new idea of threads
that was really cool,
where instead of doing all of this
like weird inversion of control,
we can live in a much better world
where we have lots of different threads,
which you can just express one damn thing after another in a straightforward way.
And then there's a minor problem that the threads had to communicate with each other
with mutexes and locks. A minor problem, yes. Right, it's a minor issue. And it turns out,
actually, that latter thing is kind of terrible. And like a long disastrous story unfolds from all
of that. But it sounds like what you're saying is the Concord design leans very heavily in the,
no, we're going to be event driven in the way that we think about writing concurrent programs. Yeah, that's totally
right. We don't take advantage of using multiple threads because we still have to process every
message one at a time, and we have to fully process a message before we can start moving
on to the next message. Right. And your point about how this forces you to pull your state
into your state, it's a funny thing to say because like, where else would your state be?
But the other place where your state can be is it can be in this kind of hidden encapsulated state
inside of the thread that you're running, right? If you're writing programs that have lots of
threads, then there's all this kind of hard to track state of your program that essentially has
to do with what's the program counter and the stack and stuff for each of the threads that
you're running in parallel. And we're saying, no, no, no, we're going to all make that explicit in states
that's always accessible.
So in some sense, any question you want to ask
about the state of a particular process,
you can always ask.
Whereas in threaded computations,
there are these intermediate points
between when the threads communicate with each other,
where these questions become really hard to ask
in the ordinary way, in a synchronous way,
reflect this in regular data structures. So we can always ask any question that we want to ask in the ordinary way, in a synchronous way, reflect this in regular data structures.
So we can always ask any question that we want to ask about the state of the process
and what's currently going on.
Yeah, that's a good summary of what this whole story is about.
The testability story, I talked about race conditions and how they are easy to simulate.
And that is purely because you are able to control precisely what goes into the system
and when to pause to start looking to examine the system.
Okay. So I can see why this is nice from a kind of testing perspective.
How does it do in terms of composability,
in terms of making it easy to build complicated systems out of little ones, right?
That's one of the most important things about a programming framework
is how easy is it to build up bigger and more complicated things
out of smaller pieces that you can reason about individually? How does that story work for Concord?
So this is actually one of the nice things that I also really like about Concord is that if you
take a single process and look at it, it's not just a state machine. It's actually a state machine
that's composed of smaller state machines who themselves might be composed of smaller state
machines. What that means is that you can kind of draw your boundaries in lots of different places. You can take a single program that is
maybe performing too slowly and split it into two smaller ones. Or you can take a swath of programs
that all have similar kind of needs in terms of data and stitch them together. And it lets you
play with the performance profiles of them in a really nice way. So is scaling by breaking up things into smaller pieces,
how does the Concord architecture make that easier?
In some sense, I can always say in any program I want,
you can say, well, yeah, I could break up into smaller pieces
and have them run on different processes.
What about Concord simplifies that?
So the thing is, if you have two state machines
that are completely deterministic,
you know that they're going to see all the same messages
in the same order.
You actually don't need to have two processes that talk to each other. In fact,
we forbid that from happening except over the stream, over the transaction log. These processes,
if they want to know about each other's states, all they do is simply include the same state machines in both of the processes. All of a sudden, you know what the other is thinking
because you know you've both processed the same messages All of a sudden, you know what the other is thinking because you know you've
both processed the same messages in a deterministic way. That's why this becomes more composable or
decomposable because you're just running the same code in different places. Rather than communicate
with a different process in the system, you just think the same way as that process. You look at
the same set of inputs, you come to the same conclusions, and you necessarily
consistently know the same things because you're running the same little processes underneath.
That's exactly right.
Yeah.
It's neat.
Okay.
So that's a lot of lovely things you've said about Concord.
What's not so great about the system?
So one of the things that we've talked about a few times now is performance and the performance
story and how things are pretty fast,
but they also have to be fast. And I think that this is actually a really tricky thing and
something that we have to keep tweaking and tuning over time. Because when it comes to performance,
every single process needs to see every single message, even though they might not do something
with every message, the vast majority of them need to. If you're really slow at handling a message, your program is going to fall behind and you're
not going to be reading the stream at the rate that you need to. And so we have to find ways to
improve the performance, which usually boils down to being smart about our choice of data structures,
the way that we write code. We actually try to write OCaml in a way that's pretty
allocate-free so that we don't have to run the garbage collector as much as we normally would.
Let's say you have a process that's trying to do too much. You might want to split that into
two processes. And all of a sudden, now they have more bandwidth in terms of CPU power.
So when we're talking about this basic Concord design, we talked about the fact that the
sequencer has to see every single message. But how do you get any kind of scaling and parallelism
if every single process has to process every single message? If we all have to do the same
work everywhere, then aren't we kind of in trouble? We certainly would be if we had to actually see
every single message. The way that this works is Concord has a giant message type
that is the protocol for each of these messages. And the message type is actually a tree of
different message types. And so what you can do as a program is you can very first thing look at
the message type and say, oh, this is for a completely different subsystem of Concord and
I don't need to look at it and throw it away. And when you find the messages that you want,
now you filter down to those
and you process them through your state machines
like you normally would.
Got it.
So the critical performance thing is that
you don't necessarily have to be efficient at processing
the actual messages,
but you definitely have to be an efficient discarter
of messages that you don't care about.
Because otherwise you're going to spend a bunch of time
thinking about things
that you don't really mean to be thinking about. And that's just going to slow everybody down.
Okay, so one of the downsides of Concord is you have to spend a lot of time thinking about
performance. What else? So I mentioned that we have this message type, and you have to be really
careful about it. Because if you change the message type in a certain way, like, sure,
over time, you're going to want to change your messages. That's a normal thing. But let's say you make a change to a message type, and all of a sudden you have an issue
in production on the same day.
Maybe it's completely unrelated to the change that you made.
What that means is that if you try to roll back, all of a sudden you're not going to
be able to understand the messages that were committed to that transaction log earlier
in the day.
You have to be really careful about like, when do I release
a change that just touches messages? And when do I release a change that adds logic that is more
dangerous? So it sounds like the versioning story for Concord as it is right now is in some sense,
very simple and very brittle. There's essentially morally one type that determines that entire tree
of messages. And then you change that one type, and then it's a new and incompatible
type. And there's no story for having messages that are like understandable across different
versions. So that when you roll, if you're going to roll a change to the message type,
you want that to be a very low risk role, because those roles are hard to roll back.
That's right. When you change the world, you have to change the whole world at once. And in fact,
our very first message comes with a little digest such that if an app reads
it and doesn't know the digest, it will just say, I'm not even going to try and crash.
Got it.
Do you think this versioning problem is a detail of how Concord was built or a fundamental
aspect of this whole approach?
Well, I think that in a lot of ways, you have protocols and stability.
You need to think about like, what is your story for upgrading and downgrading, if that is possible.
It's a natural part of a lot of things we do, especially at Jane Street.
And we've gotten really good at building out frameworks for thinking about stability.
But at the same time, I feel like it's more of a problem in Concord because of that transaction log and because of the message persistence. It makes it so that as soon as something is written down, you really have to
commit to being able to read it back in that same form. So there's a kind of forward compatibility
requirement. Any message that's written has to be understandable in the future. So it's like
you're maybe not tied to this very simple and brittle, well, there's just one type and that's
that. You could have programs that are capable of understanding multiple versions at the same time.
But essentially, the constraint of reading a single transaction log in an unintermediated way
does put some extra constraints. A thing that we'll often do in point-to-point protocols is
we'll have a negotiation phase. You'll have two processes, and process one knows how to speak
versions A, B, and C, and process two knows how to speak B, C, and D.
They can decide which language to speak between them.
When you're sending stuff over a broadcast network, it's like there's no negotiation.
Like somebody is broadcasting them as fast as they can, and all the consumers just have to understand.
So there is something fundamental at the model level that's constraining here.
Yeah, that's right. You can tell I've been in Concord for a while when I forget that things like negotiation phases even exist in the rest of the world. new apps and unrolling new apps and forward and backwards and trying out new things in an uncoordinated way, the system rolls as a unit. And at a minimum, this versioning lock requires you to
do this. Yes, that's right. Is there anything else about how the system works that requires this kind
of team orientation for how you build and deploy Concord applications? Yeah, actually, there's a
lot. When I mentioned that there were multiple Concord instances at
Jane Street, what that actually means is that each of them have their own transaction log,
their own sequencer, their own retransmitters for filling in gaps when a message is dropped.
They have their own config management for the entire system, which can get pretty complicated
because it is a hard thing to configure. And they
have their own cabinets of hardware, including all the physical boxes that you need to run the
processes and specialized networking hardware to do all sorts of things. And so every time you want
to make a brand new Concord instance, you need to do all of those things. You need to tick all those
boxes. The overhead there is massive. And so we actually have been shying away from creating new Concord instances because it's just so expensive just
from a managerial standpoint. Got it. So the essential complaint here is a certain kind of
heavyweight nature of rolling out Concord to a new thing. You want to roll out a new one and
there's a bunch of stuff and maybe it kind of makes sense at the given scale. The amount of
work you have to do configuring it and setting up is proportionate to a large complex app and pays back enough dividends that you're happy to do it.
But if you want to write some small, simple thing on top of Concord, it's basically a disaster.
You could not contemplate doing it.
So this segues nicely into what you're working on now.
There's a new system that's been being worked on in the last few years in the Concord team called ARIA, which is meant to address a bunch of these problems.
Maybe you can give us a thumbnail sketch of what is it that ARIA is trying to achieve.
Yeah.
So a lot like Concord, ARIA is both a so much that we are basically forcing it onto
people that want to write things on ARIA, because we really just think it's like the right way to
write these kinds of programs. And so ARIA itself is, you can actually kind of think about it as
just like a Concord instance off in the corner in its own little hardware cabinet. And it's managed
by the ARIA team. And if you want to write an application
that uses ARIA as a transport, then you don't need to do much. You just need to get your own
box, you put your process on it, and you talk to our core infrastructure through a special proxy
layer. So the idea of ARIA is ARIA is trying to be a general purpose version of Concord.
Oh, yeah.
Right.
So anyone can use for like whatever their little application is.
So we talked about all the ways in which Concord does not work well as a general purpose infrastructure.
What are the features that you've added to ARIA to make it suitable for people broadly to just pick it up and use it? So first of all, the amount of overhead and management is much smaller.
There is no configuration story for individual users now because we do all the configuration
of the core system.
And so if a user needs to configure their app, yeah, they go and do that.
But they don't have to worry about us in any way.
That ultimately lowers the management overhead drastically.
So if I have my simple toy app that I wanted to get working on ARIA, there's actually a
really short time from I read the ARIA docs to I have something that's sequencing messages on a transaction log
right now. And that's a really useful thing for people getting into the system for the first time.
Got it. So step number one, shared infrastructure and shared administration.
Yep.
What else?
One of the things that we've talked about is performance and how hard performance is.
In Concord, you have to read every message that every process has sequenced.
And this is obviously just not going to scale in a multi-user world.
If there is one application that someone builds that sequences hundreds of thousands of messages,
and then someone else has their tiny toy app that only sequences a handful,
you don't want to have to read 100,000 plus a handful in order to do your normal job.
And so what we do is at this proxy layer that I alluded to, we have a filtering mechanism.
So you can say, I'm actually just interested in this small swath of the world.
So please give me just those messages.
Don't bother me with all of the other 100,000 massive swarm of market data messages.
It sounds like making a system like ARIA work well still requires a lot of performance engineering.
But the performance engineering is at the level of the sequencer and of the proxy applications.
And user code only needs to be as fast as that user code has stuff that it needs to process.
If you pick various streams of data to consume and you want to read those messages,
well, you have to be able to keep up with those messages,
but you don't have to keep up with everything else.
That's right.
And scaling the core is actually much easier than scaling an application
because when we get a message from the stream,
we don't need to look at its payload at all because we actually can't understand its payload.
It's just an opaque bag of bytes to us. We just have to process the message and get those bytes out or
persist them onto disk or whatever we need to do with them. So the biggest bottleneck there is just
copying bytes from one place to another. Got it. So this is another kind of segmentation of the
work, right? There's like some outer envelope of the basic structure of the message that Aria
needs to understand. And presumably it has to have enough information to do things like filtering,
right? Because it can't look at the parts of the message it doesn't understand and filter based on
those. And then the whole question of what the data is and how it's versioned is fobbed off on
the clients. The clients have to figure out how they want to think about their own message types
and ARIA kind of doesn't intrude there. And the core system is just, you know, blindly faring
their bites around.
That's exactly right.
So what is the information that ARIA uses for filtering? How is that structured?
We have this thing that we call a topic. And a topic is a hierarchical namespace where you are writing data to some part in that namespace. And you can subscribe to data from some subtree of that namespace. And so you can
kind of separate your data into separate streams and later on have a single consumer that is
joining all those streams together. And the important thing about these topics is that
even among topics, there is a single canonical ordering that everyone gets to see.
Why is that ordering so important?
So the ordering is actually important
if you ever want to interface with another team.
So let's say that I have a team
that's providing market data on ARIA,
and all they're doing is they're putting
the market data out there on a special topic.
If I have an app that wants to do something
a little bit more complicated and transactional,
it might need to rely on market data
for something like regulatory reporting and knowing that, oh, need to rely on market data for something like regulatory
reporting and knowing that, oh, at the time that this order happened, the market data was this.
And it's really important for reproducibility. Let's say something crashes and you bring it back
up. It sees those messages in the same order and it doesn't change the ordering of the market data
and order messages. And this is, I think, in contrast, when someone hears the description of this system,
it's like, oh, we have messages and we shoot them around
and we have topics and we can filter.
It starts to sound a lot like things like Kafka.
I think this difference about ordering is pretty substantial.
In the way Kafka works, as I understand it,
there's some fundamental notion of a shard,
which is some subset of your data,
and then I think topics can have collections of shards and you can subscribe to Kafka topics, but you're only guaranteed
reproducible ordering within a shard. And you can make sure that your topic is only one shard,
which limits its scalability. But at that point, at least you're guaranteed of an ordering within
a topic. But between topics, you're never guaranteed of ordering. So if all of your
state machines are single topic state machines, that's in some sense
fine because every state machine you build will be decoded the same way as long as that state
machine is just subscribing to data from one topic. But the moment you want to combine multiple
topics together and have multiple state machines and then be able to make decisions based on
concurrently, like at the same time, what do these different state machines say? Decisions like that
become non-reproducible unless you have that total ordering. When things are
totally ordered, well, you can replicate even things that are making decisions across topics,
but without that total order, it becomes a lot more chaotic and hard to predict what's going
to happen if you have two different applications running at the same time. And actually, even worse,
if you do a replay, you can sometimes have extremely weird and unfair orderings, right?
You know, you're listening to two Kafka topics and one of them has a thousand messages and one of them has a million messages.
You can imagine that the thousand message topic is going to finish replaying back to you way faster than the million message topic.
So stuff is going to be super skewed on replay and your replay from scratch is going to behave very, very differently than your live
system did. Yeah, it really all goes back to that point of determinism and how central that is to
the ARIA and Concord story. Okay, so far what do we have? We have a proxy layer, which insulates
applications from messages they don't care about. We have this hierarchical space of topics into
which people can organize their data. Is there anything
else interesting about the ARIA design? Other pieces that distinguish it from how Concord works?
So one of the things that we also kind of talked about was the messaging protocol and how Concord's
got this one giant message protocol that everything needs to adhere to. But in ARIA, like we mentioned,
it's kind of up to the user. And so each user might use a
different protocol for a different topic. So you now have lots of different protocols. And in fact,
you can even get a negotiation-like system by, say, writing out one protocol of data and then
writing out a newer version of it and then letting another subscriber choose which one they want to do. So if I don't know how to read V3 yet, I can fall back to V2 and be happy for the day.
Got it. Although in that case, do you have a guarantee that you get things in the same order
when you consume data? If what we're doing is we're taking the data that's in different versions
and publishing it on different topics, are those necessarily going to be published in the same
order? What kind of ordering guarantees do you get in that environment? No, you're right. There
are no ordering guarantees in that sense. You can do your best to publish them one after the other
and get them as close to each other as possible. But it's kind of a thing, it's a trade-off that
you'd have to make if you were trying to do some sort of an upgrade and, or were trying to help
service older clients that haven't upgraded yet.
Right. So I guess you can make sure that your messages are published in the same order within the topic.
If I have one message and I want to publish it at versions V1, 2, and 3, I could say publish first at V1 and then at V2 and then at V3 and then do the next message then at V1 and then at V2 and then at V3.
So within topic, people will get them in the same order. But if I have one app that consumes v1 and concurrently some other totally unrelated topic for some other state
machine and reads those together in a different one that reads v2 and that same other state
machine, those can be interleaved in different ways. But I guess I can still reliably replicate
things that are consuming the same versions. I don't totally lose the ability to replicate.
Some noise is added between applications that are using different versions of the data.
You're kind of hitting on a point that's really subtle. And we talked about how message versioning
was so hard to get right. But there's actually a whole second class of message versioning that we
didn't talk about, which is the state machines have code that interprets the
messages. And that itself is like a form of versioning. If you have a state machine that
reads a bunch of messages and then crashes, and then let's say you roll a different binary that
interprets them differently and start it up, it's going to read the same messages and end up at a
different state. And so there is actually like a subtle version attached to the source code itself.
Right. And Concord is kind of rolled as one big application that is many different processes,
but it's in some ways less a distributed system and more a supercomputer where you'd run a single
program on it. And here in the context of ARIA, you're talking about some complicated zoo of many
different programs. And so now you open up the somewhat new question of subtly different state
machines that are consuming the same stream of transactions, but maybe reading them differently.
And all of a sudden, the guarantees we have about the perfection of the replication between different copies is much harder to think about.
That's right.
The story there is that, yes, a lot of the burden will fall on the users using the system to make sure that they're configured correctly.
And not just configured correctly.
And not just configured correctly, but that people are making changes in a reasonable way,
that they're making changes that maybe even if they change the state machine, how it runs,
they're not changing its semantics in ways that are deeply incompatible.
Yeah, exactly.
So a thing that you haven't talked at all about is permissioning. If you have one enormous soup of topics,
there's a big hierarchy and like anyone can publish anywhere, it seems like awful things
are going to happen. So what's the permissioning story here? Well, we give clients the ability to
write to certain topics or certain trees of topics. And within those trees of topics,
they're allowed to just create their own topics. And so they can kind of create a whole segment of ARIA
where they can read and write to freely. But this is still managed to some degree by the ARIA team.
We still want to partition up the topics in a reasonable way. It's not like everyone can just
create whatever topics they want. Got it. So part of the central administration of the system
is essentially saying users X, Y, and Z, you can use this subtree. Users A, B,
and C, you can use that other subtree and preventing users from stomping on each other that way.
Yeah, that's a simple way of putting it.
Do you think differently about permissioning reading and writing to a given topic?
Well, actually, it's a little interesting because reading is a lot harder to get right.
One of the ways that we actually allow, I talked about these proxy right. One of the ways that we actually allow,
I talked about these proxy layers.
One of the ways that we let you interface with these proxies is with TCP.
So you can connect to the proxy
and just get a stream of all the messages over TCP.
And it's really simple.
However, there's a second way of doing it
and that's UDP multicast, bringing multicast back.
So we have multicast both inside the core of ARIA and outside at the proxy layer. And the reason and that's UDP multicast, bringing multicast back. So we have multicast both inside
the core of ARIA and outside at the proxy layer. And the reason why that's interesting is because
you can't really put permissioning on multicast packets. When you're sending them out into the
network, there's nothing stopping someone from listening to an address and receiving packets
on that address. The only thing that you can really do is you can make it hard for them to
figure out what is that address that they're trying to listen to. I mean, I guess technically
you can do at the network layer segmentation of which hosts can this multicast be routed to and
which hosts can that multicast be routed to. But it's not easy to hit those kind of controls from
the application layer, I would think. That's right. And that is kind of like a management
nightmare in terms of switch rules and trying to keep
things in sync.
And like you said, the ARIA team doesn't really have a lot of ability to dictate what the
networking team does and configures.
So is the current model that you have that like you have write permissioning, but anybody
can read anything?
That is explicitly what we have right now.
But we are actually working on adding read permissioning
in a sort of like, it is kind of a best effort,
trust-based case, which is to say,
if you are using our libraries to connect to ARIA,
which you probably are,
we will still validate that you are able to read the data
before we even start hooking up your listeners to the stream.
Right, I think even if you don't have any security concerns about the confidentiality of the data
that's in ARIA, you still might want this kind of permissioning essentially for abstraction reasons.
Put it another way, some of the topics to which you sequence data is, I imagine you think about
as the internal implementation of your system, and you freely version it and change it and think
about it in respect to how you roll your own
applications and don't worry about outside users. And maybe there are some topics of data that you
publish as a kind of external service that you want people to use. And I would imagine you would
dearly like people to not accidentally subscribe to your implementation because you're going to
walk in one morning and roll your system and the whole firm is going to be broken because people
went and depended on a thing that they shouldn't have depended on.
That's exactly right.
And we have those things internal to ARIA too.
We actually keep a lot of the metadata about the current session of ARIA on ARIA.
It's really quite nice because we get all of the niceties of ARIA from being able to
inspect the system at any point in time.
And we don't want other people to be able to read
and depend on those things for that exact reason.
I mean, this points to another really nice thing
about these kind of systems I think is not well understood
by people who haven't played around with them before
is you sort of think of, oh, there are these messages
and we put the data onto the system.
Whatever the core transactions,
I don't know if you're running a trading system,
maybe you put orders and market data and whatever.
If you're a Twitter, maybe you use this kind of system
when you put tweets on there or something. But the other kind of thing
you put on there is configuration data, right? And it's nice to have this place where you can
put dynamically changing configuration and have a way of thinking consistently about exactly when
does the new config apply and when does the old config apply. And the fact that you just integrate
it into the ordinary data stream gives you this lovely model for talking about configuration
in a way that's fully dynamic and also fully consistent across all the consumers of that data.
Yeah. The very simple example there is, let's say in the middle of the day, we want to give someone
write abilities on a new topic. How do you do that? Well, you just send an ARIA message that
says they're now allowed to write to this topic. One thing that strikes me as interesting about all this is there's a lot of work in various parts of the world about this kind of transaction log.
We're talking about transaction logs as if it's in some sense kind of a new and interesting idea, but it's the oldest idea in computer science.
Like there's a paper from the mid-70s that introduces this notion of distributed transaction logs, and all sorts of things are built on it. Standard
database management systems use very similar techniques. And, you know, 30 years of distributed
systems research, and then the entire blockchain world has rediscovered the notion of distributed
ledgers and Byzantine fault tolerance. This idea just kind of keeps on being invented and reinvented
over and over. But the thing about the ARIA work that strikes me as interesting and new is there's
a bunch of work about not so much the substrate of how do you get the ledger to exist in the first place, but how do you use it?
How do you build an open heterogeneous system with multiple different clients who are working together?
I feel like I've seen less work in that space.
I guess in some sense the whole smart contract thing is a species of the same kind of work.
That's a way of writing general purpose programs and sticking them onto this kind of distributed ledger-based system.
But ARIA is very different from that, and it's kind of a whole different way and a more software engineering-oriented way of thinking about this.
One of the things that's been interesting in following along with all of this work is just seeing the working out of these idioms.
And I don't know, I feel like idioms like this in computer science are super powerful
and easy to underestimate.
I don't know who came up with the idea of a process
when designing early operating systems,
but that was a good idea.
And it had a lot of legs.
I feel like there's a lot of interesting ideas
in how you guys are building out ARIA
as a thing that makes it easy for people
to build new complex applications.
It has kind of a similar feel to me. So along those lines, like, how's it going? I know there's a system that's been
in progress. You're not the only person who's worked on it. There's been a whole team of people
working on like ARIA core technology and it's starting to be used. Like, how is that going?
Like, who's using it and what's the process like of it kind of getting out into real use in
Jamestreet? Yeah. so we have a system.
It's in production.
We have deployments in several regions now.
And it's actually being used by teams at Jane Street
in a production manner.
So I would say so far, so good.
Things are looking good.
We specifically don't have too many people using it
because we want to make sure that when we get a new client,
there's always going to be some new requirements. And we want to make sure that we're able to address them in a
relatively straightforward way. And a lot of the decisions that we make, we think really, really
hard about because the decisions that we make now are going to be ones that we'll have to live with
for a long time. And so when we're thinking about new features, we're constantly trying to figure
out ways in which we can do them in the best possible way.
On one hand, we are actually missing a bunch of features that would be really nice.
This has kind of worked out just fine for our current clients,
but we know that there are upcoming ones where we're going to need some new features.
Can you give me examples of features that you know you need, but they're not yet in production?
A small example of this is rate limiting.
It's a thing that you might expect would be there from day one, perhaps, but it's not.
And so a single client can, say, get into a bad state where they're now sending the same message over and over and over again.
And we will happily inject those into the ARIA stream as fast as possible. This is not necessarily great because
it can actually hurt the performance of the system for other people too. There's lots of isolation,
but there's still limited numbers of processes and what they can do. And so there is some sharing
and overlap. Focusing on that for a second, if you have one client who's injecting an arbitrary
amount of data into the system, that data is not going to hit the end client who's subscribing to a different topic.
But all of the internal infrastructure, the sequencers and the republishers and whatever, all of those interior pieces, the proxies that are sitting there, they have to deal with all the data.
And so if you have one injector who's running as fast as it can, it is going to slow down all of those pieces.
Even if the actual end client isn't exactly directly implicated,
the path by which they receive data is. That's right. And not to mention, we are still currently
in a world where we're writing everything, every single byte that we get down to disk,
because we have no idea. Maybe someone has to start up and say, please fill me in from the
beginning of the day. We're going to have to go to disk to get that. And so there's actually just
also limited disk space. If every client was hitting us with max
throughput as fast as possible, we might be concerned about disk. So why aren't you more
worried about rate limiting? That all sounds kind of terrible. It is. Yeah, that's definitely the
first thing on our list right now to work on. It just happens to be a relatively easier problem
compared to another thing that we've been thinking about a lot, which is snapshots.
Snapshots tie in a lot with the thing that I actually just said, which is, let's say
your app starts up and it's the middle of the day.
It needs to receive messages.
The only thing that it can do right now is start at the beginning of the day and read
every single message on its set of topics that it needs to read up to the current tip
of the stream.
And it's maybe worth saying that whole phrase, from the beginning of the day,
is kind of a weird thing to say, like, which day, whose day, when does Hong Kong's day start? When
does the day start in Sydney? There's different notions of day. And I think this just kind of
reflects the thing you talked about before, about how this technology has grown out of a trading
environment where there's a regional
concept of which trading region you're in and that there's a notion of a day's begin and end.
And like that actually is messy and complicated and not nearly as clean as you might naively hope.
It worked much better 15 years ago than it does today. But still, that's a real thing that lots
of our systems have is this kind of notion of a daily boundary. Yeah. And to even expand on that, well, first, the thing I was
going to say was that obviously that might be really expensive. It might take a lot of time
to replay all those messages. And we want to prevent that. Even going further than that,
we are trying to think about a single session as not just a day, but we want to expand it to be a whole week. So
like, wow, it's like seven times what we're doing already. And so it would be really bad if an app
that starts up on Sunday and an app that starts up on Friday night had completely different
performance characteristics because one of them has to replay, you know, maybe hundreds of gigabytes
worth of data before it can even do anything. And so this is where we introduce snapshots.
Snapshots being, hey, here's a chunk of the state at a given point in time.
You don't need to read messages older than this.
Just load up the state and start from there.
And snapshots, from a finance world perspective,
this is a familiar concept because that's what happens for market data feeds, right?
If you go to any exchange, they have two things.
They have a reliable multicast stream, very similar to the reliable multicast that we use inside something
like Concord or ARIA. They have a gap filler, which is actually part of reliable multicast,
and they have a snapshot service. The gap filler is just the thing we talked about as a retransmitter,
where you go and request a packet you missed. And then the snapshot service is this other thing,
which gives you some summary of the state, which is meant to be enough for you to go forward from there and compute.
In the context of a market data feed, that snapshot usually consists of the set of orders that are currently open and maybe the set of executions that have happened on the day because typically there are way more orders than there are executions.
So you don't want to replay to someone all of the orders if you can get away with it.
But all of the executions is not so bad necessarily.
Yeah, you're touching on how it's actually not as simple as I just described it originally.
Because you might need different snapshot techniques for different services.
Some of them need to retain more data than others.
And some of them you might want to say, hey, this single topic has two different ways of being interpreted.
You can interpret it in a
way that you need a really large snapshot in order to continue reading it, or you can interpret it
in a way that's more simple and straightforward, like the market data example. Right, and even
within market data, you could imagine, let's say I have an application whose goal is to print out
the total number of open orders at any moment in time, right? If you have that application and it
gets as a snapshot the current state of the book and the set of open orders,
that's cool.
It can replay, and then from the moment it gets the snapshot,
it knows how many open orders there are.
And every further update it gets to the book,
it can update that view of the book
and therefore update its count, and everything's cool.
But if you have an application that needs to count
the number of orders that happened on the day,
it's just dead in the water with that snapshot.
That snapshot does not let it recover its state
because just knowing the current set of open orders doesn't tell you how many orders
have been sent in a given day. And it's not like there can't exist a snapshot that would solve the
problem. You would just need a different one. So there's a way in which replaying from the
beginning of time is a general purpose recovery mechanism. It doesn't matter what you're doing.
As long as what you're doing is a function of the stream of data, you can do it on a replay just as you did it live.
Whereas snapshots are kind of by their nature to some degree application specific.
You can do some kinds of computations with some kinds of snapshots and others you can't.
Yeah, I think that's exactly right.
What does that mean for providing a snapshot service for ARIA?
Well, this goes back to the thing we have to think very hard and carefully about it. One of the ways that we are approaching the problem is to give clients the ability to
build their own snapshot servers. So you can say, I'm a client, and I know that I'm using these
topics. And hey, I'm also using these other topics from a different team, I'm going to put them all
together into one snapshot server that
can compute a snapshot that is tailored to my use case. And by doing that, by empowering clients to
build these servers out, you kind of solve the problem at the user level.
Got it. But what's interesting is that the thing that we described the exchange is doing
is kind of the opposite of that. And it's demonstrably very useful, right? Because I feel
like the market data story highlights
that snapshots are application-specific,
but they're not completely application-specific.
There are actually lots of different state machines
that you can reasonably build
on top of the kind of snapshot that exchanges provide.
So you can do a lot of things, but not everything.
So the thing you're describing now
is going all the way direction
of the application-specific ideas.
Like we are going to give users just the full ability to customize the kind of snapshot they want,
and they can remember for restarting whatever it is they want to remember.
Each application is on its own.
And I guess you can imagine something in between where you allow someone who's providing a general-purpose system
to also provide a kind of general-purpose snapshot.
Yeah, exactly.
To me, this feels a lot like composing state
machines, how if I'm building a service that's using another topic, I might use their state
machine to consume that topic. In fact, I must use their state machine to consume that topic.
In the same way, if I'm building a snapshot server, I will probably actually use some components that
that team has released to compute the snapshot. But I might actually, you know,
embellish it in some way, like add some details to it, or maybe compact it in ways that I can
get away with to make it simpler. There's lots of degrees of freedom here. And I think composability
is your answer to them. Right. And even if the only snapshot mechanism you give people
at the ARIA level is this kind of for your application, you build your own snapshot
service. You can just use the ordinary modularity tools, right? You can have a library that somebody
wrote for their particular stream that provides you most of the code that you need for computing
their part of the snapshot. So it's not like you're totally lost on a reusability perspective,
even if you decide to build completely per-application snapshot servers.
Exactly.
You said before that there's a small number of teams that are using ARIA right now.
Can you say more about what teams these are and what kind of things they're building?
Yeah. So Concord and ARIA are part of a department at Jane Street called Client-Facing Tech.
And a lot of the other current clients of Aria are also in this same department.
And as the name alludes to, we write programs that interface with clients directly. And what
we end up dealing with a lot is something called high touch flow, which means that unlike our kind
of traditional Jane Street systems, where it's just a lot of computers making decisions about
what to do, these are systems that actually involve a lot
of human interaction. And so it actually gets a lot more complicated because even though we can
automate some things, sometimes we just have to kick it back to a human to look at it and examine
it and decide what to do. And so that's the kind of tooling and infrastructure that we're building
around handling new types of orders and being able to automate them in as much of a way as we can.
Right, and I guess one of the advantages of that
is just the fact that people are kind of organizationally close
makes it easier to kind of coordinate and prioritize together
while you're still in the phase of building out
the core functionality of ARIA itself.
That's right, yeah.
Some of these people have also worked on Concord,
so they are pulling their prior experience
into the new ARIA world.
And I guess the other thing that's funny about ARIA is that it really is a different way of
programming and it takes some time to get used to, which to say a straightforward thing is in
some sense, just a misfeature, right? It's great to learn things, but also having a system that
requires people to learn stuff is kind of bad on the face of it. So I think you think, and I think
we probably think that like, it's worth learning it. It's a valuable thing to learn, but it's also
hard. It's a different way of thinking about designing your system.
And by starting with people who are near at hand,
it's easier to slowly acculturate them
and teach them how to approach the problem
in the way that you think it needs to be approached.
Have you thought much about what the process goes like
of evangelizing something like ARIA
as you try and reach out to the firm more broadly?
Like, Jane Street's kind of big now.
We have like 1,700 people,
hundreds and hundreds of engineers across's kind of big now. We have like 1,700 people,
hundreds and hundreds of engineers across a bunch of different teams.
How do you think about the process of teaching people
a new way of approaching the architecture of their systems
when the people are more spread out
and people like you don't know and talk to every day?
Yeah, I think one of the things
that we've learned the most about Concord
is that there are patterns that arise.
Concord has,
I mentioned, hundreds of different processes running, and also several dozens of them are
completely different types of applications. And what this means is that over time, we've kind of
converged on a few patterns. For example, we have something that we call the injector pattern.
It's a kind of app that just consumes some external data and puts it into a
format that it can put onto the stream and make sure the stream is up to date with the external
world. It's a very simple thing once you actually sit down and get a chance to do it. And so really
what I want to do is I want to take these patterns and encapsulate them in documentation and examples
and maybe even a sort of, let's call it a boot camp,
where in order to learn how to do things,
you have to implement your own injector.
And with the extensive testing that we can build around it,
you get immediate feedback over whether it was working or not.
So for something like this kind of simple injector,
which the way you describe it sounds very simple,
what are the kind of mistakes that you think someone who comes to ARIA
not having really learned about it and doesn't have
the advantage of the examples or the bootcamp or the documentation, what kind of things are they
going to get wrong when they try and build an application that does this injecting?
Going back to the inversion of control thing that we talked about way earlier on,
the applications themselves do not send messages. They instead wait for something to say,
what is the next message you want to send?
And then they propose it.
However, there's kind of something subtle there, which is there's a chance that they
just keep proposing the same message over and over.
At some point, you should be updating your state in such a way that you change your mind
about what you want to propose.
So like, let's say I want to sequence message A. Once A is sequenced,
I probably don't want to sequence it again.
I probably want to sequence B.
But there are certain classes of bugs
where you actually don't update your state correctly.
And now you're just stuck sequencing A
over and over and over again.
And that will just continue to infinity.
By the way, I always think of this way
of styling an application
as writing your application as a problem solver, right?
So the application is doing stuff, interacting with the outside world, and it decides that there's a problem with the world.
The problem is it knows about some order that the stream doesn't yet know about.
And then that means it wants to tell the stream about that.
And so when asked what message you want to send, it says, oh, here's a message that will tell you that this order existed in the outside world.
And then, as you said, once you receive the message that said that thing has occurred, well, now you don't want to send it anymore because the problem's been
resolved. And what's nice about that is you don't care who resolved the problem, right? You might
have resolved it. Somebody else might have resolved it. Maybe you have a whole fleet of
things that are actively going off and trying to resolve these problems. And you get to be agnostic
to who the solver is. And it means if you crash and come back and whatever, and the problem's
solved, you don't care. You don't see the problem there anymore. But yeah, the problem with this system
is like the hypochondriac who thinks there's always a problem. Whatever's going on,
there's constantly a problem. It's constantly trying to send a message saying resolve a problem
that isn't really a problem anymore. And that's how you get the infinite loop bug.
Yeah, that's a great way to put it.
One of the things you can do there is you can create documentation to explain things to people.
Is there also just some notion of trying to bake in abstractions that help people build these
applications in ways that avoid some of the common pitfalls? Yeah, totally. I'm a big fan of
abstractions myself. And so an example of an abstraction that we built was that you can take
a pipe, which is a kind of core library data structure that lets you push things onto one
end of the pipe and then lazily pull them off. And you can just attach that into an ARIA app as
a state machine where pulling them off means injecting them into the ARIA stream, sequencing
them. And that's kind of like you suddenly can use an interface that you're familiar with,
and it's going to do the right thing of only sending each thing once.
Got it.
So we're constantly looking at code that people are writing
and looking for places that we can build abstractions around the patterns
or anti-patterns that they are doing.
In terms of how broadly it's used at Jane Street,
what are you hoping to see out of ARIA a year from now?
If you look back a year from now, what will you count as a successful rollout of the system?
Well, I would love to see more users on it. I'm assuming that we're going to see at some point in the near future, a user that has a much higher bandwidth requirement
than other users. So starting to push the system to its limits and forcing us to get a little bit
more clever with how we shuffle those bytes around internally in order to both be fair to all of the clients that are
using it and scale up past the point that we need to, to satisfy everything that's going on.
So I think that's some of our biggest concerns are going to be those of making sure we can scale
the system. Right. And in fact, there's a lot of headroom there, right? There are many things that
we know that we could do to make ARIA scale up by quite a bit. That's right. Yes, we have lots of ideas. It's a matter of implementing them at the right time,
as in before they're needed direly.
But not too far before, so you have time to build other features for people too.
Exactly. Yes, it's all a balancing act and no one really knows what they're doing.
All right. Well, maybe that's a good point to end it on, a note of clear humility.
Thanks a lot, Doug. This has been really fun.
Thanks again for having me.
You'll find a complete transcript of the episode, along with links to some of the things that we discussed, at signalsandthreads.com.
Thanks for joining us, and see you next time.
Signals and Threads