Signals and Threads - State Machine Replication, and Why You Should Care with Doug Patti

Starting point is 00:00:00 Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Chainstreet. I'm Ron Minsky. It's my pleasure to introduce Doug Patty. Doug has worked here at Chainstreet for about five years. And in that time, he's worked on a bunch of interesting systems and really distributed systems. That's mostly what I want to talk about today is the kind of design and evolution of those systems. But Doug, just to get us started, why don't you tell us a little bit about where you came from and where you worked before you joined us at Jane Street? Sure, and thanks for having me, Ron. Before Jane Street, I worked at

Starting point is 00:00:37 a few smaller web startups, and my primary day-to-day was writing Node.js and Ruby and building little web frameworks and APIs. So it was quite a departure from the things that I'm working on today. So I know a bunch about what you work on, but what's different about the character of the work that you do here versus that? Well, I think that it's in a way we kind of just use completely different technologies. For example, we very rarely come across databases in our day-to-day here at Jane Street. There are some teams that work with databases, of course, but I personally have almost never interacted with them. And that was a core part of being a web

Starting point is 00:01:13 developer is there has to be a database. I think another big divergence in my world is that a lot of the Jane Street systems are based around a trading day. And so things kind of start up in the morning and some stuff happens and then it shuts down at the end of the day. And you have time to go do maintenance because there's nothing going on. Whereas with web services, you're constantly thinking about how to do continuous rollouts over the course of the day so that there's no downtime. Yep. Although that's a thing that used to be more true here than it is now. And we are more and more building services that run 24 hours a day and in fact, seven days a week. So the trading day over time has become more continuous.

Starting point is 00:01:52 But for sure, there's still a lot of that going on in many of the systems we build. Okay, so let's talk a little bit about the system. So the first thing you came and worked on when you joined here was a system called Concord. Can you tell us a little bit more about what Concord was for, what business problem you were working on and the team that worked with Concord was doing? Yeah, definitely. So Concord predates me. When I joined, I think Concord had already

Starting point is 00:02:14 been around for at least four years or so. At the time, we were entering into a new business where instead of doing the normal Jane Street thing, which is like we go to an exchange and we trade there, we're getting into relationships, like some bilateral relationships with different clients, different firms, where we trade with them directly. And so this brings on a whole new swath of responsibilities, and we built Concord to manage those things. Right. And I guess maybe it's worth saying from a business level, the thing we're doing in those businesses is not so different from our on-exchange activity. In both cases, we're primarily liquidity providers.

Starting point is 00:02:51 Other people want to trade, and we try and be the best price for that trade. But in these bilateral relationships, we have an arrangement with some other kind of financial institution that wants to reach out to us and see what liquidity we have available. That's one of the kind of relationships we might have. And suddenly we're a service provider. That's the difference is we are not just providing the trading, but also the platform on which the interaction happens. That's totally right. Being a service provider is where all this responsibility comes from. We have this obligation to be reliable, to be available. We need to have good uptime. We need to be fast in turnaround. When we get a message from a client, we need to be reliable, to be available. We need to have good uptime. We need to be fast in turnaround.

Starting point is 00:03:26 When we get a message from a client, we need to be able to respond really quickly. And we also need to maintain some sort of an audit trail, more so than in our traditional trading systems for regulatory reasons. So I understand this point about being a service provider and how that kind of changes your obligations. But can you maybe sketch out almost at the network level, what are the services that we are providing to customers that connect to us? Yeah, maybe I can walk through what it looks like from end to end, perhaps. Is that what you're asking? Yeah, totally. Yeah, okay. So you can start out with we are running some server and a client is going to connect into us into one of our ports. a client's going to send us an

Starting point is 00:04:05 order. And the order flows through our system and tries to match against one of the orders that we have from one of our desks, our trading desks. And if there is a match, then we produce an execution and we send a copy of that execution back to the client. We send a copy to the desk and we send a copy to a regulatory reporting agency. And that's a very simplistic version of what we do. And in some sense, that's not so different from the interaction you have on an exchange, right? When we want to provide liquidity on exchange, we post orders on exchange that express what we're willing to trade. And then people who want to trade come in and send orders, and then maybe

Starting point is 00:04:43 they cross one of the orders that we have, and an execution goes up. But the difference is we're playing two roles here. We're both running the exchange-like infrastructure, and we're also the same organization that's actually putting in the orders that are providing liquidity to clients who might want to come in and connect. Yep, that's exactly right. And in fact, I think the original design for Concord was taken from inspiration from what other exchanges use in the real world today. Got it. So you said before that this puts pressure in terms of reliability. Can you say a little bit more about how that plays out? Yeah. If we go back to that example of a client sends an order, we have to be ready for the possibility of some sort of a network loss. So the client might get disconnected. If the client reconnects, they really need to be able to see like, hey, did the order actually get sent? And if so, did it get filled? And these are important pieces of information that we need to hold on to.

Starting point is 00:05:38 So it's not just what happens if the network goes down, but what happens if one of our entire machines crashes and we suddenly lost the piece of state, the program that was running that was accepting those connections? We need to make sure that when we come back online, we have the right answers for the client. Right. So the scenario is a client connects, they send an order, something terrible happens, and they're disconnected. And then when they do connect again, they would really like to know whether they traded or not. And so there's that kind of re-synchronization process that you have to get right. That's right. That's really important. So how does that work? There are multiple levels at which you can think about this, but let's talk about just the client and the

Starting point is 00:06:18 server talking to each other. We use a protocol called FIX. It's a pretty standard exchange protocol in the financial world. And in FIX, there is this notion of a sequence number that is attached to every single message. You start at one, you expect it to go up by one, and you never expect it to go down or skip a sequence or something like that. I should say this is not unlike TCP, where in TCP you have sequence numbers based on the bytes and making sure that packets arrive in the same order. But the difference is that this persists over the course of the day. So that means if you lose connection, and you reconnect, what you would end up doing is you'd

Starting point is 00:06:56 log on and say, hey, I want to start from message 45. And we might say, oh, wait a minute, we don't know about 45. Can you send us 45? And all of a sudden there's this protocol for exchanging missed messages in that sense based on sequence number. Right, so TCP gives you an illusion of reliability on top of an unreliable network. Your base network sends unreliable datagrams. You might lose any packet. TCP does a bunch of sequence number stuff to synthesize a kind of reliable connection on top of that unreliable one. But it's limited in that it's just for the scope of

Starting point is 00:07:29 the connection. You lose this connection for any reason, and certainly you lose one host or the other, and you're dead, right? There are no guarantees provided by TCP that let you be certain about what information did and didn't get through. And so we're providing an extra layer of reliability on top of that where we can recover. And we can even recover when the server that's on the receiving point itself crashes, right? So it may have lost any of its local state. And still we have to be able to recover because we need to be able to make sure that clients know what they did, even if one of our box crashes. It's kind of not an acceptable customer service thing for you to be like, oh, we don't really know if you traded. Like maybe that's an awkward conversation to have with a client.

Starting point is 00:08:09 Cool. Are there any other ways in which the constraints of the business that we're doing affect the kind of technology that we have to build? One of the things that I said was that we need to be very available and we need to have uptime. And what that means is in the case that a box crashes, like, yes, it's a very rare event, and hopefully we can bring it back up. But we also need to be able to move quickly. And sometimes that means moving all of our things off of that box and onto a new box, not exactly instantly, but on the order of a few minutes instead of a few hours, potentially to debug an issue with a box. And so we have the flexibility of kind of moving processes around in this distributed

Starting point is 00:08:49 system pretty freely. Got it. All right. So there are requirements around reliability and durability of transactions. Like when you've told a client about something that happened, you'd better remember it and you better be able to resynchronize them and know all the stuff about their state that you can recover even when you lose components in your system. What about scalability? Scalability is a big thing because we sometimes have days where trading is busy on the order of magnitudes larger than a normal day.

Starting point is 00:09:15 And so we need to be able to handle all of that traffic at the same time like we would on a normal day. And so we have to design the system such that we can be efficient, not only in the common case, but even under a lot of load. Got it. Is all of the load what you're dealing with just the messages that are coming in from clients who want to trade? What else drives the traffic on the system? No, the system is actually full of lots of other data as well. For example, for regulatory reasons, we need to have all of the market data from all the public exchanges in our system at the same time as all of the orders for all the clients. And that alone is a huge amount of data. And especially on a busy

Starting point is 00:09:57 day, that's going to go up significantly. So we end up with these very, very hot paths in our code where you need to be able to process certain messages way more quickly than others. Got it. You've told us a bunch about what's the problem that's being solved. Can you give us a sketch of the system that was built to do it? What's the basic architectural story of Concord? Yeah.

Starting point is 00:10:19 We've already alluded to messages a little bit in the past. What we've built is a replicated state machine is kind of the fancy term for it. And when we say state machine, when I say state machine, I mean not state machine in the finite state machine, kind of something you would learn in CS with boxes and arrows and, you know, things changing from one state to another. We're talking more just think of a module that encapsulates some state, and that module has the ability to process messages one at a time. And each time it processes a message, it updates its state. It might perform an action, but that's it. That's the state machine. It's a very contained, encapsulated piece of software.

Starting point is 00:11:01 In some sense, that sounds very generic. Like, I write programs, they consume data, they update their state. How does the state machine approach differ from just any way I might write some kind of program that processes messages off the network? One of the things that we do is we process messages immediately. We don't do any sort of like, hey, let me go out and send a network request, or let me block for a second while I talk to something on the disk. It's really important that a single message both just updates your state and it does it deterministically. And this is really important because it means that if you have the same piece

Starting point is 00:11:35 of code and you run the same messages through it, you can get to the same state each time. Okay, so why does that ability to rerun the same state machine in multiple places and get the same result, why is that so important to the architecture? it and make sure that even if we crash, we're able to recover it later and tell them about the order and some potential execution that happened. And so we can do this by leveraging these state machines because when you crash and you come back online, one of the things that Concord does is it forces you to read all of the messages that happened in the system from the beginning of the day in the same order. And what that means is that you will end up as an exact copy of what you were when you crashed, and you will have all the information that

Starting point is 00:12:31 you need to know. Got it. So there are really two things going on here. One is the individual applications are structured as these synchronous deterministic state machines, so that they have a well-defined behavior upon receiving the sequence of transactions. And somewhere there has to be some mechanism for delivering a reliable and ordered sequence of transactions to everyone who starts up, right? That's like the other leg upon which this whole thing stands. Yep, that's right. We call that sequence of transactions the transaction log. And there are kind of two questions around the transaction log. One is, like you mentioned, how do you disseminate it? How do you make sure that in this distributed

Starting point is 00:13:10 system with hundreds of processes, everything is seeing the same messages in the same order? The other question is, how do you append to the transaction log? Because obviously, you have to be able to add new messages. Got it. So how does that work in Concord? Yeah. So for the first part, when it comes to dissemination, what we do is we use something called UDP multicast. And if you're not familiar, multicast is a kind of protocol where you can take a single UDP packet, send it to a special network address, and the networking devices like switch, will then take it and make copies of it and send it to all the other switches or devices that are subscribed to that special address. It's lossy because it's UDP, but it's a way of taking one packet and suddenly turning it into 50.

Starting point is 00:13:56 And you don't have to think about it at all from your application level. The actual work of doing the copying is done by the switches. And importantly, the switches have specialized hardware for doing this copying so that they can do it in parallel basically at the same time down through all of the ports that they need to send that packet. They can just do it in this very, very efficient way and at very high speed. And weirdly, nobody else uses this. IP multicast, which is like a big deal historically. And when I was a graduate student so long ago, I'm embarrassed to admit it. It was supposed to be the thing that was going to cause video transmission over the internet to happen. No, it turned out nobody uses it in the outside world, but it's extremely common in the finance world. It is. One of the drawbacks of it is that it's much harder to maintain over larger

Starting point is 00:14:38 and larger networks. When you're dealing with like a nice enclosed, like, hey, Concord is just a single cabinet of hardware. We actually get a lot of the benefits of it without any of the drawbacks. One of the drawbacks, though, that we do have to think about is this is UDP. And it means it's unreliable. Packets can get dropped. And so we need to build some layer of additional functionality to help prevent that. UW multicast is efficient, but it's unreliable, meaning you can lose data and messages can get reordered.

Starting point is 00:15:11 And we need to claw that back. So how do you claw back those reliability properties? Right. So this kind of goes back to what we were talking about earlier when we talked about the fix protocol and sequence numbers. We actually just use sequence numbers inside of our own messaging as well. So each packet has a sequence number attached to it. And that means that if you were to say receive packet 45, and then 47, and then 48, you would very quickly realize that you may have missed 46.

Starting point is 00:15:39 Maybe it's still in flight, maybe it hasn't gotten there yet. Or maybe it will never get there. And so what you can do then is you can go out to a separate service and contact it and say, hey, I'm missing packet 46. Can you fill me in? And this is a thing that we do all the time. And it's actually works quite seamlessly, even when drops are very common, which might be the case due to a slow process. Right. So an important thing about these kind of unreliable protocols is you might lose data because something is wrong with the network, but as any good network engineer will tell you, it's almost never the network. Just like, you know, you're starting to work in a new programming language, you think, oh, maybe it's a compiler

Starting point is 00:16:14 bug. It's like, no, it's not a compiler bug, it's your bug. And if you're losing packets, it's probably not because the network can't keep up, it's probably because you, the damn program you wrote, can't keep up, and you're because you, the damn program you wrote, can't keep up, and you're dropping packets for that reason. But again, you can always recover them the same way, which is there's some retransmission service. So really, you've described two things here, right? There are sequence numbers, which let you know that you're ordering,

Starting point is 00:16:36 and there's a retransmission server, which lets you recover data that might have been lost. But these both sound kind of magical. Where do the sequence numbers come from, and how do you get this magical retransmission service that always knows all the things that your poor application missed? Right. And before we talk about that, I think it might be more interesting to start from how do we. Every process in the system has the ability to send these packets out into the multicast void where they say, I want to add this message to the transaction log. And there is a single app that we call the sequencer.

Starting point is 00:17:18 And that sequencer is responsible for taking packets from that multicast group and re-emitting them on a different multicast group. When it does that, it also stamps the packets with a timestamp and it stamps them with their sequence number and it sends them all out to all of the things listening. And so you just have this unordered mess of messages coming in and a single canonical ordering of messages coming out of the sequencer. Got it. So in some sense, you get ordering in the stupidest possible way, which is you just pick a single point in space and you shove all the messages through that one point, and you have a counter there that increments every time it sees a message,

Starting point is 00:17:57 and that's how you get the sequence numbers. Yep, that's how it is. And when it comes to retransmitters, they're just normal processes, just like all the other processes. They listen to the sequencer. They write down everything that they hear. Sometimes they might lose information. It's very possible. And so retransmitters will try to talk to other retransmitters or in a worst case scenario, talk to the sequencer because there's a chance that the sequencer tried to send a packet that never made it to anyone. And so you eventually have to go back to the sequencer and say, hold on a second, we missed 46. Everyone missed 46. Please help fill us in.

Starting point is 00:18:32 Got it. And then what happens if you lose the sequencer? Well, in the case of Concord, everything stops. Everything grinds to a halt. There are no new messages. You can't really do things since everything that you do is based around messages. And fortunately, we have something that is a backup sequencer running. And what we do is we intentionally want to make that a manual process because we want to make sure we can inspect the system and feel confident because we're talking about trading, we're talking about money. We would much rather have a few seconds of downtime while we collect our thoughts and make sure we're in a healthy state. What we can do is we can send a message through a different channel to this backup

Starting point is 00:19:13 sequencer, and it will be promoted into the new sequencer and take over responsibilities from there. Right. And I feel like this is the point at which we could like fork the conversation and have a whole separate conversation about how this compares to all sorts of fancy distributed systems algorithms like Paxos and Raft and blah, blah, blah for dealing with the fact that the sequencer might fail. But let's not do that conversation because that's a long and complicated and totally different one. What I am curious about is this design is very simple, right? Like how are we going to drive the system? We're going to have a single stream of transactions. How are we going to drive the system we're going to have a single stream of transactions how are we going to do it we're going to have one box that just is responsible for stamping all the

Starting point is 00:19:47 transactions going through and rebroadcasting them to everyone and then we're going to build a bunch of apps that do whatever they do by listening to the stream of transactions it also seems kind of perverse in the sense that you said scalability was important but it sounds like there's a single choke point for the entire system that like everything has to go through this one process that has to do the timestamping on every message. So how does that actually work out from a scalability perspective? Surprisingly well, actually. We're limited by the speed of the network, which is like 10 gigabits per second. We're limited by the program that is trying to pull packets off and put them on a

Starting point is 00:20:26 different interface. And we're also limited to some degree by disk writing, because we have to be able to write things to disk in case we need to recall them later, because we can't store everything in memory. So those are a few limiting factors. But in practice, this fits perfectly well for Concord. We have seen in total a transaction log from 930 to 4, a rough trading day, that was over 400 gigabytes, which is pretty large. We also see probably messages baseline rate during the trading day of over 100,000 per second. And some of those messages are smaller than others, but you still see like, you know, maybe even three to five times that on a busier day. Right. And those messages actually pack a lot of punch, right? Part of the way that you optimize the system like this is you try and have a fairly concise encoding of the data

Starting point is 00:21:15 that you actually stick on the network. So you can have pretty high transaction rates with fairly small number of bytes per transaction. And in fact, you put more energy into optimizing the things that come up the most. So market data inputs are pretty concise and, you know, you've put less energy into messages that happen 10 times a day. You don't think very much at all about how much space they take. Exactly. On the kind of exchange side where you see a very similar architecture, the kind of peak transaction rates, like there they're trying to solve a larger problem. And the peak transaction rates for that version of this kind of system is on the order of several million transactions a second. I think of this as a kind of case where people look at a design like this and think, oh, there's one choke point, so it's

Starting point is 00:21:52 not scalable. But that's a kind of fuzzy-headed approach to thinking about scalability. Like your problem, kind of whatever problem you have, has some finite size, you know, even if it's web scale. And whatever design you come up with, it's going to scale a certain distance and not farther. And what I've always liked about this class of designs is you think in advance, like, wait, no, how much data is there really? What are the transaction rates we need to do? And then if you can scope your problem to fit within that, you have an enormous amount of flexibility and you can get really good performance out of that, both in terms of the throughput of the data you can get into kind of one coherent system and latency too, right?

Starting point is 00:22:27 Do you know like roughly what the latency is in and out of the sequencer that you guys currently have in Concord? Yeah, we measure somewhere between, I want to say eight and 12 microseconds as a round trip time from an app. A single process wants to sequence a message and the sequencer sends it back to the app.

Starting point is 00:22:45 Right, which is like within an order of magnitude of the theoretical limits of a software solution here. It takes like about 400 nanos to go one way over the PCI Express bus. It's kind of hard to do better than like one to one and a half mics for something like this. And so the 8 to 12 is like within a factor of 10 of the best that you can do in theory. Yep. Cool. Okay, so that tells us a little bit about this kind of scaling question.

Starting point is 00:23:08 What else is good about this design? How does this play out in practice? The thing that we've talked about the most so far is how Concord acts as a transport mechanism. And I think there's probably more that we can say about that, but I think it's not just a transport mechanism. It's also a framework. It's a framework for building applications and thinking in a different style when you're writing these applications so that they work efficiently with the giant stream of data that you have to consume.

Starting point is 00:23:35 I think one of the things that's really nice about this is that our testability story is really, really good in Concord. And that's because we have the ability to take these synchronous state machines that we've talked about and put messages in them one at a time in an exact order that we can control.

Starting point is 00:23:53 And we can really simulate race conditions to an amazing degree and take all sorts of nearly impossible things and make them reproduce perfectly. The way that we actually have this nice testability, one of the reasons why it's so nice, I mentioned it was like a framework for building applications. And the framework itself forces you to write your application in an inverted style. And what that means is your application doesn't have the ability to say, hey, I want to

Starting point is 00:24:20 send a message. I want to propose this message. Instead, your application has to process messages and make decisions about what it might want to send when something asks it what it wants to send. So you're no longer pushing messages out of your system. You are just waiting around for something to say, hey, what's your next message? Here's my next message. And this seems like a kind of weird design at first. But when it comes down to the testability story, what it does is it really forces you to put more of your state in your state. In the former example of I want to send a message, you might have some code that looks like, okay, well, I send the message, then I wait for it, then I'm going to

Starting point is 00:25:03 send another message. But what you're actually doing is you're taking a very subtle piece of state there, the fact that you want to send a second message, and you're hiding it in a way that's not observable anymore. And so by inverting control, you have no choice but to put all of your state down into some structure that you can then poke at. And from the testing perspective, you can say, all right, we've seen one message happen. Let's investigate what every other app wants to send at this point. So when you talk about inversion of control here, this goes back to a very old idea

Starting point is 00:25:34 where in the mid-70s, if you were writing an application on a Unix system that needed to handle multiple different connections, there's a system call called select that you would call. You'd call select and it would tell you which file descriptors were ready and had data to handle. And you'd say, oh, I'm going to like handle the data as far as I can handle it without doing anything

Starting point is 00:25:50 that requires me to block and wait. And I'll just keep on calling select over and over. And so it's inverted in the sense that rather than expressing your code as I want to do one damn thing after another, like writing down it in terms of threads of control, you essentially write it in terms of event handlers saying like, when the thing happens,

Starting point is 00:26:06 here's how I respond. That's a really good way to put it. There's some point again, you know, back in the Pliestine era where I was a graduate student, where there was this relatively new idea of threads that was really cool, where instead of doing all of this

Starting point is 00:26:19 like weird inversion of control, we can live in a much better world where we have lots of different threads, which you can just express one damn thing after another in a straightforward way. And then there's a minor problem that the threads had to communicate with each other with mutexes and locks. A minor problem, yes. Right, it's a minor issue. And it turns out, actually, that latter thing is kind of terrible. And like a long disastrous story unfolds from all of that. But it sounds like what you're saying is the Concord design leans very heavily in the,

Starting point is 00:26:42 no, we're going to be event driven in the way that we think about writing concurrent programs. Yeah, that's totally right. We don't take advantage of using multiple threads because we still have to process every message one at a time, and we have to fully process a message before we can start moving on to the next message. Right. And your point about how this forces you to pull your state into your state, it's a funny thing to say because like, where else would your state be? But the other place where your state can be is it can be in this kind of hidden encapsulated state inside of the thread that you're running, right? If you're writing programs that have lots of threads, then there's all this kind of hard to track state of your program that essentially has

Starting point is 00:27:17 to do with what's the program counter and the stack and stuff for each of the threads that you're running in parallel. And we're saying, no, no, no, we're going to all make that explicit in states that's always accessible. So in some sense, any question you want to ask about the state of a particular process, you can always ask. Whereas in threaded computations, there are these intermediate points

Starting point is 00:27:36 between when the threads communicate with each other, where these questions become really hard to ask in the ordinary way, in a synchronous way, reflect this in regular data structures. So we can always ask any question that we want to ask in the ordinary way, in a synchronous way, reflect this in regular data structures. So we can always ask any question that we want to ask about the state of the process and what's currently going on. Yeah, that's a good summary of what this whole story is about. The testability story, I talked about race conditions and how they are easy to simulate.

Starting point is 00:27:59 And that is purely because you are able to control precisely what goes into the system and when to pause to start looking to examine the system. Okay. So I can see why this is nice from a kind of testing perspective. How does it do in terms of composability, in terms of making it easy to build complicated systems out of little ones, right? That's one of the most important things about a programming framework is how easy is it to build up bigger and more complicated things out of smaller pieces that you can reason about individually? How does that story work for Concord?

Starting point is 00:28:27 So this is actually one of the nice things that I also really like about Concord is that if you take a single process and look at it, it's not just a state machine. It's actually a state machine that's composed of smaller state machines who themselves might be composed of smaller state machines. What that means is that you can kind of draw your boundaries in lots of different places. You can take a single program that is maybe performing too slowly and split it into two smaller ones. Or you can take a swath of programs that all have similar kind of needs in terms of data and stitch them together. And it lets you play with the performance profiles of them in a really nice way. So is scaling by breaking up things into smaller pieces, how does the Concord architecture make that easier?

Starting point is 00:29:09 In some sense, I can always say in any program I want, you can say, well, yeah, I could break up into smaller pieces and have them run on different processes. What about Concord simplifies that? So the thing is, if you have two state machines that are completely deterministic, you know that they're going to see all the same messages in the same order.

Starting point is 00:29:28 You actually don't need to have two processes that talk to each other. In fact, we forbid that from happening except over the stream, over the transaction log. These processes, if they want to know about each other's states, all they do is simply include the same state machines in both of the processes. All of a sudden, you know what the other is thinking because you know you've both processed the same messages All of a sudden, you know what the other is thinking because you know you've both processed the same messages in a deterministic way. That's why this becomes more composable or decomposable because you're just running the same code in different places. Rather than communicate with a different process in the system, you just think the same way as that process. You look at the same set of inputs, you come to the same conclusions, and you necessarily

Starting point is 00:30:06 consistently know the same things because you're running the same little processes underneath. That's exactly right. Yeah. It's neat. Okay. So that's a lot of lovely things you've said about Concord. What's not so great about the system? So one of the things that we've talked about a few times now is performance and the performance

Starting point is 00:30:23 story and how things are pretty fast, but they also have to be fast. And I think that this is actually a really tricky thing and something that we have to keep tweaking and tuning over time. Because when it comes to performance, every single process needs to see every single message, even though they might not do something with every message, the vast majority of them need to. If you're really slow at handling a message, your program is going to fall behind and you're not going to be reading the stream at the rate that you need to. And so we have to find ways to improve the performance, which usually boils down to being smart about our choice of data structures, the way that we write code. We actually try to write OCaml in a way that's pretty

Starting point is 00:31:05 allocate-free so that we don't have to run the garbage collector as much as we normally would. Let's say you have a process that's trying to do too much. You might want to split that into two processes. And all of a sudden, now they have more bandwidth in terms of CPU power. So when we're talking about this basic Concord design, we talked about the fact that the sequencer has to see every single message. But how do you get any kind of scaling and parallelism if every single process has to process every single message? If we all have to do the same work everywhere, then aren't we kind of in trouble? We certainly would be if we had to actually see every single message. The way that this works is Concord has a giant message type

Starting point is 00:31:45 that is the protocol for each of these messages. And the message type is actually a tree of different message types. And so what you can do as a program is you can very first thing look at the message type and say, oh, this is for a completely different subsystem of Concord and I don't need to look at it and throw it away. And when you find the messages that you want, now you filter down to those and you process them through your state machines like you normally would. Got it.

Starting point is 00:32:10 So the critical performance thing is that you don't necessarily have to be efficient at processing the actual messages, but you definitely have to be an efficient discarter of messages that you don't care about. Because otherwise you're going to spend a bunch of time thinking about things that you don't really mean to be thinking about. And that's just going to slow everybody down.

Starting point is 00:32:27 Okay, so one of the downsides of Concord is you have to spend a lot of time thinking about performance. What else? So I mentioned that we have this message type, and you have to be really careful about it. Because if you change the message type in a certain way, like, sure, over time, you're going to want to change your messages. That's a normal thing. But let's say you make a change to a message type, and all of a sudden you have an issue in production on the same day. Maybe it's completely unrelated to the change that you made. What that means is that if you try to roll back, all of a sudden you're not going to be able to understand the messages that were committed to that transaction log earlier

Starting point is 00:33:01 in the day. You have to be really careful about like, when do I release a change that just touches messages? And when do I release a change that adds logic that is more dangerous? So it sounds like the versioning story for Concord as it is right now is in some sense, very simple and very brittle. There's essentially morally one type that determines that entire tree of messages. And then you change that one type, and then it's a new and incompatible type. And there's no story for having messages that are like understandable across different versions. So that when you roll, if you're going to roll a change to the message type,

Starting point is 00:33:34 you want that to be a very low risk role, because those roles are hard to roll back. That's right. When you change the world, you have to change the whole world at once. And in fact, our very first message comes with a little digest such that if an app reads it and doesn't know the digest, it will just say, I'm not even going to try and crash. Got it. Do you think this versioning problem is a detail of how Concord was built or a fundamental aspect of this whole approach? Well, I think that in a lot of ways, you have protocols and stability.

Starting point is 00:34:04 You need to think about like, what is your story for upgrading and downgrading, if that is possible. It's a natural part of a lot of things we do, especially at Jane Street. And we've gotten really good at building out frameworks for thinking about stability. But at the same time, I feel like it's more of a problem in Concord because of that transaction log and because of the message persistence. It makes it so that as soon as something is written down, you really have to commit to being able to read it back in that same form. So there's a kind of forward compatibility requirement. Any message that's written has to be understandable in the future. So it's like you're maybe not tied to this very simple and brittle, well, there's just one type and that's that. You could have programs that are capable of understanding multiple versions at the same time.

Starting point is 00:34:47 But essentially, the constraint of reading a single transaction log in an unintermediated way does put some extra constraints. A thing that we'll often do in point-to-point protocols is we'll have a negotiation phase. You'll have two processes, and process one knows how to speak versions A, B, and C, and process two knows how to speak B, C, and D. They can decide which language to speak between them. When you're sending stuff over a broadcast network, it's like there's no negotiation. Like somebody is broadcasting them as fast as they can, and all the consumers just have to understand. So there is something fundamental at the model level that's constraining here.

Starting point is 00:35:20 Yeah, that's right. You can tell I've been in Concord for a while when I forget that things like negotiation phases even exist in the rest of the world. new apps and unrolling new apps and forward and backwards and trying out new things in an uncoordinated way, the system rolls as a unit. And at a minimum, this versioning lock requires you to do this. Yes, that's right. Is there anything else about how the system works that requires this kind of team orientation for how you build and deploy Concord applications? Yeah, actually, there's a lot. When I mentioned that there were multiple Concord instances at Jane Street, what that actually means is that each of them have their own transaction log, their own sequencer, their own retransmitters for filling in gaps when a message is dropped. They have their own config management for the entire system, which can get pretty complicated because it is a hard thing to configure. And they

Starting point is 00:36:25 have their own cabinets of hardware, including all the physical boxes that you need to run the processes and specialized networking hardware to do all sorts of things. And so every time you want to make a brand new Concord instance, you need to do all of those things. You need to tick all those boxes. The overhead there is massive. And so we actually have been shying away from creating new Concord instances because it's just so expensive just from a managerial standpoint. Got it. So the essential complaint here is a certain kind of heavyweight nature of rolling out Concord to a new thing. You want to roll out a new one and there's a bunch of stuff and maybe it kind of makes sense at the given scale. The amount of work you have to do configuring it and setting up is proportionate to a large complex app and pays back enough dividends that you're happy to do it.

Starting point is 00:37:09 But if you want to write some small, simple thing on top of Concord, it's basically a disaster. You could not contemplate doing it. So this segues nicely into what you're working on now. There's a new system that's been being worked on in the last few years in the Concord team called ARIA, which is meant to address a bunch of these problems. Maybe you can give us a thumbnail sketch of what is it that ARIA is trying to achieve. Yeah. So a lot like Concord, ARIA is both a so much that we are basically forcing it onto people that want to write things on ARIA, because we really just think it's like the right way to

Starting point is 00:37:50 write these kinds of programs. And so ARIA itself is, you can actually kind of think about it as just like a Concord instance off in the corner in its own little hardware cabinet. And it's managed by the ARIA team. And if you want to write an application that uses ARIA as a transport, then you don't need to do much. You just need to get your own box, you put your process on it, and you talk to our core infrastructure through a special proxy layer. So the idea of ARIA is ARIA is trying to be a general purpose version of Concord. Oh, yeah. Right.

Starting point is 00:38:27 So anyone can use for like whatever their little application is. So we talked about all the ways in which Concord does not work well as a general purpose infrastructure. What are the features that you've added to ARIA to make it suitable for people broadly to just pick it up and use it? So first of all, the amount of overhead and management is much smaller. There is no configuration story for individual users now because we do all the configuration of the core system. And so if a user needs to configure their app, yeah, they go and do that. But they don't have to worry about us in any way. That ultimately lowers the management overhead drastically.

Starting point is 00:38:59 So if I have my simple toy app that I wanted to get working on ARIA, there's actually a really short time from I read the ARIA docs to I have something that's sequencing messages on a transaction log right now. And that's a really useful thing for people getting into the system for the first time. Got it. So step number one, shared infrastructure and shared administration. Yep. What else? One of the things that we've talked about is performance and how hard performance is. In Concord, you have to read every message that every process has sequenced.

Starting point is 00:39:29 And this is obviously just not going to scale in a multi-user world. If there is one application that someone builds that sequences hundreds of thousands of messages, and then someone else has their tiny toy app that only sequences a handful, you don't want to have to read 100,000 plus a handful in order to do your normal job. And so what we do is at this proxy layer that I alluded to, we have a filtering mechanism. So you can say, I'm actually just interested in this small swath of the world. So please give me just those messages. Don't bother me with all of the other 100,000 massive swarm of market data messages.

Starting point is 00:40:07 It sounds like making a system like ARIA work well still requires a lot of performance engineering. But the performance engineering is at the level of the sequencer and of the proxy applications. And user code only needs to be as fast as that user code has stuff that it needs to process. If you pick various streams of data to consume and you want to read those messages, well, you have to be able to keep up with those messages, but you don't have to keep up with everything else. That's right. And scaling the core is actually much easier than scaling an application

Starting point is 00:40:36 because when we get a message from the stream, we don't need to look at its payload at all because we actually can't understand its payload. It's just an opaque bag of bytes to us. We just have to process the message and get those bytes out or persist them onto disk or whatever we need to do with them. So the biggest bottleneck there is just copying bytes from one place to another. Got it. So this is another kind of segmentation of the work, right? There's like some outer envelope of the basic structure of the message that Aria needs to understand. And presumably it has to have enough information to do things like filtering, right? Because it can't look at the parts of the message it doesn't understand and filter based on

Starting point is 00:41:11 those. And then the whole question of what the data is and how it's versioned is fobbed off on the clients. The clients have to figure out how they want to think about their own message types and ARIA kind of doesn't intrude there. And the core system is just, you know, blindly faring their bites around. That's exactly right. So what is the information that ARIA uses for filtering? How is that structured? We have this thing that we call a topic. And a topic is a hierarchical namespace where you are writing data to some part in that namespace. And you can subscribe to data from some subtree of that namespace. And so you can kind of separate your data into separate streams and later on have a single consumer that is

Starting point is 00:41:53 joining all those streams together. And the important thing about these topics is that even among topics, there is a single canonical ordering that everyone gets to see. Why is that ordering so important? So the ordering is actually important if you ever want to interface with another team. So let's say that I have a team that's providing market data on ARIA, and all they're doing is they're putting

Starting point is 00:42:13 the market data out there on a special topic. If I have an app that wants to do something a little bit more complicated and transactional, it might need to rely on market data for something like regulatory reporting and knowing that, oh, need to rely on market data for something like regulatory reporting and knowing that, oh, at the time that this order happened, the market data was this. And it's really important for reproducibility. Let's say something crashes and you bring it back up. It sees those messages in the same order and it doesn't change the ordering of the market data

Starting point is 00:42:40 and order messages. And this is, I think, in contrast, when someone hears the description of this system, it's like, oh, we have messages and we shoot them around and we have topics and we can filter. It starts to sound a lot like things like Kafka. I think this difference about ordering is pretty substantial. In the way Kafka works, as I understand it, there's some fundamental notion of a shard, which is some subset of your data,

Starting point is 00:43:03 and then I think topics can have collections of shards and you can subscribe to Kafka topics, but you're only guaranteed reproducible ordering within a shard. And you can make sure that your topic is only one shard, which limits its scalability. But at that point, at least you're guaranteed of an ordering within a topic. But between topics, you're never guaranteed of ordering. So if all of your state machines are single topic state machines, that's in some sense fine because every state machine you build will be decoded the same way as long as that state machine is just subscribing to data from one topic. But the moment you want to combine multiple topics together and have multiple state machines and then be able to make decisions based on

Starting point is 00:43:39 concurrently, like at the same time, what do these different state machines say? Decisions like that become non-reproducible unless you have that total ordering. When things are totally ordered, well, you can replicate even things that are making decisions across topics, but without that total order, it becomes a lot more chaotic and hard to predict what's going to happen if you have two different applications running at the same time. And actually, even worse, if you do a replay, you can sometimes have extremely weird and unfair orderings, right? You know, you're listening to two Kafka topics and one of them has a thousand messages and one of them has a million messages. You can imagine that the thousand message topic is going to finish replaying back to you way faster than the million message topic.

Starting point is 00:44:18 So stuff is going to be super skewed on replay and your replay from scratch is going to behave very, very differently than your live system did. Yeah, it really all goes back to that point of determinism and how central that is to the ARIA and Concord story. Okay, so far what do we have? We have a proxy layer, which insulates applications from messages they don't care about. We have this hierarchical space of topics into which people can organize their data. Is there anything else interesting about the ARIA design? Other pieces that distinguish it from how Concord works? So one of the things that we also kind of talked about was the messaging protocol and how Concord's got this one giant message protocol that everything needs to adhere to. But in ARIA, like we mentioned,

Starting point is 00:45:02 it's kind of up to the user. And so each user might use a different protocol for a different topic. So you now have lots of different protocols. And in fact, you can even get a negotiation-like system by, say, writing out one protocol of data and then writing out a newer version of it and then letting another subscriber choose which one they want to do. So if I don't know how to read V3 yet, I can fall back to V2 and be happy for the day. Got it. Although in that case, do you have a guarantee that you get things in the same order when you consume data? If what we're doing is we're taking the data that's in different versions and publishing it on different topics, are those necessarily going to be published in the same order? What kind of ordering guarantees do you get in that environment? No, you're right. There

Starting point is 00:45:48 are no ordering guarantees in that sense. You can do your best to publish them one after the other and get them as close to each other as possible. But it's kind of a thing, it's a trade-off that you'd have to make if you were trying to do some sort of an upgrade and, or were trying to help service older clients that haven't upgraded yet. Right. So I guess you can make sure that your messages are published in the same order within the topic. If I have one message and I want to publish it at versions V1, 2, and 3, I could say publish first at V1 and then at V2 and then at V3 and then do the next message then at V1 and then at V2 and then at V3. So within topic, people will get them in the same order. But if I have one app that consumes v1 and concurrently some other totally unrelated topic for some other state machine and reads those together in a different one that reads v2 and that same other state

Starting point is 00:46:35 machine, those can be interleaved in different ways. But I guess I can still reliably replicate things that are consuming the same versions. I don't totally lose the ability to replicate. Some noise is added between applications that are using different versions of the data. You're kind of hitting on a point that's really subtle. And we talked about how message versioning was so hard to get right. But there's actually a whole second class of message versioning that we didn't talk about, which is the state machines have code that interprets the messages. And that itself is like a form of versioning. If you have a state machine that reads a bunch of messages and then crashes, and then let's say you roll a different binary that

Starting point is 00:47:15 interprets them differently and start it up, it's going to read the same messages and end up at a different state. And so there is actually like a subtle version attached to the source code itself. Right. And Concord is kind of rolled as one big application that is many different processes, but it's in some ways less a distributed system and more a supercomputer where you'd run a single program on it. And here in the context of ARIA, you're talking about some complicated zoo of many different programs. And so now you open up the somewhat new question of subtly different state machines that are consuming the same stream of transactions, but maybe reading them differently. And all of a sudden, the guarantees we have about the perfection of the replication between different copies is much harder to think about.

Starting point is 00:47:56 That's right. The story there is that, yes, a lot of the burden will fall on the users using the system to make sure that they're configured correctly. And not just configured correctly. And not just configured correctly, but that people are making changes in a reasonable way, that they're making changes that maybe even if they change the state machine, how it runs, they're not changing its semantics in ways that are deeply incompatible. Yeah, exactly. So a thing that you haven't talked at all about is permissioning. If you have one enormous soup of topics,

Starting point is 00:48:25 there's a big hierarchy and like anyone can publish anywhere, it seems like awful things are going to happen. So what's the permissioning story here? Well, we give clients the ability to write to certain topics or certain trees of topics. And within those trees of topics, they're allowed to just create their own topics. And so they can kind of create a whole segment of ARIA where they can read and write to freely. But this is still managed to some degree by the ARIA team. We still want to partition up the topics in a reasonable way. It's not like everyone can just create whatever topics they want. Got it. So part of the central administration of the system is essentially saying users X, Y, and Z, you can use this subtree. Users A, B,

Starting point is 00:49:09 and C, you can use that other subtree and preventing users from stomping on each other that way. Yeah, that's a simple way of putting it. Do you think differently about permissioning reading and writing to a given topic? Well, actually, it's a little interesting because reading is a lot harder to get right. One of the ways that we actually allow, I talked about these proxy right. One of the ways that we actually allow, I talked about these proxy layers. One of the ways that we let you interface with these proxies is with TCP. So you can connect to the proxy

Starting point is 00:49:34 and just get a stream of all the messages over TCP. And it's really simple. However, there's a second way of doing it and that's UDP multicast, bringing multicast back. So we have multicast both inside the core of ARIA and outside at the proxy layer. And the reason and that's UDP multicast, bringing multicast back. So we have multicast both inside the core of ARIA and outside at the proxy layer. And the reason why that's interesting is because you can't really put permissioning on multicast packets. When you're sending them out into the network, there's nothing stopping someone from listening to an address and receiving packets

Starting point is 00:50:00 on that address. The only thing that you can really do is you can make it hard for them to figure out what is that address that they're trying to listen to. I mean, I guess technically you can do at the network layer segmentation of which hosts can this multicast be routed to and which hosts can that multicast be routed to. But it's not easy to hit those kind of controls from the application layer, I would think. That's right. And that is kind of like a management nightmare in terms of switch rules and trying to keep things in sync. And like you said, the ARIA team doesn't really have a lot of ability to dictate what the

Starting point is 00:50:32 networking team does and configures. So is the current model that you have that like you have write permissioning, but anybody can read anything? That is explicitly what we have right now. But we are actually working on adding read permissioning in a sort of like, it is kind of a best effort, trust-based case, which is to say, if you are using our libraries to connect to ARIA,

Starting point is 00:50:55 which you probably are, we will still validate that you are able to read the data before we even start hooking up your listeners to the stream. Right, I think even if you don't have any security concerns about the confidentiality of the data that's in ARIA, you still might want this kind of permissioning essentially for abstraction reasons. Put it another way, some of the topics to which you sequence data is, I imagine you think about as the internal implementation of your system, and you freely version it and change it and think about it in respect to how you roll your own

Starting point is 00:51:25 applications and don't worry about outside users. And maybe there are some topics of data that you publish as a kind of external service that you want people to use. And I would imagine you would dearly like people to not accidentally subscribe to your implementation because you're going to walk in one morning and roll your system and the whole firm is going to be broken because people went and depended on a thing that they shouldn't have depended on. That's exactly right. And we have those things internal to ARIA too. We actually keep a lot of the metadata about the current session of ARIA on ARIA.

Starting point is 00:51:56 It's really quite nice because we get all of the niceties of ARIA from being able to inspect the system at any point in time. And we don't want other people to be able to read and depend on those things for that exact reason. I mean, this points to another really nice thing about these kind of systems I think is not well understood by people who haven't played around with them before is you sort of think of, oh, there are these messages

Starting point is 00:52:14 and we put the data onto the system. Whatever the core transactions, I don't know if you're running a trading system, maybe you put orders and market data and whatever. If you're a Twitter, maybe you use this kind of system when you put tweets on there or something. But the other kind of thing you put on there is configuration data, right? And it's nice to have this place where you can put dynamically changing configuration and have a way of thinking consistently about exactly when

Starting point is 00:52:36 does the new config apply and when does the old config apply. And the fact that you just integrate it into the ordinary data stream gives you this lovely model for talking about configuration in a way that's fully dynamic and also fully consistent across all the consumers of that data. Yeah. The very simple example there is, let's say in the middle of the day, we want to give someone write abilities on a new topic. How do you do that? Well, you just send an ARIA message that says they're now allowed to write to this topic. One thing that strikes me as interesting about all this is there's a lot of work in various parts of the world about this kind of transaction log. We're talking about transaction logs as if it's in some sense kind of a new and interesting idea, but it's the oldest idea in computer science. Like there's a paper from the mid-70s that introduces this notion of distributed transaction logs, and all sorts of things are built on it. Standard

Starting point is 00:53:25 database management systems use very similar techniques. And, you know, 30 years of distributed systems research, and then the entire blockchain world has rediscovered the notion of distributed ledgers and Byzantine fault tolerance. This idea just kind of keeps on being invented and reinvented over and over. But the thing about the ARIA work that strikes me as interesting and new is there's a bunch of work about not so much the substrate of how do you get the ledger to exist in the first place, but how do you use it? How do you build an open heterogeneous system with multiple different clients who are working together? I feel like I've seen less work in that space. I guess in some sense the whole smart contract thing is a species of the same kind of work.

Starting point is 00:54:05 That's a way of writing general purpose programs and sticking them onto this kind of distributed ledger-based system. But ARIA is very different from that, and it's kind of a whole different way and a more software engineering-oriented way of thinking about this. One of the things that's been interesting in following along with all of this work is just seeing the working out of these idioms. And I don't know, I feel like idioms like this in computer science are super powerful and easy to underestimate. I don't know who came up with the idea of a process when designing early operating systems, but that was a good idea.

Starting point is 00:54:34 And it had a lot of legs. I feel like there's a lot of interesting ideas in how you guys are building out ARIA as a thing that makes it easy for people to build new complex applications. It has kind of a similar feel to me. So along those lines, like, how's it going? I know there's a system that's been in progress. You're not the only person who's worked on it. There's been a whole team of people working on like ARIA core technology and it's starting to be used. Like, how is that going?

Starting point is 00:54:57 Like, who's using it and what's the process like of it kind of getting out into real use in Jamestreet? Yeah. so we have a system. It's in production. We have deployments in several regions now. And it's actually being used by teams at Jane Street in a production manner. So I would say so far, so good. Things are looking good.

Starting point is 00:55:18 We specifically don't have too many people using it because we want to make sure that when we get a new client, there's always going to be some new requirements. And we want to make sure that we're able to address them in a relatively straightforward way. And a lot of the decisions that we make, we think really, really hard about because the decisions that we make now are going to be ones that we'll have to live with for a long time. And so when we're thinking about new features, we're constantly trying to figure out ways in which we can do them in the best possible way. On one hand, we are actually missing a bunch of features that would be really nice.

Starting point is 00:55:50 This has kind of worked out just fine for our current clients, but we know that there are upcoming ones where we're going to need some new features. Can you give me examples of features that you know you need, but they're not yet in production? A small example of this is rate limiting. It's a thing that you might expect would be there from day one, perhaps, but it's not. And so a single client can, say, get into a bad state where they're now sending the same message over and over and over again. And we will happily inject those into the ARIA stream as fast as possible. This is not necessarily great because it can actually hurt the performance of the system for other people too. There's lots of isolation,

Starting point is 00:56:32 but there's still limited numbers of processes and what they can do. And so there is some sharing and overlap. Focusing on that for a second, if you have one client who's injecting an arbitrary amount of data into the system, that data is not going to hit the end client who's subscribing to a different topic. But all of the internal infrastructure, the sequencers and the republishers and whatever, all of those interior pieces, the proxies that are sitting there, they have to deal with all the data. And so if you have one injector who's running as fast as it can, it is going to slow down all of those pieces. Even if the actual end client isn't exactly directly implicated, the path by which they receive data is. That's right. And not to mention, we are still currently in a world where we're writing everything, every single byte that we get down to disk,

Starting point is 00:57:15 because we have no idea. Maybe someone has to start up and say, please fill me in from the beginning of the day. We're going to have to go to disk to get that. And so there's actually just also limited disk space. If every client was hitting us with max throughput as fast as possible, we might be concerned about disk. So why aren't you more worried about rate limiting? That all sounds kind of terrible. It is. Yeah, that's definitely the first thing on our list right now to work on. It just happens to be a relatively easier problem compared to another thing that we've been thinking about a lot, which is snapshots. Snapshots tie in a lot with the thing that I actually just said, which is, let's say

Starting point is 00:57:50 your app starts up and it's the middle of the day. It needs to receive messages. The only thing that it can do right now is start at the beginning of the day and read every single message on its set of topics that it needs to read up to the current tip of the stream. And it's maybe worth saying that whole phrase, from the beginning of the day, is kind of a weird thing to say, like, which day, whose day, when does Hong Kong's day start? When does the day start in Sydney? There's different notions of day. And I think this just kind of

Starting point is 00:58:18 reflects the thing you talked about before, about how this technology has grown out of a trading environment where there's a regional concept of which trading region you're in and that there's a notion of a day's begin and end. And like that actually is messy and complicated and not nearly as clean as you might naively hope. It worked much better 15 years ago than it does today. But still, that's a real thing that lots of our systems have is this kind of notion of a daily boundary. Yeah. And to even expand on that, well, first, the thing I was going to say was that obviously that might be really expensive. It might take a lot of time to replay all those messages. And we want to prevent that. Even going further than that,

Starting point is 00:58:59 we are trying to think about a single session as not just a day, but we want to expand it to be a whole week. So like, wow, it's like seven times what we're doing already. And so it would be really bad if an app that starts up on Sunday and an app that starts up on Friday night had completely different performance characteristics because one of them has to replay, you know, maybe hundreds of gigabytes worth of data before it can even do anything. And so this is where we introduce snapshots. Snapshots being, hey, here's a chunk of the state at a given point in time. You don't need to read messages older than this. Just load up the state and start from there.

Starting point is 00:59:35 And snapshots, from a finance world perspective, this is a familiar concept because that's what happens for market data feeds, right? If you go to any exchange, they have two things. They have a reliable multicast stream, very similar to the reliable multicast that we use inside something like Concord or ARIA. They have a gap filler, which is actually part of reliable multicast, and they have a snapshot service. The gap filler is just the thing we talked about as a retransmitter, where you go and request a packet you missed. And then the snapshot service is this other thing, which gives you some summary of the state, which is meant to be enough for you to go forward from there and compute.

Starting point is 01:00:07 In the context of a market data feed, that snapshot usually consists of the set of orders that are currently open and maybe the set of executions that have happened on the day because typically there are way more orders than there are executions. So you don't want to replay to someone all of the orders if you can get away with it. But all of the executions is not so bad necessarily. Yeah, you're touching on how it's actually not as simple as I just described it originally. Because you might need different snapshot techniques for different services. Some of them need to retain more data than others. And some of them you might want to say, hey, this single topic has two different ways of being interpreted. You can interpret it in a

Starting point is 01:00:45 way that you need a really large snapshot in order to continue reading it, or you can interpret it in a way that's more simple and straightforward, like the market data example. Right, and even within market data, you could imagine, let's say I have an application whose goal is to print out the total number of open orders at any moment in time, right? If you have that application and it gets as a snapshot the current state of the book and the set of open orders, that's cool. It can replay, and then from the moment it gets the snapshot, it knows how many open orders there are.

Starting point is 01:01:11 And every further update it gets to the book, it can update that view of the book and therefore update its count, and everything's cool. But if you have an application that needs to count the number of orders that happened on the day, it's just dead in the water with that snapshot. That snapshot does not let it recover its state because just knowing the current set of open orders doesn't tell you how many orders

Starting point is 01:01:27 have been sent in a given day. And it's not like there can't exist a snapshot that would solve the problem. You would just need a different one. So there's a way in which replaying from the beginning of time is a general purpose recovery mechanism. It doesn't matter what you're doing. As long as what you're doing is a function of the stream of data, you can do it on a replay just as you did it live. Whereas snapshots are kind of by their nature to some degree application specific. You can do some kinds of computations with some kinds of snapshots and others you can't. Yeah, I think that's exactly right. What does that mean for providing a snapshot service for ARIA?

Starting point is 01:02:01 Well, this goes back to the thing we have to think very hard and carefully about it. One of the ways that we are approaching the problem is to give clients the ability to build their own snapshot servers. So you can say, I'm a client, and I know that I'm using these topics. And hey, I'm also using these other topics from a different team, I'm going to put them all together into one snapshot server that can compute a snapshot that is tailored to my use case. And by doing that, by empowering clients to build these servers out, you kind of solve the problem at the user level. Got it. But what's interesting is that the thing that we described the exchange is doing is kind of the opposite of that. And it's demonstrably very useful, right? Because I feel

Starting point is 01:02:44 like the market data story highlights that snapshots are application-specific, but they're not completely application-specific. There are actually lots of different state machines that you can reasonably build on top of the kind of snapshot that exchanges provide. So you can do a lot of things, but not everything. So the thing you're describing now

Starting point is 01:03:01 is going all the way direction of the application-specific ideas. Like we are going to give users just the full ability to customize the kind of snapshot they want, and they can remember for restarting whatever it is they want to remember. Each application is on its own. And I guess you can imagine something in between where you allow someone who's providing a general-purpose system to also provide a kind of general-purpose snapshot. Yeah, exactly.

Starting point is 01:03:23 To me, this feels a lot like composing state machines, how if I'm building a service that's using another topic, I might use their state machine to consume that topic. In fact, I must use their state machine to consume that topic. In the same way, if I'm building a snapshot server, I will probably actually use some components that that team has released to compute the snapshot. But I might actually, you know, embellish it in some way, like add some details to it, or maybe compact it in ways that I can get away with to make it simpler. There's lots of degrees of freedom here. And I think composability is your answer to them. Right. And even if the only snapshot mechanism you give people

Starting point is 01:04:00 at the ARIA level is this kind of for your application, you build your own snapshot service. You can just use the ordinary modularity tools, right? You can have a library that somebody wrote for their particular stream that provides you most of the code that you need for computing their part of the snapshot. So it's not like you're totally lost on a reusability perspective, even if you decide to build completely per-application snapshot servers. Exactly. You said before that there's a small number of teams that are using ARIA right now. Can you say more about what teams these are and what kind of things they're building?

Starting point is 01:04:34 Yeah. So Concord and ARIA are part of a department at Jane Street called Client-Facing Tech. And a lot of the other current clients of Aria are also in this same department. And as the name alludes to, we write programs that interface with clients directly. And what we end up dealing with a lot is something called high touch flow, which means that unlike our kind of traditional Jane Street systems, where it's just a lot of computers making decisions about what to do, these are systems that actually involve a lot of human interaction. And so it actually gets a lot more complicated because even though we can automate some things, sometimes we just have to kick it back to a human to look at it and examine

Starting point is 01:05:15 it and decide what to do. And so that's the kind of tooling and infrastructure that we're building around handling new types of orders and being able to automate them in as much of a way as we can. Right, and I guess one of the advantages of that is just the fact that people are kind of organizationally close makes it easier to kind of coordinate and prioritize together while you're still in the phase of building out the core functionality of ARIA itself. That's right, yeah.

Starting point is 01:05:38 Some of these people have also worked on Concord, so they are pulling their prior experience into the new ARIA world. And I guess the other thing that's funny about ARIA is that it really is a different way of programming and it takes some time to get used to, which to say a straightforward thing is in some sense, just a misfeature, right? It's great to learn things, but also having a system that requires people to learn stuff is kind of bad on the face of it. So I think you think, and I think we probably think that like, it's worth learning it. It's a valuable thing to learn, but it's also

Starting point is 01:06:04 hard. It's a different way of thinking about designing your system. And by starting with people who are near at hand, it's easier to slowly acculturate them and teach them how to approach the problem in the way that you think it needs to be approached. Have you thought much about what the process goes like of evangelizing something like ARIA as you try and reach out to the firm more broadly?

Starting point is 01:06:22 Like, Jane Street's kind of big now. We have like 1,700 people, hundreds and hundreds of engineers across's kind of big now. We have like 1,700 people, hundreds and hundreds of engineers across a bunch of different teams. How do you think about the process of teaching people a new way of approaching the architecture of their systems when the people are more spread out and people like you don't know and talk to every day?

Starting point is 01:06:38 Yeah, I think one of the things that we've learned the most about Concord is that there are patterns that arise. Concord has, I mentioned, hundreds of different processes running, and also several dozens of them are completely different types of applications. And what this means is that over time, we've kind of converged on a few patterns. For example, we have something that we call the injector pattern. It's a kind of app that just consumes some external data and puts it into a

Starting point is 01:07:06 format that it can put onto the stream and make sure the stream is up to date with the external world. It's a very simple thing once you actually sit down and get a chance to do it. And so really what I want to do is I want to take these patterns and encapsulate them in documentation and examples and maybe even a sort of, let's call it a boot camp, where in order to learn how to do things, you have to implement your own injector. And with the extensive testing that we can build around it, you get immediate feedback over whether it was working or not.

Starting point is 01:07:38 So for something like this kind of simple injector, which the way you describe it sounds very simple, what are the kind of mistakes that you think someone who comes to ARIA not having really learned about it and doesn't have the advantage of the examples or the bootcamp or the documentation, what kind of things are they going to get wrong when they try and build an application that does this injecting? Going back to the inversion of control thing that we talked about way earlier on, the applications themselves do not send messages. They instead wait for something to say,

Starting point is 01:08:04 what is the next message you want to send? And then they propose it. However, there's kind of something subtle there, which is there's a chance that they just keep proposing the same message over and over. At some point, you should be updating your state in such a way that you change your mind about what you want to propose. So like, let's say I want to sequence message A. Once A is sequenced, I probably don't want to sequence it again.

Starting point is 01:08:27 I probably want to sequence B. But there are certain classes of bugs where you actually don't update your state correctly. And now you're just stuck sequencing A over and over and over again. And that will just continue to infinity. By the way, I always think of this way of styling an application

Starting point is 01:08:42 as writing your application as a problem solver, right? So the application is doing stuff, interacting with the outside world, and it decides that there's a problem with the world. The problem is it knows about some order that the stream doesn't yet know about. And then that means it wants to tell the stream about that. And so when asked what message you want to send, it says, oh, here's a message that will tell you that this order existed in the outside world. And then, as you said, once you receive the message that said that thing has occurred, well, now you don't want to send it anymore because the problem's been resolved. And what's nice about that is you don't care who resolved the problem, right? You might have resolved it. Somebody else might have resolved it. Maybe you have a whole fleet of

Starting point is 01:09:15 things that are actively going off and trying to resolve these problems. And you get to be agnostic to who the solver is. And it means if you crash and come back and whatever, and the problem's solved, you don't care. You don't see the problem there anymore. But yeah, the problem with this system is like the hypochondriac who thinks there's always a problem. Whatever's going on, there's constantly a problem. It's constantly trying to send a message saying resolve a problem that isn't really a problem anymore. And that's how you get the infinite loop bug. Yeah, that's a great way to put it. One of the things you can do there is you can create documentation to explain things to people.

Starting point is 01:09:44 Is there also just some notion of trying to bake in abstractions that help people build these applications in ways that avoid some of the common pitfalls? Yeah, totally. I'm a big fan of abstractions myself. And so an example of an abstraction that we built was that you can take a pipe, which is a kind of core library data structure that lets you push things onto one end of the pipe and then lazily pull them off. And you can just attach that into an ARIA app as a state machine where pulling them off means injecting them into the ARIA stream, sequencing them. And that's kind of like you suddenly can use an interface that you're familiar with, and it's going to do the right thing of only sending each thing once.

Starting point is 01:10:25 Got it. So we're constantly looking at code that people are writing and looking for places that we can build abstractions around the patterns or anti-patterns that they are doing. In terms of how broadly it's used at Jane Street, what are you hoping to see out of ARIA a year from now? If you look back a year from now, what will you count as a successful rollout of the system? Well, I would love to see more users on it. I'm assuming that we're going to see at some point in the near future, a user that has a much higher bandwidth requirement

Starting point is 01:10:53 than other users. So starting to push the system to its limits and forcing us to get a little bit more clever with how we shuffle those bytes around internally in order to both be fair to all of the clients that are using it and scale up past the point that we need to, to satisfy everything that's going on. So I think that's some of our biggest concerns are going to be those of making sure we can scale the system. Right. And in fact, there's a lot of headroom there, right? There are many things that we know that we could do to make ARIA scale up by quite a bit. That's right. Yes, we have lots of ideas. It's a matter of implementing them at the right time, as in before they're needed direly. But not too far before, so you have time to build other features for people too.

Starting point is 01:11:35 Exactly. Yes, it's all a balancing act and no one really knows what they're doing. All right. Well, maybe that's a good point to end it on, a note of clear humility. Thanks a lot, Doug. This has been really fun. Thanks again for having me. You'll find a complete transcript of the episode, along with links to some of the things that we discussed, at signalsandthreads.com. Thanks for joining us, and see you next time. Signals and Threads

CODACE Plant Stand

Signals and Threads - State Machine Replication, and Why You Should Care with Doug Patti

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Signals and Threads - State Machine Replication, and Why You Should Care with Doug Patti

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.