Signals and Threads - Solving Puzzles in Production with Liora Friedberg

Starting point is 00:00:00 Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Jane Street. I'm Ron Minsky. All right, it is my pleasure to introduce Leora Friedberg. Leora is a production engineer, and she's worked here for the last five years in that role. Leora, welcome to the podcast. Thank you for having me. So just to kick things off, maybe you could tell us a little bit more about what is production engineering at Jane Street. So production engineering is a role at Jane Street. It is a flavor of engineering that focuses on the production layer of our systems, which is a pretty big statement. And I can definitely break that down. But I will say the motivation here is that Jane Street is writing software that trades billions of dollars a day. And so it's

Starting point is 00:00:45 important that that software behaves as we expect in production, right? And if it doesn't, we want people to notice right away and to address what's coming up. Production engineers have support as a first class part of their role. So when we are on support, we are the first line of defense for our team. And we are responding to any issues that arise in our systems during the day, whether that be from an alert or for a human raising some behavior that they observed to us. And we are really tackling those issues right away. And I guess I will say as a clarifying bit here that that is not really the same thing as kind of being on call overnight or on the weekends. This is really like during the trading day, you are present and responding to issues that

Starting point is 00:01:27 are popping up live. And then, of course, software engineers do this type of work, too. So the lines are a bit blurry, but roughly, I would just say this is a first class part of your role as a production engineer. So that is one big chunk. And then the other chunk of work as a production engineer is longer term work to make your response to these issues better in the first place and also make it less likely that you even need to respond. Sometimes production engineers do work that looks very similar to that of a software engineer.

Starting point is 00:01:56 So say you might like build an OCaml application that helps users self-service some requests that they currently come to your team for. Some production engineers, they might have roles that look pretty different from a software engineer, and maybe they're spending a lot of their time off support thinking about processes and how we can respond to massive issues in a more efficient, effective way.

Starting point is 00:02:17 So off-rotation work is much more varied depending on the engineer's interests and skillset and the team that they've been placed on. But they all share that overarching goal of making our support story and our production story better. This sounds similar in spirit to the site reliability engineer role that Google popularized over time in similar production engineering roles in other places, where the core thing that it's organized around is the live support of the systems. But it's not just the activity of doing the support.

Starting point is 00:02:45 It's also various kind of project work around making that support work well. So what does the split look like? What amount of people's time is spent sitting and actively thinking about the day to day support? And how much of time is spent doing these projects that make the support world better? Yeah, it's going to vary a bit by team, but I would say roughly between a quarter and a third of your time, you'll actually be on rotation for your team. And then for the rest of your time, you'll be doing that longer term work. And I'm a little curious about how you think about this fitting in with software engineering, which you mentioned, it's not a sharp line between the two. And certainly software engineers here also do various kinds of support. So what does the difference here boil down to?

Starting point is 00:03:26 Yeah, that's a good question. And I think it is, again, a little bit in some ways challenging to answer because there are production engineers who look eerily similar to a software engineer. And then there are production engineers where that difference is much starker. But I guess I would say, roughly, software engineers typically tend to be experts in few systems. And they're going to know right down to the depths those systems really well. And typically, production engineers will have a really strong working mental model of a broader set of systems and how all those systems fit together. And I think it's the same way that

Starting point is 00:04:03 we have other types of engineers and roles embedded on the same team. You might have a team with software engineers, production engineers, a PM, a UX designer, etc. Everyone is tackling that same team goal with a bit of a different perspective on it. And I think that that is really how you get that excellent product in the end. So can you tell us a little bit more about what your path into production engineering was like? Yeah. So I studied computer science and economics in college. And after college, I think looking back, it's obvious what path I was going to take. But at the time, I truly could not decide and wanted to try everything. So I went into consulting,

Starting point is 00:04:41 as many do after college. And I think the problems were actually very interesting, but I wasn't motivated by the problems themselves. And I think I also just wanted a bit more of a work life balance. And for various reasons, didn't feel. And I already knew about Jane Street because I had taken an Oak Hamill class in college. And my TA, who I thought was really smart and cool, had gone on to Jane Street. And so I thought maybe I should apply there too. Shout out to Mayer. I think he's still here. Indeed. But so I applied and Jane Street reached out to me and said, Hey, you applied to software engineering, but we actually think you might be more interested in production engineering. And this is just because I had a bit more of an interdisciplinary background than kind of the typical CS grad applicant.

Starting point is 00:05:33 And I said, what is that? And they kind of explained it to me as I'm doing to you now. And I thought it sounded interesting. So I went for it. And that was all about five years ago. And you started in this role. You've been doing it for a long time. You enjoy it. What is it that you find appealing about production engineering?. And you started in this role. You've been doing it for a long time. You enjoy it.

Starting point is 00:05:46 What is it that you find appealing about production engineering? Why do you like it so much? So I guess I can say why I like support. I think that is kind of, to me, the real differentiating factor. Although I do enjoy my longer-term work as well. But I think, for me, when you're on support, it's kind of like you're on a puzzle hunt. You put your detective hat on, if you will, and you're sleuthing around and trying to build a story and find the answer to this unsolved puzzle.

Starting point is 00:06:11 And you might have to look at a bunch of different places and gather evidence and form a hypothesis. Sometimes you have the aha moment when you discover what's going on that feels really good. And it's kind of just brain teasers all day. And that can be really fun. Also, this might sound a little mushy, but you're just helping people all day, which can feel really good, right? Like people at J Street are nice and they're going to be really grateful that you helped make their day go better by solving this problem that was clearly enough of an issue that they messaged you about.

Starting point is 00:06:37 And then at the end of the day, maybe you helped 15 people, right? That just is really rewarding and kind of like a short-term way. I don't know. That doesn't sound too mushy. I think the human aspect of getting to like interact with a lot of people, understand their problems, that's exciting. This other point you're making about debugging itself, but not just debugging of, you know, a single program gone wrong, but debugging of a large and complicated system with lots of different components, lots of different human and organizational aspects to it. Sometimes you're debugging our system. Sometimes you are debugging the systems that we are interacting with of clearing firms and exchanges and all sorts of places. So there's a lot of richness

Starting point is 00:07:12 to the kind of problems that you run into when things go horribly wrong. Yeah, definitely. So can we make this all a little more concrete? Like what's an example of a system that you've worked on and supported? And maybe we can talk about some of the kinds of things that have gone wrong. Yeah, definitely. So I sit on a team that owns applications that process a firm's trading activity. So they ingest a firm's trading activity, parse them, normalize them, group them, and then ship them off downstream. So trades you can think of as going through a pipeline of systems that generally live on my team. And those are the applications that I've been supporting for years.

Starting point is 00:07:47 Maybe it's worth talking about what is the system for? Like I get the idea of it ingests our trading activity and processes it, normalizes it, tries to put it into a regular representation and then ships it off to other things. But why? What are those other things for? What are we achieving with this whole system? So concretely, where do they go after they are processed by our pipeline? They're going to go in a giant database of all of the firm's trading activity over time. They're also going to go to a Kafka topic that people can subscribe to, to read the firm's trading activity. They're going to go to our banks and each of these are going to have a different purpose, right? So for example, sending to our banks, that's really important

Starting point is 00:08:24 because unless we do that, our banks aren't going to know what to make happen in the real world. And so that is really critical. Or if you think of writing them to downstream systems, those downstream systems are going to want to process the firm's trading activity in a really consistent format. They're going to want one central source for all of their calculations and what systems will care about that, right? Good examples of this things are like people have all sorts of live monitoring and tracking

Starting point is 00:08:48 of trading. Traders on the trading desk wants to see the activity scroll by for the given trading system or want up-to-date calculations of the current profit or loss of a given trading strategy. And the way they get that live representation of what's going on with the trading we're doing, it's by subscribing to these upstream systems that collect, normalize, fix any problems with the data, and distribute it on to clients. So it's really connected to the beating heart of the trading work that we do. So that seems like a pretty important system. What are the kinds of issues that you run up to when supporting a system like that? Yeah, so something we get more routinely is a new type

Starting point is 00:09:24 of trading activity hitting our system that we haven't seen before. So something we get more routinely is a new type of trading activity hitting our system that we haven't seen before. So for me, I think of a type of trading as the collection of all the fields that that trade has. So I mean, the date the trade was booked on, the counterparty it's with, the settlement system on the trade, and a bunch of other finance-y words that are attached to the trade that can take on different values. And so when I say a new type of trading enters our system, I mean, a trade with those same collection of field and values has not appeared before in our systems. And so then when our system is looking at the trade and maybe say, trying to match it up against some configuration files,

Starting point is 00:10:00 there's no path for that trade through the system because that collection of fields has not been seen before. And the reason it might not have been configured is there are concrete decisions that have to be made that haven't been made. Like each new settlement system that you mentioned, a settlement system, these are like the back-end systems that involve the rendezvous point between different people who are trading the same security and the way in which the shares flow from one person to another. When someone trips into some new kind of trading, there's an actual human process of thinking about, well, actually, how do we need to handle this particular case? Yeah. And it's this fun collaboration between business and tech, because of course, the actual decisions that we're making that you're describing, my team is not going to be typically best placed to make those calls because we just lack some of that business context that the amazing operations team will have. And so they will give us the information that we will try to translate into the technical

Starting point is 00:10:48 language that our system will understand. Although we are doing work to try to make that self-service. In some ways, you have maybe less of the context about that than people on the operations team, but also a lot more than most other people do. I sort of feel like people who are in this production role learn an enormous amount about the nooks and crannies of the financial system and how the business actually fits together. In some sense, I think that's some of what's exciting about the role is you really get to think not just about the technological piece, like that's really key, but also about how this all wires up and connects to trading.

Starting point is 00:11:19 Yeah, definitely. And I think there is this just breadth and variety in what you learn. I mean, that to me is another thing that I enjoy about the role because, I mean, maybe it was obvious by not being able to pick an industry to go in at the beginning, but I think I like to see and touch everything. And when you come in, you're not quite sure what's going to pop up and what you'll learn about that day. And so there is this variety that is consistently present in your life as a production engineer. Okay.

Starting point is 00:11:43 So one new thing that can happen, as you said, there's a new trade flow that shows up, a new collection of attributes that says, ah, we actually don't know how to handle this and we need to figure it out. And you're going and talking to legal and compliance and operations people and the traders and trying to put all that together. What other kinds of things can go wrong? So I think that that is a pretty common example. I guess to get some sense of the other side of things, when it's less common or more extreme, you might have something that we call an incident at Jane Street, which is basically a big issue that might impact multiple teams, something you might be somewhat stressed about not being resolved and have all hands on deck

Starting point is 00:12:20 to address. And that will come up too, as it will for almost every team at some point. So I think, for example, this year, we had a case where a human had booked some trades and had manually modified the trade identifiers. And then when they reached our system, we actually tried to raise errors. And then the system crashed on the error creation, which is kind of funny. And so then our system is down. And that's not great because trades cannot get booked downstream. And so we had to go find the cause of the crash, get it back up. Thankfully, we already had a tool that let you remove trades in real time from our stream of trades. So we did that to the malformed trades and then reflected on everything that had happened. But that is kind of a more drastic example of trade data flowing through our system that our system doesn't

Starting point is 00:13:12 expect or doesn't like that we then have to handle on support. And how do those different incidents differ in terms of the time intensity of the work that you're doing? So for my team, we do generally have time to dig into issues and get to the root cause and talk to human beings about what should happen. And then once in a while, you will have something like the latter example where something falls over and it's really closer to crunch time in that case. But there are teams where that is much more common. So for example, our order engines team, which is a team that owns applications that sends trades to the exchange super roughly, I'm sure you can give a better definition, but

Starting point is 00:13:52 that team has support that is quite urgent and they might be losing, I don't know, a million dollars in 10 minutes. I just made up that number to take it with a grain of salt, but I don't think it's crazy. And so there, it's really high energy, fast paced, adrenaline pumping. And some people find that thrilling and thrive in that environment. And that support is much more urgent than the typical case that I see on support. It's worth saying most of the time when you're like, oh, you know, we are losing money every minute this isn't fixed. It's overwhelmingly opportunity costs, i.e. there is trading we could

Starting point is 00:14:25 be doing and something is down, so we can't be doing that trading and we are losing the opportunity of being able to make the money of doing those trades. In the order engines world, which is this intermediate piece that translates our internal language of how we want to talk about within our system to orders and stuff, and then translate those to whatever exchange or broker or venue language is, and that's fundamental connective tissue there, that's in the live flow of trading. And so if that thing is down, you can't trade right now. Whereas the stuff that you're talking about is more post-trade. It's about what we do after the fact. And there we have time to adjust and fix and solve problems

Starting point is 00:15:01 at a longer timescale because it doesn't immediately stop us from doing the trading. It's just getting in the way of the booking stuff that's going to need to happen, say, by end of day. Yeah, exactly. Although I will say there are monitoring things that are like pretty critical. So the stuff that kind of falls in between those two things, if those systems are totally down, then we are totally down. We cannot trade without monitoring. That's just not safe. Yeah, I actually used to work on our monitoring software closer to when I joined Jane Street. I learned a lot because that system is very redundant and robust. And the review process is very intense and the rollout process is pretty intense, as it should be, because if something

Starting point is 00:15:39 goes wrong, it can really impact our ability to trade. Yeah, so maybe we should dig into that a little bit. I think one of the things I'm really interested in is what are the technical foundations of doing support in a good way? What are the tools that we need to build in order to be effective at support? And you've worked on some of those. Maybe you can tell us a little bit about the work you've done in that context as part of the project side of your work. Yeah, so for the team that I just referenced, I joined that team somewhat early on, maybe about a year or two in, because I was using that software every day, as were my teammates. And we had strong opinions about the features we wanted to see. And of course, we are not the only team

Starting point is 00:16:18 using that software. It's used firm-wide. So we decided to dedicate our time and put me as an engineer on that team to advocate and work on the features that we wanted. So, for example, I really wanted the ability to be able to snooze pages from this system. And so that is something that I added to our monitoring software. And now still to this day, I will snooze pages that I get. Using the code that you yourself wrote. Yes, it feels great. And just to say, there's actually lots of different kinds of monitoring software we have out there. The particular system we're talking about, I think, is called Oculus,

Starting point is 00:16:52 and it's very specifically alerting software. So this is like, you have a whole bunch of systems that are doing stuff. It detects that something has gone wrong, and it raises a discrete alert, which needs to be brought to the attention of some particular set of humans. And it's like a workflow system at that point. Something goes wrong, there's a view that you have and something pops up in your view that tells you the thing that has gone wrong. And now you have various operations that you can do to handle those. This alerting and the resulting workflow management is like really critical to the whole support role because part of what you're doing is paying attention to lots of things and being able to handle and not lose track of those

Starting point is 00:17:31 things that go wrong in the middle of a potentially hectic time where lots of things are potentially going wrong and there's lots of things you need your attention and there's a lot of people who are collaborating together to work on the support and management of that system. Yeah. And I think you do see this pattern where support team members are often super users of some technology. For example, Oculus, the one that we're talking about. For example, that same team put out a tool that lets you auto act on pages that come up. And we actually had some production engineers make our own version on top of that with state so that it was more expressive and you had the ability to take more actions automatically. And that's something that most teams probably don't need, but production

Starting point is 00:18:15 engineering teams really find valuable. Right. And I think one of the lovely things about the fact that a lot of the production engineers, not all of them, but a lot of them spend a lot of time doing software engineering is it means the tools that they're using most intensely are tools that they can shape to their own needs, right? So you're not merely stuck as the victim. We're using the system in whatever the way it happens to work right now. When you find things that are wrong, you can dive in and make those things better. Yeah. And you also might be the only teams that want a tool. So sometimes there will be a tool that you can build off of at the firm level, but sometimes you will come up with it from scratch, right? So I think as an example, we have been talking as production engineers about the health

Starting point is 00:18:55 of our rotation. What percentage of alerts are we actually actioning? And what percentage are we sending onwards or snoozing or close themselves? And what percentage of alerts that I'm seeing are from this system versus that system? And all these questions are well and good to ask, but it's way better if you have technology that helps you answer that question. So there were a few prod engines this year who built a system that analyzes your team's alerting history and tries to answer those questions for you so that you can take action and improve the health of your rotation more actively. So having tools that let you statistically analyze the historical behavior of alerting systems and find anti-patterns and things that are going wrong and where you can focus on improving things sounds great. I'm curious a little more concretely, what are the

Starting point is 00:19:38 kinds of things that you find when you apply a tool like that? What are the anti-patterns that show up in people's alerting workflows? I think you kind of get desensitized sometimes when you're on support. So I would say one example is that often you'll find that there's just a lot of flickering going on in your view, where pages will open and close, and you did nothing to resolve them, they close themselves. And probably that is taking up mental space because you saw the thing raise and close itself and you never even needed to see it. And probably sometimes you start to look into it before it just closes itself. And so that's an example of a case where we should go improve the call site of the alert and make it such that it does not show up in your screen at all. So maybe that means there's some threshold you need to make longer, or maybe there's something fundamentally wrong with the alert itself.

Starting point is 00:20:26 What do you mean by the call site of the alert? So there's some code that is literally calling a function in our monitoring library that is raising this thing on your screen. And it's possible that that spot in the code needs to be refactored such that that alert is not causing noise on your screen. Right. And I guess this points to an architectural point about the way in which these alerting systems work, which is you could imagine, and in fact, we have this kind of system too. You could imagine that like individual systems just export bland facts about the world. And then outside of the systems, you have things that monitor those facts, look at metrics exported by the system and decide when something is in a bad state. But that's not the only thing we have. We also have systems that themselves have

Starting point is 00:21:08 their own internal notion, separate from any large-scale alerting things of like, oh, something bad has happened. They see some behavior, you get down some series of conditionals inside of your code and like, ooh, I never wanted to get here, this is uncomfortable. And then like you raise an alert. And I think that's a common case that I see even. I think this kind of more metrics-based alerting thing is something that we are actually more growing now. And the historical approach has much more been you have the internal state of maybe of some trading system and it sees a weird thing and it flags that particular weird thing. So when you want to go to fix a thing, you often have to go many steps back from the alerting system all the way back into the core application that is the one that is doing the activity and that found the bad condition. Yes, although I recently saw that

Starting point is 00:21:49 they added a feature to our software where you can jump from a page to the call site itself in our code base, which is pretty exciting. Oh, that's cool. So you could like hit a button and it'll just bring up the source for the particular thing that raised the alert. Yeah. Oh, that's awesome. One problem you're talking about here is desensitization, right? You get too many alerts thrown at you. After a while, you just can't see them anymore. And then people aren't reacting. How does desensitization show up in the statistics that you gather? So, I mean, I think that that would be this higher proportion of alerts that are transient is what we call them. So they close themselves without anyone taking action on them. We'll have recordings of what actions people took on a page, right? Did they own it? Did they send it elsewhere? Did they remove it from their view entirely? If so, for how long? And maybe you'll have an alert

Starting point is 00:22:37 that opened and closed within a short time frame with no one doing anything to it. And probably that's an example of an alert that should be looked at. So one indicator of noise is stuff that flickers. Are there other things that are markers of noise? Yeah, so stuff that flickers is one, but I think if you're routinely, say, snoozing an alert, that is probably an example of the alert not behaving quite as you intended. So snoozing can be a really powerful tool,

Starting point is 00:23:02 but I remember when we were adding the ability to snooze to Oculus, we had conversations and we were thinking to ourselves, is this enabling bad behavior? And we ultimately decided, no, you should trust Jane Streeters to use snoozing with thought and care. But we were wondering, you know, if people can just snooze alerts, will they not care as much about addressing the problem with the alert itself. And so I think any action that is not owning and then resolving could conceivably be noise. If you are sending it to someone else, it should probably go to that person in the first place. If you are snoozing it, then why are you snoozing it? Should it have raised later? All those types of questions can come up if the action isn't pretty straightforward, owning and resolving. Got it. That'll make sense. Another bad behavior in the context of alerting systems that I often worry about is the temptation to win the video game. There's a bunch of alerts that pop up and then there are buttons you can press

Starting point is 00:23:55 to resolve those alerts in one way or another. And one of the things that sometimes happens is people end up pushing the buttons to resolve the alerts and clear their screen of problems without maybe engaging their brain 100% in between and actually understanding and fixing the underlying problem. I'm curious, to what degree have you seen that video game dynamic going on, and how do you notice it, and what do you do about it? I will say, for me, I think what even makes a good production engineer in the first place is a healthy dose of self-reflection for every issue that pops up. The goal really should not be to clear it from your screen after every issue. There are so many questions you can ask yourself to try to fight against this tendency of just close the alert. Right. You should ask yourself, did it need to raise in the first place at all?

Starting point is 00:24:43 Could I have mitigated the impact for the next time it raises? Could I make the alert clear for the next person who has to look at it? Could I automate part of the solution? So one of the techniques you mentioned in there, which is close to my heart, is this automation. You see lots of things are going wrong. You can build automation to resolve, filter, make the alerts more actionable, and get rid of a lot of that noise. And that's an important, necessary part of having a good alerting system. You need people to be constantly curating and combing over the list of things that happen. You want a very high signal-to-noise ratio.

Starting point is 00:25:16 When an alert goes off, you want it to be as meaningful as possible so humans are trained to actually care about what the alerts say and to respond to them. There's a danger on the other side of that too. If you have a lot of powerful automation tools, those themselves can get pretty complicated and pretty hairy and sometimes can contain their own bugs where you've tried to get rid of a lot of noise. And in the end, in the process of getting rid of a lot of noise,

Starting point is 00:25:38 have silenced a lot of real things that you shouldn't in the end have silenced. And the scary thing about that is it's really hard to notice when the mistake is nothing happened. You've over-silenced things. So I'm curious how you feel about the role of automation and what do you do to try and build systems that make it possible for people to automate and clean things up

Starting point is 00:25:58 without encouraging them to over-automate in a way that creates its own risks? Yeah, that's a really good question. And I'm not sure I have an amazing answer. I think at Jane Street, we generally try to expose a lot, at least to the expert user. So applications will often expose their internal state via command line tools. And so they should be relatively easy. Maybe should is the wrong word, but they are often relatively easy to hack on or build some automation around because Jane's readers will have exposed some

Starting point is 00:26:32 helpful commands for you to interact with the state of their system. And I think that is great. And then I think the question then is, how do you make sure you're not doing that too much? And I don't think I have a good answer. I guess code review is one thing. Hopefully someone is reading the changes you've made and agrees that they are sane. But I think also triage and prioritization is something that prodigies teams talk a lot about. I think, for example, on my team, at the end of every support day, people write a summary of all the things that they worked on during that day, all the things that popped up live. And that will often generate a discussion in thread about things that we could improve. And then I think it's really survival of the fittest there, where the

Starting point is 00:27:15 ideas that have the most traction and get people the most excited are the ones that will be picked up. I think part of what I'm hearing you say here is that you need to take the stuff that you build, that's automation around the support role, and take that seriously as software and take the correctness of that software seriously. In some ways, it's easy to think, oh, this is just like the monitoring stuff. Getting it wrong isn't that important, but it's actually really important, right? You get it wrong in the direction of under filtering. And in the end, you filter it out because in the brain of the person who's watching it, they can't pay attention and they just stop paying attention. And if you over

Starting point is 00:27:49 filter, well, now you hide important information from the people who need to see it. Even this stuff is surprisingly correctness critical and you have to take it as its own pretty serious engineering goal. So we've talked a lot about one important phase of support, which is the discovery and managing of alerts, things that go wrong. That's not the only thing you need to do in support. A thing you were mentioning before is this debugging and digging and trying to understand and analyze and explore.

Starting point is 00:28:13 I think that kind of work rewards things in the space of debugging and introspection tools. And I'm curious, what are the tools there that you think are important? Are there any interesting things that we've built that have helped make that part of the support work easier? We're definitely making strides in this area now. For example, to be honest, it's pretty recent that the firm has had this big push towards CLM or centralized log management. That is very helpful in terms of combing through

Starting point is 00:28:41 events that have happened. And that is actually something that I have just seen come about in the past, maybe two years. We kind of were in our happy world, SSHing onto production boxes and reading log files and then just parsing them or reading them by hand. And we are kind of upgrading now. So I think that's one big area where things are happening. I think people often will write their own tools to parse data.

Starting point is 00:29:07 So a lot of our data at Jane Street is in S expressions or SEXPs. And we store a lot of data that way. And that means that people kind of have shared tools around the firm for how to pull data nicely out of those expressions. All of this stuff you're talking about highlights the way in which Jane Street is off on its own, slightly independent technical universe from the rest of the world. If you look at our internal tooling, you will find it simultaneously dramatically better and dramatically worse than the thing you might expect from some big tech infrastructure. And also just in some ways, weird and different. Other people use JSON and protobufs and we use S expressions, which is a file format

Starting point is 00:29:47 basically from the mid-50s involving lots of heavily parenthesized expressions. The thing from Lisp, if you happen to know what that is. Oh, but I've grown to love them. I like them too. And in many ways,

Starting point is 00:29:59 we've built a system where people are much more connected to and using the particular hardware and particular systems. And the fact that we actually like SSH into particular boxes and look at the log files on those boxes in some ways is just kind of a holdover from those origins. But over time, that's not actually how we want things to work. And so we've done, as you said, much more building of centralized systems for bringing together data log data is one of them but not the only one in fact there's now a whole observability team which has built a

Starting point is 00:30:31 ton of different tools like you know distributed tracing tools and all sorts of observability stuff that lets you quickly and efficiently aggregate data from systems and show it together there's some kind of observability tools that we've had for a long time that you don't find in other places. Like we have extremely detailed, super high resolution packet captures with shockingly accurate timestamps of everything that comes in and out of a large swath of our trading systems

Starting point is 00:30:59 so that we can like down to a handful of nanoseconds see exactly when things happened and put complicated traces of things together. That's the kind of tool that you don't see in lots of other contexts. But a lot of the more standard observability tools are things we're really only leaning into in the last handful of years. So I'm curious, as those tools have landed, have they made a big difference to how support works?

Starting point is 00:31:20 Yeah, so I think there's definitely a bit of inertia here when it comes to change. And I think it's this interesting combination where engineers hear about these things and are really excited. And I absolutely have seen a lot of them being taken on board. But like I said before, everyone has so much always on their stack that they want to work on. And migrating to a new library is something that will be somewhere on that stack, but it may or may not be at the top. Yeah, it's always a tough thing that when you come up with a new and better way of doing something, how do you deal with the process of migrating things over? And I think there's that last tail of things to migrate that can take a long time. And that's not great, right? Because it, in complicated ways, increases the complexity and risk of what's going on. Because there's the new standard shiny way of doing it. And when you're in the world where most things have been moved over into the new way and then there's like the handful of things in the old way and people have a little forgotten how the old way works so i do think it's important to actually make time on teams

Starting point is 00:32:15 to clean up this kind of technical debt and migrate things over maybe a little more eagerly than people might do naturally yeah and jane Jane Street has a giant monorepo. And so often you will go look at someone else's code. And I mean, certainly production engineers, we'll often go read the code of the applications that are creating pages that come our way. And so being able to jump into a code base that you haven't seen before and understand all the tools that it's using and what's going on is just really important. And that's true also for software engineers who are also often going to be jumping into code bases in our big monorepo that they haven't explored before. And so that's another benefit on top of the benefit that these libraries are just better than the old way in most cases. And what goes into that kind of

Starting point is 00:32:59 production engineering role? And what's a little more on the texture of the work if you're not doing very much in the way of writing code? What is the kind of project work you do in that context? I can definitely give you an example of that type of work. I probably can't come up with an amazing summary of it just because I'm not in that part of the world. But I can say we keep going back to order engines. I have a friend on order engines. And something he might do when he's off support is he might work on a new type of trading that we're doing. And he might go talk to the desk about the trading they want to start doing. He might read a spec about that type of trading. He might then go write some code to make it happen. He might talk to downstream clients of

Starting point is 00:33:36 his systems to make sure they're ready for it and kind of be that glue between all of these systems. So that is an example of something he might do. And I think that kind of thing does connect to the support role more than it might seem, in the sense that there's all this operational work where you're trying to understand the trading that's coming up, understanding the systems on the other side. That first day when you try and do something, as you said, there are often failures that you then have to respond to and support, and understanding maybe the specifics of that particular flow, but also just understanding in general, what is the process of connecting to and getting set up with a new counterparty? And what are all the kind of little corners and details of how those systems work and

Starting point is 00:34:16 hook together? That I think very much connects to the kind of things that go wrong and the kind of things you need to understand when you're doing support. Even though it's a different kind of operational work, that's not just about machining down the support role, I do think there's a lot of synergy between those two kinds of thinking. Yeah, totally. So we just talked a bunch about the tooling that makes support better, but along the way, you pointed out in a number of ways the importance of culture. How do you

Starting point is 00:34:39 build a good culture around support and around safety? And I'm curious, what do you think are the important ingredients of the culture that we bring to support and to safely supporting the systems that we have here? Yeah, I think a big part of James Street culture in general that I have noticed, and maybe even someone has said this already on your podcast because it's pretty pervasive, but it's that you should just be totally comfortable making mistakes. And if you make a mistake, you should say it. And I think if you are making a mistake that's going to impact the production environment, that's okay. Humans do that. The important thing is that you raise it to someone around you urgently so that we can mitigate the impact and resolve it. And I think this shows up in our postmortems of incidents after they take place.

Starting point is 00:35:25 There's really not a blaming culture here, right? It's just people describing what happened so we can learn from it. Right. People are going to make mistakes. And sometimes it's true that when someone makes a mistake, the right response is like, oh, I made a mistake. I need to think about how to do that better personally. But most of the time, certainly when the rate of mistakes is high, the thing that you need to think about is how do I change the system?

Starting point is 00:35:47 How do I make it so that mistakes are less likely or that even when mistakes happen, the blast radius of the mistake is reduced? So you talked about postmortems. What is a postmortem? When do we write a postmortem? What are these things for? Where do they go? What's the story here? Roughly after an incident, we basically do some

Starting point is 00:36:05 reflection and writing about what happened. We'll sometimes write in a pretty detailed way, you know, with timestamps, the sequence of events, and we'll write down what led to it, what caused it, how it got resolved. And then we'll really have a big chunk of the postmortem that is dedicated to how can we do better. It's all well and good to reflect, but you really want to come away with concrete actionable items that you can do. You know, you might have some process takeaways like, oh, I should have reached out to impacted users sooner. And that's great. But I think technical takeaways are often a result of a postmortem. And how do you help people actually write these things effectively? To say maybe an obvious thing, even though we try really hard and I think largely succeed in having a culture where people are encouraged to admit their mistakes, it's awkward.

Starting point is 00:36:52 It is hard for people to sit down and say, yeah, here's the thing that went wrong. And here are the mistakes that I made. I think a thing that's actually unusually hard to handle is when you're the person who's writing down the postmortem, we ideally try and get the person who made a lot of the mistakes themselves. It was their hand that did the thing that caused the bad thing. That's the person who you want to give the opportunity to do the explanation. But sometimes a lot of people have made different mistakes in different places. I think in a well-engineered system, when things go wrong, it's often because there's like a long

Starting point is 00:37:23 parlay that was hit. A number of different things failed, and that's why we got into a bad state. And so you need to talk about other people's failures, which is especially awkward. What are the things that you do to try and help new people to the organization get through this process, learn how to write a good postmortem? How do we help spread this culture of this? Yeah. So I think some of it they're going to pick up by reading other postmortems and experiencing other incidents. It's pretty rare that someone will join, right, and then right away be involved in a large incident where it was their mistake that was, you know, heavily the cause. And to be honest, if that happens, probably their

Starting point is 00:38:03 mentor should have avoided that situation. I think like probably by the time they're in this situation, they have seen enough things go wrong, seen enough people admit that they have made mistakes, seen enough people still be respected and not shamed for that. And I kind of think it's just a matter of time, but you do want to be intentional with the way you talk about it. So if something did come up where a new person was involved, I think it would be really important that someone pulls them aside and make sure they're comfortable with everything going on and that they feel okay. And it's up to their teammates to instill that culture and make sure that everyone is talking about it in the right thoughtful way.

Starting point is 00:38:42 One of the tools that I've noticed coming up a lot that's meant to help people write good postmortems, about which I have complicated mixed feelings. The template? Templates, yeah, right. Where we have templates of what is the shape of a postmortem? What are the set of things you write in a postmortem? There's a place for the timeline of an event and a place to write down, you were kind of echoing this, you know, here are the things that went well, here are the things that went badly, here are the takeaways. And one of the things I worry about is that sometimes people like take the template as the thing they need to do and they go in and fill

Starting point is 00:39:14 in the template. And the thing you most want to have people do when they're writing a postmortem is to stop and think and be like, yeah, yeah, there's a lot of detail, but like big picture, how scary was this? What really went wrong? What are the deep lessons that we should learn from this? And I totally see the value of giving people structure that they can walk through. You know, the old five paragraph essay, you give people some kind of structure

Starting point is 00:39:38 that shapes what they're doing. But at the same time, there's something hard about getting people to like, take a deep breath, look wide, think about how does this matter to the overall organization and pull out those big picture lessons. And I sometimes feel there's just tension there where you give people a lot of structure and they end up focusing on the structure and spend less time leaning back and thinking, wait, no, no, what is actually important to take away from this? And I'm curious whether you've seen that. And if you have thoughts about how do you get people to grow at this harder task of doing the synthesis of like, no, no, no, what's

Starting point is 00:40:09 really going on here? Yeah, I definitely have observed that. My personal opinion is that the template can be really helpful for people who have never written one before, because it can be kind of intimidating. This big thing went wrong, now go write about it and reflect on it. And having a bit of structure to guide you when you're starting off can make it much more approachable so I do think there is a role for the template but I definitely agree with you that can be restrictive and I think once people are kind of in the flow of writing postmortems and are a bit more used to it and know what to expect they're not going to get writer's block sitting down I think removing the template and giving them a bit more space is totally reasonable. Then your question is like, how do

Starting point is 00:40:48 you get them from the template onwards? And how do you get them to think about this big picture framing? And I think the answer to a lot of these questions is that hopefully the production engineers around them are instilling this in them. I mean, to be honest, postmortems are not the only way we reflect and take action after an incident. In reality, you're going to have team meetings about this, and you're going to get a lot of people in a room who are talking about what could have gone better, or maybe they're just in the row talking about it. But there's going to be a real conversation where a lot of these big picture questions are coming up. And I think it's really important to have those discussions. And the postmortem is a very helpful

Starting point is 00:41:23 tool in reflecting, but it, I think, should not be the only tool in your arsenal. Yeah, it makes sense. Another thing that I often worry about when it comes to support is the problem of how do you train people? And especially, how do you train people over time as the underlying systems get better?

Starting point is 00:41:42 I think about this a lot because early on, stuff was on fire all the time. Every time the markets got busy, you know, things would break and klaxons would go off and there were all sorts of alerts. And you had lots of opportunities in some sense to learn

Starting point is 00:41:55 from production incidents because there were production incidents all the time. And we've gotten a lot better. And there are still places that are like new things that we've done and we're still working it out and the error rate is high, but there are lots of places where we've done a really good job of machining things down and getting the ordinary

Starting point is 00:42:11 experience to be quite reliable. But when things do go wrong, you want the experience of knowing how to debug things. How do you square that circle? How do you continue to train people to be good production engineers in an environment where the reliability of those systems is trending up over time? Yeah, I think different teams have come up with different solutions to this question. And the question of how to train people on incidents is a really hard one. I see some creative solutions out there. So I know one popular method of training is what we call the incident simulation.

Starting point is 00:42:45 And it's kind of a choose your own style adventure through a simulated incident. And this is all happening in conversation, in discussion. It's not on a machine. But the trainer is going to present to you some scenario that you're in where something's going wrong and you are going to step through it. It's kind of like D&D. And you're going to step through it and pick your path. And then they will tell you, OK, you've taken this step. Here's the situation now. And you will walk through the incident and talk about how to resolve it, what updates you

Starting point is 00:43:14 would give stakeholders, how you would mitigate it, what you would be thinking about, all of those things. That is one approach. And I think that gamifying approach has proved pretty useful. I know other teams that actually use some Nintendo Switch video games as training. So if you know the games Overcooked or Keep Talking and Nobody Explodes, those are both fun team-based games where you're actually communicating a fair amount under pressure. It manufactures a bit of a stressful situation and you're talking to your teammates and it's fun, but also it does simulate how to keep a level head and think clearly and communicate to people under pressure. So we have definitely had to come up with creative ways because like you said, bad incidents are probably just coming up less frequently. There's at least some teams that I've seen build ways of intentionally breaking non-production version of the system.

Starting point is 00:44:06 So they have some test version of the system with lots of the components deployed and operating in a kind of simulated mode. And we'll do the kind of wargaming thing, but a wargaming where you actually get to use your keyboard and computer to dig into the actual behavior of the system. There's a big investment of doing that. You have to both have this whole parallel system that looks close enough to production that it's meaningful to kind of bump around inside of it. And then people have to design fun and creative ways of breaking the system to create an interesting experience for new people to go in and try and debug it. In some ways, that seems kind of

Starting point is 00:44:39 ideal. I don't know how often people do that, how widespread that kind of thing is. I've seen a couple of examples of people doing that. Yeah, I haven't seen it be as widespread as I might like. I agree it is the ideal scenario. I can think of off the top of my head one team that I know has built that. But I think, like you said, it's just a pretty high investment cost. And so I think people have tried to steer away from that where possible. Or maybe that's too strong, but they have put it off in favor of trainings that have a very low investment cost.

Starting point is 00:45:09 I do know that we had a simulation, kind of like how you're describing where a real thing breaks and you go investigate it. That was used firm wide, or at least in my part of the firm for many years. And it wasn't really a system that you had to know about ahead of time. It would break. It's called training wheels if you know it. Someone wrote it maybe 10 years ago or something like that. This was before we had production engineers.

Starting point is 00:45:30 Any engineer could try their hand at fixing this system with a small code base and with a pretty manageable file system presence that you could kind of just go explore off the cuff. So I do know that things like this have popped up over the years, but I don't think it's something that you're going to find on every team. So beyond simulating outages and simulating working together and communicating effectively there, what other things go into training people to be effective production engineers? I think that comes with time on the job. So all software engineers go through OKML bootcamp when they join.

Starting point is 00:46:03 We have production engineers do that and then also go through a production bootcamp. And you can see a similar pattern with the classes that we have software engineers take in their first year, where production engineers take them and also take a production class. But then beyond that, it's just going to be so team specific, right? You want to be a strong debugger. You want to remain calm. You want to be careful. You want to communicate well. But the actual support you're doing is going to have such a different shape and color depending on your team

Starting point is 00:46:30 that a lot of it is team driven rather than firm driven, where someone is going to sit with you and literally do support with you for weeks and be teaching you a ton of context about the systems. And hopefully every time you handle a support issue, they'll be providing active feedback on what you can do better. So effectively, a lot of this stuff comes through in essentially an apprenticeship model. You are sitting next to people who have done this for longer

Starting point is 00:46:53 and they're showing you the ropes and you over time absorb by osmosis the things that you need to learn to do it effectively. Yeah, and some teams do have much stricter training models. For example, the team I mentioned earlier, Order Engines, that has pretty high-stakes support. I think they have a stricter training model that people follow when they're joining the team compared to a team that has a much lower support load. So how does this all filter down to the recruiting level? I know that you're involved in some recruiting stuff both on the software engineer and the production engineer side.

Starting point is 00:47:25 What does the recruiting story for production engineering look like? And how does it differ from the software engineering pipeline? Yeah. So I guess if you go back to my story, right, I was applying to software engineering and James Street raised production engineering to me as something I might be interested in. And that pattern does pop up a fair amount because certainly at least at the college level, students don't know about production engineering. And so they are going to default to software engineering, which is totally reasonable. And it's also kind of just hard to tell what a

Starting point is 00:47:56 student might be interested in because you only have so much data. So I think often if we think someone might be interested in production engineering, we'll propose it to them, see how they feel about it, have a conversation. We have interviews that are production engineering focused so someone can even try it out and see if they find it fun or not. We do legitimately get feedback that some of our production engineering interviews are pretty fun. Maybe people are just being nice, but I think they are legitimately fun if you're into this solving a puzzle business. So part of this is us reaching out to people or trying to identify people who will be interested in it. But I think there's also this lateral pool of candidates who, like you mentioned at the

Starting point is 00:48:34 beginning, have done something similar at another tech company and so are somewhat in this space already. And then those people will be a bit more opinionated about what type of work they would want to do and if this would be a good fit for them. Can you identify any of what you think of as the personal qualities of someone who is likely to be an excellent production engineer? I think there are lots of people we hire who would be kind of terrible, unhappy, ineffective production engineers and some people who are great at it. What do you think distinguishes the people who are really well fit to the role? So I think a strong communication is really important for production engineers. And communication, sometimes I think people think of as a throwaway skill or something

Starting point is 00:49:11 like that. But it's so key because production engineers are really the glue between a lot of teams. They are going to be speaking to people who have very different mental models of all the data and systems in play. You know, like I said, when we're talking to operations, we might speak a different language about a trade, right? I think the debugging skill is, of course, important. And I think that's a great example of how all of these skills are obviously also important for software engineering and other engineering disciplines. But I think

Starting point is 00:49:39 especially important for production engineering because you're doing an extra high level of investigation and debugging. I would say carefulness, just because you're interacting with production systems. And it's important that you are taking extra care and thought and aren't going to do something crazy, you're going to think it through, you want to be pretty level headed. And I guess that would be the last quality I would mention, which is just remaining calm, even in a stressful situation, which might not, to be fair, help you on every team because some teams don't need that. But typically, like no matter what team you're on, a lot of stuff might come up at once. You might feel a little bit overwhelmed and

Starting point is 00:50:14 that's okay. You just need to keep a level head, remain calm, not panic. And that is a certain type of person. And that is totally not everyone. And some people would hate that. And some people like me don't mind that situation at all. And some people love that. And then they go to the order engines team, right? So you get this big spectrum, but I think across all these teams, those are the qualities that kind of stand out. Having just a personal enjoyment of enjoying putting out fires. I think some people find it unpleasant and hard, and some people find it really energizing. I think I actually found it really energizing. I don't do as much of it as I once did, but there's something exciting about an emergency. Not exciting enough that you would create one when you didn't need one,

Starting point is 00:50:51 but when it's there, there's something joyful about it. Yeah, and I think it kind of ups the reward factor because then once you're done solving it, you feel like extra excited about having tackled this really big issue. Do you do anything in the interviewing process as a way of trying to get to these qualities? The one that seems like it maybe should be possible to get at is the debugging skill. What do you guys do that's different for evaluating production engineers than you might do for software engineers? So we do ask candidates software engineering questions, but we also have production-focused questions that we will ask them. And for example, one of these questions puts you in an environment with some code, with some command line tools,

Starting point is 00:51:34 with some data. And we tell you, hey, here's the thing that went wrong. We want to figure out what happened. And it is their job to go piece together that story. In fact, actually, that specific interview question is based on a real thing that my team had to do on support a lot in the past before we eventually built a website that helps people do it for themselves. But we used to have that all the time on support. So it really is kind of having the candidate go through an action that we used to take ourselves on support a lot and the thought process that we used to think about ourselves a lot. So we do try to get at it as realistic way as possible. Obviously, it's a bit of a synthetic situation. There's no way around that. But I think if you

Starting point is 00:52:13 talk to a candidate about it and hear how they're brainstorming about solving this problem and what steps they want to take to get there and what's their reasoning behind it, you can kind of get out a lot of these skills. But you do have to ask a lot of questions, maybe in a way that you don't want to take a back seat. Not that any interviewer should, but you certainly can't just look at the code at the end and form a really strong opinion. You really need to observe them as they get to the answer throughout the whole process to have a strong opinion of the candidate. I do think that's the thing that's in general true about interviewing. Very early on when we started doing interviewing,

Starting point is 00:52:49 I think we had the incorrect mental frame of like, oh, we'll give people problems and we'll see how well they solve the problems. But I think in reality, you learn much more by seeing what it's like to work together and seeing what the process they go through. You talked a lot about the feeling of solving puzzles. And in some sense, we solve puzzles for a living, but we don't really solve puzzles for a living. There are puzzles that are part of it, but it's much more collaborative and connected than that.

Starting point is 00:53:11 And just seeing what it's like working with a person and how their brain works and how those gears turn seems much more important. There are production engineers who do very little coding. And that is, like I said, there's that spectrum. So we also want to make sure that we have interviews that really can help those people shine and get out people who would be good for that type of role. All right. Well, thank you very much for joining me. This has been a lot of fun.

Starting point is 00:53:34 Yeah, thank you for having me. You'll find a complete transcript of the episode along with links to some of the things that we discussed at signalsandthreads.com. Thanks for joining us and see you next time.

Signals and Threads - Solving Puzzles in Production with Liora Friedberg

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.