Software Misadventures - Lorin Hochstein - On how Netflix learns from incidents, software as socio-technical systems, writing persuasively and more - #13

Starting point is 00:00:00 There was an all-hands meeting going on, and most of the people on my team were sitting in the situation room. And the situation room happens to have all these monitors around it with the dashboards. We don't usually use them for anything, right? Because no one, like, it's just cycling through those dashboards. And he sees these, like, those, like, errors. Like, wait, those are bad, right? Now, I'm not in that room. So I'm on call.

Starting point is 00:00:20 I'm not in that room. I'm like, no, I'm not going to be, like, remote. Like, I want to go actually to where the all-hands meeting is. So I'm in the all-hands meeting. They're watching the all-hands meeting from the Situation Room. My teammate sees this. Paige is the on-call, who's me. So they're watching. They're watching me stand up. They're watching me look at my phone, stand up, and then come back to the Situation Room. Welcome to the Software Misadventures podcast, where we sit down with software and DevOps experts to hear their stories from the trenches about how software breaks in production.

Starting point is 00:00:53 We are your hosts, Rannoch, Austin, and Guang. We've seen firsthand how stressful it is when something breaks in production, but it's the best opportunity to learn about a system more deeply. When most of us started in this field, we didn't really know what to expect, and wish there were more resources on how veteran engineers overcame the daunting task of debugging complex systems. In these conversations, we discuss the principles and practical tips to build resilient software, as well as advice to grow as technical leaders. Hello everyone, this is Guang. Our guest for this episode is Loren Hochstein.

Starting point is 00:01:29 Loren is a senior software engineer at Netflix, where he worked on chaos and reliability engineering and is now on the managed delivery team. In this chat, we hear about how Loren got into chaos engineering, his favorite incident story, what engineers can gain from writing better, and why some metrics might not be as useful as you think. Please enjoy this lively and educational conversation with Lauren Hochstein. Awesome. Well, Lauren, welcome to the show. Hi, I'm happy to be here. Awesome. So starting from kind of the beginning,

Starting point is 00:02:12 so you got a PhD from University of Maryland in computer science. And then after that, you actually went to academia and spent some time working as a professor before coming to the dark side and then moving into industry. What was that transition like? Yeah, I really thought I wanted to be a professor. It's funny, when I was a kid, my dad's a professor and my mom is an elementary school teacher. When I was a kid, I was like, no, I'm not doing it.

Starting point is 00:02:41 I'm not going to be an educator. I'm not going to be a professor. I'm going to do my own thing. I got a degree in computer engineering. I got a job, read a school. I was working for a little startup in Montreal. And then I was like, wait a minute, I actually do want to be a professor. And so after 18 months, I moved to the US and I went to grad school, got my PhD, applied to 40 different schools to get a job as a professor. I got two interviews, two offers, and I got it. And I was not happy there. I was very, very stressed out when I got that job.

Starting point is 00:03:19 I was worried. I think this would be a good advice for my parents in terms of how they could have convinced me to go to medical school so wait so what did uh what did it change like why why did you change your mind that you decided hey maybe like uh professorship is actually what i want like at the earlier time yeah i uh i really like school i really like learning stuff i liked research uh i was just interested in it i loved i started out my PhD program in electrical engineering, and then I switched to computer science because I really liked signal processing. I just love that stuff. I really enjoy college. I mean, it's really weird, but when I retire, I'm probably going to spend time just going back to school and just taking classes

Starting point is 00:04:00 because I just really like that. And so when you really like you know academic stuff academia is a natural place to think you want to spend your time uh and that's that's what sort of flipped me i see i see what made it stressful was it the the teaching aspect or like having to like do the grants you know the more uh stuff? Sure. The grants was probably a thing that was most stressful to me. Teaching was stressful. I mean, I'm an introvert and you have to stand up there and, you know, do a lecture all the time, but you get used to that in academia. Like you can't even get through grad school if you can't, you know, get up in front of people and speak. I learned how to stay like, you know, one lecture ahead of the students when I first created classes,

Starting point is 00:04:46 which was interesting. But one of the challenges, I like education, although I'm not a big fan of lecturing. So what was frustrating to me was the grant stuff. It's very hard to get grants. The accept rate for the NSF, at least at the time, was like 4%. I was not in a particularly hot area. My research area is empirical software engineering. So it's like actually studying software engineers as they do work. And I was stressed out all the time about it. And I couldn't spend as much time on like teaching stuff as I wanted to, because I had to spend time on research stuff. And I just wasn't like enjoying the work because I was constantly worried that like, I wouldn't get grants, which would mean I wouldn't get tenure, which would

Starting point is 00:05:28 mean that I would like basically be fired from my job and have to go somewhere else after six years. Um, I had just had a, my first child was born right when I moved to Nebraska. We didn't know anyone out there. My wife and I were like, we're like city, we're coasters, right? So we were both in the East Coast, now on the West Coast. Midwest is not like culturally, it's not a good fit for us. We didn't know anybody. So like, there's a conflict. It's never one thing, right? So there's like a confluence of different factors. And in the end, I was like, well, like, what am I doing here if I don't really like this work? Like it's, I'm fortunate enough to be in a field where there actually is industry work. If I had gotten a PhD and like, so my brother has a PhD in philosophy, and he's a philosophy professor now.

Starting point is 00:06:14 And there's not that much else you can do with a PhD in philosophy. But at least in software, you can go out into industry. And so at some point, I'm just like, what am I doing here? I'm not happy. And I started to look into other options and I slowly transitioned from academia into industry. Interesting. Like, did you have to like do the whole,

Starting point is 00:06:36 like get the how to crack the coding interview book or was that, cause you already had like a really solid, you already had a very solid background, right? So, but what was that prep like? I don't know if the book was out at that point. So we're talking about 2008. I don't know if the CAC and coding interview book had been published yet.

Starting point is 00:06:55 I applied to Google. I could not get through the tech screen. I remember in a Google doc doing programming and doing terribly. I didn't do that much coding when I was a professor and I certainly, uh, was not practiced at doing like live coding kind of stuff. Um, and so the first job I got out of when I left, um, the university, university, Nebraska Lincoln, I went to a research lab and, and so there's this whole world of these research lab things that people don't know about unless they're in academia, where they're sort of attached to universities, but the people who work there are not professors,

Starting point is 00:07:30 they're not faculty, they're staff, and they're typically funded through the research grants or contracts that that research lab is getting money for. So when I was a PhD student, my advisor had a project funded that was paying for my PhD. And there was a whole bunch of other organizations that had funding. It was from DARPA, and there was a whole bunch of other people involved. And one of these organizations was called ISI. It's still called ISI. The Information Sciences Institute.

Starting point is 00:07:57 It's a research lab out of University of Southern California. Most people have not heard of it, but that's where DNS was invented. So they had a big role in the early internet. I never heard of it until I was a PhD student. Then I was looking around. I was like, oh, I heard of this ISI thing. Why don't I apply there? So I applied there. Because it was more of an academic-y place that hires PhDs, I did not have to go through a code screen. It was more like an academic job interview where you go and give a talk, right? A job talk. That's the difference, right? Academic. So academia, I would say the work much more stressful, but the interviewing much easier because by the time, by the time you get to the interview process, your odds of getting in are very good. It's basically just, if there's

Starting point is 00:08:40 someone else in the pipeline, who's a little better than you, that's it. Whereas in industry, it's like, we don't know whether you're any good or not, because we don't have any bar in advance. Maybe if we hired you from another fang company, but even then, even when we hire people from fang companies, we still make them jump through these hoops. So it's much less stressful

Starting point is 00:08:57 to apply to an academic-y place. And so I did a job talk, I got a job there. My title, I love this, was computer scientist. So I have a PhD in computer science. I have worked literally as a computer scientist. I still could not tell you what computer science is, like how to define it, but I'm very well credentialed to talk about it. And so I worked there for a while. And then my boss and Skip Level, they left to form a startup. And I left and joined them.

Starting point is 00:09:28 And since they knew me already, I did not have to text screen. Right? I just got there. And then, I don't know if you want me to keep going. But basically, then I was working there. It was a startup. And I was working there during the government shutdown that happened. I don't know if you remember that, but this was in D.C. a while ago.

Starting point is 00:09:49 We were funded through government contracts and my paycheck started getting later and later and my wife was like, you know, we got to eat here. And so I left and I got a job, ironically, through my brother-in-law who was a VP at SendGrid at the time. So I was working, he was running SendGrid Labs, which was like a little piece of SendGrid that was doing new product development.

Starting point is 00:10:14 So I moved to Rhode Island. And I did have to go through a tech screen. I did have to go through the regular process to get a job there. And I lucked out because the problem I practiced the night before that person happened to ask me in the tech screen. I told him like, it was like the shuffle problem. I've never seen that one, right? Like, how do you do a perfect shuffle of a deck of cards? And there's a there's a well known algorithm. And I'm like, I told him like, you know, I've just practiced this. Are you sure you want me to do this? He said, Yeah, okay. So, you know, nice.

Starting point is 00:10:42 And I just blew the problem away because I had just done it. Did any of the process of interviewing ever make you feel that, oh, maybe I should go back to academia and not go to tech? Because, like, you have to go through all these hoops to just go through the interview? You know, I found the interview frustrating. I still do. I still am, like, not a big fan of the way tech interviews are done, but I like the work of industry more. So like, I really don't like how to get in there, but I really enjoy the, one of the things I like a lot more in industry is the team aspect.

Starting point is 00:11:16 Like when you're in academia, you're sort of like an entrepreneur. You know, you have some people who report to you grad students, but really you're out to make a name for yourself. And so you're on your own. The incentives are all for you to do individual work. Whereas in industry, you're always working on some kind of product together with a team. It's not about you as an individual. It's about you're working together for something, to build something.

Starting point is 00:11:39 And I like that. I work much better in that kind of environment. So while I still am very academically minded, I run the paper reading group at Netflix, I still like the day-to-day work in industry a lot more. Nice, nice. And then that's sort of, I guess, right after that, that's how you got into Netflix?

Starting point is 00:11:59 Or what drew you to Netflix? Yeah. Well, I was at Ten Grid Labs, and we'd actually moved to Rhode Island for family reasons, and then those reasons were my wife is from Rhode Island, and then those reasons were no longer relevant. There was nothing keeping us there anymore. My wife actually does not really like living in Rhode Island,

Starting point is 00:12:21 even though she's from there, or maybe because she's from there. I was fine with it, but I had no particular attachment to Rhode Island. Kids are getting a little bit older. I'd never worked in Silicon Valley. And I said, you know, this is really our last chance to try something like that. This was before the days you could work remotely for any company like you can now. And so I just basically applied to all websites, to different FANG companies. I didn't have any contacts, really, at any of them. And so I applied to all the Fangs. I failed the tech screens at Facebook and LinkedIn. I got an offer from Amazon. Google kind of screwed up the process with me. I got through the tech

Starting point is 00:12:59 screen, but they never managed to schedule an on-site because the recruiter wasn't that good, I think. And Netflix had an opening on the Chaos team. And I thought, like, that's really cool. managed to schedule an onsite because the recruiter wasn't that good i think um and net and netflix um had an opening on the chaos team and i thought like that's really cool i'd heard of chaos monkey uh and that sounded really exciting to me and i i got an offer there and i i didn't even want to wait to see what would happen with google i was just like yeah like amazon i had heard like you know they use blood to power the machine so i I was a little afraid of going to Amazon. And so, yeah. So a couple episodes back, we actually spoke with Tammy from Gremlin.

Starting point is 00:13:39 And we were actually talking about how cool the Netflix Chaos Con, I think that was the demo for, at AWS reInvent one year. And so, yeah, I think that's been an inspiration to a lot of us. And I was curious, for you guys, y'all have been doing resilience engineering for a long time at Netflix. How has that changed, evolved over time? Yeah. Well, it started out with the Chaos Monkey stuff. I mean, it's been going on for a really long time, right? So I can talk about what happened since I've been there. So when I got there about six years ago, Chaos Monkey was there. The person who had developed initially had left, and so another team had owned it

Starting point is 00:14:26 and wasn't crazy about owning this legacy thing that had been updated through open source, but we weren't really using some of the open source stuff in it. The conk stuff is interesting. That's the traffic failover, which now got broken out into a separate team. Demand Engineering runs all that stuff now. It used to be like Traffic and Chaos were together. And then there was some more sophisticated fault injection stuff that was developed by the guy who started Gremlin, Colton Andrus, built in a lot of

Starting point is 00:14:55 smarter fault injection between microservices and Netflix. So basically all the libraries had these all the apps had these libraries linked in had these, all the apps had these libraries linked in that you could say, okay, fail this call between, you know, one service and another. Uh, and so they would do experiments. So what happens if we fail? So there are some non-critical services in Netflix, like the, that are like, they add value to the, to the experience, customer experience, but they're not critical to be

Starting point is 00:15:21 able to watch a movie. So for example, the, the bookmark service that remembers where you were when you were last watching something, like if that fails, you would not expect to prevent you from watching. You just expect it for, you know, a reasonable thing would just start at the beginning again. Right. And so with this failure injection framework, we're able to, you know, so surgically fail calls for either one particular device or for, I think, a percentage. And then when I got there, they started building an experimentation framework on top of that that was more sophisticated. Because it was really tricky to do those experiments because if you only injected a little bit of failure, you couldn't really see at the metrics whether you were actually having impact to customers or not. We would basically do a chaos

Starting point is 00:16:05 experiment and look to see, are people still able to watch? We have a metric called SPS, which is a rate of how often people are pressing the start button and they're actually successfully able to stream a video. That's our KPI, our top level metric. If that drops,

Starting point is 00:16:21 things are bad. It sort of goes in a sine wave ish kind of thing throughout the course of the day because people watch more in the evening than in the morning but like it doesn't move dramatically quickly um so if it drops like that's our alerts fire and so doing this cast experiments if we would inject some failure if it was like a lot that's really bad but like a little bit it was really hard to tell and so we built so when i was there they built a system called CHAP, which is an experimentation platform where you could affect like a very specific subset of traffic, like a small fraction

Starting point is 00:16:51 and watch what happens to only those users. And so it was much more fine grained. We had graphs just on the people who are being affected. We could use sort of traditional like experimentation models where you have like a treatment and a control. I built like an auto stop system. So if it seemed like it, like it would usually would run for like a treatment and a control. I built like an auto stop system. So if it seemed like it would like it would usually would run for like a while, I don't know, like 30 minutes or something. But you know, we could detect in a couple minutes,

Starting point is 00:17:11 if there's like a big drop in the in the treatment group, just stop it. So there's a lot of sophistication went into doing that kind of building that kind of experimental platform. So we could be a little more sort of surgical um but it turned out that like okay you could run an experiment and you fail this thing or you added latency here and there's some impact well now what right like well what's the problem like we we don't we don't know we can't tell you that and like is it worth chasing up on that right so there's this whole problem of like fault fault localization which is another whole area which was not what we were built for, right? And that's like the observability, you know, movement, I would say that sort of blossomed

Starting point is 00:17:50 over the past few years is really focused on that now, right? It's on fault local, like, where's the problem when something goes wrong? But we didn't have great fault localization tools. And so it was hard to get uptake, we tried to do some automatic stuff, right? So we're gonna run a bunch of automatic experiments. But then still like, okay, something went wrong. What do we do with that? Uh, and so we, we built some cool tooling, uh, but it ended up being difficult to use that sort of automated tooling in practice because it required a lot of, of interpretation of the data, um, to an investigation to see whether there was actually a problem or not. And then you're like, there's so many different experiments you could run.

Starting point is 00:18:26 So, you know, I left that team. Man, it's been about four years now. It was the chaos team. Now it's called the resilience team. And they're doing some other things now. They're doing some more, like they're not just doing fault injection. They're using that framework to do other things like canary analysis. They're doing some more sophisticated stuff there.

Starting point is 00:18:43 They use it for load testing. So they've sort of built on it to do things other than failure testing. So it sort of evolved over time to do different kinds of infrastructure experimentation rather than just break things and just inject failure and see whether we can handle it. The Chaos Monkey stuff is still there, but it doesn't catch that many things because we now know how to build these services. Every so often, someone will do something accidentally stateful that shouldn't be, but most of the time, it's just fine. I'm curious, even when you're doing the sub-sampling specific types of traffic in order to do the injection, was that difficult? Because like, you would need to kind of have the domain expertise to kind of know that, okay, for this type of errors that we want to inject, it only

Starting point is 00:19:29 affects, you know, this sort of traffic. Was that a thing or it was pretty straightforward? Well, the nice thing about Netflix is we have enough traffic that like, even if we do a small, like, let's say I'm going to fail calls to the, you know, the foo service here. And I'm like randomly sampling at the edge, right? And I don't know if they're actually going to call it. Maybe no calls will go from the people who are tagged. So we're like, we're tagging, you know, in a way that we remember who we tagged, you know, a bunch of people to say, okay, you're the ones we're going to inject failure on. They might not end up going there. But usually we have enough traffic, like we have, there's so much traffic coming in, going to all these services that we get enough that we get, you know, statistically significant

Starting point is 00:20:07 results. So just because there's just so much stuff coming in, we can be kind of random, like we can random samples, like small amounts, and it ends up being like a decent count of traffic that we can get a decent, we can often get a pretty decent signal. And we have a knob, we can turn it up, it up like oh there's like nothing coming in here um to this service in this group so we can we can try to you know increase the number of people we're sampling um that's pretty neat like one thing as you were talking about some of the challenges uh with just doing chaos engineering or resilience engineering even at netflix where i think this idea well well, I don't

Starting point is 00:20:46 know who popularized the idea, but like Netflix has been doing this for a while. It's been a few years where Chaos Monkey, I think it was 2013, 12, sometime around that. And even so, like you mentioned, if you don't have the fault localization tools or the observability tools, it's hard to know what to do with the data that you found out or even dig in and say, okay, there's the problem. That's what we need to fix. There are various orgs today who see this pattern in the industry now where like chaos engineering, for instance, is a thing that people want to do a lot more of. But from an engineering perspective, it sounds really cool to do. But from a business standpoint, the question comes back to, so how does it add value? Like, if you did this, then how are you going to identify what to fix? And if you do that, how is it going to help the

Starting point is 00:21:33 business baseline? Because do we build stuff? Or do we break stuff and figure out how to improve it? So like, for folks who are still trying to think about doing chaos engineering, do you have any ideas or perspective on how they can go about thinking about this? Like whether it makes sense to invest right now or build some tooling first? Yeah. The funny thing is I'm kind of in the other camp at this point. So when I was on the chaos team, what would happen is I would look over at like what was happening like with the actual incidents and I would say wow like that's a lot more interesting like the real stuff that's happening that's more interesting than the synthetic stuff where we you know do a little thing and like we

Starting point is 00:22:14 don't know what like the big things like those like there's a lot there and like every time you look at every time I looked at those I was like wow I've like learned so much about what's going on in the system uh and and you know one of the reasons the team name changed from chaos to resilience is that like resilience engineering is traditionally more about like, if you, you know, look at like the academic field is more about like studying, you know, real incidents. And so I wanted to do that. And so I was like, well, why don't we, why don't we, you know, change the focus of the team to spend time doing that? And then I end up leaving the team and moving over to the incident management team. So I don't want to dump on chaos engineering.

Starting point is 00:22:55 I would say that I think you should invest more of your time looking at your real incidents. I think you will learn about i i found it has been a better roi because like it's just like it's a lot of engineering effort to build up the tooling around it and then built learning how to i mean there's some value in learning how to like think about through the experiments and stuff but i i don't know if the roi is there to be honest with you and like i i it's one of the reasons i don't not in that space anymore i'm like it's it's a cool thing i'm glad we tried it but like the real incidents are where it's at. That's the stuff where you're going to learn stuff, and they're going to keep coming.

Starting point is 00:23:31 You don't have to make them happen. All right. All right. Moving on to the next topic. So you were then transitioned to the core team, which I believe stands for core operations and reliability engineering. So that seems a lot more sort of broader and more horizontal in terms of scope than other SRE teams. Like, how did that work?

Starting point is 00:23:55 Yeah, the Chaos slash resilience team and the core team were both in the same org. They were, like, managed by the same director. So it actually was the same org they were like managed by the same director um so it was and actually it was the same that there was there were two manager gaps it was actually the same manner like the director was managing both teams um and i looked over and i was just sort of like i want to do incident analysis and it doesn't really you know it's not really within the scope of my current team like the the core team is so the core team is a centralized incident management team i think i don't know what it originally stand for every so often they change what the act what within the scope of my current team, the core team is a centralized incident management team. I don't know what it originally stands for.

Starting point is 00:24:27 Every so often they change what the acronym means. I think it's like critical operations and reliability engineering is what it is today. I don't know what it first stood for. But they're a centralized incident management team that is also responsible for the follow-up, the incident analysis. And I wanted to do the incident analysis so i moved over there but like the the sres that are on call are the ones that that do the analysis at least at the time so i had to you know become like an sre and go on call like i'm not an sre i i i guess i have like an sre skill set at this point

Starting point is 00:25:02 in my career uh that's never been my title. But I had to. So I actually kept my title. I was still I was like the only senior software engineer on that team when I moved over. There are all the other ones were SREs, senior SREs. But I had to do the same kind of work. So I moved over to that team. So that team doesn't own any services of their own.

Starting point is 00:25:21 Right. So Netflix is like you build it, you run it so each each of the teams that develops the microservice is responsible for operating it but the core team sort of owns the top level like alerts and dashboards so if like if all of a sudden like you know a lot of people in Canada are not able to you know stream video on their smart TVs like the alert will go out like a core alert will go off and they sort of do coordination and incident management, right. And they will, they'll page people in there. They're pretty adept at like, you know, tracking through dashboards to figure out, okay,

Starting point is 00:25:51 what's the source of this particular, you know, stream of errors that's coming back. They do handle communications, you know, with PR and the rest of the organization. And so that's, that's what they do is you have a bunch of people who are sitting around sort of ready to do that do that sort of work. That makes sense. On the core side, it's interesting how the team does incident management analysis centrally. So from an incident management perspective, I think different companies have different different they call them different teams but there is a

Starting point is 00:26:26 team where you'll have like an incident commander for instance who'll page and page people out kind of make sure the summary goes out to the execs and the stakeholders what's interesting is one part of the role that you mentioned is the incident analysis as well so since the core team doesn't own a lot of the services that they're built by other teams, what role and how do they play that role in the incident analysis aspect when working with these multiple teams? Well, it's changed a little bit over time, I would say. Historically, they've been responsible for memorialization of the incident. So we use Jira to track these things.

Starting point is 00:27:05 Same with LinkedIn. I mean, it checks the boxes, right? Like it does what you need it to do. People have a love-hate relationship, but it works out. Yeah. They're responsible for writing up that ticket. And often that description will be relatively brief, not a huge amount of detail. They're enough so that someone on the core team who's looking can look and see what the graphs look like, and so they can get some experience with the particular failure mode. And typically, the incident commander will organize the IR meeting. So there's an incident review meeting. They'll bring the people involved together.

Starting point is 00:27:43 And then often the teams that own individual services will do write-ups, right? So there'll be like an open Q&A doc where people will add questions and answers. Individual teams might do some write-ups either on their own or contribute to a larger one. So that's typically how it was done. What I wanted to do was I wanted to go in and like talk to a whole bunch of people who were involved and figure out like, okay, like how did the system get into that state? When the incident was happening, like, like, how did you find out? What did you see? What were you thinking? Right? Like that's, they had not historically been doing that. And they had also had only been looking at incidents that were like customer impacting, right? If something happened that was like not a customer impact, I think maybe I would see an email go from some other team that

Starting point is 00:28:29 said, hey, you know, our internal service had some outage here at this point in time, right? It affected internal customers, but not. And I was like, what happened there? That seems interesting. Like, I want to know, right? The first one that I saw was like, I think a system that like figures out like what ads to show, like, you know, Netflix doesn't show ads on our service, but we, you know, buy ads to show on the internet. And then I think there's some system that figures out what ads to show. Netflix doesn't show ads on our service, but we buy ads to show on the internet. And I think there's some service that figures out what ads to show people. And it had failed, and so there was a fallback. So there was no even impact. But I was like, that's really interesting there.

Starting point is 00:28:54 What happened? And I talked to them, and it was really fascinating to see all the different things that had to happen for that failure mode to happen. But that was not in the scope of of core at the time uh and so like i just wanted to dig in and like no one else was doing it uh and then like so i moved over and my uh colleague at the time nora jones who was on the the core the chaos resilience team she moved over with me like a month later to core she was doing a master's degree at uh loon university they have like a safety science uh degree so she was really into it too um so the two of us moved over um and we you know tried to

Starting point is 00:29:32 do some some deeper analysis stuff and we ended up like opening up new positions so we hired uh jay paul reed is on our is on that team now he's like a his like official title i think is applied resilience engineer right and he does not he's not on call like official title i think is applied resilience engineer right and he does not he's not on call like he he is he is able to right that's it that's um i really like that um he's able to spend all his time doing the analysis he does he's not responsible for being on call uh and like if you read the the literature they're like if you're involved in the incident you're really not supposed to be the one doing the the follow-up because each person will have a different perspective and your perspective is going to change the way you look at it.

Starting point is 00:30:10 And you need to get each individual person's perspective to really learn as much as you can about the incident. Wow, that is so fascinating. And it's fascinating because it's unique, at least from my personal experience i mean at linkedin and some of the other folks that i've spoken to many times the person who's on caller let's say was involved in the incident itself is writing up like for the lack of a better word some people call it post-mortem docs or the incident reports for instance they organize like the timeline and all of these data points but it's so interesting which what you said that the person involved shouldn't be the one necessarily doing the analysis later on

Starting point is 00:30:52 because different people have different perspectives so i'm curious like what kind of things you found in those reports when they got written by people who weren't directly involved in the incident or is that how it worked yeah what would happen is you would see the world from different people's points of view um to describe what happened in the in the incident right and and so that was really interesting so like historically what happens is when you like the the incident with the world of the incident would start when the first alert would go off, would be how the traditional write-ups would happen. But then, like, when you started talking to people, you would find out that, like, sort of the seeds had been planted sometimes, like, months or years before, right? And you see how that change made sense at the time, and this other change happens over here, which they didn't know about, and these things interact.

Starting point is 00:31:40 And so you, or you find out that, that like this person didn't know something that somebody else knew at the time or or the opposite i mean one of the things that i love about this stuff is um watching how experts resolve incidents right we talk a lot about okay like we don't want to blame people for their mistakes but like incidents are an opportunity for people to like really do well like operating at a very very difficult situations right and there are people like in every organization i mean netflix is a great place. I love the people there, but like every organization, there are some really, really good people. And the opportunity to like watch them work is amazing. And if you can capture like a story of what they saw and like how they dealt

Starting point is 00:32:18 with the problem in the moment, like you can learn from that in a way that's sort of similar to how you can learn from your own experience. And that's one thing I've really wanted to capture in these sorts of write-ups. That's so true. We had Ryan Underwood on the show in one of the first few episodes. And he was on the same team as me. He's moved on from LinkedIn. But he was one of those engineers who if there is, let's say, for example for example an incident or a bug people would gather around his desk just to shoulder surf and see what he was seeing and the commands that he was using and we've asked him to record his like

Starting point is 00:32:55 team work sessions or screen sessions and they are still an artifact shared across the team saying okay this is how you debug a disk issue when you have disk pressure on a containerized environment. So I can completely relate to what you're saying. But at times, we also found this challenging to do many times because it's like, okay, someone's dealing with an incident or they're going through all these things. And the first response they have is,

Starting point is 00:33:19 can I just go fix it? And many times, like people go back and do these write-ups where they'll describe exactly what they saw capture all these artifacts so that rest of the teams can learn but then time gets in the way uh priorities gets and get in the way so i'm curious to get your thoughts on uh how how did your teams at netflix go about like prioritizing this sort of work where it's like, let's capture the thing and learn from it instead of not reflecting on it all the time. Yeah, that's why I'm an advocate of

Starting point is 00:33:52 actually spending the resources to have someone who's dedicated, whose job it is to go and talk to people, right? Because it's one thing to ask you to go and write up your own experience. That's going to take you a lot of time, right? And like you said, so the one universal that we all face is that none of us have enough time, right? We are all limited in our resources. And it doesn't matter like what kind of company you work at. It doesn't matter where you are in the org chart. Like no one has enough time to do the work they have to do, right? And so like, like learning is the first, reflecting is the first thing to go because you can always cut that out right um so you need to have like some upfront investment so that's why i but like to go and sit and talk with someone for like half an hour or an hour after an incident well like that's part

Starting point is 00:34:33 of the culture anyways right like we're expected to like talk often there's like a you know we typically have like a meeting like a big you know ir meeting but like i actually find i get a lot more out of the one-on-ones right and so the one-on-ones will be much more expensive for the person doing all those, but for the people on the other side who are talking about their experience, it's, it's a meeting, right? You've got meeting. I mean, I don't know what your calendar looks like, but I suspect that like, you know, the size of LinkedIn. Yeah. Right. And so like one more meeting, it's annoying, but like, it's not that expensive. Uh, the write-ups are still hard and you have to get better at them over time.

Starting point is 00:35:05 I mean, even people who do this full-time, even me doing it, I struggled. I got better over time. But that's why I think there's a real advantage in having people dedicated to this, because that is their full-time job. They have the cycles devoted to it. So you don't have this problem of, well, I would love to do this, but like, I just have too many other competing priorities. It's just not the most important thing for my team right now. So, so I can't.

Starting point is 00:35:34 Makes sense. So in this case, like, is it the role for the applied resilience engineer that you mentioned who would go and talk to these teams and gather their perspectives and then like, kind of either provide a qualitative analysis of okay this is what this is kind of a pattern that we're seeing over and over again and some of the incidents happening across the site yeah so you know two things there one of them is the you know going and talking to people and doing you know about specific incidents and there's always little things happening all the time that you know even with like a full-time resilience engineering analyst like paul can't even look at everything right like there's always little things happening all the time that you know even with like a full-time resilience engineering analyst like paul can't even look at everything right like there's just

Starting point is 00:36:08 way too much but like the great thing about incidents is if you miss one there's another one right like like uh so so part of it is looking at like individual ones and the other part is like you said like sort of aggregating right so it's another thing that paul does is he has like a risk radar meeting where he actually sort of crowdsources and asks people to provide like what are what are some risks that you're seeing that that we haven't been talking about and then he has a meeting and they can post he pulls from you know the incident write-ups that he's done or other people have done and also just solicits feedback so like that's one of the challenges is how do you aggregate these sort of like weak signals of risks that you're getting from different places uh and that's one of the roles that he plays.

Starting point is 00:36:46 That's so cool. You might have written some incident write-ups or maybe a lot more depending upon how many incidents. Do you have any advice for folks on how to go about writing incident reports? I mean, even though not many companies would have like a full-time person to gather this feedback, but if someone's writing this incident report themselves, the elements that they could borrow to kind of make those more valuable. Yeah, I would say that the first step, right? If you're like, okay, I want to write better incident reports, but I don't know what to do. The thing that I think is probably easiest and most valuable is talk one-on-one with the people that were involved and just write separate narratives from each of their perspectives about what they saw about the world.

Starting point is 00:37:30 The hard part is taking those and combining them together and picking out themes and contributors. That's hard stuff. It takes a lot of time. But those narratives alone, you're going to get a ton of stuff out of there um and so just capture just like and i i found that like once i started to do that like i would interview someone i would write an individual narrative for each of them and then i would do the combining like even if you just stopped at the individual and i didn't i didn't actually release the individual narratives because i would i would sort of do the synthesis right up where i would like sort of switch back and forth within

Starting point is 00:38:01 the different perspectives over time right to get to get like a one story going. Even if you didn't get to that point, if you just had those narratives, like there's just a ton of stuff that's going to come out. And one of the challenges is like whenever you try to abstract, okay, like what's interesting here, right? Different things will be interesting to different people, right? Someone who just got hired at your company reading a narrative is going to take away something completely different than someone who's been there for like six years. And you can't know in advance. And so like, I can take away the things that I learned that I didn't know before and abstract those out, but I'm going to be throwing away useful stuff to someone else. But the narrative is the narrative. Like that's just what happened to the person, what they saw,

Starting point is 00:38:41 what they were thinking. And I like to go back reasonably far to how the system got into that state. The story of what were the code changes or config changes and what was the motivation for them. And so you have all of these individual stories and they're great. And I just love them.

Starting point is 00:39:00 And people love to read stories. They don't like reading bullets, but they love stories, right? Like this whole podcast, right? It's predicated on the idea of my understanding that like people love to hear stories about incidents. Oh yeah, that's true. So talking about incident stories, you've seen a lot at Netflix.

Starting point is 00:39:19 Are there any specific stories you could share with us? Yeah, I'll tell you about one that stands out. And I'm going to break the rule of not doing an analysis of your own incident. Because I was the incident responder for this one on core. And so I didn't even do a deep write-up. But I did a big, crazy timeline on it. So this is the 100% tracing incident that i was involved in so netflix is a uh we have a microservice architecture and there's like a an edge service like the front door the service is called zool it's open source it's on it's on github um

Starting point is 00:39:56 and so all the requests come into zool and then zool will tag some of them to trace right and we have a distributed tracing thing so we know who's talking to who. We trace normally one out of every 1,000 requests. So 0.1% of requests get tagged and traced. And that trace data gets pushed into our stream processing system. It goes on a Kafka queue. It gets consumed later on. And you can visualize it and so on. So one day, one of the engineers on the Zool team accidentally was debugging some kind of tracing issue.

Starting point is 00:40:27 So he had it on his local machine running Zool. He had turned the tracing on to 100%. He had switched to another development task, and he didn't realize he'd committed that change, the 100% change. And so he pushed it up with the other change. So he has now pushed up a change where 100% of requests, so like a thousand times more than we normally handle, are going to get sampled. So that goes out,

Starting point is 00:40:51 goes to Canary. So Canary only takes a fraction of traffic. And so 100% of that fraction of traffic is getting sampled. The downstream services are fine. They can handle that additional tracing.

Starting point is 00:41:10 There are some alerts that fire fire on the the stream processing side but they're like they're not critical they're like they're starting to fall behind because like it's a lot more data coming in than normal but they're like not like business critical paging things or email alerts and stuff and this is happens like a one in the morning um then um it gets promoted to from canary so we have like a staging sort of thing right so first it's canary and then like we run three different amazon regions right that goes to us east one first so like 9 30 in the morning us this gets deployed to us east one because the canary looks good right and like the zool itself is fine right like it's it's always processing 100 it's always dealing with all the requests right right? It just happens to be tagging. It goes, and now all of a sudden in US East 1, 100% of requests are being tagged and processed all along the thing, right? So now all of a sudden,

Starting point is 00:41:56 one of our error signals spike. Now, it didn't set off an alert yet, but someone on my team happened to see it because there was an all-hands meeting going on, and most but uh someone on my team happened to see it because um there was an all hands meeting going on and most of the people on my team were sitting in the situation room and the situation room happens to have all these monitors around it with the dashboards we don't usually use them for anything right because no one like it's just cycling through those dashboards and he sees these like those like errors like wait those are bad right now i'm not in that so i'm on call i'm not in that room i'm like no i'm not going to be like remote like errors like wait those are bad right now i'm not in that so so i'm on call i'm not in that room i'm like no i'm not going to be like remote like i want to go to

Starting point is 00:42:30 actually to where the the all hands meeting is right so i'm in the all hands meeting they're watching the all hands meeting from the situation room that my teammate sees this page is on call who's me so they're watching they're watching me stand up they're watching me look at my phone stand up and then come back to the situation room. And so what's happened is a bunch of services are like, the CPU is too high. We're getting weird errors. It's kind of weirdly distributed throughout the service. But there's no obvious change that's happened.

Starting point is 00:43:00 There's not one thing you can point to and say, okay, that's the problem. Now, so at the same time, in parallel, the observability team is, like, investigating why their alert went off. Like, why are we behind in processing these traces? And then the team that owns the stream processing infrastructure, right, which they're built on top of, their alerts also go off. And they're trying to figure this out. And so the observability guy asks the engineer on Zool who made the change, like, hey, did anything change at 1 a.m.? He says, no, nothing changed. The reason he says that is because Zool, the way they push is that they have a scheduled push at 4.

Starting point is 00:43:41 Their canary goes out at 4 p.m., and he pushed before that. He didn't know that at the time someone had manually started that pipeline and so his deployment got queued so his deployment didn't start until 1 in the morning and he didn't know that. So he's like, no, I pushed

Starting point is 00:43:58 yesterday at 4. Our canaries go out at 4. 1 o'clock? That's not us. That's not my change. So we don't know what's going on we're getting like tcp retransmit errors some people are saying oh there's got to be a networking problem let's bring in the networking people right some people are saying no it's not just because tcp retransmits are up that doesn't mean it's a networking issue um so networking guys come in he's like i don't think it's a networking problem. So you mentioned the Chaos Kong, right? So Netflix is able to fail over.

Starting point is 00:44:28 So I'm like, I don't know if it was my call in the end, but I was the core on call. So, okay, let's fail out. Let's fail out of U.S. East 1. So we move all the traffic out of U.S. East 1 to the other two regions, U.S. West 2 and EU. 30 minutes later, the deployment finishes in US-West and it goes to US-West too, right? That's the plant pipeline still running. And then I'm like, oh my god, it followed us, right?

Starting point is 00:44:52 The problem has followed us to this new region, right? And it just was a coincidence that that's the order in which the stage was happening. Now, in parallel, the stream processing guys are just investigating their why are we getting so much more, so much more, you know, traffic in our Kafka queues. And then another person on my team sees that in the, you know, the stream

Starting point is 00:45:16 processing infrastructure channel, they report like, hey, you know, we're having an issue, right? Like they don't, they don't show up in the core channel, because they're not really critical stuff, right? They're like operational data, like it doesn't, if're having an issue, right? Like they don't show up in the core channel because they're not really critical stuff, right? They're like operational data. Like if the Kafka stuff breaks, it doesn't affect streaming. But one of the core members sees that. He's like, hey, when did you start to see that happen, right? Like what was the timing behind that? And oh, that happened at the same time as we started to see it.

Starting point is 00:45:36 And then those streams come together, right? And then we realize what's happened. And the guy on the observability team, you know, turns down the tracing and everything goes back to normal, right? And so I love this story so much because it has all these combinations of, like, things people didn't know and then things people did know because they saw certain things, right? And so much of incidents are like that, are like these huge systems. They're all – there's so many interactions. Like, no one has the whole picture. And we're always going to be missing some, some critical piece.

Starting point is 00:46:08 And, but some people are going to get good at looking and saying, Hey, that's important, right? Like, Hey, those errors are, they were a real thing. Hey, like that, that issue that this, you know, stream processing team is having, like maybe that's connected, right? Like, uh, and, and so I, you know, I really love that, that, that incident. That, that, that, that's an amazing story. And wow, it sounds, it sounds like a tough day to be on call when you are facing with such a, when you're faced with such an incident. Go ahead. Yeah, I was going to say, so I thought going in that, like, it would be much more stressful

Starting point is 00:46:39 if like all of a sudden streaming, you know, video watching drops to zero, right? But that's not the stressful part. The stressful ones are the ones that take like hours and hours. You don't know what the problem is. It's sort of something's maybe wrong and then it goes away. And like, is everything okay? Can we like walk away now? Like, we don't know like if we're in a good state, cause we don't know why we were in

Starting point is 00:47:00 a bad state and okay, we're fine. And then it recurs like four hours later, man, I can tell you like cassandra clusters you know getting too hot like oh my god uh interactions between client libraries and and like the way the data is distributed and it's just oh those are the ones that are a nightmare like saturday all day oh those are a nightmare oh yeah i think whenever an incident happen happens and it suddenly goes away, I've seen some people like, oh, it went away. But I see other half of the room goes like, wait a second. We don't know why it went away. We don't know when it's going to come back again.

Starting point is 00:47:37 And oh, it's Friday. And you know it will come back. And you know it will come back, but you just don't know when. That's interesting so uh this this part that you mentioned where people have different perspectives and different information that they're seeing and they're making decisions accordingly um and it's fascinating when you combine all those views together um like you you know exactly why something happened and what happened and eventually how you can fix it uh You write, you have an amazing blog, surfingcomplexity.com.

Starting point is 00:48:09 If people haven't read it, I would highly recommend. Your fascination with incidents, at least I could gather your fascination with incidents from your blog itself, because you have so many posts talking about incidents in general, but not just from a software or just a technical standpoint. There is this perspective of humans too. So I was curious,

Starting point is 00:48:30 how did you develop that perspective where you're not just thinking about like, oh, this is software and why it broke, but all the human aspects involved? Yeah, I've been interested in the human aspects of software engineering in general for a while. I mean, like I said a little bit, like that's like what my PhD research was about, like software engineering and like studying people. But in terms of like humans as being part of the system, like the operational system, that was more like I would say that John Allspaugh like dragged me into that world.

Starting point is 00:49:03 Like all the like I'm sure you've, you've, you know, seen many references to his stuff. Like eventually I was like, all right, John, like you're constantly posting about this stuff. I'm going to start reading about it. Right. I started to read about it and I got completely sucked in and like this world, like, so a lot of the, that sort of thinking is around that like resilience engineering, cognitive systems engineering, there's this sort of history of around that like resilience engineering cognitive systems

Starting point is 00:49:25 engineering there's this sort of history of research into this area um and so the cognitive system engineering people they talk about a joint cognitive system like a system that's composed of both the the technical stuff the software in our world and the people and that's all one single system you'll often hear people talk about social socio-technical system right so the system is not just the software it's the software and the people together um right and they're doing something together in a way right um and that just like over time like that just got under my skin and like i just couldn't stop seeing it anymore and so now i i always see the world that way and a lot of people don't like our culture our technical culture separates that out right like there's the technical stuff right like you give me you give me the

Starting point is 00:50:10 requirements and like the constraints and i will like build that thing for you and we'll make trade-offs right but like technically you know i think it's just a cause and effect and and everyone thinks that way in technical worlds not just software people all the engineers do but like to step back and see the people involved as part of the system, like you just see things that you wouldn't have seen before, even though you're seeing the same, right, we're seeing the same world, right? But it's just from a different point of view. And there's just, there's just so much there, right? That like, it gives me so much stuff to write about. Because people can connect because they've seen it personally, but they just might not have thought about it from that point of view.

Starting point is 00:50:45 Yeah, that makes sense. And to that, I think one of your recent posts was about what identifies as or what can you put down as a root cause? And there were a few things which when I first read, I remember just sending it to Guang right away. I was like, well, we're going gonna have so much fun in this conversation uh like one of the things that you mentioned was like well prioritization for instance was one where you can like you mentioned earlier the seed for an incident incident was planted much earlier in the process where we knew it was a ticking bomb we just didn't prioritize it uh or like the other one was power dynamics for instance which is fascinating fascinating. Is your study in the socio-technical system, is that where this perspective comes from? Yeah, absolutely. So the prioritization I love,

Starting point is 00:51:31 because I cannot tell you how many incidents I have seen where one of the contributing factors was some kind of vulnerability in the system that people knew about and were working on, but they had not completed that work yet. We knew this was a problem. We were going to address it. Or maybe in a few months it would have been released or something, but we had other priorities. And the problem is you don't know the next thing that's going to break you. You always have more work that you can possibly do than time to do it in. And so you don't know what thing is going to bite you.

Starting point is 00:52:04 And so that happens again and again and again, but it's very easy to say, well, why didn't you fix that? Well, like I had a million other things. Well, I'm going to give you some more work to do. Right. Like, so, so, so that's that the power dynamics. I started to think about more and more because I read more and more about like sort of sociological studies of technical organizations. And I found those really, really interesting because things like who gets to define like what the problem is that we're trying to solve, right? Like what are the problem, what are the problems in this doc? Or like who gets the, like I said, who gets the head count, right?

Starting point is 00:52:39 Like that, these are, and like we power dynamics or like, like politics sounds like negative, right. But like we live in, like like politics sounds like negative right but like we live in we function in social organizations right like this is how things get done and and like it's just it's the reality it's just as constraining as like i don't know big o notation or or or like the cap theorem right like this is just how people interact we we negotiate we have our own perspectives right like different people in different parts of the stack like they care about they see different things they care about different things and they're going to advocate for the things they see from their perspectives and that's just how people get stuff done in groups uh and like there's this

Starting point is 00:53:17 belief that like there's the technical stuff which is the good stuff that we should be doing and then there's like the politics stuff which which is like, my company is broken. And like these things happen because of politics, not because of technical decisions, but like that, what you call politics, like that's engineering work too. It's just like, you just don't see it that way because you don't realize like how human beings actually get work done together in organizations. I got pushback for power dynamics from people like, no, it's not power dynamics. Like they should have just like asked for more, they should have been able to justify more headcount. And I'm like, well, every team can justify more headcount. There's only so many resources, right? Like, this is why I talked about like, like rhetoric, right? Like you need to

Starting point is 00:53:56 be able to like figure out how to advocate for the things you think are important. Right. And you need to be able to do it in technical language, like not like people hear rhetoric and they think like, oh, I'm trying to like sell you something, right? Like it's negative. It's like, I'm trying to fool you. I'm trying to like manipulate you. But I mean rhetoric in the sense of like,

Starting point is 00:54:17 I am trying to like persuade you of something. Like I have a position and I want to be able to communicate it to you in a way that you'll believe me. And the way we do that in our field is we write technical documents where they read as if, oh, this is the way it works. This is the way the world is. Even though you've chosen the problems to describe and how you're going to solve them, that's a political thing you've done by describing these are problems I'm mentioning and you're not mentioning these other ones. That goes into input into people in upper management.

Starting point is 00:54:45 And they're going to use that to make decisions about things like headcount and what to invest in. And if you see that, then you're going to be more successful in an organization and getting big stuff done that you want to do. That makes so much sense. So there is so much to unpack here. I have so many questions to ask, but I know we're running out of time.

Starting point is 00:55:07 So I'll ask you this. Rhetoric, I saw this tweet where you mentioned that engineering and CS degrees should have a course in rhetoric, which if I think about makes so much sense for all the reasons you just mentioned, for people who, let's say, say for instance are still trying to figure out how do i convince people of an idea that i have and it could be here's a technical design or here's

Starting point is 00:55:32 an idea or an initiative that i want to propose not everyone is i don't think it's a level playing field because i think engineers at least when they're starting out technical skills are a lot more emphasized on as opposed to some of these, for the lack of a better word, soft skills. So how do you think people can develop this rhetoric or just write persuasively or even speak in meetings when they have these discussions? Yeah, I'm fortunate that I have an academic background where you kind of have to speak and write a lot. I mean, there's just like, we're forced to, like, you can't, you can't get a PhD through a PhD without writing a lot of technical stuff to try to convince people of things and going and giving talks. So, I mean, I could, I could say very flippantly, like get a PhD, but I would not

Starting point is 00:56:18 actually recommend that for someone unless they really, really want to do that. Cause like, that is like, that is not cost effective. Like you're going to make a lot more money with those like four to six years working instead. Um, you know, I, I would say, and it depends, right? Like in a smaller company, it's not as important, right? And in a larger, like it really, like in a startup, uh, for example, like a lot of it is, is really spoken. Like you get, everyone gets on the same page by talking and you can, you can get things done just by talking to your teammates right and that's how people do things on their own teams right um it's only when things get bigger uh in larger organizations that you

Starting point is 00:56:51 have to rely much more on on writing right on like you like like startups are not memo driven cultures in the way that that large organizations are um so if you want to work at a large organization you want to succeed um like you know advance that advance that way. One thing I would recommend is like, you know, reading. So first read a lot. Like I like I just read a lot of nonfiction. I just really enjoy reading nonfiction. So it's easy for me to do to recommend. So I'm constantly reading something.

Starting point is 00:57:19 And like that's one way to get better at something, at least appreciating when it's good. Right. Is to read a lot. The writing is interesting because it's really hard to write if you're not interested in something. And so the challenge with practicing writing, like I've never been able to do journaling, for example. I just don't know what to write when I have a journal. Yeah, I empathize with that. My wife disagrees, but I empathize with that.

Starting point is 00:57:42 I try over and over again. On the other hand, when I read a lot of the resilience engineering stuff, there's all these different ideas that I get exposed to that I think I understand, I'm not 100% sure. And so I use my blog as sort of thinking out loud thing. My blog, those ideas are not super sophisticated. It's always very short, a couple of minutes of reading. And I use it to think out loud to see like, do I really understand this? If I do, then I can describe it in my own words, right? And so you need a volume of ideas that you care enough about that you want to write something about. And so once you've got that, once you're exposing yourself to ideas that you personally think are

Starting point is 00:58:24 interesting enough to write about, then you just have to work up the courage to write in public about it. You can write in a journal for that stuff. I feel like blogs, people have permission to just – similar to podcasts. They're not super polished. We talk off the cuff. I'm sure I've said ridiculous things. But people don't really care that much. It's ephemeral. And so, and like Twitter, you know, I've used Twitter for that,

Starting point is 00:58:49 but it's not a good idea to use Twitter, to use Twitter to like, to have like half-baked thoughts. It's just not, it's not the right environment. It's too easy to do, but it's not the right environment for that. But blogging is great for that. And so sometimes, you know, I have no new ideas. I won't write anything for weeks. and sometimes I'll write like two posts in a day. Um, I, another thing I do,

Starting point is 00:59:10 so I'm a big fan of, uh, what's called like ubiquitous capture where like I carry index cards around with me. And if I have an idea, I, you know, write it down and go over it later.

Starting point is 00:59:19 Um, that's like from getting things done. Uh, so I'm a big getting things done fan. Um, there's another, there's, um, a book I really like, which I read many years ago, and then I bought again recently.

Starting point is 00:59:32 I actually have it on my bookshelf here. It's called Weinberg on Writing, The Fieldstone Method. It's by a guy named Gerald Weinberg. He wrote a book called The Psychology of Computer Programming back in the 70s or 80s. But he's written, I don't know, 30 books or something. And he has a great description of how to collect these ideas as you go so that you can accumulate these sorts of things. And I find his book, it's my favorite book on how to become a better writer, but just how to come up with ideas for stuff to write about. Because if you don't have anything that you care about,

Starting point is 01:00:06 then it doesn't matter. You're not really going to be able to write, I don't think. Yeah, that's really good advice. Do you have any other books that you would recommend to folks, either in writing or something that you've just read and you thought at least folks in this industry could just learn from?

Starting point is 01:00:24 Yeah, a book that I've, two books that I've read recently that have had a lot of big influence on me. One is called Designing Engineers by Luis Bucciarelli, I think his last name is. So it was written by an MIT professor of engineering who did an ethnographic study of engineers. He went into a bunch of different engineering companies

Starting point is 01:00:44 and just like watched them do design work. And it's a really great book because it had a lot of influence on me and the role of the social process in design. And he goes and sits there and he sees they're arguing. There's this meeting where this engineer wants to, okay, here's the possible solutions

Starting point is 01:01:00 to this technical problem we have. Now we're going to go through this process to figure out how to rank them. And then they all start arguing over like no like what's the actual like what should we actually care about like you know we haven't even defined the problem really yet and he's like oh that meeting was a total waste but actually like the the research was like well no like people like didn't have like a shared understanding of the problem and like you were developing that as you were talking uh so he saw it as a total waste the the guy who

Starting point is 01:01:24 ran the meeting, who wanted to rank order the 14 possible solutions. So I love that book because whether or not you think that software engineers are engineers, when you read that stuff and he says naming is part of design and naming is really hard, you're like, well, that sounds pretty familiar to me. And he's talking about mechanical engineers. the other one i read recently is called uh mission improbable by an author named lee clark and he talks about he's he wrote about like disaster recovery documents for situations that like no one could actually know how to recover from them so things like what happens after a nuclear war or like how do we like what do people do in that situation or what happens after a nuclear war? Or, like, how do we, like, what do people do in that situation?

Starting point is 01:02:05 Or what happens after a nuclear meltdown or, like, a huge oil spill? And its conclusion is that, like, you know, well, we don't actually know. We don't have experience with these things. But they're written as if they know. So that the expert, like, they have, like, symbolic value to convince people that, like, we understand this problem. Don't worry about it and and that's that has like influenced me a lot on the on the rhetorical point because it's it's written in like technical language by experts so like this is what would happen like this is how we would evacuate long island if like there

Starting point is 01:02:34 was a you know a nuclear meltdown there at the power plant uh and just that that notion of like you know power dynamics being cloaked in technical writing had a lot of influence on me. And that's one of the reasons why you keep seeing me talk about this sort of stuff. I see. We will definitely link to both of these books in the show notes. What I find super interesting about this conversation is that I feel like when I think about these sort of things, it's like a lot of engineers, I think think tend to focus on just like the technical aspect but then there's this whole sort of um uh gap between sort of what they actually see the results versus sort of and i think a lot of times it's very comforting to just focus on the technical stuff because that's what you know well it's very easy to measure right like it's like oh

Starting point is 01:03:20 i build out these many features you know what what. But what you really like to focus on are the things that are harder to measure. But a lot of times that's sort of, and because a lot of times, right, like you hear, I don't know, I hear complaints from my like friends who are engineering or who are in engineering who are like, yo, why is this not working out? Right. Like I've done, you know, all this stuff, but what they're not looking at is that sort of that gap. Right.

Starting point is 01:03:48 And I think kind of translating from what you just talked about in terms of how you look at incidents and writing and rhetorics, translating that to metrics. So you wrote about this article preferring qualitative signals over quantitative. I thought that was was super interesting read. If you can just like provide a little bit of context in terms of like a TLDR on the post, that'd be super great.

Starting point is 01:04:12 Yeah, so man, we love metrics, don't we? And the reason- More numbers. More numbers. And to be fair to management, right? Like our systems like as an organization, they're really big. And even if you're a director or VP and you can spend all your time doing one-on-ones with people and you're still not going to see everything that's going on. And metrics make the system intelligible.

Starting point is 01:04:38 I can understand. I can manage that system because I have this number and I can see it going up or down, even though this is a huge thing with all sorts of different people. So I understand the appeal of metrics because it makes the management problem tractable. So that's attractive. And we engineers like metrics. I mean, I'll tell you, I use metrics for alerting on my system. I'm not doing qualitative analysis. I mean, I do logs. I do qualitative. But in that case like i put i can put good metrics in but if

Starting point is 01:05:09 you want to really so what i found is if you want to actually get as much as you can from the system like so a couple things so one is that like metrics necessarily are throwing out a lot of information right that like that that's what they do right we collapse the world into like a few things which means you're going to miss a lot of stuff, right? There's, and I'm always worried about all the stuff that we are missing because that the stuff that the organization doesn't see, that's where the danger is. That's the stuff that doesn't get the resources. That's the stuff for like, if you're not measuring burnout, Ooh, like, and your people start to burn out, like you're, but you're measuring attrition, like, uh, like, you know, you're but you're measuring attrition like uh like you

Starting point is 01:05:45 know you're you're too late right like you do not want to get people into that death spiral so so i am always worried about the signals that are not not measured with metrics so like like no matter how many metrics you add for me i'm going to be like well what are you not seeing right and and the other thing that i i personally find very frustrating about metrics so um whenever you're dealing with metrics, so like, you know, on the core team, you have to measure like, when does the incident start and stop? Right. And like, what category does it go into? Right. Like that time that you spend, like, okay, the metric like dropped to here at this minute, and then there to this minute,

Starting point is 01:06:18 where should we say it start? Like, there's no value to the organization in that time you spend, like picking which minute it started at, or like, okay, is this like a latent bug or is this a recent one? Like it's an interaction of two things. Is it because of like, I don't know, like this external service. Right. Like you are that time you spend to figure out what bucket to put something in. That is all waste. Right. Like you're not learning anything. Right. There's you're not that That's all opportunity cost. But if you are doing qualitative analysis, if you are learning more about the system, you're talking like, how did we get it? Like, how did this happen, right? Like you are going to, there's always an opportunity

Starting point is 01:06:55 for you to see something, to learn something about the system that you didn't know before. Now, maybe you won't, maybe you'll spend a lot of time and you won't get anything, right? It's possible. I mean, I have always learned something from, from all the qualitative analysis I've done on incidents. And one of the things you get better at is like, given you have a limited amount of time, you know, you got to pick the stuff that seems like it's most promising, right. Like based on like your intuition and your hints, but like, like my feel like never feel like my time is wasted when I'm doing qualitative analysis. I'm always learning more about a system. And every, like, incident and surprise is, like, an indication there's something about your system that you didn't know that was really important, right?

Starting point is 01:07:38 But, like, metrics, like, the collection of the metrics itself, that work, there's no learning. And the insight that you get from the metrics, I have always found that any insight you can get from quantitative metrics, you can get more insight from qualitative with the same amount of work, right? So that's like my, that's my, my like claim, right? So that's why I think qualitative is better than quantitative is that any, any effort you spend on quantitative metrics, I think you would get like more insight. If that's what you care about, you get more insight. If you spent that time doing qualitative instead, and you're not going to annoy your engineers at like, you know, having them spend time, you know, doing arbitrary bucketing. And then you also don't have the horrible, you know, I'm not even talking about like the effect where like people like, you know,

Starting point is 01:08:15 do things to make the metrics look good. That's another problem. Like already. Right. And like, you hear this, like, you know, different places where they're like chasing, chasing metrics instead of doing with like any time my my favorite metric is like how many times did someone have to make a decision between doing what they felt was right and doing what they were doing what would increase the metric right like like that's that's that's a metric i would like to see because every time that happens that's bad and that should be zero and it never is uh but qualitative that you don't have that problem because it's all based on judgment and talking about judgment like i feel like i in in theory

Starting point is 01:08:52 i definitely agree with what you said um but i feel like you know it's hard to apply that to like every company so do you think like the sort of the netflix culture of having like response freedom and responsibility being very much a necessary condition for this to thrive, such that you're giving individual engineers the trust to do the right thing here? Yeah, absolutely. I really think that you need to have... And this is one of the the, I don't know, like tenants of resilience engineering, like the people who like the op, the people, they call it the sharp end, the people who

Starting point is 01:09:28 are like actually like on the front lines, like they need to have initiative. They need to be able to make decisions in the moment about what to do. Right. And so you have to be able to trust their judgment because if you like centralize that, if you have to like, it has to go up the chain to get it and then it has to come back down. It's, you're going to be too slow and they're going to miss things. Right. And, and so I would say it's not even sufficient, but it's necessary. And one of the things I love about Netflix is that the engineers do have a lot of initiative and autonomy. And the guidelines are use your best judgment. What do you think is best for Netflix?

Starting point is 01:09:58 That's what you should do. Now, different people on different teams will disagree on what's best. And so there are lots of cat-herting challenges when we don't have, like, mandates on things. But I really do think that, like, people having initiative and you letting people use their best judgment about what to do is super important to be able to use these sorts of approaches and maybe a follow-up that might not have a great answer to but i was still curious to get your thoughts on is like for someone say like an ic working at a company maybe the culture is not as strong in that regard uh where there are a lot of hoops like are there any actionable sort of tips that you have that they can sort of either try to drive the change in terms of culture but if that's too hard anything that it they can sort of either try to drive the change in terms of culture, but if that's too hard, anything that it can do sort of to make a small change for the better.

Starting point is 01:10:51 Yeah, I've never worked in a company like that. I can say what I would try. I can say what I would try to do. I don't know if it will be successful or not. But what I would try to do is sort of guerrilla style, like on the side do some like lead by example stuff with like so one thing that i try to do on my team is write up when we have operational surprises um whether i'm involved or not um i don't often i don't have time so it doesn't happen but like if to say like look like i'm doing like this sort of work uh and here's the value right and like you either see the value or you don't. And some people won't see the value.

Starting point is 01:11:27 And so you sort of have to infect people. You've got to get people higher up to see value in this work, so you've got to do a little bit of it and have them see it, and then maybe they'll either go along or not. That might not work. In many cases, it probably won't. But at least it's one thing to sort of say like look you know we really should do this um but it's another thing to actually try

Starting point is 01:11:49 to carve out some time do it yourself and show them and and get like sort of create demand for it by by trying to sort of on the side uh do it do it yourself now like i was doing stuff on the side and then i was like no i don't want to do it on the side anymore. So, you know, but on the other hand, I was able to make space for that when I, you know, I think like that work that I did helped me make space for moving on to the core team and doing that. But I had like a sponsor. I had someone on that team who was an IC who became the manager who just really believed in that work.

Starting point is 01:12:19 And I think part of it was he saw the work I was doing. But part of it was that he was like, I'll say like a humanist, like he saw the human human's parts. And so like a huge factor in whether this succeeds or fails is like your management chain. So there's like organizational culture, but like in big companies, like the culture is also very, very local, right? Like individual teams. I can tell you, having been on three teams at Netflix, individual team cultures are very different. And so, you know, you can sort of either change your team or change your team, you know?

Starting point is 01:12:48 Damn, that's a good quote. I like it. I don't think it's actually mine, but, like, you can sort of convince people or not. My friends, in this case, I think the one that you referred to or the initiative that you took about talking about the near misses, I think Oops is the project that you named it. Okay, that's a good name.

Starting point is 01:13:11 With the qualitative analysis and something like the Oops project where people are talking about some of the near misses that happened, which didn't result into an incident, but someone caught it one way or the other, someone in the system very well, or there was a signal that told them it's going to blow up.

Starting point is 01:13:27 And I'm going to try and tie it back to the recent post we discussed earlier, which was about some of the interesting root causes that could be mentioned on incidents. Have you been able to convince the core team or other teams at Netflix to use any one of those as a root cause in an incident, where it's like a prioritization becomes one? So, I mean, it depends on who's doing the incident right.

Starting point is 01:13:53 I mean, we don't have like a canned list of, you know, I mean, we don't call them root causes, right, contributing factors is the term that we use. I have never seen power dynamics as a contributing factor. I don't think I could even put it there. I don't think I could put it there, right? So I don't think prioritization... What I have put is this work was being done.

Starting point is 01:14:20 So the work that would have prevented this if it had gotten done yet was still in progress. So I don't think I might not have labeled it prioritization but at least try to describe it um so that goes a little bit to like bucketing like i try to like here's a description of what it was the hard part then becomes like how do you aggregate across a bunch of different ones right so you have a bunch of different different incident write-ups okay like what do i do with that right like you can't just hand management here's like you know here's 30 20 page write-ups like you know now figure out like you know where to where to invest in like reducing risk right like that that's not that's terrible like they're gonna say go away like why are you wasting my time um so like doing that thematic analysis though is hard uh we so and i was doing this with a teammate with um

Starting point is 01:15:05 ryan kitchens who's still on the core team and then like covet hit and like it just fell down but like we had categories like themes like something somebody didn't know it's one of my favorite ones right like there's like some piece of context that's not shared between two different people in the organization and so that's a that's a common uh you know contributor to to incidents um but i think it's sort of like the the the next step and it's it's just very difficult to do because of like you know who who gets to pick those categories how do you you know and like but like being able to do that that yeah like yeah i would not i would not prepare myself uh because you know and then what do you even do with that right like you could put that I would not prepare dynamics myself.

Starting point is 01:15:48 And what do you even do with that, right? You could put that on every incident, in a sense. It's a fact of life. Maybe that would help ICs see the value of the role of things like docs and stuff. No, that makes sense. So you were on the core team. Thanks for being there. Thanks for running Netflix for us. We, you have two happy customers here, who's and your, your streaming viewership went up, I'm assuming significantly due to their circumstances today with COVID and all,

Starting point is 01:16:20 but you've moved on to a different team now towards the managed delivery team. Before we close, do you want to share what you are doing on to a different team now towards the managed delivery team. Before we close, do you want to share what you are doing on the managed delivery team? Yeah, I'm really a traditional software engineer on the managed delivery team. So I don't have any, you know, particularly, you know, interesting role there. Like I'm just one of the software engineers because, because I'm more operationally minded on the team. I,

Starting point is 01:16:47 you know, do a little bit more stuff with like the alerts and dashboards. I run the, we have like a, this week and managed delivery operations meeting where we talk about like, what, what are the interesting things that happened this week in operations? And like,

Starting point is 01:16:58 how did you figure out what the problem was? And like, sort of like you could think of it as being very sort of lightweight versions of like incident write-ups where we're just talking in a meeting um if there's something interesting that happened um so i'm i'm kind of trying to you know like upskill like everyone on the team on on uh the operation stuff and make that make that sort of work more visible um but i you know i switched on to that team because I thought that the problem space of increasing the automation and the delivery system was really interesting.

Starting point is 01:17:28 And I can imagine all sorts of crazy things going wrong with automation. I don't know if I'm an automation skeptic, but I definitely appreciate that as you increase automation and as you add safety stuff, it gets weirder the failure modes um and i am i am a little like so one of the trends we're seeing at netflix is that like we're we're trying to provide higher level abstractions to the to the app owners which means that they are going to be in some level responsible for less operations work and that scares me a little bit uh and so i want to figure out like how do we build automation in a way to make like to surface some of the operation stuff to them to make it more visible so that like we

Starting point is 01:18:11 don't just end up as a magic black box where something goes wrong nobody knows like what the heck is going on right like that's always the danger that like when it works it's great and when it doesn't like i don't know how this thing works i don't know what it's doing. It's been eight minutes. Why isn't this thing deploying yet? Restart. It didn't work again. Right? So I just thought it was a really interesting space to be in. Makes sense.

Starting point is 01:18:33 And by the way, this managed delivery, is this more like on software delivery or this is more content delivery? Software delivery. So this is deployment. So you can think of this as like a replacement for like a traditional pipeline-based system where you're describing, here are my environments. I've got a testing environment, a staging, and a prod. And to go from test to staging, these tests have to run.

Starting point is 01:18:54 And then to go from here to here, you have to do a manual judgment. Or you could say, OK, I want to pin back to this version. And it'll figure out, I want to mark this version as bad. It'll figure out automatically which one to pin back to this version and it'll figure out you know i want to mark i mark this version as bad it'll figure automatically which one to pin back to um so like you know treating the traditionally like pipelines you can think of them like you know like sort of the equivalent like bash scripts for like how deployments go and here it's more like a like a real software system like the understanding that the domain more about like how the code flows through the different environments. So, I mean, we'll see. It's like, it's sort of an experimenty thing. Yeah, sounds fun.

Starting point is 01:19:28 Yeah. So we're coming to the close. And this is one question we like to ask everyone. And we would love to get your answer for this one. Like, what was a tool that you recently discovered and liked? I've become a really big fan of Roam research. So I don't know if people are familiar with this. If you look at it, you're like, oh, it's like an outliner and a wiki.

Starting point is 01:19:52 Big deal. That's not very interesting. But that combination of a wiki with backlinks, right? Okay, it's an outliner and a wiki with backlinks. So I read this like how to take smart notes book about like i don't know if that book's sort of been going around about like zettelkasten there's this whole world of people who like accumulate like ideas on index cards and stuff and and use them but basically it's a tool that that i find very useful for organizing like my

Starting point is 01:20:21 thoughts um like i really like outliners and like the outliner wiki combination just works really really well there's this like entire like subculture of people who like are super into it and like have i don't use the extensions but they put all sorts of extensions in there but um i really really like worm research and like i've stuck with it in a way that i have not stuck with other tools for like keeping track of things and so i would i would recommend people check it out that's pretty neat uh it also creates these graphs and connections based on all the ideas you put in, right? It does that. I don't really use that very often. It's kind of a novelty, I think. But it does do that. Yeah. And it's R-O-A-M, Rome Research. Yeah. Cool, cool, cool. At first,

Starting point is 01:21:03 I thought Rome, the city, and I thought it was going to be like an art thing. Cool, cool, cool. At first I thought Rome, the city, and I thought it was going to be like an art thing. I was like, oh, sorry. Sorry for being offensive. Please continue. Sorry. It's okay. So, Lauren, is there anything else you would like to share with our listeners today?

Starting point is 01:21:20 Yeah. So if anyone's sort of interested in the human aspects of learning from incidents and learning more about them, there's this great website called Learning from Incidents in Software. It's at learningfromincidents.io. are trying to move away from, you know, sort of simple like metrarchy approaches to incidents to more like the narrative and like how do we learn more about like what happened? So I definitely encourage you to check it out. Awesome. Yeah, we'll certainly link to it

Starting point is 01:21:55 in the show notes as well. It was learningfromincidents.io. Yeah. Awesome. By the way, I saw some, I remember going through this website and there were some familiar names in uh on the page in the about page i think john allspa norah jones yourself and a few others yep um yeah so um richard cook jessica devito will Will Gallego, Ryan Kitchens, Laura McGuire.

Starting point is 01:22:26 Those are all people in this community. That's awesome. We'll certainly recommend all our listeners to check it out. So, Lauren, before we let you go, one last question. I said that the other one was going to be the last one. But you have on your shirt, should I deploy on a Friday at 5pm?

Starting point is 01:22:42 What's the answer to that? We got to answer that. So have you seen my address? Oh, there's a flowchart. Yeah, it's a flowchart. For our audience who are not seeing this, I think the safe answer is no. No.

Starting point is 01:22:59 And, you know, I wear this shirt every single Friday. And I will say, you know, it's everyone who loves it. So people love to fight over this on the internet, right? But everyone who has seen my shirt at Netflix loves my shirt. Even people who deploy on Friday afternoons, especially people who, you know, I have deployed Friday afternoon wearing this shirt. So I just, I think it's hilarious. I just love it.

Starting point is 01:23:25 That's a good shirt. I think we all need to get one. This has been awesome, Lauren. Thank you so much for taking the time with us. Yeah, I had a lot of fun. Hey, thank you so much for listening to the show. You can subscribe wherever you get your podcasts and learn more about us at softwaremisadventures.com. You can also write to us at hello at softwaremisadventures.com. We would love to hear from you. Until next time, take care.

Your Ad Here

Software Misadventures - Lorin Hochstein - On how Netflix learns from incidents, software as socio-technical systems, writing persuasively and more - #13

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.