The Changelog: Software Development, Open Source - Learning from incidents (Interview)

Starting point is 00:00:00 What's going on? Welcome back. This is the ChangeLog. We feature the hackers, the leaders, and the innovators of the software world. If you're new to the channel, subscribe at changelog.fm. And the Galaxy brand move is to subscribe to our master feed. Check it out at changelog.com slash master. Today, we're joined by Nora Jones, founder and CEO at Jelly, where they help teams gain insight and learnings from incidents. A few months back in December, Nora shared her thoughts in a Change Law post titled, Incidents Should Not Be a Four-Letter Word, and that got a lot of attention from our readers. So we invited Nora on the show to discuss all things incidents, the learning and the

Starting point is 00:00:38 growth they represent for teams, why teams should focus on learning from incidents in the first place, Jelly's Howey Guide to Post-Incident Investigations. Why the next emerging role is an incident analyst. And she also shared a few book recommendations, which we've linked up in the show notes. Big thanks to our friends and our partners at Fastly. Bandwidth for Change Log is provided by Fastly. You can check them out at fastly.com. This episode is brought to you by our friends at Influx Data, Act in Time, Build on InfluxDB.

Starting point is 00:01:19 This is the platform developers use to build time series applications. And to them, joined by Barbara Nelson, VP of Application Engineering. Barbara, we're working together to share some behind the scenes there at Influx Data. And one of the things you've shared time and time again is this idea of meeting developers where they are. What do you mean by that? This is really important to us that we're not expecting developers to make wholesale changes to their product or application to try and leverage the power of our platform. So it's a mindset, both in terms of the capabilities of what we deliver and how we deliver them. So why do we have the client API in 12 different languages? Because we're

Starting point is 00:01:55 meeting developers where they are in 12 different languages. We're not going to tell them, if you want to use our platform, you have to use Python. If you're using C Sharp, you use our platform in C Sharp. use Python. If you're using C Sharp, you use our platform in C Sharp. That mindset of meet the developers where they are means we sometimes end up building multiple versions of the same thing, but for different target audiences. So a lot of the capabilities that we expose in our web UI, we also expose in our VS Code plugin. Some developers are spending all their time in VS Code, so they want a solution that works where they are today. Okay, you heard it here first. Meet developers where they are. That's the mindset of Influx Data. How they build, how they ship,

Starting point is 00:02:37 how they think about DX in terms of you using this platform to build your next time series application. Bring your preferred languages. InfluxDB integrates with 13 client libraries, C Sharp, Go, Ruby, of course, JavaScript, and so many more. Learn more and get started today at influxdata.com slash changelog. Again, influxdata.com slash changelog. So we are joined by Nora Jones, founder and CEO of Jelly. Nora, thanks for coming on the show.

Starting point is 00:03:28 Thanks for having me, Jared. Happy to have you. First heard of Jelly on our episode with Brittany Dionigi. She's the director of platform engineering at Articulate, and she mentioned you all during that episode on learning-focused engineering. I hadn't really heard about incident analysis as a formalized thing until then. And it just turns out that a colleague of yours, Daniela Hurtado, was listening to that episode.

Starting point is 00:03:54 In fact, I think she was a Britney fan, brought her to the show, and then she started consuming other episodes and now she's a loyal listener. She's very excited to hear Jelly mentioned on that episode. And so she reached out, shout out to Daniela, and hooked us up. And that's kind of how we got here. Since then, you've posted a post on our blog called Incident Shouldn't Be a Four Letter Word. And oftentimes it turns out it is, but doesn't have to be. So all that to say, welcome. And tell us about Jelly and what y'all are up to.

Starting point is 00:04:26 Yeah, absolutely. Yeah. So like you mentioned, we just published the How We Got Here guide. So Jelly is an incident analysis focused company, like you said. And really what that means is taking incidents and looking at them under a lens that allows us to learn as much as we can from them. A big focus of ours is helping people learn like the experts of incidents so that they can improve themselves as engineers, improve themselves as colleagues, improve how the company acts as a system overall. And we really take like a socio-technical lens with that. And what that means is figuring out how people work together to figuring out how people learn, figuring out how people teach others. And the tooling that we make looks through

Starting point is 00:05:09 or aggregates your coordination capabilities in Slack. On top of that, it aggregates pager duty and we aggregate some organizational information as well to kind of give you this full lens to learn as much as possible about your incident so that you can improve in the future. Now, I know that your history is in chaos engineering. You're kind of one of the pioneers at chaos engineering at Netflix and this whole resilience engineering community, which is like this little sub tribe of the software community of which jelly is a part

Starting point is 00:05:43 of. I know it's bigger than just Jelly. There's more people doing these things. Tell us a little bit of your backstory and how you got from there to here and why Jelly is a thing today. Yeah, for sure. I started getting really interested in this when I was doing chaos engineering at Netflix. So chaos engineering is a form of experimenting

Starting point is 00:06:03 on production systems, creating turbulent conditions intentionally to see how the systems react with the hypothesis being they will react fine, which usually isn't the case because we are always ever learning about our systems and we're learning that they don't actually react fine to some things. So I got really into chaos engineering at Netflix. We were building pretty awesome systems around that. But I was trying to get people to think about what kind of chaos experiments to run. And I noticed that a lot of developers, especially like developers working on features and such, didn't quite know how to get into that mindset. They didn't quite know how to come up with like a failure scenario to explore a failure scenario to explore, a failure scenario to run. And so I wanted to give them something that they might actually know, which is reflecting on past incidents. You know, everyone can, every developer can tell you a really bad incident that they were a part of or a really bad bug that they had to go into. And I was helping them identify patterns behind them.

Starting point is 00:06:58 And doing that and surfacing some of those patterns kind of allowed them to get into that mindset a little bit differently. But from doing so, I learned that there was so much more capabilities behind just creating different chaos experiments from looking at those past incidents. And those capabilities were actually teaching engineers more about their systems and teaching them to learn like the experts of the systems and just really unearthing different capabilities. And so I really think of the incidents themselves as kind of like a mirror reflecting back. And it's a catalyst for learning about your system because it's something that you didn't expect to happen. It's a natural list that gets born too. It's like these things happened.

Starting point is 00:07:40 Right. Chaos ensued, heads rolled. I don't know. You never know what happens, right? But it's a natural list that forms. And I think before, maybe before sort of in quotes incident management, this list didn't naturally form. It was just sort of like ad hoc. There was maybe a postmortem. I guess if it was big enough, there's a postmortem, that kind of thing. Like this postmortem tends to be the documentation that the public sees. Right. But how did the team sort of conjure this list of what to look back on, as you said, to reflect on? It didn't naturally form. And now kind of with the way the world's working with incident management, it's a bit more formalized. Yeah. And like you said, a lot of the times the goal of a postmortem, you kind of end up losing sight of actually learning from the incident, which should be the ultimate goal. And I think, you know, some people are like, oh, no, I've learned everything I can, but actually taking the time to write a really good report and talk to your colleagues and interview them about the incident and talk together about the incident illuminates so much

Starting point is 00:08:39 more than you might expect, just kind of creating a public write-up for your customers or a public write-up, you know, for the rest of the internet. And so I think as an industry, we've kind of creating a public write up for your customers or a public write up, you know, for the rest of the internet. And so I think as an industry, we've kind of lost sight of why we're doing incident reviews, why we're doing postmortems, and we're leaving a lot of ROI on the table for something that was already a really expensive and probably emotionally expensive event. Yeah, that's I'm glad you bring up the emotional side of it because there sure is that aspect of this. We've had an idea of doing a podcast all about postmortems, like get, you know, large incidents that happen online. Maybe you have to wait a certain amount of time, but like bring

Starting point is 00:09:14 on the engineers who triaged that or, you know, lived through it and have them tell the tale. And we've thought about that, but I always think like like do people really want to come on a show and talk about like their fails a lot of it can be foibles it can be mistakes it you know adam joked about heads rolling but it really does that that doesn't really happen literally but it happens in the sense of like a lot of times that post-mortem is about who's at fault and that can be emotionally damaging it can be career damaging these things are sensitive subjects right they can be emotionally damaging. It can be career damaging. These things are sensitive subjects, right? They can be sensitive subjects, yeah. And a lot of it is organizationally dependent, too.

Starting point is 00:09:51 I mean, so I worked on a lot of this at Netflix, but then I went full time to Slack after Netflix to be an incident analyst. And it was right around the time that they were IPOing. And you can imagine that emotions are kind of high during such a big event. And after being at Slack for a little while, I really got the urge and the push to start Jelly. Like, oh, whoa, this wasn't just a Netflix thing I was seeing. It's happening at Slack. It's happening at other organizations through the industry too, which I learned about through the Learning from from incidents Slack community.

Starting point is 00:10:31 And since being at Jelly, we're getting to see it everywhere. And every organization handles incidents a little bit differently. And they're very high stakes at some, they're a bit lower stakes at others. There are organizations that are hesitant to even call incidents yet because they're still trying to find their fit as a product in the market. Like it's different everywhere. And yeah, like you said, some people might view it as career damaging and our, our whole focus as a tool and as a company is to help companies not view it like that. You know, it's not any one person's fault. It's not any one piece of technology's fault. It's like a product of the system that you created at your company. And how can you have a psychologically safe environment to talk about that in a way that everyone can become better engineers, better coworkers. What was it that specifically you saw between Netflix and then Slack to be

Starting point is 00:11:16 like, okay, I need to do something about this. Like, can you give me a list? What were the specifics of like, I'm done seeing this happen here and there. It's at scale. Others have this problem. Give me a hit list of like, what was like, okay, I got to start Jelly. Yeah. I think one of the biggest things that had happened was my full job at Slack was doing chaos engineering and incident analysis. And I was full-time like analyzing some of their incidents. And I would write these very large reports,

Starting point is 00:11:45 disseminate them to the rest of the organization. So a lot of the focus was on good writing. A lot of the focus was on talking to people, but it wasn't going to be helpful if I was the one writing all of those reports. And so I was trying to teach other people how to effectively review incidents, how to effectively analyze incidents. And like, you know, you were telling me before, Jared, like, how do you incentivize engineers to do that? Like, why would they care about that? You know, don't they have other stuff to work on? And you're absolutely right. And I think my biggest selling point to them was, you will be a better engineer by studying an incident that you weren't a part of, or by studying a different part of the Slack system, you're going to get to know so much about how Slack works. And you're going to have this knowledge that other people

Starting point is 00:12:29 don't have. You're going to know where to look. You're going to know who to page. And the more people that you do that with, the better the company is going to run overall. And so that was my selling point, but it wasn't until someone like I was working with someone that, that we trained on this and this particular person like had been waiting to get he had been trying to get promoted to a staff engineer for a very long time and he was kind of stalled out at the senior engineer level and he did some of this training and it wasn't like directly because he did the training that he got promoted to staff he started learning about the system in ways that

Starting point is 00:13:05 no one else knew. And right afterwards, he was like, okay, Nora, I actually get it now. Like, I feel like I'm seen in color. And he just became like this ultimate resource. Like he really escalated his career there. He became someone that just knew a lot about the systems. And it like, it looks like a superpower when you look at it from the outside, but really he had been studying incidents. He had been studying the ways that the system had failed and gone wrong and doing it in a very, very deep manner. Yeah. And so that's really amazing to see. And that really like gave me the push, like, I want to create more of this in the industry. Like I want to enable engineers to learn more.

Starting point is 00:13:45 I want to help companies be better at growing and teaching their engineers. Cause I really do think as an industry, we're not that great at it. We kind of incentivize people to stay a couple of years and then leave and go get another job. And, you know, engineers kind of feel like they're, they're done learning at a certain job at some point. And I don't think that necessarily has to be the case. I think it'll be better for the companies. I think it's going to be better for the industry.

Starting point is 00:14:08 I think it's going to be better for us as consumers of these products if people kind of stick around and feel their systems and really become a part of them. I like how this is a win-win-win kind of scenario where it's like there's a lot of winning happening whenever you really manage incidents well. So I imagine like very small incidents become better to document because there's always learning in there, but it's the company that wins. It's the technology that wins because then consumers win because it's better. Yeah.

Starting point is 00:14:36 In the end, it's a better engineer piece of software. Right. And then the engineers themselves and those involved in the systems and building them and maintaining them win because they now understand them better and they become more promotable, smarter in many more ways. And potentially if there was incentives to stay beyond two years, maybe there's even a an options bonus or just something that says, hey, stay here a little longer and retain that

Starting point is 00:14:56 domain knowledge and get and dig deeper, go vertical versus horizontal. Yeah, absolutely. I think like in a lot of incidents too, I mean, you can probably name some developer at your company that just seems to know more than everyone else. Like they're the person that you want when you have like a horrible incident and you're just not quite sure. Like you're just cannot get to, which I think is the wrong approach. And I think a lot of companies kind of like bump those people up like that, too. But really, these people are just really good at learning. And there's this book called Minding the Weather. So a lot of our focus at Jelly is on cognitive interviewing and knowing how to ask questions in ways that elicit answers that allow you as the interviewer to learn more. And so it's a skill that you have to learn is how to ask those questions in ways that like experts can teach you how they're doing what they're doing, because a lot of the times they don't know. But anyway, this this book called Minding the Weather, it's about expert weather

Starting point is 00:15:57 forecasters and like how they do what they do. And a lot of the focus of the book is trying to teach people how to not how to be experts, but how to learn like experts. And flipping the script like that just makes a huge difference. And you're able to make more and more of those people. It's really cool to see. So I had the opportunity early on in my career to sit next to one of those people at this organization.

Starting point is 00:16:22 And he was one of these guys who had just internalized the entire system, right? From networking to the database, to the way that they had one enterprise application that was custom, how that worked. Domain expert, he understood it was grain trading. He understood all that stuff. He wrote down nothing. Well, he had yellow sticky notes all over his desk for different, I don't know, I would never like look at his sticky notes all over his desk for different. Yeah. I don't know. I would never like look at his sticky notes, but he was not going to document anything. Yeah. And he's the guy who would just know what was wrong. It was almost like he intuited what had to happen. And I sat for him. I observed this for a little while, just thinking this person's amazing.

Starting point is 00:17:00 And he is, he is very good at what he does. And then I finally got to where I started asking him questions like i'd be like how did you know that you know and i started to prod and i started to pull and i learned so much about the system about how to debug and troubleshoot things that he skips over because he understands you know he's seven steps in and i'm still on step zero and it would be amazing if that if everything that was in his head, or at least some sort of synthesized version of what's in his head could be accessible to me async without having to sit next and prod him and have both of us, you know, and today we don't even sit in the same office as our colleagues, right? So I see the value. I see how easy it is to silo this domain expertise, the systems systems expertise that one person who you

Starting point is 00:17:47 always have to call because they'll know what's wrong right and then all that knowledge just lives in these little silos of their brains where if that's shared in a repository just like so much value so much value yeah and i mean what you were doing jared was interviewing him you were doing cognitive style interviews and it's you were were asking him how. And like that, that language shift, like that's, that's really important. Not them to show you like, Hey, you said this thing in channel. Um, like you said, Hey, I think it's the database. Can you tell me how you got to that point and actually have them pull up their laptop, show you what they were looking at. Like, and those are things that like, those are context pieces that they're not putting in channel. Like you said, Jared, they're not writing them down. They might be on a sticking on their head, but they're not writing down how they're thinking

Starting point is 00:18:47 through that. They're just like, I think it's the database and everyone else is kind of like, he's just magic. She's just magic. Like she just, they just know these things, you know? And it's really not the case. Like experts can be built. Yeah. So one thing that happens with me at the end of an incident, whether big or small, is I've just exerted all of this energy and time and probably stress and I've fixed it or somebody's fixed it. And then it's like all this work that I was supposed to be doing because, you know, an incident's never planned unless you have a chaos monkey, I guess, going on. But for most of us, we don't have that kind of resiliency. We don't have these things planned. And so your actual work is piling up behind you. And you spent six hours chasing down some stray index in a database column somewhere

Starting point is 00:19:39 that's causing this server to spike or whatever it was. The last thing you want to do is like, now let's document this for future generations. Like, isn't that why these things go oftentimes not written down, not documented, not analyzed is because we got to get back to our jobs. You got to get back to your job and you're tired. You're like, I don't want to think about this thing anymore. Like, let's put it to rest.

Starting point is 00:20:01 Like, let's, it might've, you're tired, you're emotionally exhausted. But one of the cool things about incidents is you're doing whatever you can to stop the bleeding. And so a lot of times rules and procedures go out the window. And it's one of the only moments where you can see the reality of how your system works versus how you think it works. Like, oh, Adam's supposed to own this.

Starting point is 00:20:24 Jared's supposed to own this. We're supposed to go through these particular steps to get into this database. And that's not actually what happens. It's like, you know, maybe you called Stacey in and Stacey hasn't been on that team in like six years. And like, so things like that happen. And so by not actually documenting like how it went down, you are going, you're bound to repeat a lot of those same things and you're bound to keep that knowledge only in Stacey's head. But I get why the people that participate in the incident don't want to do it because, again, they're tired, they're stressed. They just went through this really traumatic thing and they don't want to dig it back up. And that's also part of why we recommend having someone that wasn't

Starting point is 00:21:06 involved in the incident, do the incident review, document everything, talk to these experts and try to think of it as conversations rather than interviews. Like I would sometimes like go get coffee with someone in the kitchen and Slack and ask them about the incident. And a lot of the times you can make it so comfortable with these people that they're actually really glad they have someone to talk to about it. You're there to be this trusted force, right? You're not there to document like all the things that they did wrong. You're there to treat them like the expert they were because they were an expert in some part of this incident. And your job as the interviewer is to unearth that expertise and write it down.

Starting point is 00:21:44 And so I'm asking you, Adam, like, you know, how did you know to do this at this particular time? Or, wow, like this was two in the morning, your time, like you must've been a little bit tired and slowly you start to get them to open up. I mean, y'all, y'all are podcast hosts, right? Like you're, you're good at this too. Like you are good at making the guests feel comfortable. You're good at asking them questions. You're good at getting them to share their expertise. And that's a lot of what incident interviewing is too. And so eventually people get kind of excited to tell you, they're like, wow, no one's asked me like all these questions.

Starting point is 00:22:15 Like, let me tell you. Yeah. No one cares. No one cares. Somebody finally cares. Somebody cares. Someone finally cares about this database I've been screaming about for five years. Like, please get it to change.

Starting point is 00:22:27 And so you end up, you're this trusted person that is going to write down their story in a way that gets people to listen. Because they're tired. They've been like, you know, they don't want to die on this hill anymore for this particular incident, even though it's been bothering them for a while. Yeah. How did you surface the, I guess, the third party, I guess, because that's what I would call that person is a third party. The person who's in charge of documenting, is that just something that surfaced? Is this something that you surfaced? Is this a kiosk engineering incident thing? How did this surface to kind of get a third party?

Starting point is 00:23:01 So a lot of what I learned about incident analysis is from other industries. There is a whole subset of research like dating back to like Three Mile Island about how to analyze incidents and accidents appropriately. And like we do get lucky in software sometimes in that our incidents don't necessarily like not always like make the news and you know people have died or like certain safety things have happened and so that does happen in other industries and it can be really hard to productively surface what was actually happening in the event because people are trying to save their asses they're trying not to go to jail yeah a lot of these other industries like they're required to hire a third party person because

Starting point is 00:23:46 like the people in the event can't see, like they're just so they have such a nuanced view of what happened, but you end up getting more information. And so that, that was adopted from other safety critical industries. And so myself, John Allspaugh, a lot of the members of the learning from incidents community have been really studying like outside of the software industry and trying to bring this stuff into the software industry. Yeah, I mean, software, it doesn't always feel safety critical, but it does run the world. You know, the AWS outage, the Facebook outage, like those impacted the world in a lot of strange, bizarre ways. This episode is brought to you by our friends at Sentry. Build better software faster, diagnose, fix, and optimize the performance of your code.

Starting point is 00:24:55 Over 1 million developers and 68,000 organizations already use Sentry. That number includes us. Here's the absolute easiest way to try Sentry right now. You don't have to do anything. Just go to try.sentry-demo.com. That is an open sandbox with data that refreshes every time you refresh or every 10 minutes, something like that. But long story short, that's the easiest way to try Sentry right now. No installation, no whatsoever.

Starting point is 00:25:23 That dashboard is the exact dashboard we see every time we log into Centri. And of course, our listeners get a deal. They get the team plan for free for three months. All you got to do is go to Centri.io and use the code changelog when you sign up. Again, Centri.io and use the code changelog. When you mentioned going to jail, it reminded me of the movie Sully. Yes. Tom Hanks.

Starting point is 00:26:07 And that movie is just like a really good visible. I mean, I actually attribute a lot of learning even to even the movies. It was actually in a podcast at OSCOM where I said I learned something about something from The Rock on some random movie. I remember that, Jared. It was kind of a funny thing. I remember that. It was years ago. I do bring up movies often for learning processes. But that kind of reminds me of Sully, like this whole landing and they couldn't figure it out. And they ran all the simulations. They

Starting point is 00:26:27 went through every single way and they were like, you know, investigated to the nth degree. And these guys could have lost their jobs, but you know, they actually did a really good job of landing that plane in New York city into the Bay. And it was like, how did that even happen? But that kind of reminds me of that. It's like, do incidents have to be that bad? Probably not. But that's probably an extreme example of like, okay, maybe a log4j kind of thing could be out there. Like, how did this happen? This is a, an industry wide sort of security incident. But you know, because I studied in the Lund course on human factors and system safety. And it's like a bunch of professionals from other industries. We had professionals in there from maritime, from medicine. And they have all these regulations around their accidents. And I'm in there with software, and I don't have any.

Starting point is 00:27:16 And so they're all like, how, like, you get to do all these cool things? Like, I have to use, you know, XYZ. And there was one day in class where we actually talked about Captain Sully and the guest lecturer that day, because a lot of these industries like are required to use runbooks. And not only are they required to use runbooks in incidents, they're required to follow them to a T. And if they deviate off the runbooks, they risk facing jail time. But, you know, runbooks are only good until you develop expertise. And then, like you said, Jared, after that, you don't need something written down on a stick.

Starting point is 00:27:52 I mean, the guy that you were sitting next to that you referenced earlier, he wasn't using runbooks a lot of the time. He was maybe leveraging the runbook, but also combining it with what he knew about the system. And that happens in other industries too. And that's what happened with Captain Sully. He deviated off of what he was supposed to do. And the reason, and the guest lecturer puts Tom Hanks on a slide and he goes, who wouldn't want to be played by Tom Hanks in a movie? He's America's sweetheart. Like he was like, but the only reason this guy was played by Tom Hanks in a movie

Starting point is 00:28:24 is because no one died. He was like, if this guy had, if Sully had deviated from the run book, like he did and someone died, he would be in jail even if he did the right thing. And so that's when the guest lecturer started talking about the Costa Concordia case, which was a Italian cruise ship that struck an underwater rock, capsized sink in in the waters of Tuscany, and 32 people died. And what happened was this captain, like, you know, there's a lot more nuances to the story, but he basically kind of did what Captain Sully did. He deviated from the runbook.

Starting point is 00:28:58 And his deviation made it so that less people died. But 32 people still died, and this guy is in jail wow and you know if you look him up like i encourage you to do that after the episode like he's he's shamed they're like you know we saw the captain drinking we saw him dancing with girls all night like we like they're they're really shaming this guy even though he did have a lot of expertise in the moment and his decisions made it so that it was as not, not as bad as, as an accident. And Captain Sully, he kind of got a bit lucky with his. It was a good movie for sure. I mean, good story and a good movie.

Starting point is 00:29:35 It's a really good story and a good movie. But the reason I bring that up is like software, like we do try to focus on our run books a lot. And that ends up being an action item after every incident, like update this run book, update X, Y,Z, why didn't you follow the runbook? And the other folks in my class are like, you don't have to use runbooks. Like, why are you doing this? Like, we have to follow these checklists or else we go to jail. Like, you don't have to do this in the software industry. Like, why don't you trust your expertise and your cognition a little bit more? And I don't think we quite know how as an industry. And so, yeah, again, that's some of the stuff that Jelly is trying to help with as a product

Starting point is 00:30:11 and as a company is unearthing that. Maybe our inclination towards using the runbook, even though it's not regulatory authorized or required, is because we spend our days telling computers what to do. Yeah. And when we, when we do that, it's step one and then step two. And then if this condition step three, but otherwise step four. And so we kind of live in an imperative step by step, you know, world. And that's how we think.

Starting point is 00:30:37 And so if you can give me a script, that would be great. So we kind of think in scripts and in runbooks and maybe don't rely enough on our expertise or the, you know, our problem solving skills. I don't know. It's an interesting thought experiment. Yeah, it is interesting. And like runbooks serve an amazing purpose, like when you're learning, but then you pass this threshold where you're going to always be smarter than the runbook. And it's, you know, you kind of need to work with it jointly rather than like over-reliance on it. And I think that's a really great insight, Jared.

Starting point is 00:31:10 Like we do think in steps, we're programmers, like we're trying to tell the computer what to do exactly. And we're upset when it doesn't work in the exact sequence. So we find the need to update it and update it and like allow ourselves to over-rely on it when really we need to work with it. We need to be its teammate almost. It might seem obvious what runbooks are, but what exactly is a runbook? Is it automated in most cases?

Starting point is 00:31:38 Is it truly an if-then, kick off scripts, move things around, back up data kind of things? Or is it sort of, you know, kick off a Slack message, kick off a Zoom meeting, you know, alert these people, like at what part does the runbook, you know, what exactly is a runbook in the industry's terms? And then maybe product-wise, it probably gets specific, but, you know, what's the general terminology for a runbook? What does it do? Yeah, a runbook is a step-by-step list of what to do when a certain situation comes up. And if you're not experiencing that situation a lot, it can be really helpful because you are not working out that muscle all the time. So you don't have this expertise in it. And so that's kind of that threshold that I was talking about.

Starting point is 00:32:19 You need a runbook in that particular case. But if you are working on this database all the time, like you're going to be smarter than the runbook because it's just not going to be updating with your mental model all the time. And I think there's some companies out there that claim to have these like automatic runbooks and like, you know, they're continuing to update them automatically, but it's not going to be as powerful as the human expert's mind ever. But yeah, it's a step-by-step list of procedures that you're supposed to follow when a certain thing happens. And it can be, it can have anything in it. It can say, give a call to this executive. It can say, you know, make sure this customer knows. It can say before, you know, after you have called this customer, make sure you do X, Y, Z in the database. Yeah, it's just a

Starting point is 00:33:06 list of reminded procedures that I think people constantly get tripped up on or might forget if they're not written down. But our whole thing at Jelly is to write down a lot of your incidents and write down how people think. And if you're constantly writing those things down, you don't need to leverage the runbooks as much because you're actually documenting expertise. And not only are you documenting it when you're taking that in, when you're reading it, you're inherently building more expertise too. And so it's not like you'll never need runbooks. It's just, there doesn't need to be an over reliance on them and an over reliance on them being up to date all the time. So when you are advocating for this process at Jelly or with potential customers or users of your platform, where do you find that the buy-in has to come from?

Starting point is 00:33:56 Is it the teams or the developers thinking, oh yeah, we could really benefit from this? Or is it more like at a C-level, we're going to do incident analysis and we're going to have a corpus of data. Where does the buy-in usually come from? Or what are you seeing out there? I would say that there absolutely needs to be trust all the way from the top, that that is a good thing to do.

Starting point is 00:34:19 It just depends on how much the CEO and the executives trust that they're engineers and trust that they're engineers and trust that their people know how to learn and have enough insights on the business to learn some of those things. It really depends on how transparent the company in question is. And so I've seen it come from everywhere. Sometimes we do have to convince folks in the C-suite. Sometimes an engineer or engineering managers have full reign and capability to drive this through. And I think those are the best organizations because those those engineers are trusted.

Starting point is 00:34:50 Those engineers are trusted to learn more and to teach more. But it absolutely needs to be supported. Like I've seen some organizations where it, you know, it's it's claimed to be supported, but it is kind of lip service a little bit. And, you know, the engineers are then pulled into other things and don't have time to actually study what happened. And, you know, then an incident is going to happen again in the future. And they're just going to keep repeating the same cycle over and over and over again. Yeah, it's super depends, but it always needs support from the top. And it not only that, that the top needs to understand the value of incident analysis to their company, like why, why they should allow something like that to take place. And it really benefits them

Starting point is 00:35:33 overall. Like the more you're doing stuff like this, the faster you're going to be at getting features out to your customers, the more you're going to be collaborating together in a really beneficial way. It's like, it's a really powerful thing. And some of our customers we've been working with for over a year and just seeing the changes there's their organizations have made even in that small amount of time is, is really, really cool. Like just really transforming into this learning organization, like collaborating better together, knowing who to put on certain projects, knowing how to organize themselves better. It's like, it's just really powerful.

Starting point is 00:36:08 And you can, it's a big competitive advantage, honestly, when you figure that out and your competitors haven't. There's probably a lot of people not doing it. So it seems like it could be, like certain competitive advantages become table stakes at a certain point. Like it's no longer an advantage. It's like, you need to be doing this because everybody is, but it seems like, and I don't

Starting point is 00:36:28 know, so you can tell me if I'm right or wrong on this, but it seems like probably not that many organizations have adopted this as a formalized practice. Is that fair? Yeah. Yeah. We're seeing it start to uptick. I mean, that's a big thing that we're honestly trying to change is, is like making it apparent. A couple of the customers that we've worked with, like we started working with them last year, started working with a few engineers are now hiring full-time incident analysts, which is pretty cool. Like they're seeing the value of kind of this hiring this person that is a bit of like a distributed systems, cultural anthropologist, like just continuing to tell the story and continuing to build this corpus, like you said. Yeah. So right now it's like, it's, it's kind of on the

Starting point is 00:37:09 bleeding edge edge. It's, it's very nascent, but as it picks up, people will be required to get on board because they're just going to be moving slower than their competitors. It's going to help with the morale though, too, right? Learning morale. Yeah. I mean, if it has the possibility to be promoted, I mean, that's going to be a good reason for a team to want to adopt a tool like this. When you look at the corpus, what is the corpus? What is the artifact of an incident? What does it look like to consume it as an outsider completely, as a newcomer to the team? How do you consume past incidents? what are they like

Starting point is 00:37:45 it is interesting like before folks adopt some of these practices and stuff or like they're you know they're doing postmortems just to kind of get them done they're usually not very well written or they vary drastically in quality like depending on who's writing them depending what other things they have going on in their org and they're just trying to get it out the door. A really good quality incident review write-up contains like a narrative section on how the thing happened. And this is actually something that Jelly published in our Howie guide, something that we open sourced recently, how to make this a really good write-up that people want to read and participate in. There's diagrams in it, like that documenting like how

Starting point is 00:38:25 your system works, how it's functioning. You have talked to people that participated in this incident, and they are kind of informing the write-up too. It's not just done in a vacuum by one person. It's referencing different things. It's referencing GitHub links. It's referencing how folks talk to each other in Slack or how folks talk to each other in Zoom. It's referencing maybe the Jira ticket from three years ago that was latent that kind of led to this. It's mentioning like the business reasons that this incident came to be or this particular change happened or something like that. It's really just taking people on a journey with you. And so a lot of that involves good writing skills too. Like no one wants to sit there and read a dry before and postmortem that's, you know,

Starting point is 00:39:10 a page and that you're not actually going to learn for, especially if you own the system too, you're like, yeah, I already know all this. And so it's, it contains things that people maybe didn't know so that it hooks their interest. And I think a really good way to discover that too is seeing are people reading these? Are people commenting on them? Are they asking you questions on them? Are they DMing you? To quote from your Howie guide, it says under distribute the findings, it says your work shouldn't be completed to be filed. It should be completed so it can be read and shared across the business even after the learning review has taken place and the corrective actions have been taken which is kind of interesting

Starting point is 00:39:49 because like you're it's by design meant to be shared you know and when you write something you always have to take the reader in mind and so the reader really is the ceo to qa to support to sales potentially even i mean it's an organizational learning opportunity yeah it should be everyone exactly and a lot of the time it just ends up being a room full of engineers and that's a huge miss um you're not going to actually improve because it's not you know it's not the engineer's fault like i you know in some scale incidents that may be like the big focus is on sre but a lot of it needs to be distributed. It's like, how did the SRE work with marketing?

Starting point is 00:40:28 How did the SRE work with understanding the business? Like, yeah, you're absolutely right. And it's. I'm just quoting you back. So you're right. I'm just quoting you all back. She agrees with herself. You know, on the note of blame, though, I do appreciate the glossary where you talk about early on in the Howie Guide, you talk about being blame aware.

Starting point is 00:40:49 So I think that's important to note because I think earlier on in part one of the show, we kind of talked about this vulnerability and being blamed and there's heads rolling and things like that. Like it's a bad thing. Back to the four-letter word, like this incident or incidents in general are a four-letter word or why they shouldn't be. But under the glossary, you say for blame aware, you say an evolution from blameless in quotes. We recognize that everyone works with constraints. You mentioned before, Adam, you might have been up at two in the morning. So obviously you're going to be tired unless we loaded it at two in the morning. And then you say, and sometimes those don't appear until after an incident.

Starting point is 00:41:24 So the interview process reveals circumstances. Jared may be a stellar A plus rockstar, put whatever adjectives you want on his ability to engineer, but at two in the morning, keep going the best ever, you know, but at two in the morning, maybe not so much, you know, so you have to take those things into context. And so these things don't appear until after an incident and you say, we acknowledge our tendency to blame and name names and move past it in order to be productive i think it's important in this process because you're not trying to place like in the sully situation where the circumstances are more you know more moral really like more possible that somebody could die or be hurt forever yeah

Starting point is 00:42:03 maybe that's the case with software like if you you're Uber and your software goes bad, maybe somebody doesn't get picked up when they're stranded or in a situation and they need to be so that the situation could be dire. Absolutely. But to be productive and move past it and to learn, we have to sort of like remove all that blame game is how I read that. Yeah. And a lot of that goes up to the top of organizations too. It's like, what are the CEOs and the executives doing to create that kind of learning culture to make it safe to name names? And yeah. And you learn a lot about people doing these interviews too, like you were mentioning before, like things that you might not learn if you hadn't spoken to them and you're learning more about your organization and how people are working and having that information like really helps your system too it can help

Starting point is 00:42:50 your tenure it can help people feel heard it can help people feel collaborative well it might have been revealed like why was jared up at two in the morning right why was he on pager duty four times in a row why was he okay so this scenario sure, he was here and this error occurred because of, you know, Jared's inability to be normal Jared. But why was he up at two in the morning and on this duty four times in a row? What's wrong with our cycles? Why aren't we cycling this enough? Why is this duty not shared more? How can we get him an apprentice in our backup?

Starting point is 00:43:19 You know, these are things, I mean, you start to reveal things about the organizational properties and how you promote potentially even and how you hire more people to help folks not be in these situations. It does really reveal a lot. And that's if you care, right? It comes down to caring about improving. Yeah. And there's, I mean, I see a lot of organizations where they're like, yeah, Jared's a superhuman. He's always up doing all these things. He knows everything.

Starting point is 00:43:44 Look at Jared just smiling. This is my favorite episode. The audience can't see this, but Jared is just grinning over here because we're just laying it on him, compliment after compliment. They're just talking about me like I'm not here. Jared's amazing.

Starting point is 00:43:54 He's this great superhuman man. And like, so I mean, your org might tell you, hey, Jared, we need you to stop doing some of this stuff and we need you to start documenting everything you know. Because they might realize that it's a bit of a problem that you're a knowledge silo. But the thing is, like, when a Jared or someone else is an expert in something, they are inherently bad at knowing what they have expertise in. They do not know how to write down what they know, because they just know it. Good point. They have no idea what other people don't know. And so I see like orgs and the Jareds in that situation get really stressed out.

Starting point is 00:44:31 And that's another reason why it's good to train other people into how to learn like experts rather than having the experts learn how to teach. It's much better the other way. You get a lot more value out of it. What's going on, friends? This episode is brought to you by WorkOS. WorkOS is a platform that gives developers a set of building blocks for quickly adding enterprise-ready features to their applications. Add single sign-on with Okta, Azure, and more. Sync users from any SKIM directory.

Starting point is 00:45:23 HRIS integration with Bamboo HR, Rippling, and more. Sync users from any SCIM directory. HRIS integration with BambooHR, Rippling, and more. Autotrails. Free Google and Microsoft OAuth. Free Magic Link sign-in. WorkOS is designed for developers and offers a single, elegant interface. It abstracts dozens of enterprise integrations.

Starting point is 00:45:39 This means you're up and running 10 times faster so you can focus on building unique features for users. Instead of debugging legacy protocols and fragmented IT systems, you get restful endpoints, JSON responses, normalized objects, real-time webhooks, a developer dashboard, framework native SDKs. And even if your team is not focused on enterprise right now, you can still leverage WorkOS so you're not turning enterprise away. Learn more and get started at WorkOS.com. They have a single pay-as-you-grow pricing that scales with your usage and your needs. No credit card required.

Starting point is 00:46:16 Again, WorkOS.com. They also have an awesome podcast called Crossing the Enterprise Chasm, and that is hosted by Michael Greenwich, the founder of WorkOS. Check it out at workos.com slash podcast. And by our friends at Subspace. Subspace is a network as a service that helps developers accelerate real-time applications for hundreds of millions of users worldwide. Their mission is to deliver real-time connectivity from anywhere to anywhere. And the standard internet wasn't built for the way the world works now. Reliability has always been the main priority, but in a remote workforce environment and in uproar real-time applications, developers not only need reliability, but they also need speed. When every millisecond counts, Subspace gives you the fastest, most reliable network to route your traffic through. But the question is, how does Subspace do it? They developed a fiber optic backbone

Starting point is 00:47:06 and hundreds of cities plus AI that weather maps the internet in real time. This gives their network the power to find the best paths and pull traffic through them in real time. It's like a private carpool lane and GPS, but for dynamic internet traffic. And it all works via global IP proxy that takes just minutes to set up using a

Starting point is 00:47:25 simple API. No client-side installation is required. And if that sounds easy, that's because it is. Learn more and get started for free at subspace.com slash changelog. Again, subspace.com slash changelog. so as we talk to you and as we talk to britney about this and her emphasis on engineering learning and using incidents for learning i totally am sold on the benefits. I think it's pretty obvious through this conversation. I'm wondering if it's kind of a hard sell to corporations because the benefits are after a conversation somewhat obvious, but they're not all that quantitative. It seems like it seems like this is very much in the realm of qualitative things where it's like, yeah, there are, but show me numbers, show me,

Starting point is 00:48:26 I can reduce my error counts by this much, or I can buy your product for X and it's going to save me Y and Y is greater than X. And so it's an obvious purchase. Have you a hard time getting buy-in from folks or how's that going? And is it as I think it is where it's like, it's kind of hard to quantify how this process benefits a corporation yeah a lot of orgs index on the quantitative like you said because it's easier to see it's easier to measure and so i i think especially as orgs start to feel they're having a lot of incidents they start counting things they start counting how many incidents they're having they start counting like what days they're having them they start counting how many incidents they're having. They start counting like what days they're having them.

Starting point is 00:49:05 They start counting how quickly they resolved them. They start, you know, all those things because it gives them a sense of control. But really, like I think especially in times of desperation, orgs can over index on them and it can actually be pretty detrimental to their learning. And so when you over index on them, you are kind of tanking the learning. And so those numbers are not going to get any better. And people are just going to get more upset over them. And you're going to end up creating this culture that you that you were trying to actually improve. And so one of my favorite equations is performance improvement is a combination of error reduction plus insight

Starting point is 00:49:45 generation. And it's from cognitive psychologist, Gary Klein. He studies organizations, like what makes them successful. He studies how experts think, and he put it in this book called Seeing What Others Don't. And it's about a lot of what we've spoken about today. But performance improvement can't happen with just one of those equations, or one of those sides of the equation. It can't just happen with error reduction. It can't just happen with insight generation. It needs to be a combination of the two, right? It needs to be a combination of the quantitative and the qualitative. I mean, it goes back to like KPIs and OKRs. Like, yeah, they are, it's great to have those numbers. but if you don't have a story behind them,

Starting point is 00:50:25 if you don't have context behind them, they're meaningless and harmful. And it's the same thing with error counts and such. And so you do have to get creative with how to quantify some of this learning. Maybe like, hey, how many people are reading the report? How many people are voluntarily showing up to this incident review meeting? How many people are talking in this incident review meeting? Are we getting people talking in this incident review meeting? Like, are we getting people from sales? Are we getting people from marketing to come? Or is it still just engineers? There are little things you can do to track that improvement journey too. You can track like, yeah, how, like if people are interested or if they're giving you feedback, there's a lot of

Starting point is 00:51:01 things you can track, but I'd really index on like the collaboration and the the learning opportunities or if people are having those moments where they're like oh i didn't know it worked that way that is like the best roi you can get from one of those meetings so there are ways to quantify it and track if you're if you're doing it well it's just not as clear cut yeah aha moments seems like aha moments yeah for me it's about pop culture and i'll explain please do i kind of feel like this is the joke that jeff foxworthy shares a lot you might be a redneck if and it's almost like turn turn it differently you might need incident management if like the last time you had an incident people left or the last time you had an incident some it was promoted because they were actually learning. How do they do that?

Starting point is 00:51:45 Yeah. You might need incident management and employ it in your organization. If the last incident meeting you had had two people in it and one person talked, you know what I mean? Some sort of like chart, you might need this if, and I know it's pop culture.

Starting point is 00:52:00 I'm going back to Sully is pop culture. And then Jeff Foxworthy is pop culture. So does Jeff Foxworthy still count as pop culture and then Jeff Foxworthy as pop culture. Does Jeff Foxworthy still count as pop culture? In the 90s, pop culture. Yeah, it's still out there on the internet. Sure, it was popular and it was part of culture. So there you go. But you might be a redneck if or you might need incident management if.

Starting point is 00:52:18 Yeah, there you go. Because then if you can answer those questions, it's like, hey, if this is a result of some of these things happening in your organization, then you could probably benefit. And if you've had this, then that's a direct sign to empathy because somebody felt that pain. Somebody, you know, lost their job, was demoted, left. I mean, like you might leave an organization if they're not managing instance properly. If you're not learning from mistakes, why stay? Right. Right. Exactly. Yeah. And people are going to get stalled out and not not feel great at the organization, which like, yeah, this this whole thing has business benefits to it. And I think that's what a lot of executives like will need to realize is the business benefits

Starting point is 00:52:56 behind enabling and allowing something like this to take place. If the SREs on the front line of this, though, is that the SREs and those who are surrounding the SREs, their adjacents, are they the ones coming to Jelly and saying, hey, I need this thing or we need to, are they already doing what Jelly does behind the scenes with bespoke internal software and Jelly just sort of like makes them not have to manage anymore? Like how does Jelly become a useful tool for, I guess, teams at large. Yeah. So what Jelly does is we help you ingest and collect the data from the conversation that you had during the incident. So the conversation data, like how you talked to each other, who you pulled in, who you paged, that points to a lot of the areas in your system that need technical explanation. Like, wow, why did we need Adam

Starting point is 00:53:43 here? Adam wasn't on call. He was on call for this incident yesterday. This incident looks like X, Y, Z. And so really what we're trying to get people to see is like the power of studying that chat transcript after it takes place. And we're giving you tools to study and to analyze and to mark up and annotate that chat transcript.

Starting point is 00:54:03 But not only that, you're aggregating it with that pager duty information, with that information from your HR systems to really get a full comprehensive view. So I haven't seen any internal tools at a lot of companies do this. Like most organizations are writing something up in a Google Doc or a Notion page. But the thing is, those tools are not meant for incident analysis. They're not meant for learning. And so what Jelly does is gives you this playground to learn and all those tools at your disposal so that you're getting the most out of it. And you're bumping up the quality of your review, but you're also kind of saving time too. I almost feel like it's akin to past scars for me, which was adopting Scrum slash Agile.

Starting point is 00:54:44 And you needed somebody to sort of guide you. Do you ever get into a situation where you need to employ a guide for the orgs that are hiring and buying your software? Do they often need somebody as like a concierge service or something like that to say, okay, you're doing it right. Somebody to kind of hold the hand of the process or is it pretty self-serve? It's a good question. It all depends on how much you want to invest in it. We strive to make the tool itself really self-serve. So you don't need that. We're trying to guide you through, like within the application itself, it doesn't assume that you've ever done any sort of incident analysis before. It kind of assumes that you've probably just written something up like in a Google doc or you haven't really thought about this space as much. So we really

Starting point is 00:55:29 aim to make the tool self-serve. However, we do have a few customers that are really invested in getting good at this. And those are the customers that are actually hiring full-time incident analysts. Whereas some organizations, like they'll just have people do it as part of their job and they're an SRE, they're a software engineer. So I think with those organizations, yeah, they're really trying to build it up and really trying to make it a thing. But it doesn't necessarily need a human guide. It needs champions, for sure. Is this the next big job in an org like this? Incident analyst? in an org like this? I mean, incident analyst. Yeah. Yeah. I think so. It's, it's, I mean, having, having done the job a little bit and, you know, getting to work with, with customers that like are really amazing at it. It is, it's fantastic to see like, um, so like indeed

Starting point is 00:56:18 the company is, is, um, is creating jobs like this. Zero is, is doing that too. It's really cool to see and it's really cool to see how it impacts the organization. So I do think it's probably an expect job and it does require like an engineering technical background because you need to be able to ask the nitty gritty questions

Starting point is 00:56:39 and you need to be able to talk to experts of these systems in ways that like y'all can understand each other. Would you call it earlier distributed systems, cultural anthropologist or something like that? Yeah. It's a little long. I mean, how cool would that be your title?

Starting point is 00:56:53 You know, that would be so cool. I mean, it does sound like a fun job. I mean, I guess as people that talk to others often as part of what we do, maybe Adam and I are predisposed to such a role. So maybe it's exciting for me, but other engineers are like, I never want to talk to a human. That's certainly an ethos out there.

Starting point is 00:57:12 You can call them a DSCA. I'm a DSCA. What is that? I'm a DSCA. Distributed Systems Cultural Anthropologist. That's a mouthful. It sounds super official. Yeah.

Starting point is 00:57:22 And DSCA, with an acronym like that, you've got to get paid pretty good money, I think. But you can put a comma and then DSCA after your name. It's like... Put a comma in. Yeah. That's right. Because now, you know...

Starting point is 00:57:32 There probably will be a whole little cottage industry of consultants popping up and selling things around this whenever there's new opportunity. There's golden hills. These things happen. The interview process sounds like it's crucial to this. And I know the Howie guide obviously is, you know, a step-by-step guide through doing an interview and how to do it and stuff. Are these usually recorded? Are there usually, you know, audio artifacts? I highly recommend recording them if you can.

Starting point is 00:58:01 But the thing is, you know, as the like interviewer or the cultural anthropologists of sorts you need to really assure the person you're interviewing that what they say will be confidential and like because you really want them to just fully share their experience so like part of it is like knowing how to ask questions and unearth expertise but the other part is having empathy and being a trusted source. And so and then you have to aggregate all the people that you interviewed. You have to aggregate all their stories because none of them have the full picture. And try to tell the full picture within all their stories.

Starting point is 00:58:38 And so you're giving them kind of an outlet to share. And so I do I do recommend they are recorded, but like some people aren't comfortable with that. And so I always ask the person I'm interviewing, if it's okay, I'd like to record this just for my eyes only so that I can pay attention to you right now. And then look for things afterwards that you might've told me, but I always tell them, like, if you want to cut at any point, just go like this. I have had people do that before. I've, you know, I, I, I've had interviews where people have gotten very upset because they've been talking about a particular thing for so long and no one's quite listening to them, but that's, that's data in itself,

Starting point is 00:59:15 right? And you have to figure out how to take that data and help make them look really good. So you just pointed out a particular attribute about this job that I wasn't necessarily considering. And I think maybe having some sort of specification on it is important because you want that person to be, to gain the confidence and the trust of the people they're talking to, How do they do that? Like you can give me that promise, but then, well, that's just yours to me. The employer may say, no, no, all of your sources are revealed to me kind of thing. It's almost like a journalist. I can't reveal my source. Like, how do we get to a point

Starting point is 00:59:57 where we can actually make that a legitimate attribute for this position? Because at some point you may lose that trust and you might break that fourth wall, so to speak and and share things you probably regret basically yeah and like i mean i've definitely had folks share things that they you know they probably don't want to have said out loud and like it's really your role to keep it in confidence otherwise like as soon as you break that confidence people are going to stop telling you stuff. And these incident reviews are not going to be that good anymore. Right. They won't trust the process anymore.

Starting point is 01:00:30 They don't trust the process. I mean, I've told people I'm training on this, like it can feel a little bit like therapy, like you and like, and that's okay. Cause it's, they've honestly just had something stressful happen to them. Like they, you know, and so that you are their outlet to vent, honestly, a little bit too. The goal of the interviewer is not to just have a vent sesh or, you know, have them be super upset the whole time. The goal is to like, yeah, unearth what they have expertise on. And so an expert interviewer will know how to take some of that and say, wow, that was, you know, that must have been really hard for you and kind of steer the conversation in productive ways too. But yeah, I'm sure something could be said

Starting point is 01:01:10 at some point that you kind of need to share. And ultimately that should, that should be a void. Otherwise it kind of ruins the program and the organization and the people at the top kind of need to be bought into it too. And not, not asking the person like what people are saying. It makes me think of HR in a way, because like you, HR organization, you think, well, this team is for me. Yeah. Not most cases, right? Like maybe in some cases they are, maybe personally may have made a friend or whatever, but really HR is to the corporation to help the corporation, you know, just its resources, which tend to be humans. In some cases, I guess we do have robots and software, but I wouldn't want this role to get in the light of HR.

Starting point is 01:01:53 Because if you want this to be productive and be a learning process, we have to protect that role in a way. So if this does become the next big job, how do we enable it to be a trusted resource? I mean, part of it is understanding the whole point of it. And the point is not to like collect all the gossip or be an ear to the ground for the CTO. It's to help people learn. And so having the executives keep that in mind too, like you are going to ruin this if you're asking these kinds of questions. I've actually been in orgs where I've interviewed the CTO after an incident. And so like, they're not excluded from participating

Starting point is 01:02:30 in this either, because they are parts of the system too. And so I think that can actually be helpful. And I've- And did you work for the CTO at the time or you were a third party? No, I worked for the CTO at the time. Was there a strange power dynamic in that or circumstance? It's interesting. I mean, obviously, inherently, there was, but no, I mean, in that particular room, in that particular moment, I was the interviewer and the CTO was a person that participated in the incident. Yeah. And we both fell into those roles. But yeah, I think that's a really important thing thing too, is to not just be interviewing the engineers, it's to be interviewing everyone in the system. And that sometimes might involve someone in the C-suite.

Starting point is 01:03:10 Wow. It's like an investigator, right? It's like, I'm a detective now. Exactly. I was going to say, it's starting to sound like internal affairs or something. Yeah. Yeah. That's true. Like IF, yeah. So depending on how you do it, you could be, yeah, you could be despised by your colleagues if you're not good at it. And like, that's, that's the thing. Like you have to be really likable. So when I tell orgs that are trying to roll out these programs and like,

Starting point is 01:03:34 you don't have to like have this particular role to use Jelly. Like Jelly is meant to be used for like any engineer in any organization. This is just like kind of, if you want to take things to the next level. But I always tell people like you need to be likable, like you need to be likable by your colleagues. You need you know, you can't be just writing things down that they're saying like you need to be trusted by them. You need to have a good reputation, too, which is why I usually recommend hiring someone internally to do this, like someone that has been a good engineer, because they're kind of already,

Starting point is 01:04:06 they've already built that trust with their colleagues. So, so for somebody who would like to get started as a distributed systems, cultural anthropologist or an incident analyst, I guess if you're more into brevity, we have the Howie guide linked up, of course. So everybody listening can reference that in the show notes. But in terms of anything else, in terms of like, this is interesting to me as a career path or for my org, obviously jelly.io. But where do you send people? I know there's like a community of folks. This is not just yourself leading this charge. This is more, there's a community around this whole movement, right? Yeah, there's for sure community. I started the learning from incidents community early 2019 with a few of my Netflix colleagues.

Starting point is 01:04:51 There's a few other folks running it too. And it's grown to over 300 people at this point. We try to keep it intentionally small because we want people to be comfortable participating and sharing their incidents and their orgs. But it's across several different companies that are trying this and rolling this out. And we've open sourced a lot of the learnings from that community. You can go to learningfromincidents.io and you can see write-ups from folks all throughout the industry, trying some of this stuff in their organizations. I'd say I just get to an end of show like this. And I'm just thankful that someone like you and

Starting point is 01:05:23 others out there that care about this are pouring so much into it. And then I guess being so fed up slash inspired to literally found a company two and a half years ago, you're still here. So maybe you're successful. Maybe you're not. We haven't really led into the success of the company, but I'm so thankful that you're out there fighting this fight and that, you know, there could be a cottage industry of a brand new role for us to move into or for others to move into because it's just good for the progress and entropy of how things work. So totally, there are a lot of references at the bottom of the Howie guide. So I would, I would reference those references to the listener. We'll link it up, obviously, nor anything else, anything in closing, anything that you want to

Starting point is 01:06:03 share with our listeners before we call this show a show. There's a lot of ways to get good at this, and there's a lot of material out there. And it can be overwhelming. And the whole reason for founding Jelly and the whole reason for doing what we do is to disseminate this more and make it more easily accessible to allow people to kind of have this starting point without pouring through all this material, all these academic papers and such. So yeah, that's kind of my closing thing, but it's been really great hanging out with y'all. It has been fun.

Starting point is 01:06:33 North, thank you so much for your time. We appreciate you. Yeah, thank you. Take care. Bye. That's it. This show is done. Thank you for tuning in.

Starting point is 01:06:43 What do you think about incidents? Are you scared? Are you incidents? Are you scared? Are you excited? Are you learning? How does your team handle incidents? Do you have processes in place? Let us know in the comments. And if you haven't yet, check out ChangeLaw++.

Starting point is 01:06:54 Support us directly and make the ads disappear. Learn more at ChangeLaw.com slash plus plus. Thanks again to our friends and partners at Fastly for getting our back. Check them out at Fastly.com. Also, thanks to Breakmaster Cylinder for making all of our awesome beats. And thank you to you for listening. If you enjoyed the show, do us a favor. Tell a friend.

Starting point is 01:07:12 That's the best way you can help us. Word of mouth is by far the best way for shows like ours to grow. And, of course, the Galaxy Brain Move is to subscribe to the Master Feed and get all our podcasts in one single feed. Check it out at changelog.com slash master. That's it. This show's done. Thanks for tuning in. We'll see you next week. Thank you. Game on.

Your Ad Here

The Changelog: Software Development, Open Source - Learning from incidents (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.