Command Line Heroes - Fail Better: Embracing Failure

Starting point is 00:00:00 Stop me if you've heard this one. Two engineers are compiling their code. The newcomer raises his hands and shouts, Whoa, my code compiled! The veteran narrows her eyes and mutters, Hmm, my code compiled. If you've been in the coding game a little while, something changes when you think about failure.

Starting point is 00:00:23 Things that used to look like impossible problems begin to look like healthy parts of a larger solution. The stuff you used to call failure begins to look like success in disguise. You expect your code to not compile. You expect to play and experiment all along the way, fiddling, revising, refactoring. I'm Saran Yitbarek, and this is Command Line Heroes, an original podcast from Red Hat. That whole fail-fast mantra, let's be honest, it often gets used as a way to try and shortcut things towards success. But what if, instead of telling each other to hurry up and fail fast, we encourage each other to actually fail better?

Starting point is 00:01:20 Season two of Command Line Heroes is all about the lived experience of working in development. What it really feels like and how it really pans out when we're living on the command line. And that's why we're devoting a whole episode to dealing with failure. Because it's those moments that push us to adapt. The stuff we call failure, it's the heartbeat of evolution. And open source developers are embracing that evolution. Of course, that's a lot easier said than done. Imagine this. A brand new sonnet from the man himself, Shakespeare, gets discovered. There's a huge rush of interest online. Everybody's Googling.

Starting point is 00:02:10 But then, this one little design flaw leads to something called file descriptor exhaustion. That creates a cascading failure. Suddenly, you've got all that traffic moving across fewer and fewer servers. Pretty soon, Google's Shakespeare search has crashed, and it stays crashed for over an hour. Now, you've lost 1.2 billion search queries. It's a tragedy of Shakespearean proportions, all playing out while site reliability engineers are scrambling to catch up. Okay, hate to break it to you, the Shakespearean incident isn't real. In fact, it's part of a series of disaster scenarios in a book called Site Reliability Engineering. And one of the big

Starting point is 00:03:06 lessons from that book is that you've got to look beyond the disaster itself. Here's what I mean. In the Shakespeare case, the query of death gets resolved when that laser beam of traffic gets pushed onto a single sacrificial cluster that buys the team enough time to add more capacity. But you can't stop there. As bad as that issue was, resolving it isn't where the real focus should be because failure doesn't have to end in suffering. Failure can lead to learning.

Starting point is 00:03:38 Hi, I'm Jennifer Pettoff. Jennifer works over at Google. She's a senior program manager for their SRE team and leads Google's global SRE education program. And she's also one of the authors of that book, the one that describes the Shakespeare scenario. is how things get better. But only if you have a culture where mistakes and surprises are embraced. So take the Shakespeare snafu again. There is a straightforward solution. Load shedding can save you from that cascading failure. But the real work starts after things are back to normal. The real work is in the post-mortem.

Starting point is 00:04:24 After the incident is resolved, a post-mortem would be created. Every incident at Google is required to have a post-mortem and corresponding action items to prevent, but also to more effectively detect and mitigate similar incidents or whole classes of issues in the future. That's a key distinction right there. Not just solving for this particular incident, but seeing what the incident tells you about a class of issues. Postmortems, really effective ones, don't just tell you what went wrong yesterday. They give you insights about the work you're doing today and about what you're planning for the future. That broader kind of thinking instills a respect

Starting point is 00:05:06 for all those accidents and failures, makes them a vital part of everyday work life. So a really good postmortem addresses more than just the single issue at hand. It addresses the whole class of issues. And the postmortems focus on what went well, what went wrong, where we got lucky, and what prioritized action we can take to make sure this doesn't happen again. If you don't take action, history is destined to repeat itself.

Starting point is 00:05:32 At Google, there's a focus on blameless postmortems, and that makes all the difference. If nobody's to blame when something goes wrong, then everybody can dig into errors in an honest way and really learn from them without covering tracks or arguing. Those blameless postmortems have become a key part of the culture at Google. And the result is a workplace where failure isn't something to be afraid of. It's normalized. How does Google look at failure? 100% uptime is an impossible goal. Like you're kidding yourself if you think that's achievable. So failure is going to happen. It's just a matter of when and how.

Starting point is 00:06:12 Failure is celebrated at Google. So it's something we can learn from. And postmortems are shared widely among teams to make sure that the things that are learned are widely available. Failure is inevitable, but you never want to fail the same way twice. To err is human, but to err repeatedly is just something that would be better avoided. It's so interesting hearing the way Jennifer talks about failures because it's like she's leaning into those mistakes. Like when things go wrong, it means you've arrived at a place you can actually mine for value.

Starting point is 00:06:49 You deal with the situation in real time, but then afterwards taking time to write up what happened so that others can learn from that. With any incident, you pay the price when it happens and you're not recollecting some of that cost if you don't write up a postmortem and actually learn from that experience. And I think that's a critical lesson. We believe very strongly here at Google in a blameless culture. You don't gain anything by pointing fingers at people, and that just incents people to cover up failure, which is going to happen regardless. It's so important here to remember something Jennifer said earlier, that error-free work is a fantasy. There will always be things that go wrong. What it comes down to is a shift in thinking.

Starting point is 00:07:48 We can put away that idea that there is a single definable end goal, where everything will finally go the way we imagined. There is no single there that we're trying to get to. And it turns out, that's a hugely powerful and positive thing. Google's push for embracing failure makes a lot of sense. Super practical. But I wanted to know, is this just lip service? Like, do we have some concrete examples of failure actually making things better? Or is it all just a way to make ourselves feel better when we're hitting compile for the 200th time? Turns out,

Starting point is 00:08:32 there's someone who can answer that. My name is Jessica Rutter. I'm a software engineer at GitHub. Jessica has seen her share of failure over at GitHub. It's a failure arena in one sense. And along the way, she's collected some stories about times when failure was the doorway to massive success. Like this one. So there was a game development company that was working on a brand new game in the 90s. Essentially, it was a racing game. But their sort of twist on it was that it was going to be street racing. So as the racers are racing through the streets, they're not only racing each other, but there are also NPCs that are cop cars that are chasing them down. And if a cop car catches you, it's supposed to pull you over and then you lose the race. So

Starting point is 00:09:17 they get this code all wired up and they start running it. And what they discovered is that they completely calibrated the algorithm wrong. And instead of the cop cars chasing the players' vehicles, they would just come screaming out of side streets and slam right into them. So it was just a total mess. And instead of freaking out, they thought, let's go ahead and see how people like it, and that way we know what to tweak about the algorithm. So they sent it over to the playtesters, and what they found was that the playtesters had way more fun running away from the cops and trying

Starting point is 00:09:59 to dodge being captured by these like rogue violent cop cars than they ever had with just the racing game itself. And it was so much fun, in fact, that the development team shifted the entire concept that they were building the game around. Can you guess where this is going? And that's how we ended up with Grand Theft Auto. I mean, it's literally the best-selling video game franchise of all time. And the whole reason it exists is because when they failed to get the algorithm right,

Starting point is 00:10:34 they thought, well, let's try it out. Let's see what we've got and let's see what we can learn from it. Sort of amazing, right? But here's the trick. The Grand Theft Auto team had to stay receptive when they were hit with a failure. They had to stay curious. So if those developers hadn't been open-minded about it and decided to see what they could learn from this mistake, we would never have had Grand Theft Auto.

Starting point is 00:11:01 We would have had some boring, run-of-the-mill street race game. Sticking with the game theme for a minute, something similar happened when Silent Hill was being produced. This was a huge AAA game, big-time production. But they had serious problems with pop-up. Parts of the landscape weren't being processed fast enough, so all of a sudden, you get a wall or a bit of road popping up out of nowhere. This was a deal-breaker problem, and they were late in their development cycle. So what do they do? Scrap the game entirely? Throw their hands up? Or embrace the problem itself? What they did was fill the world with a very dense, eerie fog. Because fog, as it turns out, is really easy for the processors to render and not get any kind of delays. But

Starting point is 00:11:55 additionally, fog prevents you from seeing things at a distance. So in reality, those buildings are still popping in, but you can't see it anymore because the fog is blocking your view. So in reality, those buildings are still popping in, but you can't see it anymore because the fog is blocking your view. So when they do come into view, they're already rendered and it looks like they're coming out of the fog instead. The fog became so well loved that it's basically considered another character in the Silent Hill franchise. It makes the gameplay way scarier by limiting the player's vision. And even when the processors got so fast that they didn't need to cover up those pop-ups anymore, they kept the fog. You cannot have a Silent Hill game without fog. And all that fog was

Starting point is 00:12:37 doing initially was covering up a mistake. I love it. They saved a major development by embracing their failure instead of running from it. And that rule about not fearing failure applies to little individual things too, not just company-wide decisions. Looking failure calmly in the face is how we get better. Bit by bit. A lot of times people get too much into their own head and they think a failure means I'm bad at X. It's not, oh, this code is broken and I don't know how to fix it yet.

Starting point is 00:13:11 It's, I don't know how to write JavaScript. And you are never going to learn what you need to learn by saying, I don't know how to write JavaScript. But if you can identify, oh, I don't know how to make this loop work in JavaScript, then you have something that you can Google and you can find that answer. And it just works perfect. I mean, you're still going to struggle, but you're going to struggle a whole lot less. So our mistakes nudge us toward bigger things. Those experiments, those fails, those heroic attempts, they make up most of the journey, whether you're a new developer or the head of a major studio. And nowhere is that more true

Starting point is 00:13:54 than in the open-source communities I've come to know and love. Failure can be a beautiful thing in open-source. And that's where our story goes next. We saw earlier how failing well can lead to happy surprises, things we didn't even know we wanted to try. And at its best, open source development culture hits that mark. It makes failure okay. To understand how that willingness to fail gets baked into open source development, I got chatting with Jen Krieger. She's Red Hat's chief agile architect. We talked about attitudes toward failure in open source and how those attitudes shape what's possible.

Starting point is 00:14:44 Take a listen. I want to touch on this mantra. I feel like it's probably a good way to put it, the fail fast and break things, which, you know, which is a big rally, cry almost, I feel like for our community. What are your thoughts on that? I have a lot of thoughts on that. I thought you might. Fail fast, fail forward, fail quickly, all those things. So to put that into context, in the early days of my career, I worked in a company where there was no room for failure. If you did something wrong, you brought down the one application, there was really no way, no room really for anybody to do anything wrong. And that just really wraps people around the axle. led us into almost like a cultural movement, if you would, that then spawned into that wonderful word agile,

Starting point is 00:15:49 into the wonderful word DevOps. When I look at those words, all I'm seeing is that we're simply asking teams to do a series of very small experiments that help them course correct. It's about, oh, you've made a choice and that's actually a positive thing. You might take a risky decision and then you win because you've made the right decision. Or the other side, which is you've made the wrong decision and you understand now that that wasn't the right direction to go in. Yeah, that makes sense. So when you think about fail

Starting point is 00:16:21 fast and break things as being this movement. It feels like there's still some structure, some best practices in how to fail, how to do that the right way. What are some of the best practices, some of the principles around failing in a way that is good in the end? I always like to tell engineers that they need to break the build as early and as often as possible. If they're breaking their build and they're aware that they've broken the build, they have the opportunity in the moment to actually fix it. And it's all wrapped around that concept of feedback loops

Starting point is 00:17:01 and ensuring that the feedback loops that you're getting on the work that you're doing are as small as possible. And so in open source development, I submit a patch and somebody says, I'm not going to accept your patch for these nine reasons, or I think your patch is great, move forward. Or you might be submitting a patch and having a bot tell you that it's failed because it hasn't built properly. There's all sorts of different types of feedback. And then in open source development, you might also have longer feedback loops where you say, I want to design this new functionality, but I'm not entirely sure what all the rules should be. Can somebody help me design that? And so you go into this long process where

Starting point is 00:17:38 you're having long and detailed conversations where folks are participating and coming up with the best idea. And so there's all sorts of different feedback loops that can help you accomplish that. Jen figures those feedback loops can look different for every company. They're customizable, and people can make them work in a hundred different ways. But the point is, she's not even calling them failures or mistakes. She's just calling them feedback loops. It's an organic system, such a healthy way of thinking about the whole process.

Starting point is 00:18:17 Meanwhile, there's one attitude toward those little glitches that has the exact opposite effect. There are things that organizations do that are just flat out the wrong thing to do. Having your leadership team or at a very high level, the organization, thinking that shaming people for doing something wrong or instilling fear in relation to performance results, and that looks like if you don't do a good job, you won't get a bonus, or if you don't do a good job, I'm going to put you on a performance plan. Those are the types of things that create hostility. What she's describing there is a failure fail. A failure to embrace what failure can be.

Starting point is 00:18:57 And she's echoing Jennifer Pettoff's attitude too, right? That idea about blame-free postmortems we heard about at the top of the episode? You know? Yeah, that's interesting. It's like if we are a little bit more strict around how we work together, or maybe just more mindful, more purposeful in how we work together, we will be almost forced to be better at our own failure. Yes, and there's companies out there that have learned this already. They've learned it a long time ago. Toyota is a perfect example of a company that embraces this concept of continuous learning and improvement in a way that I rarely see at companies.

Starting point is 00:19:39 There's just this idea that anyone at any point can point out something that isn't working properly. It doesn't matter who they are, what level of the company they're in. It's just understood in their culture that that's okay. And that environment of continuous learning and improvement, I would say, would be one of those leading practices, the things that I would expect a company to do to be able to accommodate failure and to allow it to occur. If you're asking questions about why things aren't going well, instead of pointing fingers or trying to hide things or blaming others for things not going well, it creates an entirely different situation.

Starting point is 00:20:21 It changes the conversation. maybe a different way that teams work within a company, within a tech team. Tell me a little bit more about that. How has it changed the way developers see their roles and how they interact with other people in the company? My early days of working with engineers pretty much looked like the engineers all sat in a small area. They all talked to one another. They never really interacted with any of the business people. They never really understood any of their incoming requirements. And we spent an awful lot of time really focused on what they needed to be successful and not necessarily what the business needed to actually get their work done. And so it was much more of a, I am an engineer, what do I need in order to code this piece of functionality?

Starting point is 00:21:30 What I observe today in pretty much every team that I work with, the conversation has shifted significantly to not, what do I need as an engineer to get my job done, but what does the customer or what does the user need to actually feel like this piece of functionality that I'm making is going to be successful for them? How are they using the product? What can I do to make it easier for them? A lot of those conversations have changed. And I think that's why companies are doing better today on delivering technology that makes sense. I will also say that the faster we get at releasing, the easier it is for us to know whether or not our assumptions and our

Starting point is 00:22:13 decisions are actually coming true. So if we make an assumption about what a user might want before we were having to wait like a year to two years to really find out whether or not that was actually true. Now, if you look at the model of an Amazon or Netflix, they're releasing their assumptions about what their customers want like hundreds of times a day. And the response they get from folks using their applications will tell them whether or not they're doing what it is the users need them to be doing. Yeah, and it sounds like it requires more cooperation because even, you know, the piece of advice you gave earlier about build, break the build, break it often, you know, that kind of requires the engineering team or the developers to be more in step with DevOps, right, in order for them to break it

Starting point is 00:23:06 and to see what that looks like to do those releases early and to do them often. It sounds like it requires more cooperation between the two. Yeah, and it's always amusing to somebody who has that title Agile coach or in my case, chief Agile architect

Starting point is 00:23:21 because the original intent of the Agile manifesto was to get folks to think about those things differently. We are uncovering better ways of developing software by doing it and helping others do it. It is really the core heart and foundation of what Agile was supposed to do. And so if you fast forward the 10, 15 plus years to the arrival of DevOps and the insistence that we have continuous integration and deployment, we have monitoring, we start thinking differently about throwing code over the wall, all that stuff is really what we were supposed to be thinking back when we originally started talking about Agile. Absolutely. So regardless of how people implement this idea of failure, I think that we can both agree that the acceptance of failure, the normalizing of failure is just a part of the process,

Starting point is 00:24:21 something that we need to do, something that happens, that we can manage, that we can maybe do the right way, quote-unquote, is a good thing. It has done some good for open source. Tell me about some of the benefits of having this new movement, this new culture of accepting failure as part of the process. to go from being really in a situation where they're fearful of what might happen to a place in which they can try to experiment and try to grow and try to figure out what might be the right answer is really great to see. It's like they blossom. Their morale improves.

Starting point is 00:24:56 They actually realize that they can own what it is that they are. They can make decisions for themselves. They don't have to wait for somebody to make the decision for them. Failure as freedom. I love it. Jen Krieger is Red Hat's chief agile architect. Not all open source projects reach the fame and success of big ones like Rails or Django or Kubernetes? In fact, most don't. Most are smaller projects with just a single contributor, niche projects

Starting point is 00:25:34 that solve little problems that a small group of developers face. Or they've been abandoned and haven't been touched in ages. But they still have value. In fact, a lot of those projects are still hugely useful, getting recycled, upcycled, cannibalized by other projects. And others simply inspire us, teach us by their very instructive wrongness. Because failure, in a healthy, open-source arena, gives you something better than a win. It gives you insight. And here's something else. Despite all those dead ends, the number of

Starting point is 00:26:14 open source projects is doubling about every year. Despite all the risky attempts and Hail Marys, our community is thriving. And it turns out, we're not thriving despite our failures, we're thriving because of them. Next episode, how security changes in a DevOps world. Constant deployment means security is working its way into every stage of development. And that is changing the way we work. Meantime, if you want to learn more about open source culture and how we can all change the culture around failing, check out the free resources waiting for you at redhat.com slash command line heroes. Command Line Heroes is an original podcast from Red Hat. Listen for free on Apple Podcasts, Google Podcasts, or wherever you do your thing. I'm Saran Yitbarek. Until next time, keep on coding.

Starting point is 00:27:19 Hi, I'm Mike Ferris, Chief Strategy Officer and longtime Red Hatter. I love thinking about what happens next with generative AI. But here's the thing. Foundation models alone don't add up to an AI strategy. And why is that? Well, first, models aren't one-size-fits-all. You have to fine-tune or augment these models with your own data, and then you have to serve them for your own use case. Second, one-and-done isn't how AI works.

Starting point is 00:27:43 You've got to make it easier for data scientists, app developers, and ops teams to iterate together. And third, AI workloads demand the ability to dynamically scale access to compute resources. You need a consistent platform, whether you build and serve these models on-premise, or in the cloud, or at the edge. This is complex stuff, and Red Hat OpenShift AI is here to help. Head to redhat.com to see how.

Command Line Heroes - Fail Better: Embracing Failure

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.