Command Line Heroes - Fail Better: Embracing Failure
Episode Date: October 23, 2018Failure is the heartbeat of discovery. We stumble a lot trying new things. The trick is to give up on failing fast. Instead, fail better. This episode looks at how tech embraces failure. Approaching f...ailure with curiosity and openness is part of our process. Jennifer Petoff shares how Google has built a culture of learning and improvement from failure. With a shift in perspective, Jessica Rudder shows how embracing mistakes can lead to unexpected successes. And Jen Krieger explains how agile frameworks help us plan for failure. Failure doesn’t have to be the end. It can be a step to something greater. If you want to learn more about open source culture and how we can all change the culture around failing, check out some of the blog features waiting for you at redhat.com/commandlineheroes.
Transcript
Discussion (0)
Stop me if you've heard this one.
Two engineers are compiling their code.
The newcomer raises his hands and shouts,
Whoa, my code compiled!
The veteran narrows her eyes and mutters,
Hmm, my code compiled.
If you've been in the coding game a little while,
something changes when you think about failure.
Things that used to look like
impossible problems begin to look like healthy parts of a larger solution. The stuff you used
to call failure begins to look like success in disguise. You expect your code to not compile.
You expect to play and experiment all along the way, fiddling, revising, refactoring.
I'm Saran Yitbarek, and this is Command Line Heroes,
an original podcast from Red Hat.
That whole fail-fast mantra, let's be honest, it often gets used as a way to try and shortcut things towards success.
But what if, instead of telling each other to hurry up and fail fast, we encourage each other to actually fail better?
Season two of Command Line Heroes is all about the lived experience of working in development.
What it really feels like and how it really pans out when we're living on the command line.
And that's why we're devoting a whole episode to dealing with failure.
Because it's those moments that push us to adapt.
The stuff we call failure, it's the heartbeat of evolution.
And open source developers are embracing that evolution. Of course, that's a lot easier said than done.
Imagine this. A brand new sonnet from the man himself, Shakespeare, gets discovered.
There's a huge rush of interest online. Everybody's Googling.
But then, this one little design flaw leads to something called file descriptor exhaustion.
That creates a cascading failure.
Suddenly, you've got all that traffic moving across fewer and fewer servers. Pretty
soon, Google's Shakespeare search has crashed, and it stays crashed for over an hour. Now,
you've lost 1.2 billion search queries. It's a tragedy of Shakespearean proportions,
all playing out while site reliability engineers are scrambling to catch up.
Okay, hate to break it to you, the Shakespearean incident isn't real. In fact, it's part of a
series of disaster scenarios in a book called Site Reliability Engineering. And one of the big
lessons from that book is that you've got to look beyond the disaster itself. Here's what I mean.
In the Shakespeare case, the query of death gets resolved when that laser beam of traffic
gets pushed onto a single sacrificial cluster that buys the team enough time to add more capacity.
But you can't stop there.
As bad as that issue was,
resolving it isn't where the real focus should be
because failure doesn't have to end in suffering.
Failure can lead to learning.
Hi, I'm Jennifer Pettoff.
Jennifer works over at Google.
She's a senior program manager for their SRE team and leads Google's global SRE education program. And she's also one of the authors of that book, the one that describes the Shakespeare scenario. is how things get better. But only if you have a culture where mistakes and surprises are embraced.
So take the Shakespeare snafu again.
There is a straightforward solution.
Load shedding can save you from that cascading failure.
But the real work starts after things are back to normal.
The real work is in the post-mortem.
After the incident is resolved, a post-mortem would be created. Every incident at Google is
required to have a post-mortem and corresponding action items to prevent, but also to more
effectively detect and mitigate similar incidents or whole classes of issues in the future.
That's a key distinction right there. Not just solving for
this particular incident, but seeing what the incident tells you about a class of issues.
Postmortems, really effective ones, don't just tell you what went wrong yesterday.
They give you insights about the work you're doing today and about what you're planning for
the future. That broader kind of thinking instills a respect
for all those accidents and failures,
makes them a vital part of everyday work life.
So a really good postmortem addresses
more than just the single issue at hand.
It addresses the whole class of issues.
And the postmortems focus on what went well,
what went wrong, where we got lucky, and what prioritized action we can take to make sure this doesn't happen again.
If you don't take action, history is destined to repeat itself.
At Google, there's a focus on blameless postmortems, and that makes all the difference.
If nobody's to blame when something goes wrong, then everybody can dig into errors in an honest way and really
learn from them without covering tracks or arguing. Those blameless postmortems have become
a key part of the culture at Google. And the result is a workplace where failure isn't something to be
afraid of. It's normalized. How does Google look at failure? 100% uptime is an impossible goal.
Like you're kidding yourself if you think that's achievable.
So failure is going to happen.
It's just a matter of when and how.
Failure is celebrated at Google.
So it's something we can learn from.
And postmortems are shared widely among teams to make sure that the things that are learned are widely available.
Failure is inevitable, but you never want to fail the same way twice.
To err is human, but to err repeatedly is just something that would be better avoided.
It's so interesting hearing the way Jennifer talks about failures
because it's like she's leaning into those mistakes.
Like when things go wrong, it means you've arrived at a place you can actually mine for value.
You deal with the situation in real time, but then afterwards taking time to write up what happened so that others can learn from that.
With any incident, you pay the price when it happens and you're not recollecting some of that cost if you don't write up a postmortem and actually learn from that experience.
And I think that's a critical lesson.
We believe very strongly here at Google in a blameless culture.
You don't gain anything by pointing fingers at people, and that just incents people to cover up failure, which is going to happen regardless.
It's so important here to remember something Jennifer said earlier, that error-free work is a fantasy.
There will always be things that go wrong.
What it comes down to is a shift in thinking.
We can put away that idea that there is a single definable end goal, where everything will finally go the way we imagined.
There is no single there that we're trying to get to.
And it turns out, that's a hugely powerful and positive thing. Google's push for embracing failure makes a lot of sense.
Super practical.
But I wanted to know, is this just lip service?
Like, do we have some concrete examples of failure actually making things better?
Or is it all just a way to make ourselves feel better when we're hitting compile for the 200th time?
Turns out,
there's someone who can answer that. My name is Jessica Rutter. I'm a software engineer at GitHub.
Jessica has seen her share of failure over at GitHub. It's a failure arena in one sense.
And along the way, she's collected some stories about times when failure was the doorway to massive success.
Like this one.
So there was a game development company that was working on a brand new game in the 90s.
Essentially, it was a racing game.
But their sort of twist on it was that it was going to be street racing. So as the racers are racing through the streets, they're not only racing each other, but there are also NPCs that are cop cars that are chasing them
down. And if a cop car catches you, it's supposed to pull you over and then you lose the race. So
they get this code all wired up and they start running it. And what they discovered is that they completely calibrated the algorithm wrong.
And instead of the cop cars chasing the players' vehicles,
they would just come screaming out of side streets and slam right into them.
So it was just a total mess.
And instead of freaking out, they thought,
let's go ahead and see how people like it,
and that way we know what to tweak about the algorithm. So they sent it over to the playtesters,
and what they found was that the playtesters had way more fun running away from the cops and trying
to dodge being captured by these like rogue violent cop cars than they ever had with just the racing game itself.
And it was so much fun, in fact,
that the development team shifted the entire concept
that they were building the game around.
Can you guess where this is going?
And that's how we ended up with Grand Theft Auto.
I mean, it's literally the best-selling video game franchise of all time.
And the whole reason it exists is because when they failed to get the algorithm right,
they thought, well, let's try it out.
Let's see what we've got and let's see what we can learn from it.
Sort of amazing, right?
But here's the trick.
The Grand Theft Auto team had to stay receptive when they were hit with a failure.
They had to stay curious.
So if those developers hadn't been open-minded about it and decided to see what they could learn from this mistake,
we would never have had Grand Theft Auto.
We would have had some boring, run-of-the-mill street race game.
Sticking with the game theme for a minute, something similar happened when Silent Hill
was being produced. This was a huge AAA game, big-time production. But they had serious problems
with pop-up. Parts of the landscape weren't being processed fast enough, so all of a sudden,
you get a wall or a bit of road popping up out of nowhere. This was a deal-breaker problem,
and they were late in their development cycle. So what do they do? Scrap the game entirely?
Throw their hands up? Or embrace the problem itself? What they did was fill the world with a very dense, eerie fog. Because fog,
as it turns out, is really easy for the processors to render and not get any kind of delays. But
additionally, fog prevents you from seeing things at a distance. So in reality, those buildings are
still popping in, but you can't see it anymore because the fog is blocking your view. So in reality, those buildings are still popping in, but you can't see it anymore
because the fog is blocking your view. So when they do come into view, they're already rendered
and it looks like they're coming out of the fog instead. The fog became so well loved that it's
basically considered another character in the Silent Hill franchise. It makes the gameplay way
scarier by limiting the player's vision.
And even when the processors got so fast that they didn't need to cover up those pop-ups anymore,
they kept the fog. You cannot have a Silent Hill game without fog. And all that fog was
doing initially was covering up a mistake. I love it. They saved a major development by
embracing their failure instead of running from it.
And that rule about not fearing failure applies to little individual things too,
not just company-wide decisions.
Looking failure calmly in the face is how we get better.
Bit by bit.
A lot of times people get too much into their own head and they think a failure means I'm bad at X.
It's not, oh, this code is broken and I don't know how to fix it yet.
It's, I don't know how to write JavaScript.
And you are never going to learn what you need to learn by saying, I don't know how to write JavaScript.
But if you can identify, oh, I don't know how to make this loop work in JavaScript, then you have something that you can Google and you can find that answer.
And it just works perfect.
I mean, you're still going to struggle, but you're going to struggle a whole lot less.
So our mistakes nudge us toward bigger things. Those experiments, those fails, those heroic attempts, they make up most of the journey,
whether you're a new developer or the head of a major studio.
And nowhere is that more true
than in the open-source communities I've come to know and love.
Failure can be a beautiful thing in open-source.
And that's where our story goes next.
We saw earlier how failing well can lead to happy surprises, things we didn't even know we wanted to try. And at its best, open source development culture hits that mark.
It makes failure okay.
To understand how that willingness to fail gets baked into open source development, I got chatting with Jen Krieger.
She's Red Hat's chief agile architect.
We talked about attitudes toward failure in open source and how those attitudes shape what's possible.
Take a listen.
I want to touch on this mantra. I feel like it's probably a good way to put it, the fail fast and
break things, which, you know, which is a big rally, cry almost, I feel like for our community.
What are your thoughts on that? I have a lot of thoughts on that.
I thought you might. Fail fast, fail forward, fail quickly, all those things. So to put that
into context, in the early days of my career, I worked in a company where there was no room for
failure. If you did something wrong, you brought down the one application, there was really no way, no room really for anybody to do anything wrong.
And that just really wraps people around the axle. led us into almost like a cultural movement, if you would, that then spawned into that wonderful word agile,
into the wonderful word DevOps.
When I look at those words, all I'm seeing is that
we're simply asking teams to do a series of very small experiments
that help them course correct.
It's about, oh, you've made a choice and that's actually a
positive thing. You might take a risky decision and then you win because you've made the right
decision. Or the other side, which is you've made the wrong decision and you understand now that
that wasn't the right direction to go in. Yeah, that makes sense. So when you think about fail
fast and break things as being this movement. It feels like there's still some
structure, some best practices in how to fail, how to do that the right way. What are some of the
best practices, some of the principles around failing in a way that is good in the end?
I always like to tell engineers that they need to break the build
as early and as often as possible.
If they're breaking their build and they're aware that they've broken the build,
they have the opportunity in the moment to actually fix it.
And it's all wrapped around that concept of feedback loops
and ensuring that the feedback loops that you're getting
on the work that you're
doing are as small as possible. And so in open source development, I submit a patch and somebody
says, I'm not going to accept your patch for these nine reasons, or I think your patch is great,
move forward. Or you might be submitting a patch and having a bot tell you that it's failed because
it hasn't built properly. There's all sorts of different types of feedback. And then in open source development, you might also have longer feedback
loops where you say, I want to design this new functionality, but I'm not entirely sure what all
the rules should be. Can somebody help me design that? And so you go into this long process where
you're having long and detailed conversations where folks are participating and coming up with
the best idea.
And so there's all sorts of different feedback loops that can help you accomplish that.
Jen figures those feedback loops can look different for every company.
They're customizable, and people can make them work in a hundred different ways.
But the point is, she's not even calling them failures or mistakes.
She's just calling them feedback loops.
It's an organic system, such a healthy way of thinking about the whole process.
Meanwhile, there's one attitude toward those little glitches that has the exact opposite effect.
There are things that organizations do that are just flat out the wrong thing to do. Having your leadership team or at a very high level, the organization,
thinking that shaming people for doing something wrong or instilling fear in relation to performance results,
and that looks like if you don't do a good job, you won't get a bonus,
or if you don't do a good job, I'm going to put you on a performance plan.
Those are the types of things that create hostility.
What she's describing there is a failure fail.
A failure to embrace what failure can be.
And she's echoing Jennifer Pettoff's attitude too, right?
That idea about blame-free postmortems we heard about at the top of the episode?
You know? Yeah, that's interesting. It's like if we are a little bit more
strict around how we work together, or maybe just more mindful, more purposeful in how we work
together, we will be almost forced to be better at our own failure. Yes, and there's companies out there that have learned this already.
They've learned it a long time ago.
Toyota is a perfect example of a company that embraces this concept
of continuous learning and improvement in a way that I rarely see at companies.
There's just this idea that anyone at any point
can point out something that isn't working properly.
It doesn't matter who they are, what level of the company they're in.
It's just understood in their culture that that's okay.
And that environment of continuous learning and improvement, I would say, would be one of those leading practices,
the things that I would expect a company to do to be able to accommodate failure and to allow it to occur. If you're asking
questions about why things aren't going well, instead of pointing fingers or trying to hide
things or blaming others for things not going well, it creates an entirely different situation.
It changes the conversation. maybe a different way that teams work within a company, within a tech team. Tell me a little bit
more about that. How has it changed the way developers see their roles and how they interact
with other people in the company? My early days of working with engineers
pretty much looked like the engineers all sat in a small area. They all talked to one another. They never really
interacted with any of the business people. They never really understood any of their incoming
requirements. And we spent an awful lot of time really focused on what they needed to be successful
and not necessarily what the business needed to actually get their work done. And so it was much more of a, I am an engineer,
what do I need in order to code this piece of functionality?
What I observe today in pretty much every team that I work with,
the conversation has shifted significantly to not,
what do I need as an engineer to get my job done,
but what does the customer or what does the user need to actually
feel like this piece of functionality that I'm making is going to be successful for them? How
are they using the product? What can I do to make it easier for them? A lot of those conversations
have changed. And I think that's why companies are doing better today on delivering technology that makes sense. I will also say that the faster
we get at releasing, the easier it is for us to know whether or not our assumptions and our
decisions are actually coming true. So if we make an assumption about what a user might want
before we were having to wait like a year to two years to really find out whether or not that was actually true.
Now, if you look at the model of an Amazon or Netflix, they're releasing their assumptions about what their customers want like hundreds of times a day. And the response they get from folks
using their applications will tell them whether or not they're doing what it is the users need them to be doing.
Yeah, and it sounds like it requires more cooperation because even, you know,
the piece of advice you gave earlier about build, break the build, break it often, you know, that
kind of requires the engineering team or the developers to be more in step with DevOps,
right, in order for them to break it
and to see what that looks like
to do those releases early
and to do them often.
It sounds like it requires more cooperation
between the two.
Yeah, and it's always amusing to somebody
who has that title Agile coach
or in my case, chief Agile architect
because the original intent of the Agile manifesto was to
get folks to think about those things differently. We are uncovering better ways of developing
software by doing it and helping others do it. It is really the core heart and foundation of what
Agile was supposed to do. And so if you fast forward the 10, 15 plus years to the arrival of DevOps and the insistence that we have continuous integration and deployment, we have monitoring, we start thinking differently about throwing code over the wall, all that stuff is really what we were supposed to be thinking back when we originally started talking about Agile.
Absolutely.
So regardless of how people implement this idea of failure,
I think that we can both agree that the acceptance of failure,
the normalizing of failure is just a part of the process,
something that we need to do, something that happens, that we can manage, that we can maybe do the right way, quote-unquote,
is a good thing.
It has done some good for open source.
Tell me about some of the benefits of having this new movement, this new culture of accepting failure as part of the process. to go from being really in a situation where they're fearful of what might happen
to a place in which they can try to experiment and try to grow
and try to figure out what might be the right answer is really great to see.
It's like they blossom.
Their morale improves.
They actually realize that they can own what it is that they are.
They can make decisions for themselves.
They don't have to wait for somebody to make the decision for them.
Failure as freedom.
I love it.
Jen Krieger is Red Hat's chief agile architect.
Not all open source projects reach the fame and success of big ones like Rails or Django or Kubernetes?
In fact, most don't. Most are smaller projects with just a single contributor, niche projects
that solve little problems that a small group of developers face. Or they've been abandoned
and haven't been touched in ages. But they still have value.
In fact, a lot of those projects are still hugely useful,
getting recycled, upcycled, cannibalized by other projects.
And others simply inspire us,
teach us by their very instructive wrongness.
Because failure, in a healthy, open-source arena,
gives you something better than a win. It gives you insight. And here's something else. Despite all those dead ends, the number of
open source projects is doubling about every year. Despite all the risky attempts and Hail Marys,
our community is thriving. And it turns out, we're not thriving despite our failures,
we're thriving because of them. Next episode, how security changes in a DevOps world.
Constant deployment means security is working its way into every stage of development. And that
is changing the way we work. Meantime, if you want to learn more about open source culture and how we can all change the culture around failing, check out the free resources waiting for you at redhat.com slash command line heroes.
Command Line Heroes is an original podcast from Red Hat.
Listen for free on Apple Podcasts, Google Podcasts, or wherever
you do your thing. I'm Saran Yitbarek. Until next time, keep on coding.
Hi, I'm Mike Ferris, Chief Strategy Officer and longtime Red Hatter. I love thinking about what happens next with generative AI.
But here's the thing.
Foundation models alone don't add up to an AI strategy.
And why is that?
Well, first, models aren't one-size-fits-all.
You have to fine-tune or augment these models with your own data,
and then you have to serve them for your own use case.
Second, one-and-done isn't how AI works.
You've got to make it easier for data
scientists, app developers, and ops teams to iterate together. And third, AI workloads demand
the ability to dynamically scale access to compute resources. You need a consistent platform,
whether you build and serve these models on-premise, or in the cloud, or at the edge.
This is complex stuff, and Red Hat OpenShift AI is here to help. Head to redhat.com to see how.