PurePerformance - 018 DevOps Stories, Practices and Outlooks with Gene Kim: Part 1

Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time of Pure Performance. I'm not sure which episode this is because we had some technical glitches in the past few recordings. So we'll see what number this is when it comes out. Anyhow, regardless of that, we have another awesome, super crazy show today. Andy, how are you doing today? I'm pretty good.

Starting point is 00:00:47 I made it back to Boston. I spent the day in lovely Texas yesterday where it was very warm. Then I came back last night at around midnight. It was like two degrees. I mean Celsius, not Fahrenheit. I don't want to freak you out. But now the sun is out, and I believe it's around 10 degrees Celsius, which is about 50, I would say. So I think fall is definitely here. Winter is coming,

Starting point is 00:01:11 my friends. It's around the corner. Oh, yes. And winter is coming and we have one of the, we'll just get right to it. We have one of the walkers from beyond the wall here today. None other than the one and only Gene Kim. Ladies and gentlemen, Gene Kim, everybody. Hello, Gene. Hello, Brian and Andy. Hey, thanks for having me on.

Starting point is 00:01:35 Hi. Thanks for making it. Brian, this is actually, I mean, I like the reference, obviously, to Game of Thrones. This was not scripted, I believe, right? This just came out. No, nothing on this show is scripted, as listeners can probably tell. Hey, Gene. Gene, thanks for making it.

Starting point is 00:02:00 And I know that I would assume that most people that listen to the show probably have heard your name or at least read some of your books. But maybe for those unfortunate people that didn't yet get the pleasure to meet you, either in person or through your writing and your work, you may want to do a quick introduction on who you are and why people want to talk to you when it is about DevOps. Sure. I've had the privilege of studying high-performing technology organizations since 1999, and so that's the journey that started for me back when I was the CTO and founder of a company called Tripwire. And, you know, one of the biggest surprises was how that journey took me straight into the heart of the DevOps movement, which I think is urgent and important. So in 2013, along with some fellow co-authors, we released a book called The Phoenix

Starting point is 00:02:41 Project, a novel about IT, DevOps, and helping your organization win. And in 2016, just about two and a half weeks ago, we released the DevOps Handbook with Jez Humble, with Patrick Dubois, with John Willis. And so I'm so excited that book is finally out after almost five and a half years in the making. So long promised and finally delivered. And I'm just so thrilled with how the book came out. Yeah, that's quite the lead time, huh? Five and a half years. I thought it's all about DevOps is all about improving and increasing lead time.

Starting point is 00:03:14 How did it take that long? Yeah, in fact, some people may remember that we were actually talking about the DevOps handbook before the Phoenix project came out. So it was our original intent to just write a quick book in terms of how to do DevOps and principles and patterns. And I mean, it just became clear to me back in 2012 and 2013 that I personally didn't know enough to finish the book. And so Phoenix Project actually came out first. And boy, it is impossible for me to overstate just how much I've learned in the last three years, you know, even since the Phoenix Project came out. And so, so much of that learning went into the book. And one of the things I'm particularly pleased about is that there's 48 case studies

Starting point is 00:04:00 in the DevOps handbook, including, you know, not just the Googles, Amazon, and Facebooks of the world, but also large complex organizations that have been around for decades or even centuries, organizations like Target, Nordstrom, Raytheon, U.S. Citizenship and Immigration Services. So it's exciting to me that Capital One, Nationwide Insurance, to be able to capture those stories that have come out of the last couple years is uh i think just a it's been a privilege to uh help memorialize yeah definitely you know i read a few of the uh the case studies in there and read the phoenix project and the one thing i wanted to say to you before we kind of got into things was you know i read a lot of science fiction and horror and i love science fiction and

Starting point is 00:04:43 horror movies um but between some of those case studies and the Phoenix project, you and your coauthors are the only people who have consistently given me both waking and sleeping nightmares. It's, it's reliving that, you know, those, those lives we've been in, uh, it, it's quite something. I mean, and I, I don't say that as in like, oh, don't read it. It's just amazing how common that experience is and how easy it is to recognize and write about it. It's wonderful. Well, in fact, um, you know, if, uh, you know, you should have read some of the earlier drafts of the Phoenix project.

Starting point is 00:05:18 I had actually, uh, uh, sent out an early draft, uh, to my friend, Charlie Betts, um, you know, who was, uh, you know, part of, uh, Betts, who was part of Best Buy. He was part of Wells Fargo. He was, I think, a head of enterprise architecture for a good chunk of Wells Fargo. And his comment to me was, this book is compelling but not in the least bit enjoyable. I have a day job. Why would I want to read this? And my first reaction was like, oh, my gosh, this is terrible.

Starting point is 00:05:49 No one is going to want to read this and uh you know my first reaction was like oh my gosh this is terrible right no one's going to want to read this book and uh then it it hit me that um you know the the goal is not to put every moral injustice that we've ever seen into the book you know uh you know we actually took out you know half the bad stuff that uh happened to uh bill and his team uh in the phoenix project and uh know, that's actually when we introduced the laptop for comic relief. You know, to try to modulate things. But, yeah, the earlier drafts were, I think, were sort of like the slasher flick of IT operations and IT in general. So, Gene, one thing that's also, I believe, coming up next week, obviously this is going to be in the past when this thing airs, but next week you have

Starting point is 00:06:25 your uh devops enterprise summit in san francisco i guess a lot of additional great stories from companies that actually i believe are no longer you don't call them the unicorns i think you call them the horses right yeah exactly i mean i think so much of our uh the reason for the dev devops enterprise summit was uh three years ago you know i think it was uh our – the reason for the DevOps Enterprise Summit was three years ago. I think it was well known that the DevOps is going to be the dominant pattern of working inside of the Googles, Amazon, Facebooks, Etsys and so forth, the unicorns. But our area of interest, especially for the DevOps Handbook, was how are other organizations adopting DevOps principles and patterns, and how are they different? And that was really the vision behind the DevOps Enterprise Summit. In fact, the first year was only no unicorns allowed, only horses.

Starting point is 00:07:24 And so, in fact, so many of the case studies for the DevOps Handbook actually came out of the DevOps Enterprise Summit just describing how organizations like Disney and Target, how they're doing radically different things, often in the face of an extremely powerful orthodoxy of how IT is supposed to be managed. In fact, many of these stories actually originated when IT was almost entirely outsourced. And so the story of how they created world-class engineering cultures and often bringing engineering talent back into the organization, I think it's just such a heroic journey. And it's been an honor to help chronicle their journeys and so many of those made into the DevOps handbook. Now, if you look at all these stories from these horses, as you call them, and also like if last week or two weeks ago, we too have been on a webinar, a Dynatrace webinar. We talked about some of your stories. Then I also brought the Dynatrace story. And the Dynatrace transformation story was pushed from the top down because our CTO and founder, he said, we need to change things.

Starting point is 00:08:28 We need to build a new generation APM product. We need to do it in a different way. So we actually set aside an engineering team that basically could figure out how to develop the next generation product in a new way. But this is – I'm not sure if this happens in all organizations. So the question that I always get asked, and I want to pass this on to you, what's the most successful path of a company transforming? Is it a grassroots initiative of engineering teams that just say, we need to change something because we're so frustrated the way it works right now and then with these successes bubble it up or is it only successful when it comes right away

Starting point is 00:09:10 from the top down what what do you see out there from these stories that you heard yeah in fact those are exactly the questions that we set out to answer when we did the devops enterprise something because essentially you know whenever you're writing a prescriptive book that explains you know uh you know how does one you know implement DevOps in large, complex organizations where that is not the prevailing way of working. And so I think the common – I think there's a couple elements to the answer. One is where does DevOps get initiated from? And that's one of the questions that we really wanted a better answer, too. One of the big surprises was that the top title among the speakers was director of operations, followed by chief architect, followed by director of development. And I thought that was really interesting because I think the common narrative is that DevOps is often being driven by frustrated development groups who just want an excuse to go straight to the cloud, right, because they're so frustrated with operations. But the prevailing narrative is that it's actually coming from operations,

Starting point is 00:10:09 saying that we need a better way to serve our development stakeholders, and so we have to create next-generation platforms and shared services and so forth to maximize their productivity. So I thought that was super interesting. The second observation is that the titles of the presenters – I had mentioned director of operations and director of development. It's usually a second or third line manager who's driving this transformation. I think my, I think kind of the, the reason seems to be is that usually it's someone about at that level who sees kind of the business problem that needs to be solved, but it's also close enough to the work where they can see where there's better ways of achieving the goal. So, you know, senior leadership is necessary, but often they're not the people who

Starting point is 00:11:04 are going to the velocity conferences or going to, you know, seeing what Google and Facebook are doing and then bringing that back. So, you know, I think that kind of phase one is you have successes being shown by, you know, the second, third line managers. And then the second part is that they've created such amazing wins is that they get senior leadership on board. And then the second part is that they've created such amazing wins is that they get senior leadership on board. And then often they're getting promoted. And often they're being put into a role where they're being asked to help elevate the state of the practice for the entire organization. That involves often hundreds, thousands, maybe even over 10,000 engineers across the organization. And do you think – I mean it makes a lot of sense that operations is feeling the pain

Starting point is 00:11:47 most because basically these are the ones that keep the systems alive. But now the pressure from business comes in to move faster. And I think the number one metric that people still try to improve, obviously, is lead time and getting value, real value to the user out faster. And obviously operations is then the one that feels the pain the most because if we are cranking up the speed of stuff that comes from development but that doesn't have either the right quality, they use different tools and there's the parties involved,

Starting point is 00:12:16 then operations is the one that in the end has the pager and needs to fix the problems and do the firefighting, right? I mean, is that where the main problem really comes from then, like kind of the thinking behind it? And it's so funny because I think that is really the – I think the perspective I brought along, and I think that was really kind of the intent behind the reason why the protagonist in the Phoenix Project was the head of operations. I think that feeling that the ops is viewed as the second-class citizen compared to development. They're given – development is given more budget, given more positional power and so forth. Development is viewed as strategic. Operations is tactical. But I think one of my learnings as we did the DevOps Handbook, and this especially came from Jez Humble, is I think we can paint all the reasons why ops suffers when DevOps principles and patterns aren't in place.

Starting point is 00:13:15 But I think one of the things I learned and now feel very acutely is life is equally sucky for development as well. I mean, it's a terrible situation when you have these urgent data projects and almost all our time is spent waiting, right? And we're writing lines, we're writing code, but we actually never get to see how it actually runs in a production environment. We never get to see if customers actually get value because the team, the project team is dismantled and then shuffled around, you know, for the next project. And so as developers, we never actually get to see, you know, the downstream effects of our work. And we never actually get to see the reward of all the hard work that we've put into. So I think my, you know, with some, with that perspective, I think it's almost like the bad marriage where there's this incredible sense of symmetry that both sides are suffering equally.

Starting point is 00:14:08 And the business constituencies are equally – life is not so good for them either. I think there's just a marvelous – isn't that interesting that DevOps helps elevate the outcomes for dev, test, operations, information security, almost all of them equally in a very wonderfully symmetric fashion? I think one thing because you just told the story about the developer and it's not that easy for them either. I remember at least two stories from your book now. I think it was the Google story and then another one later in the book where you actually said developers were frightened to actually go into what's the ops area. They were frightened to actually deploy something in production because they were fearing that something is breaking. And I know this is a little stretch now, but I think this is also something that obviously DevOps tries to solve is showing both sides how tough the other side's job is.

Starting point is 00:15:12 And then in a collaborative way, figure out a way how as a team they can safer deploy faster and better quality software. I mean that's the kind of thing. But I just remember – I'm not sure what the other story was, but you had it in there and it's like, you know, developers were afraid of pushing the deploy button because they knew something was going to break. And then this realization actually then brought them to say, hey, if we would have tests that actually give me the safety net,

Starting point is 00:15:42 if we would have an automated pipeline that automatically executes these tests for me, if we have telemetry or measures and monitoring in place so that I can immediately see my impact, then this will take away the pain and then allow everybody to be more confident in the stuff that we do, right? Oh, absolutely. In fact, gosh, one of the things I love about the DevOps handbook is the index. I use it all the time. One of the neatest case studies I just really love in the DevOps Handbook is the birth of the automated testing culture at Google.

Starting point is 00:16:14 And so what was interesting to me was that I asked so many people and friends at Google, where did this amazing testing culture come from? And everyone said, I don't know. It was here, you know, when I started. And it was actually Jez Humble who said, I know who you need to talk to. It's Mike Bland. And so Mike Bland, who's now part of the 18F organization, you know, he told the story of the Google web server team at Google back in 2005, when deploying to Google.com was a very dangerous, dangerous thing. In fact, it was viewed as the dumping ground of every team's poorly tested code. And because it was never tested with the other services, often it would result in erroneous searches, slow searches,

Starting point is 00:17:01 sometimes outages. But there was this beautiful quote that Mike Bland said. He said, fear became the mind killer. Fear stopped new team members from changing things because they didn't understand the system. But fear also stopped experienced people from changing things because they understood it all too well, which I just love. And I also love the Dune quote, right, Frank Herbert? Yeah, here's the mind-blowing.

Starting point is 00:17:26 So, you know, thus began this kind of incredible transformation where there was no code accepted until it had also automated testing. And it rapidly became the exemplar within all of Google in terms of its ability to have short lead times and high success rates. And thus began a three-year journey to bring that state of the practice to the entire Google organization to the point where they're now doing 5,500 code commits per day. Almost every major property is deploying at least once per day. Google searches are deploying three times a week. But just to see how these things happen. The other story that I love from the book is from Facebook. In 2009, in the ops organization, they had this rule where no one can have a laptop open unless if there is a live site issue.

Starting point is 00:18:19 And they had this one meeting where no one could actually close their laptops. So they then started this practice of having developers being put on page rotation, developers, dev managers, architects. And for the first time, developers could actually see the downstream effects of the changes they were making, of decisions they were making. And that was a key part of the transformation. Yeah. You know, I wanted to go back to, you mentioned all the Google releases. And when you look at the, I think it's Amazon, if it calculated, they're doing about like 1.5 releases per second, which is just, you know, kind of astronomical. And, you know, we talked about unicorns and horses. And in some ways, those kind of sound still a little bit like unicorns. How much do you think those, that super hyper release cycle is just to get bragging rights versus, you know, what would a realistic company or organization, what should they be

Starting point is 00:19:19 focusing on their end goal? Yeah. Um, yeah, that's a great question. I've never heard it worded that way, but I think you really get to that crux of what's important. So one of my big learnings is, you know, typically we think about DevOps as these amazing outcomes of, you know, fast flow, you know, fast lead times, as Andy mentioned, from, you know, development to production. So customers are getting value while preserving world-class reliability, security, and stability. But I think there's another dimension that's important, which is that the ability for small teams to independently develop, test, and deploy value to customers. And so I think what that deploys per day metric is hiding is even a more important

Starting point is 00:20:02 metric of how many deploys per day per developer. And in the ideal, right, we want developers to be able to safely test and deploy and, you know, improve lives of customers to be able to do that independently as opposed to having to coordinate and synchronize with hundreds or even thousands of other developers. And when you have to do that, you know, that means everyone's shackled together to one big release. And so I think the ideal is that every developer should be able to independently develop, test, and deploy. So that could mean multiple deployments per developer per day. So, you know, and it means that,

Starting point is 00:20:47 it doesn't mean that they're releasing features multiple times a day. What it means is they can safely promote their code in a production environment without causing chaos and disruption, right? Like is too much of our common experience. Right, but this is kind of cynical, me looking at it from with my cynical hat on saying,

Starting point is 00:21:13 do you ever think there's a case, because everyone's bragging rights in a way and it seems to be, you know, the thing to do is have as many releases. I mean, do you think that there are cases at all where people might just be saying, oh, we're going to add an extra space in the config file and say just so we can claim another release or am I just being too cynical about it? You know, I think that the valid business need is that we should be able to make changes when our organization and when our customers need it the most. And so if it means that you do a white space deploy and you can do it safely, okay, great. But, you know, if we have something like Heartbleed and we need to know, you know, we need to roll out, you know, changes en masse, you know, we should be able to do that as quickly as possible. And I think that's a valid reason to make changes. And so that misstates architecture changes that require certain technical practices. And I think the other thing that brings out is that we do want developers to be emerging

Starting point is 00:22:10 and deploying as part of their daily work, right? Because the opposite of that is that we only do it at the very end of the project. And that's what ends up with these catastrophic problems of performance problems found only at the end and information security problems found only at the end when there's almost no ability to actually change our outcomes at that point. So I'll take, I think it's a critical capability and it could be abused, but, you know, I can say that the lack of having that capability, you know, is awful and almost purdains those bad outcomes that we see in the Phoenix project.

Starting point is 00:22:46 And I think just like one positive note to that as well, even though they just may change a white space, but just the fact that they're then going through the exercise of the deployment keeps the deployment methodology and the process fresh in their minds, I think is also one of the things that you mentioned in your book. The more often we do it, the more natural it becomes. And then a deployment basically becomes a no event. It's just like, hey, it's another deployment. Maybe sometimes you just use it for practicing. Yeah, that's a great, yeah, I love that. In fact, in one of the patterns in the book that we talk

Starting point is 00:23:18 about is, you know, the famous Netflix chaos monkey, right, where they routinely inject failures into the production environment. But there's a more pedestrian version that I think is just as dazzling, and that's the notion of a scheduled reboot, right? There was a story that was told of someone saying, hey, it's not acceptable that we're afraid to patch our systems to apply security patches because we were afraid of whether the systems will come back up. So the countermeasure was we're going to schedule a reboot at a random time sometime during this week. And thus the systems teams then had some time to prepare and make sure their systems were at least resilient enough to be rebooted on a semi-scheduled basis.

Starting point is 00:24:10 And what was interesting is the mean time to have the service restored went down by 75 percent just by practicing rebooting. So I just – I love that because I think, you know, coming from the server administration space, I think a lot of us are proud of, you know, systems that haven't been rebooted for years, but often that creates a situation where we are just afraid to change our systems because we don't have any confidence that they'll actually come back up again. Yeah. I've heard that every time I come across someone with a mainframe, they're like, well, we're not going to touch that. That just runs and we don't even breathe on it. rebooted in a decade, but they're not even sure they have confidence they can actually build the code again, because the compilers and the libraries and the versions are so, you know,

Starting point is 00:25:10 they're 10, 15 years out of date. Hey, talking about the mainframe, it's an interesting segue, because we talk to a lot of our customers that obviously, you know, still have mainframe components running financial healthcare, and we provide them with the monitoring of the mainframe components. And do you see – I sometimes get the pushback and say, well, but DevOps doesn't work for mainframe. And I said, well, why not? It doesn't mean that you need to deploy changes into the mainframe all the time, but I think it's then more about building new use cases on top of the mainframe using some of the new technologies that we have out there, like end-user facing apps that eventually obviously touch the mainframe in the back. But I see sometimes a lot of resistance. We're saying, oh, no, we're not going to touch anything and we want to be careful. So kind of then saying – the feedback then sometimes is, well, we are not a company that is fit for DevOps.

Starting point is 00:26:12 Is this something you see too and what will be your response to that? Oh, yeah. That definitely comes for a lot of people last year was a presentation from Rosalind Radcliffe from IBM talking about how her mission in life is how to elevate the state of the practice for the mainframe development community to use all of the technical practices that we associate with DevOps. So it's proactive production monitoring, automated unit testing. They talked about how you can use mainframe emulators on commodity servers so that you don't have to pay for test cycles. It's just a big eye-opener. In fact, one of the presentations I'm really looking forward to next week at DevOps Enterprise is by Richard Jackson from Walmart and Rosalind Rackler from IBM talking about how the mainframe team saved the day

Starting point is 00:27:13 by essentially providing a caching service for their retail link inventory management system. So in their rollout, they ended up with horrendous performance problems. And essentially, they built like Memcache or Redis from scratch on the mainframe. And quite literally in a month, saved the project. And so it was just this beautiful story about how – oh, and I guess what was really great about the story was that the lead developers for the application didn't want to use it because they didn't want to use anything on the mainframe. They were actually forced to do it. They threw everything they had to see if they could break it.

Starting point is 00:27:56 And three years later, that caching service has had like 21 billion transactions. They have 30 customers within the Walmart enterprise. It's just this wonderful story and i guess kind of one of the morals is as much as we talk about inclusiveness in the devops community uh we can still be real jerks to mainframe people well but i guess i mean it's a great story and also what you brought up some technical resolutions right you said well if you don't have a mainframe in your cicd environment you can still use emulators some type of virtualization where you can then still on a build to build basis test your code and then eventually at least optimize and build a nice automated pipeline until the very end and then you

Starting point is 00:28:42 always need to figure out how often you really want to update everything into the mainframe system itself in production. But everything until that point can be fully automated, can be very fast, can be, you know, fast lead times until the point when you then really deploy it into production. And I think that's a good hint. I think a lot of people that I talk to, or some people, they are still – they kind of are aware of virtualization, whether it's service virtualization, whether it is just mocking away things, mocking away interfaces to basically allow other teams to just move faster so that basically removing you and your component as being the bottleneck. That's kind of the way I see it. Yeah. In fact, there's a wonderful quote from Scott Pru from CSG, the largest bill printing company in the U.S.

Starting point is 00:29:33 And he said, we've adopted a philosophy that rejects bimodal IT because every one of our customers deserves speed and quality. And this means that we need technical excellence, whether it's a team supporting a 30-year-old mainframe application, a Java application, or a mobile app, right? Because I think what he's alluding to is that, you know, as we all need to go faster to achieve business goals these days, you know, essentially that affects the mainframe application, right? At some point you hit those system records, systems as well. And so it means that no one's exempt from this, not even the mainframe teams. Yeah. And maybe the last story to that,

Starting point is 00:30:11 I know Brian, I should tell you another question, but one last story. I was fortunate enough to travel to Latin America a couple of weeks ago, and I met several banks. And also, I mean, these obviously all are using the mainframe in the back, but now they have these digital transformation teams,

Starting point is 00:30:24 these innovation teams, and they're basically building new stack apps using Angular, React, running on Node.js, and then in the cloud, but cases that they built on new technology in the cloud, it allows them to actually fulfill the business needs of developing new features faster. But also without really negatively impacting the back-end because the back-end APIs that they call, they're the same. And they're just getting the records out, updating the transactions. One thing that they ran into, though, and this is where we help them, is when you have a front-end developer that just calls an API to the backend and he doesn't really care or has to care about whether this is a Java API, a Node API, or a mainframe API. They just call the API. So what some of these customers ran into, they were just pounding, because of bad development practice in the front-end, they were just pounding because of bad development practice in the front end.

Starting point is 00:31:25 They were pounding the backend mainframe. And so the cost of the mainframe basically skyrocketed. And like what you say, I think the Walmart example might be similar. A caching layer, they were just pulling the same data multiple times per transaction instead of caching it. And then one of the resolutions was to actually put a caching layer in the middle in order to not pound this very, very scarce resource. And with our tracing technology that goes from mobile into mainframe, we could make them aware of because these architects were not aware of it. And that was the interesting thing.

Starting point is 00:31:59 I love it. And just one kind of color commentary to that. What I love about the mainframe story from Walmart is that they used the mainframe as the caching service to protect the back-end systems that couldn't keep up. It's a very ironic, unexpected use of the mainframe. Well, I just wanted to – we're at about a half hour, let's um wrap up this episode unless there was any other um idea somebody wanted to put in before but if not we can uh yeah i wanna i want to just i want to just uh you know maybe to wrap this up i thanks gene for the initial kind of comments on on how you see uh companies going about starting devops like where it starts, where it originates. I like the example from you said,

Starting point is 00:32:49 it typically comes from the operations side where they feel most of the pressure. It's not the top, top management, nor is it, let's say, the people that are really down in the weeds every day. It's kind of in the middle management somewhere that feel the pain, know what's going on upstairs, but also know what is going on downstairs

Starting point is 00:33:06 and how this all impacts them. And then they become the champions and then promote it up. So I think that was a nice way. And I just think can encourage everyone to read your books if they haven't done so yet. Thank you so much.

Starting point is 00:33:23 And we're giving away a 140 pagepage excerpt of the DevOps Handbook, so we'll make those instructions available to everybody, anyone who wants one. Yeah, we'll put that up on the podcast page. And if anybody is not following Gene on Twitter, he has one of the real Twitter handles, So he's at Real Gene Kim. So congratulations for becoming that official. Someday maybe I'll get Andy and I'll have a reel in front of our name. Thank you so much, Andy.

Starting point is 00:33:56 And thank you so much, Brian. Yes, and we'll be right back with another episode. And we'll talk some more with Gene Kim. And stay tuned, everyone. Thank you.

PurePerformance - 018 DevOps Stories, Practices and Outlooks with Gene Kim: Part 1

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.