PurePerformance - SRE for the non-unicorns (aka Enterprises) with James Brookbank

Episode Date: December 5, 2022

You have a CISO (Chief Security Information Officer) but no CRO (Chief Reliability Officer)? You blame people if systems crash? You scale your people in the rate of scaling your infrastructure? If you... answer any of those questions with YES then you should tune into this podcast as you probably struggle adopting Site Reliability Engineering (SRE) in your organization.James Brookbank, Cloud Solutions Architect, has dealt with resiliency topics in a large enterprise prior to joining Google. In our conversation he shares advice he gives Enterprises to convert the excitement about SRE into actual implementation. James gave some good guidance on what good and not so good projects are to start with. He gives practical examples on what it means to change your company culture and why there doesn’t have to be an SRE for every service.In our call we discussed the SRE in Enterprise talk at DevOpsDays Boston and SRECon EMEA as well as their recent book. Here are all the relevant links:James Brookbank on Linkedin:https://www.linkedin.com/in/jamesbrookbank/SRECon EMEA Slides: https://www.usenix.org/system/files/srecon22_slides_mcghee.pdfDevOpsDays Boston 2022 Session Recording: https://www.youtube.com/watch?v=__e7b25QOHcEnterprise Roadmap to SRE Book: https://sre.google/resources/practices-and-processes/enterprise-roadmap-to-sre/

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance do you know why I have no clue every day every podcast is special for me well it's the one and only podcast that we're recording today so that makes it special yeah
Starting point is 00:00:51 I got nothing else I literally thought about that for about five minutes before joining the call and that's the best I could come up with today
Starting point is 00:00:57 was that the only thing on your calendar today is that what you're saying it's the only podcast on the calendar today
Starting point is 00:01:02 okay how many podcasts what else do you do besides pre-performance? Pick my nose. Oh, yeah, pick my nose. Hey, it's the election day special, too.
Starting point is 00:01:14 Over here in the States, it's election day and everywhere you look, it's just going to be people talking and speculating and speculating. Oh, maybe he's up
Starting point is 00:01:20 two points, maybe down two. Oh, frustrating today to get through because it's just all speculation. Do you think it's a lot about metrics and who in the end
Starting point is 00:01:30 gets closest to the goal? You know what? Yes, but I will say, unlike performance engineers and people in SRE and observability, it's about metrics, but it's about trying to fill the time, making up stories to fit them for
Starting point is 00:01:45 entertainment purposes where i think people in our field are really doing hardcore working trying to change and make the world better and there's you know there's a special group of people who've come up recently in that realm recently within the last you know building out the last 10 years who've really been making an impact and a difference for everybody's lives even if you don't know it you're in the receiving end of great performance because of the hard work and dedication of these people. Who might they be, Andy?
Starting point is 00:02:11 Who might they be? Maybe we have somebody on the line today that could fill us in. Okay, let's stop with this. It was a good one, but yeah. Welcome to the show, James. Thank you so much for being here. I'm sorry that you had to endure
Starting point is 00:02:26 the last three, four minutes where we tried to be funny, but we always tried. And eventually, in a couple of years from now, we will be funny, hopefully. Yeah, I just saw a scowl on his face the whole time. It's just good learning. We're not failing.
Starting point is 00:02:39 We're just learning fast. Yeah, exactly. Hey, James, thanks for being here. The two of us, we met at DevOps Days Boston where I saw you co-present with Steve McGee who has been on our show in the past. And some may know you, some may not know you. So for those that don't know you yet,
Starting point is 00:02:58 could you quickly give an introduction, who you are, what you do? For sure. So I'm James Brookbank. I look after a team of cloud solution architects at Google. I'm not talking on behalf of Google today, though. Just don't tweet my boss and say, like, hey, I said something. It's not that kind of talk.
Starting point is 00:03:16 I'm the other half of the Steve McGee partnership in that Steve has spent 15 plus years as an SRE at Google and knows how all the unicorns are made. I've spent 20 odd years in enterprises like banks, large companies with sort of difficult scale problems, but really not tech startups, basically. And so one of the things I try to do working at Google now is bring that expertise and talk about things like reliability. We talk a lot about sort of SRE, but really it's the wider reliability concerns with cloud customers and, you know, try and move the state of the art forward in this area, not just for the tech companies, right? For everyone. Yeah. And I think that's also what I enjoyed so much about when you were on stage in Boston, you actually said, right, not everybody's looking up to Google,
Starting point is 00:04:11 but as you said, they are one of the unicorns, but not everybody is like Google. And that could also explain a little bit why people get really excited about what they hear, but then they're falling short on actually delivering, right? How can we apply what we learn from the Googles and the Facebooks and the Amazons of the world? What can we learn? But then people are excited and then the excitement stops somewhere and nothing materializes. And you have your presentation, SRE in the
Starting point is 00:04:38 Enterprise. There's also a great book. I think you actually have it available for free, if I'm not mistaken. You can download it from the SRE Google website and that's the best bit about it. We made it very short. It's 50 pages. It's designed to be executive friendly, so you can read it on
Starting point is 00:04:57 the plane, you can give it to your boss and be like, hey, you don't have to read the full SRE book. You can later, but maybe start with like the cliff notes and just sort of try some of these things first, basically. So we deliberately try to make that accessible. It's not like all of the books we publish. It's not a hard and fast set of rules.
Starting point is 00:05:18 It's not designed to tell you you must do this and then you'll be SRE. It's just lessons that perhaps we've learned and customers have learned in these spaces and we're just trying to share those things now we will add the links to the the podcast description so folks if you're listening to this then the link is going to be there and then enjoy the 50 pages read i've enjoyed it twice already on on two uh on two plane rides uh which was nice. I have a question,
Starting point is 00:05:47 a couple of questions actually for you because we've been... Please. Yeah, thank you. We've been talking about DevOps and SRE and SLOs for several years now, right? We had people like Gene Kim on a couple of years ago
Starting point is 00:06:04 when kind of DevOps became really popular. And then we had on the SRE side where we must have had at least 10, 15 different people that are SREs. Just the last episode was with Diana. She is an SRE lead at a large bank in Canada
Starting point is 00:06:20 and kind of her journey into SREs. What I would like to know from you is what advice do you give when you talk with these enterprises, when they come to you, what do you tell them to actually really apply SREs and what are the things they should apply? What maybe they should not apply? What are some of the best practices?
Starting point is 00:06:43 Yeah. And there's obviously a lot to this. So, I think the first step is often people want a single lever or a silver bullet that they can do. They're like, you know what, I really need to do this. It's very important for my business. Tell me the one thing that I can do to make this happen. And I think what we found is that's not going to happen. It's the same with a lot of these areas, the same with DevOps, Agile. You can't really do just one thing and expect quite serious changes. And we wouldn't do that in any other part of our lives. We wouldn't expect that to just be like, well, I just changed this one thing, but now I've like outsized performance um i think though the other side is we
Starting point is 00:07:25 do i think know a relatively small number of things that make like high impact changes in this space so you know one of the things that we talk about quite heavily in the book is we perhaps overestimated how much of sre and and reliability concerns were technical and how much were not, how much were part of the social and team structures of organizations, of the culture of organizations. And I think we found a lot of overlap between the DevOps initiatives and SRE initiatives where we were saying,
Starting point is 00:07:57 if you do have a generative culture, if you do have an ability to raise incidents safely, you have psychological safety within your company, you're highly likely to form learning loops. And when you have incidents, they're going to be learning opportunities, you'll make things better,
Starting point is 00:08:12 and you'll have this sort of positive cycle of reinforcement for your operations teams and generally for your company. If you're not doing that, like buying tools or implementing practices on their own, it's not that they're not effective, but they're not providing that sort of 10x that people are looking for. They're really fighting against the grain of a lot of these things. Just to talk about the culture piece as well, I think one of the things that people often get frustrated about is, could you tell us the culture? Is it ping pong tables or foosball?
Starting point is 00:08:50 Maybe we put the free food in. That's the culture you're talking about. And it's not. And we do go into some detail on this in the book and some of the talks we've had where Google has published guidance, but you can look at, say, the Western model for culture of culture and say,
Starting point is 00:09:06 actually, culture means things like psychological safety, like dependability. If you do those things, you have those capabilities, you're much more likely to get better SRE outcomes and DevOps outcomes. And you'll probably get, like, the free food as a result of that. Like, it will be emergent from that culture.
Starting point is 00:09:26 All these things will come from doing those things correctly as opposed to the other way around, where you're like, if I get enough foosball tables in, suddenly the developers will be doing this. Flip that the other way around. And also don't overestimate how simple some of these cultural changes are in terms of just paying attention to incidents
Starting point is 00:09:46 and using retrospectives as learning opportunities instead of saying, well, you know, we've got so many incidents and I guess there's nothing we can do about it. Like that's really the most important place to start, like build those learning loops and start small. So there's more, but I guess that's the key one that seems to drive like the best behaviors. I'm imagining an office full of des one that seems to drive the best behaviors. I'm imagining
Starting point is 00:10:05 an office full of desks that are actually foosball tables. So you work on every single desk being a foosball table. That'd be, yeah, come work here. It's great. If we could do that, if that was the thing, then we would have done it. If we knew that one thing,
Starting point is 00:10:21 one level we could pull that would make it happen. It's not as simple as that, but it's easier than it sounds. The culture changes aren't expensive, but they are challenging. I'm just looking at your slides that I think from SRE Con EMEA that you've just given as well. A couple of weeks ago, yeah, we talked about some of these items. Yeah. And what I find fascinating, when I read the book, in the book, you make a comment and kind of say, you know, the history of kind of software engineering a little bit where
Starting point is 00:10:57 now in the world that we live in, reliability also becomes a differentiator, obviously, right? I think you cannot always compete maybe with the best user experience, but you can compete with having the best reliability. You're always on. Like Google. Google is known for the Google search is always there.
Starting point is 00:11:20 And that's a differentiator. Now, on the flip side of this, looking at slide number five or six, that's a differentiator. Now, on the other, on the flip side of this, looking at slide number five or six it is, it's a great point. It says reliability is not always the most important thing. I think you also need to be then very careful on where
Starting point is 00:11:35 I guess it makes sense to invest in these concepts and where it does not yet maybe make sense. Is there maybe some points that you can give on when is it a good time to really focus on better reliability and maybe at some
Starting point is 00:11:52 point it doesn't make sense right now. I think it's a crucial piece because one of the things that I think Google has made fairly public but perhaps we don't not everyone has sort of, you know, seen this, is that we do not have SRE for every service. Like that's a deliberate decision inside Google that
Starting point is 00:12:14 many smaller services that are non-critical do not require SREs. They benefit from the SRE ecosystem. They benefit from our sort of dirt testing, like disaster recovery testing, that kind of stuff. But we don't have it for every service. And we deliberately try and focus our SRE efforts on the very high reliability services. So, you know, sometimes when, you know, I have this conversation with customers or other third parties and say, like,
Starting point is 00:12:41 what are you trying to do with SRE? And they're like, oh, every service we've assigned some assigned some sres i'm like you you can do that but but that's not really how it's done say at google or or other companies who are following this model and do we have like a nice easy matrix decision of which services should be sre supported, it's very fluid. Like it's not a well-described process, but it kind of falls into like three main categories. And we did talk about this a little bit at SREconomia. So ask yourself if it is a product differentiator for your service.
Starting point is 00:13:21 Like if you're not sure, it probably isn't. Like if your service can be down for a few hours and it know it's not really expensive to do so have a think about this like sometimes we see systems where like leave booking you know people are like hey i'll leave booking systems down that's probably okay for a few hours like people can book holiday like or just let their boss know like you don't need to have this this critical five nine service for it. Pick your battles in that. Is there an existential risk for your service? Sometimes we speak to banks, and I've had this experience with banks I've worked at, where they're like, this service absolutely must operate. It has to complete something by the end of the day, or we'll be speaking to a regulator. If you have a critical
Starting point is 00:14:00 service like that, you'll immediately have access to funding and SREs, right? Like people will understand the existential nature of those services. So this helps you like identify those critical things. And I think the final one is scale. And one of the key reasons that Google did SRE and where it becomes, I think, very important is when you're trying to scale,
Starting point is 00:14:22 if humans can't manage the infrastructure involved, you'll immediately start adopting SRE practices. You'll kind of need that capability in these spaces. And even if reliability isn't the primary concern, just the cost of it will become like a, things like capacity planning will start becoming a software problem and not just a
Starting point is 00:14:45 sort of traditional operations problem so if you had to pick those it would be those three things that probably are the main indicators your mileage may vary i think is the final bit to that though so that all makes sense yeah it does for me and especially maybe on the last one um it's an interesting point because just to reiterate, if I understand this correctly, if you say you have a system that needs to scale because let's say you're building this cool next generation Facebook
Starting point is 00:15:12 or whatever it is, and it all of a sudden starts to get very popular. And in order to scale, you have two options. You either hire as many new people as you need to scale up your hardware or you actually start figuring out
Starting point is 00:15:25 how you can automate your everything around the operational aspect and the resiliency aspect so because you cannot just you know scale with the good manpower as you as you need to scale your software did i get this right this is it and again like that's that's a the concept we talk about is the pyramids of reliability, which again is in the book. So you can look at that in more detail. But I guess the concept is we've seen reliability when I was younger. I'm a little bit older now. And certainly when I first started working in IT, I was looking after old school Unix machines, like some microsystems.
Starting point is 00:16:05 Some people will remember those days. And you could just get bigger ones. It wasn't really, I mean, it was a money problem, but you could just go buy an E25 and you would just get 100 boards and all the RAM you wanted. It just cost you a lot of money, but you could buy a larger service. So we did that as an industry for a while. And then eventually we ran out of vertical scaling.
Starting point is 00:16:28 And so this model, while still effective in some scenarios, creates problems for internet-facing services where there are millions or billions of users. And immediately you start thinking about how we scale out. And that scale-out mode of operation is where SRE kind of shines and where reliability concerns become software predominantly as opposed to operations and sort of stacking reliability on top of stronger hardware. So I think sometimes people say, well, which one do I need to choose? And it's really finding the right way for you and your company.
Starting point is 00:17:07 But if you start mixing and matching those models, if you cross the streams of the reliability pyramids and you start doing scale-out on expensive hardware or scale-up on unreliable hardware, that's when it gets very, very difficult to manage your reliability concerns. And you will then start thinking about some of these things, right? Like you will start looking at these concerns of your scale-out architectures with some of the techniques that we've seen work in this space.
Starting point is 00:17:40 Does that make sense? It does. And I remember in Boston, and I'm sure you had the same question also in, by the way, where was the SRECon in EMEA? It was in Amsterdam. In Amsterdam, yeah. It's been a long time since I went. The last time I was in Amsterdam was 2011.
Starting point is 00:18:00 I did the Amsterdam Marathon. So that feels like a long time ago now. You did the marathon? It's still a very strong place. I did 26 miles, which was a very long way. What was the, can I ask what time did you have? Five and a half hours. Five and a half hours, yeah.
Starting point is 00:18:15 But in making it, I have never done a marathon. Which is the most important part. Andy, well, I'm curious, if he said 12 hours, what would you have said? I would have said, I have never finished a marathon, so he achieved something I've never achieved. And I think just getting through it is amazing. Nice answer. This is part of that culture.
Starting point is 00:18:34 This is that part of that culture you're talking about, right? Yeah. Is being accepting and safe. Yeah, exactly. Great example, Andy. I tried to put you on the spot and I failed as a wiseass. What am I going to do? I'm the problem here.
Starting point is 00:18:48 So Amsterdam and Boston, I remember in Boston, because you talked about the pyramid, I think in Boston you actually asked the question, I think it was something like, can you build a 99.99% service on a 99.9% infrastructure. Did you ask the same question in Amsterdam as well?
Starting point is 00:19:09 We did. And again, I think the complexity of this is that sometimes what's intuitive for people is not perhaps realistic. So sometimes people are like, well, how on earth can you build something more reliable and something less reliable? And we're like, well, the classic is RAID arrays.
Starting point is 00:19:30 If you haven't built a RAID array, you're using one. You may not know it. But a RAID array takes your disks, which are very unreliable. I used to run a storage team. We lost a lot of disks. There's two types of hard disks, dead or dying. And so you don't have a choice there. You have to make a more reliable storage system on top of that. And that's what we mean when we say that you're gluing together
Starting point is 00:19:58 unreliable infrastructure components to make a reliable service. That's exactly what RAID does. Now, does that mean that every service needs to look like that or every part of infrastructure needs to look like that? That's not really what we're saying, that you can always do this or you should always do this. We're saying it's possible. And I think sometimes people don't want to look through that from a reliability perspective
Starting point is 00:20:23 because it creates software and people complexity. But it is certainly possible. So I can encourage everyone to check out. I mean, again, we will also, if it's okay for you, to link to the SREcon material, the slides. Absolutely. Everything from the SREcon stuff is open access. So that talk will be published, I think, in a few weeks. But again, the slides are up and
Starting point is 00:20:54 these aren't designed to sort of, I think, trick people in terms of saying like, oh, well, you said I can't build this service in this way. I think what we're just trying to explore is the choices you make in your architecture often have trade-offs like benefits and concerns, but you can make active choices around reliability with all of the other good things, with cost and scale and people. All those things can factor in. You've just got a few choices to make and these have impacts in that.
Starting point is 00:21:31 And for a lot of teams, especially in the operations space, this seems like old news. This is something you're like, well, everyone knows this. Of course, this is obvious how you build these services. For a lot of developers, this isn't an interesting area. Why should it be? So they often don't know this until it bites
Starting point is 00:21:52 them. And I'm not here to shame them into that. I'm here to explain it. Here's how it works. Here's why it works. Here's how we do it like this. So I think that's the important bit. It's just understanding some of these models and making them as simple to understand as possible. And I think there's nothing wrong about repeating things that we think should be
Starting point is 00:22:13 common knowledge. I mean, Brian, we talked about this a lot, right? We've been talking about performance problem patterns over the last five years since we've done the podcast. We always bring up
Starting point is 00:22:23 the same patterns because they're still out there, like the M plus one query problem, which is our all-time favorite. Our old favorite, yeah. Yeah, it is what it is. And I think, I mean, one of the reasons is, you said developers might not be interested in,
Starting point is 00:22:37 but we have to understand, and you said it earlier, we're all not getting younger, but we are getting older. There's new generations coming into our field. Some of them might not be trained engineers. Some of them may now with COVID make a career change, right? And they are obviously on a fast track trying to get into our industry. And how should we expect from somebody that gets a couple of months, maybe online education or however they get educated,
Starting point is 00:23:01 how should they know everything we've learned? I don't know. I was in a specialized high school for five years on software engineering. What I learned in five years, I don't expect people to learn in two months on YouTube. And then obviously with the years and years of experience that you and Brian and I have in the field, we obviously know things
Starting point is 00:23:21 and we should never take it for granted that everybody knows it. That's why it's so important that we do podcasts like this where even simple things that we take for granted and we think everybody should know might not be known yeah for sure and and there's there's always i guess an xkcd for it uh my favorite one in this one is the um the diet coke and mentos so like the the idea is like for everything you think everyone knows there's millions of people born like in the u.s every year so there's thousands of people for whom like their first day of knowing about like someone on this listening to this podcast is like what do you mean diet coke and mentos it's amazing go go to the store go
Starting point is 00:24:03 go start that science experiment don't shame people for not knowing about it go enjoy it go find out about those things so if you don't know what this is
Starting point is 00:24:12 I'm going to say if you don't know what this is go get a bottle of coke go into the fancy room in your house with like the white couch and all that
Starting point is 00:24:20 put the bottle of coke on the floor there and drop some Mentos in your parents will love you for it. That's going to be really cool. And they'll say, wow, you're a budding scientist. So how could we be mad? Right, exactly.
Starting point is 00:24:32 Sorry, Andy, you were saying. That's perfect. I was going down the line because you say go into a store and just try it out there. And you probably will never allow back into that store. I would say like there's a lawyer cat moment here or I should, you know, be like, right, this is this advice. You can follow it your own. Find someone who knows about this and get them to help you with this journey.
Starting point is 00:24:58 But I think the idea is very sound. Often as a tech industry, I think sometimes we do have like very prescriptive ideas about how things should operate um and we don't realize how much of that is is not always obvious or consumable and the more we can make things like simple and consumable reduce the cognitive load um you know we've gone through the full stack phase right like it sounds great like being a full stack developer and knowing everything about everything, but it's not proven possible.
Starting point is 00:25:28 Like the idea that we can have someone who's, I think this was Charity who said this, like, if you're not updating the website and designing the computer chips, are you even a full stack engineer?
Starting point is 00:25:38 It's like, well, how can you do all these things? Like, it's not feasible. So we need to provide this guidance as part of our platform, as part of our team capabilities and do that in
Starting point is 00:25:49 as simple a way as possible. Make these things more accessible. I think that's where a lot of that culture comes in, right? Because if you're not, you're not going to be a full stack engineer probably. So you have to have that culture where the people who do know the hardware side or the operations side are sharing
Starting point is 00:26:07 and accepted for sharing. Same to the developers and back and forth and knowledge is being shared and common. We talk about repeating the same mistakes over and over again. And it's amazing how many times when we're talking to a client they've picked some technology to run their important stack on. It's like, well, why'd you pick that? Well, we were told to move to Kubernetes. All right, well, did you figure out how you're going to observe it?
Starting point is 00:26:32 Did you figure out how you're going to maintain it? Did you figure out how you're going to... Like, no, we just moved. And with the idea being, you make these decisions based on requirements, needs, and the ability to do all the things you need to do, including considering SRE, considering security, considering all the different elements, and then
Starting point is 00:26:50 says, what model matches our needs? And then that's the one you pick. But it's still, this is one of the things I think, Andy, just like the N plus one query problem, that struggle's going to go on and on and on and on forever. So these ideas of sharing these concepts over and over and over again are key to getting past that.
Starting point is 00:27:08 Well, never getting past it, but just key to keeping it on people's minds. So that less and less people do that over time. It's not a struggle. Hey, now we want to start SRE. Great, we're starting way behind now because we have all these other problems as opposed to we're in a good spot to now introduce. Speaking of that, though, I wanted to ask, you know, when moving to SRE, you know, you had those three points made earlier. First project, you know, our first foray into SRE,
Starting point is 00:27:37 we're going to put together the team, we're going to start experimenting and working it out. What makes a good candidate for an application? Or is there any general guidance on where you should start? Obviously, if you're doing cloud transformation, you're not going to start with your critical app. You're going to experiment with something lesser. What's a good advice for trying it? Yeah, and I think this isn't a checklist problem
Starting point is 00:28:04 or checklist solution, I guess. There's no sort of, hey, follow these boxes and it will tell you the one. But I do think that there is some pretty clear guidance that we've tried to give in the book, which is don't start with the most important thing in your company. If there's an existential thing that if it goes down and everything you know everything stops running and and you know that that's the end of your company maybe don't practice on that first like you can but we've seen people struggle with that like practicing on on those things so you build up capabilities in in anything um and the same way you build up sre capabilities so regardless of how good people are in terms of their SRE background
Starting point is 00:28:46 or knowledge, if they come into a new environment, they'll start thinking like, well, how does it work in this context? How do I understand this environment? And then this will take you some time and give it time is also a good, I think, idea on this. So the concept of let's parachute some people in, give them three months, and we're expecting much higher reliability, that's not a safe place to start. So starting somewhere where you can give yourself a good runway and say, hey, I really need more reliability for the service, and let's take a year or so to try and make that improvement. So we've talked as Google that it takes teams inside Google to move a level of reliability, to move like another nine, you know, it can take them years for services. And, you know, they're often pretty good at it. So if you're starting from scratch, like,
Starting point is 00:29:37 don't set yourself up to fail by picking the most important thing that's super critical. Don't pick your leave booking system either. Don't pick something that's non-critical that no one really worries about if it's down. Do that sort of medium-sized service first. Try and make sure it ticks those boxes, that there is some level of reliability as a differentiator, that it has some scale to it. That will get you most value in these spaces. And be prepared for that learning journey to be just like the DevOps folks, cyclical. You'll try stuff, you'll learn,
Starting point is 00:30:14 you'll build a bit more of a platform, you'll build some capabilities around that. And then the next one will be slightly easier. And if you've built the learning of the first one into the second one, each one gets easier slightly easier. And if you've built the learning of the first one into the second one, each one gets easier and easier. Your platform grows and you build up those capabilities for your company and services. And that gives you that sort of positive reinforcement loop. You're going to get setbacks. And this is the other side as well. I think we talk about sometimes
Starting point is 00:30:40 people do ask for SR implementations without failures or they're like, I like taking risks, but I don't really want to take risks for this. You can't avoid that. In your production environment, things are happening. There's no such thing as a risk-free production environment. The risks are occurring irrelevant of how you view that. And sometimes I think being realistic about risk-taking is these things are happening anyway.
Starting point is 00:31:10 Like if you're in security, I think you have the same problems. People are trying to attack your systems on the internet. That's a reality to it. So you're not coming from it from a blank slate. You're saying, what's my current state? Can I make it better? Can I make improvements in there? How do I do the smallest amount of things that would make improvements, learn and then iterate in this? So I think those are the key things.
Starting point is 00:31:35 There's probably a couple of other factors that help. If you're not doing cloud, and I understand I work for a cloud provider, so I'm not going to push cloud, but I am going to tell you that SREs will struggle to do scale-out on automated infrastructure if you don't have scale-out automated infrastructure. So whether it's cloud, like a public cloud or this private cloud,
Starting point is 00:31:58 but whatever it is, if you're not having scale-out automated infrastructure, your SREs are going to start building that because how else do they do scale out automation? So be prepared that you have a bit of a cloud journey where the public, private, whichever vendor, that's not what I'm saying, but be prepared that those scale out pieces will need to happen or they'll need to be built. And then I think the other one is, what's the path dependence that you have in some of these spaces?
Starting point is 00:32:26 Do you have an existing DevOps journey? Have you done a lot of cool stuff in that space? We think you'll find it easier if you've made a lot of progress in your DevOps journey to do things like SRE. There's a lot of overlap. We think reliability is a key concern
Starting point is 00:32:40 for the DevOps space. You can read some of the Dora reports in this. It's some evidence-based views of this. If you're not doing that, that's not necessarily a problem, but be prepared, you might have to do some of those. Silo breaking, for instance, might be more challenging when you haven't done that already, basically. So there's a couple of other factors in there. So there's always sounds like a lot, but there's really four or five key things there. Fantastic. Thank you.
Starting point is 00:33:09 James, I got two or three other questions that I would like to ask. The first one that I keep getting more and more from organizations, especially nowadays, that we all know that the economic climate may not be as nice anymore as it used to be. And so people are asking, so what's the ROI on all this? What's the ROI on actually moving to the cloud?
Starting point is 00:33:31 What's the ROI on setting up a new SRE team? What's the ROI on DevOps? Do you have an answer for this? Like, how do you argue for that? Yeah. So, you know, we talk about this deliberately, part of the book, and we talked about it at Boston for DevOps Day. So, one of the primary reasons for doing SRE at Google was cost. Steve, who was working at Google in those early years, talked about it from a practical
Starting point is 00:34:01 perspective. There's no way that you can scale by just adding humans to server administration. So we shouldn't shy away from the idea that SRE is very effective at cost reduction at scale. That's one of the key reasons we still do it. So sublinear scaling is a core part of the SRE approach. You don't have to do that. Google doesn't mandate how you do these kind of things, and that's for a good reason. But if you're looking for cost savings, we think sublinear scaling is a great cost saving. The people that you need to operate are lower than the growth rate of your company. That's our plan. And it's an open one. We're not secret about it. We think sublinear scaling is one of the key areas.
Starting point is 00:34:48 Where that gets problematic, and I think where we've tried to help people understand this, is often there are global benefits to that reliability in terms of your company. So things like reputational damage. If something goes wrong and your website is down, there's reputational damage. That's goes wrong and your website is down, there's reputational damage. That's not necessarily a shock to anyone. But the disadvantages, the negatives of that accrue
Starting point is 00:35:11 to your whole company. The benefits may be hard to understand in an ROI calculation. And especially if you're running a cost center where you're saying like, well, hey, reliability costs us this much. Trying to make that local global trade-off in terms of your budgeting and understanding of how this works, that's where most of the problems occur. Like, if you can keep the global view of SRE and reliability matched up with the spend, you'll find this much, much easier. And we know this from customers who sort of keep that in mind. That may require an executive sponsor to keep those things balanced.
Starting point is 00:35:53 Someone who is concerned about that global impact and cost. The closer you try and keep that local optimization of cost, the harder it gets for reliability as well as SRE. So don't shy away from that cost angle, but just be super cognizant that you might not get those cost saves in like a very narrow local space. They have to be done as an ROI calculation
Starting point is 00:36:17 for your whole business. And that's what I think a lot of companies have done. So we know that you can do cost savings. Keep those factors in mind. That makes sense. It makes a lot of companies have done. So we know that you can do cost savings. Keep those factors in mind. That makes sense. It makes a lot of sense. And I can also just point people again to the chapters in the book
Starting point is 00:36:33 and also in your presentation, as you mentioned. I know I'm pointing on my side here. It doesn't make sense for the people that are listening to this. They don't know you're pointing. Exactly. People visualize
Starting point is 00:36:45 I point to my other screen. I can feel the enthusiasm though which is the most important part. Yeah, exactly. Cost optimization
Starting point is 00:36:53 is global, not local. SRE is a strategic investment in long-term operational efficiencies. There's a lot of interesting pieces
Starting point is 00:37:02 in the slides. The bit we find interesting as well is often I have this conversation with people. I'm like, do you have executive sponsorship? And they're like, yes, of course. Reliability is a huge concern for us. I'm like, fantastic. Who's your chief reliability officer?
Starting point is 00:37:14 Who's the executive who looks after it? And they're like, we don't need one. There's no need for one because it's so important. And I'm like, you have a CISO, right? Because security is important, but you have a CISO. And they're like, well, of course we have a CISO, right? You have a, because security is important, but you have a CISO. And they're like, well, of course we have a CISO. Hang on. If you have a CISO, why don't you have a chief reliability officer?
Starting point is 00:37:33 That's often an awkward question. You know, the person who could wrap up all of your reliability costs and calculations and ROI, just like your CISO does in your security space. That's often a person who exists, but often they don't have the recognition or the title. Very few people seem to be called chief reliability officers. We haven't really seen that in the wild, but we've seen plenty of people who are an executive in charge of reliability. And we're like, fantastic. This is the person to talk about your costs and benefits with at that global level. We've had that person in Google in terms of for a long time. And we've never really changed that model. So we have no other way of measuring this. So that's just our opinion, really.
Starting point is 00:38:26 But I guess the companies where we've seen it be successful do tend to have executive sponsorship. And when that executive sponsorship kind of wanes, that's also when the reliability and some of these cost calculations just get deprioritized. So that's a good indicator that you can make changes in this space with prioritization at senior levels of your company. It does, however, indicate there might be a bit of a ceiling on your grassroots SRE efforts, like a grass ceiling, for want of a better word, where you will need some sponsorship to do things.
Starting point is 00:39:01 And I just think that's the same as security, right? Like if you don't have that insecurity, you always struggle to get good security outcomes. So I don't see where reliability is any different. That was the same pattern we saw for DevOps too. The people who were successfully transitioning had executive buy-in and those who didn't had a long uphill battle and some fell apart, some would do a grassroots effort and then catch somebody's attention but if you start with that executive buy-in that was that was our approach our our cto said do you know get a release out and what was it 10 minutes or 30 minutes
Starting point is 00:39:36 i was yeah we had a fast lane we needed to be able to react to any problem within an hour and get the fix out and prior to that it took it took us the classical two weeks sprint. And then that was the idea. That was years ago. But it's that enablement. It's finding the budget. It's keeping track and helping justify the budget. Because that's one thing everyone's learning now,
Starting point is 00:39:57 as Andy brought up, is that we got the budget for everything. Now it's accountability time. So if you have someone who knows how to pull in that accountability, which, oh, great, I want to be the C if you have someone who knows how to pull in that accountability, which, you know, oh, great, I want to be the CRO. I wouldn't know how to look at budgets and put all that stuff together. So you need someone
Starting point is 00:40:11 who knows that stuff to do it. So, yeah. Great point. Fantastic point there. These things are also, they seem like they're not technical concerns, but I think we do encourage people to not look at their jobs through this narrow lens of like, well, this is my technical element.
Starting point is 00:40:29 We're all here for business benefit. We're all here because we want our companies to be successful. We want these good outcomes for the overall business. And that's, I guess, in modern DevOps terms, right? I don't want to version out DevOps, but where DevOps started and I guess where it is today, we talk a lot more about business outcomes and we shouldn't shy away from that.
Starting point is 00:40:50 We should embrace that. It doesn't mean your executives should be telling you exactly what to do in terms of deployments, but they should be giving you those goals which are well aligned with these outcomes. And we shouldn't shy away from that. Well, yeah, I mean, even just simply, what are you building reliability for?
Starting point is 00:41:07 Your customers. That's the whole point, right? So that's, yes, I have this technical job, but at the end of the day, it's all to serve whoever my end customer might be. And if I'm not keeping that in mind... We're always doing it for a reason. Yeah, for sure.
Starting point is 00:41:22 And then maybe also to add one more thing, and again, I'm looking on the slides here. In terms of the ROI, you were quoting the State of DevOps report where it says, reliability is a force multiplier. Teams that excel at reliability engineering are 1.8 times more likely to meet
Starting point is 00:41:40 or exceed organizational goals. I guess that's a nice slide to put in front of your... You know, there's a reason. I'll swiftly plug the State of Devils report. Like, I know it's produced, you know, Google owns Dora, so there's a conflict of interest there. But, you know, I also think you can look at other Devils reports, right? Like, they're trying to tell you similar things.
Starting point is 00:42:02 We do think that based on the data that we have, there is a strong correlation between software performance and business outcomes. That's why we try and do these things. That's why we want companies to do these things. We want those better business outcomes. What we found when we're asking people about reliability in this space is that if you have a strong correlation between both,
Starting point is 00:42:24 and you can look at the report itself for more details, but if you're just doing really strong software development, that on its own isn't always a good indicator because sometimes you might be a bit of a feature factory. You might be pushing out releases and they're causing problems and they're not really working well from an operations perspective. The flip side is true. If you have a reliable system,
Starting point is 00:42:48 but it can't ever get updated, that's not helping your business necessarily move forward, right? So the combination of the two seems to be the secret sauce. And I guess this isn't necessarily new news for some of the DevOps teams, but I think it's worth repeating for many teams.
Starting point is 00:43:06 Try and get that balance where you are getting the right amount of releases to the right amount of reliability concerns. And if you do that, companies that do that seem to have much higher performance. So there's a strong correlation in those spaces. To write this down. It's awesome.
Starting point is 00:43:27 So, you know, people often ask, like, where's the money? And I'm like, well, here it is. You want higher business outcomes. This is it. This is why we try and do this. And here's the piece. I think what gets difficult
Starting point is 00:43:39 in complex adaptive systems is people often want to isolate and change one part of it and and that's not how you know complex systems work complex systems require you to look at the the current state the path dependence like where a company has come from where it's going towards and then try and influence and nudge that in different directions so it's not as simple as just saying like okay we need twice as many releases. That's one change in the wider ecosystem,
Starting point is 00:44:08 but it won't necessarily have the same outcome each time because, again, it depends on where you've come from. So trying to say, hey, we do want faster change, but we want it at the right pace of the reliability levels, that's, I think, an important thing when it comes to changing the system.
Starting point is 00:44:24 It's not about just sort of doing one or the other. Yeah, this is also one of the things that I mentioned, but the speed of delivery could also be, if you just measure yourself based on how fast you can deliver, you may even have a negative impact on the people that are on the receiving end. Your users might be also overwhelmed with changing again. I remember, Brian, in the early days when we were kind of speeding up from, remember, the two
Starting point is 00:44:51 releases per year to then two major releases every other week, we had a hard time to actually keep up with educating our internal folks on how to sell the new, because our whole product development is just one piece of the whole end-to-end chain.
Starting point is 00:45:07 Then you have support, then you have sales engineering, you have your customers, so you obviously, it all needs to work in the right balance. And I think that's the nice way. We need to balance everything out. But you had buttons moving. Where do I go to turn that off? Oh, they moved that.
Starting point is 00:45:26 But in the video video the video shows here but the video is two weeks old two weeks old it was a good growing pain to have though because it was because we adjusted from it James I have one final question for you because this is something that
Starting point is 00:45:42 comes up a lot in discussions as well the topic of SLO service level objectives one final question for you because this is something that comes up a lot in discussions as well. The topic of SLO, service level objectives. And I also, I borrowed one of your quotes into one of my texts that I'm about to present in the meeting
Starting point is 00:45:56 after this recording. He cried. I credit you. Yeah, I have a screenshot of your book. But I say the power of the SLO is driving transformations. And screenshot of your book. But I say the power of the SLO is driving transformations. And in the book you say,
Starting point is 00:46:10 focus on the user and all else will follow. And I like this a lot because when I started talking about SLOs in the very beginning, I think I had a wrong approach of SLOs because I thought I need to put SLOs on every single service. I need to define at least three to five or maybe 10 SLOs.
Starting point is 00:46:31 But in the end, this doesn't really make sense. You need to define the SLOs that are aligned with your business objectives, right? Like how many users do you want to have in the system? What should be the availability, your critical end user journeys or whatever you want to call them. And then I think the rest
Starting point is 00:46:45 will follow because then you will do everything to keep those top-level SLOs in check which means you are ending up with business success and then you'll figure out what it means, what you need to do in order to support these top-level SLOs.
Starting point is 00:47:02 My question to you now is, I see you're nodding. People don't see that you're nodding, but you're kind of nodding. When people ask you, you get approached by a lot of organizations. What do you tell them in terms of where do you start
Starting point is 00:47:20 with SLOs? Again, it is a very common, I think, question in this space. I don't think there's a nice prescriptive answer. I would say the most common theme I would normally do is start where you are. Like often people were like, well, you know, tell me what I need to do. And I'm like, start where you are. Like wherever you are right now, what do you have? What's easy to measure? What's the SLIs that you might have? What are the things in place that you have today? Start there. Instead of thinking what would be
Starting point is 00:47:56 the perfect outcome for my service in this hypothetical space, you probably won't know the right SLOs until you've explored this space. Like it's inherently a complex system. So trying to sort of design them up front, trying to say like, hey, if I just got enough SLOs, like this would somehow be okay. That often doesn't really help people in this space. And so thinking of all of these things, SLOs, SLIs, error budgets, as just tools in your toolbox for this space, I think is much more important than the individual metrics. Like the things that you're trying to sort of say are critical, like will change over time.
Starting point is 00:48:38 They'll change depending on your business. business and and using them like i say like different tools in your toolbox and and starting with okay what do we have now and how do i then increment it is is almost always i think where we start um if if you think of these tools as mechanisms for communication across silos like that's also i think a very important like. So an error budget is a way of us saying, hey, I have an understanding of how much reliability I need in my service, and I've agreed that with other parts of my business. The exact number in the error budget isn't the most important part of it. It's the fact that you've made that agreement, and there's a shared understanding of what you want. So an SLO is a shared understanding between yourself, other parts of your business, your customers of what they can help normally expect, what they kind of want.
Starting point is 00:49:35 The fact that you have that shared understanding is more important than the exact thing itself. Please don't go on Twitter, folks, and be like, aha, James said SLOs aren't important. That's not what I think we're trying to say. What we're saying is the right SLOs for you will be custom, and you will often not be able to well prepare them in advance. But once you start building them and using that feedback, that negotiation of these proxy metrics that you can use will help you turn some very speculative ideas into data. They will start helping you use these tools to make better data-driven decisions
Starting point is 00:50:17 and that will then help you make better choices for your business. It doesn't mean you just have to do these things because we said so. That's not what we're trying to say. They're just tools in the toolbox, basically. It's a complex topic, but does that kind of make sense? It does make sense.
Starting point is 00:50:37 Thank you so much. And as Brian and I always say, the great thing about doing this podcast is because we always learn a lot and I mean hopefully our listeners learn a lot as well but I always want to say thank you for spending the time with us and giving us your insights
Starting point is 00:50:54 because that's the great thing about the communities that we live in because we are all open, we share even though we work for certain organizations and we may want to keep certain things as a secret but we don't do this. We just share openly because in the end, we all get better.
Starting point is 00:51:10 It's really the most important part is that continuous learning. Exactly. Awesome. I think we consumed a lot of your time. Thank you so much. I really hope I get a chance again to meet you at one of the upcoming talks.
Starting point is 00:51:24 Do you have any talks coming up in the next months where people may be able to see you live? I don't, but if people want to see Steve McGee, my collaborator in this space, he's presenting this Thursday at All Day DevOps. And we'll be talking about the reliability mapping that we've been trying to do. So come to All Day DevOps, it's a virtual conference and learn a bit about reliability mapping that we've been trying to do. So come to AutoDevOps, it's a virtual conference, and learn a bit about reliability mapping. That's good. Awesome.
Starting point is 00:51:51 Yeah, I don't think reliability is going away, so we'll be talking about this. We're just going to stop caring about performance altogether. When we're robots, I still need it. Anyway, yeah, and I got nothing else to add. I really appreciate your time, James. Very educational and informative.
Starting point is 00:52:10 So thanks for spending the time with us and thanks for spending the time with our listeners and thanks to our listeners for being there. Thanks, folks. Much appreciated.
Starting point is 00:52:20 Yeah, we hope everyone has a fantastic time and I don't know, doing whatever. I'm a mess today, I guess. I can't even think straight. It's still only, what? It's just turning 10 for me.
Starting point is 00:52:30 Anyhow, thank you, everybody. We'll see you next time. And yeah, bye. Bye-bye. Thank you. See you. Thank you. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.