PurePerformance - SRE for the non-unicorns (aka Enterprises) with James Brookbank
Episode Date: December 5, 2022You have a CISO (Chief Security Information Officer) but no CRO (Chief Reliability Officer)? You blame people if systems crash? You scale your people in the rate of scaling your infrastructure? If you... answer any of those questions with YES then you should tune into this podcast as you probably struggle adopting Site Reliability Engineering (SRE) in your organization.James Brookbank, Cloud Solutions Architect, has dealt with resiliency topics in a large enterprise prior to joining Google. In our conversation he shares advice he gives Enterprises to convert the excitement about SRE into actual implementation. James gave some good guidance on what good and not so good projects are to start with. He gives practical examples on what it means to change your company culture and why there doesn’t have to be an SRE for every service.In our call we discussed the SRE in Enterprise talk at DevOpsDays Boston and SRECon EMEA as well as their recent book. Here are all the relevant links:James Brookbank on Linkedin:https://www.linkedin.com/in/jamesbrookbank/SRECon EMEA Slides: https://www.usenix.org/system/files/srecon22_slides_mcghee.pdfDevOpsDays Boston 2022 Session Recording: https://www.youtube.com/watch?v=__e7b25QOHcEnterprise Roadmap to SRE Book: https://sre.google/resources/practices-and-processes/enterprise-roadmap-to-sre/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance do you know why I have no clue every day every podcast is special for me well it's the one and only
podcast that we're
recording today
so that makes it
special
yeah
I got nothing else
I literally thought
about that for about
five minutes before
joining the call
and that's the best
I could come up with
today
was that the only
thing on your
calendar today
is that what you're
saying
it's the only podcast
on the calendar
today
okay
how many podcasts
what else do you do
besides pre-performance?
Pick my nose.
Oh, yeah, pick my nose.
Hey, it's the election
day special, too.
Over here in the States,
it's election day
and everywhere you look,
it's just going to be
people talking
and speculating
and speculating.
Oh, maybe he's up
two points,
maybe down two.
Oh, frustrating
today to get through
because it's just
all speculation.
Do you think it's a lot about
metrics and who in the end
gets closest to the goal?
You know what?
Yes, but I will say,
unlike performance engineers and people
in SRE and observability,
it's about metrics, but it's about
trying to fill the time, making up stories
to fit them for
entertainment purposes where i think people in our field are really doing hardcore working trying to
change and make the world better and there's you know there's a special group of people who've
come up recently in that realm recently within the last you know building out the last 10 years
who've really been making an impact and a difference for everybody's lives even if you
don't know it you're in the receiving end
of great performance
because of the hard work and dedication of these
people. Who might they be, Andy?
Who might they be? Maybe we have somebody on the line
today that could fill us in.
Okay, let's stop with this.
It was a good one, but yeah.
Welcome to the
show, James. Thank you
so much for being here.
I'm sorry that you had to endure
the last three, four minutes
where we tried to be funny,
but we always tried.
And eventually, in a couple of years from now,
we will be funny, hopefully.
Yeah, I just saw a scowl on his face the whole time.
It's just good learning.
We're not failing.
We're just learning fast.
Yeah, exactly.
Hey, James, thanks for being here.
The two of us, we met at DevOps Days Boston
where I saw you co-present with Steve McGee
who has been on our show in the past.
And some may know you, some may not know you.
So for those that don't know you yet,
could you quickly give an introduction,
who you are, what you do?
For sure.
So I'm James Brookbank.
I look after a team of cloud solution architects at Google.
I'm not talking on behalf of Google today, though.
Just don't tweet my boss and say, like, hey, I said something.
It's not that kind of talk.
I'm the other half of the Steve McGee partnership in that Steve has spent 15 plus years as an SRE at Google and knows how all the unicorns are made.
I've spent 20 odd years in enterprises like banks, large companies with sort of difficult
scale problems, but really not tech startups, basically.
And so one of the things I try to do working at Google now is bring that expertise and talk about things like reliability.
We talk a lot about sort of SRE, but really it's the wider reliability concerns with cloud customers and, you know, try and move the state of the art forward in this area, not just for the tech companies, right? For everyone. Yeah. And I think that's also what I enjoyed so much about
when you were on stage in Boston,
you actually said, right,
not everybody's looking up to Google,
but as you said, they are one of the unicorns,
but not everybody is like Google.
And that could also explain a little bit
why people get really excited about what they hear,
but then they're falling short on actually delivering, right?
How can we apply what we learn from the Googles
and the Facebooks and the Amazons of the world? What can we learn? But then people are excited and then the excitement
stops somewhere and nothing materializes. And you have your presentation, SRE in the
Enterprise. There's also a great book. I think you actually have it available for free,
if I'm not mistaken. You can download it
from the SRE Google website
and that's the best
bit about it. We made it very
short. It's 50 pages.
It's designed to be executive
friendly, so you can read it on
the plane, you can give it to your boss
and be like, hey, you don't have to read
the full SRE book. You can later,
but maybe start with like the cliff notes
and just sort of try some of these things first, basically.
So we deliberately try to make that accessible.
It's not like all of the books we publish.
It's not a hard and fast set of rules.
It's not designed to tell you you must do this
and then you'll be SRE.
It's just lessons that perhaps we've learned
and customers have learned in these spaces and we're just trying to share those things now we will add the links to
the the podcast description so folks if you're listening to this then the link is going to be
there and then enjoy the 50 pages read i've enjoyed it twice already on on two uh on two
plane rides uh which was nice.
I have a question,
a couple of questions actually for you because we've been...
Please.
Yeah, thank you.
We've been talking about DevOps
and SRE and SLOs
for several years now, right?
We had people like Gene Kim on
a couple of years ago
when kind of DevOps became really
popular. And then we had
on the SRE side where we
must have had at least 10, 15 different
people that are SREs. Just the last episode
was with Diana. She
is an SRE lead at
a large bank in Canada
and kind of her journey into
SREs.
What I would like to know from you is what advice do you give when you talk
with these enterprises, when they come to you,
what do you tell them to actually really apply SREs and what are the things
they should apply?
What maybe they should not apply?
What are some of the best practices?
Yeah.
And there's obviously a lot to this.
So, I think the first step is often people want a single lever or a silver bullet that they can do. They're like, you know what, I really need to do this. It's very important for my business.
Tell me the one thing that I can do to make this happen. And I think what we found is that's not
going to happen. It's the same with a lot of these
areas, the same with DevOps, Agile. You can't really do just one thing and expect quite serious
changes. And we wouldn't do that in any other part of our lives. We wouldn't expect that to just be
like, well, I just changed this one thing, but now I've like outsized performance um i think though the other side is we
do i think know a relatively small number of things that make like high impact changes in this space
so you know one of the things that we talk about quite heavily in the book is we perhaps
overestimated how much of sre and and reliability concerns were technical and how much were not, how much were part of the social
and team structures of organizations,
of the culture of organizations.
And I think we found a lot of overlap
between the DevOps initiatives
and SRE initiatives where we were saying,
if you do have a generative culture,
if you do have an ability
to raise incidents safely,
you have psychological safety within your company,
you're highly likely to form learning loops.
And when you have incidents,
they're going to be learning opportunities,
you'll make things better,
and you'll have this sort of positive cycle
of reinforcement for your operations teams
and generally for your company.
If you're not doing that,
like buying tools or implementing practices on their own, it's not that they're not effective, but they're not providing that sort of 10x that people are looking for.
They're really fighting against the grain of a lot of these things.
Just to talk about the culture piece as well, I think one of the things that people often get frustrated about is, could you tell us the culture?
Is it ping pong tables or foosball?
Maybe we put the free food in.
That's the culture you're talking about.
And it's not.
And we do go into some detail on this in the book
and some of the talks we've had
where Google has published guidance,
but you can look at, say,
the Western model for culture of culture and say,
actually, culture means things like psychological safety,
like dependability.
If you do those things, you have those capabilities,
you're much more likely to get better SRE outcomes
and DevOps outcomes.
And you'll probably get, like, the free food
as a result of that.
Like, it will be emergent from that culture.
All these things will come from doing those things correctly
as opposed to the other way around,
where you're like, if I get enough foosball tables in,
suddenly the developers will be doing this.
Flip that the other way around.
And also don't overestimate
how simple some of these cultural changes are
in terms of just paying attention to incidents
and using retrospectives as learning opportunities
instead of saying, well, you know, we've got so many incidents
and I guess there's nothing we can do about it.
Like that's really the most important place to start,
like build those learning loops and start small.
So there's more, but I guess that's the key one
that seems to drive like the best behaviors.
I'm imagining an office full of des one that seems to drive the best behaviors. I'm imagining
an office full of desks that
are actually foosball tables.
So you work on every single desk being
a foosball table. That'd be, yeah,
come work here. It's great.
If we could do that, if that was the thing,
then we would have done it.
If we knew that one thing,
one level we could pull that would make it happen.
It's not as simple as that, but it's easier than it sounds.
The culture changes aren't expensive, but they are challenging.
I'm just looking at your slides that I think from SRE Con EMEA that you've just given as well.
A couple of weeks ago, yeah, we talked about some of these items.
Yeah.
And what I find fascinating, when I read the book, in the book, you make a comment and
kind of say, you know, the history of kind of software engineering a little bit where
now in the world that we live in, reliability also becomes a differentiator, obviously,
right?
I think you cannot always compete maybe
with the best user experience,
but you can compete with having the best reliability.
You're always on.
Like Google.
Google is known for the Google search is always there.
And that's a differentiator.
Now, on the flip side of this,
looking at slide number five or six, that's a differentiator. Now, on the other, on the flip side of this, looking at slide number
five or six it is,
it's a great point. It says reliability
is not always the most
important thing. I think you also need
to be then very careful on where
I guess it makes sense to invest
in these concepts and where it does not yet
maybe make sense.
Is there maybe some
points that you can
give on when is it a good time
to really focus on
better reliability and maybe at some
point it doesn't make
sense right now.
I think it's a crucial piece because
one of the things that I think Google has made
fairly public
but perhaps we don't
not everyone has sort of, you know, seen this,
is that we do not have SRE for every service. Like that's a deliberate decision inside Google that
many smaller services that are non-critical do not require SREs. They benefit from the SRE
ecosystem. They benefit from our sort of dirt testing, like disaster recovery testing, that kind of stuff.
But we don't have it for every service.
And we deliberately try and focus our SRE efforts
on the very high reliability services.
So, you know, sometimes when, you know,
I have this conversation with customers
or other third parties and say, like,
what are you trying to do with SRE?
And they're like, oh, every service we've assigned some assigned some sres i'm like you you can do that but but that's not really how it's done
say at google or or other companies who are following this model and do we have like a nice
easy matrix decision of which services should be sre supported, it's very fluid.
Like it's not a well-described process,
but it kind of falls into like three main categories.
And we did talk about this a little bit at SREconomia.
So ask yourself if it is a product differentiator for your service.
Like if you're not sure, it probably isn't.
Like if your service can be down for a few hours and it know it's not really expensive to do so have a think about this like sometimes we see systems where
like leave booking you know people are like hey i'll leave booking systems down that's probably
okay for a few hours like people can book holiday like or just let their boss know like you don't
need to have this this critical five nine service for it. Pick your battles in that. Is there an existential
risk for your service? Sometimes we speak to banks, and I've had this experience with banks
I've worked at, where they're like, this service absolutely must operate. It has to complete
something by the end of the day, or we'll be speaking to a regulator. If you have a critical
service like that, you'll immediately have access to funding and SREs, right?
Like people will understand the existential nature
of those services.
So this helps you like identify those critical things.
And I think the final one is scale.
And one of the key reasons that Google did SRE
and where it becomes, I think, very important
is when you're trying to scale,
if humans can't manage the infrastructure involved,
you'll immediately start adopting SRE practices.
You'll kind of need that capability in these spaces.
And even if reliability isn't the primary concern,
just the cost of it will become like a,
things like capacity planning
will start becoming a software problem
and not just a
sort of traditional operations problem so if you had to pick those it would be those three things
that probably are the main indicators your mileage may vary i think is the final bit to that though
so that all makes sense yeah it does for me and especially maybe on the last one
um it's an interesting point because just to reiterate, if I understand this correctly,
if you say you have a system
that needs to scale
because let's say you're building
this cool next generation Facebook
or whatever it is,
and it all of a sudden starts
to get very popular.
And in order to scale,
you have two options.
You either hire as many new people
as you need to scale up your hardware
or you actually start figuring out
how you can automate your everything around the operational aspect and the resiliency aspect
so because you cannot just you know scale with the good manpower as you as you need to scale
your software did i get this right this is it and again like that's that's a the concept we talk
about is the pyramids of reliability, which again is in the book.
So you can look at that in more detail.
But I guess the concept is we've seen reliability when I was younger.
I'm a little bit older now.
And certainly when I first started working in IT, I was looking after old school Unix machines, like some microsystems.
Some people will remember those days.
And you could just get bigger ones.
It wasn't really, I mean, it was a money problem,
but you could just go buy an E25
and you would just get 100 boards and all the RAM you wanted.
It just cost you a lot of money, but you could buy a larger service.
So we did that as an industry for a while.
And then eventually we ran out of vertical scaling.
And so this model, while still effective in some scenarios,
creates problems for internet-facing services
where there are millions or billions of users.
And immediately you start thinking about how we scale out.
And that scale-out mode of operation is where SRE kind of shines
and where reliability concerns become software predominantly as opposed to operations and sort of stacking reliability on top of stronger hardware.
So I think sometimes people say, well, which one do I need to choose?
And it's really finding the right way for you and your company.
But if you start mixing and matching those models,
if you cross the streams of the reliability pyramids
and you start doing scale-out on expensive hardware
or scale-up on unreliable hardware,
that's when it gets very, very difficult
to manage your reliability concerns.
And you will then start thinking about some of these things, right?
Like you will start looking at these concerns of your scale-out architectures with some of the techniques that we've seen work in this space.
Does that make sense?
It does.
And I remember in Boston, and I'm sure you had the same question also in,
by the way, where was the SRECon in EMEA?
It was in Amsterdam.
In Amsterdam, yeah.
It's been a long time since I went.
The last time I was in Amsterdam was 2011.
I did the Amsterdam Marathon.
So that feels like a long time ago now.
You did the marathon?
It's still a very strong place.
I did 26 miles, which was a very long way.
What was the, can I ask what time did you have?
Five and a half hours.
Five and a half hours, yeah.
But in making it, I have never done a marathon.
Which is the most important part.
Andy, well, I'm curious, if he said 12 hours,
what would you have said?
I would have said, I have never finished a marathon, so he achieved something I've never achieved.
And I think just getting through it is amazing.
Nice answer.
This is part of that culture.
This is that part of that culture you're talking about, right?
Yeah.
Is being accepting and safe.
Yeah, exactly.
Great example, Andy.
I tried to put you on the spot and I failed as a wiseass.
What am I going to do?
I'm the problem here.
So Amsterdam and Boston,
I remember in Boston,
because you talked about the pyramid,
I think in Boston you actually asked the question,
I think it was something like,
can you build a 99.99% service
on a 99.9% infrastructure.
Did you ask the same question in Amsterdam as well?
We did.
And again, I think the complexity of this
is that sometimes what's intuitive for people
is not perhaps realistic.
So sometimes people are like,
well, how on earth can you build something more reliable
and something less reliable?
And we're like, well, the classic is RAID arrays.
If you haven't built a RAID array, you're using one.
You may not know it.
But a RAID array takes your disks, which are very unreliable.
I used to run a storage team.
We lost a lot of disks.
There's two types of hard disks,
dead or dying. And so you don't have a choice there. You have to make a more reliable storage
system on top of that. And that's what we mean when we say that you're gluing together
unreliable infrastructure components to make a reliable service. That's exactly what RAID does.
Now, does that mean that every service needs to look like that
or every part of infrastructure needs to look like that?
That's not really what we're saying,
that you can always do this or you should always do this.
We're saying it's possible.
And I think sometimes people don't want to look through that
from a reliability perspective
because it creates software and people complexity.
But it is certainly possible.
So I can encourage everyone to check out.
I mean, again, we will also, if it's okay for you,
to link to the SREcon material, the slides.
Absolutely.
Everything from the SREcon stuff is open access.
So that talk will be published, I think, in a few weeks. But again, the slides are up and
these aren't designed to sort of, I think, trick people in terms of saying like,
oh, well, you said I can't build this service in this way. I think what we're just trying to explore is the choices you make in your architecture
often have trade-offs like benefits and concerns,
but you can make active choices around reliability with all of the other good things,
with cost and scale and people.
All those things can factor in.
You've just got a few choices to make
and these have impacts in that.
And for a lot of teams,
especially in the operations space,
this seems like old news.
This is something you're like,
well, everyone knows this.
Of course, this is obvious how you build these services.
For a lot of developers, this isn't an interesting area.
Why should it be? So they often don't know this until it bites
them. And I'm not here to shame them into that.
I'm here to explain it. Here's how it works. Here's why it works. Here's how we
do it like this. So I think that's the important bit. It's just understanding some of these models
and making them as simple
to understand as possible.
And I think there's nothing wrong
about repeating things
that we think should be
common knowledge.
I mean, Brian,
we talked about this a lot, right?
We've been talking about
performance problem patterns
over the last five years
since we've done the podcast.
We always bring up
the same patterns
because they're still out there,
like the M plus one query problem,
which is our all-time favorite.
Our old favorite, yeah.
Yeah, it is what it is.
And I think, I mean, one of the reasons is,
you said developers might not be interested in,
but we have to understand, and you said it earlier,
we're all not getting younger, but we are getting older.
There's new generations coming into our field.
Some of them might not be trained engineers.
Some of them may now with COVID make a career change, right?
And they are obviously on a fast track trying to get into our industry.
And how should we expect from somebody that gets a couple of months,
maybe online education or however they get educated,
how should they know everything we've learned? I don't know.
I was in a specialized high school
for five years on software engineering.
What I learned in five years,
I don't expect people to learn in two months on YouTube.
And then obviously with the years and years of experience
that you and Brian and I have in the field,
we obviously know things
and we should never take it for granted
that everybody knows it.
That's why it's so important that we do podcasts like this where even simple things that we take for granted
and we think everybody should know might not be known yeah for sure and and there's there's always
i guess an xkcd for it uh my favorite one in this one is the um the diet coke and mentos so like the the idea is like for everything you
think everyone knows there's millions of people born like in the u.s every year so there's
thousands of people for whom like their first day of knowing about like someone on this listening to
this podcast is like what do you mean diet coke and mentos it's amazing go go to the store go
go start that science experiment
don't shame people
for not knowing about it
go enjoy it
go find out about
those things
so if you don't know
what this is
I'm going to say
if you don't know
what this is
go get a bottle of coke
go into the fancy room
in your house
with like the white couch
and all that
put the bottle of coke
on the floor there
and drop some Mentos in
your parents will love you for it.
That's going to be really cool.
And they'll say, wow, you're a budding scientist.
So how could we be mad?
Right, exactly.
Sorry, Andy, you were saying.
That's perfect.
I was going down the line because you say go into a store and just try it out there.
And you probably will never allow back into that store.
I would say like there's a lawyer cat moment here or I should, you know, be like, right,
this is this advice.
You can follow it your own.
Find someone who knows about this and get them to help you with this journey.
But I think the idea is very sound.
Often as a tech industry, I think sometimes we do have like very prescriptive ideas about how
things should operate um and we don't realize how much of that is is not always obvious or
consumable and the more we can make things like simple and consumable reduce the cognitive load
um you know we've gone through the full stack phase right like it sounds great like being a
full stack developer and knowing everything
about everything,
but it's not proven possible.
Like the idea that we can have
someone who's,
I think this was Charity
who said this, like,
if you're not updating
the website and designing
the computer chips,
are you even a full stack engineer?
It's like, well,
how can you do all these things?
Like, it's not feasible.
So we need to provide
this guidance
as part of our platform, as part
of our team capabilities
and do that in
as simple a way as possible.
Make these things more
accessible. I think that's where
a lot of that culture comes in, right? Because
if you're not, you're not going to be a full
stack engineer probably.
So you have to have that culture where
the people who do know the hardware side or the operations side are sharing
and accepted for sharing. Same to the developers and back and forth
and knowledge is being shared and common. We talk about repeating the same
mistakes over and over again. And it's amazing
how many times when we're talking to a client
they've picked some technology to run their important stack on.
It's like, well, why'd you pick that?
Well, we were told to move to Kubernetes.
All right, well, did you figure out how you're going to observe it?
Did you figure out how you're going to maintain it?
Did you figure out how you're going to...
Like, no, we just moved.
And with the idea being,
you make these decisions based on requirements, needs,
and the ability to do all the things you need to do, including
considering SRE, considering security,
considering all the different elements, and then
says, what model matches our needs?
And then that's the one you pick.
But it's still,
this is one of the things I think, Andy,
just like the N plus one query problem,
that struggle's going to go on and on and on
and on forever. So these ideas
of sharing these concepts over and over and over again are key to getting past that.
Well, never getting past it, but just key to keeping it on people's minds.
So that less and less people do that over time.
It's not a struggle.
Hey, now we want to start SRE.
Great, we're starting way behind now because we have all these other problems as opposed to we're in a good spot to now introduce.
Speaking of that, though, I wanted to ask, you know, when moving to SRE,
you know, you had those three points made earlier.
First project, you know, our first foray into SRE,
we're going to put together the team,
we're going to start experimenting and working it out.
What makes a good candidate for an application?
Or is there any general guidance on where you should start?
Obviously, if you're doing cloud transformation,
you're not going to start with your critical app.
You're going to experiment with something lesser.
What's a good advice for trying it? Yeah, and I think this isn't a checklist problem
or checklist solution, I guess.
There's no sort of, hey, follow these boxes and it will tell you the one.
But I do think that there is some pretty clear guidance that we've tried to give in the book, which is don't start with the most important thing in your company.
If there's an existential thing that if it goes down and everything you know everything stops running and and you know
that that's the end of your company maybe don't practice on that first like you can but we've
seen people struggle with that like practicing on on those things so you build up capabilities
in in anything um and the same way you build up sre capabilities so regardless of how good people
are in terms of their SRE background
or knowledge, if they come into a new environment, they'll start thinking like, well, how does it
work in this context? How do I understand this environment? And then this will take you some
time and give it time is also a good, I think, idea on this. So the concept of let's parachute
some people in, give them three months, and we're expecting much higher reliability, that's not a safe place to start.
So starting somewhere where you can give yourself a good runway and say, hey, I really need more reliability for the service, and let's take a year or so to try and make that improvement.
So we've talked as Google that it takes teams inside Google to move
a level of reliability, to move like another nine, you know, it can take them years for services.
And, you know, they're often pretty good at it. So if you're starting from scratch, like,
don't set yourself up to fail by picking the most important thing that's super critical.
Don't pick your leave booking system either. Don't
pick something that's non-critical that no one really worries about if it's down. Do
that sort of medium-sized service first. Try and make sure it ticks those boxes, that there
is some level of reliability as a differentiator, that it has some scale to it. That will get
you most value in these spaces.
And be prepared for that learning journey to be just like the DevOps folks, cyclical.
You'll try stuff, you'll learn,
you'll build a bit more of a platform,
you'll build some capabilities around that.
And then the next one will be slightly easier.
And if you've built the learning of the first one
into the second one, each one gets easier slightly easier. And if you've built the learning of the first one into the second one,
each one gets easier and easier. Your platform grows and you build up those capabilities for
your company and services. And that gives you that sort of positive reinforcement loop.
You're going to get setbacks. And this is the other side as well. I think we talk about sometimes
people do ask for SR implementations without failures or they're like,
I like taking risks, but I don't really want to take risks for this.
You can't avoid that.
In your production environment, things are happening.
There's no such thing as a risk-free production environment.
The risks are occurring irrelevant of how you view that.
And sometimes I think being realistic about risk-taking is
these things are happening anyway.
Like if you're in security, I think you have the same problems.
People are trying to attack your systems on the internet.
That's a reality to it.
So you're not coming from it from a blank slate.
You're saying, what's my current state?
Can I make it better?
Can I make improvements in there? How do I do the smallest amount of things that would make
improvements, learn and then iterate in this? So I think those are the key things.
There's probably a couple of other factors that help. If you're not doing cloud, and I understand
I work for a cloud provider, so I'm not going to push cloud, but I am going to tell you that
SREs will struggle to do scale-out
on automated infrastructure if you
don't have scale-out automated infrastructure.
So
whether it's cloud, like
a public cloud or this private cloud,
but whatever it is, if you're not having scale-out
automated infrastructure,
your SREs are going to start building that
because how else do
they do scale out automation? So be prepared that you have a bit of a cloud journey where the public,
private, whichever vendor, that's not what I'm saying, but be prepared that those scale out
pieces will need to happen or they'll need to be built. And then I think the other one is,
what's the path dependence that you have in some of these spaces?
Do you have an existing DevOps journey?
Have you done a lot of cool stuff in that space?
We think you'll find it easier
if you've made a lot of progress
in your DevOps journey
to do things like SRE.
There's a lot of overlap.
We think reliability is a key concern
for the DevOps space.
You can read some of the Dora reports in this.
It's some evidence-based
views of this. If you're not doing that, that's not necessarily a problem, but be prepared,
you might have to do some of those. Silo breaking, for instance, might be more challenging when you
haven't done that already, basically. So there's a couple of other factors in there. So there's
always sounds like a lot, but there's really four or five key things there.
Fantastic. Thank you.
James, I got two or three other questions
that I would like to ask.
The first one that I keep getting more and more
from organizations, especially nowadays,
that we all know that the economic climate
may not be as nice anymore as it used to be.
And so people are asking, so what's the ROI on all this?
What's the ROI on actually moving to the cloud?
What's the ROI on setting up a new SRE team?
What's the ROI on DevOps?
Do you have an answer for this?
Like, how do you argue for that?
Yeah.
So, you know, we talk about this deliberately, part of the book, and we talked about it at
Boston for DevOps Day. So, one of the primary reasons for doing SRE at Google was cost.
Steve, who was working at Google in those early years, talked about it from a practical
perspective. There's no way that you can scale by just adding humans to server
administration. So we shouldn't shy away from the idea that SRE is very effective at cost reduction
at scale. That's one of the key reasons we still do it. So sublinear scaling is a core part of the
SRE approach. You don't have to do that. Google doesn't mandate how you do these kind of things,
and that's for a good reason. But if you're looking for cost savings, we think sublinear
scaling is a great cost saving. The people that you need to operate are lower than the growth
rate of your company. That's our plan. And it's an open one. We're not secret about it.
We think sublinear scaling is one of the key areas.
Where that gets problematic,
and I think where we've tried to help people understand this,
is often there are global benefits to that reliability
in terms of your company.
So things like reputational damage.
If something goes wrong and your website is down,
there's reputational damage. That's goes wrong and your website is down, there's reputational damage.
That's not necessarily a shock to anyone. But the disadvantages, the negatives of that accrue
to your whole company. The benefits may be hard to understand in an ROI calculation.
And especially if you're running a cost center where you're saying like, well, hey,
reliability costs us this much.
Trying to make that local global trade-off in terms of your budgeting and understanding of how this works, that's where most of the problems occur.
Like, if you can keep the global view of SRE and reliability matched up with the spend, you'll find this much, much easier.
And we know this from customers who sort of keep that in mind.
That may require an executive sponsor
to keep those things balanced.
Someone who is concerned about that global impact and cost.
The closer you try and keep that local optimization of cost,
the harder it gets for reliability as well as SRE.
So don't shy away from that cost angle,
but just be super cognizant
that you might not get those cost saves
in like a very narrow local space.
They have to be done as an ROI calculation
for your whole business.
And that's what I think a lot of companies have done.
So we know that you can do cost savings.
Keep those factors in mind. That makes sense. It makes a lot of companies have done. So we know that you can do cost savings. Keep those factors in mind.
That makes sense.
It makes a lot of sense.
And I can also just point people again
to the chapters in the book
and also in your presentation,
as you mentioned.
I know I'm pointing on my side here.
It doesn't make sense for the people
that are listening to this.
They don't know you're pointing.
Exactly.
People visualize
I point to my
other screen.
I can feel the
enthusiasm though
which is the most
important part.
Yeah, exactly.
Cost optimization
is global,
not local.
SRE is a
strategic investment
in long-term
operational efficiencies.
There's a lot of
interesting pieces
in the slides.
The bit we find
interesting as well is often I have this conversation with people.
I'm like, do you have executive sponsorship?
And they're like, yes, of course.
Reliability is a huge concern for us.
I'm like, fantastic.
Who's your chief reliability officer?
Who's the executive who looks after it?
And they're like, we don't need one.
There's no need for one because it's so important.
And I'm like, you have a CISO, right?
Because security is important, but you have a CISO. And they're like, well, of course we have a CISO, right? You have a, because security is important, but you have a CISO.
And they're like, well, of course we have a CISO.
Hang on.
If you have a CISO, why don't you have a chief reliability officer?
That's often an awkward question.
You know, the person who could wrap up all of your reliability costs and calculations and ROI, just like your CISO does in your security space.
That's often a person who exists, but often they don't have the recognition or the title.
Very few people seem to be called chief reliability officers. We haven't really seen that in the wild,
but we've seen plenty of people who are an executive in charge of reliability. And we're
like, fantastic. This is the person to talk about your costs and benefits with at that global level. We've had that person in Google in terms of
for a long time. And we've never really changed that model. So we have no other way of measuring
this. So that's just our opinion, really.
But I guess the companies where we've seen it be successful
do tend to have executive sponsorship.
And when that executive sponsorship kind of wanes,
that's also when the reliability
and some of these cost calculations just get deprioritized.
So that's a good indicator that you can make changes in this space
with prioritization at senior levels of your company.
It does, however, indicate there might be a bit of a ceiling on your grassroots SRE efforts, like a grass ceiling, for want of a better word, where you will need some sponsorship to do things.
And I just think that's the same as security, right?
Like if you don't have that insecurity, you always struggle to get good security outcomes.
So I don't see where reliability is any different.
That was the same pattern we saw for DevOps too.
The people who were successfully transitioning had executive buy-in
and those who didn't had a long uphill battle and some fell apart, some would do a grassroots
effort and then catch somebody's attention but if you start with that executive buy-in that was that was our approach
our our cto said do you know get a release out and what was it 10 minutes or 30 minutes
i was yeah we had a fast lane we needed to be able to react to any problem within an hour and get the
fix out and prior to that it took it took us the classical two weeks sprint.
And then that was the idea.
That was years ago.
But it's that enablement.
It's finding the budget.
It's keeping track and helping justify the budget.
Because that's one thing everyone's learning now,
as Andy brought up,
is that we got the budget for everything.
Now it's accountability time.
So if you have someone who knows
how to pull in that accountability,
which, oh, great, I want to be the C if you have someone who knows how to pull in that accountability, which, you know,
oh, great, I want to be the CRO.
I wouldn't know how to look at budgets and put all that stuff together. So you need someone
who knows that stuff to do it.
So, yeah.
Great point. Fantastic point there.
These things are also, they seem
like they're not technical concerns,
but I think we do
encourage people to not look at their jobs through this narrow lens of like,
well, this is my technical element.
We're all here for business benefit.
We're all here because we want our companies to be successful.
We want these good outcomes for the overall business.
And that's, I guess, in modern DevOps terms, right?
I don't want to version out DevOps,
but where DevOps started and I guess where it is today,
we talk a lot more about business outcomes
and we shouldn't shy away from that.
We should embrace that.
It doesn't mean your executives should be telling you
exactly what to do in terms of deployments,
but they should be giving you those goals
which are well aligned with these outcomes.
And we shouldn't shy away from that.
Well, yeah, I mean, even just simply,
what are you building reliability for?
Your customers.
That's the whole point, right?
So that's, yes, I have this technical job,
but at the end of the day,
it's all to serve whoever my end customer might be.
And if I'm not keeping that in mind...
We're always doing it for a reason.
Yeah, for sure.
And then maybe also to add one more thing,
and again, I'm looking on the slides here.
In terms of the ROI,
you were quoting the State of DevOps report
where it says,
reliability is a force multiplier.
Teams that excel at reliability engineering
are 1.8 times more likely to meet
or exceed organizational goals.
I guess that's a nice slide to put in front of your...
You know, there's a reason.
I'll swiftly plug the State of Devils report.
Like, I know it's produced, you know, Google owns Dora,
so there's a conflict of interest there.
But, you know, I also think you can look at other Devils reports, right?
Like, they're trying to tell you similar things.
We do think that based on the data that we have,
there is a strong correlation between software performance
and business outcomes.
That's why we try and do these things.
That's why we want companies to do these things.
We want those better business outcomes.
What we found when we're asking people about reliability in this space
is that if you have a strong correlation between both,
and you can look at the report itself for more details,
but if you're just doing really strong software development,
that on its own isn't always a good indicator
because sometimes you might be a bit of a feature factory.
You might be pushing out releases and they're causing problems
and they're not really working well from an operations perspective.
The flip side is true.
If you have a reliable system,
but it can't ever get updated,
that's not helping your business
necessarily move forward, right?
So the combination of the two
seems to be the secret sauce.
And I guess this isn't necessarily new news
for some of the DevOps teams,
but I think it's worth repeating for many teams.
Try and get that balance
where you are getting the right amount of releases
to the right amount of reliability concerns.
And if you do that,
companies that do that seem to have much higher performance.
So there's a strong correlation in those spaces.
To write this down.
It's awesome.
So, you know, people often ask,
like, where's the money?
And I'm like, well, here it is.
You want higher business outcomes.
This is it.
This is why we try and do this.
And here's the piece.
I think what gets difficult
in complex adaptive systems
is people often want to isolate
and change one part of it and and that's
not how you know complex systems work complex systems require you to look at the the current
state the path dependence like where a company has come from where it's going towards and then
try and influence and nudge that in different directions so it's not as simple as just saying
like okay we need twice as many releases. That's
one change in the wider ecosystem,
but it won't necessarily
have the same outcome each time because, again,
it depends on where you've come from.
So trying to say, hey, we do want
faster change, but we want it
at the right pace of the reliability
levels, that's, I think, an
important thing when it comes to changing the system.
It's not about just sort of doing one or the other.
Yeah, this is also one of the things that I mentioned, but the speed of delivery could
also be, if you just measure yourself based on how fast you can deliver, you may even
have a negative impact on the people that are on the receiving end.
Your users might be also overwhelmed with changing again.
I remember, Brian, in the early days
when we were kind of speeding up from,
remember, the two
releases per year to
then two major releases every
other week, we had a hard time to actually
keep up with educating
our internal folks on how
to sell the new, because our whole
product development
is just one piece of the whole end-to-end chain.
Then you have support, then you have sales engineering,
you have your customers, so you obviously,
it all needs to work in the right balance.
And I think that's the nice way.
We need to balance everything out.
But you had buttons moving.
Where do I go to turn that off?
Oh, they moved that.
But in the video video the video shows here
but the video is two weeks old
two weeks old
it was a good growing pain to have though
because it was
because we adjusted from it
James I have one final question for you
because this is something that
comes up a lot in discussions as well
the topic of SLO service level objectives one final question for you because this is something that comes up a lot in discussions as well.
The topic of SLO,
service level objectives.
And I also,
I borrowed one of your quotes into one of my texts
that I'm about to present
in the meeting
after this recording.
He cried.
I credit you.
Yeah, I have a screenshot
of your book.
But I say the power of the SLO
is driving transformations. And screenshot of your book. But I say the power of the SLO is driving transformations.
And in the book you say,
focus on the user and all else will follow.
And I like this a lot because
when I started talking about SLOs in the very beginning,
I think I had a wrong approach of SLOs
because I thought I need to put SLOs
on every single service.
I need to define at least three to five
or maybe 10 SLOs.
But in the end, this doesn't really make sense.
You need to define the SLOs
that are aligned with your business objectives, right?
Like how many users do you want to have in the system?
What should be the availability,
your critical end user journeys
or whatever you want to call them.
And then I think the rest
will follow because then you will do
everything to keep those top-level
SLOs in check
which means you are ending up
with business success and then you'll figure
out what it means, what you need
to do in order to support these
top-level SLOs.
My question to you now
is, I see you're nodding.
People don't see that you're nodding, but you're kind of nodding.
When people ask
you, you get approached
by a lot of organizations.
What do you
tell them in terms of where do you start
with SLOs?
Again, it is a very common, I think, question in this space.
I don't think there's a nice prescriptive answer.
I would say the most common theme I would normally do is start where you are.
Like often people were like, well, you know, tell me what I need to do.
And I'm like, start where you are.
Like wherever you are right now, what do you have? What's easy to measure? What's the SLIs that you might have?
What are the things in place that you have today? Start there. Instead of thinking what would be
the perfect outcome for my service in this hypothetical space, you probably won't know
the right SLOs until you've explored this space.
Like it's inherently a complex system.
So trying to sort of design them up front, trying to say like, hey, if I just got enough
SLOs, like this would somehow be okay.
That often doesn't really help people in this space.
And so thinking of all of these things, SLOs, SLIs, error budgets, as just tools in your toolbox for this space, I think is much more important than the individual metrics.
Like the things that you're trying to sort of say are critical, like will change over time.
They'll change depending on your business. business and and using them like i say like different tools in your toolbox and and starting
with okay what do we have now and how do i then increment it is is almost always i think where we
start um if if you think of these tools as mechanisms for communication across silos
like that's also i think a very important like. So an error budget is a way of us
saying, hey, I have an understanding of how much reliability I need in my service,
and I've agreed that with other parts of my business. The exact number in the error budget
isn't the most important part of it. It's the fact that you've made that agreement,
and there's a shared understanding of what you want. So an SLO is a shared understanding between yourself, other parts of your business, your customers of what they can help normally expect, what they kind of want.
The fact that you have that shared understanding is more important than the exact thing itself.
Please don't go on Twitter, folks, and be like, aha, James said SLOs aren't important.
That's not what I think we're trying to say. What we're saying is the right SLOs for you
will be custom, and you will often not be able to well prepare them in advance.
But once you start building them and using that feedback, that negotiation of these proxy metrics that you can use will help you
turn some very speculative ideas into data.
They will start helping you use these tools
to make better data-driven decisions
and that will then help you make
better choices for your business.
It doesn't mean you just have to do these things
because we said so.
That's not what we're trying to say.
They're just tools in the toolbox, basically.
It's a complex topic, but does that kind of make sense?
It does make sense.
Thank you so much.
And as Brian and I always say,
the great thing about doing this podcast
is because we always learn a lot and
I mean hopefully our listeners learn a lot
as well but I always
want to say thank you for
spending the time with us and giving us your insights
because that's the great thing
about the communities that we live in because
we are all open, we share even
though we work for certain
organizations and we may want to keep certain things
as a secret but we don't do this.
We just share openly because in the end,
we all get better.
It's really the most important part
is that continuous learning.
Exactly.
Awesome.
I think we consumed a lot of your time.
Thank you so much.
I really hope I get a chance again
to meet you at one of the upcoming talks.
Do you have any talks coming
up in the next months where people may be able to see you live? I don't, but if people want to
see Steve McGee, my collaborator in this space, he's presenting this Thursday at All Day DevOps.
And we'll be talking about the reliability mapping that we've been trying to do.
So come to All Day DevOps, it's a virtual conference and learn a bit about reliability mapping that we've been trying to do. So come to AutoDevOps, it's a virtual conference,
and learn a bit about reliability mapping.
That's good.
Awesome.
Yeah, I don't think reliability is going away,
so we'll be talking about this.
We're just going to stop caring about performance
altogether.
When we're robots, I still need it.
Anyway, yeah, and I got nothing else to add.
I really appreciate your time, James.
Very educational and informative.
So thanks for spending
the time with us
and thanks for spending
the time with our listeners
and thanks to our listeners
for being there.
Thanks, folks.
Much appreciated.
Yeah, we hope everyone
has a fantastic time
and I don't know,
doing whatever.
I'm a mess today, I guess.
I can't even think straight.
It's still only, what?
It's just turning 10 for me.
Anyhow, thank you, everybody.
We'll see you next time.
And yeah, bye.
Bye-bye.
Thank you.
See you.
Thank you.
Bye.