PurePerformance - Why SREs are not your new Sys Admins with Hilliary Lipsig
Episode Date: May 16, 2022“The most significant body of my SRE work is architectural reviews, disaster and failover planning and help with SLIs and SLOs of applications that would like to become SRE supported.”This stateme...nt comes from Hilliary Lipsig, Principal SRE at Red Hat, as her introduction to what the role of an SRE should be. Hilliary and her teams are helping organizations getting their applications cloud native ready so that the operational aspect of keeping a system up & running and within Error Budgets can be handled by an SRE Team.Listen in to this episode and learn about the key advices she has for every organization that wants to build and operate resilient systems. And understand why every suggestion she makes has to be and will always be evidence-based!In the talk we mentioned a couple of tools and practices. Here are the links:Hilliary on Linkedin: https://www.linkedin.com/in/hilliary-lipsig-a5935245/KubeLinter: https://docs.kubelinter.ioListen to talk Helm and Back again: an SRE Guide to choosing from DevConf.cz: https://www.youtube.com/watch?v=HQuK6txYS3g
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have my fantastic co-host Andy Grabner with me.
Hi Andy and happy Performance Awareness Day.
Is it Performance Awareness Day?
Yes, at least the way we write the date in the United States.
What's the date if you just do the month and day?
Well, it's 5-5.
Oh, crap. I'm wrong.
It's just 5-0-3.
It was two days ago.
I was so excited. I had this all set
for this. We'll keep this in anyway because it
shows how stupid I am. 5-0-3
is performance awareness day.
So, May 3rd. We're making that a new holiday.
Yeah, well, we could always...
What's 5-0-55 isn't that an error i'm sure it's some hdb status code that means something
maybe it's cinco de mayo error code if you drink too much uh not sangria maybe a corona today
then you feel sick tomorrow but maybe not from corona but from corona yeah and and very quickly on
cinco de mayo you know people love putting up decorations for holidays and it's getting really
crazy now i saw my first cinco de mayo lawn decorations and i was like okay that's getting
a little out of hand now um especially since it's got nothing to do with the united states really
anyhow uh just a big drinking holiday another Speaking of drinking holidays, I'm terrible
at segues, Andy. I think so too. Let me take this over. Speaking about resiliency, because I think
if you drink a lot, you want to be resilient against the alcohol. But resiliency not only
applies to drinking and then sobering up, but it also applies to software. And we've been talking
a lot about software engineering and cyber liability engineering lately. And we've been talking a lot about software engineering and cyber reliability engineering
lately.
And we have an amazing guest today who actually tells us a little bit more about how to not
just throw like new software over the wall.
And then the SREs are taking over to just handle everything and keeping it running and
monitoring it.
But what we can actually do as organizations, at least as I think what she will tell us, what we can do is,
what good SRV teams should do to make sure
that the stuff that the apps that they are getting to run
are actually applying or adhering to certain standards.
And I think I'm not doing a good enough job to explain this.
That's why I want to welcome Hilary on the stage.
Hilary, how are you?
Hi, I'm well. You two are a delight of terrible dad jokes. I just love every second of that.
I had to go on mute so that people wouldn't hear me cackling like a witch while you guys were doing that.
So just, you know, fantastic intro and thank you so much for inviting me on here today. By way of introduction,
my name is Hillary Lipsig. I am a Principal Site Reliability Engineer at Red Hat and a
global SRE team lead. So at Red Hat, we have several different SRE teams, all of them around
managed OpenShift, so OpenShift dedicated and managed services running on OpenShift dedicated.
And that's where I sit.
So I'm on the managed services.
These are software as a service offerings that Red Hat has.
And they are backed by site reliability or SRE support.
So basically, if I understand this correctly,
we talked about Kubernetes a lot in the past, about OpenShift.
I think people understand that this is the de facto platform
of the future where we will run most of our cloud native loads.
And you as Red Hat, you provide OpenShift as a service.
So that means I assume there's a lot of critical apps
running on these platforms.
And you want to make sure that the underlying platform itself
really runs stable because otherwise your customers would not be happy with you. Yeah, exactly. So it's all about
how OpenShift runs and performs and how apps run on OpenShift and interact with OpenShift.
And a lot of that, that's where the architectural reviews come in. So kind of like you said,
so for the services, site reliability engineers,
or SREs, which is my group, we actually have, we call them the tenants, and they are software development engineering teams that would like to offer their services as managed services,
meaning full SRE support plus full CEE, you know, customer experience engineer support.
And so we onboard these these tenants and part of
that onboarding process is
a comprehensive architectural review,
where we basically provide them with our standards.
This is what makes an app observable.
This is what makes an app supportable.
This is how we know that we can support you with confidence.
Then we help them and coach them to to you know reach these standards
um so you know some of the things that i'm talking to a team and a lot of these
there's like
that was awesome for people that like we heard the dogs but now we saw a cat running through
the picture and jumping around i think we We will probably continue once she's back because she just rushed out.
Yeah, sometimes it'd be good to do these on video, but oh well.
So actually the dogs are outside.
My office is against my back wall of my house.
There's a back door.
The dogs are out there.
And what happened is they are roughhousing.
And that is just dog playing noises.
I have a Shiba Inu, two Siberian Huskies, and a German Shepherd.
So it sounds like they are fighting to the death.
That's actually just what they sound like when they're playing.
Those were all happy noises.
But they slammed into the door.
So the cat went from the floor to the bookshelf right quick,
just in case.
That was amazing.
I don't care if we leave that in. That was, that was hilarious.
No, we should, we should leave it in. That's awesome. Come on.
Oh gosh. But what was I saying? Right. Architectural reviews and standards.
My dogs don't meet any of them. They're a hot mess. They're not sustainable at all. But right. So there's several things that
we ask our tenants about in addition to some of the lower level into the weeds pieces. There are
some higher level places that we start. And so some of the things that we initially look at is,
of course, well, does your offering even run on OpenShift?
The usual answer is yes,
we've done that much. Great.
We usually actually also have a requirement that their offering,
their managed service offering is a level four or above operator.
For folks who are not familiar with an operator, these are a continuously running piece of software. It's, you know, it'll be a pod in OpenShift usually. And what it is doing is it's managing its operands. So it extends the statefulness that Kubernetes provides you all the way through to the application layer. So making sure that the application is highly available, it always has the requisite number of pods to meet that. Making sure that all the secrets exist and
they've rotated on time. Anything that an operator watches is called an operand. And there are levels
to the operators and a level four means that it must be highly observable. So it must do several
day one and day two actions. So it'll deploy and upgrade your code
and then it's also managing the code or the application rather to a degree. An operator
impersonates a person. And if you've heard me talk about operators before, I'll tell you that
a good operator requires a good story. So you need to know what the person who would be operating
your software by hand would
be doing. And that is the standards that we as SREs coach our tenants through. So we look at,
you know, what is your upgrade story? How are you going to accomplish them? Have you thought through
keeping your application highly available during those, you know, is it doing the rolling restarts?
Is it, you know, upgrading things one at a time versus, you know, just, you know, destroying everything and recreating it,
which would cause outages. In general, is your application highly available? Meaning it has at
least three replicas of everything that's running. And then we also start looking more in terms of
the continuity. So if you're running an application and you're pulling in open source upstream repositories
and tools, what's your story around that?
Do you have people who are maintainers
of that open source upstream tool?
How are you going to guarantee bug fixes
for the customer experience?
How are you going to guarantee the response
to critical vulnerability alerts?
These are the types of things that we're walking tenants through.
How are you going to disaster plan?
We're actually kicking off a new effort that is just called the Disaster Planning Office Hours.
And it's all RCA-based, so root cause analysis.
We are presenting on a few really catastrophic incidents where data was lost to help get our tenants thinking through these scenarios.
And then is it observable?
We do a really comprehensive SLO, SLI, and SLA, service definition review, of these applications to make sure that they're thinking through, you know, what do we want to measure to know we're doing well?
How are we going to measure to know we're doing well?
And then lastly, what do we actually want SRE to do? And that's usually the hardest
question for a tenant team to answer. They're just like, well, I just want SRE to keep it alive.
But there's more to it. What does that mean to you? In a situation where we have a catastrophic outage and we can restore service at the expense of data, which do you want me to choose?
Which is more important to you, service uptime or data continuity?
These are decisions tenants have had to make.
And we've had to coach them through these decisions and, you know, sometimes learn the hard way.
And so everything that we learn the hard way gets scaled back up and becomes a permanent standard.
You must tell us, you know, this by this time. And we run them through phases.
So you have to be this tall to ride in order to go into staging.
And then you must be this tall to ride in order to start offering in production.
You must be this tall to ride in order for us to reach to assume the pager from you.
So and we're with them every step of the way, coaching them,
helping them, advising them,
and helping them to do things like write
their SOPs or standard operating procedures,
helping them to think through their testing and
making recommendations on how to approve their upstream.
I would say my only open-source contributions
have been recommendations for how to fix the upstream.
My name's not on that,
but that is my open source contribution.
And that is typically the types of things
reliability engineers are doing
is we are holistically looking at your offering
and helping you make it better from head to toe.
Wow.
To recap, I need to recap a little bit
and just ask that I understand this completely.
So if I'm company X, I have a certain service offering.
In order for me to run my software on your managed OpenShift, you are expecting a certain level of quality.
I think you call it the operant level four, at least, that defines things like, is my system observable?
Is it high available?
Do you have to have certain architectural settings in place?
And you're basically coaching them to get to that level
so that you then actually would take ownership of it,
that you can then actually run it for them.
Just to recap quickly, because I think this is fascinating.
Because you are, not only do you provide a managed service
offering where they can run their software on your systems,
but you're also in that process helping them to make
their systems actually resilient,
to make them observable.
And now let me ask another question.
When you talk through them, or when you talk with them about, do you have runbooks or do
you have like, if then, what do you do in case of a certain problem?
Do you then also try to automate the remediation of these situations?
Like if they have something written and this is how to do it manually right now, do you
try to automate the remediation steps that are described kind of in
that runbook in an automated way in the future? Or do you still see just refining the steps in
a written way and then you as an SRE team would then take over and say, okay, we have everything
written and we are very happy with this. What do you also expect that most of the remediation actions need to be automated?
It's a mix. Most of that is actually the automation gets put
into a single source place.
A lot of the stuff that I am
supporting actually will run on a Red Hat customer's OpenShift cluster.
So not even Red Hat-owned infrastructure, right?
That's the customer cluster.
We don't run anything there.
We don't have to because they are paying for those resources.
And we have to respect that and keep our footprint as small as possible.
So there are other places where we store our automated remediation. Then when we have to access
the customer's infrastructure to resolve an issue,
then that's when we trigger our automated remediation.
Of course, that access has several layers of
security and so forth that we have to go through,
we have to prove our identity.
Everything we do is logged and auditable.
It has very strict RBAC. My role and it has very very strict rbac so my role you know role-based access
is very very strict um so it's it's all you know highly like very very very secure like very very
tight um because of course uh protecting people's information is extremely critical to red hat
um so there's a degree of automation which we expect. Some of our managed
services offerings are ran on Red Hat-owned infrastructure, actually. And we can do more
there since we're footing the bill for those resources. And so there is a lot of automation,
and there are some things that must be done by hand. And anything that requires access to a
certain degree of information must be done by hand because I also then have to go
and justify why I did that for security and compliance reporting.
So it's a mix, and we get our tenant engineers
to help us write the automated remediation,
which might be a bash script, it might be a Python script.
I mean, I think with anything, it's pretty much scripting.
So it's pretty easy for anybody to contribute to.
But there are some things
that you don't really think about, right?
When you're writing a script to resolve an alert
and you're saying, okay, well, you know,
have the cluster just give you all this information
and then pick the correct component, right?
This really happened.
That sounds great until the cluster spits out 5 000 results
um so this script needs to be a little bit smarter so some of that you learn from from experience from running the scripts from testing it from actually having production level workloads where
there could be 5 000 results instead of like the six you did in testing um and those are the types
of things that we take back to the tenants those are the types of things that we peer review on
their sops and we peer review on their automation of those SOPs.
Like I said, we're with them from the very beginning all the way to
the very end because at least especially my team,
we review every Prometheus query to make sure it makes
sense and to make sure we understand the time windows,
the error budget is covering.
Is your error budget a five-minute?
Is it a rolling five minutes? Or is it just like a a five minute period and then the next five minutes is a new five
minute period right those are the types of things that we as sre's need to know because it allows us
to make informed decisions on does your stop allow us to resolve your alert within the time frame
to stay within the slo so that we do not not you know blow the error budget by being too slow.
So we do all of that.
We do the stops.
We test them by hand.
There's a fun effort my team does.
It's a chaos engineering game
where we set up standalone copies of the infrastructure.
And we have three teams, red, blue, and green.
I call it combat, Kubernetes chaos combat.
So the three teams play against each other.
So the red team is the saboteurs.
So they've got in, they've broken things,
and then they will complain at you like a customer,
hey, this isn't working, I don't know why.
The green team has to kind of be like the equivalent to the CEE
or the customer experience engineering team,
where they talk about, okay, here's the things that I can see from my perspective,
here's the things the customer is saying.
And then, of course, the blue team is the actual SRE team,
so we have to go in and find and fix what was broken.
And so we've run these games a couple of times.
We're running them again.
Actually, I'm putting together the next round of games
for one of our managed services offerings,
and we've created it as a repeatable pattern that all of the, you know, it's cross-functional.
The Red Hat teams are joining in and participating
and, you know, generating, you know,
just being an observer
and generating production data for us
if that's, you know, something that's feasible
with the application.
So that we can all be better trained
on how to handle incidents, how to communicate,
how to do escalations if it's needed,
how to grab engineering if we have to
because it's a bug in the code and we can't fix that, which is actually an interesting point that I'll
come back to in a second.
And so we do these things as part of the training and part of the onboarding.
And we say, we know, we know your system.
We are ready for your pager.
We are trained.
You've met our requirements.
Let's go.
It's go time.
And then the service is generally available.
It sounds like a murder mystery party
it's awesome it sounds like a fun time you know i thought it was really fun um to be honest with
you some of the some of the team found it really stressful because it was like oh gosh i have to
do my job in front of 30 people i've never seen before and you know as reliability engineers
we're not like usually people persons like that um so uh the feedback i got was smaller
groups please so more games smaller groups that that was that's that's that's the new plan going
forward to allow people to feel a little bit more safe about failing or not doing something right
so no live stream on twitch for those then huh i could convince a few of the people to do a live
stream on twitch i absolutely could um but especially the very first time we did it,
where the game format was new, we've never done it before.
That was, you know, maybe 30 people on the call was a little much.
It was a little overwhelming.
And I took that feedback to heart.
So we're doing smaller games at this point,
which is actually easier to coordinate, to be honest with you.
If I just get six people going in a room,
that's way better than trying to get 30 people across three geographic regions.
I got a question for you.
If you look back at the last couple of projects
you ran where you helped your customers to get to that level,
in hindsight, do those have something in common?
Like these are, let's say, certain certain mistakes or not certain mistakes, but certain situations
that are kind of not good for all of them.
And kind of, can we tell our audience who would like to think about, you know, coming
to you and having you look at the environment and actually bring it to the certain resiliency
state?
Are there certain things where, you know, know hey for every customer that comes to us
we always have to tell them this and this and this because this is always not there i don't know
good slos or they have certain architectural principles that they completely neglected
are there certain things you see consistently across of your customers like your top three
list of things that we could discuss?
Because that would be interesting, right?
Because it could be an interesting, hey, before you are calling Hillary,
make sure that you got these three things covered at least.
Yeah.
So the first thing I have to tell everybody is SRE is not your sysadmin team.
We are not sysadmins.
We will not be doing your upgrades for you.
We will not be doing your maintenance for you.
You need to build that into your operator.
You need to build that into your CICD pipelines.
You need to know how that's going to be done
in a reliable and automated way
that leaves your system highly available.
And that is usually the most interesting bar
for teams to have to meet because there is a
little bit of a, oh, SRE will just run it. It's fine. And the answer is no. And I've had to tell
people as gently as possible, no. And so one of the first things I tell people is just so you know,
SRE is not your sysadmins. And I think that's a really important difference that I think people forget.
The next thing is some of the teams have never been cloud native before.
And that is just, you know, the world is changing, so they're changing too.
So there'll be some things that they don't, even if they've learned about doing something
for Kubernetes, OpenShift is opinionated.
So we have to kind of coach them through the differences,
the minutia of, okay, but you can't really capacity plan like that because OpenShift has certain other constraints
that you're going to have to think about.
So you need to adjust.
So there's just a little bit of coaching people
through the differences of OpenShift and vanilla Kubernetes.
At the end of the day, that's not a very big lift. It's just very small things,
check mark items that we can get people through. I think it's a very important thing though,
right? Because I think there's obviously a big debate about opinionated versus giving you all
the freedom. But the benefit of being opinionated is that you're getting put into a certain trajectory that makes certain things easier if you know your rules uh on the
other side people may say well but then i have to stick to certain rules but yeah but certain rules
make it easier so i i understand that and yeah yeah um and the last thing is uh etcd is not a
database i don't care what anybody else tells you i don't care what the training says etcd is not a database. I don't care what anybody else tells you. I don't care what the training says. Etcd is not a database.
A lot of our worst problems have been around people using Etcd without care to, you know, cleaning up after themselves, deleting old CRs and so forth. main open source contribution is auto pruning in Tekton, not because I wrote it, but because I had
to go to that team and say, Hey, people are not cleaning up after themselves. You don't have auto
pruning on by default. And, uh, it's terror it's, it's bringing down infrastructure because
etcd is getting completely full. And so, you know, they were really great. This is an example
of working with, they're actually not a managed service.
So, but working with an open source team,
you know, we have folks on that team in Red Hat.
So I went to talk to them and I was like,
we need this timeline adjusted on this, on this.
And they were so great.
They were so responsive to, you know,
enabling auto-pruning by default
and improving their documentation around
why auto-pruning is important and how to do it.
And so, you know, it's just one of those things where it's a reliability engineer is the type of person who can come to a team and say,
hey, this is what's happening because these things were not considered and we improve them.
And I know you said three things, but the fourth thing is not necessarily understanding some of the ways to set up your workloads and best practices.
So resource requests and limits on CPU and memory.
I know that there is a subset of folks in the Kubernetes world who don't think that CPU limits are necessary as long as you're setting requests.
I disagree with them.
I have RCAs for why.
Some of that is secret sauce, though.
So, you know, that's what it is.
But that's sort of our typical stance on this, is that you need to be setting those resources.
Yeah.
That reminds me of Christian Heckleman, Brian, if you remember him.
We had him on the show.
He did a talk at KubeCon, one of our colleagues.
And it was labeled things, how not to start with Kubernetes,
all the things that you probably do wrong,
and not setting limits or not the right limits.
It was definitely very, very up there on the list.
Yeah.
BRIAN DORSEY- I think that one falls into the trap of moving to the cloud, right?
Because a lot of people think, I moved to the cloud, therefore I have unlimited resources
and these are things I don't have to think about anymore.
So I'm just going to push my stuff and it's going to run and it'll be magic.
But there's a responsibility you need to still own and maintain.
So at the very least, you're not just consuming tons of money in resources, but there's a responsibility you need to still own and maintain. So at the very least, you're not
just consuming tons of money in resources, but there's also the health of the platform that you
need to consider. But there's just a, seems like to be a hand in hand forgetfulness of,
now there's still physical limitations to everything, right? Physics doesn't go away.
Yeah, I definitely, that's exactly what I see.
I also think that you see that there are some people who have
I've done very like low level, very close to the hardware stuff
where I've had to be super conscientious of
of resources of all kinds and like literally even to the point of
if we're using a CPU and the IOT server is getting too hot, it will fry
and it adjusts the longevity of that device, right?
So I've had to consider it all the way down to that kind of degree.
I think there are a lot of folks who have spent their entire career in the cloud.
And so it's just a little bit of a lack of exposure
to what's actually happening on the hardware
that leads to some of the rules that we've designed.
And in some cases, when you have a healthy budget and you can have auto-scaling on, to some of the rules that we've designed.
And in some cases, when you have a healthy budget and you can have auto-scaling on,
you can kind of get away from that.
One of the things about OpenShift Dedicated
is it really doesn't auto-scale
because we're not going to just automatically make decisions
about people's budget
and what they're going to spend on their infrastructure.
So in our environment,
setting all those in properly doing
your capacity planning if you're a managed service offering
and therefore knowing what to correctly set
is really mission critical for the end customer experience.
It's interesting.
I wonder if with this idea, when we move to cloud stuff
and all that, there's a lot of abstraction.
Even if you take it to the extreme to serverless functions, people have no clue what's running underneath there.
But I'm wondering if on the cloud side, there should be more of an effort to expose resource utilization.
Obviously, you might not have any control, but making the users aware of what's going on underneath everything
so that they're forcing
them to just see it and be hopefully conscientious about it because it's so abstracted and no one
does see it. No one is getting exposed to it. I wonder if there should be, I don't want to quite
say a movement, but should there be some sort of movement for lack of a better term to force people
to be aware of here's what you're doing right your code's running but
here's what it's doing to everything else that you're doing right yeah um so there's actually
something a little bit about that happening within the red hat sre groups um we are working on some
projects that kind of give customers more insights and information um there's a project that i've
been helping out with um for a little over a year now called the Deployment Validation Operator,
which basically goes through and just does
a sanity check of your workloads to
make sure they're meeting Cloud-native best practices.
That uses an open source upstream
by StackRox called kubelinter.
This is something that we actually evaluate
our tenant workloads against and make them pass
the checks for this or get an
exception before, you know, they're allowed to go to production. And as I'm actually, my team
owns a service, a managed service that is internal only. And we've had to go through these, the same
pain points that our tenants go through, we're going through it as a tenant as well. And so the
deployment validation operator is a great tool that we use for that. I can't say that that's, I mean,
I've contributed to the documentation there.
I'm the current technical lead of the project,
although that will be handed off to
somebody else for long-term ownership.
But KubeLinter is a great tool for it.
It's a static analysis tool.
You can put it into your GitOps workflows,
like in a GitHub action,
or you can use the deployment validation operator.
So you can continuously have things just checking.
Can you spell the cube blender again? How do you spell that? Do you know?
Yeah. So cube. So K U B E dash L I N T E R.
Okay. Got it. Yeah.
Because it reminds me what we are doing.
And I know, Hilary, the way we actually met initially was through a podcast where we talked about Kepton.
And as part of our Open Source Project Kepton, we are also automatically validating the health of a system by looking at a list of metrics.
We call them SLIs, right?
And then compare them against the objectives. And I was just wondering if we should
think about integrating Kubilinter
as one of the data sources, because we're
open to any type of data source.
And I think that could be really great to integrate it
into the orchestration as we're pushing out new builds.
Then not only looking at Prometheus,
but maybe also looking at stuff that
Kuplintr comes back with. That sounds really interesting. I think that it's a really fantastic
tool. That's part of why I got involved in the operator that Red Hat is writing. And, you know,
like I said, we ask our tenants to pass these checks. And so again, from a perspective as a tenant,
actually to one of my sister SRE teams,
my team has had to go through this
and we've had to do the same things.
Well, we were asked to run a tool
that actually is an open source tool,
but Red Hat has no ownership of at all.
It's a really great tool.
It is a fork of century.
So it's called Glitch Tip. And is a fork of century. So it's called glitch tip.
And so when we're looking at running it, we're like, okay,
but unfortunately we have to make it meet our own standards.
So my team has made some open source contributions there,
just a few changes so that it would run on OpenShift,
some minor tweaks, and then we implemented an observability API.
Because even though this is a tool to help you have observability,
it also must be observable so that the other team that will be the SRE support for it,
if anything does go wrong,
can see what's happening with Prometheus and then run an SOP to fix it.
I'm having to eat my own dog food on this.
We have to meet our own standards.
There's no exceptions just because we're internal and it's an internal only tool.
We still have to be this tall to ride.
Hey, I know you have a lot of kind of secret sauces that you don't want to reveal.
And I'm not going to ask you about it, but I want to ask you one thing.
I was just in the states uh traveling uh through two
through uh two different states interstates now let's get it right i went to cast visit customers
and the big topic we discussed there uh was slos service level objectives and i i do have a little
let's say workshop where i try to educate them first of all what are slis and slos what are
good ones what are bad ones and then we go through a little exercise,
kind of as a group exercise, to define good SLOs.
Oh, your cat is back.
Look at that.
Oh, yeah, she likes to drop down onto my head during meetings
and scare people.
It's really a shame this is not on camera,
because everybody would have just seen this tabby cat just
drop down.
And now she's on my chair.
The good news is I can take a screenshot,
and we'll get it in somehow.
But yeah, coming back to the discussion of SLIs and SLOs,
this was a big topic.
And people are struggling a lot with defining SLIs and SLOs.
Good ones that actually are meaningful.
And I wonder, what is your take on how do you approach an SLO definition?
Like, where do you start?
Do you have kind of some recommendations
where you say it doesn't make sense at all
to define SLOs at a certain level?
Where should you start defining SLOs?
Who do you need to get into a room
to actually define SLOs?
Just, it's a hot topic for me and for us.
And therefore, I would be very interested in your take. And without revealing any secrets, obviously, of what you normally do in your work.
So unfortunately, none of that is secret sauce, right?
And so it's so not secret sauce that we're writing open source trainings about exactly this on our end as well.
So there's a couple of things. So because of, I think unlike a lot of
SREs, I actually very deeply care about what the SLA says, because I know that we're going to have
end customers using the workloads that I support. So it's multi-level. So what are we promising
customers, right? That's the SLA. So we might say we've got four nines of uptime, right? Great.
How do we make sure we give them four nines of uptime? And when do my alerts happen? So my
objectives are not actually to make sure our SLA is met. My objectives is to be better than our SLA,
right? We should be better than our SLA. That's the goal. And then if we're not, that's when we
start firing alerts and getting SRE involved. So SRE should be there well before the customer
is like, knows or is upset or, you know. So that's really a lot about what it is. It's really
around what's the customer going to experience. And, you know, coming from a background of quality engineering, which I know you do as well, I often
consider SRE very much a quality-like function
because we have to help people think about that
from a lot of different angles.
So it's really about being better than your SLA
with your SLOs.
So whatever you want to promise to your customers,
measure for better than that.
And then alert if you're not doing,
if you're not doing,
if you're not performing that well. And that can be around things like we're guaranteeing, you know,
you know, latency, no latency, right? You can't, you can't ever guarantee a hundred percent
because you don't control all of the factors. You don't control global DNS. You don't control
AWS or GCP or, you know, a lot of other pieces that you're running on. So you should never be promising better than what your dependencies are promising.
And then you should never be promising 100%, even if you did control all your dependencies
because things happen.
The power goes out, you know, whatever, like things happen.
So those are some like general guidelines that we tell people about.
And then, of course, when you're looking into measuring it, you're saying, okay, if we want to make sure that we're not experiencing latency more than, I don't know, two requests a second taking longer than 200 milliseconds to come back, right?
Then that's what you start measuring.
And then you do it over a period of time.
And you say, okay, you know, we're seeing two requests a second coming back slower than we want to over a period of time and you say, okay, you know, it's, we're seeing two requests a second coming back too slow than we want, slower than we want to over a period of an hour, right? That
might be like a error budget type. And I didn't actually sit there and do the math on that at all.
So that could be like, wrote that down. That might be, that might be nonsense. But, you know,
those, those are the types of things that we're looking at is like, how do we want to measure it?
How do we want to aggregate those measurements? So we're looking at our overall picture, not just
our moment by moment picture. And we typically actually layer up the Prometheus
alerts in that kind of a way to say, okay, if we want this much availability and we want to,
if we're in breach of that, we want to alert SRE, you should probably have multiple layers
of measurement of that. So you're measuring it a few different ways so that you're capturing what
the customer experience would be. Yeah. So you're measuring it a few different ways so that you're capturing what the customer experience would
be.
Yeah.
So basically, I mean, this is also
kind of the way I kind of advised our customers
when I was on the road.
The first thing that I came up with or I suggest
is start from where it matters most.
And that's kind of the business perspective
or the end user perspective.
Because it doesn't do me any good
if my backend services are
perfectly fast and available if the end user, for whatever
reason, cannot access your system,
for obviously reasons that you still have under control.
I mean, I understand your argument with you cannot
control global DNS.
But from that point on, from kind of as close as possible
to the consumer of your services, then break it down into leading indicators.
And then, as you said, you have leading indicators that tell you if you fail here, you will then start failing, getting closer to where it matters most, which could be availability, your end-use experience, and so on and so forth.
So starting from the top and breaking it down, this is kind of what I advise people.
The challenge that I always have is
if you think about these complex systems
we are operating right now, you have applications,
there's hundreds of microservices potentially.
How granular do you go, or how long
do you walk through that exercise of defining SLOs for every single
service? Does it even make sense? Where does it not make sense to define SLOs?
And I think this is, for me, a challenging question as well. Does it make sense if I
have 100 microservices to define SLOs on every microservice? Or do I rather find
interface microservices that are then accessed by a third party or by your consumers.
So I'm just trying to figure out what your best practices are
there, what your approach would be.
Do you define SLOs on every service,
or it doesn't make sense?
So we have SLIs on every service.
And a lot of times, you'll see an SLO to an SLI
having a one-to-one relationship.
Not always.
And that's not, I think,
typically by design. I think it just sort of happens. But there are some things that are just
central pieces. Like if you think about like a spider web and it all comes down into this like
center circle, right? There's going to be that some like central piece that everything connects to. And that is really like when you,
when you can identify components like that,
those are probably the best components to have objectives around because it
doesn't really matter if some of the underlying pieces are not working or are
working great rather. If that piece is not working,
like if you have a pod that controls sso um and that pod
is down well who cares about the rest of it right that that should be like your one of your most
heavily measured things because it's an it's an anything that's like ingress or egress right
should have all of your objectives some of the other pieces you might have it you'll probably
want indicators around because that guy is going to build up to your objective.
But in terms of this is an objective,
that's probably really what you're looking at.
Integration points in general, like, okay,
we are going to have to log out to this third-party API.
You can't have an objective around that API.
You can only have an objective around how
fault tolerant you are in cases of that API is down.
So a fault tolerance objective is probably
one that I don't see a lot, and I really wish I saw more.
We are fault tolerant, right?
We're not crashing because something came back.
We surface that something wrong came back
and that we can't do anything about it because of that.
But we don't crash.
We don't fail.
Those are the types of things that people don't think about.
But that's where I would put objectives think about, but that's the place where
I would put objectives around.
That's great.
So I guess you figure out those critical components as part of your architectural review, which
makes, I would assume, sense.
I really like the fault-tolerant objective.
That means how tolerant are you in case systems that you're relying on break?
And again, this is a classical use case where chaos engineering
comes in right because you want to actually as part of your testing uh you figure out that's
like I mean Brian right we had calls uh podcast with Anna Medina I remember her call when she
talked about um chaos engineering and uh a few of them yeah it's a fantastic topic. Andy and I both almost immediately fell in love
with the concept of chaos engineering. It's a lot more exciting than traditional performance testing.
Even though you can argue that traditional performance testing is chaos engineering,
because it's the first time when the system was actually brought under stressful situation,
because there was more than one user sitting in front of it, hitting it with requests.
Yeah.
Again, I don't want to take credit for this because I know other companies do it too,
but we took it to the next level
with chaos engineering as training, right?
And so I think that that has been one of the highlights
of my career as the tech lead,
was designing and organizing and implementing that game.
One of the things, one of the terms that I like that came out of the work I did with
Anna Medina was the term test-driven operations, because that's basically what chaos engineering
allows you to do. If you use chaos engineering, you can actually validate, does your monitoring work?
Do you get the right alerts?
Do the right people get notified,
even though it's obviously in an experiment?
And then actually, do people react in the right way
if these alerts come in, or the tools,
and the kind of test-driven operations?
It's like we do test-driven development,
we can take this concept to operations, or to SRE, test-driven operations. It's like we do test-driven development. We can take this concept to operations,
or to SRE, test-driven operations.
Does everything work as expected?
Do my remediation scripts actually
work when they actually get executed?
STEPHANIE WONG- I'm taking that term
and using it from now on, test-driven operations.
I love that.
But actually, so on that,
I actually want to go back to something I said I'd come back to and I haven't yet,
is my SRE team,
unlike a lot of other SRE teams
when you talk to, you know,
across various companies,
we can't produce production bugs.
And that's because of the ProdSec requirement on where builds originate,
how they originate, and so forth. So we actually don't do that, which is a very different type
of SRE than you find. I found that there was a mix. There are people like us who SRE without
being able to fix the production bugs. So I can tell you what the production bug is,
but my ability to fix it is going to be fairly limited. And if it requires a code change, we typically have, are able to devise ways of working around
production bugs and getting services back up and restored to where they need to be.
But so we actually have to, one of the maturity requirements from our tenants is engineering
escalation path.
How do we page engineering at 3am and say, hey, your code has this bug? And again,
like I said, a lot of that comes down to it's, you know, there's like a lot of project requirements.
This allows us to be language agnostic in what types of workloads we support. So I actively
support Java and Ruby and Python code.
And I don't, it doesn't, you know,
I can debug just about anything as I think most S3s can.
So we can, oh, and Golang, of course,
because the operators are written in Golang.
So, you know, we're debugging
across various different stacks.
You know, I have to dump a Java heap.
I have to, you know, go get some logs
around certain workloads from certain, you from certain parts within the pod even,
not just the pod logs.
So we can do all of that, but we don't fix the bugs.
So one of our maturity requirements
is actually being able to page engineering,
raise engineering, raise the BU even
in case that they are gonna need to do some sort of
discussion with the customer or what have you,
or just make them aware of,
hey, there's no way to not blow
our error budget for this customer, sorry.
That's part of our requirements
is having emergency escalation paths.
Then like I said, just general disaster planning.
Like, okay, I have to know I can restore the service,
but it requires sacrificing this data.
What's more important? Mm-hmm.
Yeah.
Have you, for the fixing production bugs,
how often do you see organizations
use feature flagging to actually shield off
or being able to turn off certain codes in case
you're looking at it and you see, hey, this
is actually the vulnerable code.
You cannot change it. But do you advise for new features especially to wrap things behind the feature flag or is this
not what you see it's not really what i see there's a degree to which that can exist but
to be honest with you when you have an operator maintaining state um it basically creates it's a double-edged sword and i talk about this in a talk i give called
helm and back again uh i did this with uh christian hernandez at um devcon or devcon cz so in the
the czech republic um and so you have no snowflakes when your workloads are operator based. And that's great because I have a fleet that is changing size on the daily.
I never know how big my fleet is until it's like time to go look.
So I know exactly what to expect always.
The problem is if something has gone terribly wrong,
I know exactly what to expect always. And that always is very true.
And so you have to be very careful
about how you work around things.
I'm very sensitive to what the operator will
and will not try to maintain.
It's one of the reasons we do such a thorough
software view actually is because we will never be able,
we will never be able to guarantee a snowflake.
You know, we turn something off, but then, you know,
nodes restart
and we lose those configuration changes.
The operator is going to restore things to the previous state,
which might break stuff.
So it's interesting.
And so feature flags, I see them from time to time,
but typically not.
And that's really the reason why.
That's good to hear.
Hey, Hilary, I think, first of all,
thank you so much for getting on the show.
It's amazing to hear from you directly
and your day-to-day work and how you help organizations
to really bring the systems really into a state
where SRE teams from your organization
are taking it over.
And it seems there's a, I mean, the service
that you deliver just alone by getting them there
is just amazing because you help them to just have better
systems that are more resilient by default.
Is there anything else maybe as a final conclusion
that you want to make sure this is a final thought from you
that people need to understand and need to know?
Gosh, you know, we really covered so much. I don't think so. I think at the end of the day,
you know, the SRE team, we are your allies, right? We are trying to help you have the best system possible. We're never intentionally combative. And one of the things that I would say,
if anybody is sitting in my shoes and they're like, I want to do the same thing Hillary does.
When I'm making recommendations to my tenants, I bring them evidence. All of my recommendations
are evidence-based. I'm like, this is why we recommend this. This is why we say this is a
repeatable pattern. That's actually something I should have covered. We basically have predefined patterns for services architecture of like we know this works.
And we bring them why we know this works and why we know other things don't work.
And a lot of times we'll bring them RCAs.
And I'll talk to them about some RCAs.
And I'll talk to them about, you know, why.
Like the disaster planning workshop I said, it's all RCA based.
So pretty much everything we do when I'm bringing this to the tenants, I'm bringing them an RCA so they understand the why of it.
Because really, when you bring people evidence,
I find they're usually very willing to make some changes to their code or their architecture,
because they don't want the same consequences that the other team experienced.
Yeah.
And I think this is also a great place to say if people want to follow up,
we will obviously share your social media links. Is there any besides LinkedIn, Twitter, is there any other place or to follow up with you in case people want to know more about what you do and maybe they are now excited and want to do something similar or even join your team?
Really, Twitter and LinkedIn are the best place.
I'm a very private person.
I don't have a lot of public-facing social media.
That is by design.
So those are my preferred new friend ingress points.
And then after, like, I feel like I know somebody, I'm like, okay, you're fine. And then I might, you know, give them other avenues of contact to me.
Then the service mission will route you to the other, yeah.
Yes, exactly.
The service mission will route you to the other points of contact, yes.
You'll get behind the firewall.
I just got to say, I think this is amazing, too, because it's about maturity models, you know, and I think one of the challenges for other people trying to
do SRE at their own organization is they probably get told, Hey, we need SRE, go ahead and do it.
Right. And what you're proving out here is that there are, there are a lot of best practices.
There are a lot of requirements that can really help make it a stellar thing instead of just
having a title actually being impactful
and helping the organization reach the height
of what that SRE is intended to.
So hopefully anybody listening who's in that SRE realm
took something away from this along those lines.
But yeah, thank you so much.
All right, it's our secret open source contributions,
pushing services to be better all the way upstream.
Awesome.
And remember folks, SRE is not just for admins.
I think that's a great line.
I will write all of this up in the summary.
Can you upgrade my Windows?
Yeah.
I need Microsoft Word installed on my...
No, it's a whole different one.
Anyhow, thank you so very much.
And I wish I could be a fly on the wall for the Rogue One V Empire Strikes Back conversation.
But that'll be another day.
Maybe I'll get to hear some of that.
Thank you for being on the show today.
And if anyone has any questions or comments, you can reach us at pureperformance at dinatrace.com or Twitter at pure underscore DT.
Thank you so much for being on, Hillary.
It was a pleasure.
Thank you so much for having me.