PurePerformance - The SLO Dilemma: Slight Reliability Discussions with Stephen Townshend
Episode Date: August 1, 2022For some out there SLOs (Service Level Objectives) are the silver bullet to building and operating reliable software. But nothing is as shiny on the inside as it looks on the outside.In this episode w...e invited Stephen Townshend, former Performance Engineer now converted to Site (Slight) Reliability. Stephen (@the_kiwi_sre) has experienced the tough side of establishing SLOs within an organization. It’s a constant battle between focusing on reliability and new features and a lack of change in culture.Listen in and learn about the 9 pre-requisites for SLOs that Stephen has identified such as: having a certain level of observability, define clear business objectives, define ownership and give autonomy or establishing a blameless cultureStephen on Linked inhttps://www.linkedin.com/in/stephentownshend/Stephen on Twitterhttps://twitter.com/the_kiwi_sreHere the additional resources we brought up during our talk:Slight Reliability YouTube: https://www.youtube.com/c/SlightReliabilitySlight Reliability Podcast: https://www.buzzsprout.com/1698445Our LinkedIn discussion: https://www.linkedin.com/posts/scottmooreconsulting_7-steps-to-identify-and-implement-effective-activity-6938919857459462144--RI7LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sre
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to Pure Performance.
My name is Brian Wilson and as always I have with me my fantastic co-host Andy Grabner.
Andy, how are you doing today? I need a thesaurus, Andy.
I can't come up with good alternate descriptions for you.
Fantastic, wonderful, lovely.
I'm good with those, right?
Despicable.
If they're accurate, that's all good,
unless you fake it.
But I think you know, I'm wonderful.
I'm actually, I'm looking forward to tomorrow, though,
because tomorrow is my last day of traveling for a while. And it's been a lot of traveling in the last couple of weeks.
And as much as I love going back to travel,
I also love coming back home, because in the last couple of weeks and as much as i love going back to travel i also love
coming back uh home because in the last couple of weeks traveling by train which was mostly what i've done wasn't as convenient as it used to be because i think we in europe here we have a lot
of challenges with trains these days and with flights and with everything yeah um so let's say
that way that you're in a blue room today where are yougart germany i'm in stuttgart germany yeah and it's
actually funny because stuttgart is the home of mercedes-benz uh but then i'm sitting here next
to the i think one of the largest infrastructure projects in germany which is renewing the whole
train station like the main train station the whole area around it um it will be great when
it's done until it's done. Until it's done,
it still takes some time. But I
want to do one more thing here, kind of a
segue. In the last couple
of weeks, I really had a lot of
bad delays with flights
and also trains, which means
they did not meet their SLOs.
Because for me, a great SLO,
a service level objective would be that trains
are on time 99.99% of the time.
Especially in Germany.
Especially in Germany, exactly.
But talking about SLOs, we have a guest today
who at least I have known for a couple of years now.
And he is, I think, as big as an advocate
for site reliability engineering and as
a losses I would like to be but I don't want to introduce him to the audience I
want to give him the chance to introduce himself Steven how are you who are you
welcome to the show hi Andy thank you yes my name is Steven Townsend I am
technically a cyber liability engineer that's title. I'm part of an enablement team within IAG, which is an insurance company in Australasia. 13 years before that. And I talk a lot online and share my learning journey
because it was a bit of a shock going from performance engineering
when I felt like I knew what I was doing to SRE
where I felt like a complete beginner.
And I thought, I can't pretend to know what I don't know,
but I can say, hey, I don't know anything.
Come learn with me.
That's what I've been trying to do.
Well, welcome to the club because both Brian and I, I think we were
seasoned performance engineers in our
career and then we
started the whole journey towards observability
and then we started the podcast
and now we try to be smart
on what we're talking about. Basically, we don't
have a whole lot of clue about.
No, I don't want to jinx it.
We bring smart guests on the show
to make us seem smart.
But then we learn from them. So you're our smart guest that we're going to learn from. Don't want to jinx it. We bring smart guests on the show to make us seem smart. Exactly.
But then we learn from them.
So you are a smart guest that we're going to learn from.
I was going to say, when's the smart guest showing up?
But Steven, I want to, we will obviously make sure that people
know how to get a hold of you to see what type of content you actually produce.
And one of the things that you are doing is you have a YouTube channel.
It's called Slight Reliability.
And I think you're doing, I don't know, regular shows.
At least it feels like, you know,
at least once a week things are coming out.
What motivated you to do this?
I did have a, it's actually a podcast.
So although there's a YouTube channel,
and it's also available on most podcast platforms as well.
And when I was in performance engineering, I actually had one called Performance Time, and I basically rebranded it.
I thought, let's change it to make it ESO refocused.
And like I said, the motivation changed.
With Performance Time, I really wanted to provide more of an educational resource, but also show what i wasn't seeing a lot in the
technology industry which is sort of empathy and humanity and just talking about the human beings
who do this really complicated work and how they get through it and you know how they problem solve
and their creativity so that's what that original intent was and i tried to carry some of that
through but like i said the focus now is also on just having this honest
conversation of i have no idea what's going on let's try and make sense of it and then the um
so first of all folks if you're listening we will link to all of the the content that steven is
producing but the reason why we're now actually on the call even i mean we should have had you
on the call on the podcast a long long time ago
but there was a linkedin posting and i am just looking at it uh it was our friend scott moore
posting a blog post from diner's race talking about seven steps to identify and implement
efficient or effective service level objectives and then you made a comment that you know possibly
controversial but i think that if your organizational team does not value reliability
for any number of reasons, then I don't think defining SLOs or SLIs
should be the priority.
And then it was going on back and forth between you
and also one of our colleagues, Saif, and then I chimed in.
And I thought this went into a really interesting discussion
because you basically said, you know, that there's like, don't put the pressure on people and kind of force SLOs on them without,
I think, good ownership and actually changing also the people, the culture, the processes.
And I think this kind of reminded me that every time when I talk with organizations
and I kind of try to educate
them on the on the value of slos i kind of feel sometimes silently but i kind of sense it that
people like the concept but then they also say i we don't know how this works how should this work
in our organization um because we have so many other things to do and then we got always the
pressure of delivering more features and and how can we deliver more features and yet we should have
better reliability a better performance and so i would like to kind of give you now a little bit
of air time and say what is your perspective what do you see out in your organization and
the organizations that you've talked and worked with in the past about telling them about SLOs, but where are the problems actually really implementing SLOs?
So I'll speak about IAG where I currently work.
It's a reasonably large company for Australasia,
13,500 staff, big history.
It's sort of built out through acquiring other smaller companies
and becoming one giant company.
So imagine all the different technical technologies, different cultures, different teams and people coming together.
It's got a lot going on.
And so that's kind of the context.
When we first formed this SLO enablement team, we had a team come to us immediately and say, we want to do SLOs.
And we're like, great.
And they said, you're the experts, right?
And we said, yeah.
And then we went away and quickly tried to read and learn as much as possible.
And we just started experimenting with them.
And we created this interactive workshop where we would talk about their customers
and their key services.
And then we identified what they wanted to achieve
and come up with indicators and then actual threshold
or objectives as well.
And it was conversations were great.
But at the end of it, we just, what we came up with just
what didn't feel connected to anything real.
It just didn't seem to be enough.
And I understand that part of SLO, it's not just about coming
out with a number.
It's about the way you adjust your culture and mindset around operations and what you actually do.
You alert on different things.
You focus on different things because it's about can the customers use the service, not technical stuff.
But it just wasn't quite working.
And we tried that with the second team.
And again, the same kind of thing happened.
And that's when we had that alternative.
We pivoted and said, look, what are we trying to achieve here?
What is it we're trying to achieve with SLOs?
Because I guess it wasn't clear what the overall business objective was.
And that's something I want to talk about later.
So that's when we said, well, what's the goal?
The goal is really, we think think to help teams understand their customers
better and when they made technology changes how that impacts those customers that was the first
thing and then on a wider sense we thought our team goal really is to make the lives of our
customers and our colleagues and team members easier through the lens of reliability so that
that's what our team's about how can we do that and the SLO definition in the current context of
the organization it wasn't achieving that fast enough and that's when I got thinking about you
know what I think there might be prerequisites or things that will set you up for success with
SLOs and since since those initial conversations, I've spoken more about it.
I shared six different potential prerequisites with you.
I've got another three here, which I haven't told you about yet.
I'm really starting to think that there are a lot of things that at least some of them
need to be in place, I think.
Or at least if you have these things there, it's going to make your life a lot easier.
So in the very beginning, I need to ask you a quick question.
In the very beginning, you said in your organization, you started forming the SLO enablement team.
Why was this team established?
It was, we had a pretty senior leader in the organization who had come from Groupon, I believe.
And so he had seen SRE in different ways of working come to fruition and really provide a lot of value. And so he wanted an opportunity for IAG to explore that.
Okay.
Because this was kind of like, for me, kind of the initial thing because i i also had a
similar situation when organization came to me and they said hey we're now doing slos and i said
that's great but why right what was the motivation behind it oh because i heard it from google uh
that's not the right answer and and so i was just curious on on your side now you said um with the
first and the second team well let's say focus on
the first team where you tried it you said you you sat down you defined the indicators you defined
the silos objectives what were these and why didn't they work um remember those i don't know
if i care and i might better remember so the first team was actually a team who runs a Kubernetes PAS,
which other teams all over the company host their stuff on.
And it's quite big and quite important to the company.
And so there was this kind of unusual context
because we had to talk to them and say,
who are your customers?
And they were thinking about the end customer.
And I was saying, well, actually,
the customers you have control over are the other teams at IAG.
So that was a mindset shift.
And so one of the SLOs we came up with was around the success of deployments or rebuilds or something,
because it's really about the stability of that platform and making sure that people can update their code when they need to.
And which they were, you know, it was a bit of a surprise there.
It wasn't about the end customer and latency and those kinds of metrics um so that was one of them because yeah because
essentially that team is like uh it's like if if i would now build an app i would probably go to
amazon or google or microsoft and basically you know use their services so they have a certain
sla with me and like in your case that platform team is kind of like a SaaS provider
or a cloud provider to your internal development teams.
And therefore, you want to make sure that the platform is up and running.
The deployment is an interesting one, right?
Like how often, how long does it take to deploy maybe an update
and how stable are the systems that are running on there?
Or I guess, how fast can they deploy?'s good good good matrix and and why do you think this
didn't work because you said it didn't quite catch on i think it was a lack of uh in that case and a
lack of experience with it within our little esri team which is literally at the moment there was
two engineers at that time and understanding the follow through.
You know what I mean? We didn't talk about, okay,
we didn't be with them and my colleague actually did operations with that team
for three months just to learn what they do,
but we didn't adjust that operational process at all. And, and that's,
that was what was missing, right? The, okay, we have these SLO now,
let's start focusing on that and, and maybe stop alluding on the 400,000 other things
that we're currently panicking about all the time.
Maybe.
I don't know.
Maybe you know more about how to transition into SLOs
if you have a huge backlog of technical alerts and metrics.
I mean, for me, the question would be, again, focusing on this team.
Let's assume you focused on the number of deployments or how fast you can deploy.
Is this a number the team came up themselves?
Or did they also say, hey, we actually confirmed this as a requirement from our internal customers.
And they are actually not happy if we cannot deliver so i
think coming up with an slo that for me makes sense as a software producer a service producer
is great but um if if my customers actually don't care about this then why should i care about it
so i was actually my question would be a have you monitored it b have you promoted this
number also to the other teams that were using the kubernetes platform so that they can actually see
how successful that platform is and how things have changed over time
because if the customers start caring and if they think this is actually a good number
or a good slo then i think it becomes a different thing. Then you also, as a platform provider, become more responsible.
I think it then actually makes more sense because you then actually see that you have
an impact or that you have a negative impact on those internal customers if you don't meet
your goals.
And because this is sometimes what I've seen, we come up with S slos just for the sake to have slos but in the
end we don't care if we don't meet them because it doesn't have any impact if we don't meet them
right and i think this is the big challenge just creating slos and i think that i mean it was not
you maybe it was somebody else i had a heated discussion and he said he said andy don't go
around in the world and tell everybody they have to define slos for every service this is because it does the sls is not about
defining sls for every service and i said you're right because really what you need to do is you
need to define sls where it really matters and what really matters is the end user and the end
user might be an end user that uses your services or it might be an
internal team that tries to deploy their apps on the platform and then they're expecting a certain
level of service but just coming up with artificial numbers doesn't help anybody uh yeah i fully agree
with you there i think that's a mistake that we made. It was internally developed within that platform,
the team that ran the PaaS, right?
We should have had representatives
around internal customers to bounce these ideas off
and say, hey, this is what we're doing.
What matters to you?
Yeah.
And that's actually one of, yeah.
You go?
Yeah.
Yeah.
And maybe one additional thing,
because in the platform team, it could be interesting.
You could say, hey, we are the IAG platform team.
We provide you Kubernetes as a service.
You can go with us and we provide you this type of availability
and this type of speed.
If you don't like this, you can go to AWS, Google, or Amazon,
but we don't want to because we believe we provide a better service for you.
And I think this is also, I think you then need to kind of try to figure out
who is the competition, right?
Because if there's no competition, if the mandate is you can only use
that platform, then there's also less pressure on you.
And I think that, yeah.
There was, there pretty much is no competition
because the reason it was built
is it's an insurance company,
very sensitive about where data goes.
It's a huge process to get a cloud service approved
and working and even, you know,
especially if there's customer information involved,
which there absolutely is with the systems hosted on this.
So there's very little competition for this particular platform.
So, yeah.
Yeah.
Cool.
But then you said you came up with prerequisites, right?
So that means like you learned things that don't work well or tried and it didn't succeed,
but now you have prerequisites.
What are these?
Do you want to share this?
Sure.
I will say just before that,
that there was a third encounter where I've been asked to help develop SLOs.
And that focus was, hey, we've got a gigantic, massive, complex program of work.
We're currently doing all these non-functional requirements around performance and reliability.
And we tick a bunch of boxes before we go live can we move to SLOs and I thought initially that's great and then as I started unraveling the complexity of the program
I thought wow this is going to be really difficult for a number of reasons so and that's where I came
up with the prerequisites that I'm going to talk about today. So I think the first prerequisite, and it's probably not a surprise,
is there needs to be a certain level of observability in place.
You know, it's pretty obvious, right?
If you can't see and measure the reliability of what's there now,
and if you can't track the status of your SLOs after the fact,
then you can't really do much with that, right?
Flying blind and just gut feelings, yeah.
Second is I think teams need quite a lot of,
they need additional time and space set aside.
So in the context of this program,
there's a lot of focus on delivering features.
The teams I've talked to are under the pump.
They're like, you know, I said, how how busy are you one team said 12 out of 10 i'm like okay well how are you supposed
to engage with this this culture change if you don't have sufficient time and mental space right
the third prerequisite i've been thinking about is just valuing reliability uh or actually just
valuing quality in general in the wider sense.
So if you just focus on we've got to get these features over the line and we don't,
and that is the most important thing, then well, what's going to give if you have to get something done by a deadline, what's the thing that's going to give?
It's going to be the quality of what you're delivering, right?
And I had a thought recently about that.
I think in a really large organization
like a cloud provider,
that it's a no-brainer
that the reliability of the services
and the customer experience is paramount.
I totally 100% back that.
But I think perhaps maybe
in the realm of smaller organizations,
it might actually be that the bigger value
is to improve the lives of the staff
who operate the software
and the cost of maintaining
or operating unreliable software,
if that makes sense.
That's a subtle difference.
That makes a lot of sense.
Yeah, because if I understand this correctly,
in a smaller organization, you have fewer resources,
and therefore you need to focus on reliability and efficiency right from the start,
because otherwise you're just burning the few amount of resources you have
for these kind of working concepts on technical debt versus innovation.
I mean, I think in general, cost and efficiency and sustainability
should be a topic for big and small organizations.
But I think for very large organizations that have a lot of staff
or more resources and even maybe more money in the bank
that they can survive for a little longer,
it's easier to be a little less efficient and still make it over
the next quarter or something
like this yeah that's a good one um a couple of thoughts here because i remember going back to
the linkedin posting this was exactly one of the arguments you actually made because you said
why do you talk about slos for you it should be about reliability and I then said for me SLOs are a measure of reliability
because if I have an SLO and availability it means my system needs to be reliable in order to be
available and if I have an SLO on performance my system has to be reliable because otherwise
I cannot deliver the performance under different you know factors of load. But I want to say something to the cost,
because this week I had several conversations
with our customers as I was traveling through Germany.
And we all know that we have a big energy crisis here,
at least, I mean, I'm sure in all of the world,
but in Europe, we feel it quite a lot.
We're spending too much money on on
inefficient things and now everybody's like sustainability and cost efficiency is a big topic
and i really had conversations with organizations this week about defining slos
on on costs cost per feature cost per app cost per service and actually drive it down because people
realize that as we we moved to the cloud uh everything seemed like endless scale and we didn't
have enough focus on the costs and now people are realizing that we're wasting a lot of not only
money but a lot of power and we And we don't have endless energy here.
And that's why I think every company wants to be more green,
more sustainable.
And we also discussed about,
can we use SLOs as a vehicle to, in the end,
reach our performance, our reliability goal,
but also bring down our costs?
And I think if you're, we're all performance engineers.
If you solve performance, if you if you solve
performance if you have performance hotspots in your apps you know if you can fix a cpu intensive
algorithm it not only makes your system more performant but also more efficient and and it's
also important so you're talking about like multi SLOs, so not just my feature can complete within two seconds,
but my feature completes within two seconds
and I don't need 24 cores to run it.
Exactly.
You might be able to get something up to speed with enough power,
but what you're saying is you also need to get that power done
and you can't use brute force to get it through
because that's going to be the cost of the energy,
just even the dollar cost of the cloud.
Exactly.
We've talked about this in the past a little bit.
Yeah, what we talked
about is exactly what's
the euro amount
of a particular feature
for a particular user base.
And then measuring
this over time and see
how it behaves. And also, not only how it changes over time,
but that you set certain goals that you need to keep it under a certain level
unless you make a strategic decision that you're packing additional features
into a certain user journey.
And therefore, you may accept an increase by 10 cents per transaction
or something like this.
I wonder if we need to come up with a new term,
the efficiency level objective or ELO for short,
and then we could all start singing electric light orchestra songs.
It would be just a wonderful day for everyone.
Yeah.
But there's that efficiency factor, not just the performance.
Yeah.
Another slight variation on that is that I think
that in a
company like iAgero, it's not
the P1 and P2 incidents which kill us,
it's the thousands of P3s
and P4s.
So I think having a focus on
let's understand those, let's bring them down,
let's build automation, let's simplify, let's
make things more reliable and robust.
And let's build a culture where we know how to deal,
accept failure and learn from it as well.
It's something I'm excited by.
And I think it's been a sort of click in my head in the last week,
actually.
Okay.
This is an area that has real value where I am right now.
I think that applies to a lot of things too.
I mean,
if you think about,
you know, the death by a thousand cuts concept is what you're talking about with those P3s.
And I remember a few years ago going back trying to figure out how can I decrease my monthly spend on bills, right?
And I was looking for that P1 that would save me a ton of money because I'd look at like my cable bill.
Like I can only cut $15 from that and $10 here
and then it finally
dawned on me
if I do all of those
now suddenly
they add up together
it's such a simple concept
that gets overlooked often
so I think that's
a great
thing to remind people
your idea of
like look at those P3s
you know
how much damage
are those doing in total
but in order to scale
you need automation
right
because you cannot
put a thousand people on a
thousand p3s and that's where the automation comes in and better observability with with better you
know context information um steven you had three prerequisites so far and you mentioned earlier
three additional ones i'd be i've got six six yeah okay and let's move on all right so
ownership and autonomy i think if teams don't have the ownership of their own service level
and the autonomy to adjust it themselves that kind of defeats the purpose of these silos i think
if you have an external party saying these are your targets then it's not an slo it's
i don't know something else in sla or a contractual obligation yeah yeah something different yeah um
and related to that business and technology stakeholders embedded in working together
you know a representative for the customers there in the team if you don't have that then how can
you independently develop your silos without going to some external parties and bringing them in
which might work,
but it's just more complicated.
And the last of my original six,
blameless culture and psychological safety.
So it's funny that since joining SRE
and hearing conferences,
these things are talked about a lot,
a lot more than I expected
because obviously SLOs are kind of messy and tricky and it's getting a lot a lot more than i expected because obviously slos are kind of messy and
tricky and it's getting a lot of people to talk about scary new things at times and that requires
experimentation the right mindset um the ability to the what's it called the opposite of fear of
failure uh joy and success i don't. The ability to fail and be like,
that's okay.
Because from my experience,
if you try and adopt SLOs for the first time,
you're going to fail continually.
And I think it's how you respond to those failures,
which is going to be make or break.
It's funny that that's not just a given yet,
right?
I mean,
that goes back forever.
I mean,
not just in modern computing in terms of if you fail properly, fail fast, fail often, right, and improve.
But if you go back, like the great story about how Post-it notes were created, right?
I'm sure you've all heard that story.
And if you haven't, you can look.
It was an accident.
Like those Post-it notes, the guy was trying to create a super glue, messed it up, and then he found these were sticking and he could take them off.
They're like, oh, wow, this is pretty awesome, right?
But even, you know, at least over here in the United States,
we had this thing called Bell Labs,
which was like this legendary place of invention where they would just take all these smart people,
toss them in and they'd say play, right?
And they were just screwing around,
trying out all this stuff.
Like a lot of the stuff never worked,
but then suddenly they'd finally hit something.
Like the fact that failure has to be embraced still
or has to be taught to be embraced,
it just boggles my mind.
Who was the
who made the quote
where he said, I didn't fail
a thousand times, I just found a thousand
ways that don't work. I haven't yet
found the one way that works.
I don't know who said that, but yeah, exactly.
Cool.
So I have three new ones though, right?
No, I just want a quick comment. I am
completely with you on all of these three
and I just want to add, ownership
and autonomy, I typically
talk only about ownership because for
me, ownership is important because otherwise, as I
said, if you measure something and nobody cares then why to even do it i like the autonomy addition to it
the autonomy to adjust um because you know how things change and having the autonomy to set your
own goals i also completely with business and technology so every time when i try to run
workshops now i don't want the technical team to come to me and say, let's do SLOs.
I said, we also need the business stakeholders because in the end, they need to tell us what is the business objective.
And then from that business objective, we need to break it down into technical objectives so that we can always align every objective to the top business goal.
And for Blameless Culture, I i think brian we had a couple of
episodes recently uh around also chaos engineering and we had anna medina
on the call and she also talked about this and we had a
name flips my mind uh it keeps my mind but we had a couple of episodes where we talked about
blameless culture especially in the context of chaos
engineering and kind of building resilient systems
and doing
game days and stuff like this and how
the organization deals with failure.
Cool.
All right.
My three new
prerequisites I've thought about.
Insert drumroll here.
Top ten.
The first one you just touched on,
I think is having clear business objectives
because everything flows from that.
And if you just go in and say,
let's do SLOs,
like we said in the beginning,
you're not going to get anywhere
that's aligned to where you want to go.
I had an aha moment recently.
That was my last podcast episode
where I talked to business leaders
and they said,
we want to do these things in the next five years we want to grow this with new customers and we
want to reduce our operational costs by this much and here's some ways we think we can do that i was
like wow i've never heard that before that is incredible how empowering yeah the the third one
and i'm still formulating this i think that there are having the right team structure
or architectural simplicity,
because those are linked because of Conway's law,
having the right architecture makes it easier.
So having, for example, a decoupled architecture
where components can independently be treated
almost like an individual product is great.
And when things are tightly coupled together, it just makes it harder to rationalize okay to do the slo we kind of need
to to get these teams talking together and working together and then communication overload kicks in
and it makes it makes it harder i think so i'm curious about your thoughts on that
uh yeah i agree with you i mean there's also a, have you heard about the book, Team Topologies?
I think it's basically, I think that the idea is here that you have, as you said, right,
clear structures on like kind of what is the platform team, like in your organization,
you have a platform team that is building platform services,
then have individual value creation teams on top.
And you want to make sure that these are independent enough and they have clear
contracts to any other teams that they are depending on.
But then you have contractual and nice interfaces.
From a technical side, you have interfaces.
And from a team toto-team perspective,
you can basically then agree on the contract
on what this team provides and what the other team
provides, and then on these interfaces or
contracts, you can then, where it makes
sense, also define SLOs, but then they can
work independently as long as they don't
break the contracts.
Yeah, definitely.
There's nothing
I can add to this,
even though it's easier said than done.
The only thing I would add to that in terms of what should go into the thought behind it,
I think the concept is there completely,
but when teams are considering
how are you going to design this to be decoupled,
what technology are we going to use?
One thing that I've actually just had a fun conversation
with one of our customers about it
is that you also have to think about
if we choose technology X,
are we going to be able to observe it
and are we going to be able to observe
the components of it that we need?
Because quite often what we find is people pick something,
they get it running and all,
and they're like, oh, now we need to start measuring it
and all this, and it's something
that's extremely difficult to do.
Or it's either going to be a full-on homegrown diy thing that they're going to spend a lot of time and resources or if they take an off-the-shelf product there are
certain success or certain capabilities that are there but it wasn't part of that thought process
going into it so not only do you have to think about the decoupling and that independence and
ability to manipulate those different pieces just make it run on its own,
but add into that observability side of it and even security.
How are we going to do this and keep it secure too?
So pulling that whole suite into that decision-making
and that approach going into it.
I have a feeling that perhaps the best situation to divine slo's would be
if you have long-lived feature teams who can independently deliver features completely to
customers you know from top to bottom i don't i don't that's not the situation where i am now
um we have more component-based teams i i yes i that's the feeling that I have, but I haven't actually got the experience to say,
yes,
that works.
Um,
I have a problem now.
I only took nine bullet points and not,
uh, sorry,
only wrote down eight of your prerequisites,
but not nine.
What did that mean?
Let's do the 10.
We have one more or?
One more.
I've got one more to go.
You have one more to go.
Okay. I thought you're ready
that's why okay i thought i missed one because i was i know it's getting late and it was a long
long day but number nine okay very last one nine it's very simple uh having education having a
shared understanding shared terminology of what slos and and SLOs are, why they're helpful, and what the ultimate outcome is.
I will probably put number nine almost at the top.
I think before you have any discussion on the SLOs,
this is what I try to do now as well.
I want to educate people on the terminology
and why we're doing this.
Because if you just throw them into a workshop
and say we need to define SLOs,
and then we have a different understanding
on what is availability, what is an error rate,
what is XYZ, then I think you end up maybe with a result,
but everybody has a different understanding
on what that result actually is.
And therefore, education is key.
Our team is currently right now trying to think
of a way that we can create like an slo hub within the company which anyone can go to to learn about
it but we want to make it actually really simple and visual and fun so if you've got any ideas for
that let me know well i know somebody character exactly i know somebody that is using i think
paint for his arts and uh i mean who would that be right i know somebody that is using, I think, paint for his art. And I mean, who would that be?
I know somebody that has a YouTube channel,
even though he says it's just a podcast.
But I think he's very talented with educating people in a fun way.
Yeah.
We want to make it, we don't productize it,
so it's not tied to my MS Paint ability.
I can only draw so many pictures at a time.
But you know how that came about is that I wanted to create a YouTube channel
and,
you know,
even for blog posts,
I wanted artwork and my wife's really good at that kind of thing.
And she was always too busy.
So I said,
I need independence.
I can't draw.
MS Paint it is.
And that's how it all started.
Wow.
Yeah.
Brian,
you haven't seen his artwork? No, I got to go check it out. And that's how it all started. Wow. Yeah. Brian, you haven't seen his artwork?
You should check it out.
I love that all.
One thing I was going to ask about,
and I don't know if this would be a prerequisite
or if it fits in somewhere else,
and maybe what I'm going to suggest here
is completely wrong,
but when you mentioned right at the beginning
the idea that first team you worked with,
they were more the Kubernetes infrastructure
and they had to focus on their customer,
which is the people using the infrastructure, the Kubernetes platform, not the actual
customers. That got me thinking that your
SLO should be something within your control
that has a direct impact on your customers and not beyond it because anything
beyond that you don't have control over.
That Kubernetes team can't directly influence the end customer.
So they shouldn't be defining SLOs for that.
By trade-off of time spent and time spent adding up, yes, that's going to influence
the end customer. But it's really more downstream
teams that are going to influence that customer. So they should focus, let's say, on
their immediate customers.
So should SLOs be really focused on your direct customers
and only things that you have control over?
And I don't know if that could be a blanket statement at all,
or is that being too broad in general?
And is that prerequisite,
or is that just part of the concept behind everything?
I think it's spot on because think about the, if I go to Amazon,
they give me a certain SLO and SLA on their services,
but they cannot guarantee that the app that I deploy
and the infrastructure is fast because that depends on how I write my app.
I could never make AWS responsible for how good or bad I write my code.
I mean, what they can do, though, and I think this is also, I guess,
Steven, what your team does, they can give recommendations, right?
That's why they have the, what are they calling it?
The well-defined architecture.
So they have templates on if you want to build apps that run efficiently
on our platform, here are some templates, here are some architectural patterns. If you follow them,
the chances of becoming successful and efficient is higher. But yeah, that's the way I see it.
I agree as well.
It's also like one of the customers I visited yesterday, they also kind of their, didn't
call it the platform team, it was kind of like the delivery enablement team.
They were basically providing different templates of pipelines and different templates of deployment
definitions for microservices that they give to their internal customers, which are the
development teams.
And they said, hey, if you use these pipeline templates
like Jenkins, then we give you these five steps.
And if you enable step one, two, three, four,
then you get automated observability,
you get automated testing,
and then you get automated deployment.
So this is the template.
We suggest you use it, but if you want to do your own way, it's up to you but if you do this our platform is optimized for that
so here's a sort of conceptual challenge we were having recently is the the current solution that
i know about that we were going to do slos for there might be one customer interaction and that
hits components who are owned by 10 different teams.
And the question is always,
who has ownership for the end customer?
Well, maybe it's the digital front-end team
who look after that web UI.
Maybe it's them,
but then they need to now collaborate
with nine other different teams.
And it was just hard to conceptualize.
I haven't come up with a solution I'm happy with
to make that all work.
What I've seen with the end customer,
you typically either have, let's say, a mobile app
or a web app or something.
And whoever owns that interface, I think,
is the one that should have the ownership
of how many people are using this particular app
or feature, what is the user
experience we expect and if their front end depends on key back-end components they then
need to basically break down in order to deliver a certain user experience on the top what do we
need from service a b and c what type of level of service do we need in order to fulfill our goal and then it comes it
it almost becomes like a i need this from you and if i don't get this from you right in the in the
open market world i would go to somebody else like maybe i don't take the authentication service from
us internally but i go with enough with a public access service because they can deliver it and so
that's what i've done in some of my workshops where I start with a top level goal
on individual user journeys, like opening up a mobile app and logging in, then making
a transfer for mobile financial app.
And then I say, this is kind of what they want on the front end, which backend services
are directly involved and what is kind of their contribution to it, what's kind of what they want on the front end, which backend services are directly involved and what is kind of their contribution to it.
What's kind of their performance budget or their reliability budget?
What is their SLO that I need them to deliver to me in order for me to fulfill my top level goal?
While you were saying that, I was thinking about, have you ever heard of an organization
where you have almost like an internal marketplace with maybe two or three different teams that all develop in the same thing and and you get that
sense of competition and driving for excellence yeah i mean that's i mean i never had this before
but i think this is the ideal this ideal world but this would definitely result into
most likely more efficient and and and better reliable software because then you're competing
because you want to be the service that is used
and you want people to stay with you
and not go to the other cloud vendors.
They should go to your platform team
and not to AWS or Google or Microsoft.
Very interesting.
I understand it's tough if you don't have the choice
of having multiple services if you are constrained.
But then you need to have other things like,
you know, we need, you know,
I'm sure there's some type of competition in your world.
People can decide to not go with IAG,
but with other organizations.
And then that's the business driver, right?
That's the, how many, we need to make people happy.
And happy means they need to have a good digital experience
with our organization.
How do we achieve this?
Good.
Brian, I know we both have a kind of a hard stop
in a couple of minutes.
So we probably need to wrap it up.
I think so.
I think this was fascinating, Stephen.
Thank you for sharing it so much.
Before we do,
was there any other point you were hoping to,
you know, dying to get out on here?
You're starring in a new movie
and you're coming on to the late night show
to talk about it that you didn't get it in?
Yeah, I'm hosting EMEA's PaintCon.
You know, I'm going to give you a challenge.
So for anybody listening to it, we'll see if you made the challenge.
I'm looking at your pictures right now.
Andy, can you send, this episode won't air until the 26th of July.
I don't want to put pressure on you, but if you can send that picture you took, Andy, over to Stephen,
if you could do a rendition your style and i miss
paint yeah we'll use that for the for the for the image for the podcast if not totally understand
you got other real work to get done in your own stuff for your your bit but it would be a
fantastic honor to have one of your original pieces of art to use on uh for the show it would
be a genuine pleasure i would love to do that. Would this then be an NFR?
We can sell it for a really high price.
We're not going to get involved in any of that.
Don't get me going on those.
Same.
Anyhow, I really appreciate
you coming on today. Andy, did you have anything you wanted
to wrap with?
No, I should have said this in the
beginning because I know we are
connecting three
different continents
today
yeah
New Zealand
the US and
Europe
thanks to
technology
today
anyway
well you said
you said
continents and
you kind of did
a mix between
yeah
I don't even
know what that
is
I know
but what I
wanted to say what I wanted to say is uh
the technology that we used today was really resilient it worked perfectly um for me it
met all the slos and thank you so much stephen uh hopefully uh this is not the last show that we do
together because i know all of us we constantly learn and we're also those that constantly like to share the learnings.
And therefore, I hope we have it back in the future episodes.
Thank you.
Yeah, thank you very much, Andy and Brian.
I had a great conversation and I learned a few things.
So, yeah.
Awesome.
Thanks a lot.
And thanks to all of our listeners for listening to this episode.
If you have any questions or comments, you can tweet us at...
What's our Twitter?
Pure underscore...
Yeah, it is pure underscore DT, right?
DT, exactly.
Or you can send us an email at pureperformance at dynatrace.com.
But thanks for listening, everyone.
And again, a big thank you to Stephen for joining us.
And we look forward to talking to you in the future.
Thanks a lot, everyone.
Bye-bye.
Bye-bye.