PurePerformance - SREs must not be your SWAT Teams with Dana Harrison
Episode Date: April 8, 2024SREs (Site Reliability Engineers) have varying roles across different organizations: From Codifying your Infrastructure, handling high priority incidents, automating resiliency, ensuring proper observ...ability, defining SLOs or getting rid of alert fatigue. What an SRE team must not be is a SWAT team - or - as Dana Harrison, Staff SRE at Telus puts it: "You don't want to be the fire brigade along the DevOps Infinity Loop"In his years of experience as an SRE Dana also used to run 1 week boot camps for developers to educate them on making apps observable, proper logging, resiliency architecture patterns, defining good SLIs & SLOs. He talked about the 3 things that are the foundation of a good SRE: understand the app, understand the current state and make sure you know when your systems are down before your customers tell you so!If you are interested in seeing Dana and his colleagues from Telus talk about their observability and SRE journey then check out the On-Demand session from Dynatrace Perform 2024: https://www.dynatrace.com/perform/on-demand/perform-2024/?session=simplifying-observability-automations-and-insights-with-dynatrace#sessions
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always I have with me my fantastic and wonderful co-host and potentially jet-lagged co-host Andy Grabner. How are you doing, Andy?
I'm very good, I'm very good actually.
Well, I'm actually happy that I'm still very good because I, as you know,
well you just said I'm jet-lagged, I'm actually happy that I'm still very good because I, as you know, well, you just said I'm jet lagged.
I'm in India right now.
And I just had my first tuk-tuk ride.
Actually, two tuk-tuk rides.
Like the little auto rickshaws.
Yeah, the auto rickshaws.
And they call them just autos. the way to the restaurant was already really interesting because getting through traffic here in Bangalore, where I am,
is challenging, especially at 6, 7 o'clock at night.
But on the way back, the driver asked me,
do you want to go slow or fast?
And I said, well, I want to go safe, but as fast as you can,
because I have a podcast.
And I've never been as fast through streets in a big city like today but I've never been as scared it was a really interesting
experience but I can assure folks if you ever make it to India or any other country where they have
tuk-tuks or auto rickshaws or autos as they call them here uh these folks are doing this every day they know
what they're doing even though it looks scary the guy told me they have 150 000 tuk-tuks active on
the roads of bangalore wow wow now when you look at when you when you see those legendary pictures
of the the intersections uh in in india um is that what you were experiencing during rush hour or is that
different part?
Including cows and
dogs and everything else.
Wow. Okay.
Interesting.
I've heard
it's some sort of organized chaos that
you can't understand unless you're in it
and you know it. But it's
supposedly very, very safe somehow.
Well, the thing is, it seems like it's resilient.
The system's resilient, doesn't break down.
It's performing well.
It's amazing.
It's like, man.
What a segue.
What a segue.
Yeah.
Andy's the master, everybody.
Just remember.
And it seems we have a new voice today.
It sounds like a much better voice, Brian,
than you can produce in your microphone
and my crappy built-in microphone.
If I get right on top of it and get the proximity effect,
I can get a little bit of that sound, too.
All you have to do is eat the microphone.
Yes.
But yes, Andy, thank you. It is me. I am the newy thank you it is me i am the new voice you are the
new voice well maybe you're taking over our job it's in the future because uh you just get uh
very good ratings on everywhere where people are listening to him in into a podcast but you know
without further ado um we have a guest as, fortunately, that enlightens us on different topics.
And today, we will definitely hit on the topic of site reliability engineering.
But I want to shut up now for a little moment, at least.
I have a lot of questions.
Please.
But I first want to let our guest introduce himself.
Dana, please go ahead.
It's something I know a little bit about, you know, I've maybe picked up a thing or two here.
But yeah, thanks, Andy.
And thank you, Brian.
For what it's worth, Brian, you do sound way better.
It's a very nice mic.
I'm Dana Harrison.
I am a staff site reliability engineer here in Canada with a company called TELUS.
We are one of the largest telecom companies in the country.
So cell phone, internet, home phone, TV, all of that fun jazz.
I started as a site reliability engineer probably in the last, it would have been about five years ago at my previous employer, which was one of Canada's largest insurance companies.
And I've been working in tech consistently for the last,
oh no, I just realized it's been like 15 years and suddenly felt somehow quite aged at that.
But yeah, it's been 15 years with the last five or so in site reliability engineering.
It's been a fun journey getting here, I'll tell you that much.
But I won't give it all away right now.
I'll keep talking, much, but I won't give it all away right now. Keep talking. That voice is amazing.
Yeah.
And all I can say is that I've been working in tech for, I guess,
24 years now.
And that's,
that wasn't my first job out of college.
So talk about feeling aged.
Somehow you have more hair than I do.
Yeah.
Well,
like I often shave it,
but yeah,
it's easier.
But then I'm catching up with you
at some point honestly that's about if I
grew my hair out Andy that's about what
I'd look like and I just like it it just
looks for on me it looks great on you
you look wonderful you're doing
fabulous on me it looks awful so I keep
it I keep it completely buzzed out just
because I like the way it looks better
but it does mean I'm wearing now this this is, I guess, a Canadian term.
It does mean I'm wearing a touque around the house, you know, five months of the year.
What is a touque?
A beanie, a little hat.
Oh, okay, okay, okay.
I think it borrowed from the French or French-Canadian.
Ah, there you go.
We learned something new again.
Now you've learned T-O-Q-U-E, if you need to look it up. There you go. We learned something new again. It's amazing. Now you've learned T-O-Q-U-E, if you need to look it up.
There we go.
T-O-Q-U-E.
We learned this word.
We learned that there's 150,000 tuk-tuks in Bangalore.
Oh.
Another thing that we learned, well, that we are going to teach the people now, because, you know, in my local time, it is 1043 in the evening.
Brian, what is it than you are it is 11 13 in the morning my time 11 10 and dana for me it's it's just ticked over it's now 1 14 p.m the afternoon
it's really strange because it's really strange because it seems we're half an hour. I mean, we're not just
regular hours apart, but we are
hours plus half an hour apart.
So I think India and, Dana,
you mentioned earlier, there's other parts
of the world too that have half an hour time zones.
Newfoundland, as far as I
know, they may be the only two.
There might be other regions, but it's
definitely unusual. Newfoundland,
so Canada, for those who don't know, has two additional time zones after Eastern.
So East of Quebec, once you get into New Brunswick, we have Atlantic time.
So an hour ahead of me.
And then Newfoundland, because it's just that extra little bit further,
and they're just a wonderful, beautiful, special province.
They get another half hour tacked on on top of that.
That can't be too confusing at
all. Now, do they do daylight savings
or not? And then you have to, like,
I can imagine the levels of complexity.
It is different. I have no
idea who does and does not do
daylight savings. But if you
have such levels of complexity, right, you need
to make sure you
can, I'm trying to do a transition here,
you need to make sure you can, I'm trying to do a transition here, you need to make sure you can
try to understand what complexities they're encountering and get ahead of them so that you can
make sure everything's running smooth, right? There are always challenges to getting all of
your stakeholders on board, whether you're trying to enforce a site reliability practice in a large
enterprise or daylight savings across your country
and continent let me tell you yeah hey uh let's jump into the topic and fun fact well interesting
fact not fun fact yesterday i spent a couple of hours with some of your not colleagues but some
of your counterparts at another very big telecom, just across the
pond in a country that just recently exited the European Union.
And I think it's the biggest telecom in that country.
And I had about 20, 25 site reliability engineers, platform engineers in the room, we talked
about site reliability and platform engineering.
And it's interesting that I now have you on the podcast because we talked
a lot about, you know, what does this really mean? How can we, how can we, you know, make
sure that systems stay reliable, resilient? How have things changed over the years? And
I would like to actually pass it over to you because you have a great long history on site
reliability engineering. I think you've been a site reliability engineering and site reliability
engineer before I heard about the term site reliability engineer.
So can you first of all walk us a little bit about your history, your background, where
you started, what things have changed and especially you work with, I think it was MenuLife
if I look at your LinkedIn post for 12 years and now for TELUS.
What is it like to be a site reliability
engineer in a large organization? Lessons learned, things that
work, things that don't work. Once I figure it out,
I'll let you know and I'll get back to you on this podcast.
We'll be back in five years with another
episode of Pure Performance.
There's a beautiful thing about being a site reliability
engineer. One, it's still
a relatively new practice. I mean, Google only wrote the book maybe 10, 12 years ago on what it
means to be a site reliability engineer. If I think back on my career and where I started,
I think I was an SRE before I knew what being an SRE was or meant or the full impact of that.
You mentioned I was at Manulife for 12 years. Yeah, they hired me right out of school. I got
a shout out to all of the wonderful people I worked with there because they gave me
a kid who did not complete his degree in physics, but had a light tech background from working at,
it was actually Future Shop
in Canada.
That was the Canadian version of Best Buy.
I worked tech there.
I got, you know, I had always been curious about tech and, you know, getting my hands
dirty and started out at Manulife many years ago as a like desktop and server admin.
One of my first jobs was to go around to 500 workstations and upgrade
the memory in them. I only killed two, which is a pretty good track record. But through my time and
my tenure at Manulife, I was really able to steer my career in the direction I wanted. So it started,
I think my journey into SRE in
particular started a lot with looking into the concept of oil reduction and automation.
It was a lot of like, oh, here's this manual process that I see 40 people in my department
doing, and it takes them each 30 minutes a day, and they're doing this every week.
But maybe I'll teach myself, because at that time, I didn't really code. Maybe I'll teach myself.
We were a heavy.net shop. I learned C sharp and I was able to automate some of the tasks that they
were doing. And that was a running theme through what I did at Manulife. Even as I moved out of
support, I was in a projects team for a bit. I was actually delivering. We were in waterfall.
We weren't agile or sprints or anything. I was delivering tasks and
delivering code into our environments. From there, I went into more of a consulting role.
And that theme of trying to automate and reduce toil and reduce manual effort and just increase
the value we're getting out of our team members and our applications in turn
was a concept I had really latched onto throughout.
But it wasn't until I was approached five years ago by my then manager at Manulife.
And he said, well, hey, we have this new team starting up.
It's called Site Reliability Engineering.
I had never heard of it.
He said, here's some of
the stuff we're looking at doing. Would you like to be a part of it? I went, absolutely.
And I think the thing I've learned so far about being an SRE is that it is a different role,
no matter where you look. And I think a lot of that is just because it is still relatively new.
I've interviewed in companies where an SRE is is like you are literally hands-on keyboard doing terraform managing your infrastructure all day.
And that is what their SREs do.
And that is one definition of an SRE P2s that were occurring throughout the org, go in, implement observability tooling, identify what the heck was going on with their application, and then go implement a bunch of fixes. So we would see issues with a website where we were like,
well, why is this website taking eight seconds to load?
Oh, because it's making four repeated calls to the same API.
We can consolidate that into one call and cache the response.
And we went from eight seconds to about 250 milliseconds in one shot.
There are some really, really cool things we got to do as part of that team.
One of the other things I got to do at Manulife was stand up a reliability engineering bootcamp.
So it was this week-long developer hands-on bootcamp, which we only got to do in-person
twice before the pandemic hit. And then we went to complete remote,
which was exhausting to do,
but we still got a lot of people through.
We would get them hands-on with,
I'll insert redacted competitor tool name here
that we were using there,
how to instrument your code,
what these metrics mean,
what they look like,
how you can identify improvements in your app.
What does effective logging look like? Because you've got people who are just spitting out millions of
logs for no reason or for no discernible reason. They're not using the data or they don't need to
use that data. Circuit breaker patterns. How do you set up service level indicators, objectives,
and agreements? How do you set up an error budget? And then at the end of that one-week course, can you tell I really loved doing this course? At the end of this one-week
course, we would give them an app that we had set up, and using the things that they had learned
through the past four days, we gave them this app and then told them to perform a bunch of tasks on
it. So go instrument, go set up dashboards,
go set up alerting. And then once they got through that, we would actually break their app in the
backend and say, go fix it. And they had to identify any one of the things or so that we
had done to their app. So we would shut off a vendor API, like this third-party API that they
had no access to. Instead of loading up 50 records from a database at a time, we load up 50 or 500,000 of them at a
time. And so their, you know, their load times would go off the rails. That was, that was a
really fun. And I think very important exercise in, in that particular organization. Because it
really did a great job of getting the knowledge and skills and
tools that we had developed into the hands of the developers directly.
It became a,
a grassroots initiative at Manulife.
It,
instead of sort of trying to,
to force our way down through management and then talking to the developers,
it became,
Oh, here, we're just going to give this all to the developers. And then they were
the ones who were excited to use it.
They were the ones who were really
jazzed about all of this cool information
they could get because of what we did
as an SRE team.
I took so many notes.
Brian, hopefully you didn't hear me
typing because I know today I'm not using my
microphone. First of all, thank you didn't hear me typing because I know today I'm not using my microphone.
First of all, thank you so much for that great idea of doing a bootcamp.
You call it a bootcamp or masterclass, whatever you want to call it,
but it's really amazing to put people through this.
Let me ask one question because I always get the question
from folks that I interact with.
We need to attract developers.
We need to figure out how to talk with developers,
how to engage them.
And it feels like you found a great way
because as an SRE, you're kind of like
in that perfect position where you're helping
the organization to understand
why things are currently not stable.
You mentioned earlier, you were basically looking into why does it take so long to start
up an app?
Because doing startup, too many things happen and we can optimize this.
Why does it break at all?
But then you take this knowledge and then you enable mentor your developers that are
actually creating the next generation of apps so that
they from the beginning understand the concept of resiliency, the power of observability.
I love the term effective logging, right?
Because we had a Guild meeting recently and Guild is an internal group within Dynatrace
where we meet with customers or Dynatrace users on a regular basis.
And I remember that it was Andrea who is actually an SRE at Dynatrace.
And she talked about how we are trying to really standardize what is effective logging within Dynatrace.
What is good logs and bad logs and really enforce standards because we don't need stuff that is just cluttering our storage and nobody needs it.
So I really think that you just gave me the perfect pitch deck,
the perfect pitch for how we need to pitch observability into an organization
and not necessarily starting with the individual developer,
but I think the SRE
and SRE,
correct me if I'm wrong, is also
a part of platform engineering because
with platform engineering we try to
provide
site reliability engineering concepts as
best practice or as a self-service,
but getting it in there and then
going on the wrong side, and I know people
cannot see me right now because I'm off camera.
But I can see you, Andy.
You can see me.
He's moving his hands.
Descriptive video for Andy's hands.
Exactly.
It's pretty cool.
So that's, you know, obviously on the one side,
you're making sure that production is stable
and you optimize what's running now.
But then you're really taking the time.
We took the time to educate developers
and put them through a bootcamp.
And I think the bootcamp is an awesome idea.
Yeah.
Well, thank you.
It was one of the most rewarding
career experiences I've had.
It was exhausting
just in terms of we were running it
once a month for a full week.
And especially, you know,
once we sort of settled into the pandemic and realized this was our reality.
You know, the first few months we were like, oh, sure, okay,
you know, we'll run two or three remote and we'll be back into the office.
And then we quickly realized that was going to be how that was going to go.
So, yeah, we ran it for over a year remote.
But I think that was the turning point for us
in actually being able to scale effectively.
Because before that, we were a relatively small team.
That, as you said, SRE,
because you have SRE concepts in many other practices.
Platform engineering definitely leverages SRE concepts. Development should leverage SRE concepts. Everybody should be in SRE concepts in many other practices. Platform engineering definitely leverages SRE concepts.
Development should leverage SRE concepts.
Everybody should be in SRE, frankly.
And then I can retire and it'll be wonderful.
But I think that reaching out to the developers
was the key in scaling
because before having us just come in
and be sort of the SWAT team was really effective.
Like we got a lot done,
but it wasn't how we could deliver the most value
to the organization.
It wasn't how we could get sort of our knowledge out.
It was too much handholding almost
where we'd go in and fix
and then nobody would ever learn anything.
They'd just be like, oh, like, thanks, you know, Wonder Woman. And then we'd fly in and fix, and then nobody would ever learn anything.
They'd just be like, oh, thanks, Wonder Woman, and then we'd fly away on our invisible plane. And then nobody would learn a lesson
after that. So getting to that point, and prior to us
setting up the one-week boot camp, I will say Manulife had done a stellar job of setting
up a one-month developer boot camp that I had also taken part of, and there were a number
of other one-week bootcamps.
Ours was one of several.
So the fact that they had this program at all was wonderful
because it got everybody on the same page.
And then it enabled us to make other SREs.
We turned it into an SRE factory
because then you suddenly had people
who were coming through this program.
And may I, if he's listening, a special shout out to Rohan Shah,
who is now like a senior manager of SRE at Bank of Montreal here in Canada,
who was one of my students through the reliability engineering bootcamp at Manulife,
which was a lot of fun.
I will say he was my best student, and now that's on the record.
And it enabled us, again, to turn into like a self-replicating machine of SREs.
We could then take all of our knowledge and all of our concepts and all of the things
we were excited about and all of the things we were constantly changing the course material
to match.
Maybe we could tie this week's course into a recent P1 that happened and say, all right,
here's what happened.
Let's break it down.
We'll run you through it.
Here's how we could have avoided that.
And then everybody just went out from there.
And it solved, to me, the major challenge
with being a site reliability organization.
And it's a challenge that we're still facing here at TELUS.
People don't generally like being told what to do,
I think.
So if you come in and you're,
you're coming in to a P1,
everybody's already frazzled and you're going,
oh,
well we can just sort of fix it like this.
Yes.
It's wonderful.
Everybody's happy that the incident is resolved,
but not everybody's happy because sometimes you get a bit of a feeling,
in my experience,
people seem to feel like you're stepping on their toes
a little bit or you're stealing their thunder.
And obviously that's not the goal.
We're all part of an organization.
We're all getting paid by the same organization.
We want to collaborate on this.
But I get it.
It doesn't always feel like that.
I've been in that situation
where somebody comes in and fixes our stuff and it's like, well, now I feel like trash because of that. But the goal is to, again, just be part of this something bigger. excitement started, when we're able to get the developers themselves
on board with everything that we're doing, then we don't have to be those people
who are coming in and stepping on your toes.
I do just want to quickly shout out, though, because it's been mentioned a few
times earlier this week, at least in the United States, but I think pretty much
globally was when the shutdown began four years ago.
So, uh,
when everything... Wow, yeah.
Yeah. So, anyway.
It's time of the recording, yeah?
Yeah, it's time of the recording.
March 14th, yeah.
And anyway, because it's been
brought up so many times, I'm like, oh my gosh, my daughter
was like, I can't believe it was for you. So, yeah.
Anyway, Andy, you had a thought there. I have a thought too, but Andy, you go with yours first, because I'm sure it oh my gosh, my daughter is like, I can't believe it was for you. So yeah, anyway. Andy, you had a thought there.
I have a thought too, but Andy, you go with yours first
because I'm sure it's more relevant.
No, no, no.
It's just you mentioned earlier kind of the different things you were teaching.
And I think it's interesting for recap, like how to get observability,
effective logging, how to use the data to optimize,
circuit breaker patterns,
setting up SLIs and SLOs.
Because people always ask me, so now, what is SRE?
What are the three things you should do if you're applying SRE best practices?
And if you look back at the bootcamp on what you taught,
what are the three things
that developers definitely got away?
What are the top three things
that everybody has to have in their mind?
And this is the bare minimum
of building resilient systems.
That would be interesting.
Number one, understand your application.
That was a point that we had dreaded at home,
that the observability tool we were using
was not there to describe to you, your entire application. You still have to understand what's
going on and what you're dependent on. It might show you a little bit more. Um, but it is, it is
not the source of all of the answers on what everything in your code base does. And beyond
just the code, what is the purpose of your application?
Why does it exist?
Where do you fit into the wider flow of,
you know, if you are one of 20 APIs
being called after somebody clicks something
on the front end, why are you there?
What are you doing?
What service are you offering as part of your code base?
I think is probably the first thing I would say.
The second is respect the data. So, okay, great. You've implemented Dynatrace. You have all of these
wonderful data now. We've templated out, you've got business events and you've got all of your
golden signals and you've got synthetics and you've got rum. So what are you going to do with it?
Because you can set it up and ignore it,
but then you're just paying for it for no reason.
And that ties into respect the data
and respect your current state.
So understand your application, use data,
know what your current state is.
So at no point, and I've started calling this out
in a meeting that we have on Fridays,
where we go over all of our major incidents. It's a lot of fun. I've started calling out
incidents that I know are instrumented, but have agent or customer as the detection mechanism.
I never want to see that again. I should never, ever be relying on an agent or customer to tell me when
something is wrong with my application. Those are the big three that I would drive home from.
I mean, certainly there are lots of little things. Again, I could say like,
implement circuit breakers. That's just good developer practice. Implement effective logging.
Why are you spewing out a JSON object with every line of the JSON object
on a separate log line
I've seen that, great, okay
so you've just now spit out 800
log lines for one
request or response, why?
why have you done this to me?
I love the
just like
taking a lot of notes here and especially the third point, which
you said, you don't want to end up in a situation where some external party, whether it's an
agent or a customer is telling you that the system is down so that you actually need to
think about how can you detect if the system is not in a healthy state before it impacts
your end user. I mean, that's really in the end what it's all about. And as you said,
there's different ways to then mitigate. Well, first of all, there's ways to detect it.
And then there's different technical things to mitigate things like the circuit breaker concept,
retries, and things like this. These are things to make a system more resilient
based on architectural patterns. But first of all, knowing where could things end up being a problem
for your consumer. And then how do you detect this?
How can you mitigate it from the architectural perspective?
Because there's many things we can do other than restarting or scaling up or scaling down.
Every time I see an incident where somebody says, we fixed it by
restarting,
a part of me dies a little inside.
I was talking with a developer about this the other day
and asked, why did a restart fix it?
I mean, I get that a restart fixes it,
but why did you need to restart it?
Did you have a memory leak?
Is there a condition in which it just panicked and didn't
know what to do restart maybe the the thing that gets you back up and running but it's never almost
never just like the the resolution the final resolution to fix things up it's not a fix it's
a it's a it's a band-aid yeah um i had i have seen you know decades old servers that just had a regular
reboot set up nightly because they're like oh if we don't do that then the application crashes
like no what no that's not that's not a fix it reminds me of that reminds me of iis app pools
where they have the setting it's recycled it. These were IIS servers.
I told you I started off
as a.NET guy, Andy.
That's a throwback. Wasn't that standard like every
24 hours it would recycle?
It's the default setting.
Is that still going on?
Not that we have to go into that.
Who knows?
I haven't touched IIS in years at this point.
Good riddance.
I mean, love you Microsoft, but not that.
Yeah.
Wow.
I mean, so this is stuff that you did in your previous job.
Yeah.
And this was years ago.
What has happened?
What has changed now?
Has anything changed, especially, you know, as you're moving,
I assume you're also moving towards cloud-native technology.
Kubernetes is a big thing.
You want to get a little bit more current, do you, Andy?
I guess I can understand that.
It's always good to have a little bit of history, right?
But on the other side, we want to...
Yeah, for sure.
So for a little bit more background,
I was brought on to TELUS to migrate us into Dynatrace SaaS two years ago
from unnamed competitor observability tool here.
And now we are just actually,
as of this week of recording,
as of last night,
have just completed our Dynatrace managed
to SaaS migration project.
Now, darn you Dynatrace,
because that SaaS migration tool you threw out
at the end of January
would have been really flipping handy for us.
But that's neither here nor there.
We got it done without it.
Yeah, so we've actually had a Dynatrace managed instance
for the last 12 years.
So getting all of that stuff into SaaS,
that's been the big thing over the past couple of months
was just getting all of that config.
And I have to laugh because, yeah,
we're definitely in the process of going cloud native.
One of the things about Telus Digital
that they sort of, when they were their own arm
of the organization, did so well
was that everything started off cloud native. It was completely Greenfield, which I mean
was wonderful then, but you know how much of that is deprecated now. Greenfield always turns brown
eventually. But on the rest, the side of the rest of Telus, there are, I would say the side of the rest of TELUS, I would say the majority of what we are monitoring
and supporting right now is still,
if not VMs in a cloud provider,
then actual physical hosts or on-prem VMs.
We still have hundreds of these, thousands of them.
And it's a constant struggle.
We've had incidents where
like, oh, we'll just roll out like a one agent
update and, oh no, this is
now broken.
It's injected ROM on this legacy
WebLogic 10.3 platform
and it's blown it up.
And we have to work through the challenges of
that. And that's, to be clear,
not blaming the tool,
blaming the legacy architecture that we're still using.
But we are slowly but surely marching towards
a completely cloud-native setup.
We have a number of Kubernetes clusters
that we're monitoring completely in full stack,
which is very exciting.
It's enabled a lot of really cool stuff for us.
But most recently, with that migration
from managed into SaaS,
the single most exciting thing for me out of that is that we
now have people who understand the context of how they relate to one another.
Because previously we had Dynatrace SaaS that was
Telus Digital. That was essentially everything that a user interacts with on the front end.
So you load up telus.com, you're interacting with things on the Telus Digital side.
If you log into your Telus account, maybe then you're starting to call back into some of our more legacy hosted APIs. But with those two instances disconnected and a whole bunch of messy proxy stuff in between, nobody really saw or got the trace context of what everybody did with one another.
So you'd see on the digital, on the Dynatrace SaaS side, you would see, okay, I clicked this, I've called all of these great Kubernetes APIs that we developed in the last six months because they're wonderful and shiny and new.
And then it goes off through our proxy and then it's gone.
And then nobody, I won't say nobody cares about where it goes,
but you don't see it.
So it's sort of like out of sight, out of mind,
which is too real for me as somebody with ADHD is out of sight, out of mind.
If it's hidden behind a wall or like a cupboard door or something,
it's gone.
It just functionally does not exist for me.
But what we've done now
with everybody being in this wonderful
unified SaaS instance
is I can now see,
all right, user clicks something on telus.com.
Here's the 20 downstream API calls
that I'm going to from there.
And here's the proxies that you're going through.
And here are the databases you're dependent on. And suddenly you have people who
have never talked to each other before. We were so siloed that nobody on what we formerly called
big Telus, on the big Telus side of things, was aware or really had any visibility
into what Telus Digital was calling them for
or what they were calling out to.
And now suddenly we've unlocked this superpower
of unreal traceability.
And that'll just get easier as we go more cloud native
because we'll have more stuff
that we can actually reliably instrument.
It's funny because when you mentioned that earlier on,
you mentioned I think rule number one was understand your app
and understand all the pieces of it.
Don't rely on your observability tool,
but at the same time, your observability tool is also key to understanding.
So there's a bit of a chicken and an egg component there, right?
There is for sure, yeah.
But I understand what you mean, like don't rely on that for that. You should be knowing as much as you can.
This is as it is. It's a tool to help you uncover it, but
it's amazing how much it brought that knowledge to you.
And the end result is that more people are now talking.
Those silos are going away. I know for
several, for many years now actually, there's always been the idea of, well, if I'm going
to update my API, not only do I have to know what I'm communicating to downstream, I need
to know who my consumers are upstream so that they know that I'm making a change, that I'm
not impacting them, and that I need to let them know that I'm making a change, and whether or not I have to be backward compatible for how long, and all
that stuff, right?
And you can only do that if you have that awareness.
So if you're a great organization and everybody knows this stuff, fantastic.
If not, you have these tools to help you figure that stuff out, right?
Because again, if you're coming in years later, you have no idea what that is.
You mentioned that legacy stuff.
The reason why I think so many things are that is you mentioned that legacy stuff the reason why i think
so many things are dependent on a restart in legacy stuff like that is because who knows it
well enough to even like i don't want to touch it we were talking about i know i'm rambling here but
like we were talking about mainframe recently on a couple of other episodes where it used to be like
don't even breathe near the mainframe right now things are getting a little bit more modernized
and people are starting to take more risks with mainframe but whenever you have anything legacy it's like's like, oh my gosh, if I touch it and that stops working, we don't know what's going to happen.
The issue of a lack of inherited knowledge is real in super legacy stacks like a lot of the stuff we deal with.
You mentioned mainframe.
Manulife was on a mainframe.
Here's a spoiler alert for you folks.
Most financial services and insurance companies you
deal with are on mainframes and there's a good reason because those things are solid i mean you
can run like docker on zos now like that stuff's crazy um but yeah one of the things that we're
constantly dealing with and especially at a large org like this, where we're tens and tens of thousands of people
in Telus,
is with these 10, 15, 20-year-old stacks,
you're losing all of that knowledge.
As teams are sort of adjusted, moved around,
anybody who's dealt with a large enterprise
knows that a reorg is,
there's always a new reorganization
happening around the corner,
and that knowledge just gets lost.
So yeah,
as you said, the fix becomes, you know, don't release changes. The fix becomes don't breathe
near this or it'll fall over. The fix becomes like, nobody knows how to log into this. We are
just waiting for the day when it explodes and people find out the impact. It's a constant
challenge we're working with. I think it's one of the things that observability tooling can help alleviate a
little bit.
It definitely can't fix it.
Um,
but if you're,
if you're at the very least able to get a better understanding of,
okay,
nobody's touched this host in three years,
let's see what it's talking to at the very least.
Let's see what the code is doing.
Um,
it,
it can definitely help,
but it is a
constant challenge.
For sure.
I have a couple of
more questions, and we're already
amazing, 30-something
minutes in.
I talk a lot, I know, I'm sorry.
No, no, no. You spread
a lot of knowledge, that's what it is.
That's a better way to phrase it.
Diplomat Andy.
Thank you for calling it knowledge and not the usual one.
So one question I got today,
or maybe it was also yesterday during my meeting
with one of the other telecoms from the other country
on the other side of the pond,
was what is a good approach to SLOs?
Meaning, what do you teach your developers
or what do you ask from application owners
in terms of what are really good SLOs?
What is too much? What is too little?
What is the minimum?
And I just want to throw this over to you.
If you think about an application that you get,
what is a good SLO and what is not a good as the low?
So it's funny you say our application owners and developers.
And the concept of application ownership has been sort of another.
That's a struggle that we're working through.
And so a lot of the time, they are one and the same.
You have developers who are owning an application, but without necessarily understanding
the actual business impact
of specifying a given service level objective.
So that's sort of a whole other challenge.
But it's a really fascinating question
because I think if you use Dynatrace's
baselining technology with Davis...
No, let's throw that in there. if you use Dynatrace's baselining technology with Davis, no, um,
throw that in there.
Um, the concept of a good or bad SLO,
I mean,
there are lots of things that go into it.
Um,
you don't want,
I think the,
the concept of alert fatigue,
um,
can really go into,
uh,
the definition of a good or bad SLO.
Um,
uh,
for the,
for those who are,
I assume everybody knows what alert fatigue is.
I'm just not going to go into it.
But we have teams who are like,
oh, this alert is going off all of the time
and there's nothing we can do about it.
That's a bit of a double-edged sword for me.
Why can't you do anything about it?
And why was that alert set at that threshold to begin with?
I think the concept of just setting a good SLO is to have an effective understanding of,
there's back to point one from the bootcamp, what purpose does your application serve?
Who are your consumers?
Who are you consuming?
And what do you ultimately need to do? If you are only being
called a few times an hour and it doesn't actually matter on the front end what your API is doing,
maybe you're running completely asynchronously, so it doesn't matter if your API takes 10 seconds
to respond. I don't want to see an API that takes 10 seconds to respond, but maybe that's okay
for the needs of your application. It's always contextual.
It's always all about what does your application need to do to set that SLO
and then start looking at the service level indicators you can use
to measure what that application needs to do.
So from the business case for your application,
if you say we need to be able to sustain 300 users
with a reasonable response time,
first off, what does reasonable mean?
That's up to an application owner to define.
You can use things like Google.
Google's standard, I think, back in 2010
was that a webpage should load in two seconds or less.
How many webpages do you deal with today
that are still loading way, way slower than two seconds? And not even to largest contentful page. You've
got websites that it's a greater than two second time to first byte, and you're just sitting there
looking at nothing for two to five seconds. The key to setting a good SLO is really just having
that understanding, contextual understanding of
what your application needs to be doing. So if you say, all right, we know that on this front end,
we need to be able to service 300 users at a time, and they all need to be able to load this
web page in, you know, one and a half seconds. So immediately, you know that your indicators are going to be,
well, how many people are on the page or on the site right now? And what is my current response
time? How quickly am I serving up these pages? And from there, you can say, all right, at what
point do we start to worry about this application? SLAs are where you started getting into like,
at what point are we legally liable? Or like, we have to start paying back money. But to me, an effective SLO toes that line of when do we start to SLO is how do you define within a given
application and your applications needs,
how early or how late to be notified for anything.
And,
and one of the things I really try have tried to stress with people in
teaching the subject matter in the past is I think a lot of people think of
SLOs as a set and forget.
It's like you set it once, you're good for ages.
No, absolutely not.
SLOs can be fluid like the rest of your application.
If you suddenly, you know, you know you're making a release
that is maybe going to slow down response time,
but that is okay still as defined by your application owners
and within the bounds of what
your application needs to do, then maybe it's okay to loosen those SLOs a little bit. Are you getting
over-alerted for something over which maybe, again, you have genuinely no control? Maybe you're
calling something further downstream. You can maybe loosen those SLOs. In turn, maybe you've
now put in a change. You put in caching. You've put in some sort of reduced amount of downstream calls.
You've sped up your application otherwise.
Maybe you just removed some terrible old logic.
And now maybe you can tighten those SLOs.
You can say, all right,
we're going to hold ourselves accountable
to a higher standard.
I think the other part of that to me
is that your users start to expect what you deliver.
So if you are suddenly starting to deliver an API that was responding in two seconds and now responds in 500 milliseconds, you bet your users aren't going back to an API that responds in two seconds.
Nobody wants that.
And I think that can be a bit of a dangerous point
when you're playing with it at that
level, but I think the
real point is
it's all contextual, it's all on what
your application needs to be doing, and
it can change from day to day.
Or not day to day, but you know,
it can change.
I wanted to ask, though, on
the SLO part,
right? I hear people debate all it can change. I wanted to ask, though, on the SLO part,
I hear people debate all the time on where SLO should
be set. Obviously,
your end user is the number one key
factor. It should
be about the objectives and goals of
the workflow, the organization,
the purpose of the app.
But let's say you have
20 different APIs,
20 different services you're calling.
Before I even go there, let me take a step back and say,
good SLOs require good observability.
I'm not saying that as a plug for Dynatrace,
because I think people fall into the trap of setting an SLO
as a monitor threshold, as opposed to a real SLO.
Like, oh, I want to know when my CPU is over 80%.
Well, why?
That's got nothing to do with anything.
Is your application still responding well when your CPU is at 80%?
Then fine.
Yeah, who cares, right?
Care about that.
But then if you look at these different APIs, right,
these teams can see themselves as, this is my application, this section of it,
and anybody upstream is my customer, my end user.
So do we look at SLOs below the actual end user?
Is it appropriate to have SLOs at different API levels or different service levels?
I want my service to be up and running this amount of time.
I want my response time of my service to be X amount.
I want it to be able to,
even just using those basics,
handle 300 concurrent calls at the same time,
if it's a one-to-one relationship.
Or is that going too granular?
Or does it depend on many other factors?
I think it depends on a lot of factors.
The main thing,
and I'm glad you called this out,
is what is the customer actually experiencing? Define an SLO that
means something. Because yeah, you can 10 calls down in your
API stack. Sure, you could say, all right, we're going to track
response time and we're going to alert when the response time goes off the rails. But
the real point to setting these is
when does it actually become an issue? When do people need to care that you're doing this? Because otherwise you're just generating noise. And that great example you gave of, okay, we're going to alert when our CPU is over 80. Why? What's the impact of that unless you know that you know cpu goes over 80 and that is a precursor to you know
a known previous incident or something then okay maybe that's a valid sort of thing to set
but outside of that if you are trying to set all these you know granular uh slos on all of your
downstream apis without necessarily taking the time to understand what your end user impact is
or what your overall application state is,
then you're missing the point.
You're missing the forest for the trees.
Yeah, and I think the issue with that 80% one too
is if you know that that's a precursor to something,
well, then fix that.
Or it's like, okay, that's a problem.
Well, then what are you doing about it?
Or at least set something up like...
Implement regular restarts.
Exactly. Or maybe scaling has to come into it at that point and automate that scaling. Oh,
I need to know so I can add a new instance. Well, why are you adding a new instance manually when
that's happening? Right. But I think the bigger challenge we face all the time in this, to me,
and as you were talking about this, and this goes back to what you said earlier about the adoption of SRE and it's still new, I feel like everyone in IT is always scrambling.
Everything is always an emergency.
There's never like a normal running state.
It's, oh my gosh, this broke.
We've got to get it fixed.
Or, hey, there's this new feature.
We've got to hurry up and get it out, and people have to scramble to get it out. It's part of the challenge I see with this SRE adoption is organizations
giving people the time and the ability to do this, right?
And again, we see a bunch of different customers all the time.
I still put SRE and the things you're talking about,
not quite in that unicorn phase because there's more than just the Googles doing it, but it's still
the fancy people doing it. It's the people who make a billion dollars or more
and I don't mean literally, but it's
a fancy state to be in. And the people we see day to day
are scrambling. They're working in tech stacks that were
not chosen for a purpose. They were chosen because we want to move to Kubernetes
or we want to be all serverless because it was just a decision.
We're going to do SRE, but no one's taking the time to teach them proper
SRE, so they're doing weird SLOs and then expecting
we need all the support for this stuff.
Really, the only word that comes
to my mind is a scramble and one of the questions i was thinking of asking before was how do we and
i don't know if we have an answer if this is some big existential question how do we get it so that
people can actually implement these things and take advantage of what there is out there to make all this stuff easier.
I mean, that feels like an existential
question, and again,
maybe that's another I'll call back in five years
with the answer.
I think one of the things
that I have tried to espouse is
that in your regular
sprints, because you're right,
it doesn't matter if you're purely dev, if you're ops, if you're devops, because you're right, it doesn't matter if you're purely
dev, if you're ops, if you're devops, if you're whatever.
I've described
SRE as being that wonderful
devops infinity symbol
as if you imagine a dumpster
that's on fire that's just slowly
going around the infinity symbol with you.
That's us.
Because there's going to be breakpoints
at literally every possible point in that cycle.
And that's SRE.
It is like the truck crashing into the DevOps symbol
and just breaking it all to pieces.
I think I have tried to say
that if people aren't prioritizing reliability
in their application,
like you've got on purely the dev side,
you've got people like, all right, we need to get this new feature out in this sprint. Now've got, on purely the dev side, you've got people like,
all right, we need to get this new feature out
in this sprint.
Now the product owner's coming to us
and we need to release this in the next two weeks.
And it's just a constant,
all you're doing is feature, feature, feature, feature, feature.
But at the same time,
all you're building up and behind
is technical debt and a lack of reliability
and a lack of understanding
of what your application is doing and why.
So for me, if I can convince product owners to be prioritizing reliability stories,
reliability and technical debt stories at the same level as they're prioritizing new features, then I've done something right. It's not always that easy because I get like money is the ultimate
arbiter here.
And,
you know,
if we can't say,
you know,
implementing this story in this sprint is going to save you X dollars as
opposed to,
Hey,
if you have this new feature,
you're going to make a cool mill.
That that's a,
can be a bit of a tough sell sometimes where I think the,
the real benefit comes in is back to when we were doing that boot camp and we
were talking with developers. And a lot of the same stuff applies with what we've been doing at
Telus. Get in with the developers and you are golden because then they can talk to their product
owners. They can talk to their business leaders and say, oh, hey, this isn't operating how I want it to operate
based on what I've learned through this SRE practice. Let's prioritize this. And then it's
less about, oh, the site reliability office is telling me I need to go do this or else.
It becomes more like an internal ideation. One of the things, and we talked about this at Perform,
was putting the tools into the hands of the developers
to make this easy.
Because yeah, implementing good observability takes time,
or it can take time, and getting that understanding.
So we are trying desperately to self-serve as much as we can.
We have these templates that we're leveraging
with Backstage and Monaco.
We have some other ones on Terraform
from the Telus Digital side of things
where essentially people fill out eight to 10 fields,
like what's your repo?
What environment are you targeting?
What's your CMDB ID?
Fill it out, submit,
and suddenly there's a PR against your repo, and you've pushed a
default set of alerting dashboards,
monitoring, etc., to your services
in Dynatrace with the
click of a button. That
is the real hour to me.
If you can get your
developers on board in a language
that they, and in a way that
they understand.
You're going to realize the biggest possible wins.
Awesome.
Speechless.
I thought so.
Speechless or Andy's just tired.
It's a combination of both, probably a little bit.
It's a little late here.
But I want to actually bring it to a close.
But I have one more question for you because I really love the bootcamp idea and what we
are planning and you've been at Perform, we always have hot days, hands-on training days,
and one of the things we've built for Perform was a hands-on training for platform engineering
and we now brought it to a GitHub repository where in the lightweight version
where we built it backstage and Argo and everything
that you can just launch in
five minutes and it stands up a reference
IDP and we're actually planning
to do more of these
to really teach people on
best practices and eventually we also want to
do maybe some type of certification
like you're certified in
building resilient apps because you're
a developer and you learned
how you do proper logging, how you do
this and this and how you set proper SLIs and
SLOs. So this was just
one of the things we're working on and I would love to
collaborate with you because you have
obviously, yeah, I will
definitely. But my
last question is, you mentioned
that you did the boot camp back
in your previous job have you run boot camps like this at telus as well or are you planning to so
it's not something we've had the opportunity to do at that level yet it is still something i am very
much pushing for and would like to do though um i think it's it's a tough sell to a lot of leaders, very understandably, to say, I'm going
to take a dozen of your developers for a week and they aren't doing their regular job. So it's
building out the value proposition for that is the biggest challenge. Because I need to be able to
say, look, at the end of this, if you give me four cohorts of 10 students each, I know
ultimately I'm costing you how many tens or hundreds of thousands of dollars because of
that.
I need to be able to define exactly what wins you're going to get out of that.
So yeah, it's something I'd love to do.
But yeah, building out the business case for that is probably the biggest challenge for
me.
I took so many notes. I think I can probably write a novel almost. A romantic fiction about an SRE romantic fiction.
Yeah, most likely. Yeah, yeah. I will figure something out maybe I use my friend to make it even nicer
but now
authors everywhere
love you for
saying that
yes
maybe you can
write a column
for the India
Times while
you're down
there
maybe
yeah
exactly
actually
what's on
the news
today
let's see
privilege to
be a part
of
Utah
Paradeesh
Grove
doesn't it
say Andy
Grabner
arrives in
India
today
is that
the
headline no it's
no, unfortunately not.
Austrian
developer and near-death experience with TukTuk.
Yeah.
Nothing
exciting.
I'm going to plant the seed here publicly, but I think
you talk about this idea, the
boot camp and all that, and you're talking about hot days.
I think it would be, what do we usually do, a two-day hot day period?
It'd be interesting if there was an actual two-day hot session where it was a two-day boot camp on using an observability tool like Dynatrace to set up an SRE practice,
but almost like a scaled version of your week-long practice, if that could be fit to two days, don't
say like a four hour session. It's like this two day session
you sign up for, you know, and I cannot tell you how down I am
for that. Like, hey, hit me up. It's give me a give me a give me
a ticket and a flight out to out to Vegas for performance 2025.
And I am there.
Yeah, I mean, we used to we used to have a week long, we call it a ticket in a flight out to Vegas for Perform 2025 and I am there. Yeah.
I mean, we used to have a week long, we call it the autonomous cloud lab where we actually
showed people in a week how to monitor a monolithic app, how to modernize the app.
And it was like pre-pandemic.
That's very cool.
And then maybe we should resurrect that idea.
Yeah.
May I propose another even smaller one?
I didn't touch on this earlier,
but for that bootcamp that we did,
we also had a one-day product owner version
where we were talking about why it was so good.
One, for the developers to be learning about this,
so it was sort of the sell job on them.
But two, what our business partners
and what our product owners
could also get out of Dynastrace.
It's not just a tool, or
observability in general in these practices
are not just for SREs.
They're not just for developers.
You as a product
owner can now suddenly have a
very easy red-green understanding
of what is my application
doing?
It's a power that they have.
Is my feature adopted as fast as I thought?
Exactly.
How's my app rollout going?
These are easy questions to answer now.
And hey, if it frees up developers
from answering these questions,
then all the better.
Yeah, I think education is the key
with all this stuff, right?
And to do an anti-segway, that's why we have wonderful guests like you on, to help our listeners become educated on this stuff.
I think it's the most important factor.
And people who listen are probably sick of hearing Andy and I talk about how we selfishly run this podcast so we can keep learning.
But I think it is the key, right?
It is when you think about even the complaints I was talking about before,
about everyone scrambling and it's like, how do you get people to do this?
If people start understanding the value and the benefits of these things,
that's what's going to finally get them to maybe pause and say,
all right, maybe I can give up my developers for the week to go to your boot camp
or even just two days or anything. Let's start down this path. Let's start getting small wins,
you know, and just to remind people, it doesn't have to be all or nothing. Start small and get
a little bit and a little bit and a little bit. And yeah, Andy, I don't know if you had any other
questions or Dana, if you had any last ideas you wanted to get in or i don't i don't think so i think i'm i am idea to how you have you have trained me of all
of the ideas i have and uh i have them all written down and i'll do one more close talking to my
my fancy mic here to to give you that uh public radio kind of voice there this is npr is there? This is NPR.
Here at CBC.
That's right.
Yes, yes, yes.
Well, again then, thank you for everyone for listening, and Dana, really thank you for your
time. Thanks for having me. It's fantastic.
It's wonderful. And
we hope everybody enjoyed it as much as we did,
and we'll talk to you all next time.
And Andy, enjoy your time in India.
And next time, if you take another tuk-tuk,
I would hope you can have enough courage to pull your camera out
and get a video of you going through.
Every minute of him screaming.
Every minute of him.
I will make a post-it later on.
I would love to see that. Yeah. may post it later on on social media somewhere.
I would love to see that.
Yeah.
All right.
Thanks, everyone.
Thank you.
Thank you.
Bye.
Bye.
Bye.
Bye.
Bye.