PurePerformance - SREs must not be your SWAT Teams with Dana Harrison

Episode Date: April 8, 2024

SREs (Site Reliability Engineers) have varying roles across different organizations: From Codifying your Infrastructure, handling high priority incidents, automating resiliency, ensuring proper observ...ability, defining SLOs or getting rid of alert fatigue. What an SRE team must not be is a SWAT team - or - as Dana Harrison, Staff SRE at Telus puts it: "You don't want to be the fire brigade along the DevOps Infinity Loop"In his years of experience as an SRE Dana also used to run 1 week boot camps for developers to educate them on making apps observable, proper logging, resiliency architecture patterns, defining good SLIs & SLOs. He talked about the 3 things that are the foundation of a good SRE: understand the app, understand the current state and make sure you know when your systems are down before your customers tell you so!If you are interested in seeing Dana and his colleagues from Telus talk about their observability and SRE journey then check out the On-Demand session from Dynatrace Perform 2024: https://www.dynatrace.com/perform/on-demand/perform-2024/?session=simplifying-observability-automations-and-insights-with-dynatrace#sessions

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my fantastic and wonderful co-host and potentially jet-lagged co-host Andy Grabner. How are you doing, Andy? I'm very good, I'm very good actually. Well, I'm actually happy that I'm still very good because I, as you know, well you just said I'm jet-lagged, I'm actually happy that I'm still very good because I, as you know, well, you just said I'm jet lagged. I'm in India right now.
Starting point is 00:00:48 And I just had my first tuk-tuk ride. Actually, two tuk-tuk rides. Like the little auto rickshaws. Yeah, the auto rickshaws. And they call them just autos. the way to the restaurant was already really interesting because getting through traffic here in Bangalore, where I am, is challenging, especially at 6, 7 o'clock at night. But on the way back, the driver asked me, do you want to go slow or fast?
Starting point is 00:01:18 And I said, well, I want to go safe, but as fast as you can, because I have a podcast. And I've never been as fast through streets in a big city like today but I've never been as scared it was a really interesting experience but I can assure folks if you ever make it to India or any other country where they have tuk-tuks or auto rickshaws or autos as they call them here uh these folks are doing this every day they know what they're doing even though it looks scary the guy told me they have 150 000 tuk-tuks active on the roads of bangalore wow wow now when you look at when you when you see those legendary pictures of the the intersections uh in in india um is that what you were experiencing during rush hour or is that
Starting point is 00:02:06 different part? Including cows and dogs and everything else. Wow. Okay. Interesting. I've heard it's some sort of organized chaos that you can't understand unless you're in it
Starting point is 00:02:22 and you know it. But it's supposedly very, very safe somehow. Well, the thing is, it seems like it's resilient. The system's resilient, doesn't break down. It's performing well. It's amazing. It's like, man. What a segue.
Starting point is 00:02:37 What a segue. Yeah. Andy's the master, everybody. Just remember. And it seems we have a new voice today. It sounds like a much better voice, Brian, than you can produce in your microphone and my crappy built-in microphone.
Starting point is 00:02:54 If I get right on top of it and get the proximity effect, I can get a little bit of that sound, too. All you have to do is eat the microphone. Yes. But yes, Andy, thank you. It is me. I am the newy thank you it is me i am the new voice you are the new voice well maybe you're taking over our job it's in the future because uh you just get uh very good ratings on everywhere where people are listening to him in into a podcast but you know without further ado um we have a guest as, fortunately, that enlightens us on different topics.
Starting point is 00:03:26 And today, we will definitely hit on the topic of site reliability engineering. But I want to shut up now for a little moment, at least. I have a lot of questions. Please. But I first want to let our guest introduce himself. Dana, please go ahead. It's something I know a little bit about, you know, I've maybe picked up a thing or two here. But yeah, thanks, Andy.
Starting point is 00:03:47 And thank you, Brian. For what it's worth, Brian, you do sound way better. It's a very nice mic. I'm Dana Harrison. I am a staff site reliability engineer here in Canada with a company called TELUS. We are one of the largest telecom companies in the country. So cell phone, internet, home phone, TV, all of that fun jazz. I started as a site reliability engineer probably in the last, it would have been about five years ago at my previous employer, which was one of Canada's largest insurance companies.
Starting point is 00:04:25 And I've been working in tech consistently for the last, oh no, I just realized it's been like 15 years and suddenly felt somehow quite aged at that. But yeah, it's been 15 years with the last five or so in site reliability engineering. It's been a fun journey getting here, I'll tell you that much. But I won't give it all away right now. I'll keep talking, much, but I won't give it all away right now. Keep talking. That voice is amazing. Yeah. And all I can say is that I've been working in tech for, I guess,
Starting point is 00:04:50 24 years now. And that's, that wasn't my first job out of college. So talk about feeling aged. Somehow you have more hair than I do. Yeah. Well, like I often shave it,
Starting point is 00:05:01 but yeah, it's easier. But then I'm catching up with you at some point honestly that's about if I grew my hair out Andy that's about what I'd look like and I just like it it just looks for on me it looks great on you you look wonderful you're doing
Starting point is 00:05:17 fabulous on me it looks awful so I keep it I keep it completely buzzed out just because I like the way it looks better but it does mean I'm wearing now this this is, I guess, a Canadian term. It does mean I'm wearing a touque around the house, you know, five months of the year. What is a touque? A beanie, a little hat. Oh, okay, okay, okay.
Starting point is 00:05:37 I think it borrowed from the French or French-Canadian. Ah, there you go. We learned something new again. Now you've learned T-O-Q-U-E, if you need to look it up. There you go. We learned something new again. It's amazing. Now you've learned T-O-Q-U-E, if you need to look it up. There we go. T-O-Q-U-E. We learned this word. We learned that there's 150,000 tuk-tuks in Bangalore.
Starting point is 00:05:53 Oh. Another thing that we learned, well, that we are going to teach the people now, because, you know, in my local time, it is 1043 in the evening. Brian, what is it than you are it is 11 13 in the morning my time 11 10 and dana for me it's it's just ticked over it's now 1 14 p.m the afternoon it's really strange because it's really strange because it seems we're half an hour. I mean, we're not just regular hours apart, but we are hours plus half an hour apart. So I think India and, Dana, you mentioned earlier, there's other parts
Starting point is 00:06:34 of the world too that have half an hour time zones. Newfoundland, as far as I know, they may be the only two. There might be other regions, but it's definitely unusual. Newfoundland, so Canada, for those who don't know, has two additional time zones after Eastern. So East of Quebec, once you get into New Brunswick, we have Atlantic time. So an hour ahead of me.
Starting point is 00:06:56 And then Newfoundland, because it's just that extra little bit further, and they're just a wonderful, beautiful, special province. They get another half hour tacked on on top of that. That can't be too confusing at all. Now, do they do daylight savings or not? And then you have to, like, I can imagine the levels of complexity. It is different. I have no
Starting point is 00:07:14 idea who does and does not do daylight savings. But if you have such levels of complexity, right, you need to make sure you can, I'm trying to do a transition here, you need to make sure you can, I'm trying to do a transition here, you need to make sure you can try to understand what complexities they're encountering and get ahead of them so that you can make sure everything's running smooth, right? There are always challenges to getting all of
Starting point is 00:07:37 your stakeholders on board, whether you're trying to enforce a site reliability practice in a large enterprise or daylight savings across your country and continent let me tell you yeah hey uh let's jump into the topic and fun fact well interesting fact not fun fact yesterday i spent a couple of hours with some of your not colleagues but some of your counterparts at another very big telecom, just across the pond in a country that just recently exited the European Union. And I think it's the biggest telecom in that country. And I had about 20, 25 site reliability engineers, platform engineers in the room, we talked
Starting point is 00:08:18 about site reliability and platform engineering. And it's interesting that I now have you on the podcast because we talked a lot about, you know, what does this really mean? How can we, how can we, you know, make sure that systems stay reliable, resilient? How have things changed over the years? And I would like to actually pass it over to you because you have a great long history on site reliability engineering. I think you've been a site reliability engineering and site reliability engineer before I heard about the term site reliability engineer. So can you first of all walk us a little bit about your history, your background, where
Starting point is 00:08:50 you started, what things have changed and especially you work with, I think it was MenuLife if I look at your LinkedIn post for 12 years and now for TELUS. What is it like to be a site reliability engineer in a large organization? Lessons learned, things that work, things that don't work. Once I figure it out, I'll let you know and I'll get back to you on this podcast. We'll be back in five years with another episode of Pure Performance.
Starting point is 00:09:20 There's a beautiful thing about being a site reliability engineer. One, it's still a relatively new practice. I mean, Google only wrote the book maybe 10, 12 years ago on what it means to be a site reliability engineer. If I think back on my career and where I started, I think I was an SRE before I knew what being an SRE was or meant or the full impact of that. You mentioned I was at Manulife for 12 years. Yeah, they hired me right out of school. I got a shout out to all of the wonderful people I worked with there because they gave me a kid who did not complete his degree in physics, but had a light tech background from working at,
Starting point is 00:10:04 it was actually Future Shop in Canada. That was the Canadian version of Best Buy. I worked tech there. I got, you know, I had always been curious about tech and, you know, getting my hands dirty and started out at Manulife many years ago as a like desktop and server admin. One of my first jobs was to go around to 500 workstations and upgrade the memory in them. I only killed two, which is a pretty good track record. But through my time and
Starting point is 00:10:36 my tenure at Manulife, I was really able to steer my career in the direction I wanted. So it started, I think my journey into SRE in particular started a lot with looking into the concept of oil reduction and automation. It was a lot of like, oh, here's this manual process that I see 40 people in my department doing, and it takes them each 30 minutes a day, and they're doing this every week. But maybe I'll teach myself, because at that time, I didn't really code. Maybe I'll teach myself. We were a heavy.net shop. I learned C sharp and I was able to automate some of the tasks that they were doing. And that was a running theme through what I did at Manulife. Even as I moved out of
Starting point is 00:11:17 support, I was in a projects team for a bit. I was actually delivering. We were in waterfall. We weren't agile or sprints or anything. I was delivering tasks and delivering code into our environments. From there, I went into more of a consulting role. And that theme of trying to automate and reduce toil and reduce manual effort and just increase the value we're getting out of our team members and our applications in turn was a concept I had really latched onto throughout. But it wasn't until I was approached five years ago by my then manager at Manulife. And he said, well, hey, we have this new team starting up.
Starting point is 00:12:00 It's called Site Reliability Engineering. I had never heard of it. He said, here's some of the stuff we're looking at doing. Would you like to be a part of it? I went, absolutely. And I think the thing I've learned so far about being an SRE is that it is a different role, no matter where you look. And I think a lot of that is just because it is still relatively new. I've interviewed in companies where an SRE is is like you are literally hands-on keyboard doing terraform managing your infrastructure all day. And that is what their SREs do.
Starting point is 00:12:34 And that is one definition of an SRE P2s that were occurring throughout the org, go in, implement observability tooling, identify what the heck was going on with their application, and then go implement a bunch of fixes. So we would see issues with a website where we were like, well, why is this website taking eight seconds to load? Oh, because it's making four repeated calls to the same API. We can consolidate that into one call and cache the response. And we went from eight seconds to about 250 milliseconds in one shot. There are some really, really cool things we got to do as part of that team. One of the other things I got to do at Manulife was stand up a reliability engineering bootcamp. So it was this week-long developer hands-on bootcamp, which we only got to do in-person
Starting point is 00:13:41 twice before the pandemic hit. And then we went to complete remote, which was exhausting to do, but we still got a lot of people through. We would get them hands-on with, I'll insert redacted competitor tool name here that we were using there, how to instrument your code, what these metrics mean,
Starting point is 00:14:00 what they look like, how you can identify improvements in your app. What does effective logging look like? Because you've got people who are just spitting out millions of logs for no reason or for no discernible reason. They're not using the data or they don't need to use that data. Circuit breaker patterns. How do you set up service level indicators, objectives, and agreements? How do you set up an error budget? And then at the end of that one-week course, can you tell I really loved doing this course? At the end of this one-week course, we would give them an app that we had set up, and using the things that they had learned through the past four days, we gave them this app and then told them to perform a bunch of tasks on
Starting point is 00:14:44 it. So go instrument, go set up dashboards, go set up alerting. And then once they got through that, we would actually break their app in the backend and say, go fix it. And they had to identify any one of the things or so that we had done to their app. So we would shut off a vendor API, like this third-party API that they had no access to. Instead of loading up 50 records from a database at a time, we load up 50 or 500,000 of them at a time. And so their, you know, their load times would go off the rails. That was, that was a really fun. And I think very important exercise in, in that particular organization. Because it really did a great job of getting the knowledge and skills and
Starting point is 00:15:28 tools that we had developed into the hands of the developers directly. It became a, a grassroots initiative at Manulife. It, instead of sort of trying to, to force our way down through management and then talking to the developers, it became, Oh, here, we're just going to give this all to the developers. And then they were
Starting point is 00:15:48 the ones who were excited to use it. They were the ones who were really jazzed about all of this cool information they could get because of what we did as an SRE team. I took so many notes. Brian, hopefully you didn't hear me typing because I know today I'm not using my
Starting point is 00:16:04 microphone. First of all, thank you didn't hear me typing because I know today I'm not using my microphone. First of all, thank you so much for that great idea of doing a bootcamp. You call it a bootcamp or masterclass, whatever you want to call it, but it's really amazing to put people through this. Let me ask one question because I always get the question from folks that I interact with. We need to attract developers. We need to figure out how to talk with developers,
Starting point is 00:16:29 how to engage them. And it feels like you found a great way because as an SRE, you're kind of like in that perfect position where you're helping the organization to understand why things are currently not stable. You mentioned earlier, you were basically looking into why does it take so long to start up an app?
Starting point is 00:16:51 Because doing startup, too many things happen and we can optimize this. Why does it break at all? But then you take this knowledge and then you enable mentor your developers that are actually creating the next generation of apps so that they from the beginning understand the concept of resiliency, the power of observability. I love the term effective logging, right? Because we had a Guild meeting recently and Guild is an internal group within Dynatrace where we meet with customers or Dynatrace users on a regular basis.
Starting point is 00:17:27 And I remember that it was Andrea who is actually an SRE at Dynatrace. And she talked about how we are trying to really standardize what is effective logging within Dynatrace. What is good logs and bad logs and really enforce standards because we don't need stuff that is just cluttering our storage and nobody needs it. So I really think that you just gave me the perfect pitch deck, the perfect pitch for how we need to pitch observability into an organization and not necessarily starting with the individual developer, but I think the SRE and SRE,
Starting point is 00:18:08 correct me if I'm wrong, is also a part of platform engineering because with platform engineering we try to provide site reliability engineering concepts as best practice or as a self-service, but getting it in there and then going on the wrong side, and I know people
Starting point is 00:18:23 cannot see me right now because I'm off camera. But I can see you, Andy. You can see me. He's moving his hands. Descriptive video for Andy's hands. Exactly. It's pretty cool. So that's, you know, obviously on the one side,
Starting point is 00:18:38 you're making sure that production is stable and you optimize what's running now. But then you're really taking the time. We took the time to educate developers and put them through a bootcamp. And I think the bootcamp is an awesome idea. Yeah. Well, thank you.
Starting point is 00:18:52 It was one of the most rewarding career experiences I've had. It was exhausting just in terms of we were running it once a month for a full week. And especially, you know, once we sort of settled into the pandemic and realized this was our reality. You know, the first few months we were like, oh, sure, okay,
Starting point is 00:19:12 you know, we'll run two or three remote and we'll be back into the office. And then we quickly realized that was going to be how that was going to go. So, yeah, we ran it for over a year remote. But I think that was the turning point for us in actually being able to scale effectively. Because before that, we were a relatively small team. That, as you said, SRE, because you have SRE concepts in many other practices.
Starting point is 00:19:42 Platform engineering definitely leverages SRE concepts. Development should leverage SRE concepts. Everybody should be in SRE concepts in many other practices. Platform engineering definitely leverages SRE concepts. Development should leverage SRE concepts. Everybody should be in SRE, frankly. And then I can retire and it'll be wonderful. But I think that reaching out to the developers was the key in scaling because before having us just come in and be sort of the SWAT team was really effective.
Starting point is 00:20:06 Like we got a lot done, but it wasn't how we could deliver the most value to the organization. It wasn't how we could get sort of our knowledge out. It was too much handholding almost where we'd go in and fix and then nobody would ever learn anything. They'd just be like, oh, like, thanks, you know, Wonder Woman. And then we'd fly in and fix, and then nobody would ever learn anything.
Starting point is 00:20:28 They'd just be like, oh, thanks, Wonder Woman, and then we'd fly away on our invisible plane. And then nobody would learn a lesson after that. So getting to that point, and prior to us setting up the one-week boot camp, I will say Manulife had done a stellar job of setting up a one-month developer boot camp that I had also taken part of, and there were a number of other one-week bootcamps. Ours was one of several. So the fact that they had this program at all was wonderful because it got everybody on the same page.
Starting point is 00:20:53 And then it enabled us to make other SREs. We turned it into an SRE factory because then you suddenly had people who were coming through this program. And may I, if he's listening, a special shout out to Rohan Shah, who is now like a senior manager of SRE at Bank of Montreal here in Canada, who was one of my students through the reliability engineering bootcamp at Manulife, which was a lot of fun.
Starting point is 00:21:19 I will say he was my best student, and now that's on the record. And it enabled us, again, to turn into like a self-replicating machine of SREs. We could then take all of our knowledge and all of our concepts and all of the things we were excited about and all of the things we were constantly changing the course material to match. Maybe we could tie this week's course into a recent P1 that happened and say, all right, here's what happened. Let's break it down.
Starting point is 00:21:45 We'll run you through it. Here's how we could have avoided that. And then everybody just went out from there. And it solved, to me, the major challenge with being a site reliability organization. And it's a challenge that we're still facing here at TELUS. People don't generally like being told what to do, I think.
Starting point is 00:22:10 So if you come in and you're, you're coming in to a P1, everybody's already frazzled and you're going, oh, well we can just sort of fix it like this. Yes. It's wonderful. Everybody's happy that the incident is resolved,
Starting point is 00:22:22 but not everybody's happy because sometimes you get a bit of a feeling, in my experience, people seem to feel like you're stepping on their toes a little bit or you're stealing their thunder. And obviously that's not the goal. We're all part of an organization. We're all getting paid by the same organization. We want to collaborate on this.
Starting point is 00:22:40 But I get it. It doesn't always feel like that. I've been in that situation where somebody comes in and fixes our stuff and it's like, well, now I feel like trash because of that. But the goal is to, again, just be part of this something bigger. excitement started, when we're able to get the developers themselves on board with everything that we're doing, then we don't have to be those people who are coming in and stepping on your toes. I do just want to quickly shout out, though, because it's been mentioned a few times earlier this week, at least in the United States, but I think pretty much
Starting point is 00:23:21 globally was when the shutdown began four years ago. So, uh, when everything... Wow, yeah. Yeah. So, anyway. It's time of the recording, yeah? Yeah, it's time of the recording. March 14th, yeah. And anyway, because it's been
Starting point is 00:23:39 brought up so many times, I'm like, oh my gosh, my daughter was like, I can't believe it was for you. So, yeah. Anyway, Andy, you had a thought there. I have a thought too, but Andy, you go with yours first, because I'm sure it oh my gosh, my daughter is like, I can't believe it was for you. So yeah, anyway. Andy, you had a thought there. I have a thought too, but Andy, you go with yours first because I'm sure it's more relevant. No, no, no. It's just you mentioned earlier kind of the different things you were teaching. And I think it's interesting for recap, like how to get observability,
Starting point is 00:24:00 effective logging, how to use the data to optimize, circuit breaker patterns, setting up SLIs and SLOs. Because people always ask me, so now, what is SRE? What are the three things you should do if you're applying SRE best practices? And if you look back at the bootcamp on what you taught, what are the three things that developers definitely got away?
Starting point is 00:24:28 What are the top three things that everybody has to have in their mind? And this is the bare minimum of building resilient systems. That would be interesting. Number one, understand your application. That was a point that we had dreaded at home, that the observability tool we were using
Starting point is 00:24:45 was not there to describe to you, your entire application. You still have to understand what's going on and what you're dependent on. It might show you a little bit more. Um, but it is, it is not the source of all of the answers on what everything in your code base does. And beyond just the code, what is the purpose of your application? Why does it exist? Where do you fit into the wider flow of, you know, if you are one of 20 APIs being called after somebody clicks something
Starting point is 00:25:15 on the front end, why are you there? What are you doing? What service are you offering as part of your code base? I think is probably the first thing I would say. The second is respect the data. So, okay, great. You've implemented Dynatrace. You have all of these wonderful data now. We've templated out, you've got business events and you've got all of your golden signals and you've got synthetics and you've got rum. So what are you going to do with it? Because you can set it up and ignore it,
Starting point is 00:25:46 but then you're just paying for it for no reason. And that ties into respect the data and respect your current state. So understand your application, use data, know what your current state is. So at no point, and I've started calling this out in a meeting that we have on Fridays, where we go over all of our major incidents. It's a lot of fun. I've started calling out
Starting point is 00:26:11 incidents that I know are instrumented, but have agent or customer as the detection mechanism. I never want to see that again. I should never, ever be relying on an agent or customer to tell me when something is wrong with my application. Those are the big three that I would drive home from. I mean, certainly there are lots of little things. Again, I could say like, implement circuit breakers. That's just good developer practice. Implement effective logging. Why are you spewing out a JSON object with every line of the JSON object on a separate log line I've seen that, great, okay
Starting point is 00:26:50 so you've just now spit out 800 log lines for one request or response, why? why have you done this to me? I love the just like taking a lot of notes here and especially the third point, which you said, you don't want to end up in a situation where some external party, whether it's an
Starting point is 00:27:14 agent or a customer is telling you that the system is down so that you actually need to think about how can you detect if the system is not in a healthy state before it impacts your end user. I mean, that's really in the end what it's all about. And as you said, there's different ways to then mitigate. Well, first of all, there's ways to detect it. And then there's different technical things to mitigate things like the circuit breaker concept, retries, and things like this. These are things to make a system more resilient based on architectural patterns. But first of all, knowing where could things end up being a problem for your consumer. And then how do you detect this?
Starting point is 00:27:52 How can you mitigate it from the architectural perspective? Because there's many things we can do other than restarting or scaling up or scaling down. Every time I see an incident where somebody says, we fixed it by restarting, a part of me dies a little inside. I was talking with a developer about this the other day and asked, why did a restart fix it? I mean, I get that a restart fixes it,
Starting point is 00:28:16 but why did you need to restart it? Did you have a memory leak? Is there a condition in which it just panicked and didn't know what to do restart maybe the the thing that gets you back up and running but it's never almost never just like the the resolution the final resolution to fix things up it's not a fix it's a it's a it's a band-aid yeah um i had i have seen you know decades old servers that just had a regular reboot set up nightly because they're like oh if we don't do that then the application crashes like no what no that's not that's not a fix it reminds me of that reminds me of iis app pools
Starting point is 00:29:00 where they have the setting it's recycled it. These were IIS servers. I told you I started off as a.NET guy, Andy. That's a throwback. Wasn't that standard like every 24 hours it would recycle? It's the default setting. Is that still going on? Not that we have to go into that.
Starting point is 00:29:19 Who knows? I haven't touched IIS in years at this point. Good riddance. I mean, love you Microsoft, but not that. Yeah. Wow. I mean, so this is stuff that you did in your previous job. Yeah.
Starting point is 00:29:37 And this was years ago. What has happened? What has changed now? Has anything changed, especially, you know, as you're moving, I assume you're also moving towards cloud-native technology. Kubernetes is a big thing. You want to get a little bit more current, do you, Andy? I guess I can understand that.
Starting point is 00:29:55 It's always good to have a little bit of history, right? But on the other side, we want to... Yeah, for sure. So for a little bit more background, I was brought on to TELUS to migrate us into Dynatrace SaaS two years ago from unnamed competitor observability tool here. And now we are just actually, as of this week of recording,
Starting point is 00:30:17 as of last night, have just completed our Dynatrace managed to SaaS migration project. Now, darn you Dynatrace, because that SaaS migration tool you threw out at the end of January would have been really flipping handy for us. But that's neither here nor there.
Starting point is 00:30:35 We got it done without it. Yeah, so we've actually had a Dynatrace managed instance for the last 12 years. So getting all of that stuff into SaaS, that's been the big thing over the past couple of months was just getting all of that config. And I have to laugh because, yeah, we're definitely in the process of going cloud native.
Starting point is 00:30:58 One of the things about Telus Digital that they sort of, when they were their own arm of the organization, did so well was that everything started off cloud native. It was completely Greenfield, which I mean was wonderful then, but you know how much of that is deprecated now. Greenfield always turns brown eventually. But on the rest, the side of the rest of Telus, there are, I would say the side of the rest of TELUS, I would say the majority of what we are monitoring and supporting right now is still, if not VMs in a cloud provider,
Starting point is 00:31:35 then actual physical hosts or on-prem VMs. We still have hundreds of these, thousands of them. And it's a constant struggle. We've had incidents where like, oh, we'll just roll out like a one agent update and, oh no, this is now broken. It's injected ROM on this legacy
Starting point is 00:31:55 WebLogic 10.3 platform and it's blown it up. And we have to work through the challenges of that. And that's, to be clear, not blaming the tool, blaming the legacy architecture that we're still using. But we are slowly but surely marching towards a completely cloud-native setup.
Starting point is 00:32:14 We have a number of Kubernetes clusters that we're monitoring completely in full stack, which is very exciting. It's enabled a lot of really cool stuff for us. But most recently, with that migration from managed into SaaS, the single most exciting thing for me out of that is that we now have people who understand the context of how they relate to one another.
Starting point is 00:32:40 Because previously we had Dynatrace SaaS that was Telus Digital. That was essentially everything that a user interacts with on the front end. So you load up telus.com, you're interacting with things on the Telus Digital side. If you log into your Telus account, maybe then you're starting to call back into some of our more legacy hosted APIs. But with those two instances disconnected and a whole bunch of messy proxy stuff in between, nobody really saw or got the trace context of what everybody did with one another. So you'd see on the digital, on the Dynatrace SaaS side, you would see, okay, I clicked this, I've called all of these great Kubernetes APIs that we developed in the last six months because they're wonderful and shiny and new. And then it goes off through our proxy and then it's gone. And then nobody, I won't say nobody cares about where it goes, but you don't see it.
Starting point is 00:33:34 So it's sort of like out of sight, out of mind, which is too real for me as somebody with ADHD is out of sight, out of mind. If it's hidden behind a wall or like a cupboard door or something, it's gone. It just functionally does not exist for me. But what we've done now with everybody being in this wonderful unified SaaS instance
Starting point is 00:33:55 is I can now see, all right, user clicks something on telus.com. Here's the 20 downstream API calls that I'm going to from there. And here's the proxies that you're going through. And here are the databases you're dependent on. And suddenly you have people who have never talked to each other before. We were so siloed that nobody on what we formerly called big Telus, on the big Telus side of things, was aware or really had any visibility
Starting point is 00:34:25 into what Telus Digital was calling them for or what they were calling out to. And now suddenly we've unlocked this superpower of unreal traceability. And that'll just get easier as we go more cloud native because we'll have more stuff that we can actually reliably instrument. It's funny because when you mentioned that earlier on,
Starting point is 00:34:47 you mentioned I think rule number one was understand your app and understand all the pieces of it. Don't rely on your observability tool, but at the same time, your observability tool is also key to understanding. So there's a bit of a chicken and an egg component there, right? There is for sure, yeah. But I understand what you mean, like don't rely on that for that. You should be knowing as much as you can. This is as it is. It's a tool to help you uncover it, but
Starting point is 00:35:11 it's amazing how much it brought that knowledge to you. And the end result is that more people are now talking. Those silos are going away. I know for several, for many years now actually, there's always been the idea of, well, if I'm going to update my API, not only do I have to know what I'm communicating to downstream, I need to know who my consumers are upstream so that they know that I'm making a change, that I'm not impacting them, and that I need to let them know that I'm making a change, and whether or not I have to be backward compatible for how long, and all that stuff, right?
Starting point is 00:35:49 And you can only do that if you have that awareness. So if you're a great organization and everybody knows this stuff, fantastic. If not, you have these tools to help you figure that stuff out, right? Because again, if you're coming in years later, you have no idea what that is. You mentioned that legacy stuff. The reason why I think so many things are that is you mentioned that legacy stuff the reason why i think so many things are dependent on a restart in legacy stuff like that is because who knows it well enough to even like i don't want to touch it we were talking about i know i'm rambling here but
Starting point is 00:36:13 like we were talking about mainframe recently on a couple of other episodes where it used to be like don't even breathe near the mainframe right now things are getting a little bit more modernized and people are starting to take more risks with mainframe but whenever you have anything legacy it's like's like, oh my gosh, if I touch it and that stops working, we don't know what's going to happen. The issue of a lack of inherited knowledge is real in super legacy stacks like a lot of the stuff we deal with. You mentioned mainframe. Manulife was on a mainframe. Here's a spoiler alert for you folks. Most financial services and insurance companies you
Starting point is 00:36:45 deal with are on mainframes and there's a good reason because those things are solid i mean you can run like docker on zos now like that stuff's crazy um but yeah one of the things that we're constantly dealing with and especially at a large org like this, where we're tens and tens of thousands of people in Telus, is with these 10, 15, 20-year-old stacks, you're losing all of that knowledge. As teams are sort of adjusted, moved around, anybody who's dealt with a large enterprise
Starting point is 00:37:18 knows that a reorg is, there's always a new reorganization happening around the corner, and that knowledge just gets lost. So yeah, as you said, the fix becomes, you know, don't release changes. The fix becomes don't breathe near this or it'll fall over. The fix becomes like, nobody knows how to log into this. We are just waiting for the day when it explodes and people find out the impact. It's a constant
Starting point is 00:37:42 challenge we're working with. I think it's one of the things that observability tooling can help alleviate a little bit. It definitely can't fix it. Um, but if you're, if you're at the very least able to get a better understanding of, okay, nobody's touched this host in three years,
Starting point is 00:37:57 let's see what it's talking to at the very least. Let's see what the code is doing. Um, it, it can definitely help, but it is a constant challenge. For sure.
Starting point is 00:38:09 I have a couple of more questions, and we're already amazing, 30-something minutes in. I talk a lot, I know, I'm sorry. No, no, no. You spread a lot of knowledge, that's what it is. That's a better way to phrase it.
Starting point is 00:38:25 Diplomat Andy. Thank you for calling it knowledge and not the usual one. So one question I got today, or maybe it was also yesterday during my meeting with one of the other telecoms from the other country on the other side of the pond, was what is a good approach to SLOs? Meaning, what do you teach your developers
Starting point is 00:38:48 or what do you ask from application owners in terms of what are really good SLOs? What is too much? What is too little? What is the minimum? And I just want to throw this over to you. If you think about an application that you get, what is a good SLO and what is not a good as the low? So it's funny you say our application owners and developers.
Starting point is 00:39:11 And the concept of application ownership has been sort of another. That's a struggle that we're working through. And so a lot of the time, they are one and the same. You have developers who are owning an application, but without necessarily understanding the actual business impact of specifying a given service level objective. So that's sort of a whole other challenge. But it's a really fascinating question
Starting point is 00:39:38 because I think if you use Dynatrace's baselining technology with Davis... No, let's throw that in there. if you use Dynatrace's baselining technology with Davis, no, um, throw that in there. Um, the concept of a good or bad SLO, I mean, there are lots of things that go into it. Um,
Starting point is 00:39:53 you don't want, I think the, the concept of alert fatigue, um, can really go into, uh, the definition of a good or bad SLO. Um,
Starting point is 00:40:04 uh, for the, for those who are, I assume everybody knows what alert fatigue is. I'm just not going to go into it. But we have teams who are like, oh, this alert is going off all of the time and there's nothing we can do about it.
Starting point is 00:40:17 That's a bit of a double-edged sword for me. Why can't you do anything about it? And why was that alert set at that threshold to begin with? I think the concept of just setting a good SLO is to have an effective understanding of, there's back to point one from the bootcamp, what purpose does your application serve? Who are your consumers? Who are you consuming? And what do you ultimately need to do? If you are only being
Starting point is 00:40:47 called a few times an hour and it doesn't actually matter on the front end what your API is doing, maybe you're running completely asynchronously, so it doesn't matter if your API takes 10 seconds to respond. I don't want to see an API that takes 10 seconds to respond, but maybe that's okay for the needs of your application. It's always contextual. It's always all about what does your application need to do to set that SLO and then start looking at the service level indicators you can use to measure what that application needs to do. So from the business case for your application,
Starting point is 00:41:22 if you say we need to be able to sustain 300 users with a reasonable response time, first off, what does reasonable mean? That's up to an application owner to define. You can use things like Google. Google's standard, I think, back in 2010 was that a webpage should load in two seconds or less. How many webpages do you deal with today
Starting point is 00:41:44 that are still loading way, way slower than two seconds? And not even to largest contentful page. You've got websites that it's a greater than two second time to first byte, and you're just sitting there looking at nothing for two to five seconds. The key to setting a good SLO is really just having that understanding, contextual understanding of what your application needs to be doing. So if you say, all right, we know that on this front end, we need to be able to service 300 users at a time, and they all need to be able to load this web page in, you know, one and a half seconds. So immediately, you know that your indicators are going to be, well, how many people are on the page or on the site right now? And what is my current response
Starting point is 00:42:30 time? How quickly am I serving up these pages? And from there, you can say, all right, at what point do we start to worry about this application? SLAs are where you started getting into like, at what point are we legally liable? Or like, we have to start paying back money. But to me, an effective SLO toes that line of when do we start to SLO is how do you define within a given application and your applications needs, how early or how late to be notified for anything. And, and one of the things I really try have tried to stress with people in teaching the subject matter in the past is I think a lot of people think of
Starting point is 00:43:24 SLOs as a set and forget. It's like you set it once, you're good for ages. No, absolutely not. SLOs can be fluid like the rest of your application. If you suddenly, you know, you know you're making a release that is maybe going to slow down response time, but that is okay still as defined by your application owners and within the bounds of what
Starting point is 00:43:45 your application needs to do, then maybe it's okay to loosen those SLOs a little bit. Are you getting over-alerted for something over which maybe, again, you have genuinely no control? Maybe you're calling something further downstream. You can maybe loosen those SLOs. In turn, maybe you've now put in a change. You put in caching. You've put in some sort of reduced amount of downstream calls. You've sped up your application otherwise. Maybe you just removed some terrible old logic. And now maybe you can tighten those SLOs. You can say, all right,
Starting point is 00:44:14 we're going to hold ourselves accountable to a higher standard. I think the other part of that to me is that your users start to expect what you deliver. So if you are suddenly starting to deliver an API that was responding in two seconds and now responds in 500 milliseconds, you bet your users aren't going back to an API that responds in two seconds. Nobody wants that. And I think that can be a bit of a dangerous point when you're playing with it at that
Starting point is 00:44:48 level, but I think the real point is it's all contextual, it's all on what your application needs to be doing, and it can change from day to day. Or not day to day, but you know, it can change. I wanted to ask, though, on
Starting point is 00:45:04 the SLO part, right? I hear people debate all it can change. I wanted to ask, though, on the SLO part, I hear people debate all the time on where SLO should be set. Obviously, your end user is the number one key factor. It should be about the objectives and goals of the workflow, the organization,
Starting point is 00:45:20 the purpose of the app. But let's say you have 20 different APIs, 20 different services you're calling. Before I even go there, let me take a step back and say, good SLOs require good observability. I'm not saying that as a plug for Dynatrace, because I think people fall into the trap of setting an SLO
Starting point is 00:45:40 as a monitor threshold, as opposed to a real SLO. Like, oh, I want to know when my CPU is over 80%. Well, why? That's got nothing to do with anything. Is your application still responding well when your CPU is at 80%? Then fine. Yeah, who cares, right? Care about that.
Starting point is 00:45:56 But then if you look at these different APIs, right, these teams can see themselves as, this is my application, this section of it, and anybody upstream is my customer, my end user. So do we look at SLOs below the actual end user? Is it appropriate to have SLOs at different API levels or different service levels? I want my service to be up and running this amount of time. I want my response time of my service to be X amount. I want it to be able to,
Starting point is 00:46:27 even just using those basics, handle 300 concurrent calls at the same time, if it's a one-to-one relationship. Or is that going too granular? Or does it depend on many other factors? I think it depends on a lot of factors. The main thing, and I'm glad you called this out,
Starting point is 00:46:43 is what is the customer actually experiencing? Define an SLO that means something. Because yeah, you can 10 calls down in your API stack. Sure, you could say, all right, we're going to track response time and we're going to alert when the response time goes off the rails. But the real point to setting these is when does it actually become an issue? When do people need to care that you're doing this? Because otherwise you're just generating noise. And that great example you gave of, okay, we're going to alert when our CPU is over 80. Why? What's the impact of that unless you know that you know cpu goes over 80 and that is a precursor to you know a known previous incident or something then okay maybe that's a valid sort of thing to set but outside of that if you are trying to set all these you know granular uh slos on all of your
Starting point is 00:47:40 downstream apis without necessarily taking the time to understand what your end user impact is or what your overall application state is, then you're missing the point. You're missing the forest for the trees. Yeah, and I think the issue with that 80% one too is if you know that that's a precursor to something, well, then fix that. Or it's like, okay, that's a problem.
Starting point is 00:48:01 Well, then what are you doing about it? Or at least set something up like... Implement regular restarts. Exactly. Or maybe scaling has to come into it at that point and automate that scaling. Oh, I need to know so I can add a new instance. Well, why are you adding a new instance manually when that's happening? Right. But I think the bigger challenge we face all the time in this, to me, and as you were talking about this, and this goes back to what you said earlier about the adoption of SRE and it's still new, I feel like everyone in IT is always scrambling. Everything is always an emergency.
Starting point is 00:48:33 There's never like a normal running state. It's, oh my gosh, this broke. We've got to get it fixed. Or, hey, there's this new feature. We've got to hurry up and get it out, and people have to scramble to get it out. It's part of the challenge I see with this SRE adoption is organizations giving people the time and the ability to do this, right? And again, we see a bunch of different customers all the time. I still put SRE and the things you're talking about,
Starting point is 00:49:01 not quite in that unicorn phase because there's more than just the Googles doing it, but it's still the fancy people doing it. It's the people who make a billion dollars or more and I don't mean literally, but it's a fancy state to be in. And the people we see day to day are scrambling. They're working in tech stacks that were not chosen for a purpose. They were chosen because we want to move to Kubernetes or we want to be all serverless because it was just a decision. We're going to do SRE, but no one's taking the time to teach them proper
Starting point is 00:49:35 SRE, so they're doing weird SLOs and then expecting we need all the support for this stuff. Really, the only word that comes to my mind is a scramble and one of the questions i was thinking of asking before was how do we and i don't know if we have an answer if this is some big existential question how do we get it so that people can actually implement these things and take advantage of what there is out there to make all this stuff easier. I mean, that feels like an existential question, and again,
Starting point is 00:50:12 maybe that's another I'll call back in five years with the answer. I think one of the things that I have tried to espouse is that in your regular sprints, because you're right, it doesn't matter if you're purely dev, if you're ops, if you're devops, because you're right, it doesn't matter if you're purely dev, if you're ops, if you're devops, if you're whatever.
Starting point is 00:50:29 I've described SRE as being that wonderful devops infinity symbol as if you imagine a dumpster that's on fire that's just slowly going around the infinity symbol with you. That's us. Because there's going to be breakpoints
Starting point is 00:50:44 at literally every possible point in that cycle. And that's SRE. It is like the truck crashing into the DevOps symbol and just breaking it all to pieces. I think I have tried to say that if people aren't prioritizing reliability in their application, like you've got on purely the dev side,
Starting point is 00:51:04 you've got people like, all right, we need to get this new feature out in this sprint. Now've got, on purely the dev side, you've got people like, all right, we need to get this new feature out in this sprint. Now the product owner's coming to us and we need to release this in the next two weeks. And it's just a constant, all you're doing is feature, feature, feature, feature, feature. But at the same time,
Starting point is 00:51:17 all you're building up and behind is technical debt and a lack of reliability and a lack of understanding of what your application is doing and why. So for me, if I can convince product owners to be prioritizing reliability stories, reliability and technical debt stories at the same level as they're prioritizing new features, then I've done something right. It's not always that easy because I get like money is the ultimate arbiter here. And,
Starting point is 00:51:47 you know, if we can't say, you know, implementing this story in this sprint is going to save you X dollars as opposed to, Hey, if you have this new feature, you're going to make a cool mill.
Starting point is 00:51:57 That that's a, can be a bit of a tough sell sometimes where I think the, the real benefit comes in is back to when we were doing that boot camp and we were talking with developers. And a lot of the same stuff applies with what we've been doing at Telus. Get in with the developers and you are golden because then they can talk to their product owners. They can talk to their business leaders and say, oh, hey, this isn't operating how I want it to operate based on what I've learned through this SRE practice. Let's prioritize this. And then it's less about, oh, the site reliability office is telling me I need to go do this or else.
Starting point is 00:52:38 It becomes more like an internal ideation. One of the things, and we talked about this at Perform, was putting the tools into the hands of the developers to make this easy. Because yeah, implementing good observability takes time, or it can take time, and getting that understanding. So we are trying desperately to self-serve as much as we can. We have these templates that we're leveraging with Backstage and Monaco.
Starting point is 00:53:08 We have some other ones on Terraform from the Telus Digital side of things where essentially people fill out eight to 10 fields, like what's your repo? What environment are you targeting? What's your CMDB ID? Fill it out, submit, and suddenly there's a PR against your repo, and you've pushed a
Starting point is 00:53:28 default set of alerting dashboards, monitoring, etc., to your services in Dynatrace with the click of a button. That is the real hour to me. If you can get your developers on board in a language that they, and in a way that
Starting point is 00:53:44 they understand. You're going to realize the biggest possible wins. Awesome. Speechless. I thought so. Speechless or Andy's just tired. It's a combination of both, probably a little bit. It's a little late here.
Starting point is 00:54:01 But I want to actually bring it to a close. But I have one more question for you because I really love the bootcamp idea and what we are planning and you've been at Perform, we always have hot days, hands-on training days, and one of the things we've built for Perform was a hands-on training for platform engineering and we now brought it to a GitHub repository where in the lightweight version where we built it backstage and Argo and everything that you can just launch in five minutes and it stands up a reference
Starting point is 00:54:32 IDP and we're actually planning to do more of these to really teach people on best practices and eventually we also want to do maybe some type of certification like you're certified in building resilient apps because you're a developer and you learned
Starting point is 00:54:47 how you do proper logging, how you do this and this and how you set proper SLIs and SLOs. So this was just one of the things we're working on and I would love to collaborate with you because you have obviously, yeah, I will definitely. But my last question is, you mentioned
Starting point is 00:55:04 that you did the boot camp back in your previous job have you run boot camps like this at telus as well or are you planning to so it's not something we've had the opportunity to do at that level yet it is still something i am very much pushing for and would like to do though um i think it's it's a tough sell to a lot of leaders, very understandably, to say, I'm going to take a dozen of your developers for a week and they aren't doing their regular job. So it's building out the value proposition for that is the biggest challenge. Because I need to be able to say, look, at the end of this, if you give me four cohorts of 10 students each, I know ultimately I'm costing you how many tens or hundreds of thousands of dollars because of
Starting point is 00:55:51 that. I need to be able to define exactly what wins you're going to get out of that. So yeah, it's something I'd love to do. But yeah, building out the business case for that is probably the biggest challenge for me. I took so many notes. I think I can probably write a novel almost. A romantic fiction about an SRE romantic fiction. Yeah, most likely. Yeah, yeah. I will figure something out maybe I use my friend to make it even nicer but now
Starting point is 00:56:25 authors everywhere love you for saying that yes maybe you can write a column for the India Times while
Starting point is 00:56:32 you're down there maybe yeah exactly actually what's on the news
Starting point is 00:56:35 today let's see privilege to be a part of Utah Paradeesh Grove
Starting point is 00:56:41 doesn't it say Andy Grabner arrives in India today is that the
Starting point is 00:56:44 headline no it's no, unfortunately not. Austrian developer and near-death experience with TukTuk. Yeah. Nothing exciting. I'm going to plant the seed here publicly, but I think
Starting point is 00:56:59 you talk about this idea, the boot camp and all that, and you're talking about hot days. I think it would be, what do we usually do, a two-day hot day period? It'd be interesting if there was an actual two-day hot session where it was a two-day boot camp on using an observability tool like Dynatrace to set up an SRE practice, but almost like a scaled version of your week-long practice, if that could be fit to two days, don't say like a four hour session. It's like this two day session you sign up for, you know, and I cannot tell you how down I am for that. Like, hey, hit me up. It's give me a give me a give me
Starting point is 00:57:40 a ticket and a flight out to out to Vegas for performance 2025. And I am there. Yeah, I mean, we used to we used to have a week long, we call it a ticket in a flight out to Vegas for Perform 2025 and I am there. Yeah. I mean, we used to have a week long, we call it the autonomous cloud lab where we actually showed people in a week how to monitor a monolithic app, how to modernize the app. And it was like pre-pandemic. That's very cool. And then maybe we should resurrect that idea.
Starting point is 00:58:03 Yeah. May I propose another even smaller one? I didn't touch on this earlier, but for that bootcamp that we did, we also had a one-day product owner version where we were talking about why it was so good. One, for the developers to be learning about this, so it was sort of the sell job on them.
Starting point is 00:58:21 But two, what our business partners and what our product owners could also get out of Dynastrace. It's not just a tool, or observability in general in these practices are not just for SREs. They're not just for developers. You as a product
Starting point is 00:58:35 owner can now suddenly have a very easy red-green understanding of what is my application doing? It's a power that they have. Is my feature adopted as fast as I thought? Exactly. How's my app rollout going?
Starting point is 00:58:51 These are easy questions to answer now. And hey, if it frees up developers from answering these questions, then all the better. Yeah, I think education is the key with all this stuff, right? And to do an anti-segway, that's why we have wonderful guests like you on, to help our listeners become educated on this stuff. I think it's the most important factor.
Starting point is 00:59:16 And people who listen are probably sick of hearing Andy and I talk about how we selfishly run this podcast so we can keep learning. But I think it is the key, right? It is when you think about even the complaints I was talking about before, about everyone scrambling and it's like, how do you get people to do this? If people start understanding the value and the benefits of these things, that's what's going to finally get them to maybe pause and say, all right, maybe I can give up my developers for the week to go to your boot camp or even just two days or anything. Let's start down this path. Let's start getting small wins,
Starting point is 00:59:50 you know, and just to remind people, it doesn't have to be all or nothing. Start small and get a little bit and a little bit and a little bit. And yeah, Andy, I don't know if you had any other questions or Dana, if you had any last ideas you wanted to get in or i don't i don't think so i think i'm i am idea to how you have you have trained me of all of the ideas i have and uh i have them all written down and i'll do one more close talking to my my fancy mic here to to give you that uh public radio kind of voice there this is npr is there? This is NPR. Here at CBC. That's right. Yes, yes, yes.
Starting point is 01:00:33 Well, again then, thank you for everyone for listening, and Dana, really thank you for your time. Thanks for having me. It's fantastic. It's wonderful. And we hope everybody enjoyed it as much as we did, and we'll talk to you all next time. And Andy, enjoy your time in India. And next time, if you take another tuk-tuk, I would hope you can have enough courage to pull your camera out
Starting point is 01:00:55 and get a video of you going through. Every minute of him screaming. Every minute of him. I will make a post-it later on. I would love to see that. Yeah. may post it later on on social media somewhere. I would love to see that. Yeah. All right.
Starting point is 01:01:08 Thanks, everyone. Thank you. Thank you. Bye. Bye. Bye. Bye. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.