PurePerformance - The SLO Dilemma: Slight Reliability Discussions with Stephen Townshend

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to Pure Performance. My name is Brian Wilson and as always I have with me my fantastic co-host Andy Grabner. Andy, how are you doing today? I need a thesaurus, Andy. I can't come up with good alternate descriptions for you. Fantastic, wonderful, lovely. I'm good with those, right?

Starting point is 00:00:45 Despicable. If they're accurate, that's all good, unless you fake it. But I think you know, I'm wonderful. I'm actually, I'm looking forward to tomorrow, though, because tomorrow is my last day of traveling for a while. And it's been a lot of traveling in the last couple of weeks. And as much as I love going back to travel, I also love coming back home, because in the last couple of weeks and as much as i love going back to travel i also love

Starting point is 00:01:10 coming back uh home because in the last couple of weeks traveling by train which was mostly what i've done wasn't as convenient as it used to be because i think we in europe here we have a lot of challenges with trains these days and with flights and with everything yeah um so let's say that way that you're in a blue room today where are yougart germany i'm in stuttgart germany yeah and it's actually funny because stuttgart is the home of mercedes-benz uh but then i'm sitting here next to the i think one of the largest infrastructure projects in germany which is renewing the whole train station like the main train station the whole area around it um it will be great when it's done until it's done. Until it's done, it still takes some time. But I

Starting point is 00:01:48 want to do one more thing here, kind of a segue. In the last couple of weeks, I really had a lot of bad delays with flights and also trains, which means they did not meet their SLOs. Because for me, a great SLO, a service level objective would be that trains

Starting point is 00:02:03 are on time 99.99% of the time. Especially in Germany. Especially in Germany, exactly. But talking about SLOs, we have a guest today who at least I have known for a couple of years now. And he is, I think, as big as an advocate for site reliability engineering and as a losses I would like to be but I don't want to introduce him to the audience I

Starting point is 00:02:33 want to give him the chance to introduce himself Steven how are you who are you welcome to the show hi Andy thank you yes my name is Steven Townsend I am technically a cyber liability engineer that's title. I'm part of an enablement team within IAG, which is an insurance company in Australasia. 13 years before that. And I talk a lot online and share my learning journey because it was a bit of a shock going from performance engineering when I felt like I knew what I was doing to SRE where I felt like a complete beginner. And I thought, I can't pretend to know what I don't know, but I can say, hey, I don't know anything.

Starting point is 00:03:19 Come learn with me. That's what I've been trying to do. Well, welcome to the club because both Brian and I, I think we were seasoned performance engineers in our career and then we started the whole journey towards observability and then we started the podcast and now we try to be smart

Starting point is 00:03:35 on what we're talking about. Basically, we don't have a whole lot of clue about. No, I don't want to jinx it. We bring smart guests on the show to make us seem smart. But then we learn from them. So you're our smart guest that we're going to learn from. Don't want to jinx it. We bring smart guests on the show to make us seem smart. Exactly. But then we learn from them. So you are a smart guest that we're going to learn from.

Starting point is 00:03:48 I was going to say, when's the smart guest showing up? But Steven, I want to, we will obviously make sure that people know how to get a hold of you to see what type of content you actually produce. And one of the things that you are doing is you have a YouTube channel. It's called Slight Reliability. And I think you're doing, I don't know, regular shows. At least it feels like, you know, at least once a week things are coming out.

Starting point is 00:04:14 What motivated you to do this? I did have a, it's actually a podcast. So although there's a YouTube channel, and it's also available on most podcast platforms as well. And when I was in performance engineering, I actually had one called Performance Time, and I basically rebranded it. I thought, let's change it to make it ESO refocused. And like I said, the motivation changed. With Performance Time, I really wanted to provide more of an educational resource, but also show what i wasn't seeing a lot in the

Starting point is 00:04:45 technology industry which is sort of empathy and humanity and just talking about the human beings who do this really complicated work and how they get through it and you know how they problem solve and their creativity so that's what that original intent was and i tried to carry some of that through but like i said the focus now is also on just having this honest conversation of i have no idea what's going on let's try and make sense of it and then the um so first of all folks if you're listening we will link to all of the the content that steven is producing but the reason why we're now actually on the call even i mean we should have had you on the call on the podcast a long long time ago

Starting point is 00:05:25 but there was a linkedin posting and i am just looking at it uh it was our friend scott moore posting a blog post from diner's race talking about seven steps to identify and implement efficient or effective service level objectives and then you made a comment that you know possibly controversial but i think that if your organizational team does not value reliability for any number of reasons, then I don't think defining SLOs or SLIs should be the priority. And then it was going on back and forth between you and also one of our colleagues, Saif, and then I chimed in.

Starting point is 00:05:59 And I thought this went into a really interesting discussion because you basically said, you know, that there's like, don't put the pressure on people and kind of force SLOs on them without, I think, good ownership and actually changing also the people, the culture, the processes. And I think this kind of reminded me that every time when I talk with organizations and I kind of try to educate them on the on the value of slos i kind of feel sometimes silently but i kind of sense it that people like the concept but then they also say i we don't know how this works how should this work in our organization um because we have so many other things to do and then we got always the

Starting point is 00:06:42 pressure of delivering more features and and how can we deliver more features and yet we should have better reliability a better performance and so i would like to kind of give you now a little bit of air time and say what is your perspective what do you see out in your organization and the organizations that you've talked and worked with in the past about telling them about SLOs, but where are the problems actually really implementing SLOs? So I'll speak about IAG where I currently work. It's a reasonably large company for Australasia, 13,500 staff, big history. It's sort of built out through acquiring other smaller companies

Starting point is 00:07:22 and becoming one giant company. So imagine all the different technical technologies, different cultures, different teams and people coming together. It's got a lot going on. And so that's kind of the context. When we first formed this SLO enablement team, we had a team come to us immediately and say, we want to do SLOs. And we're like, great. And they said, you're the experts, right? And we said, yeah.

Starting point is 00:07:48 And then we went away and quickly tried to read and learn as much as possible. And we just started experimenting with them. And we created this interactive workshop where we would talk about their customers and their key services. And then we identified what they wanted to achieve and come up with indicators and then actual threshold or objectives as well. And it was conversations were great.

Starting point is 00:08:12 But at the end of it, we just, what we came up with just what didn't feel connected to anything real. It just didn't seem to be enough. And I understand that part of SLO, it's not just about coming out with a number. It's about the way you adjust your culture and mindset around operations and what you actually do. You alert on different things. You focus on different things because it's about can the customers use the service, not technical stuff.

Starting point is 00:08:38 But it just wasn't quite working. And we tried that with the second team. And again, the same kind of thing happened. And that's when we had that alternative. We pivoted and said, look, what are we trying to achieve here? What is it we're trying to achieve with SLOs? Because I guess it wasn't clear what the overall business objective was. And that's something I want to talk about later.

Starting point is 00:08:59 So that's when we said, well, what's the goal? The goal is really, we think think to help teams understand their customers better and when they made technology changes how that impacts those customers that was the first thing and then on a wider sense we thought our team goal really is to make the lives of our customers and our colleagues and team members easier through the lens of reliability so that that's what our team's about how can we do that and the SLO definition in the current context of the organization it wasn't achieving that fast enough and that's when I got thinking about you know what I think there might be prerequisites or things that will set you up for success with

Starting point is 00:09:41 SLOs and since since those initial conversations, I've spoken more about it. I shared six different potential prerequisites with you. I've got another three here, which I haven't told you about yet. I'm really starting to think that there are a lot of things that at least some of them need to be in place, I think. Or at least if you have these things there, it's going to make your life a lot easier. So in the very beginning, I need to ask you a quick question. In the very beginning, you said in your organization, you started forming the SLO enablement team.

Starting point is 00:10:17 Why was this team established? It was, we had a pretty senior leader in the organization who had come from Groupon, I believe. And so he had seen SRE in different ways of working come to fruition and really provide a lot of value. And so he wanted an opportunity for IAG to explore that. Okay. Because this was kind of like, for me, kind of the initial thing because i i also had a similar situation when organization came to me and they said hey we're now doing slos and i said that's great but why right what was the motivation behind it oh because i heard it from google uh that's not the right answer and and so i was just curious on on your side now you said um with the

Starting point is 00:11:04 first and the second team well let's say focus on the first team where you tried it you said you you sat down you defined the indicators you defined the silos objectives what were these and why didn't they work um remember those i don't know if i care and i might better remember so the first team was actually a team who runs a Kubernetes PAS, which other teams all over the company host their stuff on. And it's quite big and quite important to the company. And so there was this kind of unusual context because we had to talk to them and say,

Starting point is 00:11:37 who are your customers? And they were thinking about the end customer. And I was saying, well, actually, the customers you have control over are the other teams at IAG. So that was a mindset shift. And so one of the SLOs we came up with was around the success of deployments or rebuilds or something, because it's really about the stability of that platform and making sure that people can update their code when they need to. And which they were, you know, it was a bit of a surprise there.

Starting point is 00:12:03 It wasn't about the end customer and latency and those kinds of metrics um so that was one of them because yeah because essentially that team is like uh it's like if if i would now build an app i would probably go to amazon or google or microsoft and basically you know use their services so they have a certain sla with me and like in your case that platform team is kind of like a SaaS provider or a cloud provider to your internal development teams. And therefore, you want to make sure that the platform is up and running. The deployment is an interesting one, right? Like how often, how long does it take to deploy maybe an update

Starting point is 00:12:41 and how stable are the systems that are running on there? Or I guess, how fast can they deploy?'s good good good matrix and and why do you think this didn't work because you said it didn't quite catch on i think it was a lack of uh in that case and a lack of experience with it within our little esri team which is literally at the moment there was two engineers at that time and understanding the follow through. You know what I mean? We didn't talk about, okay, we didn't be with them and my colleague actually did operations with that team for three months just to learn what they do,

Starting point is 00:13:15 but we didn't adjust that operational process at all. And, and that's, that was what was missing, right? The, okay, we have these SLO now, let's start focusing on that and, and maybe stop alluding on the 400,000 other things that we're currently panicking about all the time. Maybe. I don't know. Maybe you know more about how to transition into SLOs if you have a huge backlog of technical alerts and metrics.

Starting point is 00:13:43 I mean, for me, the question would be, again, focusing on this team. Let's assume you focused on the number of deployments or how fast you can deploy. Is this a number the team came up themselves? Or did they also say, hey, we actually confirmed this as a requirement from our internal customers. And they are actually not happy if we cannot deliver so i think coming up with an slo that for me makes sense as a software producer a service producer is great but um if if my customers actually don't care about this then why should i care about it so i was actually my question would be a have you monitored it b have you promoted this

Starting point is 00:14:26 number also to the other teams that were using the kubernetes platform so that they can actually see how successful that platform is and how things have changed over time because if the customers start caring and if they think this is actually a good number or a good slo then i think it becomes a different thing. Then you also, as a platform provider, become more responsible. I think it then actually makes more sense because you then actually see that you have an impact or that you have a negative impact on those internal customers if you don't meet your goals. And because this is sometimes what I've seen, we come up with S slos just for the sake to have slos but in the

Starting point is 00:15:06 end we don't care if we don't meet them because it doesn't have any impact if we don't meet them right and i think this is the big challenge just creating slos and i think that i mean it was not you maybe it was somebody else i had a heated discussion and he said he said andy don't go around in the world and tell everybody they have to define slos for every service this is because it does the sls is not about defining sls for every service and i said you're right because really what you need to do is you need to define sls where it really matters and what really matters is the end user and the end user might be an end user that uses your services or it might be an internal team that tries to deploy their apps on the platform and then they're expecting a certain

Starting point is 00:15:51 level of service but just coming up with artificial numbers doesn't help anybody uh yeah i fully agree with you there i think that's a mistake that we made. It was internally developed within that platform, the team that ran the PaaS, right? We should have had representatives around internal customers to bounce these ideas off and say, hey, this is what we're doing. What matters to you? Yeah.

Starting point is 00:16:16 And that's actually one of, yeah. You go? Yeah. Yeah. And maybe one additional thing, because in the platform team, it could be interesting. You could say, hey, we are the IAG platform team. We provide you Kubernetes as a service.

Starting point is 00:16:30 You can go with us and we provide you this type of availability and this type of speed. If you don't like this, you can go to AWS, Google, or Amazon, but we don't want to because we believe we provide a better service for you. And I think this is also, I think you then need to kind of try to figure out who is the competition, right? Because if there's no competition, if the mandate is you can only use that platform, then there's also less pressure on you.

Starting point is 00:17:01 And I think that, yeah. There was, there pretty much is no competition because the reason it was built is it's an insurance company, very sensitive about where data goes. It's a huge process to get a cloud service approved and working and even, you know, especially if there's customer information involved,

Starting point is 00:17:23 which there absolutely is with the systems hosted on this. So there's very little competition for this particular platform. So, yeah. Yeah. Cool. But then you said you came up with prerequisites, right? So that means like you learned things that don't work well or tried and it didn't succeed, but now you have prerequisites.

Starting point is 00:17:47 What are these? Do you want to share this? Sure. I will say just before that, that there was a third encounter where I've been asked to help develop SLOs. And that focus was, hey, we've got a gigantic, massive, complex program of work. We're currently doing all these non-functional requirements around performance and reliability. And we tick a bunch of boxes before we go live can we move to SLOs and I thought initially that's great and then as I started unraveling the complexity of the program

Starting point is 00:18:14 I thought wow this is going to be really difficult for a number of reasons so and that's where I came up with the prerequisites that I'm going to talk about today. So I think the first prerequisite, and it's probably not a surprise, is there needs to be a certain level of observability in place. You know, it's pretty obvious, right? If you can't see and measure the reliability of what's there now, and if you can't track the status of your SLOs after the fact, then you can't really do much with that, right? Flying blind and just gut feelings, yeah.

Starting point is 00:18:51 Second is I think teams need quite a lot of, they need additional time and space set aside. So in the context of this program, there's a lot of focus on delivering features. The teams I've talked to are under the pump. They're like, you know, I said, how how busy are you one team said 12 out of 10 i'm like okay well how are you supposed to engage with this this culture change if you don't have sufficient time and mental space right the third prerequisite i've been thinking about is just valuing reliability uh or actually just

Starting point is 00:19:22 valuing quality in general in the wider sense. So if you just focus on we've got to get these features over the line and we don't, and that is the most important thing, then well, what's going to give if you have to get something done by a deadline, what's the thing that's going to give? It's going to be the quality of what you're delivering, right? And I had a thought recently about that. I think in a really large organization like a cloud provider, that it's a no-brainer

Starting point is 00:19:51 that the reliability of the services and the customer experience is paramount. I totally 100% back that. But I think perhaps maybe in the realm of smaller organizations, it might actually be that the bigger value is to improve the lives of the staff who operate the software

Starting point is 00:20:08 and the cost of maintaining or operating unreliable software, if that makes sense. That's a subtle difference. That makes a lot of sense. Yeah, because if I understand this correctly, in a smaller organization, you have fewer resources, and therefore you need to focus on reliability and efficiency right from the start,

Starting point is 00:20:33 because otherwise you're just burning the few amount of resources you have for these kind of working concepts on technical debt versus innovation. I mean, I think in general, cost and efficiency and sustainability should be a topic for big and small organizations. But I think for very large organizations that have a lot of staff or more resources and even maybe more money in the bank that they can survive for a little longer, it's easier to be a little less efficient and still make it over

Starting point is 00:21:04 the next quarter or something like this yeah that's a good one um a couple of thoughts here because i remember going back to the linkedin posting this was exactly one of the arguments you actually made because you said why do you talk about slos for you it should be about reliability and I then said for me SLOs are a measure of reliability because if I have an SLO and availability it means my system needs to be reliable in order to be available and if I have an SLO on performance my system has to be reliable because otherwise I cannot deliver the performance under different you know factors of load. But I want to say something to the cost, because this week I had several conversations

Starting point is 00:21:49 with our customers as I was traveling through Germany. And we all know that we have a big energy crisis here, at least, I mean, I'm sure in all of the world, but in Europe, we feel it quite a lot. We're spending too much money on on inefficient things and now everybody's like sustainability and cost efficiency is a big topic and i really had conversations with organizations this week about defining slos on on costs cost per feature cost per app cost per service and actually drive it down because people

Starting point is 00:22:26 realize that as we we moved to the cloud uh everything seemed like endless scale and we didn't have enough focus on the costs and now people are realizing that we're wasting a lot of not only money but a lot of power and we And we don't have endless energy here. And that's why I think every company wants to be more green, more sustainable. And we also discussed about, can we use SLOs as a vehicle to, in the end, reach our performance, our reliability goal,

Starting point is 00:22:59 but also bring down our costs? And I think if you're, we're all performance engineers. If you solve performance, if you if you solve performance if you have performance hotspots in your apps you know if you can fix a cpu intensive algorithm it not only makes your system more performant but also more efficient and and it's also important so you're talking about like multi SLOs, so not just my feature can complete within two seconds, but my feature completes within two seconds and I don't need 24 cores to run it.

Starting point is 00:23:32 Exactly. You might be able to get something up to speed with enough power, but what you're saying is you also need to get that power done and you can't use brute force to get it through because that's going to be the cost of the energy, just even the dollar cost of the cloud. Exactly. We've talked about this in the past a little bit.

Starting point is 00:23:48 Yeah, what we talked about is exactly what's the euro amount of a particular feature for a particular user base. And then measuring this over time and see how it behaves. And also, not only how it changes over time,

Starting point is 00:24:10 but that you set certain goals that you need to keep it under a certain level unless you make a strategic decision that you're packing additional features into a certain user journey. And therefore, you may accept an increase by 10 cents per transaction or something like this. I wonder if we need to come up with a new term, the efficiency level objective or ELO for short, and then we could all start singing electric light orchestra songs.

Starting point is 00:24:35 It would be just a wonderful day for everyone. Yeah. But there's that efficiency factor, not just the performance. Yeah. Another slight variation on that is that I think that in a company like iAgero, it's not the P1 and P2 incidents which kill us,

Starting point is 00:24:54 it's the thousands of P3s and P4s. So I think having a focus on let's understand those, let's bring them down, let's build automation, let's simplify, let's make things more reliable and robust. And let's build a culture where we know how to deal, accept failure and learn from it as well.

Starting point is 00:25:11 It's something I'm excited by. And I think it's been a sort of click in my head in the last week, actually. Okay. This is an area that has real value where I am right now. I think that applies to a lot of things too. I mean, if you think about,

Starting point is 00:25:23 you know, the death by a thousand cuts concept is what you're talking about with those P3s. And I remember a few years ago going back trying to figure out how can I decrease my monthly spend on bills, right? And I was looking for that P1 that would save me a ton of money because I'd look at like my cable bill. Like I can only cut $15 from that and $10 here and then it finally dawned on me if I do all of those now suddenly

Starting point is 00:25:49 they add up together it's such a simple concept that gets overlooked often so I think that's a great thing to remind people your idea of like look at those P3s

Starting point is 00:25:57 you know how much damage are those doing in total but in order to scale you need automation right because you cannot put a thousand people on a

Starting point is 00:26:05 thousand p3s and that's where the automation comes in and better observability with with better you know context information um steven you had three prerequisites so far and you mentioned earlier three additional ones i'd be i've got six six yeah okay and let's move on all right so ownership and autonomy i think if teams don't have the ownership of their own service level and the autonomy to adjust it themselves that kind of defeats the purpose of these silos i think if you have an external party saying these are your targets then it's not an slo it's i don't know something else in sla or a contractual obligation yeah yeah something different yeah um and related to that business and technology stakeholders embedded in working together

Starting point is 00:26:55 you know a representative for the customers there in the team if you don't have that then how can you independently develop your silos without going to some external parties and bringing them in which might work, but it's just more complicated. And the last of my original six, blameless culture and psychological safety. So it's funny that since joining SRE and hearing conferences,

Starting point is 00:27:19 these things are talked about a lot, a lot more than I expected because obviously SLOs are kind of messy and tricky and it's getting a lot a lot more than i expected because obviously slos are kind of messy and tricky and it's getting a lot of people to talk about scary new things at times and that requires experimentation the right mindset um the ability to the what's it called the opposite of fear of failure uh joy and success i don't. The ability to fail and be like, that's okay. Because from my experience,

Starting point is 00:27:47 if you try and adopt SLOs for the first time, you're going to fail continually. And I think it's how you respond to those failures, which is going to be make or break. It's funny that that's not just a given yet, right? I mean, that goes back forever.

Starting point is 00:28:02 I mean, not just in modern computing in terms of if you fail properly, fail fast, fail often, right, and improve. But if you go back, like the great story about how Post-it notes were created, right? I'm sure you've all heard that story. And if you haven't, you can look. It was an accident. Like those Post-it notes, the guy was trying to create a super glue, messed it up, and then he found these were sticking and he could take them off. They're like, oh, wow, this is pretty awesome, right?

Starting point is 00:28:27 But even, you know, at least over here in the United States, we had this thing called Bell Labs, which was like this legendary place of invention where they would just take all these smart people, toss them in and they'd say play, right? And they were just screwing around, trying out all this stuff. Like a lot of the stuff never worked, but then suddenly they'd finally hit something.

Starting point is 00:28:40 Like the fact that failure has to be embraced still or has to be taught to be embraced, it just boggles my mind. Who was the who made the quote where he said, I didn't fail a thousand times, I just found a thousand ways that don't work. I haven't yet

Starting point is 00:28:59 found the one way that works. I don't know who said that, but yeah, exactly. Cool. So I have three new ones though, right? No, I just want a quick comment. I am completely with you on all of these three and I just want to add, ownership and autonomy, I typically

Starting point is 00:29:19 talk only about ownership because for me, ownership is important because otherwise, as I said, if you measure something and nobody cares then why to even do it i like the autonomy addition to it the autonomy to adjust um because you know how things change and having the autonomy to set your own goals i also completely with business and technology so every time when i try to run workshops now i don't want the technical team to come to me and say, let's do SLOs. I said, we also need the business stakeholders because in the end, they need to tell us what is the business objective. And then from that business objective, we need to break it down into technical objectives so that we can always align every objective to the top business goal.

Starting point is 00:30:01 And for Blameless Culture, I i think brian we had a couple of episodes recently uh around also chaos engineering and we had anna medina on the call and she also talked about this and we had a name flips my mind uh it keeps my mind but we had a couple of episodes where we talked about blameless culture especially in the context of chaos engineering and kind of building resilient systems and doing game days and stuff like this and how

Starting point is 00:30:31 the organization deals with failure. Cool. All right. My three new prerequisites I've thought about. Insert drumroll here. Top ten. The first one you just touched on,

Starting point is 00:30:47 I think is having clear business objectives because everything flows from that. And if you just go in and say, let's do SLOs, like we said in the beginning, you're not going to get anywhere that's aligned to where you want to go. I had an aha moment recently.

Starting point is 00:31:00 That was my last podcast episode where I talked to business leaders and they said, we want to do these things in the next five years we want to grow this with new customers and we want to reduce our operational costs by this much and here's some ways we think we can do that i was like wow i've never heard that before that is incredible how empowering yeah the the third one and i'm still formulating this i think that there are having the right team structure or architectural simplicity,

Starting point is 00:31:30 because those are linked because of Conway's law, having the right architecture makes it easier. So having, for example, a decoupled architecture where components can independently be treated almost like an individual product is great. And when things are tightly coupled together, it just makes it harder to rationalize okay to do the slo we kind of need to to get these teams talking together and working together and then communication overload kicks in and it makes it makes it harder i think so i'm curious about your thoughts on that

Starting point is 00:32:01 uh yeah i agree with you i mean there's also a, have you heard about the book, Team Topologies? I think it's basically, I think that the idea is here that you have, as you said, right, clear structures on like kind of what is the platform team, like in your organization, you have a platform team that is building platform services, then have individual value creation teams on top. And you want to make sure that these are independent enough and they have clear contracts to any other teams that they are depending on. But then you have contractual and nice interfaces.

Starting point is 00:32:41 From a technical side, you have interfaces. And from a team toto-team perspective, you can basically then agree on the contract on what this team provides and what the other team provides, and then on these interfaces or contracts, you can then, where it makes sense, also define SLOs, but then they can work independently as long as they don't

Starting point is 00:32:57 break the contracts. Yeah, definitely. There's nothing I can add to this, even though it's easier said than done. The only thing I would add to that in terms of what should go into the thought behind it, I think the concept is there completely, but when teams are considering

Starting point is 00:33:17 how are you going to design this to be decoupled, what technology are we going to use? One thing that I've actually just had a fun conversation with one of our customers about it is that you also have to think about if we choose technology X, are we going to be able to observe it and are we going to be able to observe

Starting point is 00:33:34 the components of it that we need? Because quite often what we find is people pick something, they get it running and all, and they're like, oh, now we need to start measuring it and all this, and it's something that's extremely difficult to do. Or it's either going to be a full-on homegrown diy thing that they're going to spend a lot of time and resources or if they take an off-the-shelf product there are certain success or certain capabilities that are there but it wasn't part of that thought process

Starting point is 00:33:57 going into it so not only do you have to think about the decoupling and that independence and ability to manipulate those different pieces just make it run on its own, but add into that observability side of it and even security. How are we going to do this and keep it secure too? So pulling that whole suite into that decision-making and that approach going into it. I have a feeling that perhaps the best situation to divine slo's would be if you have long-lived feature teams who can independently deliver features completely to

Starting point is 00:34:33 customers you know from top to bottom i don't i don't that's not the situation where i am now um we have more component-based teams i i yes i that's the feeling that I have, but I haven't actually got the experience to say, yes, that works. Um, I have a problem now. I only took nine bullet points and not, uh, sorry,

Starting point is 00:34:53 only wrote down eight of your prerequisites, but not nine. What did that mean? Let's do the 10. We have one more or? One more. I've got one more to go. You have one more to go.

Starting point is 00:35:04 Okay. I thought you're ready that's why okay i thought i missed one because i was i know it's getting late and it was a long long day but number nine okay very last one nine it's very simple uh having education having a shared understanding shared terminology of what slos and and SLOs are, why they're helpful, and what the ultimate outcome is. I will probably put number nine almost at the top. I think before you have any discussion on the SLOs, this is what I try to do now as well. I want to educate people on the terminology

Starting point is 00:35:41 and why we're doing this. Because if you just throw them into a workshop and say we need to define SLOs, and then we have a different understanding on what is availability, what is an error rate, what is XYZ, then I think you end up maybe with a result, but everybody has a different understanding on what that result actually is.

Starting point is 00:36:00 And therefore, education is key. Our team is currently right now trying to think of a way that we can create like an slo hub within the company which anyone can go to to learn about it but we want to make it actually really simple and visual and fun so if you've got any ideas for that let me know well i know somebody character exactly i know somebody that is using i think paint for his arts and uh i mean who would that be right i know somebody that is using, I think, paint for his art. And I mean, who would that be? I know somebody that has a YouTube channel, even though he says it's just a podcast.

Starting point is 00:36:30 But I think he's very talented with educating people in a fun way. Yeah. We want to make it, we don't productize it, so it's not tied to my MS Paint ability. I can only draw so many pictures at a time. But you know how that came about is that I wanted to create a YouTube channel and, you know,

Starting point is 00:36:51 even for blog posts, I wanted artwork and my wife's really good at that kind of thing. And she was always too busy. So I said, I need independence. I can't draw. MS Paint it is. And that's how it all started.

Starting point is 00:37:02 Wow. Yeah. Brian, you haven't seen his artwork? No, I got to go check it out. And that's how it all started. Wow. Yeah. Brian, you haven't seen his artwork? You should check it out. I love that all. One thing I was going to ask about, and I don't know if this would be a prerequisite

Starting point is 00:37:11 or if it fits in somewhere else, and maybe what I'm going to suggest here is completely wrong, but when you mentioned right at the beginning the idea that first team you worked with, they were more the Kubernetes infrastructure and they had to focus on their customer, which is the people using the infrastructure, the Kubernetes platform, not the actual

Starting point is 00:37:27 customers. That got me thinking that your SLO should be something within your control that has a direct impact on your customers and not beyond it because anything beyond that you don't have control over. That Kubernetes team can't directly influence the end customer. So they shouldn't be defining SLOs for that. By trade-off of time spent and time spent adding up, yes, that's going to influence the end customer. But it's really more downstream

Starting point is 00:37:59 teams that are going to influence that customer. So they should focus, let's say, on their immediate customers. So should SLOs be really focused on your direct customers and only things that you have control over? And I don't know if that could be a blanket statement at all, or is that being too broad in general? And is that prerequisite, or is that just part of the concept behind everything?

Starting point is 00:38:26 I think it's spot on because think about the, if I go to Amazon, they give me a certain SLO and SLA on their services, but they cannot guarantee that the app that I deploy and the infrastructure is fast because that depends on how I write my app. I could never make AWS responsible for how good or bad I write my code. I mean, what they can do, though, and I think this is also, I guess, Steven, what your team does, they can give recommendations, right? That's why they have the, what are they calling it?

Starting point is 00:38:59 The well-defined architecture. So they have templates on if you want to build apps that run efficiently on our platform, here are some templates, here are some architectural patterns. If you follow them, the chances of becoming successful and efficient is higher. But yeah, that's the way I see it. I agree as well. It's also like one of the customers I visited yesterday, they also kind of their, didn't call it the platform team, it was kind of like the delivery enablement team. They were basically providing different templates of pipelines and different templates of deployment

Starting point is 00:39:39 definitions for microservices that they give to their internal customers, which are the development teams. And they said, hey, if you use these pipeline templates like Jenkins, then we give you these five steps. And if you enable step one, two, three, four, then you get automated observability, you get automated testing, and then you get automated deployment.

Starting point is 00:40:01 So this is the template. We suggest you use it, but if you want to do your own way, it's up to you but if you do this our platform is optimized for that so here's a sort of conceptual challenge we were having recently is the the current solution that i know about that we were going to do slos for there might be one customer interaction and that hits components who are owned by 10 different teams. And the question is always, who has ownership for the end customer? Well, maybe it's the digital front-end team

Starting point is 00:40:33 who look after that web UI. Maybe it's them, but then they need to now collaborate with nine other different teams. And it was just hard to conceptualize. I haven't come up with a solution I'm happy with to make that all work. What I've seen with the end customer,

Starting point is 00:40:50 you typically either have, let's say, a mobile app or a web app or something. And whoever owns that interface, I think, is the one that should have the ownership of how many people are using this particular app or feature, what is the user experience we expect and if their front end depends on key back-end components they then need to basically break down in order to deliver a certain user experience on the top what do we

Starting point is 00:41:19 need from service a b and c what type of level of service do we need in order to fulfill our goal and then it comes it it almost becomes like a i need this from you and if i don't get this from you right in the in the open market world i would go to somebody else like maybe i don't take the authentication service from us internally but i go with enough with a public access service because they can deliver it and so that's what i've done in some of my workshops where I start with a top level goal on individual user journeys, like opening up a mobile app and logging in, then making a transfer for mobile financial app. And then I say, this is kind of what they want on the front end, which backend services

Starting point is 00:42:03 are directly involved and what is kind of their contribution to it, what's kind of what they want on the front end, which backend services are directly involved and what is kind of their contribution to it. What's kind of their performance budget or their reliability budget? What is their SLO that I need them to deliver to me in order for me to fulfill my top level goal? While you were saying that, I was thinking about, have you ever heard of an organization where you have almost like an internal marketplace with maybe two or three different teams that all develop in the same thing and and you get that sense of competition and driving for excellence yeah i mean that's i mean i never had this before but i think this is the ideal this ideal world but this would definitely result into most likely more efficient and and and better reliable software because then you're competing

Starting point is 00:42:45 because you want to be the service that is used and you want people to stay with you and not go to the other cloud vendors. They should go to your platform team and not to AWS or Google or Microsoft. Very interesting. I understand it's tough if you don't have the choice of having multiple services if you are constrained.

Starting point is 00:43:06 But then you need to have other things like, you know, we need, you know, I'm sure there's some type of competition in your world. People can decide to not go with IAG, but with other organizations. And then that's the business driver, right? That's the, how many, we need to make people happy. And happy means they need to have a good digital experience

Starting point is 00:43:28 with our organization. How do we achieve this? Good. Brian, I know we both have a kind of a hard stop in a couple of minutes. So we probably need to wrap it up. I think so. I think this was fascinating, Stephen.

Starting point is 00:43:46 Thank you for sharing it so much. Before we do, was there any other point you were hoping to, you know, dying to get out on here? You're starring in a new movie and you're coming on to the late night show to talk about it that you didn't get it in? Yeah, I'm hosting EMEA's PaintCon.

Starting point is 00:44:04 You know, I'm going to give you a challenge. So for anybody listening to it, we'll see if you made the challenge. I'm looking at your pictures right now. Andy, can you send, this episode won't air until the 26th of July. I don't want to put pressure on you, but if you can send that picture you took, Andy, over to Stephen, if you could do a rendition your style and i miss paint yeah we'll use that for the for the for the image for the podcast if not totally understand you got other real work to get done in your own stuff for your your bit but it would be a

Starting point is 00:44:35 fantastic honor to have one of your original pieces of art to use on uh for the show it would be a genuine pleasure i would love to do that. Would this then be an NFR? We can sell it for a really high price. We're not going to get involved in any of that. Don't get me going on those. Same. Anyhow, I really appreciate you coming on today. Andy, did you have anything you wanted

Starting point is 00:44:59 to wrap with? No, I should have said this in the beginning because I know we are connecting three different continents today yeah New Zealand

Starting point is 00:45:09 the US and Europe thanks to technology today anyway well you said you said

Starting point is 00:45:17 continents and you kind of did a mix between yeah I don't even know what that is I know

Starting point is 00:45:22 but what I wanted to say what I wanted to say is uh the technology that we used today was really resilient it worked perfectly um for me it met all the slos and thank you so much stephen uh hopefully uh this is not the last show that we do together because i know all of us we constantly learn and we're also those that constantly like to share the learnings. And therefore, I hope we have it back in the future episodes. Thank you. Yeah, thank you very much, Andy and Brian.

Starting point is 00:45:55 I had a great conversation and I learned a few things. So, yeah. Awesome. Thanks a lot. And thanks to all of our listeners for listening to this episode. If you have any questions or comments, you can tweet us at... What's our Twitter? Pure underscore...

Starting point is 00:46:10 Yeah, it is pure underscore DT, right? DT, exactly. Or you can send us an email at pureperformance at dynatrace.com. But thanks for listening, everyone. And again, a big thank you to Stephen for joining us. And we look forward to talking to you in the future. Thanks a lot, everyone. Bye-bye.

Starting point is 00:46:26 Bye-bye.

PurePerformance - The SLO Dilemma: Slight Reliability Discussions with Stephen Townshend

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.