PurePerformance - Decrypting software reliability into a plain English with Ash Patel

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Welcome everyone to another episode of Pure Performance. You can obviously tell by the voice that this is not Brian Wilson who typically does the intro. Brian hopefully is fast asleep at the time of the recording. He should be in bed by now. I think it's about midnight or maybe even after midnight probably. But I am not here on my own i actually have a another great guest with me ash patel uh ash uh thank you so much for being on the show um i know

Starting point is 00:00:55 i stumbled across your podcast um a couple weeks ago actually it was interesting one of my friends he pointed it out and then i listened and i checked it out and said hey this would actually be a cool guest on my podcast now i stopped talking and you can start talking because first of all thanks for being here again can you quickly introduce yourself who you are what you do and what is this podcast that i've been listening to first of all andy it's a pleasure to be on your show. And unfortunately, I couldn't meet Brian, but one day I hope to do that in real life. So I'm Ash Patel, like you mentioned, and I have a podcast as well. It's focused more so on advocating reliability practices.

Starting point is 00:01:38 And we recently changed it from just calling it the SRE Path Podcast. So we're at srepath.com to the Reliability Enables podcast because that's what my aim is, to help people who are trying to enable greater reliability, particularly in the software side of things, to have an impact within their organization. That is an area that I feel is still quite lacking in a lot of situations.

Starting point is 00:02:04 Big Tech has done a very good job at it because they have robust practices, robust systems based around having engineers work on reliability. But I think more importantly, other organizations that serve us with critical industries, they need to get better at. And that's my aim to share ideas so they can join the fun as well. Ash, thank you for the introduction. I'm just reading off of your LinkedIn page.

Starting point is 00:02:33 Folks, by the way, every link that I mentioned, whether it's LinkedIn and also some of the podcast episodes that, Ash, you sent over to me, kind of that are touching also on observability, which is a big topic for us, you find all the links in the description of the podcast. But I want to just quickly read your two sentences because I really like this. Once upon a time, I ran operations at a healthcare business. Right now, I'm having an extended eat, pray, love moment and focusing on advocating greater reliability in software because I don't want software to go down every single day in my next gig. And I thought this was a really, you know, great thing to say,

Starting point is 00:03:14 because just as Brian and I and many of the listeners that we have, you know, we've been trying for years to really make sure that these critical systems don't go down every time, right? And we are trying to do what we can, depending on which role we are in. Brian and I have a big background in performance engineering and performance testing. So we typically brought systems to a critical state and then trying to figure out where they are breaking and then give this feedback back to the engineers.

Starting point is 00:03:41 I think we've seen a shift in where you really think about reliability engineering, like straight or resilience engineering, where you, from the beginning, try to make sure that you're making the right decisions to keep systems reliable. In your world, you said you went from the SRE podcast to the reliability podcast from a terminology perspective. what has changed for you since you have been in your kind of like eat pray love phase of your life and what has changed over the last months and years well it's been about 18 months and i can definitely say

Starting point is 00:04:19 it's been less stressful working with a few interesting organizations here and there to help them understand things a little better. And I'm working on some interesting projects here in Australia right now. And that's why Brian couldn't join us. It's 2 a.m. in the Eastern time zone. He's in the Eastern time zone, right? He's in Denver, so he's mountain time. Oh, okay. But it's still, it's at least 1 a.m.

Starting point is 00:04:41 Yeah, yeah, yeah. Yeah, it's got to be midnight or 1 a.m. Yeah, something like that. So yeah, it just came about from the fact that you mentioned that Eat, Pray, Love moment just came about from the fact that when I was running operations, I had a whole bunch of other things in my technology portfolio other than just dealing with our software. So to have to deal with it every day, people complaining, end users complaining directly to me every day.

Starting point is 00:05:09 We had a fairly flat structure in our organization. So even a cashier or someone who was dealing with billing could reach out to me and say, we're having this problem. And it was becoming annoying. And I'm noticing that our organization was not the only one experiencing this. So many different places that I've been talking with, they're experiencing this right now.

Starting point is 00:05:34 From the software that they're using, whether it's external, it's internal, it's just something that we need to resolve. Let me help you understand. You said people just came to you what type of role did you had that everybody knew i gotta go to ash in case something doesn't feel right director of operations so essentially yeah it wasn't just technology oriented as well there's also a people side of things as well so it it's a smaller organization. So we wear multiple hats. It's not just saying director of, you know, I started off as a sysadmin.

Starting point is 00:06:10 I always thought I would just be a technical dude. And that's not how it pans out in the real world. You know, sometimes you have the role molds you into shape rather than the other way around. So there was a lot of people side of the operations in that organization, working with clinicians, bringing them up to speed. It was a very multifaceted role. I've learned a lot in the 15 years I spent there. 15 years, that's a long tenure for working for a company.

Starting point is 00:06:43 I mean, you've done the same, right? I know, I know. It's been 16 and a half. I know I love it, right? I love my job and I love the company where I am. So that's right. But it's still, I think it's still rare, right? Compared to looking at other folks in our industry,

Starting point is 00:07:00 there's typically more change. So that's why having 15 years in one organization is quite an achievement. I got a question for you. So looking back into, you mentioned cashiers came to you or different people came to you. Can you give us an overview from a software perspective? What were kind of the most common reasons why systems didn't behave as expected? Why they crashed?

Starting point is 00:07:23 Why they were slow? Why they were simply not available, resilient. What are the top reasons? So some of the reasons I can outline were that there was a disconnect between what the engineers were developing in terms of our internal software and the capabilities of our systems to handle it. So we were doing a combination of on-prem as well as cloud. Once we shifted to cloud,

Starting point is 00:07:51 that's actually when the problems worsened, funnily enough. Yes. Because the expertise that we had at that time, and we're not talking a very long time ago, we're talking only six, seven years ago, was very little in terms of cloud computing. When we're trying to bring in people into our space with cloud computing expertise,

Starting point is 00:08:12 but they didn't have the domain knowledge. That was a challenge we were having. The people with the domain knowledge didn't have cloud computing expertise. So what they would do is they'll make software with the mindset of, I'm making things for this kind of rack. It's going to go into this kind of rack,

Starting point is 00:08:27 and I'm trying to get them into the mindset. You're actually working with VMs. The ops guys are doing this. You need to be, we had a kind of a you build it, you run it type model. The biggest problem was delineating those shared and owned responsibilities. Even though I'd make it clear, somehow we would all get lost in translation. And it got to the point where I had to codify what everybody was responsible for, to what

Starting point is 00:08:54 extent and how to do it. And I think that's what a lot of organizations, a lot of teams need to be able to not be stepping on each other's toes and saying, well, that's not my job. That should be that person's job. So we were having a lot of issues with the VMs just not being able to handle the load because they were not properly configured to the requirements of our workloads.

Starting point is 00:09:21 The code was very inefficient, which I guess is a problem in a lot of environments and in general we did not do performance testing at all so folks if you listen to this not only Brian but also folks like Mark Tomlinson who has actually inspired us to do this podcast about 80 years ago he's big in performance, and I met him through performance engineering work.

Starting point is 00:09:52 From my perspective, I've been in performance engineering for so long, it's sad to hear that these things still happen, that people are not doing performance testing. And I know it's not always easy, it's not always top on your mind, because you also have a pressure to get your features out. But I think that's also what you're trying to advocate for now, right? Because I hope so, at least. Ashen, please say yes

Starting point is 00:10:14 if you're working with organizations that performance engineering should be top of mind. That is interesting. So you said VMs don't handle the load, code is inefficient. As you moved from on-premise to the cloud, I assume there was also parts running already in the cloud communicating back to on-premise. I think that's, I would assume, a common architectural.

Starting point is 00:10:37 Did you also experience any latency issues that were related with the fact that all of a sudden applications or services had to communicate from one environment into the cloud and back, right? And obviously latency is an issue, throughput, cost, I would assume. Was that also an issue for you? Yes, there was a big issue because we had a dynamic CRM, a bit of a lower end version of a SAP system. We weren't going to spend tens of millions on a system. So we spent actually not much less.

Starting point is 00:11:12 So these were boxes sitting on site and they would communicate with our cloud-based systems. And yes, there was latency. There were latency issues. And sometimes the latency would get bad to the point where end users would complain as to why am I watching this thing just going, you know, that loading icon?

Starting point is 00:11:37 Why am I just watching this thing go around and around in circles for five, 10 minutes? It wasn't just us. I can't fully put it down to the internal people as well. It was a lot of vendor issues as well. So I was doing vendor management as well. So we'd be dealing with external vendors who were not providing adequate service

Starting point is 00:11:56 and they couldn't give answers. Our suppliers as well. In the healthcare space, you deal with a lot of suppliers who also were going through the same issues and change because they were just used to people sometimes calling in, faxing in things. This will just blow minds of people who are in software engineering.

Starting point is 00:12:16 But to this day, in certain industries, people still fax orders in. Yeah, it's crazy. It is crazy. It is crazy. So 15 years of experience in an organization, being responsible for operations, having to deal with the people that complain, obviously in that kind of cloud migration. I think what you said nicely earlier, you were bringing people in that may have some cloud experience, but they had no domain experience and the other way around. What were some of the things that you then, how did you solve this? So what measures did you take to ease in the problem?

Starting point is 00:12:56 Did you start to re-architect for the cloud? Did you just try to really optimize the systems for the workloads? What did you do to mitigate some of these issues? So the first step was to actually look at re-architecture. But of course, as you may know, that's not a simple step with all the spaghetti mess of whatever you already have. I'm not talking months. It might be a multi-year project.

Starting point is 00:13:22 And for us, that's what it appeared to be. So we decided to educate first. And I think that's probably why I'm doing this now and trying to educate the engineers as to how they can better understand what our needs are, become more intimate with what the users are expecting. So that meant actually getting them for the first time in their possibly their lives to actually talk with end users. Actually

Starting point is 00:13:51 communicate with them and not leave it to we didn't have PMs because we were doing internal services. So it was a lot of requirements being built by analysts and being sent to developers.

Starting point is 00:14:07 It sounds very old school when I say it now, doesn't it? No, but for me, the interesting thing, and this is not the first time that I hear this, what you're explaining, talking with the end user, understanding what the real problem of the end user is, that they want software to solve for them, is something that shouldn't need a cloud transformation project where all of a sudden you see things don't go bad so i mean this

Starting point is 00:14:30 is it should be basic common you know approach to software engineering but i do know that this is typically um in many cases not the case that people are actually having the engineers sit down with your end users, like really sit down if it's possible, and then watch them, how they are dealing with the day-to-day tasks. Because I think that then provides also some empathy, right? I mean, you understand where they're struggling and why they're struggling. And I don't know the healthcare business as good, obviously, as you do.

Starting point is 00:15:13 But if you then all of a sudden see what impact bad software has, not only on the end user, but probably also, I don't know, a patient or somebody that they are interacting with. And if you feel this the first time as an engineer, I think you have a much better appreciation for really building better quality software. Well, I can tell you a story that I would then tell some of these engineers just so it really sunk in, you know, it really hit it home with them. Well, I hoped it hit home with them because it was really an emotional situation for me to experience that, that I was actually there on site at one of the sites.

Starting point is 00:15:42 These are primary health services for people with chronic conditions who the government contracts us to provide them so the general practitioners physicians would say all right this person can have their health care optimized we're having some issues can your clinicians look into it and actually provide advice as to how to improve their health condition so it's quite an advanced practice and obviously we need to have good technology working to make all of that work effectively there was a situation on site where a patient and these are people who may have taken a long bus ride to get to where you are or they may have had other things going on

Starting point is 00:16:27 in their day people might think they're not busy but people have things going on in their lives and they got really upset because the software was not loading the software just kept crashing out and going slow when it was loading to the point where this patient just walked out and just said, forget about this. We'll look at it some other time. And this is someone with a chronic health condition that needs our help, that needed our help. And we weren't able to provide them that service

Starting point is 00:17:03 because of software not working effectively. And we had the features there, but the performance wasn't there. The uptime wasn't there. If you look at things on a graph, everything might look okay. But then when you're looking at that individual moment in time when that software was not working effectively that's when you really feel it if you're there yeah yeah it's a very powerful story as sad as it is hopefully as you said it impacted the next line of code that the developers were writing or you know reconsidering the importance of performance of good architecture of good best practices and how to build resilient

Starting point is 00:17:53 systems right that we we have a big responsibility in the world with with the software that we are they were creating i remember i had the luxury situation to spend some time with Kelsey Hightower, a big name in the Kubernetes industry. And he told also a story where he was working for an organization that was in the US. When you don't have that high income, you get like food, not food stamps, but the digital version of food stamps. And you could basically go to the cashier and you try to go to the supermarket already in a situation where, right, obviously not in a good situation in life if you're depending on this. And then you swipe the card. And there were moments when the system was down. And so people, right, that are trying to get food for their families cannot pay. And then you have a line of people behind you and they already see you are already on

Starting point is 00:18:50 these food stamps and it's not nice. And this is why we need to make sure we need to do whatever it takes to at least the stuff that we built and that we are responsible for works as expected. Yeah, we got to remember we're educated professionals, so we don't possibly consider this, but it's a case of dignity. That situation, that food stamp situation, their dignity was compromised. Yeah, because other people could see that even the food stamp is not working for them. We don't want that to be the case.

Starting point is 00:19:25 Yeah, exactly. So, Ash, this means your experience in your previous job helped you to understand it's important to educate. That made you create the podcast. How long have you been doing this podcast now, you said? 18 months months a little longer yeah just over just under 18 months i'd say i've been writing on reliability topics since 2021 so a bit longer yeah yeah yeah so we should definitely make sure to get all these links out

Starting point is 00:20:01 so folks can follow up with with your stories because education is just is key in our industry and in every industry can you if you think back about the last 18 months where there's some episodes some guests that you had where you said hey this is actually this was a really cool and interesting moment an interesting story anything any episodes that you would like to highlight also for our listeners maybe to go back in and say, hey, you know, if you are in performance engineering and reliability, then here are a handful of things that you enjoyed in the discussions with your guests. Well, this year I've been focusing a lot on observability

Starting point is 00:20:41 because there are a lot of problems in the space to solve and ways we can improve on them so i think that's those are the episodes that are probably the most memorable for me there are a whole bunch of other people that i spoke with there was someone who spoke about chaos engineering on the demova he worked at bmo which is a very large bank in canada and he handled their chaos engineering resilience engineering that was a very large bank in Canada. And he handled their chaos engineering, resilience engineering. That was a very interesting topic. But I think observability still is top of mind. And I think for me, to me in particular,

Starting point is 00:21:16 it is a foundation capability of reliability work of any operational work. I think you've called it XOps at one point. Observability seems to be foundational to a lot of XOps. It can be things like AIOps, SRE work, MLOps. Everything needs to have that data in place.

Starting point is 00:21:42 I guess it's fitting that this podcast is related to Dynatrace. So I think we should focus on some of those topics because that was an interesting area and I've built a more rounded perspective on that particular area than anything else yet. Cool. Yeah, obviously, you know, observability is something that Brian and I,

Starting point is 00:22:02 we've been living and breathing over the last couple of years um and it's great that you're focusing on this topic before i have a question in mind but first i want to go back to your previous kind of life when you were head of operations did you have proper observability then no no but yeah you must have had some some type of observability right because we did we did but when you say proper observability i'm thinking about the entire work stream that i've built out and i'm sharing with people i'm thinking whoa we didn't have that we didn't have that we had quite a bit in place we knew when systems were down, when to respond to issues, what kind of, you know, we had good coverage, fairly good coverage of the four golden signals, except for maybe latency.

Starting point is 00:22:52 Like I mentioned, that performance testing was not to the level that I would hope for. But it was a massive transition. You're going from people working with on-prem to hybrid to then everything is all on the cloud it's a big shift all in the span of a couple of years yeah yeah it sounds straightforward but it's not yeah yeah so um yeah that is definitely that was definitely the case that we didn't have effective observability so i do want to help whoever wants to have effective observability to do it right. And I've put out a reliability blueprint on the SRE part site.

Starting point is 00:23:34 Observability is one of the work streams there, and you can see all the different facets that you need to be mastering to be effective at observability. So be sure to check that out. Yeah, and as you said i will definitely link to this every time when i just want to remind people right folks if you listen to this you may be in a car in a bus and uh on a plane wherever you're listening make sure that you check out the details with all the links yeah sorry i didn't want to interrupt you. Go on. Oh, no problem. No problem at all. So I was going to say that because I've learned a lot from the people that I've spoken with on the podcast,

Starting point is 00:24:11 and there's a reason I do the podcast as well, it's a lot from my learning as well. If I were just to passively listen, I probably wouldn't learn as much. I'm one of those people who needs to actively communicate with people, and then that's the most effective way for me to learn when someone's telling me something. So there are a few people that I would like to highlight and specifically what they've done if we have time for that. Yeah, definitely. Please go ahead. I just want to highlight the same thing for me and Brian. We love these podcasts because we learn so much from our guests because they all bring in their perspective

Starting point is 00:24:45 on various topics and it's it's the best um the best educational piece for us right just learning yeah so please go ahead who are the people that crossed your podcast as a guest and that we should know about so there are three people who just really put it all together really well for me. And it's at different stages of the observability lifecycle. So the first person is David Cottle, who is an engineering manager at Capital One, a very large bank in the US. And he's very frank about his views on observability. He did a talk recently at Monitorama, and you should see the slide he put up. I'll see if I can find it.

Starting point is 00:25:24 We can put that in the links as well. So essentially his idea is that there are a lot of delusions around what observability can do and what people think their problems are. So to him, a lot of people have this delusion that they have some kind of scale problem

Starting point is 00:25:39 that they think that they're at that level where they need to have the highest end tooling that they need to do all kinds of tricky things. And the one thing that I learned from that is that you need to have proper alignment between what your problem actually is and the solution. A lot of people oversell their problem internally and they end up going for shiny objects or failing in their projects because

Starting point is 00:26:04 there's a disconnect between the problem and the solution too fancy solution for too basic a problem is just going to make life hard for everybody i'm sure you've seen that yourself yeah that's an interesting one so basically would that would that also be um interpreted like like we have a problem, we don't know exactly what it is, but in order to solve it, we just go the quote-unquote easy route by saying, we're buying this tool, and then this tool will make it go away. Kind of like not hiding the facts, but kind of avoiding

Starting point is 00:26:40 to actually build and architect a proper solution. And just like saying, hey, you know, we can, all these problems will go away if we will buy shiny tool ABC, whatever that is, instead of actually fixing the real problem, which might be completely unrelated to that, right? Because observability may not get it there. I think I don't want to talk down the need of observability, right? I think we both know that right because observability may not get it there i think i don't want to talk down the need of observability right i think we both know that we need observability but i yeah that's interesting yeah um kind of like you know we

Starting point is 00:27:13 are at a stage where what i also often see is that people go to conferences i'm i'm speaking at a lot of conferences right and we all get inspired by these speakers i think most speakers and that includes me as well but we always tell a very nice story, right? And we all get inspired by these speakers. I think most speakers, and that includes me as well, we always tell a very nice story. Sometimes we overtell, we make it even look nicer. And this tool can solve all of our problems. But we sometimes hide the fact that a lot of other things actually were necessary as well, and that these tools alone didn't solve it. But then if people go to conferences, listen to podcasts like ours,

Starting point is 00:27:46 and then they say, hey, Ash or Andy, they talked about doing this and this, and now we need to do this as well, and then all our problems will go away, so please give me the funding. I understand why people may do this, but obviously it will not solve

Starting point is 00:28:00 maybe the real problem that people have here. Well, there was one thing that David said to me, and I want to summarize it because I think I've seen this even with the engineers I worked with. They get excited by the really cool problems or the interesting problems, the ones that require... Let's go extreme. The ones that look like they might require quantum computing

Starting point is 00:28:23 or something like that, right developing your own ai well i guess that's that's not as cool now because everyone's doing ai but you know years ago they wanted to almost do an ai which was like okay that's more than what we need let's try and actually just solve the actual and i know it's boring the actual underlying problem that we're just having we can fix this in the next two or three sprints so how about we do that yeah and then everyone just kind of looks dejected but we have to accept that reality yeah that we need to be focusing on solving specific problems rather than going for the shiny objects or going for these chasing windmills, essentially.

Starting point is 00:29:09 Are we just... Actually, that's not the term, David. Who's chasing ghosts in the system? Could be. You know what? I'm not a native speaker, so I'm pretty sure there's certain proverbs that are out there that I don't know

Starting point is 00:29:23 how they are called in English. But it sounds interesting. The sad thing is I am a native English speaker. I get stuck with idioms. I'm like, yeah, I think that's the one. I think a lot of people do. And we just kind of, we just wing it, you know, we just try and it's like, yeah, you get it. Right.

Starting point is 00:29:43 Yeah. Yeah, I got it. Solving the boring problems and don't always chase the shiny objects. I think that's an interesting one. Solve the, I mean, boring, I understand from a challenging perspective, it might be boring, but impactful, right? Solve the impactful problems and don't just chase the shiny objects. And it's an interesting trade-off as well, because obviously you want to keep your engineers happy,

Starting point is 00:30:07 so you want to give them also something that is exciting and new, but you cannot just do this for the sake of not focusing and solving the problems that actually advance the business, because in the end, the business is the one that is paying the money to actually have all these people employed. Yeah, that was the issue that I was having, because I was actually also responsible partially for a balance sheet.

Starting point is 00:30:29 So I'm thinking, let's not waste money here, guys and gals. So yeah. Cool. So David, that's an interesting, I need to follow up. I think there's a link. You sent me a link to his podcast. So folks, if you want to hear from David, Capital One, very well-known financial entity in Canada,

Starting point is 00:30:51 or in the US, at least in North America. In the US, yeah. Yeah, yeah, yeah. I remember I met a couple of Capital One engineers. It was pre-COVID when they were talking about how they're doing continuous delivery. You said it's another big topic of mine. Now, Capital One has been doing some great stuff. I would say

Starting point is 00:31:07 they did a lot of things early on with the cloud and with new technology, even working in a highly regulated industry in finance. That's really cool. Yeah, absolutely. They do some amazing things.

Starting point is 00:31:24 We're not wanting to sound like an ad for them but yeah yeah what's interesting people working there yeah what's that what's what's in your wallet wasn't that that's their slogan right if you look at their ads um i think that's their slogan what's in your wallet it's a capital one i'm not sure i'm one of those guys used to skip the ads you know record it or just watching netflix or just have ad blockers so i haven't seen their ads in a long time yeah cool so david was the first one i think you had a couple more that you wanted to talk about that's right that's right so the second person that i thought of and I spoke with him very recently, about a month or two ago, Tim Mahoney.

Starting point is 00:32:07 And he works as part of the enabling team, observability enablement team at IKEA. So in North America, or in the English-speaking world, we'd say IKEA. Yeah, so IKEA. So he brought up some very interesting points. The first one, and I've rarely heard people even talk about this, even engineering managers, maybe I'm not listening well enough, I don't know. But he mentioned the concept of actually having an effective engineering baseline.

Starting point is 00:32:42 What is expected of all the teams and engineers? How often do we talk about engineering baselines? And it's a concept I've heard before, but I haven't heard people say it to me often enough. I think it's important to bring that into mind, that these are the requirements for you to be an effective engineer. So that means that's interesting because when you said engineering baseline,

Starting point is 00:33:10 the first thing that for me came to mind was how do we measure the productivity or the effectiveness of engineers? But I guess as you kept talking, it's more like what are the skills and um what are the skills and the kind of like what are engineering practices what should be the baseline for engineering practices right not necessarily maybe the output maybe the output is something that comes with it if you are following these guidelines um but baseline for me immediately triggered uh engineering productivity because

Starting point is 00:33:43 that's a big topic that i hear a lot right now when we talk about platform engineering. That's another topic that I speak about where it's all about how can we make developers' lives easier, how we can make them more efficient by reducing things that are not allowing them to contribute to what they should do, like building cool new shiny objects maybe. But sorry, but go ahead. It's just like whenever I hear something,

Starting point is 00:34:10 sometimes I just need to say something. And sometimes I... Oh, no problem, no problem. I was actually fascinated listening to that because I think yours is, I would say, a more precise technical approach to looking at engineering baselines. But I was talking with Tim about how this would pan out

Starting point is 00:34:28 and I was interpreting it in a more management perspective because that's where I have been playing for the last, for most of my career. So for me, just to say, okay, we're just measuring things. Yes, okay, we're measuring things, but it's our job as management not to just say, we're measuring you on this, we're just measuring things. Yes, okay, we're measuring things, but it's our job as managers not to just say we're measuring you on this. We're measuring your productivity.

Starting point is 00:34:49 How are we going to make sure that you do reach that number, that number that we're expecting from you? How are we going to make it happen? And for me, that's when I, when we were talking about the baselines, that's what it came across as. And maybe have a listen to that episode and tell me if I'm wrong,

Starting point is 00:35:09 but that's what it sounded like. There are a few things he mentioned, things like that. If I remember correctly, he was saying that observability is not just a checkbox. It's not just, yep, we're done. We've done this, this, and this. We've installed Dynatrace. We've installed this instrumentation. We've done open telemet, and this. We've installed Dynatrace. We've installed this instrumentation. We've done open telemetry.

Starting point is 00:35:25 We are now done. So there's that whole fallacy of the maturity model that we wanted to talk past, that this is always a continuously moving object. The observability enablement team isn't just a project team. They're not just doing this, and then they're going to be done in two years' time. They're constantly going to be updating these baselines. I guess the best way to explain the baseline from how I interpreted it

Starting point is 00:35:53 would be the numbers that you're seeing like Dora metrics are the lagging indicators in terms of KPIs. But the leading indicators are going to be things like how many of our teams are actually implementing all the things we need to do

Starting point is 00:36:08 in observability. Are they doing this? Are they doing this? Are they actually doing the right tooling? Are they having this process that we brought in last week? How many teams are actually doing that? That is going to directly contribute to how we see our lagging indicators, the door metrics. That's how I would say it. I like that. This is, I think, the first time I heard Dora metrics

Starting point is 00:36:32 in context of that they are lagging. You're completely right. Because if you're doing things right up front, they will benefit the Dora metrics. You cannot just expect the Dametrics to magically get better without investing upfront. This is like, if we come back to maybe a sports analogy, right? We are expecting a sprinter to make the 100 meters in a certain time. And obviously, once they make it, that's great. But how do you get there? It's through training, training, training, and giving them right advice, and maybe different styles of running or whatever it is,

Starting point is 00:37:09 right? And I think that's the same thing here. So the leading indicators are how we are enabling our engineers to get their job done, how we make sure that they're following the right practices, that they're educated the right way, that they know how to use all these tools, that they know how to do certain things, and this will then in the end, if everybody's doing the right thing, will impact metrics like DORA, which I also, just funny

Starting point is 00:37:36 enough, on the DORA side, I think we both know what DORA means, for us at least, right? It's a DevOps metric, but DORA in Europe right now is the Digital Operations Resiliency Act so folks don't confuse it same acronym yeah yeah I wish they actually looked up what DORA means before they created that act in the European Union hey well we'll figure it out exactly cool so we had David yeah yeah yeah yeah so actually

Starting point is 00:38:04 just to double down on that sports analogy, I really liked it. So imagine Usain Bolt at 10 years old or whatever age he started at being told by his coach, okay, run hard. I'm going to measure you. And every time you run, I expect you to be doing even better than what you did previously. Just that, just that piece of advice. I'm measuring you. Do your magic because you've got the inbuilt talent you've had the training

Starting point is 00:38:29 to run you're a natural runner I'm going to make sure you run a 10 second 100 meters nothing else I don't know how that would work yeah

Starting point is 00:38:41 that's right I mean that's why I think it's a great... Thanks for that. Thanks to Autometrics for the lagging indicator. And we need to invest upfront in the baseline of our engineers, which means we need to help them do the right things. And we need to train them.

Starting point is 00:38:56 We need to give them feedback. We need to have teams like Teams, Teams, Teams, right? That they are an observability enabling team, which is not a project-based thing, but it's like continuously mentoring, continuously working with engineers to leverage, let's say, the power of observability to get better in their job and find problems earlier, fix them faster. And this translates into better Dora metrics, yeah? Cool. Exactly. There was one other thing Tim said just before we move on to the next person next person is very intriguing as well and how he talks about observability he actually talks

Starting point is 00:39:31 about an area that we don't really think about too often um so we're going to get to richard's ideas in a second but tim from ek i mentioned the dunning-kruger effect and i didn't hear that at all before he spoke about it i I'm sure engineers might know about it I hadn't heard it before but essentially it's people overestimating their ability to do something that they think that they're better at something than they actually are. Unfortunately I've fallen for that previously where I'm thinking yeah I'm good at this, confidence is good but then you don't actually do as well.

Starting point is 00:40:05 I think we need to be a bit more humble about how good are we at observability? Teams, especially product teams that may be given that responsibility might not actually be good at observability and they just need to be okay with it. And then they can follow the engineering baseline and follow the guidance, follow the framework to actually get better and better just like how they did at coding simple as that yeah i um the only thing that i can say to this maybe this is also why i think

Starting point is 00:40:40 agile planning you are you're playing the planning poker where it's not an individual, but the whole team basically then needs to put in the number and say, you know, how long do we need for this? And hopefully this, unless the team overall is very, you know, is just overestimating themselves, every member. But ideally, the more people you have in the team you get a more accurate estimate but yeah i can see this and i think we all fallen into this right we always thought that we can we're much better until yeah sure kubernetes i can easily do this until

Starting point is 00:41:16 you actually start with it and then you say oh yeah just watching these youtube videos didn't didn't didn't solve it. Yeah. What happened to that cluster? Where did it go? Exactly. It took another break. Oh, no, it broke. Okay, yeah.

Starting point is 00:41:36 All right. So Richard, huh? Yeah. Richard Benwell. Yeah. So Richard Benwell was a person who spoke about observability from a different angle so he is has been in the monitoring space for 20 plus years um so he's the ceo of squared up now they are a vendor but we try to keep that conversation very much focused on the problem rather than talk

Starting point is 00:41:58 about anything they do i'm not too fussed about the ins and outs of what each piece of software does how is it going to solve the problem? We have to understand the problem first before we even start looking at solutions. And there's a whole bunch of solutions you can pick for this. His area is boosting your observability that is usability.

Starting point is 00:42:18 Right? So just scanning through my notes, I found that his insights were quite specifically on the fact that it's not all rainbows and unicorns once you get the data. There's a lot of focus and observability on the technical problems of collecting and storing the data.

Starting point is 00:42:38 But then the usability aspect kind of falls in the too hard basket because, well, really, we've done our job, right? It's up to the people who are going to look at the data now aspect kind of falls in the too hard basket because well really we've done our job right it's up to the people who are going to look at the data now to figure it out but we've got to remember that humans are still the people who are solving problems and

Starting point is 00:42:56 if they're going to be the ones solving the problem they need to make effective use of that observability data and the best way to do it is to make it easy to make sense of yeah it's funny that you mentioned Richard because I met him we build data. And the best way to do it is to make it easy to make sense of. It's funny that you mentioned Richard because I met him, I'm not sure if it was, it must have been kind of like two or three years ago at a conference and he showed me his stuff that they built with the Square. I think they're doing a lot of cool dashboarding if I remember

Starting point is 00:43:22 right. That's what they do. So kind of really making the data tell a story. And then, as you said. But I'm taking also a lot of notes here. The usability aspect of observability, that is a really, really interesting way to phrase it. Because we can capture all the data in the world. But if nobody's looking at it or if nobody knows what this data actually tells them and i mean we've been trying to solve this problem with different approaches right automatically detecting patterns automatically highlighting important data but yet in the end right the question is who is going to look at this data

Starting point is 00:44:03 because in every organization you have different people that need to look into this observability data with different backgrounds, different experiences. And therefore, you need to provide something that is flexible enough to adjust to the individual requirements of every organization and use case. Exactly. Funny thing you mentioned was that he did some really cool dashboards. The one thing that i learned from him was the key is to make all that data you're visualizing actually meaningful if you're just putting up pretty graphs and dashboards you're not helping anybody yeah but what you really should be asking yourself and this is something that i'm saying that i've learned

Starting point is 00:44:40 from all the conversations that i've had with people in that space, in the visualization spaces, will this dashboard actually tell me where the issues are at a glance when I'm not even thinking about looking at the screen? People think about creating these dashboards when they've got nothing else going on. They're focusing on making that dashboard. But you've got other things on your mind.

Starting point is 00:45:07 You're thinking about what's happening at home. You're thinking about the incident that you're looking at right now. You're thinking about your boss's messages. You're thinking about your colleagues' messages. You're looking at emails. You're trying to figure, patch everything together. So you need something that's effective

Starting point is 00:45:22 at giving you an effective picture at a glance. That's the key takeaway I took away from that. Yeah, and to add to this, somebody builds a dashboard because they have an understanding of the system. But the problem is if they're the only ones that understand what they put on the dashboard, what if they're sick, what if they move on, somebody else needs to take it over, dashboards need to be as intuitive as any mobile app that I install and I need to know how to use this, as you said. And what we have been doing is also a little bit of measuring actually which dashboards

Starting point is 00:46:03 are even used. Because in the end, if you're building dashboards over dashboards are even used because you know in the end if you're building dashboards over dashboard over dashboard and you maintain them but never nobody ever uses them right because there might be people that building dashboards but then there's others that consume them and so simple things like you know let's measure which dashboards are used which dashboards are not used you can also think about like a thumbs up, thumbs down. Is this dashboard useful? Does it tell you anything? Maybe you just rotate like on a sprint basis, you know, pick the top, pick a couple of dashboards and show it to people and ask them, is this any meaningful for you? What does this dashboard tell you? And then with this also learn which

Starting point is 00:46:41 dashboards are actually not effective because nobody understands what's on there yeah one tip i would add and i always used to train my staff on this was don't ask closed questions don't ask a yes or no type question always ask for input input so is this dashboard useful for you people will be like yes people will say yes that's a problem we used to have yeah yes it's useful and then later on like hey i can't figure out how to use this dashboard used to happen far too often yeah so you want to ask a question like what does this dashboard tell you. Yeah. It's almost like an exam. Yeah. Scary as that might sound to put people

Starting point is 00:47:27 in that situation. Keep it casual. Hey, what do you think this dashboard is showing you right now? Can you just give me an interpretation?

Starting point is 00:47:35 I'm trying to figure out if people are able to understand it well. Hey, Ash, thank you so much for three episodes. David, Tim,

Starting point is 00:47:44 Richard. The links are in the description of the podcast uh it's really great that i mean i also love podcasting so so do you and so you know keep going keep doing this um we hope that some of our listeners will you know tune into some of your episodes because it clearly sounds like some great content. I also know you told me earlier that thanks to the podcast now, people are being made aware of what you do and who you are and that you're passionate about this. And you actually actively be brought in and contacted by organizations that need some help on this. And just like want to confirm that. So in case people are, you know,

Starting point is 00:48:25 having a need of an expert, um, yes. Just want to make sure they can reach out to you. Yeah. I'm enjoying the beautiful weather in Australia for, I think it might be a while. So if any,

Starting point is 00:48:37 yeah, it'll take a, probably a bigger issue for me to want to go anywhere else at this point. But yeah, always happy to have a conversation with people who just need a bit of guidance as to what's happening in their reliability journey. And thank you, Andy, for having me on and love what you do here. Thank you so much. And Brian, sorry that you could not take part of this conversation today.

Starting point is 00:49:03 Hopefully you had a chance to listen to it, though. And yeah, I will make sure. Well, this should probably air, actually, in about two weeks. We always record where we have a little bit of backlog, but I think in two weeks this will air. So that means at the early July, people will be able to listen to this. And hopefully then you will see some traction on your podcast and also hopefully some people will reach out. And while you are in Australia, the world is a small place.

Starting point is 00:49:30 The world is connected. So folks, don't shy away even if you are somewhere completely remote from Australia. Pingash. Great connection to have. Exactly. Exactly. great connection to have exactly exactly

Starting point is 00:49:46 yeah any final words Ash before we close this episode no I think that's it that's all I think Andy yeah I appreciate it I also am a salsa dancer by the way really yeah

Starting point is 00:50:03 on one on two Cuban what do you do on one on two Cuban what do you do Cuban Cuban nice yeah it's rare it's quite rare

Starting point is 00:50:10 but it's the thing that gets me going yeah yeah cool how long have you been dancing 10 years 10 years

Starting point is 00:50:17 nice yeah yeah it's awesome yeah you met your wife I did yeah in Boston on the dance floor that's amazing yeah exactly yeah

Starting point is 00:50:28 yeah she's colombian so she obviously knows her stuff uh but yeah it seems i made an impression when i met her and she uh she definitely then uh after the second time dancing, she accepted my request to ask her out. And yeah, seven years later, seven years married. That's worked out pretty well. Yeah, that's amazing. Like, has she tried to get you to learn the Colombian way? It's a little bit too fast for me, to be honest with you. Yes. bit too fast for me uh to be honest with you yes and also she uh she learned it uh you know she she learned a different type of salsa than we do when we go to dance clubs here um but um

Starting point is 00:51:13 now we enjoy it very much oh well so what what style do you do uh i i also started cuban in the very beginning uh then did some rueda but. But then I switched over to On1. I just like a lot. I mean, I dance, whatever. I don't care in the end, but On1 is my favorite. Yeah, I think I'm going to actually take lessons for On1 just to be more versatile rather than just sitting out when all the followers only know on one you know yeah well yeah definitely

Starting point is 00:51:47 i will be i'll be calling you next time when i am in australia i remember my first my first couple of trips i always went uh dancing in sydney but obviously things have changed uh i think with the pandemic which clubs are are still out there and which ones are new so yeah, it's good to know good to know to have a salsero around the corner when I make it down to Australia amigo de salsa amigo de salsero

Starting point is 00:52:15 ¿tú hablas español también? sí, sí, yo hablo español pero algunas veces está muy difícil porque nadie habla español aquí en Australia. Sí, yo entiendo. Sí. Y tú hablas, ¿sí?

Starting point is 00:52:31 Sí, un poco, porque mi esposa es colombiana. Pero solo aprendo con tu lengua. Y sabes lo que es lo divertido? Aún estamos grabando y esto todavía va a estar en el show. Así que veámos quién está escuchando hasta el final. still probably be going to be on the show. So let's see who is listening until the very end. Folks, if you listen until the very end, and if you hear us talk about salsa and a little bit of Spanish talking,

Starting point is 00:52:52 make a comment on LinkedIn or wherever you find this on the podcast. It would be funny to see who else is dancing salsa out there. That'd be awesome. That'd be amazing. All right. Hey, with this I need to say goodbye

Starting point is 00:53:06 thank you so much sure stop the recording but until next time it was an honor it was a pleasure having you

Starting point is 00:53:12 for sure appreciate it Andy cheers cheers bye bye

Your Ad Here

PurePerformance - Decrypting software reliability into a plain English with Ash Patel

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.