PurePerformance - Resilience in the Age of AI and Why we Still Suck at it with Adrian

Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello and welcome to another episode of Pure Performance. My name is Brian Wilson. And as always I have with me my very wonderful and talented guest, Andy Grabner. Hi Andy, how are you doing today? Good.

Starting point is 00:00:35 You know, I was trying to make this head move, but unfortunately for the last couple of weeks I have a little bit of an issue with my, it's not a bit of the spine or I think I'm not as resilient anymore. as I used to be. Are you still wearing the scarf? I do. Okay, good. You know, I was really confused about today's episode,

Starting point is 00:00:55 speaking of resilience, because I saw stuff about chaos and resiliency. I'm like, are we having a topic about the United States? But I don't think we are. I think we're going to keep this into the IT world side of things. I think so, too. But you're right. It's a little bit chaotic everywhere you look right now.

Starting point is 00:01:14 But let's keep it to, where we can have a positive impact, right? Yes, exactly. We actually have a repeat guest. And believe it or not, it's been six years. So it's back in 2019. So almost seven years. Crazy.

Starting point is 00:01:28 And crazy. And Adrian Hornspey, thank you so much for being back. Back in the days, we had two recordings. One was called the Art of Breaking Things. We talked about chaos engineering. And the second one was called How to Build Distributed Resilience Systems. Six years. What happened since?

Starting point is 00:01:45 Well, thanks so much. driving me back. It's crazy, actually. Six years, it feels like it was yesterday. Yeah, it's hard to imagine. That was even pre-COVID if you want to put it into a crazy mindset. It's true. A lot has changed, actually.

Starting point is 00:02:01 I was in AWS back then. I think at that moment, six years ago, I probably was working on the early FIS, IWS FOP induction service project. We were probably producing the PRFAQ and the narrative to propose that to the organization and joined the team to build that for four years. So I was principal engineering that team for four and a half years. And then like last year, early last year, I just left AWS and then I went to build my own company, Resilium labs.

Starting point is 00:02:46 And it took me a bit of time to figure out what I'm doing, but it turns out helping organization understand why the resilience practices that they have in places often don't produce the result that they

Starting point is 00:03:06 intended to deliver, right? So, yeah, that's kind of the tagline is we fix resilience programs, right? It took me a bit to figure out. I didn't know initially that's what I was going to do, but it turns out that is what I'm doing. I just thought of a new tagline, though, if you need another,

Starting point is 00:03:29 we make resilience resilient. That's meta-resilience. Yes, resiliency for resilience. And you came up with this without even the help of an AI. Look at that. Look at that. I am an AI. You are?

Starting point is 00:03:44 I mean, Adrian, really cool to hear that the reason why you left the AWS is because you wanted to start your own adventure. That's always great and also something that are admirable because it's a big step. But I think you were also, I mean, obviously you deserve it. But if I look at your website, there's a quote from a someone known, maybe someone, not everybody knows him, it seems, but we now live in the secret. But a gentleman with the name Werner Fogles, VP and CTO at M.O. and web services, gave you a nice quote that says, more often than not consultants can talk to talk, but cannot walk the walk.

Starting point is 00:04:23 And he talks about if you want or need to improve the resilience of your systems and operations, Adrian has proven that he can deliver. He's an educator at heart with an in-depth knowledge based on real experience. And that's quite something. And I think he also said he even reposted your LinkedIn announcement, so that hopefully boosted your improvement. impressions and they have a lot of new connections. It was quite a shock to see that.

Starting point is 00:04:51 But yeah, like it was very nice to see that. You know, like when you work in a company, like the size of Amazon, and then it's hard to see the impact you have. I think it's so easy to get the imposter syndrome. And certainly I had it. And it's very difficult to see your impact because there's so many people, so much. so much happens.

Starting point is 00:05:16 So yeah, it was, it warmed my heart when I left to see that. It was, yeah, it was very surprising. And, like, it was, it made me very emotional that time. And that comes from somebody that lives in Helsinki, Finland. And the word emotional from us, yeah. Well, I'm half French, half English, so, you know,

Starting point is 00:05:44 I moved to Finland. So it's like very, my emotions are really like torn apart between those three, like the Brits, the French, the extremes, like the extremes of emotions. Yeah. But, Aaron, let's talk a little bit about the stuff that you actually do. I think you're also writing a book, if I'm not mistaken. Can you quickly? Correct, yes. Talk a little bit about that.

Starting point is 00:06:07 I started the book writing the book a few years ago. It evolved a lot. I rerew it three times and I finally decided the last few months to get it over the line. Then the topic is why we still suck at resilience. It's super interesting. We've been talking about resilience for so many years, like, residence engineering is a 20-plus year field. technical resilience like EOC during game days

Starting point is 00:06:46 is 15, 15 years and then there's been books, there are tools, and it turns out organizations still struggle a lot with it. And I think it's a kind of TLDR is the big problem is

Starting point is 00:07:02 organization think it's a technical problem. They want to engineer their way out of resilience, where in fact it really is an organization. problem. It's about the whole social, technical system, like the people and the software that work with. And often the organization is, well, it's not often, it's all the time the organization is, it's kind of in the middle of a lot of different pressure, whether it's the efficiency pressure, which is a really big one, the pressure of delivering fast versus

Starting point is 00:07:40 long-term thinking. And there's few others. There's secondary tensions that come in the process, controlling guardrails, all this kind of thing. So there's a lot of different tensions, and all this tension affect the resilience practices that we put in place. And so the book talks about that

Starting point is 00:08:00 and kind of describe the tension, goes through how to navigate the tension, how to talk about the tension, because often if you can talk about it, you can address it. Often that's the problem. We can't fix something that we can't discuss, right? So the vocabulary is so important. So I talk about this.

Starting point is 00:08:24 I talk about the importance of learning for resilience because it really is the ultimate solution is to turn ourselves into a learning organization rather than trying to fix things through efficiency metrics. performance metrics, you need learning metrics. So these are very different. They are completely antagonistic forces. And yeah, so the book is going through that, trying to diagnose a little bit

Starting point is 00:08:54 what happened and trying to offer some solutions to that problem. Is this also then happening in your work at Resilium Labs that when people hire you, that most of the time you actually don't show them how to set up a chaos experiment and how which tools to use and what type of load to generate or chaos experiments to launch, but more starting from trying to figure out what is holding the organization back from becoming resilient? Is that the majority of your work then? It is almost all the work, is there?

Starting point is 00:09:32 Like it turns out technically, organization are up there, right? The engineers are super smart. Like they say, like, you don't, you don't, there's maybe a few, few things to improve here and there, but they know the things. They have implemented the practice. They've implemented the patterns. They do all those stuff. It's just the framing about the practices, how to measure it, how to think about it, how to talk about it.

Starting point is 00:10:02 It kind of drives the organization towards, what I'm going to. call in the book theater. It's like demonstration that you have the practices versus actually learning, right? And this is because the tensions are putting pressure on what should be a learning practice, and it turns it into a performance act, right? Like demonstrate that you can run an experiment versus run the experiment to find, you know, like surprises or demonstrate that, okay, you run a game day, but it's very comfortable. It doesn't push you. It doesn't push you to the uncomfortable space because you don't have time.

Starting point is 00:10:50 Because there's pressure to deliver, there's pressure to do features, there's pressure to get that checkbox. So all this kind of, all these different pressure kind of really turn those practices, which should have time. they should, you know, they should, they should, they should let people invest themselves. The organization should, should, yeah, it's investing time into learning, right? And then to, it turns, it turns into the opposite. It's hard, like, I'm not too good at expressing myself, like, quickly, but that's kind of the idea there. I think I get it. It's kind of about. at the end of everything, it's how important is to this, this to us, to our bottom line.

Starting point is 00:11:43 If we spend time on resilience, what is that taking away from? And if we have a problem because we're not resilient, is it a big enough problem that we're like complaining that, oh, we should have done that? Or is it like, yeah, that was a problem, but we were still getting features out, right? It sounds like it comes down to a lot to a balance of priority, right? I imagine, and is curious if this is what you see in a lot of cases where an organization might think, like, well, if we do have some sort of outage because we're not resilient, is it more cost effective to have that outage or to spend the time and resources to set up a resilient practice? Is that ever come into play, or is it not as obvious as that? We're not as London. Yeah, often they don't even realize that they've turned, they've moved away from learning, right? So they often start with good intention, like they actually want to find a problem.

Starting point is 00:12:43 And the kind of problems you would want to find is what's, what in residency during is called the gap between what you imagine your system is. It's called work as imagined versus what is actually happening in practice, work as done. And all those practices around resiliency is really to understand where that gap is. And then often the organization set up those practices with that in mind, but not understanding exactly that it is what they're doing. They often do it because books tell them that you need to do those practices. So again, they start with good intention. maybe not the exact right goal of learning.

Starting point is 00:13:32 And because of that, they fall into the trap of comparing themselves with a project that has a start date and end date. And then because of that, then they get assigned the same type of metrics as the projects that have delivery date. And you can't compete, resilience can't compete with feature delivery because feature delivery is very tangible. You can see it as, you know, it has customers, it has, you get feedback out of it. You can think about it, design it, measure it, deliver it and measure its impact. It's very concrete. Resilience is an ongoing process. It's the process of learning where that gap between the work as imagined and work as done is.

Starting point is 00:14:22 It never stops. And if it's successful, prevention and resilience, creates non-event. Right. You can't really see the effect of it because it's successful. And because of that, the success of the practice often is what creates the problem. Because it's not common for organization to celebrate an engineer that spends two months at improving observability, looking deep into the logs to understand.

Starting point is 00:14:57 the small things and do improvement. It's not in the habit of an organization to celebrate near misses. It's not in the habits of an organization to celebrate the invisible work that happens in the engineers trying to maintain system. But organizations celebrate feature delivery. They celebrate heroes, fixing outages. They celebrate all this kind of stuff, which is the opposite of what you want. So it creates a, it creates kind of situation where your reward system, your incentive, also don't encourage resilience. Like, so it's, it's a lot of, there's so many tensions, so many habits in organization that push resilience outside of the learning kind of a frame that it should take.

Starting point is 00:15:51 Does it make sense? Yeah, it reminds me too of, I don't know if you see it over where you all live, but, especially like in hospitals around here, you'll see signs up on the wall that says, you know, zero days since our last, or 20 days since our last safety incident, right? So they're counting since they had an incident, right?

Starting point is 00:16:10 To celebrate exactly what you're saying there, but it's very rare to see that kind of thing, right? So, you know, if organizations had, like, on their big boards, like, you know, how many days since their last outage, like that would just be some small change to help celebrate what they're doing, I guess. But even that is tricky, right? Because you are now incentivizing the organization and people to try maybe not to declare outages to actually keep that metric. And this is the hard thing to, like often the first intuition is to do exactly what you're talking about.

Starting point is 00:16:49 but the long-term effect is off eroding those things. So, for example, instead of saying, you know, instead of saying many days, X days, things are last outage, it would be much, it would create a better incentive to say how, you know, when, how often do we find near misses and then, you know, how quick, click and we do we recover more about transparency about the act of failing,

Starting point is 00:17:28 understanding the system, learning. So how many, you know, surfacing, for example, surprises. Like, when you do a review, when you do a game day, a chaos injuring, you want to understand what surprised you, what changed your mental model. And these are the kind of things to celebrate

Starting point is 00:17:45 so that you actually incentivize people to openly talk about things that are about failure, because failure is about, it should be normal. You know, updating your mental model should be normal. updating your understanding should be continuous, because your system is changing regardless of what you want or not. It's changing all the time.

Starting point is 00:18:12 So our mental model should change all the time. And often we play catch-up, right? So we should celebrate that catching up rather than saying, hey, we haven't had an outage because maybe we didn't declare it. But let's celebrate that. You see, but you understand what I mean? And talking about resiliency and unexpected things, I had to put myself on mute a little bit because my neighbor just decided to drill holes into their walls. But of course. Of course, exactly.

Starting point is 00:18:44 It happens when you don't expect it and you never test for this either. But fortunately, I think he's done now. He's just using the hammer. But Adrian, the near misses, I had a conversation recently with one of our SRE leads in our organization. And this was also a big topic, the near misses. And he also did a talk at SRE CON in Ireland. Can you enlighten us a little bit the audience, what you mean with the near miss and what this really is and what we can learn from it and how we should deal with this? Yeah, so near miss is an incident that basically got caught just before it impacted the customer

Starting point is 00:19:30 or that didn't push to the extreme or kind of didn't push the collapse. It almost went there, could have impacted customer but didn't. That's what we typically talk about near misses. It's the incident that didn't happen, but that we can learn a lot from. And often, organizations do not celebrate them. They do not look at them. They don't try to analyze them. And so what triggered the Nirmis basically goes back into,

Starting point is 00:20:16 into the system without being dealt with. So it's like, I would say it's like a volcano that doesn't erupt, but creates the hearstquake. You feel the signals, it's almost there, and you decide not to do anything about it because it didn't erupt. So you don't leave, you don't escape, you don't try to learn, you don't change your positioning. And that's really what is in the name is.

Starting point is 00:20:46 And I would say the, if you really, if you want to improve resilience, you really need to build a culture where analyzing near-miss is happening and being very open about it. Because there's so much of information there. It's kind of the outage that almost happened. And so you can understand the. the dynamics, the assumptions that were there, the models, the mental models that were broken before the impact customer. That's really what it is. So, yeah, organizations I work with typically, we try to make them deal with near misses the same way they deal with incidents.

Starting point is 00:21:39 So write a post-mortem going and spend time analyzing them and really learn from them because there's, so much information there. Yeah. So I just, if I maybe can try, not even an analogy about a story, because this is fresh on my mind. We work with one of our customers, and they had a queue with some default settings,

Starting point is 00:21:59 default timeout before messages get thrown off the queue. I think the default timeout was one hour. That meant when the queue grew and it was an order queue, and it grew and messages were older than one hour, they were kicked out without anybody in all this thing. And so it's revenue. Yeah, it's revenue. And the thing is, right, if you think about if you are, if let's assume you are detecting

Starting point is 00:22:22 this, you detecting that this is a problem and you fix it because on that individual queue, you're making changes, you're changing the default, you're making the queue bigger. But if you don't tell anybody, if you don't learn that it means everybody else may also just run with the defaults and never think about that this is a key metric that decides how a complex distributed system actually works because cues are everywhere, right? Yeah. It's exactly that. They probably learned a lot about how to monitor them, how to alert, you know, what's the business impact.

Starting point is 00:22:57 So it's like, I mean, there's so much learning just in there. And how did this stay there so many years without anybody noticing? What was that? I think this is also super interesting is what happened, you know, and, and, you know, and, missing there and kind of the telemetry, all this kind of stuff. Yeah. Hey, another question that I have for you. And I need to quote somebody else and somebody else has LinkedIn and then also something

Starting point is 00:23:28 you wrote in your blog post. The quote that I read in the context of AI, I mean, AI is upon all of us. I'm pretty sure you're using AI to create code documentation, chaos experiments. And I think the quote was from Max Kerbacher. he said we are creating more and more but we are understanding less and less and then I also wrote or read your latest blog post which was I think it was titled when AI writes your code chaos engineering writes your insurance policy and you make one statement in there that says traditional chaos engineering assumed code written by humans humans you could ask

Starting point is 00:24:11 questions to but then assumption breaks down when and substantial portions of your code pace emerges from statistical models. So my question to you now, because you have been in that space for so long, and with your time at AWS and AWS, one of the four-runners also when it comes to AI, all the different types of the is that we are using. I'm pretty sure you must have also seen a change in chaos engineering now as AI is emerging to generate code. So my question to you now, what is really changing?

Starting point is 00:24:39 How can we, how does chaos engineering change? what do saddleability engineers now need to change in their behavior in the way they ensure resiliency? Most organizations are playing catch-up, actually, with AI. I mean, the chaos engineering practice already needed work, right? A lot of organizations were not there yet with chaos injuring for systems that were not powered by AI, right? they were still learning. So now on top of that, you put AI. And so I would say the big things that change is there's a lot more code being produced.

Starting point is 00:25:23 So the mental model will change even faster. So, you know, doing those practices and trying to identify the gap between what you think the system is and what it is is even more important because it grows even faster. So, you know, you need to really. really use these kind of tools like load testing, chaos injuring, there's a bunch of other things, but to

Starting point is 00:25:51 try to make sense of what is being built and especially how it scales, how it breaks, how it degrades, because otherwise you will learn that in production. At some point, the

Starting point is 00:26:08 AI is, especially if you use, especially if you use automation and AI agent to try to make sense of a system, trying to recover like autonomous agent is for SR is something that's coming. But eventually they will fail, right? Something is going to, they're going to make a bad decision. And at that moment, do you wait that AI fix itself, or do you jump in and try to, you know, to figure it?

Starting point is 00:26:40 the system. From a pressure point of view, I can guarantee humans are going to be thrown in the middle there, right? Somebody is going to say now we can't wait AI to fix itself and trying to find a problem.

Starting point is 00:26:56 It's been driving us to the wrong direction for the last two hours. Let's stop this madness and put some humans there. The problem is if your humans haven't been in the code and trying to figure out the system and then I understand, updated their mental models, well, it is going to happen in production.

Starting point is 00:27:16 It's going to take them and it's going to be a problem. So, you know, this is the interesting thing. It's like, I think people are trying to make sense of how to put this practice in their new, in kind of their AI workflow now at the moment. My suggestion is trying to do them continuously, trying to use AI to make sense of data, find patterns, but stay in the loop as a human because you really, you still need to learn continuously. I still don't believe system will be fully autonomous anytime soon. Like, you know, maybe some companies will be able to do that. But, like, besides a few, maybe cloud-native, smaller companies that, you know, don't have too complex systems. Like most companies will have human interlude for a long time.

Starting point is 00:28:21 So, you know, I think that's the danger. I went everywhere all of a sudden. Yeah. Now, but it's, I mean, for me, as you explained this, right, if we end up in the world where we get AI generated code and let's assume it works most of the time,

Starting point is 00:28:38 but all of a sudden things break and the AI cannot fix itself. And then we ask the human, it feels like the classical, throw it over the wall, what we tried to solve back in the days when we talked about DevOps. So because in the end,

Starting point is 00:28:51 you say the human, in the end, is the one that needs to fix it if the AI cannot fix it. But it's going to be really challenging, right? If you have no clue, what's actually happening. And this is why, and I'm also very new to the whole world, but it feels the way we can really get better in this

Starting point is 00:29:10 is start by writing better requirements, better assumptions, better specifications on what we actually expect from the system. Because you in your sentence, right, in your blog post, you said when humans were creating the code, you could always ask the human, what was the intention, what was the requirement? You could always ask questions. but you can only ask the EI what we've told the EI

Starting point is 00:29:33 so that means we need to get much better in really requirements engineering and specifying hey an order queue should never lose orders even if it backs up we need to have a different plan right yeah no it's exactly that and the models I mean

Starting point is 00:29:50 the context is not infinite as well like you know you have your the models can keep some of the context for a few days, few weeks, but eventually they can't keep the context for everybody all that time. I mean, not that humans remember six months later why they made a certain decision, but at least you can, you know, you can try to understand why a decision was made. And then the, you know, typically you get, you get knowledge diffusion there. You get

Starting point is 00:30:29 an understanding of the evolutions of mental model. I mean, all the teams I've worked with, typically there's always this institutional knowledge that you can query through the people working team, right? Of course, the team can be changed completely, then you lose that, which is a problem. But often you can query that institutional knowledge within the team. And that's so important during outages, right?

Starting point is 00:30:56 because often that's when you need to coordinate and find the why of particular decision so that you can, you know, fix things. Now, what I talked about about the AI kind of adding more problem. It's not the new thing, actually, in 1983. this Leigham-Bainbridge that wrote the paper, the ironies of automation, where she already talked about that in the context of automation in industrial systems, where she stated, again, she stated a few ironies, but the most important one here relevant is that the more automated the systems are, the more expert you're going to have to be to be able to fix the automation itself.

Starting point is 00:31:52 That applies completely with AI, right? Because all of a sudden, now you need to be an AI expert, then you need to be a system expert. You need to understand all that. So the expertise and the knowledge is expanding dramatically there as well. So it's a very interesting time. For me, I feel a little bit sad for junior engineers because that's a lot to catch up with.

Starting point is 00:32:18 So it's a lot of knowledge to get right away. Yeah. Yeah, and it's bit. I was going to say, Andy, this calls to mind a lot of the conversation we had with Jeff Blankenberg a few episodes ago. So Jeff actually works here with Dinotrice, but he was using AI to help him build a website. And it was finding a balance between having AI write the code, but being able to accurately envision and tell the AI.

Starting point is 00:32:48 what it needs to do for him and why, right? Which is very much exactly what you're saying. Like, you still need somebody there architecting. You still need to give it the reasons. If it gives you a generic table, you're like, no, no, no, no, I need the table to do this and that, right? And then I think in his example, he needed another table. And it's like, no, no, no, based on the previous one I gave you because I don't want a generic one again, but AI doesn't pick up on those things. So you have to be very aware of what you're asking it to do, right?

Starting point is 00:33:13 And that's the human part of the knowledge. But I do think it would be interesting to, at the end of how. having it write something, have AI produce like, okay, what were the instructions I gave you for each part that you wrote? Because one thing, yeah, if you have the people who had the knowledge who are there, that's great, you can ask them. But if they move on somewhere else, then you don't have that. And if at least you have it written, like again, smart uses for AI, have AI write why it did everything. And now at least you have your code documentation. And how many developers like writing code documentation? Zero, right? So get AI to write the code documentation.

Starting point is 00:33:48 I wonder, does anyone use AI to write comments in the code, or is that gone forever now? The only problem with that is often when you ask AI the reason why it does something, it also comes up with a statistical answer. It's not necessarily the real reason, because the real reason is that statistical inference. Well, true, true. It's mathematics, right? So there's no real way yet to really understand the decision process of AI. There's an attempt to do that.

Starting point is 00:34:28 I think what Jeff was talking about, though, on the previous episode was like you still have to review what the AI does, right? And he was even talking about like junior developers, right? You know, junior developers do the code review, right, in terms of what they're going to be doing. Anytime AI gives you something, you still have to take a look at it, but you don't have to do the heavy lifting of writing it all yourself. So there's still, and that kind of ties into your,

Starting point is 00:34:48 how long is it going to take for it to be more independent, writing all this stuff. Probably going to still be a long time because that's a gap that there's not only technical issues, like you're saying, it's going to be based on whatever models it is, but then there's going to be that trust factor of, okay, how do we know this is doing the right thing if I don't look at it, right? And if it's like outsourcing. Yeah.

Starting point is 00:35:13 That's exactly what I said on the call. They're going to give you exactly what you asked for. I mean, outsourcing has gotten better, right? But as I said, anyone who's in the early days of outsourcing, especially, like you got exactly what you asked for and nothing more. And like, hey, this doesn't work. Well, you didn't ask for it to work, right? But the other thing is there's no accountability with AI, right?

Starting point is 00:35:32 Whereas humans, there is accountability. So I think that all ties into that loop too, where the human still has to step in and be accountable because there's got to be some accountability somewhere. And I don't mean to punish people, but to make sure things aren't doing what they're supposed to be doing. Yeah, and this is what worries me, because at some point, somebody is going to ask,

Starting point is 00:35:51 who is responsible for something, and somebody is going to be thrown under the bus for no apparent reason, simply because we overstrusted AI or we just delegated everything without humans in the loop And from an efficiency perspective, it's very, very tempting because it goes fast. It works 90% of the time or 99% of the time. Now, the one person where it's not going to work or something is going to happen.

Starting point is 00:36:26 That's when the problems are going to be very difficult to deal with. I think, Jeff, in the previous podcast, he also talked about there's issues, using Git issues or GitHub issues for kind of his extended context. So he's always documenting everything in Git issues. So everything that he asked for all the refinements. Because as you mentioned earlier, right, context is limited when you have a conversation with an AI. But if you have everything well documented, you at least know what was the discussion you

Starting point is 00:36:58 had. And I think with this, this was a little bit of a lesson learned from him because he has been writing a blog series called 31 days of vibe coding and what he learned, how it started with his first day. and how he also changed the way he is using the AI and tips and tricks like, you know, put everything in a Git issue, start every conversation with a new Git issue, don't build big features, break them down into smaller things,

Starting point is 00:37:22 be very descriptive and everything documented in Git Issues so later on the AI can also go back or you can yourself go back, obviously. Yeah, I think vibe coding is great. I see Vibe coding as a great way to create a proof of concept, refine your ideas. But then at some point, you need to just stop it. And then, you know, go back to writing your specs, figure things out, and then break down in what I call micro-task,

Starting point is 00:37:50 like micro-task. But you have a lot less room for going rogue. Now, the problem, this requires a lot of knowledge, a lot of experience because you need to know, like you said before, you need to know what matters. You often realize that you think few steps ahead and then how you design those tasks. Because eventually they pile onto each other. And if you do not have this long-term vision and the way,

Starting point is 00:38:32 especially in the way you accumulate tasks, right, I think it creates often big problems because you realize too late that you made a bad decision and you start to go back in force and then you go back to having a big problem. So it's not fundamentally different than how we did before. Like when without AI, we still did POC and then we have to take a step back and kind of look at the bigger system and then produce the first version of it. and break it down in task. So it's just the

Starting point is 00:39:14 efficiency pressure is so tempting with AI because you can go so fast and it feels productive. But you didn't have that before because you couldn't produce the code so fast. But I think it's that this kind of

Starting point is 00:39:36 pressure affect you in very, very different ways. Because it's like the dopamine thing. You can produce code and it's really satisfying. It's like, you know, well, I've produced code. Well, you haven't. But it's funny. You know, Eddie, I'm thinking, like, remember way back, and I think you did it, it might have been with,

Starting point is 00:40:01 I mean, going way back, Wilson Moore, right? And I think he was saying, like, for performance stuff, like, find, find something to automate that you repeat a lot, right? And it feels like AI should be used in the same way. Like find your tedious tasks and use that. So if I think like construction, like you're going to build a new building, right? In the old days first, you have to still have to design the building. You still have to figure out all the safety.

Starting point is 00:40:25 But then when it comes to starting it, you'd have to get like, you know, 40, 100 people to dig the big ditch, right, to start laying the foundation. Then it's like, all right, well, now let's just get a, you know, mechanical shovel to build it, right? And to me, it sounds like the safer way through AI is we still do all the higher level tasks, and it's fine those tedious manual tasks to get AI to do for you, but it's doing it again based on what you've designed and what you require. So, you know, kind of apply the same principles of what do I automate?

Starting point is 00:40:58 It's like, what do I AI in that same idea? Yeah, they also went to what Jeff said, right? In the end, we still follow our software engineering best practices. We start with a good requirements document. We then think about a good architecture based on the environment we live in, based on the constraints. There might be legal reasons why you have to make certain decisions, but you need to write all this down, like as you would write it down for somebody that you then contract to do your work.

Starting point is 00:41:26 And now that contractor is an AI, right? And you have different AIs that you interact with and so you get your job done. We go back to a waterfall model. If it's faster, though. If it's faster, though, yeah, that's a good point. I mean, that's an interesting idea, you know. Would an old model come back if it's made much more efficient? Like, I think it's, I mean, I think it's a stretch,

Starting point is 00:41:49 but I think people shouldn't completely close the door on that stuff, right? Like, oh, we can't do that. Like, well, maybe we can now, you know. Who knows? It's interesting. I mean, I think some new models will come up. a combination of a few things. Adrian, I wanted to use the last

Starting point is 00:42:07 cup of minutes that we have with one more topic. Obviously, we are already on the AI theme. I wanted to talk a little bit about the AI ESRI agents, because if you look around, and I think the same goes true, also for your previous employer, all of these vendors are producing and coming up with new cloud services

Starting point is 00:42:27 around SRE agents or however they're calling them, you know, giving potential insights, into root cause in case something fails. So, for instance, if I make a configuration and deployment mistake on AWS, because I misconfigured my API gateway or whatever it is, I get, I can call the SRE agent as far as I understand. And then the SAC agent tells me what is the root cause and gives me recommendations. Do you have any experience on this, do you know where this is going

Starting point is 00:42:52 and kind of, are you excited about this or is this just, you know, it's just another service? Yeah, I think it's a fascinating era. I think, again, it's related to how to use AI for what is great. You know, what's really good at. So pattern recognition, all this kind of things, you know, it's good dealing with large amount of data. So kind of reduce your cognitive load during incidents.

Starting point is 00:43:21 So to do more of the thinking. I think this is great. Now, it's going to be various. interesting to see how it evolves because it also introduced new type of problem. It can, and I've seen that happen, it can do anchoring bias, right? It gives you, it puts you into a direction so convincingly that, you know, you go there for sometimes to realize that then that, no, that was not the right place. and you spend a lot of time.

Starting point is 00:44:02 And that's going to be a problem, I think, if you offload all your thinking through those agents. So you need to, I would say, you need to keep thinking and use an agent to verify your hypothesis rather than delegate the thinking to AI. And I think this is really what is going to, it is worrying me a little bit. It also is going to take learning opportunity away from people during incidents.

Starting point is 00:44:36 That's all that. You know, if you ask senior engineers, often they learn the biggest lesson during outages, you know, when you have to recover under stress, assumptions you've made mental models that were broken. You know, so you kind of, I think incident is, a reality check and often reality checks

Starting point is 00:45:02 if we're outsource everything is okay, how do we learn? And then you might end up with teams that come operate if AI is not there because there's a lot of outages that happen where your monitoring is not there. Flying blind happens regularly at least once a year

Starting point is 00:45:25 I have a customer that is flying blind. So, yeah, I mean, there's going to be a skill atrophy, I think, happening at large scale. And the question is going to be, it's even more relevant not to do game days, right, to do chaos injuring, to do all these practices, because they practice your operational muscles, like, you know, if done, right? So I would say, yeah, it's great, but yeah, it's going to have to be tempered with with organizations that are understanding what they're doing, what they're trading off with. So again, you do trade, you trade efficiency for resilience, and AI is pushing you really much

Starting point is 00:46:23 towards the efficiency side of things. Yeah. Yeah, but I really like what you said, and I think this is something I want to just reiterate. If we outsource everything, how do we learn? And we're taking away the learning opportunity. And I just did a little bit of vibe coding today myself. So I created a new app.

Starting point is 00:46:43 And immediately, the first batch that it created was full of errors. But instead of me looking and learning what mistakes it made, I just said, hey, look at these errors, these compile errors and fix it. And I didn't even bother to check, right? So by just saying, you know, AI, you can do this, do this, do this. You're right, right? I mean, we don't learn.

Starting point is 00:47:04 We will never be getting better and giving better instructions. And we will repeat these mistakes. And the question then is, will then the AI at the time when it really matters, be fast and good and then fix it really in the time we need it. Or we will then throw humans at that. the problem, but the humans also lost all the knowledge about things.

Starting point is 00:47:27 So that's going to be interesting. I mean, it's another abstraction level, right? You can argue that, you know, I started my carrier in assembly, right? You can argue that we don't anymore optimize

Starting point is 00:47:41 our assembly because, you know, processors have become so powerful and some the available memory is a master. You don't have to think about that anymore. And maybe the systems are going to be evolving

Starting point is 00:47:57 so that they can fix things much better, faster than human. Maybe. But yeah, then it's like what is left for us. Like, what are we doing? Martinis on the beach. Martinez on the beach, yeah. So that's how they tried to sell us the dream, right? Yeah, exactly.

Starting point is 00:48:19 But no, it is a little scary. I think the profession of software engineering is being redefined. But we've said that for every technology that has happened in the last at least 25 years. I've heard the same story every time there's a new technology. So yeah, it's just we're going to have to reinvent ourselves yet again. And that's it. Yeah. Hey, Adrian, thank you so much for the conversation.

Starting point is 00:49:00 Is there anything else that you wanted to get off of your chest, any other topic before we close out? Wow. Like, that's a... I know, this is maybe the door open of another episode. Yeah, I would say, stay curious with incidents. and keep learning. I think if there would be one thing, I would strongly encourage everybody

Starting point is 00:49:24 is to see the discipline of resilience, kind of the as a learning practice, nothing else. It's not the project that you need to track through efficiency metric or performance metric. It's a learning opportunity to understand how your system really behaves versus what you have in mind. And whether it's AI, whether all these kind of tools still needs to be within that context of learning.

Starting point is 00:49:59 So I think when you start using AI, think about how is it helping you learn versus how you are outsource learning. That would be my last comment. And I think that's going to be an awesome sound bite, Brian, if I'm. may say. You can say that all you want. I'm kidding. Now, CEI is a learning opportunity. And not just as an outsource.

Starting point is 00:50:29 Yeah. I think, yeah, if people can understand resilience is learning. Yeah. That's a good thing. Cool. Adrian, thank you so much. All the best for your new,

Starting point is 00:50:45 new business for Resilium Labs, folks. We'll definitely make sure that all of these links will make it to the podcast description as well and also do your blog. And also, yeah, if there's anything else, just let us know that we should link to. Once the book is out. Exactly.

Starting point is 00:51:07 And I'll just send you copies. Thank you. Thank you. Also, I just remember to analyze your near misses. I really like that listen as well. It's very easy to say, yeah, we dodged a pillow on that one. Let's ignore it.

Starting point is 00:51:23 Fantastic advice. So thank you so much for being honest. It's been wonderful. Hope you have a great rest of your week. Thank you, Brian. Hopefully talk to you soon. Bye-bye. Bye-bye.

PurePerformance - Resilience in the Age of AI and Why we Still Suck at it with Adrian

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.