PurePerformance - Resilience in the Age of AI and Why we Still Suck at it with Adrian
Episode Date: March 2, 2026Why do we still struggle with resilience in 2026? Is it the growing complexity of systems, the pressure to ship fast, or a lack of education around resilient design? In this episode we welcome Adrian ...Hornsby from Resilium Labs to explore these questions and learn about chaos, complexity, and the importance of continuous learning!Adrian has learned his chaos engineering skills while working at AWS for many years. He shares insights from his upcoming book and his experience helping organizations embrace resilience as a continuous learning practice. We discuss:Why traditional chaos engineering assumptions break down when AI starts writing your code.The rise of AI-powered SRE agents—are they a blessing or a missed learning opportunity?Organizational challenges and the importance of tracking near misses.Links we discussedAdrians LinkedIn: https://www.linkedin.com/in/adhorn/Resilium Labs: https://www.resiliumlabs.com/Upcoming Book: https://leanpub.com/whywestillsuckatresilience
Transcript
Discussion (0)
It's time for Pure Performance.
Get your stopwatches ready.
It's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello and welcome to another episode of Pure Performance.
My name is Brian Wilson.
And as always I have with me my very wonderful and talented guest, Andy Grabner.
Hi Andy, how are you doing today?
Good.
You know, I was trying to make this head move,
but unfortunately for the last couple of weeks I have a little bit of an issue with my,
it's not a bit of the spine or I think I'm not as resilient anymore.
as I used to be.
Are you still wearing the scarf?
I do.
Okay, good.
You know, I was really confused about today's episode,
speaking of resilience,
because I saw stuff about chaos and resiliency.
I'm like, are we having a topic about the United States?
But I don't think we are.
I think we're going to keep this into the IT world side of things.
I think so, too.
But you're right.
It's a little bit chaotic everywhere you look right now.
But let's keep it to,
where we can have a positive impact, right?
Yes, exactly.
We actually have a repeat guest.
And believe it or not, it's been six years.
So it's back in 2019.
So almost seven years.
Crazy.
And crazy.
And Adrian Hornspey, thank you so much for being back.
Back in the days, we had two recordings.
One was called the Art of Breaking Things.
We talked about chaos engineering.
And the second one was called How to Build Distributed Resilience Systems.
Six years.
What happened since?
Well, thanks so much.
driving me back.
It's crazy, actually.
Six years, it feels like it was yesterday.
Yeah, it's hard to imagine.
That was even pre-COVID if you want to put it into a crazy mindset.
It's true.
A lot has changed, actually.
I was in AWS back then.
I think at that moment, six years ago,
I probably was working on the early FIS, IWS FOP induction service project.
We were probably producing the PRFAQ and the narrative to propose that to the organization
and joined the team to build that for four years.
So I was principal engineering that team for four and a half years.
And then like last year, early last year, I just left AWS and then I went to build my own company,
Resilium labs.
And it took me
a bit of time to figure out
what I'm doing, but it turns out
helping organization
understand why
the resilience practices that they have in
places often don't
produce the result that they
intended to deliver, right?
So, yeah,
that's kind of
the tagline is we fix resilience programs, right?
It took me a bit to figure out.
I didn't know initially that's what I was going to do,
but it turns out that is what I'm doing.
I just thought of a new tagline, though, if you need another,
we make resilience resilient.
That's meta-resilience.
Yes, resiliency for resilience.
And you came up with this without even the help of an AI.
Look at that.
Look at that.
I am an AI.
You are?
I mean, Adrian, really cool to hear that the reason why you left the AWS is because you wanted to start your own adventure.
That's always great and also something that are admirable because it's a big step.
But I think you were also, I mean, obviously you deserve it.
But if I look at your website, there's a quote from a someone known, maybe someone, not everybody knows him, it seems, but we now live in the secret.
But a gentleman with the name Werner Fogles, VP and CTO at M.O.
and web services, gave you a nice quote that says,
more often than not consultants can talk to talk,
but cannot walk the walk.
And he talks about if you want or need to improve the resilience
of your systems and operations, Adrian has proven that he can deliver.
He's an educator at heart with an in-depth knowledge based on real experience.
And that's quite something.
And I think he also said he even reposted your LinkedIn announcement,
so that hopefully boosted your improvement.
impressions and they have a lot of new connections.
It was quite a shock to see that.
But yeah, like it was very nice to see that.
You know, like when you work in a company, like the size of Amazon,
and then it's hard to see the impact you have.
I think it's so easy to get the imposter syndrome.
And certainly I had it.
And it's very difficult to see your impact
because there's so many people, so much.
so much happens.
So yeah, it was,
it warmed my heart when I left to see that.
It was, yeah, it was very surprising.
And, like, it was,
it made me very emotional that time.
And that comes from somebody that lives in Helsinki, Finland.
And the word emotional from us, yeah.
Well, I'm half French, half English, so, you know,
I moved to Finland.
So it's like very, my emotions are really like torn apart between those three, like the Brits, the French, the extremes, like the extremes of emotions.
Yeah.
But, Aaron, let's talk a little bit about the stuff that you actually do.
I think you're also writing a book, if I'm not mistaken.
Can you quickly?
Correct, yes.
Talk a little bit about that.
I started the book writing the book a few years ago.
It evolved a lot.
I rerew it three times and I finally decided the last few months to get it over the line.
Then the topic is why we still suck at resilience.
It's super interesting.
We've been talking about resilience for so many years, like, residence engineering is a 20-plus year field.
technical resilience
like EOC during game days
is 15, 15 years
and then there's been books,
there are tools, and
it turns out organizations
still struggle a lot with it.
And I think
it's a kind of TLDR
is the big problem is
organization think it's a technical
problem. They want to engineer their way
out of resilience, where in fact
it really is an organization.
problem. It's about the whole social, technical system, like the people and the software that
work with. And often the organization is, well, it's not often, it's all the time the
organization is, it's kind of in the middle of a lot of different pressure, whether it's the
efficiency pressure, which is a really big one, the pressure of delivering fast versus
long-term thinking.
And there's few others.
There's secondary tensions that come in the process,
controlling guardrails, all this kind of thing.
So there's a lot of different tensions,
and all this tension affect the resilience practices
that we put in place.
And so the book talks about that
and kind of describe the tension,
goes through how to navigate the tension,
how to talk about the tension,
because often if you can talk about it, you can address it.
Often that's the problem.
We can't fix something that we can't discuss, right?
So the vocabulary is so important.
So I talk about this.
I talk about the importance of learning for resilience
because it really is the ultimate solution
is to turn ourselves into a learning organization
rather than trying to fix things through efficiency metrics.
performance metrics, you need learning metrics.
So these are very different.
They are completely antagonistic forces.
And yeah, so the book is going through that, trying to diagnose a little bit
what happened and trying to offer some solutions to that problem.
Is this also then happening in your work at Resilium Labs that when people hire you,
that most of the time you actually don't show them how to set up a chaos experiment
and how which tools to use and what type of load to generate or chaos experiments to launch,
but more starting from trying to figure out what is holding the organization back
from becoming resilient?
Is that the majority of your work then?
It is almost all the work, is there?
Like it turns out technically, organization are up there, right?
The engineers are super smart.
Like they say, like, you don't, you don't, there's maybe a few, few things to improve here and there,
but they know the things.
They have implemented the practice.
They've implemented the patterns.
They do all those stuff.
It's just the framing about the practices, how to measure it, how to think about it, how to talk about it.
It kind of drives the organization towards, what I'm going to.
call in the book theater. It's like demonstration that you have the practices versus actually learning,
right? And this is because the tensions are putting pressure on what should be a learning practice,
and it turns it into a performance act, right? Like demonstrate that you can run an experiment
versus run the experiment to find, you know, like surprises or demonstrate that,
okay, you run a game day, but it's very comfortable.
It doesn't push you.
It doesn't push you to the uncomfortable space because you don't have time.
Because there's pressure to deliver, there's pressure to do features, there's pressure to get that checkbox.
So all this kind of, all these different pressure kind of really turn those practices,
which should have time.
they should, you know, they should, they should, they should let people invest themselves.
The organization should, should, yeah, it's investing time into learning, right?
And then to, it turns, it turns into the opposite. It's hard, like, I'm not too good at expressing myself, like, quickly, but that's kind of the idea there.
I think I get it. It's kind of about.
at the end of everything, it's how important is to this, this to us, to our bottom line.
If we spend time on resilience, what is that taking away from?
And if we have a problem because we're not resilient, is it a big enough problem that we're like complaining that, oh, we should have done that?
Or is it like, yeah, that was a problem, but we were still getting features out, right?
It sounds like it comes down to a lot to a balance of priority, right?
I imagine, and is curious if this is what you see in a lot of cases where an organization might think, like, well, if we do have some sort of outage because we're not resilient, is it more cost effective to have that outage or to spend the time and resources to set up a resilient practice?
Is that ever come into play, or is it not as obvious as that? We're not as London.
Yeah, often they don't even realize that they've turned, they've moved away from learning, right?
So they often start with good intention, like they actually want to find a problem.
And the kind of problems you would want to find is what's, what in residency during is called the gap between what you imagine your system is.
It's called work as imagined versus what is actually happening in practice, work as done.
And all those practices around resiliency is really to understand where that gap is.
And then often the organization set up those practices with that in mind,
but not understanding exactly that it is what they're doing.
They often do it because books tell them that you need to do those practices.
So again, they start with good intention.
maybe not the exact right goal of learning.
And because of that, they fall into the trap of comparing themselves with a project that has a start date and end date.
And then because of that, then they get assigned the same type of metrics as the projects that have delivery date.
And you can't compete, resilience can't compete with feature delivery because feature delivery is very tangible.
You can see it as, you know, it has customers, it has, you get feedback out of it.
You can think about it, design it, measure it, deliver it and measure its impact.
It's very concrete.
Resilience is an ongoing process.
It's the process of learning where that gap between the work as imagined and work as done is.
It never stops.
And if it's successful, prevention and resilience,
creates non-event.
Right.
You can't really see the effect of it because it's successful.
And because of that, the success of the practice often is what creates the problem.
Because it's not common for organization to celebrate an engineer that spends two months
at improving observability, looking deep into the logs to understand.
the small things and do improvement.
It's not in the habit of an organization to celebrate near misses.
It's not in the habits of an organization to celebrate the invisible work that happens in the engineers trying to maintain system.
But organizations celebrate feature delivery.
They celebrate heroes, fixing outages.
They celebrate all this kind of stuff, which is the opposite of what you want.
So it creates a, it creates kind of situation where your reward system, your incentive, also don't encourage resilience.
Like, so it's, it's a lot of, there's so many tensions, so many habits in organization that push resilience outside of the learning kind of a frame that it should take.
Does it make sense?
Yeah, it reminds me too of, I don't know if you see it over where you all live, but,
especially like in hospitals around here,
you'll see signs up on the wall that says,
you know,
zero days since our last,
or 20 days since our last safety incident, right?
So they're counting since they had an incident, right?
To celebrate exactly what you're saying there,
but it's very rare to see that kind of thing, right?
So, you know, if organizations had, like, on their big boards,
like, you know, how many days since their last outage,
like that would just be some small change to help celebrate what they're doing, I guess.
But even that is tricky, right?
Because you are now incentivizing the organization and people to try maybe not to declare outages to actually keep that metric.
And this is the hard thing to, like often the first intuition is to do exactly what you're talking about.
but the long-term effect is off eroding those things.
So, for example, instead of saying, you know, instead of saying many days, X days,
things are last outage, it would be much, it would create a better incentive to say how,
you know, when, how often do we find near misses and then, you know, how quick,
click and we do we recover
more about
transparency about
the act of failing,
understanding the system, learning.
So how many, you know, surfacing,
for example, surprises. Like,
when you do a review, when you do
a game day, a chaos injuring,
you want to understand what
surprised you, what changed your mental
model. And these are the kind of things to celebrate
so that you actually
incentivize people
to openly talk about things that are about failure,
because failure is about, it should be normal.
You know, updating your mental model should be normal.
updating your understanding should be continuous,
because your system is changing regardless of what you want or not.
It's changing all the time.
So our mental model should change all the time.
And often we play catch-up, right?
So we should celebrate that catching up rather than saying, hey, we haven't had an outage because maybe we didn't declare it.
But let's celebrate that.
You see, but you understand what I mean?
And talking about resiliency and unexpected things, I had to put myself on mute a little bit because my neighbor just decided to drill holes into their walls.
But of course.
Of course, exactly.
It happens when you don't expect it and you never test for this either.
But fortunately, I think he's done now.
He's just using the hammer.
But Adrian, the near misses, I had a conversation recently with one of our SRE leads in our organization.
And this was also a big topic, the near misses.
And he also did a talk at SRE CON in Ireland.
Can you enlighten us a little bit the audience, what you mean with the near miss and what this really is and what we can learn from it and how we should deal with this?
Yeah, so near miss is an incident that basically got caught just before it impacted the customer
or that didn't push to the extreme or kind of didn't push the collapse.
It almost went there, could have impacted customer but didn't.
That's what we typically talk about near misses.
It's the incident that didn't happen, but that we can learn a lot from.
And often, organizations do not celebrate them.
They do not look at them.
They don't try to analyze them.
And so what triggered the Nirmis basically goes back into,
into the system without being dealt with.
So it's like, I would say it's like a volcano that doesn't erupt,
but creates the hearstquake.
You feel the signals, it's almost there,
and you decide not to do anything about it because it didn't erupt.
So you don't leave, you don't escape, you don't try to learn,
you don't change your positioning.
And that's really what is in the name is.
And I would say the, if you really, if you want to improve resilience,
you really need to build a culture where analyzing near-miss is happening and being very open about it.
Because there's so much of information there.
It's kind of the outage that almost happened.
And so you can understand the.
the dynamics, the assumptions that were there, the models, the mental models that were broken before the impact customer.
That's really what it is.
So, yeah, organizations I work with typically, we try to make them deal with near misses the same way they deal with incidents.
So write a post-mortem going and spend time analyzing them and really learn from them because there's,
so much information there.
Yeah.
So I just, if I maybe can try,
not even an analogy about a story,
because this is fresh on my mind.
We work with one of our customers,
and they had a queue with some default settings,
default timeout before messages get thrown off the queue.
I think the default timeout was one hour.
That meant when the queue grew and it was an order queue,
and it grew and messages were older than one hour,
they were kicked out without anybody in all this thing.
And so it's revenue.
Yeah, it's revenue.
And the thing is, right, if you think about if you are, if let's assume you are detecting
this, you detecting that this is a problem and you fix it because on that individual
queue, you're making changes, you're changing the default, you're making the queue bigger.
But if you don't tell anybody, if you don't learn that it means everybody else may also
just run with the defaults and never think about that this is a key metric that decides
how a complex distributed system actually works because cues are everywhere, right?
Yeah.
It's exactly that.
They probably learned a lot about how to monitor them, how to alert, you know, what's the business impact.
So it's like, I mean, there's so much learning just in there.
And how did this stay there so many years without anybody noticing?
What was that?
I think this is also super interesting is what happened, you know, and, and, you know, and,
missing there and kind of the telemetry, all this kind of stuff.
Yeah.
Hey, another question that I have for you.
And I need to quote somebody else and somebody else has LinkedIn and then also something
you wrote in your blog post.
The quote that I read in the context of AI, I mean, AI is upon all of us.
I'm pretty sure you're using AI to create code documentation, chaos experiments.
And I think the quote was from Max Kerbacher.
he said we are creating more and more but we are understanding less and less and then I also
wrote or read your latest blog post which was I think it was titled when AI writes your code
chaos engineering writes your insurance policy and you make one statement in there that says
traditional chaos engineering assumed code written by humans humans you could ask
questions to but then assumption breaks down when
and substantial portions of your code pace emerges from statistical models.
So my question to you now, because you have been in that space for so long,
and with your time at AWS and AWS, one of the four-runners also when it comes to AI,
all the different types of the is that we are using.
I'm pretty sure you must have also seen a change in chaos engineering now
as AI is emerging to generate code.
So my question to you now, what is really changing?
How can we, how does chaos engineering change?
what do saddleability engineers now need to change in their behavior in the way they ensure resiliency?
Most organizations are playing catch-up, actually, with AI.
I mean, the chaos engineering practice already needed work, right?
A lot of organizations were not there yet with chaos injuring for systems that were not powered by AI, right?
they were still learning.
So now on top of that, you put AI.
And so I would say the big things that change is there's a lot more code being produced.
So the mental model will change even faster.
So, you know, doing those practices and trying to identify the gap between what you think the system is and what it is is even more important because it grows even faster.
So, you know, you need to really.
really use
these kind of tools like
load testing, chaos injuring,
there's a bunch of
other things, but to
try to make sense of what is
being built and especially
how it scales, how
it breaks,
how it degrades,
because otherwise you will learn that
in production. At some point,
the
AI
is, especially if you use,
especially if you use automation and AI agent to try to make sense of a system,
trying to recover like autonomous agent is for SR is something that's coming.
But eventually they will fail, right?
Something is going to, they're going to make a bad decision.
And at that moment, do you wait that AI fix itself,
or do you jump in and try to, you know, to figure it?
the system. From
a pressure point of view,
I can guarantee humans are going to be
thrown in the
middle there, right? Somebody is going to say
now we
can't wait AI to fix itself and
trying to find a problem.
It's been driving us to the wrong direction
for the last two hours. Let's stop
this madness and put some
humans there. The problem is
if your humans haven't been
in the code
and trying to figure out the
system and then I understand, updated their mental models, well, it is going to happen in production.
It's going to take them and it's going to be a problem.
So, you know, this is the interesting thing.
It's like, I think people are trying to make sense of how to put this practice in their new, in kind of their AI workflow now at the moment.
My suggestion is trying to do them continuously, trying to use AI to make sense of data, find patterns, but stay in the loop as a human because you really, you still need to learn continuously.
I still don't believe system will be fully autonomous anytime soon.
Like, you know, maybe some companies will be able to do that.
But, like, besides a few, maybe cloud-native, smaller companies that, you know, don't have too complex systems.
Like most companies will have human interlude for a long time.
So, you know, I think that's the danger.
I went everywhere all of a sudden.
Yeah.
Now, but it's, I mean, for me,
as you explained this, right,
if we end up in the world
where we get AI generated code
and let's assume it works most of the time,
but all of a sudden things break
and the AI cannot fix itself.
And then we ask the human,
it feels like the classical,
throw it over the wall,
what we tried to solve
back in the days when we talked about DevOps.
So because in the end,
you say the human,
in the end, is the one that needs to fix it
if the AI cannot fix it.
But it's going to be really challenging, right?
If you have no clue,
what's actually happening.
And this is why, and I'm also very new to the whole world,
but it feels the way we can really get better in this
is start by writing better requirements,
better assumptions, better specifications on what we actually expect from the system.
Because you in your sentence, right, in your blog post,
you said when humans were creating the code,
you could always ask the human, what was the intention,
what was the requirement?
You could always ask questions.
but you can only ask the EI what we've told the EI
so that means we need to get much better
in really requirements engineering
and specifying hey
an order queue should never lose
orders even if it backs up
we need to have a different plan right
yeah no it's exactly that
and the models I mean
the context
is not infinite as well
like you know you have your
the models can
keep some of the context for a few days, few weeks, but eventually they can't keep the context
for everybody all that time. I mean, not that humans remember six months later why they made
a certain decision, but at least you can, you know, you can try to understand why a decision
was made. And then the, you know, typically you get, you get knowledge diffusion there. You get
an understanding of the evolutions of mental model.
I mean, all the teams I've worked with,
typically there's always this institutional knowledge
that you can query through the people working team, right?
Of course, the team can be changed completely,
then you lose that, which is a problem.
But often you can query that institutional knowledge within the team.
And that's so important during outages, right?
because often that's when you need to coordinate and find the why of particular decision so that you can, you know, fix things.
Now, what I talked about about the AI kind of adding more problem.
It's not the new thing, actually, in 1983.
this Leigham-Bainbridge that wrote the paper, the ironies of automation,
where she already talked about that in the context of automation in industrial systems,
where she stated, again, she stated a few ironies, but the most important one here relevant
is that the more automated the systems are, the more expert you're going to have
to be to be able to fix the automation itself.
That applies completely with AI, right?
Because all of a sudden, now you need to be an AI expert,
then you need to be a system expert.
You need to understand all that.
So the expertise and the knowledge is expanding dramatically there as well.
So it's a very interesting time.
For me, I feel a little bit sad for junior engineers
because that's a lot to catch up with.
So it's a lot of knowledge to get right away.
Yeah.
Yeah, and it's bit.
I was going to say, Andy, this calls to mind a lot of the conversation we had with
Jeff Blankenberg a few episodes ago.
So Jeff actually works here with Dinotrice, but he was using AI to help him build a website.
And it was finding a balance between having AI write the code,
but being able to accurately envision and tell the AI.
what it needs to do for him and why, right?
Which is very much exactly what you're saying.
Like, you still need somebody there architecting.
You still need to give it the reasons.
If it gives you a generic table, you're like, no, no, no, no, I need the table to do this and that, right?
And then I think in his example, he needed another table.
And it's like, no, no, no, based on the previous one I gave you because I don't want a generic one again, but AI doesn't pick up on those things.
So you have to be very aware of what you're asking it to do, right?
And that's the human part of the knowledge.
But I do think it would be interesting to, at the end of how.
having it write something, have AI produce like, okay, what were the instructions I gave you
for each part that you wrote? Because one thing, yeah, if you have the people who had the knowledge
who are there, that's great, you can ask them. But if they move on somewhere else, then you don't
have that. And if at least you have it written, like again, smart uses for AI, have AI write
why it did everything. And now at least you have your code documentation. And how many developers
like writing code documentation? Zero, right? So get AI to write the code documentation.
I wonder, does anyone use AI to write comments in the code, or is that gone forever now?
The only problem with that is often when you ask AI the reason why it does something,
it also comes up with a statistical answer.
It's not necessarily the real reason, because the real reason is that statistical inference.
Well, true, true.
It's mathematics, right?
So there's no real way yet to really understand the decision process of AI.
There's an attempt to do that.
I think what Jeff was talking about, though,
on the previous episode was like you still have to review what the AI does, right?
And he was even talking about like junior developers, right?
You know, junior developers do the code review, right,
in terms of what they're going to be doing.
Anytime AI gives you something, you still have to take a look at it,
but you don't have to do the heavy lifting of writing it all yourself.
So there's still, and that kind of ties into your,
how long is it going to take for it to be more independent,
writing all this stuff.
Probably going to still be a long time because that's a gap that there's not only technical issues,
like you're saying, it's going to be based on whatever models it is,
but then there's going to be that trust factor of, okay,
how do we know this is doing the right thing if I don't look at it, right?
And if it's like outsourcing.
Yeah.
That's exactly what I said on the call.
They're going to give you exactly what you asked for.
I mean, outsourcing has gotten better, right?
But as I said, anyone who's in the early days of outsourcing, especially,
like you got exactly what you asked for and nothing more.
And like, hey, this doesn't work.
Well, you didn't ask for it to work, right?
But the other thing is there's no accountability with AI, right?
Whereas humans, there is accountability.
So I think that all ties into that loop too,
where the human still has to step in and be accountable
because there's got to be some accountability somewhere.
And I don't mean to punish people,
but to make sure things aren't doing what they're supposed to be doing.
Yeah, and this is what worries me,
because at some point, somebody is going to ask,
who is responsible for something,
and somebody is going to be thrown under the bus
for no apparent reason,
simply because we overstrusted AI
or we just delegated everything without humans in the loop
And from an efficiency perspective, it's very, very tempting because it goes fast.
It works 90% of the time or 99% of the time.
Now, the one person where it's not going to work or something is going to happen.
That's when the problems are going to be very difficult to deal with.
I think, Jeff, in the previous podcast, he also talked about there's issues,
using Git issues or GitHub issues for kind of his extended context.
So he's always documenting everything in Git issues.
So everything that he asked for all the refinements.
Because as you mentioned earlier, right, context is limited when you have a conversation
with an AI.
But if you have everything well documented, you at least know what was the discussion you
had.
And I think with this, this was a little bit of a lesson learned from him because he
has been writing a blog series called 31 days of vibe coding and what he learned,
how it started with his first day.
and how he also changed the way he is using the AI
and tips and tricks like, you know, put everything in a Git issue,
start every conversation with a new Git issue,
don't build big features, break them down into smaller things,
be very descriptive and everything documented in Git Issues
so later on the AI can also go back or you can yourself go back, obviously.
Yeah, I think vibe coding is great.
I see Vibe coding as a great way to create a proof of concept,
refine your ideas.
But then at some point, you need to just stop it.
And then, you know, go back to writing your specs,
figure things out, and then break down in what I call micro-task,
like micro-task.
But you have a lot less room for going rogue.
Now, the problem, this requires a lot of knowledge,
a lot of experience because you need to know, like you said before,
you need to know what matters.
You often realize that you think few steps ahead and then how you design those tasks.
Because eventually they pile onto each other.
And if you do not have this long-term vision and the way,
especially in the way you accumulate tasks, right,
I think it creates often big problems because you realize too late that you made a bad decision
and you start to go back in force and then you go back to having a big problem.
So it's not fundamentally different than how we did before.
Like when without AI, we still did POC and then we have to take a step back and kind of look at the bigger system
and then produce the first version of it.
and break it down in task.
So it's just the
efficiency pressure
is so tempting with AI
because you can go so fast
and it feels productive.
But you didn't have that before
because you couldn't produce the code
so fast. But I think it's that
this kind of
pressure affect you in very, very different ways.
Because it's like the dopamine thing.
You can produce code and it's really satisfying.
It's like, you know, well, I've produced code.
Well, you haven't.
But it's funny.
You know, Eddie, I'm thinking, like, remember way back,
and I think you did it, it might have been with,
I mean, going way back, Wilson Moore, right?
And I think he was saying, like, for performance stuff, like, find,
find something to automate that you repeat a lot, right?
And it feels like AI should be used in the same way.
Like find your tedious tasks and use that.
So if I think like construction, like you're going to build a new building, right?
In the old days first, you have to still have to design the building.
You still have to figure out all the safety.
But then when it comes to starting it, you'd have to get like, you know,
40, 100 people to dig the big ditch, right, to start laying the foundation.
Then it's like, all right, well, now let's just get a, you know,
mechanical shovel to build it, right?
And to me, it sounds like the safer way through AI is we still do all the higher level
tasks, and it's fine those tedious manual tasks to get AI to do for you, but it's doing
it again based on what you've designed and what you require.
So, you know, kind of apply the same principles of what do I automate?
It's like, what do I AI in that same idea?
Yeah, they also went to what Jeff said, right?
In the end, we still follow our software engineering best practices.
We start with a good requirements document.
We then think about a good architecture based on the environment we live in, based on the constraints.
There might be legal reasons why you have to make certain decisions, but you need to write
all this down, like as you would write it down for somebody that you then contract to do
your work.
And now that contractor is an AI, right?
And you have different AIs that you interact with and so you get your job done.
We go back to a waterfall model.
If it's faster, though.
If it's faster, though, yeah, that's a good point.
I mean, that's an interesting idea, you know.
Would an old model come back if it's made much more efficient?
Like, I think it's, I mean, I think it's a stretch,
but I think people shouldn't completely close the door on that stuff, right?
Like, oh, we can't do that.
Like, well, maybe we can now, you know.
Who knows?
It's interesting.
I mean, I think some new models will come up.
a combination of a few things.
Adrian, I wanted to use the last
cup of minutes that we have with one more topic.
Obviously, we are already on the AI theme.
I wanted to talk a little bit about the AI
ESRI agents, because if you look around,
and I think the same goes true,
also for your previous employer,
all of these vendors are producing
and coming up with new cloud services
around SRE agents or however they're calling them,
you know, giving potential insights,
into root cause in case something fails.
So, for instance, if I make a configuration and deployment mistake on AWS,
because I misconfigured my API gateway or whatever it is, I get,
I can call the SRE agent as far as I understand.
And then the SAC agent tells me what is the root cause and gives me recommendations.
Do you have any experience on this, do you know where this is going
and kind of, are you excited about this or is this just, you know,
it's just another service?
Yeah, I think it's a fascinating era.
I think, again, it's related to how to use AI for what is great.
You know, what's really good at.
So pattern recognition, all this kind of things, you know,
it's good dealing with large amount of data.
So kind of reduce your cognitive load during incidents.
So to do more of the thinking.
I think this is great.
Now, it's going to be various.
interesting to see how it evolves because it also introduced new type of problem.
It can, and I've seen that happen, it can do anchoring bias, right?
It gives you, it puts you into a direction so convincingly that, you know,
you go there for sometimes to realize that then that, no, that was not the right place.
and you spend a lot of time.
And that's going to be a problem, I think,
if you offload all your thinking through those agents.
So you need to, I would say,
you need to keep thinking and use an agent
to verify your hypothesis rather than delegate the thinking to AI.
And I think this is really what is going to,
it is worrying me a little bit.
It also is going to take learning opportunity away from people during incidents.
That's all that.
You know, if you ask senior engineers,
often they learn the biggest lesson during outages,
you know, when you have to recover under stress,
assumptions you've made mental models that were broken.
You know, so you kind of, I think incident is,
a reality check
and often reality checks
if we're outsource everything
is okay, how do we learn?
And then you might end up
with teams that come operate if AI
is not there because there's a lot of
outages that happen where your monitoring
is not there. Flying blind happens regularly
at least once a year
I have a customer that is flying blind.
So, yeah, I mean, there's going to be a skill atrophy, I think, happening at large scale.
And the question is going to be, it's even more relevant not to do game days, right,
to do chaos injuring, to do all these practices, because they practice your operational muscles,
like, you know, if done, right?
So I would say, yeah, it's great, but yeah, it's going to have to be tempered with with organizations
that are understanding what they're doing, what they're trading off with.
So again, you do trade, you trade efficiency for resilience, and AI is pushing you really much
towards the efficiency side of things.
Yeah.
Yeah, but I really like what you said,
and I think this is something I want to just reiterate.
If we outsource everything, how do we learn?
And we're taking away the learning opportunity.
And I just did a little bit of vibe coding today myself.
So I created a new app.
And immediately, the first batch that it created was full of errors.
But instead of me looking and learning what mistakes it made,
I just said, hey, look at these errors,
these compile errors and fix it.
And I didn't even bother to check, right?
So by just saying, you know, AI, you can do this, do this, do this.
You're right, right?
I mean, we don't learn.
We will never be getting better and giving better instructions.
And we will repeat these mistakes.
And the question then is, will then the AI at the time when it really matters,
be fast and good and then fix it really in the time we need it.
Or we will then throw humans at that.
the problem, but the humans also lost
all the knowledge about
things.
So that's going to be interesting.
I mean, it's
another abstraction level, right?
You can argue that,
you know, I started
my carrier in assembly, right?
You can argue that we don't
anymore optimize
our assembly because, you know,
processors have become
so powerful
and some
the available memory
is a master.
You don't have to think about that anymore.
And maybe the systems are going to be evolving
so that they can fix things much better, faster than human.
Maybe.
But yeah, then it's like what is left for us.
Like, what are we doing?
Martinis on the beach.
Martinez on the beach, yeah.
So that's how they tried to sell us the dream, right?
Yeah, exactly.
But no, it is a little scary.
I think the profession of software engineering is being redefined.
But we've said that for every technology that has happened in the last at least 25 years.
I've heard the same story every time there's a new technology.
So yeah, it's just we're going to have to reinvent ourselves yet again.
And that's it.
Yeah.
Hey, Adrian, thank you so much for the conversation.
Is there anything else that you wanted to get off of your chest, any other topic before we close out?
Wow.
Like, that's a...
I know, this is maybe the door open of another episode.
Yeah, I would say, stay curious with incidents.
and keep learning.
I think if there would be one thing,
I would strongly encourage everybody
is to see the discipline of resilience,
kind of the as a learning practice,
nothing else.
It's not the project that you need to track through
efficiency metric or performance metric.
It's a learning opportunity to understand
how your system really behaves versus what you have in mind.
And whether it's AI, whether all these kind of tools still needs to be within that context of learning.
So I think when you start using AI, think about how is it helping you learn versus how you are outsource learning.
That would be my last comment.
And I think that's going to be an awesome sound bite, Brian, if I'm.
may say.
You can say that all you want.
I'm kidding.
Now, CEI is a learning opportunity.
And not just as an outsource.
Yeah.
I think, yeah, if people can understand
resilience is learning.
Yeah.
That's a good thing.
Cool.
Adrian, thank you so much.
All the best for your new,
new business for Resilium Labs, folks.
We'll definitely make sure that all of these links
will make it to the podcast description as well
and also do your blog.
And also, yeah, if there's anything else,
just let us know that we should link to.
Once the book is out.
Exactly.
And I'll just send you copies.
Thank you.
Thank you.
Also, I just remember to
analyze your near misses. I really like
that listen as well.
It's very easy to say, yeah, we dodged
a pillow on that one. Let's ignore it.
Fantastic advice.
So thank you so much
for being honest. It's been wonderful.
Hope you have a great rest of your week.
Thank you, Brian. Hopefully talk to you soon.
Bye-bye.
Bye-bye.
