PurePerformance - AI-Augmented Chaos Engineering in Practice with Bartek Pisulak

Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatch is ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello, everybody, and welcome to another episode of Pure Performance. My name is Brian Wilson. And as always, I have with me, my wonderful co-host, Andy Grabner. How are you doing today, Andy? Pretty good, pretty good.

Starting point is 00:00:36 It's the, I think we both survived Halloween, huh? We did both survive Halloween. And today's already off to a good start. If you know, you know, I'll just leave it at that. There we go. Yeah. But how was your, everything good? No, I think you told me that you bought too much candy.

Starting point is 00:00:54 Well, yeah, we usually, so we're on like this weird half-crescent street with no street lights, but a lot of the people have like Halloween lights up and stuff and also we try to sound terrible. We try to lure kids to our houses with candy. Boy, I never thought of Halloween that way. But either way, you know, we just didn't have a high turnout. But last year I started buying, you know, the large candy bars, which was always like the gold, you know, the holy grail for as a kid.

Starting point is 00:01:21 And we just didn't have too many this year. So I have a bunch of leftover large candy bars that all get to eat myself. So I guess it's a win-win, you know. Yeah. And I think also because you didn't have too many kids, it was not too chaotic. You didn't have to deal with two chaotic situations in front of your house, kids fighting for candy. What do you do if you run out of candy? I have to come up with a hypothesis first, yeah.

Starting point is 00:01:44 I think so, yeah. What happens if the chaos is in my house? What if I ran out of candy? What would I do if I ran out of candy and had none? Yeah, a lot of things. Andy, this is fantastic. A lot of things can happen on Halloween. And you've got to plan for all of them, right?

Starting point is 00:01:59 Yeah, and as Halloween is over, also our jokes are now over because it's about time to introduce our guest. Bartek, thank you so much. Bartek, thank you so much for being on the podcast today. Sorry that you had to go through one or two minutes of the two of us trying to be funny. But we also try to always do a little bit of a segue to the topic, which means today we're going to talk about chaos engineering. Bartek, please do me a favor. Introduce yourself who you are, what is your background, and also what excites you about chaos engineering?

Starting point is 00:02:33 Of course, and thank you for having me. By the way, I think Halloween is a great introduction to chaos in general. So I think it's a really good start. So my name is Bartek, and I'm working right now at Pega Systems. My responsibility is for the overall quality of our cloud infrastructure. So this is what I do with my team. And, of course, if you think about quality, it's a very broad term. And you can think about the technical quality, but it's not everything.

Starting point is 00:03:05 It's also about documentation quality, process quality. So basically, there are a lot of things that you can put under the umbrella of the single word right now. And why am I really interested in chaos engineering topic? And this is something that I've been. exploring for a couple of recent years, and I've been giving different talks and different conferences about chaos theory and chaos engineering. Right now, in the modern world, when you have enterprise systems, which are basically a set of different microservices or services working together, any failure introduced to that complex system can cascade and can cost unexpected

Starting point is 00:03:54 problems everywhere. And this is a technical side of chaos engineering and chaos theory in general in IT aspect. But also, if you think about small things like how the fact that you had a great party last night as a software engineer can affect the quality of your code, which then can eventually cascade, can lead to cascade problems on production. So this is another aspect of chaos theory in the context of IT. And we saw great examples recently with AWS issues. I think it was last week, if I recall correctly. Azure had recent issues.

Starting point is 00:04:40 So all of that is just a business justification for really, really playing with chaos engineering and really having it and embracing in different parts of organization. have a question now do you think a valid chaos experiment would be to organize parties in the middle of the week with a lot of alcohol and then see what happens the next day oh yeah it sounds like a great plan i don't know if you're aware of this i think it's called a bilmer's pig if i if i recall correctly which is correlating the level of alcohol in your blood with the quality of the code that you can produce and there's the small pig of product of product which is the highest productivity so yeah it might be pretty interesting to be fair to have something like this so there was a study that said like at a certain level of alcohol you're going to have better code and then after that you break the threshold yes yes but the pig is very very small range yeah you can hit the pig and all of a sudden after that it goes down sharply yeah i just want to make sure i just want to make sure

Starting point is 00:05:50 listeners, we are not encourage you to now start drinking and then writing production code or automation that is doing things. So please stay sober. Especially with all the beer and all the offices. Yeah. Exactly. Encouraging you. Yeah.

Starting point is 00:06:05 Yeah. Now to a more serious note, Bartek, I saw you at Cloud Native East Austria. It was really good to have you co-present and you talked about the you gave a lot of background information about

Starting point is 00:06:20 chaos theory. You gave a lot of basic information about also introductory information about what is the general test cycle, software delivery life cycle, where does chaos engineering fit in? So folks, if you're listening to this and you would like to see a good

Starting point is 00:06:36 introduction about chaos engineering, we'll definitely have the links ready for you in the podcast summary. For me, what I would like to know a little bit more, especially for those that may have listened also to some of our previous podcast on the topic, even though the last episode has been, I think, at least two or three years ago.

Starting point is 00:06:57 How has chaos engineering changed over the last maybe a year or two, especially with AI? Because you also brought up in your talk, right, the chaos engineering with AI. Do we see benefits? Like, where can AI really help us in which aspect of chaos engineering? That's a great question. And before I answer directly in the context of chaos engineering, I would like to start with the statement that AI is changing and shaping our industry right now, including chaos engineering as well.

Starting point is 00:07:31 And something that I was presenting during my presentation, one of the slides, or I think I don't have it, to be honest, I had it in my different presentation. But you can easily Google something that is called the Gartner hype cycle, So in short, the Gartner is the company that is providing information about how different technologies, how mature they are right now. And this is a source of knowledge for different companies to really follow the specific trend that is right now on the market or not. And one of the concepts that they have is exactly the hype cycle where they take a look at where the specific technology on the market is. and how mature it is. And if you look at this chart that they have provided,

Starting point is 00:08:25 for most of new technologies, you see the hype, then you see the pig of the hype. Then you have the moment when the hype is going down. And there is the moment when the market is saying, okay, now show me real tools that can help to drive my business forward. So I can really get an advantage. And if you look at the current Gartner hype cycle for generative AI, and you look at different areas that are related to AI,

Starting point is 00:08:58 you can see that most of them are actually started entering the stage where market really wants to see tools that people can use to generate the value. And this is where we are. So the second thing that is related to the current revolution of AI, is that we are seeing more and more companies that they're actually introducing different products and different technologies, which are speeding up the processes,

Starting point is 00:09:27 automating their processes, boosting productivity. And there are different surveys that are saying that some percentage of people who are not embracing or using AI to boost the productivity is actually going to be laid off. And we are seeing that on the market.

Starting point is 00:09:46 One of the research that I'm using in one of my presentations is saying that in 2025, 35% of business leaders are saying that AI will replace employees at the companies. But on the other hand, the same people are saying, and it's nine in ten companies, that they're going to be hiring people with AI skills. So, on the one hand, those people who are not following, who are not learning, If you're not learning new stuff, yes, they can be worried. But at the same time, it's not about playing of people. It's about boosting productivity.

Starting point is 00:10:25 And there is a smart quote that from Jenseng, who is the CEO of Nvidia, he said something really smart that I'm trying to convey in almost every discussion that I have about AI. He said that AI is not going to take your jobs. The person who use AI is going to take your job. So that's basically the summer. So whatever like it or not, AI is there. we may not be thinking

Starting point is 00:10:46 that it's a silver bullet for any problem for solving all the problems but at the same time if you don't want to stay behind you just have to embrace the fact that it's there and play with it and try to see how it can help you boosting your productivity.

Starting point is 00:11:00 So jumping into chaos engineering itself as any other area of IT it is also affected by the fact that we have artificial intelligence And we have all those great things that it's, especially generative artificial intelligence. So what I'm observing is that, as in every other area,

Starting point is 00:11:26 we're seeing more and more proof of concepts, more and more people playing with AI in different contexts of chaos engineering, if chaos engineering steps that you're normally doing when you think about performing an experiment. And something that I was focusing on during the presentation are those five steps of doing an experiment. So you have the definition of steady state, you have the hypothesis generation, running experiment, verifying and improving. And in every single of those stages, you can actually think about using artificial intelligence to boost the productivity and to speed up the entire cycle and to get better results. Thank you so much for

Starting point is 00:12:14 I really took down the note by the way or the quote from the CEO of Nvidia I think it's definitely a great reminder that as the industry is changing we just need to make sure we learn the latest trade and the latest tools

Starting point is 00:12:28 because as you said very nicely that AI is not going to take your job but the person that knows AI better than you may kick you out of the job and so you need to step up So we all should experiment with it. From these, coming back to play chaos engineering, practical examples in these five steps along that life cycle, where do you see the biggest change also in the way you run chaos experiments?

Starting point is 00:13:00 Where have you seen the biggest impact in terms of you leveraging AI for, I don't know, for the hypothesis generation, for the load generation, for enforcing chaos. Let us know what should people, where should people focus on as well? Yeah, sure. So we can go one stage by another, and I think that would be the best way to actually describe what I have in my head right now. So let's start with definition of steady state, and definition of steady state usually requires some kind of observability tool to be applied to this. specific system. So you can basically identify, first of all, which metrics you should observe,

Starting point is 00:13:47 which metrics are actually defining your steady state. That's the first step. The second is to make sure that you can clearly say in which ranges the specific metric should stay to be able to say, hey, this is my steady state. And normally what is happening, if you you try to detect if you're out of steady state, is that you have a set some threshold. It can be low threshold and high threshold. And if the specific metric is above or below, then some kind of alarm should be triggered.

Starting point is 00:14:26 Now, what is really helpful in the context of implementing AI for that kind of observability model is that AI can actually learn from the past experience and from the past data and can trigger different kind of alarms way faster comparing to just setting the threshold. So let's imagine that this is a response time of your, one of your services and one of APIs that you have exposed. It's, I don't know, sets to 200 milliseconds or something like that. This is your high threshold. And you have to wait until this is really happen.

Starting point is 00:15:08 If this is higher than you're going to get an alarm. Now, if you can see, feed your model with any data from the past, with some patterns, observations, the model can actually learn behaviors, and if it's going to detect that the specific metric is going up, and the pattern is something that happened in the past that can lead to failure of your system, it can trigger alarm way sooner than reaching those 200 milliseconds, because it can trigger an alarm, let's say, around 100 milliseconds or something like that. So you just don't waste another time to react. So that's the one of the things that you can actually or you can actually implement and use generative AI, not only for chaos engineering, but also for observability

Starting point is 00:15:57 and alerting in general. Right. So for that one, it sounds like, you know, and I'm not here to talk about dinotrace in any way, shape, or form, right? But we've always done the, the baselining of the data and we're looking for a variance in a in a in a in a baseline it's not saying if it's we're not setting 200 milliseconds it's like this is your normal state if you vary 5% 10% whatever the defaults are through the alert that's the the starting point right but what you're looking at what you're talking about is looking for patterns right so let's say you have a response time starting to climb right but it hasn't violated your baseline yet but at the same time you see, maybe it's a response time on a service, but at the same time you see response time maybe on a database starting to climb, right?

Starting point is 00:16:46 That might indicate, and neither of them have violated, but if the AI recognizes that pattern from a problem in the past, it's not just going to say, well, you haven't violated yet. I'm like, oh, I'm seeing this pattern again, right, which makes the analysis, at least from the human point of view, more complicated, right? but if I understand what you're saying the AI can look for these patterns it's learned in the past to say if A plus B equals X then I know we're going to have that problem whereas in the past it was just

Starting point is 00:17:15 if A is going off I'm not going to say anything until it reaches that threshold. Is that a correct understanding? Yeah, this is exactly what I'm talking about and AI is great with finding patterns and AI is great in finding patterns

Starting point is 00:17:31 using the large data, large amount of data, which is pretty hard for people to do. So that's the beauty of implementing AI in this context. But this is exactly what I'm talking about here. So maybe last word on this. For me, this seems like a combination, Brian, as you said, because the baselining or understanding what is the normal behavior of every individual metric,

Starting point is 00:17:56 this is something where we don't need to throw an AI algorithm. This is classical statistical means of defining what is a baseline. The real interesting thing is then to look at historical data and understand if five different things behave in a certain way, then we know from history that something bad is going to happen, most likely. I mean, we are done at trace to this through our smartscape and the way we analyze the data. And obviously, if you're putting all of your observability data into an AI, nowadays, you can probably also get into some of this pandemic. I think that's nice, yeah.

Starting point is 00:18:35 That's a cool, cool thought. It also helps out a lot more with predicting earlier, as opposed to, as soon as we start seeing the signs of that pattern, maybe it doesn't do a high-level alert, but maybe it's like patterns starting to happen, right? It's awesome. I think the pattern, again, we talked about a bunch of times about, you know, when, Bartak, when you talk about AI hype, right? there's a lot of BS uses of AI these days because it's all hype right these are the ones that fascinate me because I'm like these are actually really really good these are really helpful because it's taking where you know it's way too complicated or complex for somebody to set a set up a system to do and say all right let's let's give you know if we go back even to the old the the goat analogy I don't know if you know the goat analogy it was uh

Starting point is 00:19:29 the people were using goats to clear fields right and this is like maybe within the last 10 years or something and people who do lawn services like oh my gosh these goats are going to put us out of business and it's like no the goats aren't going to put you out of business because you're much more efficient right so it's the idea that which is the most efficient way to do it that's the one that's going to win and if you're already efficient you'll be fine if you're not it's going to be the new thing but the goats aren't going to be more efficient so don't worry about the goats in this case we have even better than lawn mowers right yeah that that's the interesting analogy i've never heard about it but uh i think it's really good cool so steady state is really understanding your current

Starting point is 00:20:16 system understanding what type of patterns that emerge in your distributed system will potentially lead to a problem So really understanding what is normal and what leads to abnormal situations. So that's a steady state phase. By the way, if you're okay, you have also had this, obviously, in your slides in Cloud Native Austria. We'll try to exactly find the link to that time stamp where you present it on these cycles because I think that's a really useful visualization that you used. Stady state, what comes next? Hypothesis, I believe.

Starting point is 00:20:54 Yeah, the next one is. hypothesis, but before I jump into hypothesis directly, I want to jump into verify, and there is one reason, because it's very similar to what we have in case of steady state. And it's also about finding patterns and classification. So basically, if we think about verify and verifying results of our experiment, where it can apply artificial intelligence, it can help you to compare test results, again, expect the behavior. And again, you can do that in the context of all the data from the past. So you can feed your model with any data that you have already.

Starting point is 00:21:36 And then you can just ask, hey, give me a classification, highlights the deviations and things like that. So again, this is about using the fact that AI is really good at finding patterns, classification, stuff like that. So this is why I want to jump directly into this very first stage. because those two are connected conceptually if we think about using AI. So that's what I really wanted to highlight here.

Starting point is 00:22:04 Let's get back to hypothesis because that was actually the question. So this is, again, it's a beauty and of how AI can analyze a large set of data. So if you can think about the other, application stack. And I'm going to be talking here, let's say, about Kubernetes stack

Starting point is 00:22:29 because usually when we have some Kubernetes-based enterprise system is usually pretty complex. There are a lot of pods there. There are plenty of services running inside. And let's imagine that we have a stack like that with plenty of deployments, plenty of pods.

Starting point is 00:22:51 And if you want to chaos test it, properly, you actually need to understand the architecture and you have to design those test cases depending on the nature of service. And you have plenty of services in the cluster, let's say. So if you want to cover this as an engineer, you need to run hundreds of tests, hundreds of iterations, need to talk with different service teams, there's additional calls with that, and even though you can miss edge cases. Now, let's imagine that you can actually do the same automatically. And there are different tools like that that help with analyzing either architecture of the system

Starting point is 00:23:43 or running deployment and the entire cluster to actually be able to produce hypothesis for different services that are running inside this cluster, which can then be used for running specific experiments. So let's imagine that you have a design of your, an architecture of your, of your, of your system. That can be one sort of the data for, for AI model, so it can analyze it. You can have some kind of documentation, requirements, plus running deployment. All of that, analyzed by AI can produce hypothesis targeting specific edge use cases for every single service. And then this is an entry point for the experiment. So to make this a little bit more real, a hypothesis could mean I analyze my architectural diagram

Starting point is 00:24:35 or I analyze distributed traces from my live application. The AI could say, hey, in your architecture, you have a weak point because 90% of your transactions rely on a central service that runs on a single host that has no failover. This is a clear thing we need to figure out if this is resilient. And so the hypothesis is if you're bringing chaos into the system, it could be that probably the system breaks or the other way. I think the chaos engineering is always trying to test something positive,

Starting point is 00:25:09 at least I hope so. We have this system and we are bringing chaos and we believe because it is set up that way even though it's a central component it will then automatically scale up with high load, it will automatic Kubernetes will figure out how to divert the traffic. So

Starting point is 00:25:27 are these some of the realistic use cases? What is a hypothesis? What will fall out of it? What will be? Yeah, so what you just describe is exactly what we are looking for here. We are looking for weak spots and analyzing enterprise system,

Starting point is 00:25:45 you have multiple services, you can easily miss those edge cases, miss those things that you just mentioned. Now, the trust in the way how AI is doing analysis is that it won't miss those spots. It's going to actually be very clear with pointing those weak spots that we can target with our experiment. And we have tools like that out there, actually, especially when we're talking about Kubernetes. I don't want to like advertise anything here. but I believe this is an open source project. So, for example, there is a framework called Cracken that you can use for chaos engineering in context of Kubernetes.

Starting point is 00:26:29 What is a good add-on to this framework is something that is called Chaos Recommender. So it's doing exactly what I just described. It's analyzing the Kubernetes clusters, is analyzing pods, and it's looking for weak spots. That's one of the examples of the framework that you can use. The another one, it's called Chaos Eter. This one is also, I think, open source, if I recall correctly.

Starting point is 00:26:59 Again, in the context of Kubernetes Cluster, it is introducing AI in every single stage of experiments, so it can generate hypothesis, but it also actually can do analysis, can do post-processing and everything that you can do during experiment using LLM agents under the hood. Yeah, and this is one of the topics I thought was interesting because I think especially going back to the idea of the first step of steady state, right? At least where we're at now, I think, I don't want to say a danger, but a caution of the AI in the hypothesis is it's only going to be able to test for what it knows. So if your inputs are bad, right, you know, we work in observability and we see all the time people only put observability into certain areas, only the areas maybe they can afford to, or maybe they have a mix of tools and getting it all in.

Starting point is 00:28:07 Right, so that's going to impact the data that the AI has to create a hypothesis with, right? So if it's blind a certain area, it's not going to be able to account for that. But on the flip side, you know, I completely agree. If you do have a data set, it's going to be fantastic at finding those real edge cases. You know, I don't know. I'm curious, like, let's assume you had complete coverage and complete visibility into everything, so the AI had all visibility. We know in the earlier days of chaos testing,

Starting point is 00:28:44 some of the hypotheses were, well, what if somebody trips over a power cable in the data center? Right? How are you going to detect that? Now, I don't know if those are still done, right? I don't know if it's still valid, but I think there is definitely a blend of automation testing versus human-generated chaos. Do you foresee it always being, okay, we have the AI to do a bunch of all the really cool

Starting point is 00:29:13 crazy edge cases like that, but we should also still have people thinking even further of things that the AI wouldn't even consider, likely tripping over the cable kind of thing, to work in tandem as it goes, or do you see it possibly just all AI completely? I think that you made a great point mentioning that the data and quality of the data that you use for training a model is the key part here. So the better data we provide, obviously, the more precise answers we're going to get and more reliable ones. So that's, I think, the key point. Now, if we're going to assume the scenario that you mentioned, to cover all the data that you can possibly get about the specific system.

Starting point is 00:30:04 I strongly believe that still the human touch is really needed. AI won't be able to provide all the possible scenarios. I don't believe that it's ever going to happen, to be honest. And I still believe that the person in, like, when you have a model, I mean, you're using AI model to any IT area that can be chaos engineering, or something else. What is really necessary is this feedback loop that you have, so the model is generating some answer,

Starting point is 00:30:43 but then this answer always needs to be somehow validated by the person, and this should be like a constant thing. So human being a part of this feedback loop is a crucial thing, also in the context of chaos engineering. So the tandem that you mentioned, I think this is the way to go. Right. So it's more like we free up the humans from doing the more mundane ones of trying to find all that other stuff so they can really focus on verifying that the AI has got everything that it needs and also getting it to the more creative outside the box kind of thing.

Starting point is 00:31:17 So it's not a full replace, but it's give the regular or let's give the less, let's give the tedious work. to the AI to do yeah and well just like if you think about you know machines in a factory you still need people in there to repair them make sure the stuff's coming out fine and all uh also do other things that the the machines can't do so it's just changing the role of the chaos engineer but they're still part of that um but they don't have to do the actual as much of the dirty work themselves yeah yeah exactly and i can give you a good example which is which is kind of outside of the chaos engineering space but uh this is this is something from the field So let's imagine that you have some kind of system that is scanning Docker images and finding vulnerabilities.

Starting point is 00:32:08 And there are different tools that are reporting that kind of vulnerabilities out there that you can use. One common thing that is happening is that plenty of those vulnerabilities are actually false positives because of different reasons and the way how those scanners works. And if you get a list of those vulnerabilities, you can assign it to an engineer owning this image to verify if those are false positives or not. Now, what if we can train a model that based on the S-Bomb, for example, generated for this image or any data that you can provide can go through this list and give you like initial assessment if the vulnerability can be potentially false positive or not with the reason.

Starting point is 00:32:59 Now, how easy is that for the engineer to have a specific suggestion now to verify if what AI produce is valid or not? It's a way easier because if you have the information and the reasoning behind why a model decided if this specific vulnerability may be or may not treated as a false positive, it's way easier to go straight to verify this hypothesis that have been provided by artificial intelligence. So this is one of the examples where you still need someone to take a look at the final result, but it's a huge boost of productivity for the entire process. I like that. That's a great example.

Starting point is 00:33:43 As you said, while it might be outside of chaos engineering, even though actually I think there is a use case for chaos engineering on this as well, because you could run an experiment where you are deploying vulnerable code and you want to make sure that your observability actually detects that you have vulnerable code. So right there, that's the human element of the chaos hypothesis, Andy. Yeah, exactly. So do you identify and how fast can you react to it? And the hypothesis could be, I don't know, we are fixing it in an hour,

Starting point is 00:34:14 but maybe you prove through your chaos tests that it's never been identified. Hey, I have another question. the is the idea of the chaos test so let's assume the hypothesis we analyze the architecture we identify if we introduce latency between service a and service b we assume something will happen right is then the idea that you first run the chaos experiment to really validate your hypothesis that it really fails and therefore if it fails you have to do something but if you don't If it doesn't fail, actually you made a wrong hypothesis. Or is it you find the hypothesis and then you actually improve the system so that when you run the experiment, that nothing will happen.

Starting point is 00:35:04 What is the right approach? Typically what I follow is actually to make sure if my hypothesis is right or not. So I just run the experiment and I'm just observing what is going on. I was right or not. If I assume that something is going to fail and it's working, I'm happy, of course, in a positive way, but at the same time, I'm becoming suspicious why, you know, why I was wrong. But yeah, it could be that your model, maybe it could be that you didn't have all the detailed information about the architecture.

Starting point is 00:35:38 Maybe the quality of the documentation wasn't good. You mentioned this earlier. The quality of the data is the thing. And whether it's you as a human or an AI mix and creates a hypothesis based on flawed data, this could happen. Yeah, and of course one of the reasons why I do that always

Starting point is 00:35:55 is exactly what you just what you just said in my multiple reason why I may be wrong. But if I'm wrong in the right way, so everything works after the experiment, even if I assume

Starting point is 00:36:05 that something is going to break, it's still less work for me to verify that for experiment than let's say not doing that and trying to fix something that is actually working, yeah?

Starting point is 00:36:23 So that's basically the thing here. Cool. Awesome. So if I recap, we started with steady state where I want to understand how our system is behaving, what patterns do we have? Very closely related is actually the verify you said,

Starting point is 00:36:39 right, obviously, because it's similar steps. But then I'm looking, it's easy for me now I'm looking at your circle with the five different stages. connected together. So folks, if you listen to this, really check out the presentation, check out the visuals. So steady state, coming up with a hypothesis, AI can definitely help here to come out

Starting point is 00:37:01 to identify weak points, maybe in combination with some well-known things. When dripping over the cable is one thing, but we obviously know certain rules. Right. When you are, if you think about Java-based applications with memory pressure, with garbage collection, I think you can make certain, you don't need. an AI, I think it's just like certain rules that we've learned over the years or the database connection pool sizes,

Starting point is 00:37:26 latency, these are all things that typically are weak points in systems. So hypothesis is great. Now, running the experiment, is it just a load test that you run and then you are injecting chaos like killing pods or injecting latency or can we also use AI there to run the experiment?

Starting point is 00:37:48 Yeah, so the main part where I see artificial intelligence can help and speed up productivity in terms of running experiment is actually creating the experiment. Of course, it depends on what kind of framework you're using. It can be any framework that you have out there. And I think it's not different than just generating, I would say, a code. So you have a different, different AI-based tools like a pilot, which can speed up. generating the codes that you're actually using to create your service or your product. And it's exactly the same in the context of chaos engineering.

Starting point is 00:38:29 And that was the meaning of the slide. So, for example, and this is something that we've been talking about during this presentation, if you're using something as simple as AWS Fold Injection Simulator, you can actually use Amazon Bedrock or whatever it is to. generate a template that you can just upload to Fold Injection Simulator and just run. That's it. It's no different than, I don't know, generating unit tests using some AI tool for your code. Yeah.

Starting point is 00:39:02 So this is exactly what I had in mind when I was talking about running experiments. So it can, as the example from my slide, yeah, you can just go and say introduce 100 milliseconds latency on database calls. and you just get a template that you can upload to AWS4 Injection Simulator and just run. That's it. You don't have to write any code. That's the productivity boost that you can get

Starting point is 00:39:28 when you're talking about this stage. Cool. And yeah, I would think too, and I don't want to spend much time here because I know we need to get to a couple more, but going back to our old origins, you know, with the low testing, like if you're going to be writing scripts to be running traffic against these systems, this is where I think AI can really help

Starting point is 00:39:47 well. I mean, yeah, I know as old script creators, right, we love to pride ourselves in the complexity or scripts, but to the same point with AI, you know, even if I go back to Load Runner from back in the day, it used to have a correlation engine. So you run the script twice and it would look for what changes and it would replace, right? No AI. It was just, you know, but at the same time, if we take the same idea of the inputs for the AI for the chaos, you can point the script generator at the API endpoint. You can point it at all the documentation for the API. all the different kinds of inputs and everything and it could just generate all those scripts for you so that you can then run those a lot quicker and more efficiently in this chaos cycle so even outside of chaos the components of building that whole chaos cycle could be used in a lot of ways and in fact I wonder if uh yeah how much longer test create our script creators will be around because that seems like an easier one to uh through not necessarily the low design but the script creation well even a little design, I would argue, right, because if an AI gets access to your observability tool where you know how much traffic is coming in, you know, simulate create a load profile

Starting point is 00:40:58 that covers the peak period in a month. True. What is the peak period? Yeah. So I think that's also all good. Cool. Yeah, and of course, one important thing here is that I don't believe that

Starting point is 00:41:11 AI can generate, I don't know, full enterprise application. Again, this is this human in the load, yeah. Whatever is going to be generated, it's definitely needs to be the review, but that's a boost of productivity again. Yeah.

Starting point is 00:41:26 Yeah. Very good. Trying to kind of finish that cycle. So steady state hypothesis, running the experiment, now we're actually at Verify, which you say it is closely related to steady state. What happens in the Verify?

Starting point is 00:41:44 Yeah, so conceptually is, is really close to the idea that we have in steady state. So again, you have some test results that you get from the experiment. You have some expected behavior. You have some data from the past experiments. You can actually use machine learning or any other approach that is backed by AI

Starting point is 00:42:10 to classify whatever results you get, to link metrics to causes, highlight deviations, everything automatically based on data. So the most important points can be highlighted. And very often what is happening with, after running experiment, when you get all those data, you can also get a lot of noise. Now, AI can help with filtering those noise

Starting point is 00:42:37 and pointing to those specific, most important metrics patterns that you can get after running the experiments. So that's, again, that's the beauty. refining patterns by AI. So pretty pretty easy and very connected conceptually to the status there.

Starting point is 00:42:56 Which means we are at the last step of our chaos life cycle improving, improving the system. So assuming that we really have our hypothesis confirmed that something breaks, I guess this is not a place to improve a system.

Starting point is 00:43:14 Yeah. So again, If we think about how we can apply AI here, there are different, there are few levels that we can talk about. In the most common case, and the easiest to apply, is actually, again, to use AI to propose specific targeted improvements in your system. So based on the, and we're coming back kind of to the hypothesis,

Starting point is 00:43:46 So if model knows about architecture, model knows about all the documentation use cases, deployments, et cetera, et cetera, based on the results and based on the expected behaviors, it can analyze even repository of your code and can propose specific changes that you can apply to improve. That's the one level. The second level, obviously,

Starting point is 00:44:11 because what we're heading with AI is the automation. So the second level is where you actually, your system can detect those improvements and can also apply. Now, let's imagine the situation where we have all of that automated, and you constantly and continuously run the experiment, get the results. After getting results, automatically, AI can propose improvements, can apply those improvements to your repository, can rebuild your service, can rerun all the testing, including the same experiment. So you have like a continuous improvement cycle. Of course, in the perfect world without the human intervention,

Starting point is 00:44:58 but as we discussed previously, we need someone out there to take all that entire process. But if you can automatically do all of that, again, it's a huge productivity boost. So there are a couple of levels of how autonomous the entire, improvement system can be. There is one use case that I'm also talking about during this presentation because I found it really interesting. It's the idea where you actually have the AI agents running in a specific system, which can fix issues in the system at runtime, like, for example, changing configuration of your

Starting point is 00:45:40 system. And one of the examples, I have no idea if this is true or not, but I find it fascinating as the as the example is the auto remediation agent by Netflix. They call it Nightingale. And again, I don't have any proof that it's true, but the entire concept is really, really great. So there's an agent they have that can apply memory configurations on the fly, depending on how the system behaves, which they say can remediate 56% of all the

Starting point is 00:46:15 configuration, memory configuration errors and then can reduce cost at the same time. So, and all of that is based on artificial intelligence, which is analyzing on the fly, the patterns, what is going on in the systems. I found it really fascinating in terms of improvement. Yeah, it's really cool. I want to, Brian, this reminds me, and maybe a quick shout out to our friends from Akama. That's exactly what I was thinking. Yeah, yeah.

Starting point is 00:46:39 I was going to figure that point of, yeah. Yeah, yeah, because Akamas, right, we had them on the podcast and they are basically, doing continuous performance improvements where they run constant load and then make changes to all the different configuration settings where it's in your Kubernetes, in your JVM, in your databases

Starting point is 00:46:57 and basically do goal-based optimization. So they are basically optimizing the system until they reach a certain goal. And the goal could be faster performance, lower memory footprint. They're doing something similar in that area, yeah. Yeah, if we just go back to their early days, I know they do a lot more now, but the early days was like there, how many JVM settings are there? Like a couple hundred, right?

Starting point is 00:47:21 And people know just about a few, so the idea was puts it in there, looks at your observability to see what's going on. It makes a tweak, checks your observability, see if that made anything better, and it just keeps on tweaking until it gets the right state. Again, relying on those inputs for steady state, right? So, you know, the other really cool thing I was thinking about with this whole idea is, you know, listening to, or if you think from the scientific method, right, your goal is not to prove your hypothesis correct. It's to prove your hypothesis wrong, right? When you're going to do an experiment, you know, proving it correct. You can never prove it correct, but you can prove it that you haven't been able to fail it, right? And if you think about the process of doing that, that requires multiple different iterations,

Starting point is 00:48:13 okay, the first thing I looked at didn't prove it wrong, but now how many other things do I have to do until I get to a confidence, like, okay, I'm very confident that I can't prove this wrong. Still doesn't mean it's right, obviously, in the scientific idea, but it just hasn't been proved wrong. The idea of, okay, what happens if our database is out for an hour, right? many different iterations and many different tests of how that might happen, this and that, right? So that AI is going to speed up all those iterations of how to do that because it's going to know all the different levers that can pull to stop talking to the database for an hour, right?

Starting point is 00:48:45 Whereas if it was just human, first of all, you'd have to figure out all the different ways it might go down or all the way communication might be disrupted. So it just really sounds like a really fast-tracked approach to, you know, just like, you know, rapid fire, finding as many paths to try to prove that hypothesis wrong and just firing them over and over and over again until you do, right? Yeah, and this is exactly what it's all about. And in the end, as we said a couple of times, we need someone, a human, looking at the entire process. But there is one thing that I'm always talking about when I'm describing a mission and vision of my organization is that we want a quality at speed, which means that you have to automate. And now having artificial intelligence, we can actually do that because in the past, there are lots of areas that you couldn't automate, really. you need to really have a person doing all that tedious work instead of the same person doing creative stuff.

Starting point is 00:49:53 Right now, you can actually use artificial intelligence and improve most of those areas. So that's the beauty. It's all about efficiency. Why have a human, or even why have a human rack a server when you can spin up a pod, right? This goes into the AI stuff, like, why have a human do any of that stuff?

Starting point is 00:50:11 if you can put that stuff onto the AI and have the humans do with the pieces they need. One last thing I wanted to get in there, Andy, was Vartek, you mentioned all the different inputs for the, after validation, the step is called what, like, not remediation, but what's the step after validation called? Improvement. Improvement, right? And one thing you said in the beginning, which I thought was really fascinating, Andy, but it didn't come up when you said this other part is requirements, right? So in terms of the improvement, right, putting the, if the AI has access to not only what our goal is for speed and all this, but maybe we only want to spend, you know, so much on our infrastructure to run this component.

Starting point is 00:50:55 Maybe this component has to be responding within a certain amount of time. Like if you're rolling out a new release or not even necessarily new, but what are the requirements for some of these pieces so that we know from a business point they're successful? including costs and all that, right? Because if the remediation is, hey, we have to throw, you know, another 100 virtual CPUs at the problem, right, that might not be the good remediation, right? So by adding those requirements, as you mentioned earlier, the AI is going to put that all in mind and be finding a solution that fits into all those aspects of it, the business side as well as the technical side.

Starting point is 00:51:34 That's true. That's correct. That's one of the reasons why I mentioned requirements. Exactly, exactly because of how you just described. You know, overall, the software that we write is running some kind of business, and this is what the business is looking for, yeah, more for making sure that any requirements that the specific system is having and the requirements that our clients are looking for are fulfilled.

Starting point is 00:51:58 So that's exactly what AI can take into consideration during any experiment that we can run. Take, unfortunately, time is up because it's amazing how time flies when we speak about topics and when we all have guests that are especially working in the area where we have our background, which is around quality and about performance testing. I can just encourage everyone, check out your talks, we'll link to one or two talks that you gave recently, especially the one to Cloud Native Austria. Also, part, if you okay, I will share the little. LinkedIn profile from you so people can

Starting point is 00:52:40 also connect and follow up and yeah I'm sure now that we are connected I'm pretty sure we'll cross paths anyway and if there's anything coming up if AI is going to make some additional leaps in the future that will make the life of

Starting point is 00:52:58 engineers even easier than we're happy to have you back for another episode yeah of course sounds great and you know thank you for having a great discussion today Yeah, it was amazing And I can't wait to see My last thought on this all is

Starting point is 00:53:13 As people start incorporating AI Into more parts of their architecture Right, I don't mean in the chaos cycle But in their actual deployment Messing with the AI is going to have to be part of chaos test as well So you're going to have AI Trying to Mess up the other AIs that are running in the system

Starting point is 00:53:34 So it's just this fun Yeah, this stuff can go on forever it's amazing though thank you so much bartek for for being on i wonder if ai gets as excited when it learns this kind of stuff like we do uh but you know for for and i chaos testing has our chaos engineering has always been uh you know as we've been exploring it it was fascinating especially for me uh coming again from the old uh you know performance and low testing side where it was not quite you know it was this grand extension to what we were trying to do in performance testing or not even you know trying to simulate a load, see if the system fails,

Starting point is 00:54:08 and then this broader, most of really amazing field of chaos engineering on top of that, and then add this AI stuff into the cycles. It just reinvigorates the really coolness of chaos engineering. So look forward to seeing what comes with it. And I hope everybody enjoyed today's episode. So thanks again, Bartek. And Andy, thank you.

Starting point is 00:54:28 Thank you. Bye-bye. Thank you.

PurePerformance - AI-Augmented Chaos Engineering in Practice with Bartek Pisulak

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.