Software Huddle - AI and Proactive Reliability with Kolton Andrus

Starting point is 00:00:00 Hello, everyone. Welcome to Software Hustle. I'm Sean Balkan, and today we're talking with Colton Andrus, the founder and CEO of Grumlin, about what happens to reliability when AI is writing most of the code. Colton helped build the chaos engineering practice of both Amazon and Netflix before starting Grimlin. And in our conversation, we talk about scar tissue. Yes, scar tissue. The hard one intuition engineers developed from being woken up at 3 a.m. to fix production outages and how AI doesn't have any of it. It generates coding and after afternoon that maybe took a team previously weeks to build, but none of those painful lessons come along for the ride. We dig into why 10x more code might mean 10x more failures. The concept of reliability guardrails think ethical guardrails, but for keeping your systems up, why you still have to test in production no matter how good your staging environment is, how Grimelin is rethinking their product for the world where agents, not engineers are essentially the primary users, and why we're entering a painful, narrow part of the hourglass

Starting point is 00:00:59 before AI gets good enough to handle all of this on its own. And with that, let's get into it. All right, Colton, welcome to the new software huddle. Thanks, pleasure to be here. Yeah, thanks so much for being here. You know, it's funny, we're going to be talking a lot about, I think, reliability, you know, what does that mean in sort of this AI era? We had our own reliability challenges just getting connected today.

Starting point is 00:01:22 Normally we record these using Riverside. Riverside seems to be having some connectivity issues. We switched over to Zoom in order to solve that problem. So we were joking that perhaps someone had AI coded something last night, put it into production, and that's went to some outage on their end. So I don't know, maybe it's the taste of what's to come. Yeah. Well, I think that's the thing about reliability is, you know,

Starting point is 00:01:44 it's often not the forefront of people's mind until it doesn't work. And then all of a sudden it becomes the most important thing. You're not able to do what you're there to accomplish. and you have to react in the moment, and that's why we have SREs, that's why we have ops teams, that's why we talk about reliability, and really try to avoid these events

Starting point is 00:02:02 and turn them into non-events wherever possible. Yeah, do you think that's one of those things where it's, you know, one of those functional areas where, you know, when everything is going right, nobody's, like, nobody cares, basically. And then when things are going wrong, it's like, you know, suddenly they are in the hot seat, and it's like the most, uh, you know, It's what everybody essentially suddenly cares about.

Starting point is 00:02:26 Yeah, 100%. I think that's actually one of the biggest problems about the reliability space is, you know, when things are working fine, it's like, why do we even need you? What do you even do here? And when things are broken, it's like, oh, my gosh, why is it broken? What's going on? And we see that with the way incentives work within organizations as well. Oh, hey, you know, we haven't had an outage in a while.

Starting point is 00:02:49 Everyone's kind of gotten, you know, amnesia about it. they've forgotten, oh, well, reliability, we're doing fine. We don't need to invest more. We don't need to talk about it a lot. Oh, we just got off of a major cloud outage that caused us to go down for eight hours. It's the only thing we're going to talk about for a week and it's going to become the most important company, you know, objective for the next quarter or two. Yeah, I think you see that with like backups and disaster recovery as well. It's like insurance essentially where, you know, it's there, it's easy to deprioritize and sort of kick the can down the road and then suddenly when there is, you actually need it, it becomes a big, big issue for the company.

Starting point is 00:03:30 Yeah. Yeah. So I wanted to take things back a ways, you know, before Gremlin, you were at Netflix, you spent time at Amazon. You know, can you kind of take us back to those early days? You know, what did even this concept of like, you know, chaos engineering even mean? when the term was first being used. And I know Gremlin has evolved a lot as a company since then,

Starting point is 00:03:51 but I just want to get the sense from you, you know, where did all this come from that led eventually to Gremlin? And how were kind of people thinking about that back in the early days of Netflix and Amazon? Yeah. So my journey really started at Amazon. I joined the retail website availability team. And it was our job to make sure the website didn't go down. And this was 2009.

Starting point is 00:04:12 So, you know, nowadays, quite a few years ago. And the truth is we were having outages on a semi-regular basis, and it was a big deal. And every Amazon retail outage was a big impact to revenue. It was money lost, and it was customers that were upset. And we had a great incident response team. We had a great, you know, we had call leaders. I had the opportunity to serve as a call leader. So one of the, you know, dozen people that took turns getting paid when there was a large

Starting point is 00:04:41 outage and managing that outage, triaging it, correcting the issue. But what we said as a business is this is too important for us to just react to it all the time. We really need to find a way to get in front of it. And that's where this idea of proactive failure testing came from. What's funny is we, I don't want to toot Amazon's horn too much, but we were talking about this before Chaos Monkey hit the zeitgeist. And it was an idea that we had come up with and we wanted to go in classic Amazon fashion, build a service, build a set of tooling. I did a bunch of really what's now developer evangelism within Amazon to go promote the idea, to tell people it was important, show them how to do it and teach them and get them doing it. And we had a lot of success with that

Starting point is 00:05:28 at Amazon. We had hundreds of teams that were doing this on a regular basis. We saw improvements to our reliability. And while we were in flight of building this and rolling it out, that's when Netflix launched Chaos Monkey. And it was really a great moment because the whole industry was feeling this pain. And a lot of people were talking about it. And it gave us this opportunity to really understand, hey, this is an important thing. But it's a thing that we can be getting in front of and doing a better job of, as opposed to waiting for it to occur and then just, you know, picking up the pieces after things happened.

Starting point is 00:06:04 So I had an opportunity. I was just going to say that. I feel like there's, you know, consistently you see this with while the, you know, I think like large tech companies, you know, whether that's Netflix, Amazon, Google and so on, where, you know, they kind of end up establishing these like functional areas ahead of maybe where the rest of the industry is. And perhaps that's because they're operating at a scale that's beyond like what most companies can even, you know, comprehend. So they're kind of at the very forefront of, you know, pushing technology. They hit these things earlier, maybe than other companies and probably at a

Starting point is 00:06:42 scale that they just, the existing system don't work for them. So they have to come up with some sort solution. Do you think that's a be part of that? Yeah. Yeah, I do. And I think it was, it's about feeling the pain. So you're at scale. You know, Amazon was one of the first adopters of it was really service oriented architecture before it became microservice architecture. Yeah. And while it's a great pattern for distributing work and being able to segment how your code is developed and deployed, it adds all these network dependencies in between all of your services. And so now your failure modes have grown exponentially and you're dealing with that problem. So by being an early adopter of some of those technology patterns, you really feel the pain and have to go devise solutions

Starting point is 00:07:23 to get in front of and address those. Yeah. And there's, I remember years ago I had an old boss of mine And he was hired into a company to, he was trying to organize and kind of modernize their engineering practices. I remember him. This was way before, you know, any concept of like chaos monkey, chaos engineering was having. But I remember him telling me that he just started pulling the power out of computers to see what would break. Because, you know, there was so much stuff essentially running on local machines at that point. This is, you know, 25 years ago or something where it wasn't super uncommon for someone to be running something off of their desktop. Probably not ideal, but that was the situation at the time.

Starting point is 00:08:04 So, and I think there's this kind of classic, you know, image of chaos engineering of like pull to plug, see what happens. Like, how has things matured, you know, since those days? Yeah. Well, I think it's funny you tell that story because that's really what the genesis of Chaos Monkey at Netflix was. And so, you know, while we're building this at Amazon, Netflix is moving to the cloud. And one of the things they come to realize is, you know, in the cloud, a host can be replaced from underneath you. at any time. And so if you have developers that have put a bunch of state on disk or are trying to run things in kind of that old fashion where you had a few computers, you had full control,

Starting point is 00:08:41 you could make those assumptions and they could hold true. Now those assumptions are false. And they did a great, I think it was really a management decision to say, hey, this is happening regularly in the cloud. So we're going to go make it happen to our developers in dev, in staging, in production so that they're understanding this is what the world looks like. And what that uncovered is all these places where the services were storing state on disk so they weren't idempotent. And so it forced us to really learn and adapt to that cloud environment. That was really predated my time.

Starting point is 00:09:19 I joined Netflix after the first wave of Chaos Monkey. And when I joined, there was really an opportunity to take that to the next level. And so they'd been doing a lot of the Chaos Monkey approach, but what we needed to do was take it further. And really, give engineers the tools to go do this testing proactively and do more than just host rebooting. I mean, I think Chaos Monkey's great, but it's like the most basic failure mode, a host went away.

Starting point is 00:09:46 And what we really needed is, you know, hey, what happens when we can't talk to a dependency? What happens when a service we're talking to, you know, get slow or disappears? That's really that next level. And so that was the opportunity I had while on Netflix was to really build upon the good work that had come before and go create a set of tooling that made it safe and scalable to run these tests that covered a wider variety of things. And I think safety is an important piece of that. The other part of this is, you know, engineers don't want to cause outages and they don't want to cause failures.

Starting point is 00:10:19 And so if they view it as risky, if they view it as something that they might get in trouble for because they accidentally caused a production. outage, they won't do it. And so that's where, you know, I think I personally learned a lot about how do we do this safely. So to answer your question, you know, how is it evolved? Well, in the early days, it was kind of the Wild West. Go pull some plugs, go shut down some hosts. Let's just see what happens and we'll respond and react in the moment. But in order to build a repeatable engineering process that felt safe and, you know, wise to be doing, it really required, let's give tools. Let's of tooling. Let's talk about the scope, the blast radius. If we're going to run this experiment, where are we going to run it? What might the impact be? Let's build in things like the ability

Starting point is 00:11:06 to revert experiments if they're not going the way we expect. Because if something goes wrong, and you unplug it, you know, back to the data center analogy, you unplug a cord and everything goes wrong, it might take you hours. In fact, one of my first Amazon outages that I was part of, we found out that we had a network, part of our network that ran through our office. So back to your like, oh, it's running on my desktop. And when they ran the exercise and severed that network connection, not only did it take down the website, it took down our offices as well. And so we were kind of flying blind.

Starting point is 00:11:42 We had to go do a lot of work to bring everything back up. And it was a great thing to uncover. We separated that. We made sure that wasn't a shared dependency anymore. But that ability to safely revert things is, I think, one of the steps forward that were made, especially in the early 2010s. Yeah, I still remember those, even the early days with myself as an engineer sort of wrapping my head around going from sort of being able to deploy things to, you know, physical like hardware that you could either racks that existed in the business or at least. the transfer of something where you still had the concept of like this to the world of cloud and not being able to necessarily just think of like, okay, well, I can just save this to a disk

Starting point is 00:12:33 on an EC2 instance. Like that might not be the same essentially EC2 instance that I'm hitting later and those things move around. And that was a little bit challenging in the early days kind of wrapping your head around. So I can imagine that led to all sorts of sorts of problems. In terms of, you know, we talked about how a lot of this stuff started at companies like Amazon and Netflix. At what point did it go from, I don't know, what people might consider like these elite engineering orders to something that mainstream engineers started to do? Yeah, well, I think, again, architecture was really a driver. So move to the cloud and moved to, you know, microservice architecture meant that everybody else started to feel this pain. And so I saw that a lot, yeah, kind of early to mid-2010s where people were really adapting those patterns.

Starting point is 00:13:23 You go to a conference. Everyone was talking about how do you do microservices right? And as everybody really embraced that, they started feeling the pain and finding that they needed to change their approach. You can't just rely. It's funny. You talk about the data center. But I helped, you know, I did some data center work early in my career, you know, double redundant power supplies. you know, ha-swappable raids, you know, all the things that we'd built into hardware to really try to address these problems so that we wouldn't have to, you know, deal with the servers going down and, you know, remember a couple outages where I had to go to a data center and swap some things around in order to bring things back up.

Starting point is 00:14:03 And so cloud gives us a lot of advantages in that regard, being able to handle things programmatically, being able to have a lot of redundancy built in. But as you mentioned, it really changed the paradigm of software development. You can't rely on those individual hosts that can come and go. You can't rely on an individual disk. You need to have a separate disk service, so to speak, that allows you to store and recover those things. So I think that's really where you saw the shift start to happen. Yeah.

Starting point is 00:14:32 And I guess like sort of fast-forwarding to today, I still think we're still in a place where the majority of companies still treat reliability, is something I think that's reactive in nature. You know, something breaks, you go fix it, you write a post-mortem, you wake somebody up in two in the morning, you know, and a team is deployed. What has to change, you know, culturally or technically for proactive reliability to somehow become the default? Yeah.

Starting point is 00:15:00 I've thought a lot about this because we've seen this and we've struggled with this with some of our customers along the years, especially the early days of Grummel, kind of the late 2010s. where a lot of folks were still in this reactive mindset. And I think the thing that I learned from that is it's really about incentives. Does leadership say that it's important to be ahead of this problem, or is leadership okay with just reacting when it occurs? If you don't ever give your engineers time to work on it,

Starting point is 00:15:29 and there's always a product roadmap, there's always a deliverable pending, there's always something that people are working on, then they're never going to really get in front of the problem. And so that's where, yeah, the way I describe it to people is you can pay the cost when the outage occurs. And it's actually quite expensive. It's more than you think it is. There's whatever revenue you lost, there's the customer and the brand impact. And there's the engineering impact.

Starting point is 00:15:56 All the engineers that are involved in that, they get woken up in the middle of the night. They're not doing as good of work the next day. They're spending the next day or two digging through monitoring and logs to understand and piece together what happened. You hold a post-mortem, you know, you find some action items and half those action items go off into Jira to die and don't even necessarily get properly remediated. And so you can spend all that time or you can amortize it. You can spend, you know, an hour a week or, you know, an afternoon a month. And you can really mitigate a lot of that time that you spend. So it's thinking about, you know, how you plan for it and how you budget for it.

Starting point is 00:16:34 the other side of it is how you reward people for it. And there's this great comic that I've showed a few times. And it's like, there's a little fire. There's a guy and there's a little fire. And one guy puts out the fire and the other guy doesn't. And then the fire gets bigger. And then that guy says, oh, my gosh, things are on fire. And everybody huddles around.

Starting point is 00:16:53 And then he puts it out. And he's the hero because he put out this big fire. And the guy who put it out while it was small really doesn't get any recognition or reward. And so how, you know, if I, and I saw this at companies I've been at, being a call leader or being an incident commander, whatever people call it, that's usually a badge of honor. You get respect because they know when things hit the fan, you're available to come and fix it and make sure that the right things happen. But really, you want to reward the people that are finding and fixing things before they ever occur

Starting point is 00:17:27 so that the system's kind of boring so that things, you know, aren't getting into that state. But as you said, and as we talked about at the beginning, if it's not on fire, it's not top of mind. And so people aren't thinking about it as much. And so it really requires another level of maturity and sophistication to address the problem in a more, you know, thoughtful engineering approach. Yeah. I mean, exactly. Like, it goes back to what we were saying where, you know, when things aren't on fire, you know, nobody cares and nobody's getting promoted or anything like that. And I think it's a classic issue in all engineering of, or any company really is when you are

Starting point is 00:18:08 trying to assess people's impact or, you know, figure out how do you recognize people for promotion or bonuses and things like that. You have to be careful for what you measure because people are going to optimize for the thing that you measure. Even when I was at Google, we used to always joke that, you know, if you want to get promoted, spend a quarter like, implementing something purposely inefficiently, and then the next quarter turn around and actually fix the inefficiencies and be like, look, 1000x improvement in the efficiency. Because you get more recognition for improving the thing that was not working well versus actually implementing it correctly the first time. Yeah. Promotion driven development. You know, how many,

Starting point is 00:18:50 how many projects and things are built that, you know, aren't necessarily the best use of the company's time or efforts, but they get the right visibility, they get the right rewards. And And so they end up being done that way. Yeah. I mean, we're seeing that now. You know, we talked about this whole, you know, paradigm shift that happened when people starting to move the cloud. We're going through this paradigm shift now with essentially AI coding tools.

Starting point is 00:19:13 There's companies that are incentivizing people around token usage, AI usage. Of course, people are going to end up, you know, gamifying that where it's like, well, if I know in my performance review, the amount of AI usage is going to come up as some sort of metric that I'm evaluated on, well, why not just spin up a whole bunch of, you know, agents doing useless things just to drive up my AI usage, essentially? Yeah. Yeah. Well, and I think, you know, it's quite an interesting time, and I think it is a good corollary

Starting point is 00:19:44 because we're seeing a big desire for people. They want to learn it. It's interesting. It's important. It's all you see, like, in the news or in tech, you know, circles. And it's great in many ways. It's also, you know, we're pushing out so much more, so much faster that it's really impacting our rate of change and our defect rate. And so, you know, for me, I look at it and I say it's great, but, you know, we're going to have to think about how to get in front of this one as well and how to do it in a thoughtful way that we're being efficient with our resources, that we're testing the code that's being written.

Starting point is 00:20:22 I think it's great if you can have AI write a bunch of code for you. what we've seen is, you know, 10 times the amount of changes means 10 times the amount of failures that get introduced. And so, you know, how are you going to keep up with that at scale if you're continuing to do things the old way? Yeah, I mean, I think this is at the heart of, you know, one of the major problems that is going to face the technology and the tech industry over the next couple of years is that even if we just assume that the number of errors per lines of code written by a human, you know, essentially, I don't know. Let's say it's a thousand lines of human written code lead to one potential bug or error. If we assume that ratio is the same for AI generated code, the fact that we're generating 100 times as much code means that we will have 100 times as many errors.

Starting point is 00:21:10 It's not necessarily even that the AI calls on the question like how good the AI written code is. If we assume it's human part, we're still going to have essentially more errors during this point. And the other thing is we're compressing essentially this time of, of sort of thought to generation of the code, but we're not necessarily compressing things like reliability and engineering at the same rate. So we're creating this bigger and bigger gap between our ability to verify that the code actually works and works reliably versus our ability to generate code.

Starting point is 00:21:45 And I think that puts a lot of things, you know, into question and risk around, you know, how should these organizational structures be? how do we need to evolve as an industry in order to address essentially this growing gap? Yeah, yeah. The metrics I've seen, and I'm keeping an eye on this, even as the models have improved, the defect rate has remained fairly constant. And in fact, if you count how much we've got to push fixes as a result, we're actually going down a little bit in quality. We've got velocity in a great spot. But overall, by the time, you know, everything is as it needs to be into the quality we need, we're still spending more cycles and more time to get to that point. So yeah, I think it's, you know, it's back to like,

Starting point is 00:22:30 how are we, how are we prioritizing or incentivizing? You know, in a lot of cases, if you're not telling the AI explicitly, hey, I want you to think about security, hey, I want you to think about reliability. Hey, I want you to go validate. You know, one of the tools that we use internally is, you know, write your test first and make sure you've got a clear set of acceptance criteria. But to me, I mean, that's an opportunity in our space. Hey, you know, what part of the acceptance testing for the system that you're going to deploy is, can it, can it handle these types of chaos experiments? Can you validate that the system can withstand these types of issues that do occur? And some of those are hardware issues.

Starting point is 00:23:11 They're not even software issues. You know, at scale, data centers, hosts, disks, they fail. And in the small numbers, they fail infrequently. but at large numbers, they fail every day, every hour. And so you have to be able to accommodate those. So I think there's an opportunity there to really build a great feedback loop where, you know, one, you're explicit to the AI saying, hey, look, I need you to think about reliability. And then another one is really as part of this like software development life cycle and pipeline,

Starting point is 00:23:44 there should be a step where, you know, not just are you passing the unit test, but are you passing the integration test? and then the system level test, which are really those reliability tests. Yeah. I mean, so with models in model development, people put in, or companies put in,

Starting point is 00:24:02 like ethical guardrails, there's also security guardrails. Do we need some kind of, especially when it comes to coding, some form of reliability guardrails built in. And so, like, what would that look like? Yeah. I mean, funny enough,

Starting point is 00:24:17 that is the exact same phrasing that we've been using internally is reliability garb rails. And yeah, I think as I just described, what it really is is giving an opportunity for that system to go run those tests and see the results and make sure it's passing. And by creating a feedback loop there, where you can go test the code that's been written

Starting point is 00:24:39 to make sure that it can withstand these common failure modes, and feeding that back into the model, back into the agent so that they can take that into account, that's how you're going to be able to get that higher quality software that's more resilient. The other thing, you know, I've been thinking about along this similar conversation is that historically when humans have written code, sometimes over weeks, maybe even longer, we kind of build up this like mental model of how things fail.

Starting point is 00:25:09 And we naturally think about retries, timeouts, what happens when a dependency is down. We also have scar tissue from historic failures of being woken up at three and the more. morning to have to deal with an issue. But, you know, AI doesn't necessarily have the same charge issue. It generates code in maybe an afternoon. We don't necessarily pass along that intuition to the AI or develop the same intuition when AI is generating that code. Do you think that perhaps fundamentally changes the type of the nature of failures we might expect? I mean, I think for all the reasons you just stated, it, it changes, yeah, what we need. to do. We need to be thinking about those failure modes. You know, when you're launching a

Starting point is 00:25:57 tier one service at one of these large companies, you're typically doing a failure mode analysis. You're thinking through the potential failure modes. I love your example of the scar tissue. You know, yeah, I've been woken up many a times in the middle of the night. I've solved incidents off the side of I-5 on my motorcycle in the rain. You know, those are things that really embedded in me this deep understanding and need for reliability. And so whenever I build software, it's always at the forefront of my mind, how are we going to teach that to the models

Starting point is 00:26:28 in a way that they care about it in the appropriate way? How does it fit into the context window, when we're trying to do these large tasks where we're generating a lot of code? But I think as well, to your point, are we going to see new failure modes? Probably not at the infrastructure level. I think, you know, in computers,

Starting point is 00:26:45 it's the same 10 things that tend to fail. CPU, disk, memory, network, dependencies, etc. But when it comes to logical bugs and logical failures, I'm sure there's all sorts of new assumptions or new shortcuts that are made. It's pattern matching under the hood in many ways. And so, oh, this may work in this business context, but it absolutely doesn't work in this business context. I think that's the other one.

Starting point is 00:27:10 One of the things being a software engineer is you have to learn your domain. Not only do you have to be good at the discipline of software engineer, You have to understand your company, your business, and its business logic to make sure that you're doing the right thing for your customers or your business in those circumstances. And so I imagine it would be great if AI as a whole could just encompass that. But really, it has to have some cues and understand which domains it's operating in and how the business operates in order to make some of the right tradeoff decisions. How can you develop those cues? What would be your guidance for companies to try to solve this problem? Yeah, well, I think in today's world, you know, making sure you're just being explicit about it.

Starting point is 00:27:54 You know, there's great jokes going around about, you know, hey, if you can just tell AI exactly what you wanted to do, it'll build exactly the right software. And that's kind of like, oh, if you've got a project manager or a product manager that just knows all the right tradeoffs and decisions to make up front, well, then, yeah, you're going to get perfect code. But oftentimes we're not able to go to that level of specificity. And there's a bit of an engineering hubris here that like, if you really want to get to that specificity, we call that code. That's how we get to that level of specificity to say exactly what we want the machine to do. But I think being explicit in the way in which you're engineering things, you know, if you have agents that are riding parts of code, you have another agent or a backup process that goes through and asks about reliability questions that runs the reliability test, that puts those guardrails in place to make sure that those things are being thought about and accounted.

Starting point is 00:28:46 for. I imagine similarly, we're going to see some of those develop in different domains as well. Hey, I'm in e-commerce or I'm in finance or I'm in, you know, manufacturing. Those are going to have a different set of business constraints that are going to need that context to be fed into the AI and then something at the tail end to validate to make sure that we're meeting that criteria. So, yeah, for me, it's, you know, at this point, it feels like just spelling out as much as you can, spending more time on the spec, more time on exactly how you want it to behave on what should go well, what should not go well, and defining that in a way that you can go validate that, as opposed to, you know, we talk about or joke about vibe coding.

Starting point is 00:29:29 And in my mind, you know, I have this mental image of like a real short prompt. Hey, go write me Amazon.com, you know, and it spits out some things and maybe it gets a bunch right. But that's really, you think about the number of people that have gone into building Amazon.com over the years and all the lessons learned and all of the criterion requirements that have come out of it. And if you wanted to build that spec, that spec alone is probably, you know, the size of a book. Yeah, absolutely. I mean, it'd be a lot. I'm not sure that the models would be able to handle that quite yet. But it would be interesting.

Starting point is 00:30:03 See, yeah. One of the things you mentioned there was around potentially. AI like running or generating your reliability experiments or your tests. You know, we're in this place where AI is writing code. It's often writing the unit tests. We might be writing the reliability tests. Sometimes we're using AI to review the PRs as well because we can't keep up with the volume of PRs that being generated by AI.

Starting point is 00:30:30 So we kind of have AI everywhere within this checking its own homework, generating the homework, generating the tests around the homework. But I guess, like, where does the human judgment stay essential in this entire process? Yeah, there's a bit of a joke there about who's watching the watchers, right? If you're having AI watch itself and, you know, is it really going to get you good results or sometimes it's going to make the wrong decision? It's funny because we were discussing that internally recently. And, you know, one of the big questions right now, if you have a SaaS is, can somebody just vibe code your SaaS?

Starting point is 00:31:06 and will they do it? And one of the things that gave me a little bit of peace of mind and comfort is, do you really want to ask AI to write all your failure tests? Because if you don't define it well and you say, hey, what happens when this database goes away and AI decides to just delete your database? Well, that's not the test you want to run. You want to run that test in a way that's safe, that you can revert it, that you've got confidence in it.

Starting point is 00:31:31 So I think that's where we're going to have an opportunity to really think about how do we go test the things that we want, but do it in a way that is almost like a third-party validation? And I think that's where, yeah, I don't know. Again, as an engineer, my opinion is, you know, we've learned a lot. We are intelligent beings. We want to make sure that the right things are happening.

Starting point is 00:31:55 So we still play a very important role in that feedback loop and in supervising to make sure that things are being done in a thoughtful way. And I think that's really what. what divides kind of the vibe coding, slop coding jokes from true engineering is it's not outsource your thinking and just allow the AI to do whatever you want, whatever it wants. It's about really giving it specific tasks and instructions to automate pieces that are of less interest or of consequence, but still being in a position where you understand what's being built. You're validating that it's meeting your requirements.

Starting point is 00:32:31 and you're keeping an eye on the quality and the overall output to ensure that it's meeting what the business needs or what the industry needs. Do you think there's a like a spectrum in terms of where you need to be more thoughtful about that in any particular software company? For example, you know, if I'm at the like infrastructure layer and I have my customers essentially depend on whatever product, like your database or, you know, something like this. and I screw up some part of the security. I mess up some part of how we handle a network. That's going to be pretty catastrophic to everybody that depends on us. Whereas if I mess up, maybe something in the UI, you know, I could push out of change. It's less catastrophic in terms of like taking down everybody's database.

Starting point is 00:33:19 Do people need to be thinking about, you know, what are the layers of the onion within my company where we might be able to charge ahead and go really fast with AI versus, kind of slow down and be really thoughtful? Yeah, I think that's, and that's probably a good dichotomy you described there. You know, we already have these paradigms within our systems. We have critical systems and we have less critical systems or we have tiering around it. You know, we have pieces that, you know, absolutely must stay up or must work for the system to operate. And we have pieces that maybe we get to handle a few bugs without it being catastrophic.

Starting point is 00:33:51 So, yeah, I think, you know, this comes up all the time as well when it comes to production, You know, I love the idea of AI SREs and AI ops. But when you, you know, when you've done this for many years, you have to drop into a situation, have enough context to understand what's happening, but also have an idea of what actions you can take that are reasonable actions. And you don't want to be in a position where an AI is deciding it's going to take a very unreasonable action that it thinks it's saying to recover things and actually make things much. worse. Yeah. Do you think we're potentially heading toward a world where either proactive reliability or some other part of this has to be like a regulatory expectation?

Starting point is 00:34:39 Like we, you know, we have certain standards in the industry around security. Now, you can, from a compliance standpoint, you can argue how valuable those are. But, you know, there's still certain expectations around, you know, the treatment of personal information or if you are selling into the enterprise, there's certain expectation of, you know, some level of standards on the security perspective that you're meeting. You're handling health care information. There's, you know, hippoc compliance. There's certain guardrails in place that you have to meet those expectations. Do you think we need something like that as we are, you know, spitting out more and more code to ensure engineering best practices?

Starting point is 00:35:20 Yeah, it's interesting. I mean, does regulation really cause the quality we need? I think, it's a bit of a stop gap in the end. It's to make sure that we're covering the worst cases. And we have a lot of that in place. You know, when it comes to disaster recovery or business continuity or compliance, we've seen more standards arise. But I mean, to the heart of your question, yes. I think we do need to make sure that we are addressing these problems and we have some measure of accountability. I think that's the other thing we've learned over the last, you know, decade doing this as a company is if there's no accountability, then there's really not that incentive to go fix and do a better job. And so you have to provide that somewhere. Now, does that

Starting point is 00:36:05 belong at the governmental, you know, legislative layer? Does it belong within the company? Does it belong within the team? I think that, you know, is a open question that is going to be based upon the industry and where you're at. But I think the analogies to security are very apt. you know, the cost of being wrong in security can be quite extensive. And if you leave the front door unlocked, regardless of if you've done all the other best things you need to do, you know, you can lose the whole, you know, lose everything right there. Similarly, you know, reliability takes defense in depth. You really need to address it at multiple layers and you need to make sure that you're thoughtful about how it's being done.

Starting point is 00:36:46 So I do think incentives and accountability factor into that to make sure whether that's AI, incentives and accountability or engineering incentives or accountability becoming perhaps one and the same, but you need something there to make sure that the right things are getting done. They're not getting forgotten. And really, it's to mitigate some of those worst case scenarios where if not, you know, yeah, we've seen, you know, just as a bit of side commentary, we've seen quite a few number of large outages in the last six to nine months. Yeah.

Starting point is 00:37:17 Every major cloud provider, you know, major services that the internet reliance, upon. And whether that's due to AI generated code or not, it is due to us moving quickly at times. And I'm all for velocity. I think that's one of the great things going on right now is we can do more and we can explore more and we can experiment more. But there's also, there's an experimentation phase and there's a hardening phase for something that's truly production ready. And I think of it like it's our digital infrastructure. You know, you're building a bridge. It's fine to vibe code the first version of the bridge that you build in your lab and you make sure that you understand how it behaves and you run it through the wind tunnel and you do all these things.

Starting point is 00:37:56 But when you go to ship that into the real world and build it and people's lives are on the line and there's a large amount of property and personal, you know, caught up in that, you need to make sure that it's really hard. That it's really going to behave well. And that's where you need some extra checks and balances and you need to make sure people are giving it the appropriate thought that is needed. Yeah, I just recently wrote about this problem where I think even before AI generated code, we've always had this like 90, 10 illusion in software where essentially 90% of the work

Starting point is 00:38:34 takes 10% of the time and then the last 10% takes 90% of the time. So you can generate the demo really quickly. It looks great. But then that last 10% is all the hard stuff. of how do you make this enterprise ready, how do you actually make this production ready? And I think what we've done is with AI generate code right now, we've taken that 10% to generate the first 90, basically down to zero,

Starting point is 00:38:58 because we can prompt our way to a demo in an afternoon. And I think, unfortunately, because of all the hype in this area, it's also creating now pressure on teams to move at the speed of the demo. And people get frustrated that, hey, we saw the demo, why does it take another six months for this thing? to hit production. And it's because of all the other things that come along with that, where it's, you know, security, it's networking. It's, uh, uh, the dependencies on the infrastructure. There's a whole bunch of different things that go into making something enterprise grade

Starting point is 00:39:28 and production ready that gets lost in the headlines of, hey, we just fired our entire engineering team and one person did this, you know, did six months of work in, in a day. Yeah, I think that's a great, a great way to talk about it, you know, this idea that, oh, you're 90% done. So, you know, what's, you know, what's, taken so long, just ship it, you know, we're there. Why aren't we finished already? But the last 10% taking, you know, much more of the time and being things that require a bit more rigor, a bit more validation, you know, a bit more thought and critical thinking to make sure they work well. You know, realistically, what we're saying is, well, AI can get us the first 10%,

Starting point is 00:40:07 but the last 90%, you know, still requires a lot more effort and work. And, you know, I think in part, the marketing buzz is a little bit to blame, as you refer to there, that everybody's hearing these stories and they're seeing these demos. And so they're like, why is my team not able to just ship something to production in a single day? And that's the answer. The answer is, yeah, you can get, you know, what feels like 90% of the way there. But realistically, you know, maybe you're half. Maybe you're 10%. Because there's all these other things that need to be done to really make it hardened and production ready and, you know, have that longevity. that we really expect from our software in today's era.

Starting point is 00:40:48 Yeah, absolutely. Yeah, and we haven't sped up a lot of the other parts of the software development lifecycle. So you either cut corners in order to meet the demand, work crazy hours, or you just have to slow down in order to make these things actually verify that they're working. So I mean, to be fair, though, like this is where, you know, one of the places that we see a lot of opportunity. Yeah.

Starting point is 00:41:08 You know, we want to make it easy for agents to come interact with Gremlin so that they can go run these tests in a safe, revertible way, so they can have that feedback loop. So we can build those reliability garg rails. So we can take that first pass that's demo ready, run it through the rigor that we've learned over time, and impart that back into the model so that it can go harden and adjust.

Starting point is 00:41:31 And so I think we'll see that in other places. I think we're seeing that already in security as well, where, you know, there are ways that we can help keep that momentum and speed up that velocity. but it takes, you know, it takes some delegation to expertise. We're not yet to the point where that's just the default expectation you get out of your software. Yeah. So I wanted to actually, I wanted to talk about this new disaster recovery testing product that you launched.

Starting point is 00:41:59 I guess is that related to some of these ideas that we're talking about? You know, what was essentially the gap that you were hearing from your customers that made you invest, you know, your time and resources into building that? Yeah, well, I think, you know, just as we've discussed, reliability can be time intensive and the disaster recovery process, one that's fairly regulated or has compliance needs to it. You know, as we're working with our customers, they're doing this, but it takes thousands of engineering hours to go execute these tests. They have to coordinate across the entire company. They have to make sure everyone's done their preparation. You know, whether they're running it synchronously or asynchronously, there's a lot of time that goes into that. we need everyone to be online this weekend. We're going to fail out of our data center.

Starting point is 00:42:45 And if something goes wrong, we need you there to go, you know, address it or adjust it or we got to roll it back. And if we've, you know, scheduled time for all of our engineers to be online on the weekend, you know, we can't really, can't really just hit the first failure, say, oh, no, you know, it didn't work and come back the next weekend. There's a lot of effort that needs to go into that. So, you know, this for us is less about AI in particular and more just. about how do we streamline engineering work that people are spending a lot of time on.

Starting point is 00:43:16 And so we can put it, you know, how we built this into the product is things that allow people to go do all their pre-working conditions and validate that they're ready to go before we get everybody in a room to run it. It's less about testing each service and how they handle the data center failure when we get those large exercises. It's more about testing the glue and the duct tape in between everything, the network, the routing, the security, to make. make sure that nothing's being missed. As well, there's just a lot of administrivia work that goes into that to make sure that everything's set up, that everything's being coordinated.

Starting point is 00:43:50 And software is pretty good at that. You know, we can help make sure we get all the right pieces in place. We've run all the right preconditions. And then lastly, safety, you know, that ability, if things are going wrong, you know, if you're doing a large data center or cloud region evacuation and things go wrong, especially in production, And ultimately, this is my little aside, people are like, oh, just run in dev and staging. Well, if you think of a pie chart of things and go wrong,

Starting point is 00:44:16 like a third of that pie chart only lives in production. It's the production routing. It's the production groups. It's the DNS. It's how we actually, it's a production load. It's how customers are interacting with the system. And so we have to test in production in the end because that's the system that matters. And that's the system we need to understand and harder.

Starting point is 00:44:35 But if we're testing in production and things go wrong, it usually means the stakes high, especially at that magnitude. And so how do we automatically revert and roll everything back so that we can get back to steady state, get back to our customers doing what they need to do as quickly as possible? That's another place where, you know, software automation and kind of the expertise that we bring to the table can help make that a much more streamlined process, a much more repeatable process. And I think that repeatable bit is pretty important. A lot of these companies, they run this once or twice a year. You know, with the amount of software, changes we're making on a daily basis, the system you tested yesterday is not the system you're

Starting point is 00:45:14 operating today. And so how are you preparing for, you know, real failures if you're working on a model that's three to six months old? Well, realistically, you need to run this on a more regular basis. And maybe the big expensive ones, you only run monthly or at a cadence. But, you know, if it's tens of thousands of engineering hours to run it, you just can't afford to run it more often. So by giving people a streamlined way to run it more frequently and more repeatably and more safely, we can run it more often, which means we're actually getting much better results, much more realistic results close to what production looks like, and we're uncovering risks before we trigger them.

Starting point is 00:45:53 And that's really the key is we want to find the thing that's going to sink us before it happens. And so we need to be able to be doing that on a cadence that really uncovers it and gives us time to adapt and prepare. Yeah, you talked about the cost up front of potentially thousands of engineering hours, you know, under the sort of the prior systems. What is the cost reduction in this more streamlined approach? Yeah, well, I mean, I think what used to take, so it's a bit twofold of a question. I think the first half is amortize out to what the teams can do on their own. And they should be doing that on their own anyway. That's part of just operating a good service.

Starting point is 00:46:32 So they can be running these zone or region or data center failures on their own weekly or monthly so that they know that they're in a good place so that when we get everyone together, we're not having to go rehash and deal with those individual pieces. But then, you know, that ability to run everything together or to automate that entire exercise, now we don't need to have necessarily everybody in place. And if we have those safety mechanisms where we can revert things, then maybe we can just page somebody if something goes wrong. as opposed to having them on the phone live. And I've seen some of these war room calls with hundreds of engineers on them. And you just do the math on that real quick, that's quite an expensive exercise.

Starting point is 00:47:14 So I think that costs, the time and the cost is what makes it prohibitive to do often. So if we can make it so that from a safety perspective, we don't need that level of oversight because we've got guarantees that we can revert things if they go wrong. Now you can really have your SRE team run that entire exercise and engage the folks that are needed after they've reverted the impact as

Starting point is 00:47:37 opposed to having everybody online ready to go in case something goes wrong. Yeah. So if we look ahead a few years from now when AI is basically accelerating everything in terms of engineering and all the surface area around engineering, what does this entire practice of ensuring reliability look like? Yeah, well, I think, you know, and then in the midterm, it's about how do we interact with agents as opposed to how do we interact with people. I think a lot of our software is built for engineers to make engineers more effective and to move more quickly. So that's one of the things we're thinking a lot about is how do we make agents able to interact with our system, learn from our system, have that feedback loop and be able to incorporate into their designs and into their systems.

Starting point is 00:48:30 I think longer run, it's a good question. You know, I've varied between being a skeptic and an optimist quite a few times over the last couple of years. And I think, you know, will we arrive at a point where AI can hold all the context about our complex distributed systems in order to always make the right decision so that we can trust it to take actions without a human overseeing it? Well, at some point, foreseeably yes. But is that tomorrow or next month? I don't think it is because all the things that we've learned, the things we were talking about, the things that we have scar tissue around, you know, this is actually one of the reasons why reliability testing and chaos engineering came about is, you know, the naive approach would

Starting point is 00:49:15 just be unit test everything. And truthfully, when you have a system with hundreds of services and components, the time it takes to test all combinations of failure, the sun would burn out before we finish running test. It's an NP-complete problem. We can't do that. And so we had to find a reasonable approximation, which is let's simulate or recreate real failures and see how it responds to those, the ones that we're most concerned about, because that's going to help us mitigate most of the issues and uncover most of the vulnerabilities. Now, can AI follow our same approach? Yes. But if we try to take a formal, you know, formal methodology and test all possible things, well, unless, unless

Starting point is 00:49:57 unless we do, well, and I don't know, maybe if we see quantum computing get to the point where that's publicly accessible and something that we're able to iterate on, then we'll have a breakthrough where we can truly grok and understand the whole system and be able to take those actions and move through all those tests in a reasonable time frame. So yeah, to me, that's really the bottleneck. How big is your context window? How many variables do you have in flight? How many interconnected pieces do you have within your system that have to operate correctly for everything to work well and how are you going to, you know, basically test all of those in a reasonable way? I think that's the barrier we're looking at today.

Starting point is 00:50:38 Yeah, absolutely. I mean, I think that one of the things you said there is something I've thought a lot about as well recently is like how do you make essentially your systems navigable by agents? You know, we've sort of talked about developer experience in terms of, you know, how does the developer understand docs and APIs and your developer console, what's that onboarding experience? And now the experience that probably really matters for the future of any, you know, infrastructure, dev tools, company is how easy can the agent essentially use those tools?

Starting point is 00:51:10 And if you're not part of that tool chain or it's too difficult, it's probably like a real existential threat to the future of that business because this is how everybody's going to be building. At least that seems to be the case. And then if we take essentially the optimistic view, that eventually AI will reach a point of essentially engineering superintelligence where it's better than any human could possibly be. And it is capable of not only generating the code, but making the right decisions across

Starting point is 00:51:43 really complex distributed systems and testing it reliably and all the parts that go into it, there's still a period where we're going to be in a real growing pain situation where we're leveraging these tools, we have people kind of pumping out code that don't always understand fully what that code does and why it does the thing that it does. And it's going to lead, you know, inevitably to more outages and more, you know, bugs and things like that. And we have to kind of get through this narrow part of the hourglass in order to achieve this utopia of where, you know, AI is always making the right decision. Yeah. Yeah. I think the next couple of years for, for all the greatness and benefits we're going to see, we're going to have some growing pains.

Starting point is 00:52:24 And we're going to have to feel those pains and go solve those pains to find something that's truly sustainable and, you know, has longevity. Yeah. I mean, it's a little bit like if you had taken, I don't know, early days of autonomous vehicles. And then we didn't have all these essentially checks and balances in place where we could. And people just put them on the road. I'm sure there would have been a lot of, you know, really meaningful downsides to that. But it had to be much more purposeful and careful along the way. I know before that, you know, essentially went into a place.

Starting point is 00:52:54 where we have Waymo's driving around San Francisco and pockets of different cities and so forth. That took decades, essentially, of work to get there. Well, anyways, as we come up on time, is there anything else you'd like to share? I think we'd gotten to cover a lot of it. You know, I love that we kind of started with the reliability guardrails because that to me is, is, you know, the next exciting thing that we're building at Grimel. How do we make it easy to go test these things? How do we really provide that structure and those boundaries to me?

Starting point is 00:53:24 make sure that our systems are meeting the needs that we need. And how do we enable AI acceleration in a way that is safe and thoughtful? I mean, a little bit me, I'm still waiting for my three laws of robotics equivalent for AI. You know, we need is not just to be able to move fast, but to be able to move fast in a way that we feel comfortable with, that we have safety guarantees and that we're doing the right thing. And as all things in technology, we have to balance speed of execution and speed of innovation with those concerns. And so I think that's, you know, the opportunity this year, next year, as things continue

Starting point is 00:53:58 to accelerate to find out how we can have arcade can eat it too. Yeah. And I think that's a challenge right now where you're in such an arms race across like, are there the model companies to always be pushing an envelope, how the best model, or all the companies that are leveraging these models to generate, you know, new products. Everybody's essentially competing with each other to move faster and faster and faster. There's a lot of pressure to do that. inevitably when there's pressure to do that certain things get skipped along the way that

Starting point is 00:54:25 under other circumstances we might be a little bit more thoughtful and careful along the way and be thinking about those things so i agree like i think there has to be at some point we're going to be in a world where uh we need some regulation or at least you know some cruelly defined best practices of what the best companies in the world are doing was this yeah yeah the pain is going to drive that just like it did and you know the architectural changes of the 2000s and the 2010s. And hopefully we can, you know, learn and get ahead of some of that before the pain has too much of an impact. You know, better to have growing pains and not, you know, catastrophic pains along the way. Yeah, absolutely. Well, Colton, thank you so much for being here. I really enjoyed this.

Starting point is 00:55:06 Yeah, my pleasure. Thanks for having me, Sean. Great discussion. Yeah, cheers.

Software Huddle - AI and Proactive Reliability with Kolton Andrus

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.