Software Huddle - Software Reliability Agents with Amal Kiran

Starting point is 00:00:00 We're on a mission to make sure that no on-call engineer ever has to wake up at 2 a.m. in the morning and solve a production issue in 2025, which it should be an AI agent jumping in and solving this problem for them. How far away do you think we are from having like a fully autonomous system for this? Technology-wise, I think we're there. These systems can take autonomous action. Obviously, there will be systems can take autonomous action. Obviously, there will be some of these hairy problems. And given that you're in the business of SRE incident management, how do you make sure that you don't have

Starting point is 00:00:36 your incidents on your side? Hi, everyone. It's Sean. Welcome back to Software Huddle. So if you're writing code or keeping systems running, probably know the drill. Late night pages, chasing down weird bugs, dealing with the word storms.

Starting point is 00:00:50 It's tough, it costs money when things break. And honestly, I don't think anybody loves that experience. So the big question is, can we actually use something like AI, AI agents in particular, to make reliability less painful, more systematic. And that's what we're talking about today. So this week on the show, I have Amal Kiran, the CEO and co-founder of Tempur Stack.

Starting point is 00:01:12 They're building tools aimed at automating SRE tasks. Think, you know, automatically finding monitoring gaps, alerts, helping with root cause analysis, even generating run books using AI. So if you want to hear about applying AI to real world SRE problems and all the tech behind it, I think you're going to enjoy this. So here's my conversation with Amal. Amal, welcome to Software Huddle. Hi, Sean. Good morning.

Starting point is 00:01:37 Yeah, thanks for being here. So I want to start off with you. Who are you? What do you do? Well, I'm Amal. I'm the co-founder and CEO of TempurStack. TempurStack is an AI SRE agent. We're on a mission to make sure that no on-call engineer ever has to wake up at 2 AM in the morning and solve a production issue in 2025, which it should be an AI agent

Starting point is 00:02:01 jumping in and solving this problem for them. We just demoed that full self-drive version of TemporStack where it goes from alert to issue resolution completely autonomously last week. So yeah, that's what we're most excited about. Awesome. Well, yeah, doing the Lord's work there, making people not have to get up to 2 a.m. So I myself have never worked as an SRE,

Starting point is 00:02:25 but I was basically on call for like seven years when I was a founder. So I get some of the pain. And from an outsider's perspective, it feels like what we were doing for SREs like a decade ago is essentially sort of what we are doing today. So why is it the case that, and maybe I'm wrong, correct me if I'm wrong,

Starting point is 00:02:44 but why is it the case that, and maybe I'm wrong, correct me if I'm wrong, but why is it the case that things haven't changed? We still have people having a beeper go off at two in the morning and having to get up and fix things. Right. Yeah. I think things have been moving forward in some sense, right? And software reliability has gone through an evolution of its own. We started off actually, it's kind of funny because we are probably right now, exactly where we started with all, even though things have moved forward in the last 10 years. But we started from a point where it was just assumed that things will fail. When it does, somebody will jump in and solve the problem. But I think over the last 10 years, a lot of investment happened into observability tooling.

Starting point is 00:03:30 How do I get my metrics, logs, events, traces, all of them into one database where I can access it whenever I need it? But I think the problem with that is it still was reactive in the sense that when an issue happens, and it's typically the customer who finds out because they are using the software more than you or I or the engineer is actually using it. So, they're typically the first ones to find out who complained to customer service and then customer service complains to engineering. That's when they actually open the observatory tools to figure out, oh, what's going on. So it became more like a post-mortem tool or sort of like a diagnostic tool rather than a prevention tool. And that's what I think the next stage was, can we have

Starting point is 00:04:17 alerts so that the system actually tells us whenever there's a problem? But the process of setting up alerts is so manual and so cumbersome and so time consuming that engineers just don't have the time to set up all of the alerts that they need to let them know before something happens. So even there, there is a bit of a reactive stance that they take that whenever I have an incident, I will go and set up some of the alerts around that blast radius. But

Starting point is 00:04:46 obviously your next incident is not coming from exactly the same thing that failed today. It's coming from the 90% of the stuff where you have absolutely no alerts or absolutely no forewarning. I think those are the things that we're trying to fix with TemperStack is how we first of all get you to a 100 100% coverage. And then when problems happen, can we have our agents respond? And that's how it's going to be so much different than the last 10 years. Right. You mentioned we have all these observability incident tools

Starting point is 00:05:17 now. You have your data dogs, New Relic, PagerDuty, and so forth. And they've done a good job of giving you a place to go when you know that there's a problem. But I think another thing that's happening is we have more data available to us than ever before, but that also creates a problem where it's just that's a lot of data to sift through. And how do you think about combating that problem of just like, we have all this data at our fingertips, but if I don't even know where to start or some like an overwhelming amount of data,

Starting point is 00:05:49 can I kind of get lost in that? Yeah, so absolutely right. And I think along with the data fatigue that you're talking about, the other problem that the industry has is alert fatigue. And what we realized is both of these are actually a problem that has its roots in the reactive stance that people have. When I am in that reactive stance, I'm always sort of worried that, hey, do I have all of

Starting point is 00:06:16 the data that I need when I need it? And so I keep instrumenting more logs, more metrics, more traces, and so on. And again, when I'm in this reactive stance, when I have that incident, now I'm scared and I go and set up not just alerts on the leading indicators of problems, but on the 20 metrics that I have access to. I think that's what results in this ballooning observability data cost and ballooning problem of alert fatigue. I think the way we solve for that at Temperature Stack is try to take smaller slices of this data when there is a leading indicator that goes off. Only have alerts on leading indicators and then

Starting point is 00:07:02 around that, whatever metrics might be affected, even if they are not in alarm state yet, try and get the last five minutes of that data and then process that to see, are there any spiky behaviors? Is there any abnormal behaviors to get to the root cause and then eventually the solution of that? So yeah, I think there is a lot of work that's happening right now in terms of how do we reduce those log sizes?

Starting point is 00:07:27 How do we make sure that we only have the metrics that we really need because commas are blowing up? Right. Obviously, it's a negative impact to people who work in these roles, having to deal with these incidents, kind of scramble, deal with postmortems and things like that. But what is sort of like the business impact to these things beyond just like the individual engineer that's responsible for some of this stuff? Oh, yeah, absolutely. I think the business impact is huge, right? One of the articles that I was reading from Oxford economics puts that value at $400 billion. And this is just from the global 2000 companies that they surveyed. And I think what a lot of the engineering leadership is realizing is that the cost of downtime was only attributed in the past to the direct loss of revenue. I was down for

Starting point is 00:08:21 an hour. And if my software was up, whether the impact is direct transactions that are being enabled or what is the indirect result of this downtime on customer perception of our software and hence churn and so on. However, we calculate that it was thought that it is just this direct loss of revenue. But I think what's becoming clearer and clearer is that there are so many other costs that you typically don't attribute to that downtime itself, which is you end up paying regulatory fines in a lot of countries. You end up having SLA penalties if you're serving another business. And you've breached your SLA as you end up paying SLA penalties. And I think the most important is typically the most senior resources

Starting point is 00:09:06 in the engineering team tend to be on call. And the disruption that that causes for developer productivity, you're not just during that one or two hours of downtime, but as a fallout, the amount of time that that senior engineer now needs to spend to figure out what exactly happened and figure out what the long-term fix for this issue is, et cetera, is just so much developer productivity lost. So that's what makes it, I think, an economically huge problem that enterprises really want to solve. Yeah. And I think there's also a challenge.

Starting point is 00:09:43 You touched on this where a lot of times, especially if it's like a major incident, you kind of, you like, you like fly in these like, these superheroes of the organization, there's like a handful of people who like end up having a bus factor of one on a potential incident, a person can't go on vacation because of the risk of some sort of outage or something like that. No, absolutely. I think this came up recently at the SRE Day conference that I was attending, and there was one talk about just the culture of SRE and engineering. Oftentimes, we celebrate these superheroes that you're calling, these guys who really know the architecture really well and can jump in and solve any problem because they know exactly what's going on. But every time, I think from an organization's perspective, every time you're having to call in that superhero, it also means that something in your process is not working.

Starting point is 00:10:50 Why is it that, like, what happens if tomorrow this person leaves, right? Is a question that's constantly looming. So in the moment, it seems like, oh, wow, this person, the guy or lady is a hero, and they came and saved the day. But every time, if I were running that org, I would be very concerned that, hey, I need to have processes that we depend on and not people that we depend on for this. And one part of that challenge also, I think, is how do we then sort of get these superheroes to externalize their knowledge so that it's available for other people to use?

Starting point is 00:11:33 And that, I think, is a challenge for any tool in this segment is, is it constantly encouraging people to share whatever their tribal learning is to the tool so that the tool is then able to resurface that at whatever time when it's required the next time that knowledge is required. Yeah. So getting into Tempur Stack, just from a user's perspective, what's this look like? If I'm using this, how do I use it? What is this sort of, what is the setup and what is it kind of taking over for me? Sure, so like I said, so last 10 years, a lot of investment

Starting point is 00:12:12 that has happened in the observability tool. So typically for any software, you at least have two monitoring tools, one on the infrastructure side. So if you're on AWS, think of like a cloud watch. And then on the application performance side, you would be using something like a new data dog, Splunk, something like that.

Starting point is 00:12:30 And then you have logging tools like your Loki, Loki and Corelogics and things like that. So Tempurstack's job actually starts downstream from that. So we connect with the existing monitoring stack of the organization and do everything downstream that an SRD engineer today would. And that breaks down into a few things. The first, of course, like we were speaking earlier is, can you get me to that 100% alert

Starting point is 00:12:58 coverage so that I never have a customer reported incident? My engineering team will always be the first one to know. So that's what Tempestact does. It audits your current monitoring tools, figures out what alerts you have, what alerts are recommended to be had by your AWS well architected, Azure well architected and all of those,

Starting point is 00:13:18 and helps you get to that 100% alert coverage. And then when incidents happen, you typically have 20 alarms going off. And the engineers trying to find the needle in the haystacks, which is the root cause here, that caused all of this cascading effect. So Tempestack gets you there instantaneously. It's able to look at the cluster of alerts,

Starting point is 00:13:36 get to root cause. And then once you have the root cause, it's about, OK, what actions do I need to take to fix this problem? So we create that run book in real time. And then with a daemon that actually sits inside the customer's infrastructure, we are able to execute that run book as well and take those corrective actions.

Starting point is 00:13:56 But of course, like you can decide how much of this you want control over and how much of this you want to let that Brist on its own. So that you have from 100% control to 100% autonomous, you have it's like a sliding scale that you can go. Okay. And you started off by talking about how you're basically getting the data from all these different sources, you're connecting any systems. How does that connection work? Is this like an agent that's installed to listen on something or, you know, is there some sort of other type of integration

Starting point is 00:14:31 that's happening? Yeah, so most of the monitoring tools have a standard APIs that they expose. And right now it's a simple 10 minute integration that you do with those standard APIs, right? So in most cases, it's just creating an IAM user and then plugging that key into TemporStack. That's all it takes, 10 minutes for each integration.

Starting point is 00:14:53 Once we have integrated is where that agent comes in, which is through these APIs, sort of looking at what alerts are there, coming up with a recommendation of what alerts should be there, and so on. But the integration itself is 10 minutes. Yeah.

Starting point is 00:15:09 So how does that audit work? How do you determine that you're missing an alert and are able to make a recommendation? Right. Again, so actually, when you look at the documentation of some of these monitoring tools, all of this is actually well-documented. So For example, if you have an EC2 server on your AWS, CloudWatch will let you monitor about 40 different metrics for EC2. But obviously, the rest of the

Starting point is 00:15:39 metrics are not as important. There are some metrics that are more important than the others, which will always be leading indicators of a problem, like your CPU utilization, your memory utilization, your network in and out that's happening, et cetera. So what we're looking at is basically our agent is one, ingesting all of this documentation from AWS and figuring out what are their recommended alerts

Starting point is 00:16:01 for a resource type, which is an EC2. Then also, modifying and customizing that based on, okay, is it a T2 small or a T3 large? What class of that other source is this? Those are some of the inputs that we use to build this recommendation list. Then, of course, we're checking, because we have integrations, we're checking if those Then, of course, because we have integrations, we're checking if those alerts actually are there on each instance of your EC2 server or not. Then over time, we also have a module that does the optimization. Imagine your CPU utilization was at 30%, and so we set up an alert at 50%. Over the next six months, you have more traffic coming in,

Starting point is 00:16:46 so your normal operating range is actually inching up. If you leave it there, then that's how you end up with noisy alerts because now every 5% increases hitting that alert threshold in some sense. That's where the optimization keeps giving you suggestions on, hey, we think that maybe you should change the eval period here. You should change your threshold levels here, et cetera. So that's a continuous audit that's happening. Yeah. In terms of that alert audit based on what is the recommendations given,

Starting point is 00:17:21 you're running this EC2 in this particular class of server, how much of that is relying on AI inference to figure out some of that stuff versus a heuristic, rule-based approach? Right, so I think it's a little bit of both. The heuristic, sort of rule-based approach is a good starting point. But when we have to recommend, okay, now over time, what does this alert need to change

Starting point is 00:17:50 to or what does the threshold need to change to? That's where the AI sort of inference elements kind of come in, right? Because initially, yes, it's a great rule-based thing to say, let's start you off at this point, which seems like a fair point to start. But over time, how it needs to change has a lot more factors going into it. What is this easy to connect it to? Typically, what is the operating range and things like that. That's where the AI inference comes in.

Starting point is 00:18:18 In terms of optimization and making those suggestions, is there any risks that the new alert fatigue becomes too many recommendations essentially? If you're bombarding people with recommendations about optimizing the alerts, adding alerts, then you're creating a new problem potentially. Right. No, not really. So what typically happens is that it's not like every time there's a recommendation, I will go and fix it. I mean, we are competing against this never being fixed. So even if somebody goes back to that dashboard like once a month and you have 10 suggestions and that's what we typically see

Starting point is 00:19:02 our customers doing is not really go to it every day. But once a month or once a fortnight, you just go back to that dashboard and say, oh, okay, these are the suggestions that we have currently. And so let me accept or deny some of these. And it takes like five minutes to do that. So that typically doesn't really cause the alert fatigue. But yeah, you're right in the sense that there always are these things that you never knew about. And so when you start setting up these alerts, you will start seeing new alerts. But that's not a bad thing, per se, because otherwise, these

Starting point is 00:19:39 are the things that would have caused down times for you. So in some sense, you're trying to decide whether you want more alerts or you want more down times. So that's an easy choice to make. But what some of our customers also do is go one microservice at a time and start setting up these alerts so that you can choose how much of that additional workload

Starting point is 00:20:01 you want to handle. Could you also tune that based on how critical that particular service is? If this is my payment service, I want to make sure that I don't have an incident there. So maybe I bias towards having more information rather than less information, but something that's maybe less consequential to the business.

Starting point is 00:20:23 I'm kind of OK with potentially missing something. Oh yeah, absolutely. So you can decide. So because TempurStack also subsumes the role of your incident management system, one of the ways in which our customers handle this is to say that, you know what, for my most critical microservices or resources, let me have a phone call with like escalation policy in place so that in five minutes if I don't pick up, it goes to Sean and things like that. But for my less critical systems, maybe a Slack message is okay. And for some of the other things, maybe an email is okay. So you can decide what level of intrusion you want in the notification system based on how critical this service is for you.

Starting point is 00:21:10 So that's one way to handle it. And the second way to handle it is also in the template, you can go and make changes and say, you know what, for this, I want, this is a very critical system, so I want more sort of window for errors. So, let's say I set it up at 70% instead of a 90%, but something else where I know, you know, this is not too important, I can go and quickly fix it. I'm okay to have a 90% threshold where I have lesser time to react, right? So that's another way you can sort of handle this.

Starting point is 00:21:42 One of the things that we are excited about and we are working on, which we still don't have, is can we also automatically, using some sort of AI inference, automatically categorize problems as severity one, two, three, which we are still working on. We don't have that yet, but I think that would be very interesting as well.

Starting point is 00:22:03 Yeah, in that situation, are you thinking, I know you don't have anything at the moment, but that feels akin to standard recommendation models, using more of a purpose-built model versus using something like a foundation model to solve that problem. Is that the way that you're thinking about this? Yeah, so again, so I think one of the underlying things for us to do a good job of this categorization

Starting point is 00:22:30 is understanding the topology of the software system itself and understanding if the problem that's occurring right now, is that a very important central node or is that somewhere central node or is that somewhere in the corner of a tree which would determine what's the blast radius if this fails? I think that would be one of the biggest inputs. Today, the way we figured that out is we know which systems are talking to which systems from data from the monitoring tools. Also, we know what service this was linked to. So there is some idea of that.

Starting point is 00:23:08 And the one piece that I think missing in this sort of the input dataset is the real user monitoring and the traces, right? Of, I think traces would just take this to the next level and we'll get there. We'll start ingesting some of the traces, which allows us to know, okay, how many of the customer flows actually hit this node, right? And if

Starting point is 00:23:31 like, if that is very high, then okay, obviously, that's a very important node. If that's very low, then so maybe not so important node. So I think that would be the most important input for us to figure out the stability of the problem. Do you take into account human feedback or taking into account that people presumably are working already as an SRE, maybe they've been there for, they probably have a lot of knowledge that is not documented somewhere. Is there a part that allows you to take that into account as an input?

Starting point is 00:24:10 Yeah, 100 percent, and at all levels. When we're setting up alerts, the templates and suggestions that we create, you can always go and modify that and say, hey, I'm less adventurous than you are, or more adventurous than you are, so I want a higher or lower threshold. That's one way in which we accept that input. Every time we generate a run book, you can either literally delete that run book and create your own,

Starting point is 00:24:37 or you can say, here's my subjective feedback on whatever the suggestions are, can you please take that into account the next time you're generating this runbook? So you can do that. And from the auto-healing perspective, the way we've built it is there's a rules engine and there is a script library. So you can literally say,

Starting point is 00:25:00 and all engineering teams have these three pesky problems that they know happen every month. And they know these are the rules. When x, y, z conditions are met, this happens. And when that happens, this is exactly what I do. So you can actually write a script and upload that into our script library and say, do exactly this that I'm talking about.

Starting point is 00:25:21 So at all levels, I think the human is definitely in the loop. And that's what makes, I think, the feedback loop much, much better for us. So you mentioned auto-healing there. Can you explain that a little bit more? Yeah, absolutely. So again, that, I think, is our mission.

Starting point is 00:25:39 No engineer should have to wake up at 3 AM in the morning to solve a production problem. But obviously, it's easier said than done. I think some of these problems do get heavy. But the idea is that if I look at what's the common denominator, and some of the enterprises actually do solve for this, but they've all been written as internal tools for one specific microservice. But if I look at what's the common denominator, what's the architecture for it, you need a trigger, which the trigger is typically an alert.

Starting point is 00:26:12 When that happens, you need something to do the root cause analysis, figure out what exactly the problem is, and then you need something to, now that I know what the problem is, what is the solution, come up with a solution, and then take that action itself. So that is exactly what Tempest stack enables you to do. The last part of actually the taking action,

Starting point is 00:26:35 we don't want to do that as a SaaS tool from outside of our client's environment. So we have a daemon that sits inside of the client's environment, and that is the one that is executing some of these actions. Again, like I said, we suspect that the ones that will get automated sooner rather than later are the ones where either the engineering team knows that these are the conditions and these are the solutions, that this is exactly what we do. Or it could also be just the diagnostic steps that you're choosing to let TempurStack take action even before you get paged and you come on the call.

Starting point is 00:27:13 Where all the information that you need from that particular instance that's failing right now has already been gathered, and that is automated. Then you can look at what are the recommendations for resolution and take those actions yourself. Or you can do it as, just like you're coding on Cursor today, like wipe coding, you can do wipe troubleshooting where you're actually working with TemporStack to say,

Starting point is 00:27:38 okay, let me execute this recommendation of the diagnostic step. And every time it comes up with that result, it actually auto adjusts and rewrites the whole runbook. You can actually step through it and say, okay, let me run the first diagnostic step. Okay, now a bunch of things have changed. Literally go one step at a time and execute one command at a time,

Starting point is 00:28:02 and also do that same thing for the resolution. Right. So, yeah. So I think, but what is exciting is that that future of where engineers don't really need to be on call and instead of having to wake up at 2 a.m., they can look at an email at 8 a.m. and say, oh, this happened. And, okay, let me figure out how to make sure that this never happens again. And that, I think, should be the engineer's job. That's the part that they enjoy doing. And that's what we, I think,

Starting point is 00:28:35 we want them to be doing. Yeah, that's. So with this, how do you kind of build trust with these teams? We talked about these superheroes previously. I would think that some people who have that status in the company like having that status and maybe feel a little bit protective about their status as the person that you have to go to and rely on. How do you break down those barriers or those walls that are put up? Right. So the first part of the question is that the superheroes actually are very,

Starting point is 00:29:13 very critical to this whole process because they are the ones who will externalize their knowledge and actually teach the agent to perform better and better over time. Like I said, there are feedback loops where either you can give the agent subjective feedback and who better than those superheroes in your team to actually play around with the agent and get them into some of these situations and train them. I think it doesn't really diminish their role in the company. If anything, it probably increases their role,

Starting point is 00:29:45 just that you don't have to wake up at 3 AM to do it. You can train the agent in peacetime in your work hours instead of having to wake up, which I'm sure nobody enjoys. But the second part, I think about the trust, which is a very important point that you brought up, is I think we've seen that happen with our alerting recommendations as well.

Starting point is 00:30:08 So humans obviously start from a low trust point, right? And then we think that it's the product's responsibility to build that trust over time. And what we saw happening with our alert recommendations as well is that initially people were skeptical, they would go and tweak some things around. But I think over time, they realized that they're not making too many changes. And that's the point at which they were okay to put that

Starting point is 00:30:35 in a full self-drive mode. But there is a setting where you can go and say, I want just TemporStack to start setting up whatever alerts it thinks is right. We see that even on the resolution side, it's going to be a similar journey where initially it's going to be recommendations, it's going to be people stepping through and working with the agent to solve these problems until it gets to a point where either they know that this problem happens, and this is the exact solution. I can see that this is the script that's going to get executed.

Starting point is 00:31:07 I think that also helps. Showing that this is what we're going to do lets them know that, okay, I'm still in control and I can pause wherever I want. That's one part of it. And then getting it right over time and having the right recommendations is the second part of it, where if they feel that, hey, you know what, these recommendations seem to make sense,

Starting point is 00:31:30 and those superheroes start feeling that, hey, this is exactly what I would have done. That is a critical moment for DevSec, where they start feeling that, hmm, this makes sense, because this is exactly what I would have done. I think, yeah, it's a process, and we have to gain that trust over time. Yeah. How far away do you think we are from having a fully autonomous system for this? Well, we actually already do and that's the interesting part of it. The demo that we ran was 100% autonomous.

Starting point is 00:32:06 It started from the alert and went all the way through root cause to identifying what exactly is the action to be taken and taking that action as well. So technology-wise, I think we're there. These systems can take autonomous action. Obviously, there will be some of these hairy problems, and they come more from, I think, let's say the recommendation is kill this process.

Starting point is 00:32:32 But you know that this process has 200 other dependencies, and I can't just kill this process. I have to figure out something else. So those are the kinds of things I think that we will have to figure out. That's where the wipe troubleshooting comes in, where you say, okay, till here is fine, but the next step, let me do this myself. In some sense, I think that technology exists even today, but to leverage that

Starting point is 00:33:02 again goes back to, are our systems architected well enough? And so that is what is, I think, going to take time. That we get it to a point where you can make the most of this agent. And that's going to take time. And it has both technology, culture, all of those angles to it. Right. You're right. So I want to get into the agent architecture a little bit.

Starting point is 00:33:27 If someone's running all of Tempur Stack, how many agents are we talking about behind the scenes? It's hard to say how many agents, because literally every monitoring system has its own agent built in. But it is a multi-agent setup. There are agents talking to each other. If I look at it from an outcomes perspective, again, there are four outcomes that we want to achieve. One is that alert setup. The second is

Starting point is 00:34:01 the root cause analysis. That is one major outcome. The third is coming up with recommendations and what to be done, like the run book. And then the fourth is actually taking action. And all of these use different kinds of agents within the system. In terms of the architecture, I think three things that the agent needs to work with.

Starting point is 00:34:25 One is underlying LLM model, which right now for us in production is 3.7 Sonnet. That's what we're using in production, but in Patlid, we're also training our own llama model, and we're beginning to see some improvements in accuracy when we use that fine-tuned model rather than the public model. So that's one part of the architecture. The second is it needs data stores that it has access to. So again, there are both relational data stores, which is where your alerts, alarms,

Starting point is 00:34:58 the resource data, all of that is stored. And then you have your vector databases where more of the knowledge base is stored. So that's where we're sort of ingesting data from your stack overflows, the monitoring tools, documentation, the forums on Reddit, et cetera, where some of these conversations around incidents are happening.

Starting point is 00:35:19 So all of that goes into our vector database and potentially in the future, we could also have the company's own knowledge base sort of brought into this vector database. So those are all the data stores that the agent has to make the decisions that it needs. And then there are tools. So for example, every single integration that we have

Starting point is 00:35:40 with the monitoring tools becomes one tool for the agent to use, which it can use to set up alerts or improve some of the alerts, et cetera. Similarly, there's a runbook generator, which has something has to generate that final UI, which is all created on the fly. So those are some of the tools that the agent has access to to be able to achieve those four outcomes. And that's how I think we think of the architecture largely.

Starting point is 00:36:09 Yeah. And for each of these agents or merely the accommodation, like how do they run? Are these running as microservices on Kubernetes or something like that? What's that set up look like? For us? Yeah.

Starting point is 00:36:23 Internally, I think the access to the LLMs is built on Bedrock. A lot of the agentic workflow is built on LandGraph. Any of the interactions that are happening between these agents is happening on LandGraph. Zillers is our vector database. I think those would be the tools that we use, but one piece that sits outside of all of this is that daemon that sits in the client's infrastructure. For that, the idea there again is one, that daemon will take the actions that need to be taken, but it's also looking at, hey, what are the actions that are being taken by engineers

Starting point is 00:37:09 outside of what is recommended? And so that also becomes part of the feedback loop in terms of saying, okay, maybe we missed something here and we need to sort of think about that as well and that gets added to the context and next time it's used. Yeah, I don't know if I answered your question on this one. Well, I was just kind of curious about, I understand you're using LandGraph, but where does this actually run? It's running within your cloud environment,

Starting point is 00:37:40 but I guess how do you run that? How do you serve that? Is it serverless? Is it running on Kubernetes? That was the question I was asking. Yes. Actually, right now, I don't think we are on Kubernetes. We are on EC2. Because we don't have too many microservices, it's built as a monolith right now. EC2 works better for us, but as we scale, we'll probably break it down into more microservices and have it run on Kubernetes. Where we use serverless is actually to run those actions. So when the daemon is running actions within the client's infrastructure, that's where it would spin off probably a Lambda function and run whatever script is there on that Lambda function. That's where we use serverless, but within our own SaaS offering,

Starting point is 00:38:31 there's not too much serverless that we use. Okay. And then what happens is, or how do you handle a situation where if you have multiple nodes in something like a LandGraph that's executing some sort of agentic component of this entire workflow, and that part fails or generates an output that is unexpected or something like that, how do you handle that failure scenario?

Starting point is 00:38:57 Right, so Bedrock does have a governance layer on top of it. LandGraph also has that. And then we have some feedback loops built in saying, okay, if this fails, then... I mean, two things happen. One is in the instance, what is the recovery? Can I do retries to see if the output right now is not the ideal that I expect. Can I do retrace? That is the in-product fix where we have those control loops. But also, a lot of these are getting locked to our systems where some engineer can go take a look at it and look at for some of these use cases, what was the hallucination score, and what was the learning scores of how much the agent

Starting point is 00:39:47 is actually learning versus how much of it is, in some sense, a reputation of what it has seen before. So we're looking at all of these governance metrics to make sure that we are constantly also feeding information to make it better and better. And I would say those are the two parts to that control loop. One is how do I fix it in product while it's generating and something failed? And the second is from a long-term perspective, how do we keep making these improvements based on some of these governance metrics

Starting point is 00:40:20 that we are tracking? Mm-hmm. For the hallucination score, are you using a model to evaluate the hallucination level? I'm actually not sure how we're measuring. I do know that we get a hallucination score from Bedrock itself, if I know right. I'm actually not sure. I think RCDU would be the better person to answer that question, but I know right? I'm actually not sure. I think, I see you would be the better

Starting point is 00:40:46 person to answer that question. But I do know that he sends me sometimes like snippets of what those scores looks like. And that's where my understanding of this comes from. But how exactly that hallucination score is being generated, I will have to get back to you on that. Yeah. Okay. Yeah. I was just curious. In terms of testing and evaluating, how do you handle that situation where presumably you're going to have to iterate on this? You may change a prompt, you're doing tests right now of fine-tune model versus the Asana 3.7. How do you know that you're actually moving things in the right direction versus creating something that's actually worse? Yeah, again, so there are two

Starting point is 00:41:30 parts to this. One is just like how a consumer or customer would test for this, we're looking at all of the suggestions. Are they making sense for us as engineers? And we also have a bunch of people that we share it with and say, is this making sense? That is one level of, I think, which is more like a gut check to see if is this making sense? And we've actually gone through an evolution there as well. When we started, we would just send the alert to the public LLM and see if it's able to generate

Starting point is 00:42:10 what we need from it. And the accuracy was probably like 20, 30% and it was pretty bad. But with the DAG model, I think we saw that go up to 70, 80% in terms of how often does a human engineer agree with the recommendations, whether it's on alerts or actions to be taken, root cause, et cetera. But with the fine-tuned LAMA model right now, we see that that accuracy is growing to 80, 90%.

Starting point is 00:42:42 So I think that's one side of it. The second is also the more context we're able to set, which comes from how much information do we have in our vector database and how many different scenarios has that vector database documented. So in some of the cases, we have actually had to write like hand code sort of different kinds of scenarios. For example, for the root cause analysis, where it's trying to do a correlation between

Starting point is 00:43:12 alerts, we've had to see that with about 5,000 different scenarios that we have seen, where we know that, okay, if this is the cluster of alerts, then this is the most likely root cause. But with that seeding, now we see that things of alerts, then this is the most likely root cause. But with that seeding, now we see that things in that vicinity, it's able to extrapolate from that learning data and say, okay, so if I now combine these two use cases, then what happens? So those things, the agent is learning on its own and learning to do. So I think there is a... on its own and learning to do. So I think there is a... So one is ingesting public data. Two is writing our own use cases and doing that hard-coded learning data. And the third

Starting point is 00:43:57 is, of course, as we work with more and more customers, there's a lot of that feedback that's coming in. So that's, I think, all of the loops that we have to keep making this better and better, how the learning's happening. And given that you're in the business of SRE incident management, how do you make sure that you don't have your incidents on your side? So we do use SemperStack for our own sort of the monitoring as well.

Starting point is 00:44:28 You know, and I think there is enough sort of redundancies built in on our side. All of this architecture, our CTO has done this for companies as large as Goldman Sachs. He's been the CISO for really high growth startups like Danzo, which is like the DoorDash of India. So I think that initial expertise that exists within the team has all been in this. He's been in the DevOps Society space for the last 15 years. So in fact, a lot of our customer conversations start with some sort of a consulting on how should they instrument their reliability and observability tools and things like that. I think we, of course, have taken care of some of those things.

Starting point is 00:45:18 We have our ISO coming through, SOC 1, SOC two coming in as well. So all of that, of course, is the external sort of the validation of the efforts that have already gone into making sure that we have a very robust system working at the heart of all of this. And in terms of building these kind of like Gen. AI products, these multi-agent systems, are there things that, I know, like lessons learned along that the way that you can talk a little bit about for anybody that's listening, that's looking to start to build these types of systems? Sure, yeah. So I think, like I was talking to somebody yesterday at one of the conferences, and this is exactly the question that they had. And what I realized is for us, it's always been sort of problem-based learning. Obviously, we've had to

Starting point is 00:46:11 learn a bunch about how some of these systems work, how some of, like, what are the best tools out there to use for ourselves. But I think we were always trying to solve one specific problem, because like, JNI is an ocean, right? I don't think there's any way you could learn everything that is out there. But for us, when we started solving the alert problem, we said, OK, what do we need to make sure X happens? And then started looking at where

Starting point is 00:46:40 we can leverage AI to do that. And same with the run at the root cause analysis use case as well. So everything starts with a problem. And then when we know that the problem has, this is how I would want to solve the problem. And then we see if AI is the best use for it and what AI is the best use for it and what AI is the best use for it and what tools should we bring in to solve that problem?

Starting point is 00:47:09 I think one of the problems is that with AI, when you have a hammer, everything looks like a nail. And I see that approach with a lot of engineering teams that are actually starting off with, hey, you know what? I want to build an agent and then let me figure out what... Yeah, they were first engineer the problem. Yeah, I spoke on a panel this week about like AI readiness and one of the points I made is like, like start with the business value and you know, what is the problem you're trying

Starting point is 00:47:32 to solve, then figure out whether it makes sense to use AI or something else, right? I agree, like I think there's a lot of pressure on organizations right now to go to the innovation team, go to their, you know, leaders in engineering and be like, what are we doing about agents? We need agents. And they don't actually know why they need it. They just know that they want it. Right. Yeah. And I think people in that blind chase are discounting how useful workflows are. So I think for us, it's been a combination of workflows and agents that actually unlocks the most power of AI. I think that's how it's been. It's been problem-based discovery of different AI tools that can be useful for solving that particular problem.

Starting point is 00:48:20 Yeah, there's actually a really good GitHub repo that I read through this week called the 12 Factor Agents, which is principles for building reliable LLM applications that kind of relates to what we're talking about. But in that, basically the author of that has tried all the frameworks and kind of talks about the reality of building agents and what are the things that you really want to control. And I think the things that we see in the news or make headlines, make keynotes and stuff like that are these very open world problems of like, we had a world where at one point software was like a single node and then we went to sort of more workflows

Starting point is 00:49:01 where we had like a defined DAG. And I think the promise of agents is that you just have nodes and the edges basically get dynamically figured out by some sort of reasoning agent. But I think what actually is being successful in business is a lot more close to the DAG structure with a little bit of dynamic choices around what tool to use or sometimes which edge to go down. But it's not just completely open world, like, you know, here's access to AWS,

Starting point is 00:49:31 go, you know, fix all my problems. It's a lot more constrained. And you want to remove as much non determinism as possible, and really only rely on non determinism when it makes sense. I think that's gonna keep it. Makes a lot of sense. Yeah, yeah, absolutely. Makes a lot of sense. Awesome. Well, is there anything else you'd like to share?

Starting point is 00:49:54 Not really. I think we're right now super excited with the full self-drive version is, you know, how I'm kind of looking at it is the full self-drive version is how I'm looking at it, is the full self-drive version of that agent. We would love for people to try it out on their own, maybe some test environments that they're okay to break. I would love to see people actually try that out and see how it works. to see people actually try that out and see how it works. So yeah, I think that's what we're most excited about right now of putting this agent in the hands

Starting point is 00:50:30 of as many people as possible and have them try to break the agent itself sense, right? And find those use cases when the agent is completely sort of flummoxed and doesn't know what to do. Right? So yeah. Yeah, awesome. Yeah, I mean, no, they say like no

Starting point is 00:50:46 product survives this first encounter with real users. So yeah, well, thanks so much for being here. Really enjoyed it. I love the vision of what you guys are building. Thank you so much, Sean. Thanks for having me. And it's been a great conversation. Thanks so much. I will look up the article that you just mentioned on GitHub. Yeah, cheers.

Your Ad Here

Software Huddle - Software Reliability Agents with Amal Kiran

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.