Software Huddle - Software Reliability Agents with Amal Kiran
Episode Date: April 29, 2025So if you're writing code or keeping systems running, you probably know the drill. Late night pages, chasing down weird bugs, dealing with alert storms. It's tough! It costs money when things break, a...nd honestly, nobody loves that experience. So the big question is, can we actually use something like AI, AI agents in particular, to make reliability less painful, more systematic? That's what we're talking about today. We have on the show with us Amal Kiran, the CEO and Co-founder of Temperstack. They're building tools aimed at automating SRE tasks, think, automatically finding monitoring gaps, alerts, helping with root cause analysis, even generating Runbooks using AI.  So if you wanna hear about applying AI to real world SRE problems and all the tech behind it, we think you're gonna enjoy this.
Transcript
Discussion (0)
We're on a mission to make sure that no on-call engineer ever has to wake up at 2 a.m. in the morning
and solve a production issue in 2025, which it should be an AI agent jumping in and solving this problem for them.
How far away do you think we are from having like a fully autonomous system for this?
Technology-wise, I think we're there. These systems can take autonomous action.
Obviously, there will be systems can take autonomous action.
Obviously, there will be some of these hairy problems.
And given that you're in the business of SRE incident
management, how do you make sure that you don't have
your incidents on your side?
Hi, everyone.
It's Sean.
Welcome back to Software Huddle.
So if you're writing code or keeping systems running,
probably know the drill.
Late night pages, chasing down weird bugs,
dealing with the word storms.
It's tough, it costs money when things break.
And honestly, I don't think anybody loves that experience.
So the big question is,
can we actually use something like AI,
AI agents in particular,
to make reliability less painful, more systematic.
And that's what we're talking about today.
So this week on the show, I have Amal Kiran, the CEO and co-founder of Tempur Stack.
They're building tools aimed at automating SRE tasks.
Think, you know, automatically finding monitoring gaps, alerts,
helping with root cause analysis, even generating run books using AI.
So if you want to hear about applying AI to real world SRE problems
and all the tech behind it, I think you're going to enjoy this.
So here's my conversation with Amal.
Amal, welcome to Software Huddle.
Hi, Sean. Good morning.
Yeah, thanks for being here.
So I want to start off with you.
Who are you? What do you do?
Well, I'm Amal. I'm the co-founder and CEO of TempurStack.
TempurStack is an AI SRE agent.
We're on a mission to make sure that no on-call engineer ever has to
wake up at 2 AM in the morning and solve a production issue in 2025,
which it should be an AI agent
jumping in and solving this problem for them.
We just demoed that full self-drive version of TemporStack
where it goes from alert to issue resolution
completely autonomously last week.
So yeah, that's what we're most excited about.
Awesome. Well, yeah, doing the Lord's work there,
making people not have to get up to 2 a.m.
So I myself have never worked as an SRE,
but I was basically on call for like seven years
when I was a founder.
So I get some of the pain.
And from an outsider's perspective,
it feels like what we were doing for SREs like a decade ago
is essentially sort of what we are doing today.
So why is it the case that,
and maybe I'm wrong, correct me if I'm wrong,
but why is it the case that, and maybe I'm wrong, correct me if I'm wrong, but why is it the case that things haven't changed? We still have people having a beeper
go off at two in the morning and having to get up and fix things.
Right. Yeah. I think things have been moving forward in some sense, right? And software
reliability has gone through an evolution of its own. We started off actually, it's kind of funny because we are probably right now,
exactly where we started with all, even though things have moved forward in the last 10 years.
But we started from a point where it was just assumed that things will fail.
When it does, somebody will jump in and solve the problem.
But I think over the last 10 years, a lot of investment happened into observability tooling.
How do I get my metrics, logs, events, traces, all of them into one database where I can access it
whenever I need it? But I think the problem with that is it still was reactive in the sense that when an issue happens, and
it's typically the customer who finds out because they are using the software more than
you or I or the engineer is actually using it.
So, they're typically the first ones to find out who complained to customer service and
then customer service complains to engineering.
That's when they actually open the observatory tools to figure out, oh, what's going on. So it became more like a post-mortem tool or sort of like a diagnostic
tool rather than a prevention tool. And that's what I think the next stage was, can we have
alerts so that the system actually tells us whenever there's a problem? But the process
of setting up alerts is so manual and so cumbersome and so time consuming
that engineers just don't have the time
to set up all of the alerts that they
need to let them know before something happens.
So even there, there is a bit of a reactive stance
that they take that whenever I have an incident,
I will go and set up some of the alerts around that blast radius. But
obviously your next incident is not coming from exactly the same thing that failed today.
It's coming from the 90% of the stuff where you have absolutely no alerts or absolutely
no forewarning. I think those are the things that we're trying to fix with TemperStack
is how we first of all get you to a 100 100% coverage. And then when problems happen, can we have our agents respond?
And that's how it's going to be so much different than the last 10
years.
Right.
You mentioned we have all these observability incident tools
now.
You have your data dogs, New Relic, PagerDuty, and so forth.
And they've done a good job of giving you a place to go when
you know that there's a problem. But I think another thing that's happening is we have more data
available to us than ever before, but that also creates a problem where it's just that's
a lot of data to sift through. And how do you think about combating that problem of
just like, we have all this data at our fingertips, but if I don't even know where to start
or some like an overwhelming amount of data,
can I kind of get lost in that?
Yeah, so absolutely right.
And I think along with the data fatigue
that you're talking about,
the other problem that the industry has is alert fatigue.
And what we realized is both of these are actually a problem that has its roots in the
reactive stance that people have.
When I am in that reactive stance, I'm always sort of worried that, hey, do I have all of
the data that I need when I need it?
And so I keep instrumenting more logs, more metrics, more traces, and so on.
And again, when I'm in this reactive stance, when I have that incident, now I'm scared and I
go and set up not just alerts on the leading indicators of problems, but on the 20 metrics
that I have access to.
I think that's what results in this ballooning observability data cost and ballooning problem of alert fatigue.
I think the way we solve for that at Temperature Stack is try to take smaller slices of this data
when there is a leading indicator that goes off. Only have alerts on leading indicators and then
around that, whatever metrics might be affected,
even if they are not in alarm state yet,
try and get the last five minutes of that data
and then process that to see, are there any spiky behaviors?
Is there any abnormal behaviors to get to the root cause
and then eventually the solution of that?
So yeah, I think there is a lot of work
that's happening right now in terms of how do we reduce those log sizes?
How do we make sure that we only have the metrics that we really need because commas are blowing up?
Right. Obviously, it's a negative impact to people who work in these roles, having to deal with these incidents, kind of scramble, deal with postmortems and things like that. But what is sort of like the business impact to
these things beyond just like the individual engineer that's responsible for some of this
stuff?
Oh, yeah, absolutely. I think the business impact is huge, right? One of the articles
that I was reading from Oxford economics puts that value at $400 billion. And this is just from the global 2000 companies
that they surveyed. And I think what a lot of the engineering leadership is realizing is that
the cost of downtime was only attributed in the past to the direct loss of revenue. I was down for
an hour. And if my software was up, whether the impact is direct transactions that
are being enabled or what is the indirect result of this downtime on customer perception of our
software and hence churn and so on. However, we calculate that it was thought that it is just this
direct loss of revenue. But I think what's becoming clearer and clearer is that there are so many other costs that
you typically don't attribute to that downtime itself, which is you end up paying regulatory
fines in a lot of countries. You end up having SLA penalties if you're serving another business.
And you've breached your SLA as you end up paying SLA penalties. And I think the most important is
typically the most senior resources
in the engineering team tend to be on call. And the disruption that that causes for developer
productivity, you're not just during that one or two hours of downtime, but as a fallout,
the amount of time that that senior engineer now needs to spend to figure out what exactly happened and figure out what the long-term fix for this issue is, et cetera,
is just so much developer productivity lost.
So that's what makes it, I think, an economically huge problem that enterprises really want
to solve.
Yeah.
And I think there's also a challenge.
You touched on this where a lot of times, especially if it's like a major incident, you kind of, you like, you like fly in these like, these superheroes of the organization, there's like a handful of people who like end up having a bus factor of one on a potential
incident, a person can't go on vacation because of the risk of some sort of outage or something like
that. No, absolutely. I think this came up recently at the SRE Day conference that I was attending,
and there was one talk about just the culture of SRE and engineering. Oftentimes, we celebrate
these superheroes that you're calling, these guys who really know the architecture really well and
can jump in and solve any problem because they know exactly what's going on. But every time,
I think from an organization's perspective, every time you're having to call
in that superhero, it also means that something in your process is not working.
Why is it that, like, what happens if tomorrow this person leaves, right?
Is a question that's constantly looming.
So in the moment, it seems like, oh, wow, this person, the guy or lady is a hero, and they
came and saved the day.
But every time, if I were running that org, I would be very concerned that, hey, I need
to have processes that we depend on and not people that we depend on for this.
And one part of that challenge also, I think, is how do we then sort of get these superheroes
to externalize their knowledge so that it's available for other people to use?
And that, I think, is a challenge for any tool in this segment is, is it constantly
encouraging people to share whatever their tribal learning is to the tool so that the tool is then able to resurface that
at whatever time when it's required the next time that knowledge is required.
Yeah. So getting into Tempur Stack, just from a user's perspective, what's this look like?
If I'm using this, how do I use it? What is this sort of, what is the setup
and what is it kind of taking over for me?
Sure, so like I said,
so last 10 years, a lot of investment
that has happened in the observability tool.
So typically for any software,
you at least have two monitoring tools,
one on the infrastructure side.
So if you're on AWS, think of like a cloud watch.
And then on the application performance side,
you would be using something like a new data dog,
Splunk, something like that.
And then you have logging tools like your Loki,
Loki and Corelogics and things like that.
So Tempurstack's job actually starts downstream from that.
So we connect with the existing monitoring stack
of the organization and do everything
downstream that an SRD engineer today would.
And that breaks down into a few things.
The first, of course, like we were speaking earlier is, can you get me to that 100% alert
coverage so that I never have a customer reported incident?
My engineering team will always be the first one to know.
So that's what Tempestact does.
It audits your current monitoring tools,
figures out what alerts you have,
what alerts are recommended to be had by
your AWS well architected,
Azure well architected and all of those,
and helps you get to that 100% alert coverage.
And then when incidents happen,
you typically have 20 alarms going off.
And the engineers trying to find the needle in the haystacks,
which is the root cause here, that caused all
of this cascading effect.
So Tempestack gets you there instantaneously.
It's able to look at the cluster of alerts,
get to root cause.
And then once you have the root cause, it's about, OK,
what actions do I need to take to fix this problem?
So we create that run book in real time.
And then with a daemon that actually sits
inside the customer's infrastructure,
we are able to execute that run book as well
and take those corrective actions.
But of course, like you can decide how much of this
you want control over and how much of this
you want to let that Brist on its own. So that you have from 100% control to 100%
autonomous, you have it's like a sliding scale that you can go.
Okay. And you started off by talking about how you're basically getting the data from all these
different sources, you're connecting any systems. How does that connection work? Is this like an agent that's installed
to listen on something or, you know,
is there some sort of other type of integration
that's happening?
Yeah, so most of the monitoring tools have a standard APIs
that they expose.
And right now it's a simple 10 minute integration
that you do with those standard APIs, right?
So in most cases, it's just creating an IAM user
and then plugging that key into TemporStack.
That's all it takes, 10 minutes for each integration.
Once we have integrated is where that agent comes in,
which is through these APIs,
sort of looking at what alerts are there,
coming up with a recommendation
of what alerts should be there,
and so on.
But the integration itself is 10 minutes.
Yeah.
So how does that audit work?
How do you determine that you're missing an alert
and are able to make a recommendation?
Right.
Again, so actually, when you look
at the documentation of some of these monitoring tools,
all of this is actually well-documented. So For example, if you have an EC2 server on your AWS,
CloudWatch will let you monitor about 40 different metrics for EC2. But obviously, the rest of the
metrics are not as important. There are some metrics that are more important than the others,
which will always be leading indicators
of a problem, like your CPU utilization,
your memory utilization, your network in and out
that's happening, et cetera.
So what we're looking at is basically our agent is one,
ingesting all of this documentation from AWS
and figuring out what are their recommended alerts
for a resource type, which is an EC2. Then also,
modifying and customizing that based on, okay, is it a T2 small or a T3 large? What class of that
other source is this? Those are some of the inputs that we use to build this recommendation list.
Then, of course, we're checking, because we have integrations, we're checking if those
Then, of course, because we have integrations, we're checking if those alerts actually are there on each instance of your EC2 server or not.
Then over time, we also have a module that does the optimization.
Imagine your CPU utilization was at 30%, and so we set up an alert at 50%.
Over the next six months, you have more traffic coming in,
so your normal operating range is actually inching up. If you leave it there, then that's how you
end up with noisy alerts because now every 5% increases hitting that alert threshold in some
sense. That's where the optimization keeps giving you suggestions on,
hey, we think that maybe you should change the eval period here.
You should change your threshold levels here, et cetera.
So that's a continuous audit that's happening.
Yeah.
In terms of that alert audit based on what is the recommendations given,
you're running this EC2 in this particular class of server,
how much of that is relying on AI inference
to figure out some of that stuff versus a heuristic,
rule-based approach?
Right, so I think it's a little bit of both.
The heuristic, sort of rule-based approach
is a good starting point.
But when we have to recommend, okay, now over time, what does this alert need to change
to or what does the threshold need to change to?
That's where the AI sort of inference elements kind of come in, right?
Because initially, yes, it's a great rule-based thing to say, let's start you off at this
point, which seems like a fair point to start.
But over time, how it needs to change has a lot more factors going into it.
What is this easy to connect it to?
Typically, what is the operating range and things like that.
That's where the AI inference comes in.
In terms of optimization and making those suggestions,
is there any risks that the new alert fatigue becomes too
many recommendations essentially? If you're bombarding people with recommendations about
optimizing the alerts, adding alerts, then you're creating a new problem potentially.
Right. No, not really. So what typically happens is that
it's not like every time there's a recommendation, I will go and fix it.
I mean, we are competing against this never being fixed. So even if somebody goes back to that
dashboard like once a month and you have 10 suggestions and that's what we typically see
our customers doing is not really go to it every
day. But once a month or once a fortnight, you just go back to that dashboard and say, oh, okay,
these are the suggestions that we have currently. And so let me accept or deny some of these. And
it takes like five minutes to do that. So that typically doesn't really cause the alert fatigue.
But yeah, you're right in the sense that there always are these things that you never knew about.
And so when you start setting up these alerts,
you will start seeing new alerts.
But that's not a bad thing, per se, because otherwise, these
are the things that would have caused down times for you.
So in some sense, you're trying to decide
whether you want more alerts or you want more down times.
So that's an easy choice to make.
But what some of our customers also do
is go one microservice at a time
and start setting up these alerts
so that you can choose how much of that additional workload
you want to handle.
Could you also tune that based on how critical
that particular service is?
If this is my payment service, I want
to make sure that I don't have an incident there.
So maybe I bias towards having more information rather
than less information, but something
that's maybe less consequential to the business.
I'm kind of OK with potentially missing something. Oh yeah, absolutely. So you can decide. So because
TempurStack also subsumes the role of your incident management system, one of the ways in which our
customers handle this is to say that, you know what, for my most critical microservices or resources,
let me have a phone call with like escalation policy
in place so that in five minutes if I don't pick up, it goes to Sean and things like that.
But for my less critical systems, maybe a Slack message is okay. And for some of the other things,
maybe an email is okay. So you can decide what level of intrusion you want in the notification system
based on how critical this service is for you.
So that's one way to handle it.
And the second way to handle it is also in the template,
you can go and make changes and say, you know what, for this, I want,
this is a very critical system, so I want more sort of window for errors.
So, let's say I set it up at 70% instead of a 90%, but something else where I know, you
know, this is not too important, I can go and quickly fix it.
I'm okay to have a 90% threshold where I have lesser time to react, right?
So that's another way you can sort of handle this.
One of the things that we are excited about
and we are working on, which we still don't have,
is can we also automatically,
using some sort of AI inference,
automatically categorize problems as severity one, two, three,
which we are still working on.
We don't have that yet,
but I think that would be very interesting as well.
Yeah, in that situation, are you thinking,
I know you don't have anything at the moment,
but that feels akin to standard recommendation models,
using more of a purpose-built model
versus using something like a foundation model
to solve that problem.
Is that the way that you're thinking about this?
Yeah, so again, so I think one of the underlying things for us to do a good job of this categorization
is understanding the topology of the software system itself and understanding if the problem
that's occurring right now, is that a very important central node or is that somewhere
central node or is that somewhere in the corner of a tree
which would determine what's the blast radius if this fails?
I think that would be one of the biggest inputs.
Today, the way we figured that out is we know which systems are talking
to which systems from data from the monitoring tools.
Also, we know what service this was linked to. So there is some idea of that.
And the one piece that I think missing in this sort of
the input dataset is the real user monitoring
and the traces, right?
Of, I think traces would just take this to the next level
and we'll get there.
We'll start ingesting some of the traces,
which allows
us to know, okay, how many of the customer flows actually hit this node, right? And if
like, if that is very high, then okay, obviously, that's a very important node. If that's very
low, then so maybe not so important node. So I think that would be the most important
input for us to figure out the stability of the problem.
Do you take into account human feedback or taking into account that people presumably
are working already as an SRE, maybe they've been there for, they probably have a lot of knowledge
that is not documented somewhere.
Is there a part that allows you to take that
into account as an input?
Yeah, 100 percent, and at all levels.
When we're setting up alerts,
the templates and suggestions that we create,
you can always go and modify that and say,
hey, I'm less adventurous than you are,
or more adventurous than you are,
so I want a higher or lower threshold. That's one way in which we accept that input. Every time we
generate a run book, you can either literally delete that run book and create your own,
or you can say, here's my subjective feedback on whatever the suggestions are,
can you please take that into account
the next time you're generating this runbook?
So you can do that.
And from the auto-healing perspective,
the way we've built it is there's a rules engine
and there is a script library.
So you can literally say,
and all engineering teams have these three pesky problems
that they know happen every month.
And they know these are the rules.
When x, y, z conditions are met, this happens.
And when that happens, this is exactly what I do.
So you can actually write a script
and upload that into our script library
and say, do exactly this that I'm talking about.
So at all levels, I think the human is definitely
in the loop.
And that's what makes, I think, the feedback loop much, much
better for us.
So you mentioned auto-healing there.
Can you explain that a little bit more?
Yeah, absolutely.
So again, that, I think, is our mission.
No engineer should have to wake up at 3 AM in the morning
to solve a production problem.
But obviously, it's easier said than done. I think some of these problems do get heavy.
But the idea is that if I look at what's the common denominator, and some of the enterprises
actually do solve for this, but they've all been written as internal tools for one specific
microservice. But if I look at what's the common denominator,
what's the architecture for it,
you need a trigger, which the trigger is typically an alert.
When that happens, you need something to do the root cause analysis,
figure out what exactly the problem is,
and then you need something to,
now that I know what the problem is,
what is the solution, come up with a solution,
and then take that action itself.
So that is exactly what Tempest stack enables you to do.
The last part of actually the taking action,
we don't want to do that as a SaaS tool
from outside of our client's environment.
So we have a daemon that sits inside
of the client's environment,
and that is the one that is executing some of these actions. Again, like I said, we suspect that the ones that will get
automated sooner rather than later are the ones where either the engineering team knows that these
are the conditions and these are the solutions, that this is exactly what we do. Or it could also be just the diagnostic steps that you're choosing to let
TempurStack take action even before you get paged and you come on the call.
Where all the information that you need from
that particular instance that's failing right now has already been gathered,
and that is automated.
Then you can look at what are
the recommendations for resolution and take those actions yourself.
Or you can do it as, just like you're coding on Cursor today,
like wipe coding, you can do wipe troubleshooting
where you're actually working with TemporStack to say,
okay, let me execute this recommendation
of the diagnostic step.
And every time it comes up with that result,
it actually auto adjusts and rewrites the whole runbook.
You can actually step through it and say,
okay, let me run the first diagnostic step.
Okay, now a bunch of things have changed.
Literally go one step at a time and execute one command at a time,
and also do that same thing for the resolution.
Right.
So, yeah.
So I think, but what is exciting is that that future of where engineers don't really need
to be on call and instead of having to wake up at 2 a.m., they can look at an email at
8 a.m. and say, oh, this happened. And, okay, let me
figure out how to make sure that this never happens again. And that, I think, should be the
engineer's job. That's the part that they enjoy doing. And that's what we, I think,
we want them to be doing. Yeah, that's. So with this, how do you kind of build trust
with these teams?
We talked about these superheroes previously.
I would think that some people who have that status in the company like having that status
and maybe feel a little bit protective about their status as the person that you have to
go to and rely on.
How do you break down those barriers or those walls that are put up?
Right. So the first part of the question is that the superheroes actually are very,
very critical to this whole process because they are the ones who will externalize their knowledge
and actually teach the agent to perform better and better over time. Like I said, there are feedback loops where either you can give
the agent subjective feedback and who better than
those superheroes in your team to actually play around with
the agent and get them into
some of these situations and train them.
I think it doesn't really diminish their role in the company.
If anything, it probably increases their role,
just that you don't have to wake up at 3 AM to do it.
You can train the agent in
peacetime in your work hours instead of having to wake up,
which I'm sure nobody enjoys.
But the second part, I think about the trust,
which is a very important point that you brought up,
is I think we've seen that happen
with our alerting recommendations as well.
So humans obviously start from a low trust point, right?
And then we think that it's the product's responsibility
to build that trust over time.
And what we saw happening
with our alert recommendations as well
is that initially people were skeptical,
they would go and tweak some things around. But I think over time, they realized that
they're not making too many changes. And that's the point at which they were okay to put that
in a full self-drive mode. But there is a setting where you can go and say, I want just
TemporStack to start setting up whatever alerts it thinks is right. We see that even on the resolution side,
it's going to be a similar journey where initially it's going to be recommendations,
it's going to be people stepping through and working with the agent to solve
these problems until it gets to a point where either they know that this problem happens,
and this is the exact solution.
I can see that this is the script
that's going to get executed.
I think that also helps.
Showing that this is what we're going to do
lets them know that, okay, I'm still in control
and I can pause wherever I want.
That's one part of it.
And then getting it right over time
and having the right recommendations is the second part of it,
where if they feel that, hey, you know what, these recommendations seem to make sense,
and those superheroes start feeling that, hey, this is exactly what I would have done.
That is a critical moment for DevSec, where they start feeling that, hmm, this makes sense,
because this is exactly what I would have done.
I think, yeah, it's a process, and we have to gain that trust over time.
Yeah. How far away do you think we are
from having a fully autonomous system for this?
Well, we actually already do and that's the interesting part of it.
The demo that we ran was 100% autonomous.
It started from the alert and went all the way
through root cause to identifying what exactly
is the action to be taken and taking that action as well.
So technology-wise, I think we're there.
These systems can take autonomous action.
Obviously, there will be some of these hairy problems,
and they come more from, I think,
let's say the recommendation is kill this process.
But you know that this process has 200 other dependencies,
and I can't just kill this process.
I have to figure out something else.
So those are the kinds of things I think that we will have
to figure out.
That's where the wipe troubleshooting comes in,
where you say, okay, till here is fine, but the next step, let me do this myself.
In some sense, I think that technology exists even today, but to leverage that
again goes back to, are our systems architected well enough?
And so that is what is, I think, going to take time.
That we get it to a point where you can make the most of this agent.
And that's going to take time.
And it has both technology, culture, all of those angles to it.
Right.
You're right.
So I want to get into the agent architecture a little bit.
If someone's running all of Tempur Stack,
how many agents are we talking about behind the scenes?
It's hard to say how many agents,
because literally every monitoring system has
its own agent built in.
But it is a multi-agent setup.
There are agents talking to each other. If I look at it from an outcomes perspective,
again, there are four outcomes that we want to achieve. One is that alert setup. The second is
the root cause analysis. That is one major outcome.
The third is coming up with recommendations
and what to be done, like the run book.
And then the fourth is actually taking action.
And all of these use different kinds of agents
within the system.
In terms of the architecture, I think three things
that the agent needs to work with.
One is underlying LLM model,
which right now for us in production is 3.7 Sonnet.
That's what we're using in production, but in Patlid,
we're also training our own llama model,
and we're beginning to see some improvements in accuracy
when we use that fine-tuned model rather than the public model.
So that's one part of the architecture. The second is it needs data stores that it has access to.
So again, there are both relational data stores, which is where your alerts, alarms,
the resource data, all of that is stored. And then you have your vector databases where more
of the knowledge base is stored.
So that's where we're sort of ingesting data
from your stack overflows,
the monitoring tools, documentation,
the forums on Reddit, et cetera,
where some of these conversations
around incidents are happening.
So all of that goes into our vector database
and potentially in the future,
we could also have the company's own knowledge base
sort of brought into this vector database.
So those are all the data stores that the agent has
to make the decisions that it needs.
And then there are tools.
So for example, every single integration that we have
with the monitoring tools becomes one tool
for the agent to use, which it can use to set up alerts or improve some of the alerts, et cetera.
Similarly, there's a runbook generator,
which has something has to generate that final UI,
which is all created on the fly.
So those are some of the tools that the agent has access to
to be able to achieve those four outcomes.
And that's how I think we think of the architecture largely.
Yeah.
And for each of these agents or merely the accommodation,
like how do they run?
Are these running as microservices on Kubernetes
or something like that?
What's that set up look like?
For us?
Yeah.
Internally, I think the access to the LLMs is built on Bedrock.
A lot of the agentic workflow is built on LandGraph.
Any of the interactions that are happening between these agents is happening on LandGraph.
Zillers is our vector database.
I think those would be the tools that we use, but one
piece that sits outside of all of this is that daemon that sits in the client's infrastructure.
For that, the idea there again is one, that daemon will take the actions that need to be taken, but it's also looking at,
hey, what are the actions that are being taken by engineers
outside of what is recommended?
And so that also becomes part of the feedback loop
in terms of saying, okay, maybe we missed something here
and we need to sort of think about that as well
and that gets added to the context and next time it's used.
Yeah, I don't know if I answered your question on this one.
Well, I was just kind of curious about, I understand you're using LandGraph,
but where does this actually run? It's running within your cloud environment,
but I guess how do you run that? How do you serve that? Is it serverless? Is it running on Kubernetes? That was the question I was asking.
Yes. Actually, right now, I don't think we are on Kubernetes. We are on EC2.
Because we don't have too many microservices, it's built as a monolith right now. EC2 works
better for us, but as we scale, we'll probably break it down into
more microservices and have it run on Kubernetes. Where we use serverless is actually to run those
actions. So when the daemon is running actions within the client's infrastructure, that's where
it would spin off probably a Lambda function and run whatever script is there on that Lambda function.
That's where we use serverless, but within our own SaaS offering,
there's not too much serverless that we use.
Okay. And then what happens is, or how do you handle a situation where if you have
multiple nodes in something like a LandGraph
that's executing some sort of agentic component
of this entire workflow, and that part fails
or generates an output that is unexpected
or something like that, how do you handle
that failure scenario?
Right, so Bedrock does have a governance layer on top of it.
LandGraph also has that.
And then we have some feedback loops
built in saying, okay, if this fails, then... I mean, two things happen. One is in the instance,
what is the recovery? Can I do retries to see if the output right now is not the ideal that I expect. Can I do retrace? That is the in-product
fix where we have those control loops. But also, a lot of these are getting locked to
our systems where some engineer can go take a look at it and look at for some of these
use cases, what was the hallucination score, and what was the learning scores of how much the agent
is actually learning versus how much of it is, in some sense, a reputation of what it has seen before.
So we're looking at all of these governance metrics to make sure that we are constantly
also feeding information to make it better and better. And I would say those are the two parts to that control loop.
One is how do I fix it in product while it's generating
and something failed?
And the second is from a long-term perspective,
how do we keep making these improvements
based on some of these governance metrics
that we are tracking?
Mm-hmm.
For the hallucination score, are you using a model
to evaluate the hallucination level?
I'm actually not sure how we're measuring.
I do know that we get a hallucination score from Bedrock itself, if I know right.
I'm actually not sure.
I think RCDU would be the better person to answer that question, but I know right? I'm actually not sure. I think, I see you would be the better
person to answer that question. But I do know that he sends me sometimes like snippets of
what those scores looks like. And that's where my understanding of this comes from. But how
exactly that hallucination score is being generated, I will have to get back to you
on that. Yeah. Okay. Yeah. I was just curious. In terms of testing and evaluating,
how do you handle that situation where presumably you're going to have to iterate on this?
You may change a prompt, you're doing tests right now of fine-tune model versus the Asana 3.7.
How do you know that you're actually moving things in the right
direction versus creating something that's actually worse? Yeah, again, so there are two
parts to this. One is just like how a consumer or customer would test for this, we're looking at
all of the suggestions. Are they making sense for us as engineers? And we also have a bunch of people that we share it with
and say, is this making sense?
That is one level of, I think, which is more like a gut check
to see if is this making sense?
And we've actually gone through an evolution there as well.
When we started, we would just send the alert
to the public LLM and see if it's able to generate
what we need from it.
And the accuracy was probably like 20, 30%
and it was pretty bad.
But with the DAG model, I think we saw that go up to 70, 80%
in terms of how often does a human engineer agree with the
recommendations, whether it's on alerts or actions to be taken, root cause, et cetera.
But with the fine-tuned LAMA model right now, we see that that accuracy is growing to 80,
90%.
So I think that's one side of it.
The second is also the more context we're able to set,
which comes from how much information do we have
in our vector database and how many different scenarios
has that vector database documented.
So in some of the cases, we have actually had to write
like hand code sort of different kinds of scenarios.
For example, for the root cause analysis, where it's trying to do a correlation between
alerts, we've had to see that with about 5,000 different scenarios that we have seen, where
we know that, okay, if this is the cluster of alerts, then this is the most likely root
cause.
But with that seeding, now we see that things of alerts, then this is the most likely root cause. But with that
seeding, now we see that things in that vicinity, it's able to extrapolate from that learning data
and say, okay, so if I now combine these two use cases, then what happens? So those things,
the agent is learning on its own and learning to do. So I think there is a...
on its own and learning to do. So I think there is a... So one is ingesting public data. Two is writing our own use cases and doing that hard-coded learning data. And the third
is, of course, as we work with more and more customers, there's a lot of that feedback
that's coming in. So that's, I think, all of the loops
that we have to keep making this better and better,
how the learning's happening.
And given that you're in the business of SRE incident
management, how do you make sure that you don't have
your incidents on your side?
So we do use SemperStack for our own sort of the monitoring as well.
You know, and I think there is enough sort of redundancies built in on our side.
All of this architecture, our CTO has done this for companies as large as Goldman Sachs.
He's been the CISO for really high growth startups like
Danzo, which is like the DoorDash of India. So I think that initial expertise that exists within
the team has all been in this. He's been in the DevOps Society space for the last 15 years.
So in fact, a lot of our customer conversations start with some sort of a consulting on how
should they instrument their reliability and observability tools and things like that.
I think we, of course, have taken care of some of those things.
We have our ISO coming through, SOC 1, SOC two coming in as well. So all of that, of course, is the external
sort of the validation of the efforts that have already gone into making sure that we have a very
robust system working at the heart of all of this. And in terms of building these kind of like
Gen. AI products, these multi-agent systems, are there things that,
I know, like lessons learned along that the way that you can talk a little bit about for anybody that's listening, that's looking to start to build these types of systems?
Sure, yeah. So I think, like I was talking to somebody yesterday at one of the conferences,
and this is exactly the question that they had. And what I
realized is for us, it's always been sort of problem-based learning. Obviously, we've had to
learn a bunch about how some of these systems work, how some of, like, what are the best tools out
there to use for ourselves. But I think we were always trying to solve one specific problem,
because like, JNI is an ocean, right?
I don't think there's any way you could learn everything
that is out there.
But for us, when we started solving the alert problem,
we said, OK, what do we need to make sure X happens?
And then started looking at where
we can leverage AI to do that.
And same with the run at the root cause analysis use case as well.
So everything starts with a problem.
And then when we know that the problem has,
this is how I would want to solve the problem.
And then we see if AI is the best use for it
and what AI is the best use for it and what AI is the best use for it and what tools
should we bring in to solve that problem?
I think one of the problems is that with AI, when you have a hammer, everything looks like
a nail.
And I see that approach with a lot of engineering teams that are actually starting off with,
hey, you know what?
I want to build an agent and then let me figure out what...
Yeah, they were first engineer the problem.
Yeah, I spoke on a panel this week about like AI readiness and one of the points I
made is like, like start with the business value and you know, what is the problem you're trying
to solve, then figure out whether it makes sense to use AI or something else, right? I agree, like
I think there's a lot of pressure on organizations right now to go to the innovation team, go to
their, you know, leaders in engineering and be
like, what are we doing about agents? We need agents. And they don't actually know why they
need it. They just know that they want it. Right. Yeah. And I think people in that blind
chase are discounting how useful workflows are. So I think for us, it's been a combination of workflows and
agents that actually unlocks the most power of AI. I think that's how it's been. It's been problem-based
discovery of different AI tools that can be useful for solving that particular problem.
Yeah, there's actually a really good GitHub repo that I read through this week called
the 12 Factor Agents, which is principles for building reliable LLM applications that
kind of relates to what we're talking about.
But in that, basically the author of that has tried all the frameworks and kind of talks
about the reality of building agents and what are the things that you really want to control.
And I think the things that we see in the news or make headlines, make keynotes and
stuff like that are these very open world problems of like, we had a world where at
one point software was like a single node and then we went to sort of more workflows
where we had like a defined DAG.
And I think the promise of agents is that you just have nodes and the edges basically
get dynamically figured out by some sort of reasoning agent.
But I think what actually is being successful in business
is a lot more close to the DAG structure
with a little bit of dynamic choices around what tool to use
or sometimes which edge to
go down. But it's not just completely open world, like, you know, here's access to AWS,
go, you know, fix all my problems. It's a lot more constrained. And you want to remove
as much non determinism as possible, and really only rely on non determinism when it makes
sense. I think that's gonna keep it.
Makes a lot of sense.
Yeah, yeah, absolutely.
Makes a lot of sense.
Awesome.
Well, is there anything else you'd like to share?
Not really.
I think we're right now super excited
with the full self-drive version is, you know,
how I'm kind of looking at it is the full self-drive version is how I'm looking at it, is the full self-drive version of that agent.
We would love for people to try it out on their own, maybe some test environments that
they're okay to break. I would love to see people actually try that out and see how it works.
to see people actually try that out and see how it works. So yeah, I think that's what we're most excited about
right now of putting this agent in the hands
of as many people as possible
and have them try to break the agent itself sense, right?
And find those use cases when the agent is completely
sort of flummoxed and doesn't know what to do.
Right?
So yeah.
Yeah, awesome.
Yeah, I mean, no, they say like no
product survives this first encounter with real users. So yeah, well, thanks so much for being
here. Really enjoyed it. I love the vision of what you guys are building. Thank you so much,
Sean. Thanks for having me. And it's been a great conversation. Thanks so much. I will look up the
article that you just mentioned on GitHub. Yeah, cheers.