Software Huddle - From Code Red to Green: Incident Management with Nora Jones of Jeli and Dan McCall from PagerDuty
Episode Date: November 21, 2023Today's episode is all about Incident Management. We have two amazing guests, Nora Jones, founder and CEO of Jeli, and Dan McCall, the VP and GM of Incident Management at PagerDuty. There's of cour...se a technical aspect to managing incidents that PagerDuty excels at, very well known for, and there's also a human side, like how do you learn from an incident so it doesn't happen again in the future, and this is where Jeli steps in. In the episode, Nora and Dan talk through the evolution of incident management, the hard problems in the space, and a future that leverages AI with a human in the loop component to scalably and proactively manage incidents and reduce outages. We also touch on the recent announcement that Jeli was acquired by PagerDuty.
Transcript
Discussion (0)
Incidents are really important to businesses because they rob your engineers from doing more productive things like building innovative software.
We think that the humans and machines together is actually the ideal situation.
First thing you want to do is you want to help the human by finding the signal in the noise.
And we use AI within the operations cloud in order to find that. So a lot of what we've done with Jelly is not enable the generative AI to make decisions,
but to instead summarize data and inform the human.
And so give the human information
that would have otherwise maybe taken them a long time
to understand, deduce,
or maybe they wouldn't have even looked into it.
Hey everyone, Sean Faulkner from Software Huddle.
And today's episode is all about incident management. I have two amazing guests, Nora Jones, founder and CEO
of Jelly, and Dan McCall, the VP and GM of incident management at PagerDuty. There's, of course,
a technical aspect to managing incidents that PagerDuty excels at, very well known for. And
there's also a human side, like how do you learn from an incident so it doesn't happen again
in the future? And this is where Jelly steps in.
In the episode, Nora and Dan talk through the evolution of incident management, the hard problems in the space,
and a future that leverages AI with a human in the loop component to scalably and proactively manage incidents and reduce outages.
We also touch on the recent announcement that Jelly was inquired by PagerDuty.
You probably couldn't find two more informed people to talk about incident management than
these two. They made their careers in this space. And I think you're really going to love the
episode. So let me stop rambling and kick it over so you can hear from Nora and Dan. But just one
last thing. If you enjoy the episode, please subscribe to Software Huddle and follow all
our updates on Twitter and LinkedIn at Software Huddle.
All right, over to the episode.
Nora and Dan, welcome to Software Huddle.
Hello there.
Hi, Sean.
Hi. Yeah, thanks so much for being here.
I think, you know, we got a lot to get into today.
Incident management, exciting updates about Jelly and that directly relate to both of you being here.
But before we get into all that,
let's start with some basics. Who are you and what do you do? Nora, why don't we start with you?
Yeah. Hi, I'm Nora. I'm the founder and CEO of Jelly.io, an incident management platform.
Awesome. And then over to you, Dan.
Awesome. Nice to meet you. I'm Dan McCall. So I'm the VP of product for incident response here at PagerDuty. I've been here just over two years
and I'm in the Bay Area. Fantastic. Yes. We were just covering that in the sort of get to know you
side before we started recording that. I'm also in the Bay Area. So I wanted to focus, start our
conversation anyway, focused on incident management. And, you know, an incident is typically some kind
of event that requires like a team's immediate attention. What are some of the categories or types of incidents that force
the team to kind of go into this like, you know, firefighting mode?
Yeah, so I think, you know, incidents are something that is really important to businesses,
because they rob your engineers from doing more productive things like building innovative software. And so there's a
wide spectrum that we see in the industry in terms of the different categories that they can fall in.
So you have your kind of normal everyday incidents that are, say, of a lower priority.
And these are pretty common. They happen pretty frequently, but they're not necessarily show stopping.
Then you have the ones where your main product is no longer working.
So if you imagine you're an e-commerce company and people can't check out, as an example, that is an emergency.
That's something that is both urgent and critical, that you need to really redirect your resources to solve that problem immediately.
And of course, we see these issues happening for software companies like an e-commerce company, but we also see them happening across a wide spectrum of our customers.
So, for example, we even have a customer that has physical merchandise that has to stay at a certain temperature.
And if the temperature of those refrigerators that the merchandise is in drops to a certain level,
that is also an emergency because it will spoil those goods, as an example. And one of the things
that we see is that sometimes those kind of critical incidents that are happening within
an organization are sometimes evidence of another problem, like a security problem,
for example. And so we see, you know, whether a security problem, for example.
And so we see, you know, whether it's software, whether it's physical goods, whether it's some sort of security issue,
there's a broad spectrum of what can go wrong within an organization.
And it's incumbent upon that organization to be ready when those things happen and to have pre-rehearsed what to do so that they can really minimize the
disruption that it creates because time is money. And you really want to get back up and running
for whichever category it is as quickly as possible. Fantastic. And then, Nora, do you have
anything to add to that? Yeah, I think Dan brought up a good point about there are so many different
circumstances every day that might take engineers or other folks from the organization away from
their normal work that they had planned to do that day. And I think one of the biggest things
with incidents is, yeah, defining is this thing an incident right now? Do we actually need to drop
what we're doing in order to resolve this right now? And a lot of times, you know, if folks are wondering and debating it,
the answer is yes, right? Like they're already taking time out of their day. They're already
thinking about it. And I think there is a whole, there are whole different categories of incidents,
right? Like there's incidents that require legal and require PR. There are incidents that require
a certain set of engineers or incidents that require customer service. And I think one thing that I see in healthy organizations
is that there is alignment across the business about KPIs, about what an incident is and what
it is not. So I think to go back to your original question, like what types of incidents do we see?
All kinds. And it really, truly depends on your business, like Dan said.
And how do companies like typically handle these incidents? And like, what are the processes and
tooling that's in place for doing that? It sounds like in the ideal scenario, there's some kind of
process or framework in place where people have alignment or agreement about what an incident is,
potentially, probably different like classes of instances, as well as like agreement about what an incident is, potentially probably different classes of instances,
as well as how severe an incident.
But once they have some kind of framework in place,
how are they actually kind of dealing with this and triaging them?
Yeah, so I can start.
So we see, again, a spectrum.
And so we published something called the Digital Operations Maturity Model.
And it basically describes the different phases of maturity that different companies have in order to to kind of gauge their readiness or their maturity in terms of how they would deal with this.
And it it goes from manual to reactive to responsive to proactive and finally preventative.
That's kind of the
spectrum in and of itself. And on the manual side, we see that people kind of hero their processes,
right? And so manifestations of this look a lot like, oh, a major incident happens. Get everyone
on a conference bridge right now, right? Where literally 100 people will all join the same conference bridge simultaneously. And it's hugely disruptive, and hugely inefficient when folks do that.
And so as you see companies kind of progress through that, you know, maturity, you start to see
that they evolve, right, they evolve both their processes and their software. And so the PagerDuty Operations Cloud, which is the portfolio of products that we have, are really designed to help companies evolve in their maturity so that they can make sure that the right person at the right time is alerted when there is a problem and that we just, we prevent the kind of broad disruption that we see happening in less mature
companies. And, but you want to be very respectful of the humans.
We believe passionately in the value of humans and that we should respect them.
And so one of the things that we think really helps with that is layering in
automation alongside the humans together. We think that the humans and machines together is actually
the ideal situation. So for example, let's say that you're listening to your observability,
you're getting this giant alert stream. Your first thing you want to do is you want to help the human
by finding the signal in the noise. And we use AI within the operations cloud in order to find that,
right? So once you've reduced the noise and you found the signal, then you might want to wake up
a human at that point. But you might want to do some things, again, to respect that human first.
So for example, there may be some diagnostics that you want to run that may take a few minutes.
Give your human a few more minutes to sleep first.
Run that diagnostics.
And then only when you have the results back, then kind of alert the right person.
And when you alert them, make sure it's the right person, the person that's actually on call for that particular part of the software.
And so we see that there's a lot of ways that software can help mature a company.
And the operations cloud really helps companies along that maturity journey.
So we see that that can be really helpful.
And then, so you mentioned, you know, there's this whole spectrum, like with lots of, you know, tools and processes that companies might go through and they go through this sort of growth curve, but kind of taking a step back and looking at the evolution of incident management, like how has the space changed in the last, you know, 10 years or so?
Like, have we gotten better at sort of essentially managing this process, being more efficient about it?
How has this evolved in the last decade or so?
Yeah, I think that's a great question.
And I think PagerDuty actually started a lot of this back in 2011, right?
They gave the tech industry the ability to hold a pager and get alerted on things.
And we were keeping our businesses always up.
Folks were not expecting businesses always up. Folks were
not expecting maintenance pages anymore. Like they were actually expecting us to respond and,
you know, care about the end usage of our products. And I think Dan is actually bringing
up good points. Like in order to care about the end usage of our products, we also have to care
about how folks coordinate during incidents as well. And so I think there has been a big
evolution in the tech industry from like a always on, you know, I don't sleep, I'm chugging Red
Bulls all night to like, keep my systems up to like a more sustainable pace, and also more focused
on systemic expertise rather than individual expertise, which I think incidents actually have a really great way
to reveal and grow systemic expertise afterwards. And so what I mean by that is like,
you know, say you do page the person on call and they might need help and that's totally okay,
but that itself is data that you can use to grow the rest of the system that you can use to grow
the expertise of the other folks. And so I think a big way it's evolved is that we're now paying attention to coordination
and cognition during incidents. And those are two important pieces in order to
make incidents a little bit more normal for our organizations, but also more normal for
our customers. And when I say more normal, I mean,
better, like, I mean, you know, not as not as impactful, not as long, things like that.
Maybe over to you, Dan, in terms of what trends or technologies are sort of having the most impact
in terms of helping organizations with essentially scalably
and also efficiently responding to incidents
and learning from them?
Yeah, so I think it first starts with really knowing
that an incident is even happening in the first place.
And one of the ways that we've really been addressing this
is to think holistically about a company
in terms of how it interacts with its own customers. Okay. So despite all the tools
and all the software that exists in the world, you may be surprised to know that over half of the
time, the way that a company learns that a major incident is happening is because a customer has
called them, which is pretty surprising, right? And so when a customer calls you, how do you want to show up, right?
You want to show up in a coordinated and mature way.
But what we find is that many customer service organizations are actually disconnected
from their own engineering teams within their company.
And so when that customer calls, that customer service team is caught off guard
in a way that makes them feel kind of, you know, disconnected, right? And doesn't help them show up well to their customers. So part of our
operations cloud is that we have software specifically for the customer service teams
that shows up inside Salesforce, inside Zendesk, inside, you know, their implementation where they
live. And it gives them direct visibility
into what's happening live inside of their engineering and IT organizations, so that when
they receive that call, they can be on their front foot. And they can say, hey, thanks for calling,
we're aware of the situation. And it's going to be resolved in seven minutes, right, where they can
be much more direct and informed when that customer calls. So I think that's the first thing is just having your organization show up in a coordinated way. Also, one of the things that we have
discovered, especially with our enterprise customers, is that every customer is unique.
And although we talk about incidents and major incidents and security incidents, as if they're
this like static thing that is shared between all companies, companies are actually quite unique. And so they
have specific processes and policies inside of their own companies that are quite distinct from
one another. And so one of the things that we're trying to do is really build our software in a
platform way such that each company can tailor the solution to their unique needs. So, for example, during major incidents, let's say you have a SEV1 or SEV2 incident,
there are often a list of actions that you're supposed to take inside of a company.
And we find that, again, on that operational maturity model, on the left-hand side where you're more manual,
this literally might be in a physical binder somewhere.
You know, over to the right-hand side, we see that companies are a lot more digital about this. And
so one of the new capabilities we've released over the past year is what we call incident
workflows. And what this does is it allows a company to basically define in advance the actions
that should happen with different classifications of incidents.
So, for example, if you have a SEV1 incident in your EMEA data center, you might run this
runbook or incident workflow, and it could have 10 steps in it, okay?
But if you had a SEV1 incident and it's in your APAC region, for example, maybe it's
a different seven steps.
And what we find is that, you know, you need to, again, kind of respect your humans and help them do the things that they are not necessarily disproportionately competent at. So for example,
is it worthwhile disrupting a human's time from trying to solve a problem in order to spin up a
Zoom bridge or spin up a Slack channel?
No, those should just be automated. Those should just happen so that the human can focus on the
thing that the human is best at doing. And so what we found is that during a major incident,
you don't want automation to just run automatically if it requires a human intervention.
And so one of the things that we're passionate about is this concept of a human in the loop action. And the way that these work
is basically that you might remind a human to send a status update, but you shouldn't just
automatically write and send one, right? There should be a human touch that is involved in these
things. And so we really think that, you know, whether it's the intake of the incidents, whether it's during the incidents or whether it's learning from the incidents afterwards, there's been a lot of innovation that's happened between our two companies over the past few years that are really making this fundamental process that companies care about a lot more fluid and more mature. Yeah, I think you raise a number of good points there in terms of, I think,
it's much more comforting as a customer to, if you call
or interact with customer support, and they're already aware of the issue, it makes you
feel like, okay, well, this might be a problem, but at least they have it under control or they're aware of it
rather than that person potentially being completely blindsided. So a lot of it's about
sort of information sharing
and visibility across the organization,
especially as companies grow and scale,
that gets harder and harder to probably do
without essentially having like tools
and technology there to facilitate it.
And then the other thing I think
that was really interesting was around the uniqueness
of what an incident might mean to an organization and how you
actually deal with it. You know, if you're a B2C consumer application, like you mentioned,
the checkout process might be the, you know, the highest impact thing where if there's an incident,
like everybody needs to kind of like, you know, jump on it, the right people need to jump on it.
Whereas if you're, I don't know, a B2B API first product, maybe the definition of the most severe
incident is going to be different.
And the way that you essentially deal with that also needs to be different.
So whatever tools and technologies that you're using need to be able to be flexible, essentially, depending on how you define these types of things.
Once you've sort of figured out some of this baseline challenges, how do you actually measure or quantize what you're doing is effective and
actually continue to make, you know, improvements? How do you essentially know that the things that
you're doing today are better than what you were doing maybe six months ago?
Yeah. And I think it kind of goes back to how every organization is different. And so I think
finding your baseline and your normal is a really good thing to look at
and to see how you've improved over time. So there's like all sorts of different matrices
that you can kind of index on, but without context, you're sort of over-indexing on them.
So you don't want to just look at length kind of in a continuum. You want to look at it.
How much time did we spend in the repair phase of the incident? How much time did we spend
diagnosing it? How much time did we spend detecting it? How long did it take us to get the
right people in the room? You know, it was like an engineer kind of waffling for like 45 minutes
before they brought other people in, like all of that is data. And so ideally we over time get
better at coordinating and bringing the right people in, bringing, you know,
having better camaraderie, like during incidents, so that it is a more normal experience. And so
I think you can measure this on like, you know, do we still have to rely on this same person
every time for all these incidents? Like, and you asked earlier, Sean, like, how has the software
industry changed? I think there was a while ago a very like hero culture, right?
Like we have these few engineers that come in and save the day.
I mean, I worked at one organization where every time a certain engineer showed up into the Slack channel for the incident, people would react with the Batman emoji. And I think those times are, you know, and while that person was awesome,
I think it's more, you know, it's better for our orgs when we're actually spreading out that
expertise and not relying on Batman every time, but instead creating much more Batmans. And so
I think that's actually something you can track too. And it is something that we do
in Jelly for you, you know, like how often are you relying on folks in certain types of incidents that are not
on call? That can tell you a little bit about an expertise gap in an organization. And by closing
that gap, you'll get better at incidents over time. And in fact, you might even see less of
them in certain areas of your business as well. Just to build on that, because I think everything
Nora said is absolutely right. And once
you get those things right, we're seeing something really interesting happen in the industry, which
is once you've kind of got these core metrics under control, you can then elevate your thinking
to business metrics. And we're starting to see, especially our enterprise customers, really
weigh in here because then you can start to measure how is this impacting our ability to grow revenue, right? How is this impacting our ability to
control our costs and be more efficient? How is this affecting our ability to mitigate risk?
And those are things that their executives care about. And so one of the things we've been really
effective at doing is getting the operations under control so that the people in charge can then elevate themselves
to a higher order of problems inside their organization,
which leads to things like promotion for that person, right?
Which needs those core business metrics also improving.
And so I think that's a key part of that story.
Yeah, I mean, I think that it's really important,
especially as organizations get big,
if you're looking for additional resourcing and funding into an area like incident management, you have to make a business case for it.
And things like revenue growth, revenue solves all problems in many ways, or essentially tying those to business impact and having tooling that helps you tell that story and sort of, you know, point to certain metrics that actually impact the business, bottom line is really, really key.
You mentioned, I liked your story, Nora, around the Batman emoji.
You know, the problem there is that Batman is not very scalable.
So you need to figure that out.
And, you know, I think, you know, we've been talking about these trends and evolution that's happened in the past decade. I would think another thing too
is that the scale that organizations are running at, like the
number of things that we're running in the cloud today, the
sort of massive scale that companies can reach also impacts
essentially how sophisticated they need to be about
being able to respond
to these different types of incidents, and also the consequences of that.
If you are a company that only has a handful of customers and you have an outage
or something like that, that's not ideal, but if you have billions of users
or millions of users, not only the
impact to your business bottom line, but the visibility
of that is really bad if you have
like a major outage that could lead to sort of outside the company consequences in terms of like
negative PR or, you know, people losing trust with the platform. What are some of the kind of like
hard problems that you see both technically and maybe, you know, not maybe more even on the sort
of, I don't know, the social economic side of the business
in terms of responding and dealing with incidents?
Well, the first thing I would say is that you're totally right about trust.
And we, you know, PagerDuty, our customer base is more than two-thirds of the Fortune
100 and more than half of the Fortune 500.
Trust us to be up on their worst day,
right? And so I think that that's key, is you need to be able to trust the platforms that you use,
because you're entrusting them with the trust of your own customers, right? I think in terms of,
you know, really adapting to scale and things of that nature, we have some of the largest customers in the world
using us. And what we see is that there is, again, a spectrum around the way that they organize
themselves. So some of our customers, for example, are still using a network operations center model
or kind of a centralized NOC setup. And then you have other customers that are full DevOps,
right, where you build it,
you own it. And what we're seeing is actually that some of these things are starting to come together
more into something we're labeling hybrid ops, where you have a bit of both in this world.
But I think that everyone is trying to figure out how do they incorporate AI? How do they
incorporate automation? How do they make the
best use of their humans so that they can focus them on the problems that are unique,
right? And they can add more automation around the things that are more common.
They're trying to find pockets where there is extreme cost., for example, a service that represents 25% of all your incidents.
Well, maybe you shouldn't use that service the way that you're using it anymore.
Maybe it's time for transforming that service into something that's more reliable.
So I think that they're trying to stitch together not only different models of operating, but different technologies to help them get a lot more efficient so that they can do this at scale throughout their organizations.
And Nora, what do you see as some of the hard problems that organizations are struggling with in the space now?
Yeah, and it all depends on kind of where they're at as a business.
And you brought up a good point earlier, Sean, like there's some companies with billions and billions of users and reputation really, really matters there. We also, we have companies all throughout the spectrum, right? We
have startups, we have fortune 500 companies. We have companies that, you know, have billions and
dollars of revenue. And I think one thing that I find that's really interesting is when we come
into an org that is all of a sudden getting really popular
and it's like they were a startup and then overnight they are a very different organization.
And I've been in orgs like that and it's really exciting and it's really fun,
but it's also incredibly chaotic. And it's like, all of a sudden you've gotten popular very quickly.
Everyone's using your product and you don't quite, you haven't quite prepared for that scale yet. And why should you like it? You know, it didn't make business sense until you've
actually gotten that scale. And that's where I see folks really need to get a handle of an incident
management program quickly. Like they might've had a culture where it's very autonomous, which
is great. Not a lot of process, but during incidents, you need a process. You need to be able to rely on your colleague communicating about the incident the same way you are communicating about
the incident. And that's where I see some of the hard problems is when folks are holding different
views of the situation, which is actually a very normal thing to happen. But that's where it's
really important to retrospect on that afterwards. I feel like when companies are in these pivotal growth points and it can happen at any
point during a company's journey, you know, it can happen as a startup suddenly gets really big. It
can happen when a company IPOs, it can happen when an IPO company opens up a new business unit or
anything like that. And those junctures are like super, super important to actually reflect
on the incidents that are happening then because they set the tone for your culture
and they can actually like really save you money in a lot of time if you are looking at them and
if you are looking at how everyone's viewpoints are differing from each other.
Yeah, I'm glad you brought that up in terms of sort of the post-event analysis and learning.
That was something, you know, during my time at Google, a big part of the engineering process
there is to run essentially like a blameless postmortem on any kind of incident that happens,
and it's officially documented and filed along with learnings and what changes are being made. What are you seeing in terms of
how companies tend to handle post-incident analysis
and learning and what are your recommendations around what
they should be doing, essentially? Yeah, I think there are very
different types of incidents and I think that a lot of orgs say,
okay, if it's a SEV1 incident and like, say,
you know, they define SEV1 as it impacted our highest ACV customers. And, you know, folks
notice, I worked at a company where it was a SEV1 if it hit Twitter. And that was a while ago, but,
you know, that was like an interesting aspect of it. And where I see a lot of companies sort of
get into trouble is they only retrospect incidents on the SEV1s.
But the SEV1s don't always account for what's anomalous in your organization. And so when I
say anomalous, I mean, like, you know, maybe usually you have six folks that work on an
incident, and it usually takes you a few minutes. And, you know, you usually don't like have to spend a lot of time figuring out what's
going on and say that flips, say it wasn't a SEV one incident. It was a SEV four, but you know,
40 people had to fix it. And it took, you know, maybe it took seven minutes, maybe it took less
than normal, but it took most of that time to actually figure out what was going on.
That is an example of a, of an anomalous incident occurring, and it's actually a really good
example of something to retro on.
And so I think there can be different levels of retro too, right?
There can be a level where maybe we don't have a lot of time to prepare, but we all
get in a room and we talk about it for 30 minutes and we share our experiences and we
collaborate about it and we figure out what we learned from it. Maybe there's a deeper level
where if it's really anomalous for the org, you are spending a lot of time preparing for it. You
have someone that didn't participate in the incident that is investigating it. So you can
get like a third party review and an unbiased review and interviewing all the people that
participated. I mean, that's a whole nother level, but I think there's a lot in between those two
worlds that folks can do as well. And I would look at how anomalous the incident is for your org to
decide which level you want to do. And one of the things Jelly published early on is called the
How We Got Here Guide, and you can find it at jelly.io slash Howie.
And it goes into all the ways that you can conduct a good blame-aware or blameless post-incident review.
And, you know, this doesn't mean, like, no names mentioned or no accountability.
It, in fact, you know, means the opposite.
You know, it creates a safe space for names to get mentioned. It creates a safe space for
accountability and, you know, creates a good environment for people to talk and share
what contributed to the incident and how it unfolded. Yeah. And I think creating a culture
where you essentially can be transparent about these things, even if you were like,
you know, it was, I don't know, your commit that led to like a bug that led to an outage or
something like that. Creating a culture where you're not in fear of the consequences of that
is probably really important as well, because otherwise people are going to hide the fact that
these things are happening versus, you know, essentially, you know, assembling the troops to
deal with it as quickly as possible. Absolutely. Yeah. And that's where I see folks get into trouble, too. It's like everyone could
find out who pushed the thing, and it's not like it was their fault, but it is important
to have that person and that thing, that action that got taken as part of the contributors to
the incident and be able to get that person's
experience. And it's hard. It is not an easy thing to do in an organization to just have that. You
have to build it and invest in it. Yes, absolutely. So you talked a little bit about Jelly there,
and I want to talk a little bit now about Jelly and also your founder journey, and maybe starting
out with you as a founder.
So I know you've worked in the past at great companies like Netflix and Slack. So I guess
my first question is, why become an entrepreneur? Why take on that headache? There's lots of ways,
I think, of getting some of the thrill of being in the startup world without necessarily
taking on the responsibility of having to fundraise and,
you know, hire everyone, maybe let people go and all these kind of hard decisions that come
along with being an entrepreneur. Yeah, absolutely. And I've had, you know, even more time to reflect
on a lot of that over the past couple weeks, you know, with with the acquisition news and
and one of your questions earlier on how did, how has incident management changed?
I have been in incidents and in reliability since I graduated college.
You know, it's been like my entire career so far.
And so I've, you know, in very different organizations, I've been a part of that sort of line of defense.
And I've always kind of seen and wanted a better way of doing
things that actually paid attention to the humans that were doing it. And like some of the things
that made the in the moment stuff hard. And when you ask about how things changed, I think about
an interview that I had maybe nine years ago, I was interviewing for an incident commander role. And as part of the interview,
my interviewer had a stopwatch and they gave me a problem to solve. And then they gave me
all these people that I could talk to. And so, you know, there was customer service,
there were customers, there was a VP DMing me, there were like, there was a whole simulation,
like kind of set up. And I also had to guess how much time had gone by since my
last updates so the interviewer was like holding a stopwatch and being like how much time do you
think has passed you know has it like been 10 minutes has it been five minutes and there's
and it you know reminds me of all these things we have to hold in our head as incident commanders
and we have to hold in our head running an incident. And I think,
you know, that was a pivotal moment for me because I was like, we can help with a lot of this and we can create more of these people that are not just like able to really drop into these moments,
which these people exist and they are amazing. But like, we can spread out this expertise and
equip more people in the organization with this expertise.
And so I saw a big opportunity to really pay attention to a lot of the human aspects of both attention to the technical side, to what's breaking rather than
how things are breaking, rather than what kind of fed into it, what contributed to it,
who was participating. And what I really noticed was when I was in certain organizations,
post-incident reviews went better and conversations went better when there was an artifact.
So rather than like me going to Dan and being like, this is my opinion on what happened, having data to show and orient
around makes it so it's us against the problem rather than like us against each other. And so
I wanted to create these artifacts for people. And so I started tinkering around with it and
getting ideas for it when I was working at Netflix. And then I worked at Slack for a little bit and I, it was just, you know, I had, I had to go do it.
And so it was, um, it was something I really wanted to create and I wanted to exist in the software industry.
It was tooling that I wish I could have used, um, in these roles.
And so that was a lot of what drove it. And I also wanted
to create an organization and a culture that I really wanted to work in. And like, yeah, Sean,
you mentioned like there's all sorts of aspects that are really hard. And I think one of the most
interesting things is like, I understand the system from a very different level than I did
before. You know, I was looking at the system as an SRE, as an incident commander, as an engineering leader, but now I have a view of
the system as a CEO, as someone that is like having to raise money, as someone that is, you know,
interacting with customers and managing an organization. And there's all sorts of different
sharpens that contribute to an incident that exists all throughout the
organization. And so I think it's been like a really fascinating journey and rewarding journey
from a lot of aspects, but it's also been a rewarding journey from the aspect of how it's
changed and evolved my approach to incidents too. Yeah. So a big part of what you mentioned there,
it was sort of some of the hard problems that you encounter with an organization is in places where you feel like there's an opportunity for efficiency gains and better processes.
So how do you sort of manage this human component? How do you help people do their job better?
You know, I love the crazy interview that you did.
And but that goes back to your point earlier about like,
you have to be Batman in that moment and you know,
how many of those people exist, but how can you essentially, you know,
can you build tools and technologies that allow, you know,
more people to essentially, you know, hit Batman level in terms of efficiency.
So how is jelly essentially doing some of that stuff?
Like how are you essentially helping people with this human part of incident management? Yeah. And so we've been integrated with PagerDuty actually from day
zero. So I was like building some of these tools and looking at things manually before where I was
seeing, okay, you know, what is certain data during incidents? Like what is the people data?
Like how, how long has this person worked here? Have they been trained to do incident response?
Are they actually on call? Like how long did it take them to get into a channel? And so one of the things I
knew from the very beginning was that we wanted to hook into PagerDuty. You know, it kind of starts
with that alert, with that alert and, you know, that on-call schedule. And I wanted to create
something that fed back and like improved it over time too. And so, um, you know, you get alerted and then,
then I think you decide, do I want to call an incident? And one of the things that we try to
do with our tool is help, um, cut that part out because it's honestly, it can be a waste of time
if it is an incident and you're sort of waffling about for like about 30 minutes. And so we try to
make it really easy for you to just spin up a channel,
spin up a Zoom room, spin up whatever makes sense for your organization
and get the right people in the room.
And so we try to take that cognitive burden off of the human that noticed the thing.
Because it's like they're already noticing the thing.
It might be three in the morning.
Like there's probably a lot going on.
And so we try to create and make all those aspects really easy.
So, um, you know, letting the right people know, putting the stuff in a room.
We have also like a UI where folks can go and see if there's an incident going on and
that doesn't require any manual input from anyone.
So, um, I think again, we're taking a lot of the cognitive burden
off of the responder. I mean, you can spin up the jelly process if it's a true emergency
with just the click of a button. If you have a little bit more time, you can name the incident.
You can tell us what workflow to execute. You can tell us like if you want to integrate with Jira
in a certain part of it. But if it's a true emergency, you can just click the button and have it execute like a common process that you define or that you
let us define. And so again, helping the responder just focus on what they do best, which is
responding and fixing the problem rather than trying to let every party know what's going on.
We really help with that part. And then
once the incident is finished or closed, we automatically ingest it into the Jelly platform
where you can analyze the incident and you can get a sense of what happened. And one of the things I
really like about Jelly is our narrative builder. It works you backwards from telling a story about
the incident because ultimately stories drive change.
You know, filled out templates that people don't actually want to do and meetings that people don't actually want to attend don't drive change.
But stories drive change.
Stories drive change about how something happened, how something unfolded.
And so we try to make it really easy to tell a story that people want to read, want to engage with, and And so we actually added an AI component to our incident
called Jelly Catch Up. And you can just run it in an incident channel and it will just ping you,
like it's not going to interrupt the whole channel. And as a CEO, I actually use this all
the time. Like if I'm on a call with someone I'm prospecting and I see that we potentially have an
incident going on, I can easily run that and see what's going on.
And we found that that is really useful for stakeholders and organizations. But it's also
really useful for responders when they're jumping into a long running incident. And then the AI
feature that we have on the post incident side allows you to generate a narrative. So if you're
having writer's block or you're stuck there, we can
analyze the data that you inputted and you can click a button and we give you sort of prompts
to start with like, Hey, it looked like Dan was first to enter this incident. And then Sean from
the search team came in and Sean was on call. Um, but he first checked, you know, this part of the
database. And so it gives you things to start with so that you can add to it to, again, make your life a little bit easier, remove some of that
coordinative and cognitive burden. Okay, fantastic. I mean, so I definitely want to
dive a little bit into some of the AI stuff there. But maybe before we get there,
you mentioned the acquisition from, you know, PagerDuty, and that happened or it was announced a few weeks ago.
And maybe starting with you, Dan, why, I guess, does that sort of merger between these two organizations make sense from the PagerDuty side?
And then, Nora, you can kind of speak to the Jelly side.
Yeah, absolutely.
So we are very excited about this acquisition
because we think it's a better together story.
And if you think about kind of the life cycle of an incident,
we have been admiring Jelly for a while
in terms of their learning from incidents philosophy,
where basically every incident is an opportunity to grow and learn, right? And if you think about
that being kind of the tail end of an incident lifecycle, we really wanted to create kind of an
all in one incident management solution that that filled the entire end to end portfolio.
And PagerDuty is really known for kind of the first part of that lifecycle. And Jelly is,
you know, really well known for the second half. And so we think it's a kind of the first part of that life cycle. And Jelly's, you know, really well known for the
second half. And so we think it's a kind of chocolate and peanut butter moment where these
things really come together in support of what our customers want. And already with the news,
we've had many customers reaching out on both sides of the equation, saying things like,
what took you so long? Or why didn't you do this earlier? In fact, 90% of Jelly's customers are already PagerDuty customers. And so we think that there's just a tremendous amount of synergy that exists not only with our products, Dutonians. And we've really found a very common bond in terms of the culture
between PagerDuty and Jelly. So that's kind of the second element of this. And lastly is Nora herself.
We are big believers that part of our responsibility as a company in this space
is to be thought leaders and to be out in the industry talking. And this is something that
Nora just does very organically. And she's really built a reputation within the industry and with
customers of people that trust her. And so we are just thrilled not only that their products are
coming together, but that Nora and her team of jelly beans are also joining Patriot Duty as well.
Fantastic. Yeah. I mean, I like the peanut butter and chocolate moment.
It sounds like very sort of complimentary side ends of the spectrum in terms of incident management. And Nora, maybe I'll kick it over to you for you to sort of comment on how you see this alignment between the two companies.
Yeah, I mean, I'm really excited. It makes
it makes a lot of sense to me. I think you see a lot of acquisitions out there that you're kind
of trying to piece together from the outside. And I think this just makes sense to everyone
internally. But, you know, also, folks, I'm talking to externally, which is really
validating. And it feels really good, because I know we're certainly very excited about it. I
mean, like I mentioned, we've been integrated with PagerDuty since day zero. It was one of the
first integrations we've built before we had customers. And I'm really excited to get even
deeper with that integration. Like I think there's all sorts of possibilities we could do. And I'm
really excited for the industry to see some incident improvement from it as well. Great. Yes. So I want to talk a little bit about AI before we kind of wrap up, because one,
you know, like there's probably an obligation for every podcast in the universe to mention AI,
at least a handful of times at this point. But there's also, you know, you mentioned
Jelly Ketchup as something that Jelly, an AI investment that you made. But I also feel like
incident management seems like an excellent fit for leveraging AI to help transform something
that's maybe historically been somewhat of a reactive process to being more of a proactive
process and also probably certain efficiency gains you can get with managing these different
challenges. So what's the history of using AI in incident response?
And what are some of the sort of trends that you're seeing
in terms of leveraging some of the new things that are happening in Gen AI?
So I can start on our end.
So the first is we've been in the AI game for a while.
In fact, if you look back at the history of the PagerDuty Operations Cloud,
we've been doing things with machine learning
and AI for about 10 years.
In fact, we have an entire product called AI Operations
that is part of our portfolio.
And where we've historically leveraged
a lot of those techniques and capabilities
is really to, again, kind of help the human's focus
by finding signal in the noise.
That's really been kind of the main thrust of that investment,
because when you're listening to all of your services,
you're getting an incredible volume of alerts and events coming in.
And many of them are either duplicates or they're erroneous red herrings or
whatnot.
And so the ability to leverage,
you know,
AI and machine learning to basically say, Hey, we've seen an incident just like this before.
So we think it's legitimate and you should react to this as if, you know, you declared it manually.
Right. Or to say, hey, not only have we seen this incident before, but this is the team that solved it previously.
So you should, you know, go assign it to that team.
Or, for example, to say
like, hey, this entire event storm, we've seen something like this before. And you should ignore
it because it was a false positive last time. So really, that's been the main thrust of our
historical investment there. And that is hugely helpful to our customers with very low setup time.
One of the key things that differentiates us in this way is the amount of time it takes
to configure this.
Some of our competitors offer solutions in this area that take months to set up and a
lot of consultants.
And ours is really much more of a kind of you turn it on, it starts working.
But it's built in a platform way such that you can customize it and reflect
the reality that, again, every customer is unique. And so you may want to make it your own
and not just use the default settings. So that's kind of the AI operations. But
with the onset of actual generative AI, which came on the scene this year, we've really been
pioneers in this area. And, you know, Nora mentioned
earlier how there's some just obvious use cases, like, hey, I just got pulled into this incident,
there's already been 10 people working on it for an hour, you know, like, catch me up, right? Like,
that's a perfect example of something that is just like immediately useful to a team. And what we
think that there are many more of these. So on
the Patriot duty side, we've already released several capabilities in this area. And again,
they're in that kind of obvious category where we're trying to help the humans focus on the
unique parts by empowering them with things that are, that are more reproducible. So for example, we ingest the entire, you know, like Slack thread that is going on,
we know all of the event data coming in, we even have 700 integrations with every possible
tool you could want that are populating data into PagerDuty. And we use that centralized data set
in order to be able to then propose rational and credible status updates, as an
example. So when we prompt you, say 20 minutes into an incident, that you should send out a
status update, rather than prompting you with a blank screen, we can prompt you with a credible
status update. But back to the human in the middle concept that we talked about earlier,
we don't just send it because it might be wrong because AI does hallucinate from time to time, right? So you want to be able to give the human an
opportunity to edit versus author from scratch. That's a great example of one of the things we're
doing. We also see that in kind of the learning from incidents path that oftentimes after you've
determined that there's a frequent recurring issue that's
happening in your infrastructure, you want to add automation to prevent that from happening
in the future. And so again, rather than having to like author automation from scratch,
we can automatically generate a first pass at what that automation should be. And then it's
really just a matter of going in and tweaking it to be perfect before you kind of, you know, set it up into your automatically running set of scripts. So we think
those are some key elements that really help your operations team in general.
And then Nora, where do you see potential for leveraging AI and sort of helping with the human
component of incident management?
Yeah, I mean, I think generative AI
is really good at summarizing data and giving you an answer. And so I think that is like,
you know, there are double as sure to that, right? And so I think as humans, we have to be aware
of what decisions that it's making and be actually a really big part of those decisions. And so a lot of what
we've done with Jelly is not enable the generative AI to make decisions, but to instead summarize
data and inform the human. And so give the human information that would have otherwise maybe taken
them a long time to understand, deduce, or maybe they wouldn't have even looked into it. And so
I think of it as kind of shoulders for the human to stand on, which I think ultimately helps you resolve your incidents a little bit
better and understand them a little bit better because it's combing through this world of data
that is otherwise very hard for human eyes to come through. So how do you think about keeping
humans really at the center of incident management? You
touched on a couple of things there, but essentially we want machines to help humans
do their jobs better rather than necessarily displace them. So how are you thinking big
picture about this, keeping the human in the loop, leveraging their expertise, going into the future
where there's so much sort of development that's happening in the
space on the gen AI side. I think it's really important to have the generative AI show their
work, right? And so you as the human are actually learning where it's getting their answers from.
And that is something that we've actually created in Jelly, where is everything that you are putting
in your narrative timeline, regardless of if you're using generative
AI or regardless of if you're generating it on your own, you're using supporting evidence. Like
we have you show your work. And so I think that is a big part of it. It will help the human
and the machine work together. I think in this new world we're sort of entering into, but that's
sort of the really important thing is that we're not replacing the human, we're supplementing them and we're giving them shoulders to stand on there.
And I think it's really important for us as the designers and the creators of these technologies
to really keep that in mind, which will really empower organizations and individuals and
organizations as well. Yeah, I would think sort of elements of explainable AI really plays a critical piece here when
you're dealing with something like, how can you learn essentially from these incidents
that happen if you don't have an AI system that can explain itself when it's essentially
trying to help facilitate that process?
And Dan, did you have something to add to this?
Just that I think there's a lot of philosophy that also is part of this, right?
And philosophically, as the creators of these solutions, right,
you need to really care and respect about that human themselves and want to empower them, right?
And so I think there's this interesting analogy I heard a while back called centaur chess, okay?
So centaur chess works like this.
A human that plays a pure AI at chess, the human will lose to the AI 100 percent of the time. OK, but a human that is assisted with AI will be a pure AI.
And that's called the centaur.
OK, and so we think philosophically that it really is the case that humans and machines together are the ideal solution.
And so we're what we're trying to do is recognizing that humans will always be a key component of this whole industry.
How can we help them with all these incredible new tools that are coming on the market to make them have superpowers so that they can spend more time on innovation, so that they can
spend more time with their kids, so that they can, you know, spend their time being productive and
not trying to deal with all the toil and the minutiae that also corresponds in this area.
Yeah, I think that's a fantastic point. And I mean, this is a super rich area. I think we could
easily spend an hour just talking about this. So, but as we start to wrap up,
is there anything else that you'd like to share?
I'll start with you, Nora.
I think that we're kind of just getting started
with a lot of the stuff that we're doing
with generative AI and, you know,
really the next chapter of not only incident management,
but actually how it impacts
the broader bottom line of businesses.
And so I'm very excited for what the future holds there. Awesome. And Dan?
Yeah, just, you know, even though we've been at this at PagerDuty for 15 years, it still feels
like we're just getting started, right? And I think that with PagerDuty and Jelly together,
this really accelerates our ability to bring incredible end-to-end solutions to our customers.
And so we're really excited about this joint opportunity.
In fact, I saw a tweet last week that said, it goes together like PD and jelly.
And I thought that really kind of nailed the mark.
So I'm just really excited for Nora and her team to join and to really get started.
Awesome.
Well, Nora, Dan, thanks so much for being here.
And I'm really excited to see what's next for Nora, Dan, thanks so much for being here. And I'm really excited
to see what's next for the,
well, I guess now one company,
but PagerDuty and Jelly together.
Cheers.
Thanks, Sean.
Awesome. Thanks, Sean.