Software Huddle - From Code Red to Green: Incident Management with Nora Jones of Jeli and Dan McCall from PagerDuty

Starting point is 00:00:00 Incidents are really important to businesses because they rob your engineers from doing more productive things like building innovative software. We think that the humans and machines together is actually the ideal situation. First thing you want to do is you want to help the human by finding the signal in the noise. And we use AI within the operations cloud in order to find that. So a lot of what we've done with Jelly is not enable the generative AI to make decisions, but to instead summarize data and inform the human. And so give the human information that would have otherwise maybe taken them a long time to understand, deduce,

Starting point is 00:00:39 or maybe they wouldn't have even looked into it. Hey everyone, Sean Faulkner from Software Huddle. And today's episode is all about incident management. I have two amazing guests, Nora Jones, founder and CEO of Jelly, and Dan McCall, the VP and GM of incident management at PagerDuty. There's, of course, a technical aspect to managing incidents that PagerDuty excels at, very well known for. And there's also a human side, like how do you learn from an incident so it doesn't happen again in the future? And this is where Jelly steps in. In the episode, Nora and Dan talk through the evolution of incident management, the hard problems in the space,

Starting point is 00:01:12 and a future that leverages AI with a human in the loop component to scalably and proactively manage incidents and reduce outages. We also touch on the recent announcement that Jelly was inquired by PagerDuty. You probably couldn't find two more informed people to talk about incident management than these two. They made their careers in this space. And I think you're really going to love the episode. So let me stop rambling and kick it over so you can hear from Nora and Dan. But just one last thing. If you enjoy the episode, please subscribe to Software Huddle and follow all our updates on Twitter and LinkedIn at Software Huddle. All right, over to the episode.

Starting point is 00:01:48 Nora and Dan, welcome to Software Huddle. Hello there. Hi, Sean. Hi. Yeah, thanks so much for being here. I think, you know, we got a lot to get into today. Incident management, exciting updates about Jelly and that directly relate to both of you being here. But before we get into all that, let's start with some basics. Who are you and what do you do? Nora, why don't we start with you?

Starting point is 00:02:15 Yeah. Hi, I'm Nora. I'm the founder and CEO of Jelly.io, an incident management platform. Awesome. And then over to you, Dan. Awesome. Nice to meet you. I'm Dan McCall. So I'm the VP of product for incident response here at PagerDuty. I've been here just over two years and I'm in the Bay Area. Fantastic. Yes. We were just covering that in the sort of get to know you side before we started recording that. I'm also in the Bay Area. So I wanted to focus, start our conversation anyway, focused on incident management. And, you know, an incident is typically some kind of event that requires like a team's immediate attention. What are some of the categories or types of incidents that force the team to kind of go into this like, you know, firefighting mode?

Starting point is 00:02:53 Yeah, so I think, you know, incidents are something that is really important to businesses, because they rob your engineers from doing more productive things like building innovative software. And so there's a wide spectrum that we see in the industry in terms of the different categories that they can fall in. So you have your kind of normal everyday incidents that are, say, of a lower priority. And these are pretty common. They happen pretty frequently, but they're not necessarily show stopping. Then you have the ones where your main product is no longer working. So if you imagine you're an e-commerce company and people can't check out, as an example, that is an emergency. That's something that is both urgent and critical, that you need to really redirect your resources to solve that problem immediately.

Starting point is 00:03:48 And of course, we see these issues happening for software companies like an e-commerce company, but we also see them happening across a wide spectrum of our customers. So, for example, we even have a customer that has physical merchandise that has to stay at a certain temperature. And if the temperature of those refrigerators that the merchandise is in drops to a certain level, that is also an emergency because it will spoil those goods, as an example. And one of the things that we see is that sometimes those kind of critical incidents that are happening within an organization are sometimes evidence of another problem, like a security problem, for example. And so we see, you know, whether a security problem, for example. And so we see, you know, whether it's software, whether it's physical goods, whether it's some sort of security issue,

Starting point is 00:04:32 there's a broad spectrum of what can go wrong within an organization. And it's incumbent upon that organization to be ready when those things happen and to have pre-rehearsed what to do so that they can really minimize the disruption that it creates because time is money. And you really want to get back up and running for whichever category it is as quickly as possible. Fantastic. And then, Nora, do you have anything to add to that? Yeah, I think Dan brought up a good point about there are so many different circumstances every day that might take engineers or other folks from the organization away from their normal work that they had planned to do that day. And I think one of the biggest things with incidents is, yeah, defining is this thing an incident right now? Do we actually need to drop

Starting point is 00:05:22 what we're doing in order to resolve this right now? And a lot of times, you know, if folks are wondering and debating it, the answer is yes, right? Like they're already taking time out of their day. They're already thinking about it. And I think there is a whole, there are whole different categories of incidents, right? Like there's incidents that require legal and require PR. There are incidents that require a certain set of engineers or incidents that require customer service. And I think one thing that I see in healthy organizations is that there is alignment across the business about KPIs, about what an incident is and what it is not. So I think to go back to your original question, like what types of incidents do we see? All kinds. And it really, truly depends on your business, like Dan said.

Starting point is 00:06:09 And how do companies like typically handle these incidents? And like, what are the processes and tooling that's in place for doing that? It sounds like in the ideal scenario, there's some kind of process or framework in place where people have alignment or agreement about what an incident is, potentially, probably different like classes of instances, as well as like agreement about what an incident is, potentially probably different classes of instances, as well as how severe an incident. But once they have some kind of framework in place, how are they actually kind of dealing with this and triaging them? Yeah, so I can start.

Starting point is 00:06:36 So we see, again, a spectrum. And so we published something called the Digital Operations Maturity Model. And it basically describes the different phases of maturity that different companies have in order to to kind of gauge their readiness or their maturity in terms of how they would deal with this. And it it goes from manual to reactive to responsive to proactive and finally preventative. That's kind of the spectrum in and of itself. And on the manual side, we see that people kind of hero their processes, right? And so manifestations of this look a lot like, oh, a major incident happens. Get everyone on a conference bridge right now, right? Where literally 100 people will all join the same conference bridge simultaneously. And it's hugely disruptive, and hugely inefficient when folks do that.

Starting point is 00:07:31 And so as you see companies kind of progress through that, you know, maturity, you start to see that they evolve, right, they evolve both their processes and their software. And so the PagerDuty Operations Cloud, which is the portfolio of products that we have, are really designed to help companies evolve in their maturity so that they can make sure that the right person at the right time is alerted when there is a problem and that we just, we prevent the kind of broad disruption that we see happening in less mature companies. And, but you want to be very respectful of the humans. We believe passionately in the value of humans and that we should respect them. And so one of the things that we think really helps with that is layering in automation alongside the humans together. We think that the humans and machines together is actually the ideal situation. So for example, let's say that you're listening to your observability, you're getting this giant alert stream. Your first thing you want to do is you want to help the human

Starting point is 00:08:45 by finding the signal in the noise. And we use AI within the operations cloud in order to find that, right? So once you've reduced the noise and you found the signal, then you might want to wake up a human at that point. But you might want to do some things, again, to respect that human first. So for example, there may be some diagnostics that you want to run that may take a few minutes. Give your human a few more minutes to sleep first. Run that diagnostics. And then only when you have the results back, then kind of alert the right person. And when you alert them, make sure it's the right person, the person that's actually on call for that particular part of the software.

Starting point is 00:09:26 And so we see that there's a lot of ways that software can help mature a company. And the operations cloud really helps companies along that maturity journey. So we see that that can be really helpful. And then, so you mentioned, you know, there's this whole spectrum, like with lots of, you know, tools and processes that companies might go through and they go through this sort of growth curve, but kind of taking a step back and looking at the evolution of incident management, like how has the space changed in the last, you know, 10 years or so? Like, have we gotten better at sort of essentially managing this process, being more efficient about it? How has this evolved in the last decade or so? Yeah, I think that's a great question. And I think PagerDuty actually started a lot of this back in 2011, right?

Starting point is 00:10:15 They gave the tech industry the ability to hold a pager and get alerted on things. And we were keeping our businesses always up. Folks were not expecting businesses always up. Folks were not expecting maintenance pages anymore. Like they were actually expecting us to respond and, you know, care about the end usage of our products. And I think Dan is actually bringing up good points. Like in order to care about the end usage of our products, we also have to care about how folks coordinate during incidents as well. And so I think there has been a big evolution in the tech industry from like a always on, you know, I don't sleep, I'm chugging Red

Starting point is 00:10:51 Bulls all night to like, keep my systems up to like a more sustainable pace, and also more focused on systemic expertise rather than individual expertise, which I think incidents actually have a really great way to reveal and grow systemic expertise afterwards. And so what I mean by that is like, you know, say you do page the person on call and they might need help and that's totally okay, but that itself is data that you can use to grow the rest of the system that you can use to grow the expertise of the other folks. And so I think a big way it's evolved is that we're now paying attention to coordination and cognition during incidents. And those are two important pieces in order to make incidents a little bit more normal for our organizations, but also more normal for

Starting point is 00:11:43 our customers. And when I say more normal, I mean, better, like, I mean, you know, not as not as impactful, not as long, things like that. Maybe over to you, Dan, in terms of what trends or technologies are sort of having the most impact in terms of helping organizations with essentially scalably and also efficiently responding to incidents and learning from them? Yeah, so I think it first starts with really knowing that an incident is even happening in the first place.

Starting point is 00:12:14 And one of the ways that we've really been addressing this is to think holistically about a company in terms of how it interacts with its own customers. Okay. So despite all the tools and all the software that exists in the world, you may be surprised to know that over half of the time, the way that a company learns that a major incident is happening is because a customer has called them, which is pretty surprising, right? And so when a customer calls you, how do you want to show up, right? You want to show up in a coordinated and mature way. But what we find is that many customer service organizations are actually disconnected

Starting point is 00:12:55 from their own engineering teams within their company. And so when that customer calls, that customer service team is caught off guard in a way that makes them feel kind of, you know, disconnected, right? And doesn't help them show up well to their customers. So part of our operations cloud is that we have software specifically for the customer service teams that shows up inside Salesforce, inside Zendesk, inside, you know, their implementation where they live. And it gives them direct visibility into what's happening live inside of their engineering and IT organizations, so that when they receive that call, they can be on their front foot. And they can say, hey, thanks for calling,

Starting point is 00:13:35 we're aware of the situation. And it's going to be resolved in seven minutes, right, where they can be much more direct and informed when that customer calls. So I think that's the first thing is just having your organization show up in a coordinated way. Also, one of the things that we have discovered, especially with our enterprise customers, is that every customer is unique. And although we talk about incidents and major incidents and security incidents, as if they're this like static thing that is shared between all companies, companies are actually quite unique. And so they have specific processes and policies inside of their own companies that are quite distinct from one another. And so one of the things that we're trying to do is really build our software in a platform way such that each company can tailor the solution to their unique needs. So, for example, during major incidents, let's say you have a SEV1 or SEV2 incident,

Starting point is 00:14:30 there are often a list of actions that you're supposed to take inside of a company. And we find that, again, on that operational maturity model, on the left-hand side where you're more manual, this literally might be in a physical binder somewhere. You know, over to the right-hand side, we see that companies are a lot more digital about this. And so one of the new capabilities we've released over the past year is what we call incident workflows. And what this does is it allows a company to basically define in advance the actions that should happen with different classifications of incidents. So, for example, if you have a SEV1 incident in your EMEA data center, you might run this

Starting point is 00:15:13 runbook or incident workflow, and it could have 10 steps in it, okay? But if you had a SEV1 incident and it's in your APAC region, for example, maybe it's a different seven steps. And what we find is that, you know, you need to, again, kind of respect your humans and help them do the things that they are not necessarily disproportionately competent at. So for example, is it worthwhile disrupting a human's time from trying to solve a problem in order to spin up a Zoom bridge or spin up a Slack channel? No, those should just be automated. Those should just happen so that the human can focus on the thing that the human is best at doing. And so what we found is that during a major incident,

Starting point is 00:15:57 you don't want automation to just run automatically if it requires a human intervention. And so one of the things that we're passionate about is this concept of a human in the loop action. And the way that these work is basically that you might remind a human to send a status update, but you shouldn't just automatically write and send one, right? There should be a human touch that is involved in these things. And so we really think that, you know, whether it's the intake of the incidents, whether it's during the incidents or whether it's learning from the incidents afterwards, there's been a lot of innovation that's happened between our two companies over the past few years that are really making this fundamental process that companies care about a lot more fluid and more mature. Yeah, I think you raise a number of good points there in terms of, I think, it's much more comforting as a customer to, if you call or interact with customer support, and they're already aware of the issue, it makes you feel like, okay, well, this might be a problem, but at least they have it under control or they're aware of it

Starting point is 00:16:59 rather than that person potentially being completely blindsided. So a lot of it's about sort of information sharing and visibility across the organization, especially as companies grow and scale, that gets harder and harder to probably do without essentially having like tools and technology there to facilitate it. And then the other thing I think

Starting point is 00:17:17 that was really interesting was around the uniqueness of what an incident might mean to an organization and how you actually deal with it. You know, if you're a B2C consumer application, like you mentioned, the checkout process might be the, you know, the highest impact thing where if there's an incident, like everybody needs to kind of like, you know, jump on it, the right people need to jump on it. Whereas if you're, I don't know, a B2B API first product, maybe the definition of the most severe incident is going to be different. And the way that you essentially deal with that also needs to be different.

Starting point is 00:17:48 So whatever tools and technologies that you're using need to be able to be flexible, essentially, depending on how you define these types of things. Once you've sort of figured out some of this baseline challenges, how do you actually measure or quantize what you're doing is effective and actually continue to make, you know, improvements? How do you essentially know that the things that you're doing today are better than what you were doing maybe six months ago? Yeah. And I think it kind of goes back to how every organization is different. And so I think finding your baseline and your normal is a really good thing to look at and to see how you've improved over time. So there's like all sorts of different matrices that you can kind of index on, but without context, you're sort of over-indexing on them.

Starting point is 00:18:35 So you don't want to just look at length kind of in a continuum. You want to look at it. How much time did we spend in the repair phase of the incident? How much time did we spend diagnosing it? How much time did we spend detecting it? How long did it take us to get the right people in the room? You know, it was like an engineer kind of waffling for like 45 minutes before they brought other people in, like all of that is data. And so ideally we over time get better at coordinating and bringing the right people in, bringing, you know, having better camaraderie, like during incidents, so that it is a more normal experience. And so I think you can measure this on like, you know, do we still have to rely on this same person

Starting point is 00:19:17 every time for all these incidents? Like, and you asked earlier, Sean, like, how has the software industry changed? I think there was a while ago a very like hero culture, right? Like we have these few engineers that come in and save the day. I mean, I worked at one organization where every time a certain engineer showed up into the Slack channel for the incident, people would react with the Batman emoji. And I think those times are, you know, and while that person was awesome, I think it's more, you know, it's better for our orgs when we're actually spreading out that expertise and not relying on Batman every time, but instead creating much more Batmans. And so I think that's actually something you can track too. And it is something that we do in Jelly for you, you know, like how often are you relying on folks in certain types of incidents that are not

Starting point is 00:20:06 on call? That can tell you a little bit about an expertise gap in an organization. And by closing that gap, you'll get better at incidents over time. And in fact, you might even see less of them in certain areas of your business as well. Just to build on that, because I think everything Nora said is absolutely right. And once you get those things right, we're seeing something really interesting happen in the industry, which is once you've kind of got these core metrics under control, you can then elevate your thinking to business metrics. And we're starting to see, especially our enterprise customers, really weigh in here because then you can start to measure how is this impacting our ability to grow revenue, right? How is this impacting our ability to

Starting point is 00:20:49 control our costs and be more efficient? How is this affecting our ability to mitigate risk? And those are things that their executives care about. And so one of the things we've been really effective at doing is getting the operations under control so that the people in charge can then elevate themselves to a higher order of problems inside their organization, which leads to things like promotion for that person, right? Which needs those core business metrics also improving. And so I think that's a key part of that story. Yeah, I mean, I think that it's really important,

Starting point is 00:21:21 especially as organizations get big, if you're looking for additional resourcing and funding into an area like incident management, you have to make a business case for it. And things like revenue growth, revenue solves all problems in many ways, or essentially tying those to business impact and having tooling that helps you tell that story and sort of, you know, point to certain metrics that actually impact the business, bottom line is really, really key. You mentioned, I liked your story, Nora, around the Batman emoji. You know, the problem there is that Batman is not very scalable. So you need to figure that out. And, you know, I think, you know, we've been talking about these trends and evolution that's happened in the past decade. I would think another thing too is that the scale that organizations are running at, like the

Starting point is 00:22:12 number of things that we're running in the cloud today, the sort of massive scale that companies can reach also impacts essentially how sophisticated they need to be about being able to respond to these different types of incidents, and also the consequences of that. If you are a company that only has a handful of customers and you have an outage or something like that, that's not ideal, but if you have billions of users or millions of users, not only the

Starting point is 00:22:40 impact to your business bottom line, but the visibility of that is really bad if you have like a major outage that could lead to sort of outside the company consequences in terms of like negative PR or, you know, people losing trust with the platform. What are some of the kind of like hard problems that you see both technically and maybe, you know, not maybe more even on the sort of, I don't know, the social economic side of the business in terms of responding and dealing with incidents? Well, the first thing I would say is that you're totally right about trust.

Starting point is 00:23:14 And we, you know, PagerDuty, our customer base is more than two-thirds of the Fortune 100 and more than half of the Fortune 500. Trust us to be up on their worst day, right? And so I think that that's key, is you need to be able to trust the platforms that you use, because you're entrusting them with the trust of your own customers, right? I think in terms of, you know, really adapting to scale and things of that nature, we have some of the largest customers in the world using us. And what we see is that there is, again, a spectrum around the way that they organize themselves. So some of our customers, for example, are still using a network operations center model

Starting point is 00:23:58 or kind of a centralized NOC setup. And then you have other customers that are full DevOps, right, where you build it, you own it. And what we're seeing is actually that some of these things are starting to come together more into something we're labeling hybrid ops, where you have a bit of both in this world. But I think that everyone is trying to figure out how do they incorporate AI? How do they incorporate automation? How do they make the best use of their humans so that they can focus them on the problems that are unique, right? And they can add more automation around the things that are more common.

Starting point is 00:24:36 They're trying to find pockets where there is extreme cost., for example, a service that represents 25% of all your incidents. Well, maybe you shouldn't use that service the way that you're using it anymore. Maybe it's time for transforming that service into something that's more reliable. So I think that they're trying to stitch together not only different models of operating, but different technologies to help them get a lot more efficient so that they can do this at scale throughout their organizations. And Nora, what do you see as some of the hard problems that organizations are struggling with in the space now? Yeah, and it all depends on kind of where they're at as a business. And you brought up a good point earlier, Sean, like there's some companies with billions and billions of users and reputation really, really matters there. We also, we have companies all throughout the spectrum, right? We have startups, we have fortune 500 companies. We have companies that, you know, have billions and

Starting point is 00:25:35 dollars of revenue. And I think one thing that I find that's really interesting is when we come into an org that is all of a sudden getting really popular and it's like they were a startup and then overnight they are a very different organization. And I've been in orgs like that and it's really exciting and it's really fun, but it's also incredibly chaotic. And it's like, all of a sudden you've gotten popular very quickly. Everyone's using your product and you don't quite, you haven't quite prepared for that scale yet. And why should you like it? You know, it didn't make business sense until you've actually gotten that scale. And that's where I see folks really need to get a handle of an incident management program quickly. Like they might've had a culture where it's very autonomous, which

Starting point is 00:26:20 is great. Not a lot of process, but during incidents, you need a process. You need to be able to rely on your colleague communicating about the incident the same way you are communicating about the incident. And that's where I see some of the hard problems is when folks are holding different views of the situation, which is actually a very normal thing to happen. But that's where it's really important to retrospect on that afterwards. I feel like when companies are in these pivotal growth points and it can happen at any point during a company's journey, you know, it can happen as a startup suddenly gets really big. It can happen when a company IPOs, it can happen when an IPO company opens up a new business unit or anything like that. And those junctures are like super, super important to actually reflect on the incidents that are happening then because they set the tone for your culture

Starting point is 00:27:10 and they can actually like really save you money in a lot of time if you are looking at them and if you are looking at how everyone's viewpoints are differing from each other. Yeah, I'm glad you brought that up in terms of sort of the post-event analysis and learning. That was something, you know, during my time at Google, a big part of the engineering process there is to run essentially like a blameless postmortem on any kind of incident that happens, and it's officially documented and filed along with learnings and what changes are being made. What are you seeing in terms of how companies tend to handle post-incident analysis and learning and what are your recommendations around what

Starting point is 00:27:56 they should be doing, essentially? Yeah, I think there are very different types of incidents and I think that a lot of orgs say, okay, if it's a SEV1 incident and like, say, you know, they define SEV1 as it impacted our highest ACV customers. And, you know, folks notice, I worked at a company where it was a SEV1 if it hit Twitter. And that was a while ago, but, you know, that was like an interesting aspect of it. And where I see a lot of companies sort of get into trouble is they only retrospect incidents on the SEV1s. But the SEV1s don't always account for what's anomalous in your organization. And so when I

Starting point is 00:28:32 say anomalous, I mean, like, you know, maybe usually you have six folks that work on an incident, and it usually takes you a few minutes. And, you know, you usually don't like have to spend a lot of time figuring out what's going on and say that flips, say it wasn't a SEV one incident. It was a SEV four, but you know, 40 people had to fix it. And it took, you know, maybe it took seven minutes, maybe it took less than normal, but it took most of that time to actually figure out what was going on. That is an example of a, of an anomalous incident occurring, and it's actually a really good example of something to retro on. And so I think there can be different levels of retro too, right?

Starting point is 00:29:15 There can be a level where maybe we don't have a lot of time to prepare, but we all get in a room and we talk about it for 30 minutes and we share our experiences and we collaborate about it and we figure out what we learned from it. Maybe there's a deeper level where if it's really anomalous for the org, you are spending a lot of time preparing for it. You have someone that didn't participate in the incident that is investigating it. So you can get like a third party review and an unbiased review and interviewing all the people that participated. I mean, that's a whole nother level, but I think there's a lot in between those two worlds that folks can do as well. And I would look at how anomalous the incident is for your org to

Starting point is 00:29:57 decide which level you want to do. And one of the things Jelly published early on is called the How We Got Here Guide, and you can find it at jelly.io slash Howie. And it goes into all the ways that you can conduct a good blame-aware or blameless post-incident review. And, you know, this doesn't mean, like, no names mentioned or no accountability. It, in fact, you know, means the opposite. You know, it creates a safe space for names to get mentioned. It creates a safe space for accountability and, you know, creates a good environment for people to talk and share what contributed to the incident and how it unfolded. Yeah. And I think creating a culture

Starting point is 00:30:39 where you essentially can be transparent about these things, even if you were like, you know, it was, I don't know, your commit that led to like a bug that led to an outage or something like that. Creating a culture where you're not in fear of the consequences of that is probably really important as well, because otherwise people are going to hide the fact that these things are happening versus, you know, essentially, you know, assembling the troops to deal with it as quickly as possible. Absolutely. Yeah. And that's where I see folks get into trouble, too. It's like everyone could find out who pushed the thing, and it's not like it was their fault, but it is important to have that person and that thing, that action that got taken as part of the contributors to

Starting point is 00:31:23 the incident and be able to get that person's experience. And it's hard. It is not an easy thing to do in an organization to just have that. You have to build it and invest in it. Yes, absolutely. So you talked a little bit about Jelly there, and I want to talk a little bit now about Jelly and also your founder journey, and maybe starting out with you as a founder. So I know you've worked in the past at great companies like Netflix and Slack. So I guess my first question is, why become an entrepreneur? Why take on that headache? There's lots of ways, I think, of getting some of the thrill of being in the startup world without necessarily

Starting point is 00:32:01 taking on the responsibility of having to fundraise and, you know, hire everyone, maybe let people go and all these kind of hard decisions that come along with being an entrepreneur. Yeah, absolutely. And I've had, you know, even more time to reflect on a lot of that over the past couple weeks, you know, with with the acquisition news and and one of your questions earlier on how did, how has incident management changed? I have been in incidents and in reliability since I graduated college. You know, it's been like my entire career so far. And so I've, you know, in very different organizations, I've been a part of that sort of line of defense.

Starting point is 00:32:42 And I've always kind of seen and wanted a better way of doing things that actually paid attention to the humans that were doing it. And like some of the things that made the in the moment stuff hard. And when you ask about how things changed, I think about an interview that I had maybe nine years ago, I was interviewing for an incident commander role. And as part of the interview, my interviewer had a stopwatch and they gave me a problem to solve. And then they gave me all these people that I could talk to. And so, you know, there was customer service, there were customers, there was a VP DMing me, there were like, there was a whole simulation, like kind of set up. And I also had to guess how much time had gone by since my

Starting point is 00:33:27 last updates so the interviewer was like holding a stopwatch and being like how much time do you think has passed you know has it like been 10 minutes has it been five minutes and there's and it you know reminds me of all these things we have to hold in our head as incident commanders and we have to hold in our head running an incident. And I think, you know, that was a pivotal moment for me because I was like, we can help with a lot of this and we can create more of these people that are not just like able to really drop into these moments, which these people exist and they are amazing. But like, we can spread out this expertise and equip more people in the organization with this expertise. And so I saw a big opportunity to really pay attention to a lot of the human aspects of both attention to the technical side, to what's breaking rather than

Starting point is 00:34:26 how things are breaking, rather than what kind of fed into it, what contributed to it, who was participating. And what I really noticed was when I was in certain organizations, post-incident reviews went better and conversations went better when there was an artifact. So rather than like me going to Dan and being like, this is my opinion on what happened, having data to show and orient around makes it so it's us against the problem rather than like us against each other. And so I wanted to create these artifacts for people. And so I started tinkering around with it and getting ideas for it when I was working at Netflix. And then I worked at Slack for a little bit and I, it was just, you know, I had, I had to go do it. And so it was, um, it was something I really wanted to create and I wanted to exist in the software industry.

Starting point is 00:35:18 It was tooling that I wish I could have used, um, in these roles. And so that was a lot of what drove it. And I also wanted to create an organization and a culture that I really wanted to work in. And like, yeah, Sean, you mentioned like there's all sorts of aspects that are really hard. And I think one of the most interesting things is like, I understand the system from a very different level than I did before. You know, I was looking at the system as an SRE, as an incident commander, as an engineering leader, but now I have a view of the system as a CEO, as someone that is like having to raise money, as someone that is, you know, interacting with customers and managing an organization. And there's all sorts of different

Starting point is 00:36:01 sharpens that contribute to an incident that exists all throughout the organization. And so I think it's been like a really fascinating journey and rewarding journey from a lot of aspects, but it's also been a rewarding journey from the aspect of how it's changed and evolved my approach to incidents too. Yeah. So a big part of what you mentioned there, it was sort of some of the hard problems that you encounter with an organization is in places where you feel like there's an opportunity for efficiency gains and better processes. So how do you sort of manage this human component? How do you help people do their job better? You know, I love the crazy interview that you did. And but that goes back to your point earlier about like,

Starting point is 00:36:46 you have to be Batman in that moment and you know, how many of those people exist, but how can you essentially, you know, can you build tools and technologies that allow, you know, more people to essentially, you know, hit Batman level in terms of efficiency. So how is jelly essentially doing some of that stuff? Like how are you essentially helping people with this human part of incident management? Yeah. And so we've been integrated with PagerDuty actually from day zero. So I was like building some of these tools and looking at things manually before where I was seeing, okay, you know, what is certain data during incidents? Like what is the people data?

Starting point is 00:37:18 Like how, how long has this person worked here? Have they been trained to do incident response? Are they actually on call? Like how long did it take them to get into a channel? And so one of the things I knew from the very beginning was that we wanted to hook into PagerDuty. You know, it kind of starts with that alert, with that alert and, you know, that on-call schedule. And I wanted to create something that fed back and like improved it over time too. And so, um, you know, you get alerted and then, then I think you decide, do I want to call an incident? And one of the things that we try to do with our tool is help, um, cut that part out because it's honestly, it can be a waste of time if it is an incident and you're sort of waffling about for like about 30 minutes. And so we try to

Starting point is 00:38:03 make it really easy for you to just spin up a channel, spin up a Zoom room, spin up whatever makes sense for your organization and get the right people in the room. And so we try to take that cognitive burden off of the human that noticed the thing. Because it's like they're already noticing the thing. It might be three in the morning. Like there's probably a lot going on. And so we try to create and make all those aspects really easy.

Starting point is 00:38:28 So, um, you know, letting the right people know, putting the stuff in a room. We have also like a UI where folks can go and see if there's an incident going on and that doesn't require any manual input from anyone. So, um, I think again, we're taking a lot of the cognitive burden off of the responder. I mean, you can spin up the jelly process if it's a true emergency with just the click of a button. If you have a little bit more time, you can name the incident. You can tell us what workflow to execute. You can tell us like if you want to integrate with Jira in a certain part of it. But if it's a true emergency, you can just click the button and have it execute like a common process that you define or that you

Starting point is 00:39:10 let us define. And so again, helping the responder just focus on what they do best, which is responding and fixing the problem rather than trying to let every party know what's going on. We really help with that part. And then once the incident is finished or closed, we automatically ingest it into the Jelly platform where you can analyze the incident and you can get a sense of what happened. And one of the things I really like about Jelly is our narrative builder. It works you backwards from telling a story about the incident because ultimately stories drive change. You know, filled out templates that people don't actually want to do and meetings that people don't actually want to attend don't drive change.

Starting point is 00:39:54 But stories drive change. Stories drive change about how something happened, how something unfolded. And so we try to make it really easy to tell a story that people want to read, want to engage with, and And so we actually added an AI component to our incident called Jelly Catch Up. And you can just run it in an incident channel and it will just ping you, like it's not going to interrupt the whole channel. And as a CEO, I actually use this all the time. Like if I'm on a call with someone I'm prospecting and I see that we potentially have an incident going on, I can easily run that and see what's going on. And we found that that is really useful for stakeholders and organizations. But it's also

Starting point is 00:40:50 really useful for responders when they're jumping into a long running incident. And then the AI feature that we have on the post incident side allows you to generate a narrative. So if you're having writer's block or you're stuck there, we can analyze the data that you inputted and you can click a button and we give you sort of prompts to start with like, Hey, it looked like Dan was first to enter this incident. And then Sean from the search team came in and Sean was on call. Um, but he first checked, you know, this part of the database. And so it gives you things to start with so that you can add to it to, again, make your life a little bit easier, remove some of that coordinative and cognitive burden. Okay, fantastic. I mean, so I definitely want to

Starting point is 00:41:35 dive a little bit into some of the AI stuff there. But maybe before we get there, you mentioned the acquisition from, you know, PagerDuty, and that happened or it was announced a few weeks ago. And maybe starting with you, Dan, why, I guess, does that sort of merger between these two organizations make sense from the PagerDuty side? And then, Nora, you can kind of speak to the Jelly side. Yeah, absolutely. So we are very excited about this acquisition because we think it's a better together story. And if you think about kind of the life cycle of an incident,

Starting point is 00:42:14 we have been admiring Jelly for a while in terms of their learning from incidents philosophy, where basically every incident is an opportunity to grow and learn, right? And if you think about that being kind of the tail end of an incident lifecycle, we really wanted to create kind of an all in one incident management solution that that filled the entire end to end portfolio. And PagerDuty is really known for kind of the first part of that lifecycle. And Jelly is, you know, really well known for the second half. And so we think it's a kind of the first part of that life cycle. And Jelly's, you know, really well known for the second half. And so we think it's a kind of chocolate and peanut butter moment where these

Starting point is 00:42:49 things really come together in support of what our customers want. And already with the news, we've had many customers reaching out on both sides of the equation, saying things like, what took you so long? Or why didn't you do this earlier? In fact, 90% of Jelly's customers are already PagerDuty customers. And so we think that there's just a tremendous amount of synergy that exists not only with our products, Dutonians. And we've really found a very common bond in terms of the culture between PagerDuty and Jelly. So that's kind of the second element of this. And lastly is Nora herself. We are big believers that part of our responsibility as a company in this space is to be thought leaders and to be out in the industry talking. And this is something that Nora just does very organically. And she's really built a reputation within the industry and with customers of people that trust her. And so we are just thrilled not only that their products are

Starting point is 00:43:57 coming together, but that Nora and her team of jelly beans are also joining Patriot Duty as well. Fantastic. Yeah. I mean, I like the peanut butter and chocolate moment. It sounds like very sort of complimentary side ends of the spectrum in terms of incident management. And Nora, maybe I'll kick it over to you for you to sort of comment on how you see this alignment between the two companies. Yeah, I mean, I'm really excited. It makes it makes a lot of sense to me. I think you see a lot of acquisitions out there that you're kind of trying to piece together from the outside. And I think this just makes sense to everyone internally. But, you know, also, folks, I'm talking to externally, which is really validating. And it feels really good, because I know we're certainly very excited about it. I

Starting point is 00:44:44 mean, like I mentioned, we've been integrated with PagerDuty since day zero. It was one of the first integrations we've built before we had customers. And I'm really excited to get even deeper with that integration. Like I think there's all sorts of possibilities we could do. And I'm really excited for the industry to see some incident improvement from it as well. Great. Yes. So I want to talk a little bit about AI before we kind of wrap up, because one, you know, like there's probably an obligation for every podcast in the universe to mention AI, at least a handful of times at this point. But there's also, you know, you mentioned Jelly Ketchup as something that Jelly, an AI investment that you made. But I also feel like incident management seems like an excellent fit for leveraging AI to help transform something

Starting point is 00:45:31 that's maybe historically been somewhat of a reactive process to being more of a proactive process and also probably certain efficiency gains you can get with managing these different challenges. So what's the history of using AI in incident response? And what are some of the sort of trends that you're seeing in terms of leveraging some of the new things that are happening in Gen AI? So I can start on our end. So the first is we've been in the AI game for a while. In fact, if you look back at the history of the PagerDuty Operations Cloud,

Starting point is 00:46:03 we've been doing things with machine learning and AI for about 10 years. In fact, we have an entire product called AI Operations that is part of our portfolio. And where we've historically leveraged a lot of those techniques and capabilities is really to, again, kind of help the human's focus by finding signal in the noise.

Starting point is 00:46:24 That's really been kind of the main thrust of that investment, because when you're listening to all of your services, you're getting an incredible volume of alerts and events coming in. And many of them are either duplicates or they're erroneous red herrings or whatnot. And so the ability to leverage, you know, AI and machine learning to basically say, Hey, we've seen an incident just like this before.

Starting point is 00:46:48 So we think it's legitimate and you should react to this as if, you know, you declared it manually. Right. Or to say, hey, not only have we seen this incident before, but this is the team that solved it previously. So you should, you know, go assign it to that team. Or, for example, to say like, hey, this entire event storm, we've seen something like this before. And you should ignore it because it was a false positive last time. So really, that's been the main thrust of our historical investment there. And that is hugely helpful to our customers with very low setup time. One of the key things that differentiates us in this way is the amount of time it takes

Starting point is 00:47:28 to configure this. Some of our competitors offer solutions in this area that take months to set up and a lot of consultants. And ours is really much more of a kind of you turn it on, it starts working. But it's built in a platform way such that you can customize it and reflect the reality that, again, every customer is unique. And so you may want to make it your own and not just use the default settings. So that's kind of the AI operations. But with the onset of actual generative AI, which came on the scene this year, we've really been

Starting point is 00:48:02 pioneers in this area. And, you know, Nora mentioned earlier how there's some just obvious use cases, like, hey, I just got pulled into this incident, there's already been 10 people working on it for an hour, you know, like, catch me up, right? Like, that's a perfect example of something that is just like immediately useful to a team. And what we think that there are many more of these. So on the Patriot duty side, we've already released several capabilities in this area. And again, they're in that kind of obvious category where we're trying to help the humans focus on the unique parts by empowering them with things that are, that are more reproducible. So for example, we ingest the entire, you know, like Slack thread that is going on,

Starting point is 00:48:47 we know all of the event data coming in, we even have 700 integrations with every possible tool you could want that are populating data into PagerDuty. And we use that centralized data set in order to be able to then propose rational and credible status updates, as an example. So when we prompt you, say 20 minutes into an incident, that you should send out a status update, rather than prompting you with a blank screen, we can prompt you with a credible status update. But back to the human in the middle concept that we talked about earlier, we don't just send it because it might be wrong because AI does hallucinate from time to time, right? So you want to be able to give the human an opportunity to edit versus author from scratch. That's a great example of one of the things we're

Starting point is 00:49:35 doing. We also see that in kind of the learning from incidents path that oftentimes after you've determined that there's a frequent recurring issue that's happening in your infrastructure, you want to add automation to prevent that from happening in the future. And so again, rather than having to like author automation from scratch, we can automatically generate a first pass at what that automation should be. And then it's really just a matter of going in and tweaking it to be perfect before you kind of, you know, set it up into your automatically running set of scripts. So we think those are some key elements that really help your operations team in general. And then Nora, where do you see potential for leveraging AI and sort of helping with the human

Starting point is 00:50:21 component of incident management? Yeah, I mean, I think generative AI is really good at summarizing data and giving you an answer. And so I think that is like, you know, there are double as sure to that, right? And so I think as humans, we have to be aware of what decisions that it's making and be actually a really big part of those decisions. And so a lot of what we've done with Jelly is not enable the generative AI to make decisions, but to instead summarize data and inform the human. And so give the human information that would have otherwise maybe taken them a long time to understand, deduce, or maybe they wouldn't have even looked into it. And so

Starting point is 00:51:03 I think of it as kind of shoulders for the human to stand on, which I think ultimately helps you resolve your incidents a little bit better and understand them a little bit better because it's combing through this world of data that is otherwise very hard for human eyes to come through. So how do you think about keeping humans really at the center of incident management? You touched on a couple of things there, but essentially we want machines to help humans do their jobs better rather than necessarily displace them. So how are you thinking big picture about this, keeping the human in the loop, leveraging their expertise, going into the future where there's so much sort of development that's happening in the

Starting point is 00:51:45 space on the gen AI side. I think it's really important to have the generative AI show their work, right? And so you as the human are actually learning where it's getting their answers from. And that is something that we've actually created in Jelly, where is everything that you are putting in your narrative timeline, regardless of if you're using generative AI or regardless of if you're generating it on your own, you're using supporting evidence. Like we have you show your work. And so I think that is a big part of it. It will help the human and the machine work together. I think in this new world we're sort of entering into, but that's sort of the really important thing is that we're not replacing the human, we're supplementing them and we're giving them shoulders to stand on there.

Starting point is 00:52:29 And I think it's really important for us as the designers and the creators of these technologies to really keep that in mind, which will really empower organizations and individuals and organizations as well. Yeah, I would think sort of elements of explainable AI really plays a critical piece here when you're dealing with something like, how can you learn essentially from these incidents that happen if you don't have an AI system that can explain itself when it's essentially trying to help facilitate that process? And Dan, did you have something to add to this? Just that I think there's a lot of philosophy that also is part of this, right?

Starting point is 00:53:08 And philosophically, as the creators of these solutions, right, you need to really care and respect about that human themselves and want to empower them, right? And so I think there's this interesting analogy I heard a while back called centaur chess, okay? So centaur chess works like this. A human that plays a pure AI at chess, the human will lose to the AI 100 percent of the time. OK, but a human that is assisted with AI will be a pure AI. And that's called the centaur. OK, and so we think philosophically that it really is the case that humans and machines together are the ideal solution. And so we're what we're trying to do is recognizing that humans will always be a key component of this whole industry.

Starting point is 00:53:55 How can we help them with all these incredible new tools that are coming on the market to make them have superpowers so that they can spend more time on innovation, so that they can spend more time with their kids, so that they can, you know, spend their time being productive and not trying to deal with all the toil and the minutiae that also corresponds in this area. Yeah, I think that's a fantastic point. And I mean, this is a super rich area. I think we could easily spend an hour just talking about this. So, but as we start to wrap up, is there anything else that you'd like to share? I'll start with you, Nora. I think that we're kind of just getting started

Starting point is 00:54:32 with a lot of the stuff that we're doing with generative AI and, you know, really the next chapter of not only incident management, but actually how it impacts the broader bottom line of businesses. And so I'm very excited for what the future holds there. Awesome. And Dan? Yeah, just, you know, even though we've been at this at PagerDuty for 15 years, it still feels like we're just getting started, right? And I think that with PagerDuty and Jelly together,

Starting point is 00:54:59 this really accelerates our ability to bring incredible end-to-end solutions to our customers. And so we're really excited about this joint opportunity. In fact, I saw a tweet last week that said, it goes together like PD and jelly. And I thought that really kind of nailed the mark. So I'm just really excited for Nora and her team to join and to really get started. Awesome. Well, Nora, Dan, thanks so much for being here. And I'm really excited to see what's next for Nora, Dan, thanks so much for being here. And I'm really excited

Starting point is 00:55:25 to see what's next for the, well, I guess now one company, but PagerDuty and Jelly together. Cheers. Thanks, Sean. Awesome. Thanks, Sean.

Software Huddle - From Code Red to Green: Incident Management with Nora Jones of Jeli and Dan McCall from PagerDuty

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.