Software at Scale - Software at Scale 17 - John Egan: CEO, Kintaba

Episode Date: April 20, 2021

John Egan is the CEO and Co-Founder of Kintaba, an incident management platform. He was the co-creator of Workplace by Facebook, and previously built Caffeinated Mind, a file transfer company, whic...h was acquired by Facebook.In this episode, our focus is on incident management tools and culture. We discuss learnings about incident management through John’s personal experiences at his startup and at Facebook, and his observations through customers of Kintaba. We explore the stage at which a company might be interested in having an incident response tool, the surprising adoption of such tools outside of engineering teams, the benefits of enforcing cultural norms via tools, and whether such internal tools should lean towards being opinionated or flexible. We also discuss postmortem culture and how the software industry moves forward by learning through transparency of failures.Apple Podcasts | Spotify | Google PodcastsVideo Highlights This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey, welcome to another episode of the Software at Scale podcast. Joining here with me today is John Egan, who is the CEO and co-founder of Kintaba, an incident management and reporting company. And previously, you were the co-founder of Workplace by Facebook, which is the best way I can think of describing it to a lot of people is Yammer, but like Facebook's equivalent. It's like a workplace social media tool. Is that roughly accurate?
Starting point is 00:00:41 Yeah, that's right. It's Facebook's enterprise offering. That was a pretty exciting tool to get to work on at Facebook before leaving and starting Kintaba. Cool. Thanks for joining me here today. Great. Thanks for having me. Yeah. So I want to start off with, first of all, let's talk about Kintaba, like what it is. And I'm really curious about the origin name. Like where did you get the idea for naming your company Kintaba? Because it's a pretty unique name. Yeah.
Starting point is 00:01:11 So Kintaba is actually a derivative name of the word Kintsugi, which is a Japanese art form where you will take broken pottery and you'll reassemble it by using golden inlay in the cracks. So you end up with this really interesting piece of pottery that will have gold in these interesting shapes and designs. And what's fascinating about it is as an art form, the reconstructed version of that pottery then becomes more valuable than the original. So you might've had a somewhat valueless piece, right? That has then been reconstructed and now it's more unique, it's more resilient, and you can see its scars. And we thought that was a really amazing metaphor to take back towards
Starting point is 00:01:43 building a resiliency company, a company that's trying to encourage people to do more about talking about recording, identifying, and putting process behind outages and incidents. And really highlights this idea of the scars within a company really aren't something you should be hiding. That's part of your company's history and it's part of the value of the company going forward. So it's a derivative of that. So if you take Kintsugi and you convert it over to Kentaba, there's a little bit more of a repetitive end to it because this is something you're consistently doing as the ABA and out in, that you're consistently applying into your organization as opposed to something that happens once. And so that's the history of it. It also rolls off the tongue nicely and was an
Starting point is 00:02:24 available domain. So all of the positives of getting a company off the ground. But yeah, we've always liked that history. And we introduced that a lot when we talked to customers as well as just in the general storytelling. I think it's a great way to approach both Kentaba and the resiliency space in general. Yeah. It reminds me of this concept of like anti-fragility of what's the name? Neil Nicholas Taleb or something like that. Nassim Nicholas Taleb. Yeah. The idea that there are some things that get stronger if you hurt them.
Starting point is 00:02:51 Yeah. And I think the openness of incidents and sort of their causes and effects has become a bit of a movement on its own, right? Over the last five, 10 years, it's always been something in technical circles. We've always wanted to know what happens. But these days, when there's a major outage on Amazon or Cloudflare or any of these big internet buoying companies, that's the conversation that comes up on like Hacker News and in Reddit, right? That people want to talk about is they want to know what happened, what went wrong. And I think part of that is because we all share the same
Starting point is 00:03:22 infrastructure now, right? In the past, we were running our own servers and we had pretty custom situations. These days, you're probably on a cloud and you're probably on the same cloud that most of the big guys are on. And so that has also pushed this understanding of not just being anti-fragile as an organization, but being open about the pieces that are still fragile. Because if you haven't found them yet, it's likely other people haven't as well. So I think that publicness, openness about the scars, wounds, failures is the evolution maybe of anti-fragile. And I think that's pretty exciting. Yeah. There's this podcast, which I just discovered, it's called The Downtime Project. I think it's like a brand new podcast, two episodes in talking about outages of various
Starting point is 00:04:04 companies. So I feel like you and your crew would be interested in just listening to those. That sounds really cool. Yeah, you'll have to send me that link. I think, again, as an organization, if you're practicing resilience, I think, and incident management in general,
Starting point is 00:04:20 you almost start to become a consumer of outages, whether they happen inside your own organization or elsewhere. I think the coolest example of this is NASA. NASA actually has a webpage out there where they collect incidents from other companies. And then they try to apply those internally to NASA because they don't have enough internal incidents. Their resiliency bar is so high, but their actual activity is quite low in comparison, that they're almost out there feeding on other people's outages because then they can go and learn. They can say, okay, here's a major industrial accident that happened in Ohio, but it turns out there's some learnings here. We need to
Starting point is 00:04:54 apply to the way that we prep for space launch. I think like all of these, that podcast is a good example. It's propagation of that because once you get into this consistent mode of lowering your barrier to filing incidents and realizing that the more of these things you're filing, actually the more resilient you become, you almost become greedy to the idea of incidents. And you want to see everyone else's and you want to gather as many as you can because it's that last step, right? It's that learning that's valuable and you can't get there without having outages. So I think there's almost like an economy
Starting point is 00:05:25 of incident reports. And I think that podcast sounds representative of that, right? Because there's more desire, because you're learning from those makes you a better practitioner. I feel like we'll see more of that. I think we'll see more websites that collect this kind of information, podcasts that talk about it, deep analysis from technically inclined folks. It'll be fantastic. That is super interesting, especially about the NASA website. I think I'll have to check that out. So you'll have to send me a link after this.
Starting point is 00:05:52 Yeah, I will. You can put it under the podcast. I don't have the, it's a long web address. I wish it was just like nasaincidentmanagement.com, but there's a long web address and they just collect them from around industry. Maybe we can park that domain name and you never know 10 years later. Before this goes live, we'll make sure to register it and get it sold back to NASA.
Starting point is 00:06:11 So can you tell me a little bit about what Kentaba does? Yeah. It sounds like there's incident management, but yeah, can you just elaborate? Yeah. So Kentaba is an incident management platform. It's really designed to make it super easy for a company of any size to come in and start implementing the basic best practices of incident management. And historically, a lot of this was only practiced by, or formally practiced at least by the big guys, right? Like by the Facebooks, the Googles, the Netflixes of the world, where once you got to great scale, you started to implement these types of practices. And what's happened is all of those engineers have started going off to other
Starting point is 00:06:50 companies. And there's become this sort of revolution of adoption of the core practices of incident response across companies that are startups, 5, 10 people, 30 people, 50 people, long before they get to that great scale. because it turns out it's valuable at every stage from a practice process. So with Contaba, you can come in and get that core set of features that gets you up and running from a process standpoint. So this is about declaring publicly when you have incidents publicly within your company, publicly about incidents that are happening, bringing together your response team in an automated way, and then letting that response team collaborate together and mitigate the issue and track all of those actions. And then finally, actually writing that postmortem, the learning document,
Starting point is 00:07:34 distributing that throughout the company, and having a reflection meeting occur. All of these little things are all handled inside of one platform. And there's a lot of depth, right? Each piece can do multiple things. You can call out to webhooks, you can call automations, you can automatically add people. But really, it's that core process that companies are adopting and moving from what I'd almost call a state of chaos when there's an outage, right? If you ask a five-person startup on their first week, what do you do when the site goes down? A lot of them will say, we just run to Slack and panic. It's really just about wrapping those four or five steps into an easy to implement process and then rolling that out in a way that your company can adopt.
Starting point is 00:08:14 And so historically to do this, you would either have to go write down a pretty long product process inside of Notion or whatever your wiki is internally and distribute it. Or you'd have to go ladder up to a pretty expensive system. You'd have to go out to a service now. Maybe you'd have to pay for that extra high tier and pager duty to get postmortem writing. You'd have to work pretty hard to actually get it implemented. So our goal is low barrier to entry, easy to implement incident management for everyone. And that's really the ethos of the product.
Starting point is 00:08:42 That makes sense. I'm sure the question that people will be asking you is that how is Kintapa better than just using Google Docs and Slack and Zoom all together? Yeah, it's like you're tying all these things up that are slightly different. And I can already see some downsides of trying to coordinate across three. But yeah, what's your take on that? Most companies we talked to initially are doing just that. They're stitching tools together. And these are in the remote world, we're especially looking at like the slacks and everyone piling into slack channels. In the old world, it was everyone would go run around to
Starting point is 00:09:14 whoever's trying to solve the problems computer and start like talking. And the reality is if you're trying to build a process across a bunch of different tools, you're much better off generally in building a tool that's purpose-built for what you're actually trying to solve. And it won't always be the most apparent advantage the minute you put it in. It's more of a long-term advantage, right? It's like, how are we doing this in a consistent fashion that everyone knows what to expect when things are going wrong? And it scales as the company scales, right? When you're saying, okay, we write our postmortems over here in Google Docs, we sometimes create Slack channels and they're named in different ways. We try to announce it in this channel whenever it goes out, like everyone remember to send the email out to Phil
Starting point is 00:09:53 when this goes wrong, because Phil really cares, right? You start in a world where you look at something like this and say, oh, that's easy to script. And then you rapidly realize that the depth of each step needs to be consistent and recorded somewhere. And really quickly, it just becomes easier to have another tool. And a lot of what Kentaba does is coordinate those existing tools for you. We operate inside of Slack. A lot of our companies do most of their incident response inside of Slack, right? But then you have the Kentaba UI that you can go to when you're trying to reference
Starting point is 00:10:19 back to other incidents and you're trying to search across all of them and see the reference of what incidents were tagged, what way and happened over this time period, et cetera, without having to leave all of your Slack channels open. And suddenly you have every single incident that's ever happened, right? A separate Slack channel that never closes in your history books. So it's not really something where you can't go respond to an incident without an incident management process. You're just going to be a lot happier if you have one in place and you have a tool there. And the goal during incident response is you don't really want to care about that overhead and that administration. That's the last thing you want to do is argue
Starting point is 00:10:52 about where are we dealing with this fire? What you want to do is deal with it because if it's gotten to the point of being an incident, you've already exhausted a lot of your automated systems for dealing with the problem. You're probably at the point where you need human expertise. This problem's never happened before. And a lot of it is really that investigation. You're firing up your observability tools. You're firing up your log chasers. You're doing all of that work. So having to do any kind of administrative overhead, even as simple as saying, I'm going to add this to a Google doc that logs all of our active incidents, it's just too much and it slows you down and minutes matter. Okay.
Starting point is 00:11:29 So to dig a little deeper, so Kentaba will let you do things like assign like an incident manager, assign like maybe a technical manager or something. Also maybe create like tasks or like Jira tasks or something like that. Because I remember at Dropbox, we had this process of, we had this internal tool, like completely like custom built back in in 2011 or something like that. It would create a Slack channel, it would create a task and a fabricator task, if you're familiar with... You should be familiar with that. Yeah, we use fabricator internally here.
Starting point is 00:11:55 We're still on it. Yeah. And now it's moved to creating a Jira task and all that. So Kentaba helps with all of those things. Is that roughly accurate? Yeah, yeah, that's right. So I haven't seen the Dropbox tool, but I imagine Kintaba would feel similar in that in the creation of an incident, a lot of things are happening automatically. It's adding the right people based on the tags you've assigned to the incident. It's pulling in the
Starting point is 00:12:19 appropriate, if you're running an incident commander or an iMock, it's grabbing that individual. It's allowing people to subscribe who just want to stay up to date. Some people will have their notification settings set to tell them every time an incident happens, if it's a certain severity or above, all of these things are happening automatically. So that human action, which is I'm declaring the incident as one piece, and then Kentaba in the background is doing all of that other work for you. In terms of connectivity into something like JIRA, we have those types of integrations. So everything from when you create an incident, maybe a master task is built in Jira, all the way through to a few follow-up tasks that are discovered during the incident. You can actually file those directly in Contaba and they get fired off to Jira
Starting point is 00:12:54 as sub to that master task. So you can go and reference back all of the follow-up tasks, both there and in the postmortem that you eventually are so dead on in terms of... And I imagine that tool at Dropbox was similar to the tool we had at Facebook is similar to the tool that they use at Google. And 2011, 2012 was really when a lot of this stuff was developed. That's when a lot of those companies were implementing tools as they started to realize that the ad hoc method of trying to deal with these problems was pretty rough as you scaled your companies up. I'm not sure how big Dropbox was at the time. Were you in the hundreds of employees at the time, maybe tens? Probably like no idea, but probably in the hundreds. Cause I remember like they had their series A or series B in October, 2011. And that's when everything changed and like hyper
Starting point is 00:13:33 growth and all that stuff. And that's when the processes start to really matter, right? The saying is, if you want to implement a process, you implement a tool, right? If you just go write it down, maybe someone's going to read it if you're lucky, but usually you come back to that process after the fact when it's written down, as opposed to during the situation. So I think it might've been the Robinhood CTO who had a really interesting talk at one point about you must have a tool to implement process, especially when your company is scaling, because that's the only way you're going to enforce the process in a way that doesn't require people to go read a document. that they're not going to read when the process needs to be implemented. That makes a lot
Starting point is 00:14:08 of sense. And you can do some things culturally, like you can make sure that your postmortems are blameless and that can be like a cultural thing. But in order to make sure everybody follows the same steps when like an incident happens, it's easier to just automate it as much as possible. It's cool. Like the cultural shift happening in a lot of companies, when we started Contavo, we really thought we were mostly going to be coming into companies and they were already going to have these processes in place. And they were just looking for a tool to replace what you were describing for a set of stitched together other tools. And the reality is the cultural shift that's happening is really much more basic than that.
Starting point is 00:14:42 It's much closer just to the idea of we should probably file these things at all. We should probably be recording these things. There's a gap, I think, between task management systems where you can file things all the way up to their highest priority. And then you're sort of completely panicked moments of the entire site is down and everything's broken. And it turns out the space in between those two things is actually really where incident management is powerful, is happening in your SEV2s, your SEV3s, where you want to lower that barrier in general. File more of these things that you're doing the diligence of going through the process of reflecting on how did we lose? Maybe it wasn't the entirety of the customer base accessing the site, but maybe it was
Starting point is 00:15:19 10% and that's big enough for a SEV3. And the process, it turns out, is really valuable in that gap. So a lot of the time what we find ourselves talking to companies about is that core statement, which is you just should be filing more of these things. You're probably having a larger number of incidents than you realize. And if you start to file them, the advantage you're going to get is you're going to have fewer SEV1s. You're going to catch more SEV2s, SEV3s. Netflix, we call these near misses. The airline industry has been great at this forever, right? Capture absolutely everything that goes even remotely wrong.
Starting point is 00:15:49 And your goal there, right, in the airline industry is to keep planes in the air and from crashing. But in software is keep the major outages at bay. And I think providing the tool there for that also helps compared to just a written down process from a cultural standpoint of saying, oh yeah, of course we file these things and stuff that we have a tool for it. Why wouldn't I go do it? And I actually think it's pretty similar to the revolution that task management went through a little bit longer ago, right? Maybe even 20 years ago, when we moved task management pretty
Starting point is 00:16:15 aggressively away from the project management approach, which is this very formal top down, here's exactly how it's going to look. And you're going to just be a recipient of that task list into more of a distributed, everyone should be creating these things. More tasks isn't bad. This is tracking the work you have to do. And we're being more honest about what has to happen. And this kind of merging came together, right? Of like the to-do lists that everyone was already keeping notepad and otherwise, and the formal project management. And in the middle there, you actually have operational task work, which is what every startup and every tech company really operates on now. I think incident management is going through is at the very early stages of the same thing, where we're moving from this world of like, incidents are the scariest thing at your company,
Starting point is 00:16:56 this terrifying thing that no one wants to deal with, to it's part of the day-to-day. And if we treat it that way, then it's less likely we're going to hit these horrible emergencies and they're going to happen less often. That is really interesting. And even my first instinct was that you probably want to introduce a tool like Kentaba, like when you're a company that's big enough,
Starting point is 00:17:17 your post-product market fit. And it seems like that's what you were thinking initially as well, but it seems like even tiny startups are adopting tools like Intaba. And how would you say that? Where is the demand coming from? Because my instinct is if I'm like three founders on a couch, it seems like a high overhead too. Startups specifically, particularly, especially startups with live products,
Starting point is 00:17:41 even if you're three people, you're really operating in a real-time fashion at this point. You have a lot of work that's happening in real time and you're probably dealing with it in Slack. You're probably pinging each other and jumping into chat rooms. Maybe in the pre-pandemic days, you were all in an office on the couch together. And I think even at that stage, having a method to record and track these things has a degree of value. Ketava is free for fewer than five people, primarily because we're just trying to get you in the motions at that point. Does a two-person company really have to have this on day one?
Starting point is 00:18:11 Probably not. You can probably get away with not having it. But honestly, when you get to four or five, it starts to become valuable and you start to force yourself through that process of all these things you were dealing with, then taking a moment even to reflect on them. We had a piece we put out a little while ago that was called, I think, the four-second post-mortem, right? It was this idea that if you just write something down after the outage, the incident, the real-time emergency has happened, you're going to get more cultural value out of it. So if you're a four-person startup and two of you are asleep and two of
Starting point is 00:18:41 you fix a problem and you write a one-sentence post-mortem on exactly what's not going to happen again. And then that emails out to everyone. That's better than by the time those other two people wake up, that entire Slack conversation has been pushed away. And they might A, not know the incident never happened or B, just never get any feedback from it and fall in the same pitfalls. The revolution outside of tech that happened in this space was 80 plus years ago, which was really just this core idea that we should really ask the people who were involved in the outages what happened, because they're actually the best people to account for what went wrong from a systemic standpoint. coming in after the fact and taking that sort of outsider's view and say, oh, we didn't have enough pods spun up in Kubernetes, or we didn't have our DNS fallback set up. Ha ha, I would never have made that mistake. Joe's an idiot. It's a very common way to approach even in a small team. And if you get Joe to go write down the one sentence back that says, all right,
Starting point is 00:19:39 all of our assumptions about the way namespacing and Kubernetes works were wrong. And it turns out you can never do X, Y, and Z, then even in that small company, you're going to get value. And Kentaba doesn't really get in your way at that point, right? It's a tool where you've probably clicked two buttons and you talked in Slack if you're that small. So you're not really getting an additional layering on top of it. And it's one of our challenges, I think, from a marketing standpoint is that assumption. Is incident management, it's this heavy handed thing, right? It's the NTSB report. It's the scary meeting you go into with the rest of the department heads.
Starting point is 00:20:11 It's all of those pieces. And I think if that's all that incident management is, I'm not even that interested in it, right? It's not that exciting of a space. I think where it's exciting is when you start thinking about it as a daily practice, almost. We've got a couple of companies using us that are 20, 30 people, and we're looking at tens of incidents a week, which for a company historically would be a lot, but it's because they're treating these real-time response efforts as low severity incidents. And because of that, they now have great tracking on it. They have reporting on it. They understand the, and they're having fewer SEV1s. So I think this works, starts to work around four, really starts to work around
Starting point is 00:20:50 10, and then just becomes indispensable somewhere around 20 or 30. I think that makes sense. And I think the key thing that you mentioned was it's like super low overhead, right? I think traditionally, engineers are used to the manual work, right, of creating a doc, filling out five whys on why something happened, and it's just a pain, communicating super manually, emailing out at the right time, making sure your email doesn't have typos, all of those things. It seems like Kentabra removes a lot of overhead from and also normalizes the entire act of creating an incident, like it's not a culture, like, I'm not scared about filing an incident anymore. I think that's the key part. Sounds like... Yeah. And I think practicing those lower
Starting point is 00:21:32 SEV levels, like the twos and the threes is what really encourages that. And culturally, it kicks off really the other side of this being open, making sure everyone can see the incidents. Means as long as you've got one person in the company who gets comfortable filing these things, it just propagates out really cleanly. Everyone else sees that and they say, oh, that's just part of how we operate. We file SEV3s, we file SEV2s. And we don't think of them as, oh no, I'm going to go wake people up because we have our configuration set up that people don't get woken up for a SEV3 or a SEV2. And I think that's really what's critical is when you're thinking about building a company, even at our scale, we're small. We're eight to 10 people at Kentaba.
Starting point is 00:22:07 And even at our scale, we use the product pretty heavily. And it's just a culturally positive thing. And it helps the fact that when things do go very badly, that cultural positivity propagates forward. And we're like, oh, it's an outage. Those happen. Let's deal with it quickly. And you really can only do that through practice, right?
Starting point is 00:22:25 You can really only learn to accept failure once you record it often enough that you realize it's going to always be there. I think as an engineer, right? The worst thing that can happen as an engineer is you join a team and your engineering manager sits you down and says, we don't have outages here.
Starting point is 00:22:39 We don't have errors here because the implication they're making is you're going to be fired if there ever is one and you're at all involved. And it's an antiquated way of thinking about software operations, right? It's throwing out a hundred years of learnings, especially post-World War II learnings about humans and responsibility. And so long as your hiring bar isn't terrible, which we all like to be really positive about our hiring bars, right? As long as your hiring bar is pretty good, then a major problem is almost always systemic. There's something about
Starting point is 00:23:07 the context that you were placed in that caused that situation to happen. And your job as someone involved in it is to make sure that context doesn't occur again, to record your account of what happened. And that's, again, where this just comes into play, right? It's like, how do we make sure we're logging these things, make sure there's awareness of them and make sure they're distributed. And none of it's complex, right? It should sound simple. When I talk to people about the product, people don't generally come back to me and say, ah, it's a really complex, it's just an easy idea, which can be a barrier in and of itself, right? You can, as a company leader, you can look at these things and say, that's obvious and easy. But I think the evidence consistently is we don't do it naturally.
Starting point is 00:23:44 We're even in software as objective as we like to think of ourselves, right? We're not. We blame people all over the place for things that go wrong that aren't really their fault. And I think that's why this evolution, right? From like SRE teams as a practice to DevOps as a cultural revolution to like resiliency as a company change is so cool to me because we're expanding out the understanding of these things that we hold dear, I think, as engineers to the point where not just our bosses and our bosses and our bosses, but even the folks running marketing, the folks running legal, the folks running all these other departments also can embrace it. And that's really what's
Starting point is 00:24:21 critical about implementing the process and making it open. And I think that's one of the biggest things that changes when you adopt the tool as well is the openness of it. If you as an engineering team implement incident response, you're probably going to silo that into your engineering Slack channel, into your engineering Google Doc. You're going to have a lot of limitation on visibility there. It's the opposite of what you want. You want lots of openness. Yeah. I think the lack of trust really comes from the lack of clear communication, right? Like if marketing doesn't know why the site went down and if there's something, if this is going to happen again, that's when you start losing trust. That makes sense to me. One question I had was around like how standardized and opinionated is Kintaba? Like how
Starting point is 00:25:04 easy does it let you customize the process or everything in place because from what I've seen like Google and Facebook and even like larger companies they generally have a fairly standardized incident management process there's like SEV levels as you've been talking about SEV1, SEV2. I know somebody who works on an oil platform they also have like SEV1 and SE SEV2s. So it seems like a universal naming practice. But then each company has its own quality metrics, right? We declare a SEV2 when the site is down for more than 10 minutes versus like 30 minutes. So how do you configure all of these different things? And how opinionated is Kentaba versus how freeform? So we try to be really opinionated
Starting point is 00:25:42 about the core flow. How do you define how public is an incident that's been declared? What does the command center feel like? Where do you go to have the communications? Does the postmortem have to be written? All of that stuff is pretty hard-coded. You really need to go through the flow. The sort of depth of each flow is a little bit more configurable. So for example, in the response itself, determining who are the responders that should be automatically added, should you be calling external tooling when certain types of incidents are created that are a SEV1 versus a SEV2? How are we categorizing? That kind of stuff is configurable.
Starting point is 00:26:14 And then we really break it down into three or four core areas of the product, right? There's people, there's tags for defining how you want to organize these things. There's your SEV levels, there's rotations, on-call rotations, which are really more about roles. What role might you have inside of the product? And then all of that gets tied together with a product called automations. And automations just lets you do an if this, then that across all of those things, right? If sev one and inside of this tag, then make sure this on-call is added, right? If this on-call is added and it's been open for more than 20 minutes, then upgrade to sev two, right? You can define those roles through sort of an engine that runs inside of Kintaba, but the out-of-the-gate configuration of it works just fine as well, especially if you're small.
Starting point is 00:26:56 If you're like five, 10 people, you might not have any automations, right? You might really just have the declaration, which propagates out and announces in a Slack channel, and then everyone runs to that Slack channel, and that's good enough. So out of the box, it's pretty opinionated in terms of best practices. And then as you're getting a little bit more complex as an organization, you can apply those rules down as automations. When it comes to things like sev levels out of the box, it's one, two, three. We do support up to five. We originally didn't. We were pretty aggressive about not letting people add SEV levels. It turned out there was a pretty good argument for some companies to have,
Starting point is 00:27:31 especially an informational SEV level, which some companies would call like a SEV five. Some companies call it info to log these really near misses that weren't even SEV worthy, but needed to be logged. And then some companies will even use those additional levels for things like running like game days, running like a planned planned sev where they want to log it and they want to record it separately, but it needs to have its own icon. So we landed on five. We had a lot of internal discussion and five seems to be enough for everyone. Three tends to be enough for most people. And so we try not to do too much configuration there. The worst tools to me are the ones where you spend weeks configuring them and you don't actually get a benefit from the
Starting point is 00:28:09 tool because all you've done is taken your own existing opinions, which might not be best practices and enforce them into someone else's world. The sales forces of the world, I don't want to knock Atlassian, but the Atlassians of the world, that's what they are. They're like IT configurable tools. And the other downside of that tends to be you end up with kind of a class of people inside of the company who can then control the tool and everyone else has to go to them and beg for the things that they want changed. Whereas the advantage of being a little bit more opinionated is because there are fewer things to configure, we also don't have to lock those down as aggressively. We don't have to say, if you're not in this group, you can't configure things. So while even in Contaba, only admins can add people, for example, to make sure
Starting point is 00:28:54 you're not adding someone from outside your organization. Anyone can add and change tags. Anyone can create automations. Anyone can declare another, I guess this is too big, but like clear an external web hook that gets called. Those things can be configured by anyone because they should be, right? Like you shouldn't have to go through channels to go and make sure that you get pinged whenever a sev3 gets created. That's just a natural thing you should be allowed to do. Yeah. That whole discussion around increasing sev levels makes so much sense to me. Like again, at Dropbox, we started with was like, if we're down for a long time or if we lose any customer data, and that would automatically page people. And then eventually we're like,
Starting point is 00:29:30 sometimes we want to notify something is a big deal or we're going to run out of capacity in 12 months. So it's not a step three because that's really bad, but we still don't want to wake people up. So then we shoehorned this fake SEV called internal SEV zero. It's like the coming SEV. You can see the train somewhere in the distance. Yeah. And I think companies deserve some degree of flexibility for sure in terms of how they want to log these.
Starting point is 00:29:53 Facebook would actually log a lot of their pending SEVs as real SEVs. They believed pretty strongly in saying if it is going to eventually be a SEV1, it's a SEV1. But even within teams there, they were a little bit different across. And I think we have to provide a degreeV1. It's a SEV1. But even within teams there, they were a little bit different across. And I think we have to provide a degree of flexibility. It's one of the challenges of a tool like this. We want to walk that line. And I want to be out of the box easy for a small company that doesn't want to do any configuration. But I also want to be useful for a larger company, a thousand plus people who are on the product a couple of times a week at a minimum daily at a
Starting point is 00:30:22 maximum. And I want to make sure that they don't get shoehorned into something that's completely inappropriate. So post-mortem templates are another one, right? Those are configurable and can be changed based on the type of incident that it is. You can have a different template for different tags. You can have a different template for different sub levels. You just need that kind of stuff. I've always been very envious is the wrong word, but of Asana's approach, right? When they came into the task management space. And he took a really clean approach to lightweight configuration, but a lot of opinion in the UI. And I think if Kentavla can iterate continually towards that kind of ideal, I think you get products that engineers and non-engineers are willing to go and adopt without maybe necessarily having to go through a vendor selection purchase process against the service nows of the world. That brings me to another question,
Starting point is 00:31:09 which is, have you seen non-engineers use file incidents? Yeah. I think what happens when you get that barrier lower and more incidents are filed, the more people can see them is you start to imagine as a non-engineer the things you could potentially also use this for. And if you're tagging appropriately, you can create different populations that are alerted for different types of incidents. So a natural one was one of our larger companies had a SEV1 during the Texas power outages. They had a large portion of their user population, or sorry, employee population in Texas who had lost power. And this isn't an engineering problem, like your engineers can't log in, but really it's an HR people safety power. And this isn't an engineering problem, right? Your engineers can't
Starting point is 00:31:45 log in, but really it's an HR people safety problem. And it was really the HR team pushing to go and file that incident. And I think when it first happened, that company was a little surprised. I remember we got some feedback from them that was like, hey, we filed an incident for this. Is that okay? And we're like, yeah, of course. You can file incidents for anything. Your marketing team can file an incident when, or your PR team can file an incident when a bad article comes out. The process is the same. You're bringing a team together, you're dealing with it. And once it's responded to, you're probably writing a postmortem. How did we get into such a bad relationship with the Washington Post that they wrote that terrible article about us? I don't think there's anything about incident
Starting point is 00:32:21 management at its core process level that only works for engineering, I just think the engineering community has embraced it the most because it's, for lack of a better word, the biggest fire. It happens consistently in manageable ways that they really want to report on. What becomes challenging when you get out of the engineering org is metrics start to become a little bit more difficult. I think Dave Renzin over at Google has a talk that he gives about how marketing departments could adopt SLOs and SLAs. And I think it's a really interesting talk as an engineer, but I think the reality is that the marketing department doesn't want to be measured on a lot of that stuff. So they want
Starting point is 00:32:56 the core process because it's valuable. But imagining how you might do metrics against them forces a lot of those organizations to push back and say, we don't really want to be tracked on this. This isn't one of the ways we want our team to be measured. So I think that comes from hopefully making things more lightweight and making it more about process adoption for those types of teams. But even engineering outages deserve to have non-technical folks involved, right? If you have an outage, you probably need your PR person, your marketing person. If you're a B2B company, your salesperson, they all probably need to know what's going on. And the worst thing you can have is all of them shooting emails out to the edge manager or the responder themselves saying like, what's going on? What's going on? Is it
Starting point is 00:33:32 going to be fixed? What's going on? And I think the reason people do that is we have this natural fear that the people we work with don't really understand how we're impacted by things. There's like an outage. And in the engineering community, we're pretty aware of the fact that if someone stopped typing, they're probably doing something important, working on the problem. But in the sales world, they don't know that and they don't see the conversations you're having. So their impression is, well, maybe they're just not going to work on this till tomorrow. And they don't realize that it's having an impact on my ability to do my job. And so just making all this stuff more open just cuts down the overhead
Starting point is 00:34:05 communication because they can see the channel, they can see the conversations, they can see what's going on. And you would think like the risk there would be, okay, now those channels just get super noisy. But in reality, that's not what happens. Like what most people want is awareness of the response more than they really need to know like the details of every single action being taken. So given access to the channel and seeing that level of depth and conversation happening gets 90% of people to back off and let it continue to flow forward. Yeah. As long as they understand what's going on. And yet to your point, yeah, sales might be doing a demo of the product while the site's
Starting point is 00:34:40 going down. I've certainly seen potential issues or near misses on things like that. So yeah, people want to just be aware of what's going on. And I think that can impact priority, right? Like you might have a SEV2 and a B2B company because you think it's not affecting anyone important and then find out that it's impacting an in-process sales pitch to a company that's going to make your entire sales book for the rest of the year. And suddenly it's probably, maybe that's a Sev one. Maybe you actually need to redefine what your timeline is going to be. And I think that's okay. The Kentaba takes an audit approach to a lot of this stuff. Anyone can change the Sev level, but it records that you did it. So you'll eventually have to back up why you changed it. If you went into that channel and you made it a Sev one, you can't just willy nilly
Starting point is 00:35:21 do it and then hope no one will notice. So I think audit logs are really powerful at preventing people from doing nefarious action. So think a little bit about it because it's going to log the change. But yeah, I think in general, these teams benefit just as much as eng teams do inside of companies. And I think companies like Facebook and Google, there's a lot of public awareness inside of the company of what's going on, all the way to the point where you can almost respond if there's a conversation happening somewhere else and say, oh yeah,
Starting point is 00:35:48 there's a SEV1 for that. And that just ends the conversation. It's okay. It's already recognized as the most important thing. And you and I can talk about this and we know what a SEV1 is, we're aware of it. But at a company that's just adopting this stuff, the whole concept is new to non-engineers especially. And it's a really healthy concept to understand. Oh, there's something more important than an important task. There's this other thing called a SEV1. And the minute I know that exists, I have a terminology now to use for my eng team is panicking and working as fast as they can. And I don't just have to convert that into, why don't you realize it's an emergency? I think
Starting point is 00:36:22 that's pretty awesome. You're introducing vocabulary along with culture. Yeah. I think the nice thing about that terminology is that everybody understands that no, 7 is probably bad. And even though some companies might have the numbers upside down. I think it's funny you had zero actually at Dropbox. I didn't realize that. I've always been a huge anti-proponent to the zero.
Starting point is 00:36:44 Yeah. I feel like it's an excuse. Oh, there's one more. We didn't have that. I've always been a huge anti-proponent to the zero. Yeah. I feel like it's an excuse. Oh, there's one more. Like we didn't have emergency enough to have a bigger emergency. I think the idea was that at a self zero,
Starting point is 00:36:52 it's like you've lost customer data, which is basically like close to if it's really bad, it could be like a company ending event. Yeah. So what are those things
Starting point is 00:36:58 that could be like a company ending event? Oh, we have an availability hit for 30 minutes and we need to pay back our enterprise customers because we breached our contract and stuff. But I think at the end, these things, 7-0s, they end up getting overused as well. But that's a conversation if I feel for another time.
Starting point is 00:37:17 I think Microsoft had a priority level zero, they had a P-0 that they introduced at some point because P-1 wasn't low enough. Facebook has a concept of unbreak now, which is basically your real-time task kind of fill-in. I think there's a really interesting conversation at these companies. Where does the one end and the other begin? Where does unbreak now and then incident response begin? And I think it's really just about incident responses tend to require teams to solve and tasks tend to be individual-based. I don't know if Dropbox ran the same way, but it was this idea that someone always owns the task and they're really the one doing the work and getting it fixed and it gets passed if someone else is doing it. But with an incident, you
Starting point is 00:37:54 actually can have a response team. You can have people doing research. You can have people fixing one part of the system, people fixing another part of the system. And they represent like the various arms generally of your engineering infrastructure. Yeah. Yeah. We generally have a T-lock or like a technical leader on call or something, and they would be the main person just deciding what to do, but then the tasks get farmed out with other people. And the nice thing about SEV0s and like SEV3s and all of that was like, yeah, the boundaries were not super clear, but everybody knew that a SEV1 was more important than a SEV3. And by the end of it, internal tools, teams, and all of that were not allowed to file SEV1s or SEV0s.
Starting point is 00:38:30 We can't have something that's not user-facing actually be a SEV0. And the flip side was, now we know that a SEV0 is really important, you need to make sure your action items are done within 30 days and all of these other things. Which actually I think was really important and really useful because these things can so
Starting point is 00:38:47 easily be forgotten, especially when the organization grows like super big and some team might just not be completing their tasks on time. And we might have a repeat of the outage that there's a code yellow and all of that. Like you need all of this process, especially when you're like, And then you get the log too, right? So when review time comes up, especially in a larger company, you can point back to these things and say, yeah, I was on the response team to these major incidents that happened. And you'll notice they never happened again. It's a good thing to
Starting point is 00:39:15 be able to point back to. Again, it's moving culturally away from a fear to file into almost not an excitement, but an encouragement to file. Like maybe it didn't really happen if you don't file it. And so I think that that's really one of the things we like to see. Like when Contaba is working, and I try not to say it this way, but it's almost like your incident count goes up, right? It's like you think about, I'm going to implement an incident management solution. And what do I expect? Well, my incidents count will go down.
Starting point is 00:39:43 And that's not actually what you're trying to do. The reality is these things are happening and you're not logging them. And so what will be perceived to happen is your incident count goes up, which means every time that happens, your learning goes up, your knowledge goes up, your ability to respond gets better and your acceptance of the process improves. And so we're working like really hard to figure out in the product, how do you celebrate that? It's a hard thing to celebrate. Like, yay, more SEV3s, right? But it's actually really good. The airline industry has a really interesting history of rewarding pilots who file more
Starting point is 00:40:13 incidents. And it's a hard line to walk, right? Because you're asking for abuse at some point. You're like, oh, every little thing gets filed just for... Or maybe I break something on purpose so I can file more of these. But there's almost an incentive for organizations to say, more of these are better. And the result of that will be fewer catastrophes, fewer horrible major problems. And that's hard to see immediately. You have to believe that, and then you have to see it for a little while. And then the results comes out long-term.
Starting point is 00:40:39 But you just get powerful impacts on numbers of major outages, as well as just a cultural impact of positivity around the natural progression of a company. I think companies, especially startups, fall apart because of incidents. And they often fall apart, especially with younger founders. I know when I was in my early 20s and running my first company, our company almost fell apart a huge number of times because of the fights we got into because of outages. And I think if we'd been a little bit more exposed to the reality of the world before that, and worked on a few more high-speed unicorn rapid growth companies, none of that would have come as a shock. And I think tools like this, we're talking again about smaller companies, help those companies
Starting point is 00:41:17 accept, oh, we have a tool for it, so it must not be that uncommon. And it gets you a little bit over that, you screwed this up, I screwed this up, right? Like your context in a rapidly growing company is we're moving really fast. We're growing really fast. We're making mistakes, right? The learnings are going to be less about this co-founder needs to go. And they're going to be more about, okay, we moved too fast over here and we need to change the way we move quickly on that one piece of the product. So maybe we don't need to push every hour into the live site. Maybe it should be every day. Things like that, I think, are the learnings you can get that are just healthy outcomes, especially in the post-coronavirus world where we can't even see each other during these outages, which just makes everything harder because you're trying to be blunt and quick and aggressive in
Starting point is 00:41:58 these chat channels. And you end up coming off as angry, unhappy, and miserable. And you don't always want to hop on Zoom. And so I think we're only going to see this propagate, I hope, as we move more and more towards remote work, especially where you're going to have that defined tool process because you're not going to all be standing in a room yelling at each other. It reminds me of a product called Git Prime, where eventually once Kentawa has millions of customers, you can show like new customers. On average, a company in your industry of your size has six incidents a month. And this is where you stack rank compared to other companies. I'm saying this half as a joke, but yeah, you have a lot of opportunity to educate people on this is how things work out. And the
Starting point is 00:42:42 more I think about it, it seems like you can also help guide people. Like maybe people don't know whether to file it as like a SEV1 or a SEV2. I know that this was a problem at Dropbox. Like, should I file this as SEV1 or SEV2? And it actually gave you like this workflow or like this, like where it would ask, is there like an availability hit?
Starting point is 00:42:58 How long has it been going on for? Does it affect actual customers or is it like this internal customers? And it will automatically decide the SEV level for you you so it's also like a way to educate people this is what a large incident at our company means so yeah we do like configurable hints right now in cantaba like when you're picking it the explanation of what is that level is configurable by the company but i think you're dead on because we get more data into the product, right? Like I think AI ops is like a really popular conversation term these days. And I think where AI machine learning, where those things are really interesting to me
Starting point is 00:43:32 is that kind of stuff, right? It's how often does this tend to happen? What kinds of incidents have happened like this before across your industry? Should you be surprised that this piece of infrastructure was affected? And so now it turns out like within engineering teams on AWS, this piece of infrastructure is affected. And so now it turns out within engineering teams on AWS, this piece of infrastructure is generally responsible for the problem. That kind of stuff's really cool from an AI ML standpoint. Where I cringe and I see companies chasing after AI ML is when they're saying, it will come in and solve these problems for you.
Starting point is 00:43:58 It's like, great, we're going to run an AI system kicking off your scripts. That'll kick off script A, B, C, D, depending on what it thinks is happening. And I think just a lot of what AI has shown us over the last 10 years is rather than putting AI in front of that, you probably ought to just run those scripts before you file the incident in the first place. And if it doesn't work, get a human involved. We all want AI to be human-like, and that's not yet what AI is good at. AI is good at repeating things that have been done in the past and propagating forward everything from the good to the biases. Recruiting is the famous example of where you really don't want to use AI ML because all you're going to do is get
Starting point is 00:44:34 more of exactly who you have. And that's a quick path to have no diversity. And I think with incident response, it's similar, right? If you're practicing it properly, your incidents are black swans, right? These things aren't necessarily predictable. And if they were predictable, you should have caught them before they ever hit your incident system. Like at this point, you're getting people, humans are coming in and they're fixing the problem. And then they're going to make changes to those automated systems after the fact. So you don't get another incident again. I think wrapping all these things into your incident management system starts to turn that incident response system into just a day-to-day ops system that really that piece could just be automated out. So I think it'll be very interesting to see the direction a lot
Starting point is 00:45:14 of this stuff goes. I know PagerDuty recently bought Rundeck, which is making scripted actions easier to access for individuals who are responding to incidents. And we watch that very closely in terms of, okay, what's that? How are they going incidents. And we watch that very closely in terms of, okay, what's that? How are they going to implement that? Does that really fit in? Or does that just become more overhead? So if you're implementing this thing,
Starting point is 00:45:31 it's another layer that you've got to configure before you can get these basic values out of the product. Yeah. AI seems definitely like a stretch to me, but again, my perspective is probably just jaded from larger companies. I can give you, again, another personal example. So I was on call for Dropbox.com and there were a lot of things that could go wrong,
Starting point is 00:45:50 right? For like a site that's like fairly large. So the only thing that we really wanted to do was we had this triage dashboard. As soon as we get an alert saying availability is down, we have to look at this massive dashboard and try to go through every graph and see where is the availability hit from. And that would take a lot of time. So the only really script that we wanted to run
Starting point is 00:46:11 was hit our metric system and print out like all of the metrics that are like deviating from like a baseline. This is supposed to be like 99%, but it's like 95. So I see like a lot of value in something like, just give me more information when like an incident is going wrong. But maybe the answer where there is the observability tool
Starting point is 00:46:30 needs to show that to us rather than... Yeah, I think there's an obvious intersection there, right? The observability tools and incident response tools. And they're both real-time tools. They're both designed for, I'm investigating something. And I think if you go back eight, nine years, observability wasn't even a space, right? We were like, oh, this is metrics, isn't it? And it took a while to figure out that there's real-time opportunity. So I do think there's a really interesting intersection there. And I think today they really do live as separate tools. You're using something like Honeycomb and you're using something like Contaba. They work next to each other really cleanly,
Starting point is 00:47:00 but it's a pretty natural integration that you would eventually have. And you have to be careful with those integrations because what you don't want to do is just pile information into your sort of high signal response channel. It's like the worst thing you can do is just automatically. But there are situations where, like you're describing, where depending on the infrastructure piece being affected, the severity level, the tag, you probably do always need a certain chart. You'll grab that chart and inject it in just so everyone's got it. If for no other reason that you might want a record of that point in time of that chart, even if it wasn't affected. I always say the
Starting point is 00:47:33 scariest emergencies are the ones where everything looks okay. It's not so bad when the chart's gone red. You can at least have somewhere to start. The worst thing in the world is your dashboards are all green, but your customers are telling you they can't access the site. Even worse is when you can access it and no one else can. You end up turning Wi-Fi off on your phone and you're trying to connect on LTE to see if it's something about the network. Those are the scary ones. And I think in those situations, those are the perfect examples of incident response. They're so fuzzy that when you do find the reasoning for the problem, it's going to be this interesting and novel thing. It's not just going to be,
Starting point is 00:48:10 I needed to expand the memory or I didn't turn auto-scaling on. It might be, but more likely, it's going to be something weird. It's going to be something like my auto-provisioners ran out of characters. Yeah. Like file descriptors. We ran out of file descriptors on this one machine and that caused like a cascading problem and stuff. Turns out you can't name an internal user www or it overwrites your like internal systems for www. There's that, that actually happened. Like that kind of stuff is, it's just, it's novel. And I think coming back to that, the original conversation we had about like postmortems and publicity of these write-ups is everyone wants to know. There was a Robin Hood outage was pretty famous right around daylight savings time. And I remember everyone on Hacker
Starting point is 00:48:49 News was just making fun of them and saying, ha ha, they forgot about daylight savings time, but what a bunch of fools. And I can't remember if they publicly wrote the postmortem or if someone gave a talk about it, but it turned out to be some completely different thing. It had something to do with namespacing and auto-provisioning. And it just happened to happen on Daylight Savings Time. And I think that's why it's so cool because the fact that we all jumped to the conclusion that it's this obvious thing makes it all the more critical that someone, especially within that company, propagates out the information about that's not what happened. It was this completely different problem that we maybe had no system in place to detect.
Starting point is 00:49:25 The other one's natural disasters, which are pretty tough. You data center goes dark. Something horrible happens like that. And it's okay. My dashboard's not just red, it's off. And so the space is fascinating to me. We get to see all kinds of cracks in people's day-to-day lives. And we're all running around as firefighters.
Starting point is 00:49:42 So I think the tools that have to exist here have to be practical tools for everyone to use. They can't be specialist tools. And I think that's really the fun opportunity in the space to me. And I think the culturally valuable one. And the one I was upset didn't exist. We went out pretty aggressively to the market and looked. We thought maybe this is surely PagerDuty solved this. Surely Atlassian has solved this.
Starting point is 00:50:02 And the tools are just complex. And there are lots of fields to fill out and they just feel like more work. And that's not a good way to deal with outages. That's like, your goal is to get your firemen on the ground, not to give them more forms to fill out. It's a joke in policing, right? They'll let you go if there are too many forms to fill out. Like you want to get the forms out of there. You want to just let people do their jobs. Yeah. I can imagine in the world, like maybe five years from now, like Kindaba will have one button like publish to web.
Starting point is 00:50:29 So you can publish out like a postmortem, like publicly, and you can just post on Hacker News with another button. And I totally think the first step to that is just get more incidents recorded inside of companies to encourage them to write these things at all. But you're totally right. Like the more public publishing that happens, the better, especially where it benefits companies. I think I trust companies that write their postmortems publicly more than I trust companies that don't. Cloudflare has let me down before as a service provider. And I think of them as really weak
Starting point is 00:51:00 because of it when it happens. And then they write the postmortem and I realize the depth of their technical knowledge. And I read this over and I think, okay, I couldn't have handled that. So it's still better for them to be the ones in charge. And it's that opportunity, right? Even if you're not publishing a postmortem, just being public about incidents, I think is your opportunity to show your expertise. That's how people know that you deserve to be a steward of data or provider of a service or otherwise is the way you deal with major outage situations. And then when you can give that account of them, there's a... Sidney Decker has a chapter in his book about accountability,
Starting point is 00:51:36 where one of the early pushbacks against incident response or blameless approaches to incident response was the idea that we're not really holding people to account. And if you dig into that word, account means you're accountable. It means you should be providing the account among other things of what happened. And once that happens, you almost always realize that it has very little to do with the person and it has an awful lot to do with the system. And the system as a company owner is yours. And I really like that idea. You're always pushing the problem outwards to the external factors. And you're saying, this wasn't the subordinate of the
Starting point is 00:52:11 subordinate inside of a hierarchical organization. That's not what happened. What happened was this company process caused blah, blah, blah. And the minute you do that, you've actually pushed the accountability across the chain, right? Because we all agree to processes. We all build those. We all built the company that allowed that to happen. And that's the right way to do it. If nothing else softens the blow, because we all take it on as a learning and we're all responsible. Every time an airplane crashes, right? The entire industry treats that as something they need to be aware of and deal with. And why shouldn't companies, right? Every time Dropbox.com goes down, that had the potential to take your whole company down.
Starting point is 00:52:50 It doesn't matter if your job is a technical writer or a PR person, you probably care an awful lot about that response and whether or not maybe you in some tiny way were responsible for the process that allowed it to happen. And being open to that learning is just powerfully healthy from a cultural as well as a company building standpoint. Yeah, that makes sense. And maybe this is a good last question, which is what's your plan
Starting point is 00:53:14 or what's your approach been to educate the industry about adopting tools like Intaba? So we're working really hard to get more and more what I call self-service bottoms up as an organization. We really want small and up-and-coming companies to be able to adopt it without having to go and do hiring a consultant to go in and teach them the cultural pieces.
Starting point is 00:53:34 So this is really a UX problem throughout their product. It's not about putting you through a training course. It's about making each page that you hit, each screen that you encounter informative so you can learn as you go. Let's file incidents always. How do we make that more accessible? Where do we put it in Slack? How do we make the bot work? How do we encourage it more often?
Starting point is 00:53:54 How do we give you more exposure to the ones that have been filed? How do we give you an idea of how often they get filed so you don't feel like you're running outside of the norm? That's the core stuff that the product has to work pretty hard at and that we've been iterating on pretty aggressively and we continue to. The other parts of it are, there's more and more personification. I think of the product that's happening in the beginning, the tool was very tool-like. Here's your tool, it disappears and you use it. And in incident response at a company like a Facebook, you'll have, or Dropbox,
Starting point is 00:54:21 you'll have either like an IMOC, an incident manager on call, you'll have an incident commander, and it tends to be a trained role, right? Like you went through formal training to learn this stuff and you place them in every incident, regardless of if it's their problem, right? If nothing else, to make sure everyone's acting correctly, right? You can help adjust sev levels, you can help move things around. We used to call this the responsible adults in the room. I really do start to feel like the future is almost the elimination of that role, right? The tool should take on that role of helping you understand how do I file? When do I file? What do I file? Where do I file? When do I close it? Et cetera, et cetera. Because otherwise, just like you said with how do you teach people, otherwise you have to train. And if the barrier to the tool is a training class, it just ends up being this thing that only large organizations will ever get to really adopt.
Starting point is 00:55:13 And so I think that's really what will make that work for every company is it's partially on us from a feature UX standpoint to do it, but it's also just, it's tool crafting and it's messaging everything from marketing page out to like how you talk about your own incidents and product to these conversations we're having right now. I hope that the entire tech world is like adopting this cultural feeling of we ought to be doing incident management, right? That's the beginning. And then you propagate forward into the tool and beyond. To me, these things are all product problems, which is why we build a product in this space. Because when I think about iteration, I think less about how do I go and own everything your SRE team does, which is a very natural way to think about a product
Starting point is 00:55:54 like this. And instead, I like to think about it from the standpoint of how do we make your marketing person participate in incidents? Because if I can get your marketing person, who's never even heard of incident management, to do this, then a small company of engineers who maybe also haven't practiced it in the past can also adopt it overnight. So it's really about who uses it. And I reference Asana a lot because I'm really amazed by that product in general. But I think it's similar in that the rest of the world had to-do lists and we had project management and there's that middle ground. And the only way you do that is by building a really beautiful product that helps people understand.
Starting point is 00:56:28 I think incident management is the same way. Marketing teams already have where they run to when there's a fire, but they don't actually have a tool to go and make this work consistently. I think you're working on a really important and exciting goal. And I think the product, if more companies had the product, I would enjoy working at those companies more for sure. So that's the dream.
Starting point is 00:56:51 It should be a required tool, right? Imagine showing up at a company that doesn't have a task management system. Or like code review at this point. Yeah, or code review. Yeah, it's okay. You're way out of date.
Starting point is 00:57:03 Their response. So I feel like we're really in the early days of this. Google published the book in what, 2016? The SRE book. And that was the SRE handbook, which was really the first time they publicly talked about incident command. And I think we're refining that now into what does it look like for non-Googles to have the same kind of effectiveness without having to maybe do that heavy lift of a full command system. Yeah. I am excited to read like how smaller companies can translate some of the learnings from like the Google SRE book and take the most important ones that are like
Starting point is 00:57:38 relevant for themselves. I don't think there's like a smaller version of the Google SRE book that's out there. And if somebody can drive, that would be amazing. Yeah. I think there's like a smaller version of the Google SRE book that's out there. And if somebody can drive that would be amazing. Yeah, I think there's a couple of okay resources, but you're right. Actually, Gremlin, if you're familiar with them, and chaos engineering space has a really interesting write up on running a high severity incident management process. It's the write up, it's not the tool, but it's pretty cool. So I definitely recommend other folks check that out. Yeah, I'll take a look at that. I'm gonna read all the other links you send me. And thank you for being a guest. I think i've learned a lot this episode yeah yeah thanks for having me i think it's a great topic and a big fan of the show so excited to be part of it thank you so much

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.