Software at Scale - Software at Scale 17 - John Egan: CEO, Kintaba
Episode Date: April 20, 2021John Egan is the CEO and Co-Founder of Kintaba, an incident management platform. He was the co-creator of Workplace by Facebook, and previously built Caffeinated Mind, a file transfer company, whic...h was acquired by Facebook.In this episode, our focus is on incident management tools and culture. We discuss learnings about incident management through John’s personal experiences at his startup and at Facebook, and his observations through customers of Kintaba. We explore the stage at which a company might be interested in having an incident response tool, the surprising adoption of such tools outside of engineering teams, the benefits of enforcing cultural norms via tools, and whether such internal tools should lean towards being opinionated or flexible. We also discuss postmortem culture and how the software industry moves forward by learning through transparency of failures.Apple Podcasts | Spotify | Google PodcastsVideo Highlights This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Hey, welcome to another episode of the Software at Scale podcast.
Joining here with me today is John Egan, who is the CEO and co-founder of Kintaba, an incident management and reporting company.
And previously, you were the co-founder of Workplace by Facebook, which is the best way
I can think of describing it to a lot of people is Yammer, but like Facebook's equivalent.
It's like a workplace social media tool.
Is that roughly accurate?
Yeah, that's right.
It's Facebook's enterprise offering.
That was a pretty exciting tool to get to work on at Facebook before leaving and starting Kintaba.
Cool. Thanks for joining me here today. Great. Thanks for having me.
Yeah. So I want to start off with, first of all, let's talk about Kintaba, like what it is. And I'm
really curious about the origin name. Like where did you get the idea for naming your company
Kintaba? Because it's a pretty unique name.
Yeah.
So Kintaba is actually a derivative name of the word Kintsugi, which is a Japanese art form where you will take broken pottery and you'll reassemble it by using golden inlay
in the cracks.
So you end up with this really interesting piece of pottery that will have gold in these
interesting shapes and designs.
And what's fascinating about it is as an art form, the reconstructed version of that pottery
then becomes more valuable than the original. So you might've had a somewhat valueless piece,
right? That has then been reconstructed and now it's more unique, it's more resilient,
and you can see its scars. And we thought that was a really amazing metaphor to take back towards
building a resiliency company, a company that's trying to
encourage people to do more about talking about recording, identifying, and putting process behind
outages and incidents. And really highlights this idea of the scars within a company really aren't
something you should be hiding. That's part of your company's history and it's part of the value of the company going forward. So it's a derivative of that.
So if you take Kintsugi and you convert it over to Kentaba, there's a little bit more of a repetitive
end to it because this is something you're consistently doing as the ABA and out in,
that you're consistently applying into your organization as opposed to something that
happens once. And so that's the history of it. It also rolls off the tongue nicely and was an
available domain. So all of the positives of getting a company off
the ground. But yeah, we've always liked that history. And we introduced that a lot when we
talked to customers as well as just in the general storytelling. I think it's a great way to approach
both Kentaba and the resiliency space in general. Yeah. It reminds me of this concept of like
anti-fragility of what's the name? Neil Nicholas Taleb or something like that.
Nassim Nicholas Taleb.
Yeah.
The idea that there are some things that get stronger if you hurt them.
Yeah.
And I think the openness of incidents and sort of their causes and effects has become
a bit of a movement on its own, right?
Over the last five, 10 years, it's always been something in technical circles.
We've always wanted to know what happens. But these days, when there's a major outage on Amazon
or Cloudflare or any of these big internet buoying companies, that's the conversation that comes up
on like Hacker News and in Reddit, right? That people want to talk about is they want to know
what happened, what went wrong. And I think part of that is because we all share the same
infrastructure now, right? In the past, we were running our own servers
and we had pretty custom situations. These days, you're probably on a cloud and you're probably on
the same cloud that most of the big guys are on. And so that has also pushed this understanding
of not just being anti-fragile as an organization, but being open about the pieces that are still
fragile. Because if you haven't found them yet, it's likely other people haven't as well. So I think that publicness, openness about the scars, wounds, failures is the evolution
maybe of anti-fragile. And I think that's pretty exciting.
Yeah. There's this podcast, which I just discovered, it's called The Downtime Project.
I think it's like a brand new podcast, two episodes in talking about outages of various
companies.
So I feel like you and your crew
would be interested in just listening to those.
That sounds really cool.
Yeah, you'll have to send me that link.
I think, again, as an organization,
if you're practicing resilience, I think,
and incident management in general,
you almost start to become a consumer of outages,
whether they happen inside your own organization or elsewhere. I think the coolest example of this is NASA. NASA actually
has a webpage out there where they collect incidents from other companies. And then they
try to apply those internally to NASA because they don't have enough internal incidents.
Their resiliency bar is so high, but their actual activity is quite low in comparison,
that they're almost out there feeding on
other people's outages because then they can go and learn. They can say, okay, here's a major
industrial accident that happened in Ohio, but it turns out there's some learnings here. We need to
apply to the way that we prep for space launch. I think like all of these, that podcast is a good
example. It's propagation of that because once you get into this consistent mode of lowering your
barrier to
filing incidents and realizing that the more of these things you're filing, actually the more
resilient you become, you almost become greedy to the idea of incidents. And you want to see
everyone else's and you want to gather as many as you can because it's that last step, right?
It's that learning that's valuable and you can't get there without having outages. So I think
there's almost like an economy
of incident reports. And I think that podcast sounds representative of that, right? Because
there's more desire, because you're learning from those makes you a better practitioner.
I feel like we'll see more of that. I think we'll see more websites that collect this kind
of information, podcasts that talk about it, deep analysis from technically inclined folks.
It'll be fantastic. That is super interesting,
especially about the NASA website.
I think I'll have to check that out.
So you'll have to send me a link after this.
Yeah, I will.
You can put it under the podcast.
I don't have the, it's a long web address.
I wish it was just like nasaincidentmanagement.com,
but there's a long web address
and they just collect them from around industry.
Maybe we can park that domain name
and you never know 10 years later. Before this goes live, we'll make sure to register it and get it sold back to NASA.
So can you tell me a little bit about what Kentaba does?
Yeah. It sounds like there's incident management, but yeah, can you just elaborate?
Yeah. So Kentaba is an incident management platform. It's really designed to make it super easy for a company of any size
to come in and start implementing the basic best practices of incident management. And historically,
a lot of this was only practiced by, or formally practiced at least by the big guys, right? Like
by the Facebooks, the Googles, the Netflixes of the world, where once you got to great scale,
you started to implement these types of
practices. And what's happened is all of those engineers have started going off to other
companies. And there's become this sort of revolution of adoption of the core practices
of incident response across companies that are startups, 5, 10 people, 30 people, 50 people,
long before they get to that great scale. because it turns out it's valuable at every stage from a practice process. So with Contaba, you can come in and get that
core set of features that gets you up and running from a process standpoint.
So this is about declaring publicly when you have incidents publicly within your company,
publicly about incidents that are happening, bringing together your response team
in an automated way, and then letting that response team collaborate together and mitigate the issue and track all of those actions.
And then finally, actually writing that postmortem, the learning document,
distributing that throughout the company, and having a reflection meeting occur.
All of these little things are all handled inside of one platform.
And there's a lot of depth, right?
Each piece can do multiple things. You can call out to webhooks, you can call automations, you can automatically
add people. But really, it's that core process that companies are adopting and moving from
what I'd almost call a state of chaos when there's an outage, right? If you ask a five-person startup
on their first week, what do you do when the site goes down? A lot of them will say, we just run to Slack and panic. It's really just about wrapping those four or five steps into
an easy to implement process and then rolling that out in a way that your company can adopt.
And so historically to do this, you would either have to go write down a pretty long product
process inside of Notion or whatever your wiki is internally and distribute it. Or you'd have to go
ladder up to a pretty expensive system.
You'd have to go out to a service now.
Maybe you'd have to pay for that extra high tier and pager duty to get postmortem writing.
You'd have to work pretty hard to actually get it implemented.
So our goal is low barrier to entry, easy to implement incident management for everyone.
And that's really the ethos of the product.
That makes sense.
I'm sure the question that people will be asking you
is that how is Kintapa better than just using Google Docs and Slack and Zoom all together?
Yeah, it's like you're tying all these things up that are slightly different. And I can already
see some downsides of trying to coordinate across three. But yeah, what's your take on that?
Most companies we talked to initially are doing just that. They're stitching tools together.
And these are in the remote world, we're especially looking at like the slacks and
everyone piling into slack channels. In the old world, it was everyone would go run around to
whoever's trying to solve the problems computer and start like talking. And the reality is if
you're trying to build a process across a bunch of different tools, you're much better off generally
in building a tool that's purpose-built for what you're actually trying to solve. And it won't always be the most apparent
advantage the minute you put it in. It's more of a long-term advantage, right? It's like,
how are we doing this in a consistent fashion that everyone knows what to expect when things
are going wrong? And it scales as the company scales, right? When you're saying, okay, we write
our postmortems over here in Google Docs, we sometimes create Slack channels and they're named in different ways. We try to announce
it in this channel whenever it goes out, like everyone remember to send the email out to Phil
when this goes wrong, because Phil really cares, right? You start in a world where you look at
something like this and say, oh, that's easy to script. And then you rapidly realize that the
depth of each step needs to be consistent and recorded somewhere.
And really quickly, it just becomes easier to have another tool.
And a lot of what Kentaba does is coordinate those existing tools for you.
We operate inside of Slack.
A lot of our companies do most of their incident response inside of Slack, right?
But then you have the Kentaba UI that you can go to when you're trying to reference
back to other incidents and you're trying to search across all of them and see the reference
of what incidents were tagged, what way and happened over this time period, et cetera,
without having to leave all of your Slack channels open. And suddenly you have every
single incident that's ever happened, right? A separate Slack channel that never closes
in your history books. So it's not really something where you can't go respond to an
incident without an incident management process. You're just going to be a lot happier if you have
one in place and you have a tool there. And the goal during incident response is you don't really want to
care about that overhead and that administration. That's the last thing you want to do is argue
about where are we dealing with this fire? What you want to do is deal with it because if it's
gotten to the point of being an incident, you've already exhausted a lot of your automated systems
for dealing with the problem. You're probably at the point where you need human expertise. This problem's never happened before. And a lot of it is really that
investigation. You're firing up your observability tools. You're firing up your log chasers. You're
doing all of that work. So having to do any kind of administrative overhead, even as simple as
saying, I'm going to add this to a Google doc that logs all of our active incidents, it's just too
much and it slows you down and minutes matter.
Okay.
So to dig a little deeper, so Kentaba will let you do things like assign like an incident manager, assign like maybe a technical manager or something.
Also maybe create like tasks or like Jira tasks or something like that.
Because I remember at Dropbox, we had this process of, we had this internal tool, like
completely like custom built back in in 2011 or something like that.
It would create a Slack channel, it would create a task and a fabricator task, if you're
familiar with...
You should be familiar with that.
Yeah, we use fabricator internally here.
We're still on it.
Yeah.
And now it's moved to creating a Jira task and all that.
So Kentaba helps with all of those things.
Is that roughly accurate?
Yeah, yeah, that's right. So I haven't seen the Dropbox tool, but I imagine Kintaba would feel
similar in that in the creation of an incident, a lot of things are happening automatically. It's
adding the right people based on the tags you've assigned to the incident. It's pulling in the
appropriate, if you're running an incident commander or an iMock, it's grabbing that
individual. It's allowing people to subscribe who just want to stay up to date. Some people will have their notification settings set to tell them
every time an incident happens, if it's a certain severity or above, all of these things are
happening automatically. So that human action, which is I'm declaring the incident as one piece,
and then Kentaba in the background is doing all of that other work for you. In terms of connectivity
into something like JIRA, we have those types of integrations. So everything from when you create an incident, maybe a master task is built in Jira,
all the way through to a few follow-up tasks that are discovered during the incident.
You can actually file those directly in Contaba and they get fired off to Jira
as sub to that master task. So you can go and reference back all of the follow-up tasks,
both there and in the postmortem that you eventually are so dead on in terms of...
And I imagine that tool at Dropbox was similar to the tool we had at Facebook is similar to the tool that they use
at Google. And 2011, 2012 was really when a lot of this stuff was developed. That's when a lot
of those companies were implementing tools as they started to realize that the ad hoc method
of trying to deal with these problems was pretty rough as you scaled your companies up. I'm not
sure how big Dropbox was at the time. Were you in the hundreds of employees at the time, maybe tens?
Probably like no idea, but probably in the hundreds. Cause I remember like they had their series A or series B in October, 2011. And that's when everything changed and like hyper
growth and all that stuff. And that's when the processes start to really matter, right? The
saying is, if you want to implement a process, you implement a tool, right? If you just go write it
down, maybe someone's going to read it if
you're lucky, but usually you come back to that process after the fact when it's written down,
as opposed to during the situation. So I think it might've been the Robinhood CTO who had a really
interesting talk at one point about you must have a tool to implement process, especially when your
company is scaling, because that's the only way you're going to enforce the process in a way that
doesn't require people to go read a document. that they're not going to read when the process needs to be implemented. That makes a lot
of sense. And you can do some things culturally, like you can make sure that your postmortems are
blameless and that can be like a cultural thing. But in order to make sure everybody follows the
same steps when like an incident happens, it's easier to just automate it as much as possible.
It's cool. Like the cultural shift happening in a lot of companies,
when we started Contavo, we really thought we were mostly going to be coming into companies
and they were already going to have these processes in place. And they were just looking
for a tool to replace what you were describing for a set of stitched together other tools.
And the reality is the cultural shift that's happening is really much more basic than that.
It's much closer just to the idea of we should probably file these things at all. We should probably be recording these things.
There's a gap, I think, between task management systems where you can file things all the way up
to their highest priority. And then you're sort of completely panicked moments of the entire site
is down and everything's broken. And it turns out the space in between those two things is actually
really where incident management is powerful, is happening in your SEV2s, your SEV3s, where you want to lower that barrier in general.
File more of these things that you're doing the diligence of going through the process
of reflecting on how did we lose?
Maybe it wasn't the entirety of the customer base accessing the site, but maybe it was
10% and that's big enough for a SEV3.
And the process, it turns out, is really valuable in that gap.
So a lot of the time what we find ourselves talking to companies about is that core
statement, which is you just should be filing more of these things. You're probably having a
larger number of incidents than you realize. And if you start to file them, the advantage you're
going to get is you're going to have fewer SEV1s. You're going to catch more SEV2s, SEV3s. Netflix,
we call these near misses. The airline industry has been great at this forever, right?
Capture absolutely everything that goes even remotely wrong.
And your goal there, right, in the airline industry is to keep planes in the air and
from crashing.
But in software is keep the major outages at bay.
And I think providing the tool there for that also helps compared to just a written down
process from a cultural standpoint of saying, oh yeah, of course we file these things and
stuff that we have a tool for it. Why wouldn't I go do
it? And I actually think it's pretty similar to the revolution that task management went through
a little bit longer ago, right? Maybe even 20 years ago, when we moved task management pretty
aggressively away from the project management approach, which is this very formal top down,
here's exactly how it's going to look. And you're going to just be a recipient of that task list into more of a distributed, everyone should be creating these things.
More tasks isn't bad. This is tracking the work you have to do. And we're being more honest about
what has to happen. And this kind of merging came together, right? Of like the to-do lists that
everyone was already keeping notepad and otherwise, and the formal project management. And in the
middle there, you actually have operational task work, which is what every startup and every tech company really operates on now.
I think incident management is going through is at the very early stages of the same thing,
where we're moving from this world of like, incidents are the scariest thing at your company,
this terrifying thing that no one wants to deal with, to it's part of the day-to-day.
And if we treat it that way, then it's less likely we're going to hit
these horrible emergencies
and they're going to happen less often.
That is really interesting.
And even my first instinct was that
you probably want to introduce a tool like Kentaba,
like when you're a company that's big enough,
your post-product market fit.
And it seems like that's what you were thinking
initially as well,
but it seems like even tiny startups
are adopting tools
like Intaba. And how would you say that? Where is the demand coming from? Because my instinct is if
I'm like three founders on a couch, it seems like a high overhead too.
Startups specifically, particularly, especially startups with live products,
even if you're three people, you're really operating in a real-time fashion at this point.
You have a lot of work that's happening in real time and you're probably
dealing with it in Slack. You're probably pinging each other and jumping into chat rooms. Maybe in
the pre-pandemic days, you were all in an office on the couch together. And I think even at that
stage, having a method to record and track these things has a degree of value. Ketava is free for
fewer than five people, primarily because we're just trying to get
you in the motions at that point.
Does a two-person company really have to have this on day one?
Probably not.
You can probably get away with not having it.
But honestly, when you get to four or five, it starts to become valuable and you start
to force yourself through that process of all these things you were dealing with, then
taking a moment even to reflect on them. We had a piece we put out a little while ago that was called, I think,
the four-second post-mortem, right? It was this idea that if you just write something down after
the outage, the incident, the real-time emergency has happened, you're going to get more cultural
value out of it. So if you're a four-person startup and two of you are asleep and two of
you fix a problem and you write a one-sentence post-mortem on exactly what's not going to happen again. And then that emails out to everyone.
That's better than by the time those other two people wake up, that entire Slack conversation
has been pushed away. And they might A, not know the incident never happened or B, just never get
any feedback from it and fall in the same pitfalls. The revolution outside of tech that happened in
this space was 80 plus years ago, which was really just this core idea that we should really ask the people who were involved in the outages what happened, because they're actually the best people to account for what went wrong from a systemic standpoint. coming in after the fact and taking that sort of outsider's view and say, oh, we didn't have
enough pods spun up in Kubernetes, or we didn't have our DNS fallback set up. Ha ha, I would
never have made that mistake. Joe's an idiot. It's a very common way to approach even in a small team.
And if you get Joe to go write down the one sentence back that says, all right,
all of our assumptions about the way namespacing and Kubernetes works were wrong.
And it turns out you can never do X, Y, and Z, then even in that small company, you're going to get value.
And Kentaba doesn't really get in your way at that point, right? It's a tool where you've
probably clicked two buttons and you talked in Slack if you're that small. So you're not really
getting an additional layering on top of it. And it's one of our challenges, I think, from a
marketing standpoint is that assumption. Is incident management, it's this heavy handed thing, right?
It's the NTSB report.
It's the scary meeting you go into with the rest of the department heads.
It's all of those pieces.
And I think if that's all that incident management is, I'm not even that interested in it, right?
It's not that exciting of a space.
I think where it's exciting is when you start thinking about it as a daily practice, almost.
We've got a couple of companies using us that are 20, 30 people, and we're looking at tens of incidents a week, which
for a company historically would be a lot, but it's because they're treating these real-time
response efforts as low severity incidents. And because of that, they now have great tracking on
it. They have reporting on it. They understand the, and they're having fewer SEV1s. So I think this works, starts to work around four, really starts to work around
10, and then just becomes indispensable somewhere around 20 or 30. I think that makes sense. And I
think the key thing that you mentioned was it's like super low overhead, right? I think traditionally,
engineers are used to the manual work, right, of creating a doc, filling out five
whys on why something happened, and it's just a pain, communicating super manually, emailing out
at the right time, making sure your email doesn't have typos, all of those things. It seems like
Kentabra removes a lot of overhead from and also normalizes the entire act of creating an incident,
like it's not a culture, like, I'm not scared about filing an incident anymore.
I think that's the key part. Sounds like... Yeah. And I think practicing those lower
SEV levels, like the twos and the threes is what really encourages that. And culturally,
it kicks off really the other side of this being open, making sure everyone can see the incidents.
Means as long as you've got one person in the company who gets comfortable filing these things, it just propagates out really cleanly. Everyone
else sees that and they say, oh, that's just part of how we operate. We file SEV3s, we file SEV2s.
And we don't think of them as, oh no, I'm going to go wake people up because we have our
configuration set up that people don't get woken up for a SEV3 or a SEV2. And I think that's really
what's critical is when you're thinking about building a company, even at our scale, we're small.
We're eight to 10 people at Kentaba.
And even at our scale, we use the product pretty heavily.
And it's just a culturally positive thing.
And it helps the fact that when things do go very badly, that cultural positivity propagates
forward.
And we're like, oh, it's an outage.
Those happen.
Let's deal with it quickly.
And you really can only do that through practice, right?
You can really only learn to accept failure
once you record it often enough
that you realize it's going to always be there.
I think as an engineer, right?
The worst thing that can happen as an engineer
is you join a team
and your engineering manager sits you down
and says, we don't have outages here.
We don't have errors here
because the implication they're making
is you're going to be fired if there ever is one
and you're at all involved. And it's an antiquated way of thinking about software
operations, right? It's throwing out a hundred years of learnings, especially post-World War II
learnings about humans and responsibility. And so long as your hiring bar isn't terrible,
which we all like to be really positive about our hiring bars, right? As long as your hiring
bar is pretty good, then a major problem is almost always systemic. There's something about
the context that you were placed in that caused that situation to happen. And your job as someone
involved in it is to make sure that context doesn't occur again, to record your account of
what happened. And that's, again, where this just comes into play, right? It's like, how do we make
sure we're logging these things, make sure there's awareness of them and make sure they're distributed. And none of it's complex,
right? It should sound simple. When I talk to people about the product, people don't generally
come back to me and say, ah, it's a really complex, it's just an easy idea, which can be a
barrier in and of itself, right? You can, as a company leader, you can look at these things and
say, that's obvious and easy. But I think the evidence consistently is we don't do it naturally.
We're even in software as
objective as we like to think of ourselves, right? We're not. We blame people all over the place for
things that go wrong that aren't really their fault. And I think that's why this evolution,
right? From like SRE teams as a practice to DevOps as a cultural revolution to like resiliency
as a company change is so cool to me because we're expanding out the
understanding of these things that we hold dear, I think, as engineers to the point where not just
our bosses and our bosses and our bosses, but even the folks running marketing, the folks running
legal, the folks running all these other departments also can embrace it. And that's really what's
critical about implementing the process and making it open. And I think that's one of the biggest things that changes when you adopt the tool as well
is the openness of it. If you as an engineering team implement incident response, you're probably
going to silo that into your engineering Slack channel, into your engineering Google Doc.
You're going to have a lot of limitation on visibility there. It's the opposite of what
you want. You want lots of openness. Yeah. I think the lack of trust really comes from the lack of clear communication, right?
Like if marketing doesn't know why the site went down and if there's something,
if this is going to happen again, that's when you start losing trust. That makes sense to me.
One question I had was around like how standardized and opinionated is Kintaba? Like how
easy does it let you customize
the process or everything in place because from what I've seen like Google and Facebook and even
like larger companies they generally have a fairly standardized incident management process there's
like SEV levels as you've been talking about SEV1, SEV2. I know somebody who works on an oil
platform they also have like SEV1 and SE SEV2s. So it seems like a universal naming practice.
But then each company has its own quality metrics, right? We declare a SEV2 when the site is down
for more than 10 minutes versus like 30 minutes. So how do you configure all of these different
things? And how opinionated is Kentaba versus how freeform? So we try to be really opinionated
about the core flow. How do you define how public is an
incident that's been declared? What does the command center feel like? Where do you go to
have the communications? Does the postmortem have to be written? All of that stuff is pretty
hard-coded. You really need to go through the flow. The sort of depth of each flow is a little
bit more configurable. So for example, in the response itself, determining who are the responders
that should be automatically added, should you be calling external tooling when certain types of incidents are created that are a SEV1 versus a SEV2?
How are we categorizing?
That kind of stuff is configurable.
And then we really break it down into three or four core areas of the product, right?
There's people, there's tags for defining how you want to organize these things.
There's your SEV levels, there's rotations, on-call rotations, which are really more about roles. What role might you have inside of the product?
And then all of that gets tied together with a product called automations. And automations just
lets you do an if this, then that across all of those things, right? If sev one and inside of this
tag, then make sure this on-call is added, right? If this on-call is added and it's been open for
more than 20 minutes, then upgrade to sev two, right? You can define those roles through sort of an engine that runs inside of Kintaba,
but the out-of-the-gate configuration of it works just fine as well, especially if you're small.
If you're like five, 10 people, you might not have any automations, right? You might really
just have the declaration, which propagates out and announces
in a Slack channel, and then everyone runs to that Slack channel, and that's good enough.
So out of the box, it's pretty opinionated in terms of best practices. And then as you're
getting a little bit more complex as an organization, you can apply those rules down
as automations. When it comes to things like sev levels out of the box, it's one, two, three.
We do support up to five. We originally didn't. We were pretty aggressive about not letting people add SEV
levels. It turned out there was a pretty good argument for some companies to have,
especially an informational SEV level, which some companies would call like a SEV five.
Some companies call it info to log these really near misses that weren't even SEV worthy,
but needed to be logged. And then some companies will even use those additional levels for things
like running like game days, running like a planned planned sev where they want to log it and they
want to record it separately, but it needs to have its own icon. So we landed on five. We had a lot
of internal discussion and five seems to be enough for everyone. Three tends to be enough for most
people. And so we try not to do too much configuration there. The worst tools to me are
the ones where you spend weeks configuring them and you don't actually get a benefit from the
tool because all you've done is taken your own existing opinions, which might not be best
practices and enforce them into someone else's world. The sales forces of the world, I don't
want to knock Atlassian, but the Atlassians of the world, that's what they are. They're like IT configurable tools. And the other downside of that tends to be
you end up with kind of a class of people inside of the company who can then control the tool
and everyone else has to go to them and beg for the things that they want changed.
Whereas the advantage of being a little bit more opinionated is because there are fewer things to
configure, we also don't have to lock those down as aggressively. We don't have to say, if you're not in this group, you can't
configure things. So while even in Contaba, only admins can add people, for example, to make sure
you're not adding someone from outside your organization. Anyone can add and change tags.
Anyone can create automations. Anyone can declare another, I guess this is too big, but like clear
an external web hook that gets called. Those things can be configured by anyone because they should be, right? Like you shouldn't
have to go through channels to go and make sure that you get pinged whenever a sev3 gets created.
That's just a natural thing you should be allowed to do. Yeah. That whole discussion around increasing
sev levels makes so much sense to me. Like again, at Dropbox, we started with was like, if we're
down for a long time or if we lose any
customer data, and that would automatically page people. And then eventually we're like,
sometimes we want to notify something is a big deal or we're going to run out of capacity in
12 months. So it's not a step three because that's really bad, but we still don't want to
wake people up. So then we shoehorned this fake SEV called internal SEV zero.
It's like the coming SEV.
You can see the train somewhere in the distance.
Yeah.
And I think companies deserve some degree of flexibility for sure in terms of how they
want to log these.
Facebook would actually log a lot of their pending SEVs as real SEVs.
They believed pretty strongly in saying if it is going to eventually be a SEV1, it's
a SEV1.
But even within teams there, they were a little bit different across.
And I think we have to provide a degreeV1. It's a SEV1. But even within teams there, they were a little bit different across. And I think we have to provide a degree of flexibility. It's one of the challenges
of a tool like this. We want to walk that line. And I want to be out of the box easy for a small
company that doesn't want to do any configuration. But I also want to be useful for a larger company,
a thousand plus people who are on the product a couple of times a week at a minimum daily at a
maximum. And I want to make sure that they don't get shoehorned into something that's completely
inappropriate. So post-mortem templates are another one, right? Those are configurable
and can be changed based on the type of incident that it is. You can have a different template for
different tags. You can have a different template for different sub levels. You just need that kind
of stuff. I've always been very envious is the wrong word, but of Asana's approach, right? When
they came into the task management space.
And he took a really clean approach to lightweight configuration, but a lot of opinion in the UI.
And I think if Kentavla can iterate continually towards that kind of ideal, I think you get products that engineers and non-engineers are willing to go and adopt without maybe necessarily having to go through a vendor selection purchase process against the service nows of the world. That brings me to another question,
which is, have you seen non-engineers use file incidents?
Yeah. I think what happens when you get that barrier lower and more incidents are filed,
the more people can see them is you start to imagine as a non-engineer the things you could
potentially also use this for.
And if you're tagging appropriately, you can create different populations that are alerted for different types of incidents. So a natural one was one of our larger companies had a SEV1
during the Texas power outages. They had a large portion of their user population,
or sorry, employee population in Texas who had lost power. And this isn't an engineering problem,
like your engineers can't log in, but really it's an HR people safety power. And this isn't an engineering problem, right? Your engineers can't
log in, but really it's an HR people safety problem. And it was really the HR team pushing
to go and file that incident. And I think when it first happened, that company was a little
surprised. I remember we got some feedback from them that was like, hey, we filed an incident for
this. Is that okay? And we're like, yeah, of course. You can file incidents for anything.
Your marketing team can file an incident when, or your PR team can file an incident when a bad article comes out. The process is the same.
You're bringing a team together, you're dealing with it. And once it's responded to, you're
probably writing a postmortem. How did we get into such a bad relationship with the Washington Post
that they wrote that terrible article about us? I don't think there's anything about incident
management at its core process level that only works for engineering, I just think the engineering community has embraced it the most because it's, for lack
of a better word, the biggest fire.
It happens consistently in manageable ways that they really want to report on.
What becomes challenging when you get out of the engineering org is metrics start to
become a little bit more difficult.
I think Dave Renzin over at Google has a talk that he gives about how marketing departments could adopt SLOs and SLAs.
And I think it's a really interesting talk as an engineer, but I think the reality is that
the marketing department doesn't want to be measured on a lot of that stuff. So they want
the core process because it's valuable. But imagining how you might do metrics against them
forces a lot of those organizations to push back and say, we don't really want to be tracked on this. This isn't one of the ways we want our team to be measured.
So I think that comes from hopefully making things more lightweight and making it more
about process adoption for those types of teams. But even engineering outages deserve to have
non-technical folks involved, right? If you have an outage, you probably need your PR person,
your marketing person. If you're a B2B company, your salesperson, they all probably need to know
what's going on. And the worst thing you can have is all of them shooting emails out to the
edge manager or the responder themselves saying like, what's going on? What's going on? Is it
going to be fixed? What's going on? And I think the reason people do that is we have this natural
fear that the people we work with don't really understand how we're impacted by things.
There's like an outage. And in the engineering community, we're pretty aware of the fact that if someone stopped typing,
they're probably doing something important, working on the problem. But in the sales world,
they don't know that and they don't see the conversations you're having. So their impression
is, well, maybe they're just not going to work on this till tomorrow. And they don't realize that
it's having an impact on my ability to do my job. And so just making all this stuff more open just
cuts down the overhead
communication because they can see the channel, they can see the conversations, they can see
what's going on. And you would think like the risk there would be, okay, now those channels just get
super noisy. But in reality, that's not what happens. Like what most people want is awareness
of the response more than they really need to know like the details of every single action being
taken. So given access to the
channel and seeing that level of depth and conversation happening gets 90% of people
to back off and let it continue to flow forward. Yeah. As long as they understand what's
going on. And yet to your point, yeah, sales might be doing a demo of the product while the site's
going down. I've certainly seen potential issues or near misses on things like that. So yeah, people want to just be aware of what's going on.
And I think that can impact priority, right? Like you might have a SEV2 and a B2B company
because you think it's not affecting anyone important and then find out that it's impacting
an in-process sales pitch to a company that's going to make your entire sales book for the
rest of the year. And suddenly it's probably, maybe that's a Sev one. Maybe you actually need to redefine what your timeline is going to be.
And I think that's okay. The Kentaba takes an audit approach to a lot of this stuff. Anyone
can change the Sev level, but it records that you did it. So you'll eventually have to back up why
you changed it. If you went into that channel and you made it a Sev one, you can't just willy nilly
do it and then hope no one will notice. So I think audit logs are really powerful at preventing people from doing nefarious
action.
So think a little bit about it because it's going to log the change.
But yeah, I think in general, these teams benefit just as much as eng teams do inside
of companies.
And I think companies like Facebook and Google, there's a lot of public awareness inside of
the company of what's going on, all the way to the point where
you can almost respond if there's a conversation happening somewhere else and say, oh yeah,
there's a SEV1 for that. And that just ends the conversation. It's okay. It's already recognized
as the most important thing. And you and I can talk about this and we know what a SEV1 is,
we're aware of it. But at a company that's just adopting this stuff, the whole concept is new
to non-engineers especially. And it's a really
healthy concept to understand. Oh, there's something more important than an important
task. There's this other thing called a SEV1. And the minute I know that exists, I have a
terminology now to use for my eng team is panicking and working as fast as they can.
And I don't just have to convert that into, why don't you realize it's an emergency? I think
that's pretty awesome. You're introducing vocabulary along with culture.
Yeah.
I think the nice thing about that terminology is that everybody understands that no, 7 is
probably bad.
And even though some companies might have the numbers upside down.
I think it's funny you had zero actually at Dropbox.
I didn't realize that.
I've always been a huge anti-proponent to the zero.
Yeah.
I feel like it's an excuse. Oh, there's one more. We didn't have that. I've always been a huge anti-proponent to the zero. Yeah. I feel like it's an excuse.
Oh, there's one more.
Like we didn't have
emergency enough
to have a bigger emergency.
I think the idea was
that at a self zero,
it's like you've lost
customer data,
which is basically like
close to if it's really bad,
it could be like
a company ending event.
Yeah.
So what are those things
that could be like
a company ending event?
Oh, we have an availability
hit for 30 minutes
and we need to pay back
our enterprise
customers because we breached our contract and stuff. But I think at the end, these things,
7-0s, they end up getting overused as well. But that's a conversation if I feel for another time.
I think Microsoft had a priority level zero, they had a P-0 that they introduced at some
point because P-1 wasn't low enough. Facebook has a concept of unbreak now, which is basically your real-time task kind of fill-in. I think there's a really
interesting conversation at these companies. Where does the one end and the other begin?
Where does unbreak now and then incident response begin? And I think it's really just about
incident responses tend to require teams to solve and tasks tend to be individual-based.
I don't know if Dropbox ran the same way,
but it was this idea that someone always owns the task and they're really the one doing the work and
getting it fixed and it gets passed if someone else is doing it. But with an incident, you
actually can have a response team. You can have people doing research. You can have people fixing
one part of the system, people fixing another part of the system. And they represent like the
various arms generally of your engineering infrastructure.
Yeah. Yeah. We generally have a T-lock or like a technical leader on call or something, and they would be the main person just deciding what to do, but then the tasks get farmed out with other
people. And the nice thing about SEV0s and like SEV3s and all of that was like, yeah, the boundaries
were not super clear, but everybody knew that a SEV1 was more important than a SEV3. And by the end of it, internal
tools, teams, and all of that were not
allowed to file SEV1s or SEV0s.
We can't have
something that's not user-facing
actually be a SEV0.
And the flip side was, now we know that
a SEV0 is really important, you need to
make sure your action items are done within 30
days and all of these other things.
Which actually I think was really important and really useful because these things can so
easily be forgotten, especially when the organization grows like super big and some team might just
not be completing their tasks on time.
And we might have a repeat of the outage that there's a code yellow and all of that.
Like you need all of this process, especially when you're like,
And then you get the log too, right?
So when review time comes up, especially in a larger
company, you can point back to these things and say, yeah, I was on the response team to these
major incidents that happened. And you'll notice they never happened again. It's a good thing to
be able to point back to. Again, it's moving culturally away from a fear to file into almost
not an excitement, but an encouragement to file.
Like maybe it didn't really happen if you don't file it.
And so I think that that's really one of the things we like to see.
Like when Contaba is working, and I try not to say it this way, but it's almost like your incident count goes up, right?
It's like you think about, I'm going to implement an incident management solution.
And what do I expect?
Well, my incidents count will go down.
And that's not actually what you're trying to do. The reality is these things are happening and you're
not logging them. And so what will be perceived to happen is your incident count goes up, which
means every time that happens, your learning goes up, your knowledge goes up, your ability to
respond gets better and your acceptance of the process improves. And so we're working like really
hard to figure out in the product, how do you celebrate that? It's a hard thing to celebrate.
Like, yay, more SEV3s, right?
But it's actually really good.
The airline industry has a really interesting history of rewarding pilots who file more
incidents.
And it's a hard line to walk, right?
Because you're asking for abuse at some point.
You're like, oh, every little thing gets filed just for...
Or maybe I break something on purpose so I can file more of these.
But there's almost an incentive for organizations to say, more of these are better. And the result of that will be fewer catastrophes, fewer horrible
major problems. And that's hard to see immediately. You have to believe that,
and then you have to see it for a little while. And then the results comes out long-term.
But you just get powerful impacts on numbers of major outages, as well as just a cultural impact
of positivity around the natural progression of a company. I think companies, especially startups,
fall apart because of incidents. And they often fall apart, especially with younger founders.
I know when I was in my early 20s and running my first company, our company almost fell apart a
huge number of times because of the fights we got into because of outages. And I think if we'd been
a little bit more exposed to the reality of the world before that, and worked on a few more
high-speed unicorn rapid growth companies, none of that would have come as a shock.
And I think tools like this, we're talking again about smaller companies, help those companies
accept, oh, we have a tool for it, so it must not be that uncommon. And it gets you a little bit
over that, you screwed this up, I screwed this up, right? Like your context in a rapidly growing company is we're moving really fast. We're growing really
fast. We're making mistakes, right? The learnings are going to be less about this co-founder needs
to go. And they're going to be more about, okay, we moved too fast over here and we need to change
the way we move quickly on that one piece of the product. So maybe we don't need to push every hour
into the live site. Maybe it should be every day. Things like that, I think, are the learnings you can get that are just healthy outcomes,
especially in the post-coronavirus world where we can't even see each other during these outages,
which just makes everything harder because you're trying to be blunt and quick and aggressive in
these chat channels. And you end up coming off as angry, unhappy, and miserable. And you don't
always want to hop on Zoom. And so I think we're
only going to see this propagate, I hope, as we move more and more towards remote work, especially
where you're going to have that defined tool process because you're not going to all be
standing in a room yelling at each other. It reminds me of a product called Git Prime,
where eventually once Kentawa has millions of customers, you can show like new customers. On average, a company in your industry of your size has six incidents a
month. And this is where you stack rank compared to other companies. I'm saying this half as a joke,
but yeah, you have a lot of opportunity to educate people on this is how things work out. And the
more I think about it, it seems like you can also help guide people.
Like maybe people don't know
whether to file it as like a SEV1 or a SEV2.
I know that this was a problem at Dropbox.
Like, should I file this as SEV1 or SEV2?
And it actually gave you like this workflow
or like this, like where it would ask,
is there like an availability hit?
How long has it been going on for?
Does it affect actual customers
or is it like this internal customers?
And it will automatically decide the SEV level for you you so it's also like a way to educate people this is what a large incident
at our company means so yeah we do like configurable hints right now in cantaba like when
you're picking it the explanation of what is that level is configurable by the company but i think
you're dead on because we get more data into the product, right? Like I think AI ops is like a really popular conversation term these days.
And I think where AI machine learning, where those things are really interesting to me
is that kind of stuff, right?
It's how often does this tend to happen?
What kinds of incidents have happened like this before across your industry?
Should you be surprised that this piece of infrastructure was affected?
And so now it turns out like within engineering teams on AWS, this piece of infrastructure is affected. And so now it turns out within engineering teams
on AWS, this piece of infrastructure is generally responsible for the problem.
That kind of stuff's really cool from an AI ML standpoint. Where I cringe and I see companies
chasing after AI ML is when they're saying, it will come in and solve these problems for you.
It's like, great, we're going to run an AI system kicking off your scripts. That'll kick
off script A, B, C, D, depending on what it thinks is
happening. And I think just a lot of what AI has shown us over the last 10 years is rather than
putting AI in front of that, you probably ought to just run those scripts before you file the
incident in the first place. And if it doesn't work, get a human involved. We all want AI to
be human-like, and that's not yet what AI is good at. AI is good at repeating things that have been
done in the past and propagating forward everything from the good to the biases. Recruiting is the
famous example of where you really don't want to use AI ML because all you're going to do is get
more of exactly who you have. And that's a quick path to have no diversity. And I think with
incident response, it's similar, right? If you're practicing it properly, your incidents are black
swans, right? These things aren't necessarily predictable. And if they were predictable,
you should have caught them before they ever hit your incident system. Like at this point,
you're getting people, humans are coming in and they're fixing the problem. And then they're
going to make changes to those automated systems after the fact. So you don't get another incident
again. I think wrapping all these things into your incident management system starts to turn that incident response system into just a day-to-day ops system that really that piece
could just be automated out. So I think it'll be very interesting to see the direction a lot
of this stuff goes. I know PagerDuty recently bought Rundeck, which is making scripted
actions easier to access for individuals who are responding to incidents. And we watch that very
closely in terms of, okay, what's that? How are they going incidents. And we watch that very closely in terms of,
okay, what's that?
How are they going to implement that?
Does that really fit in?
Or does that just become more overhead?
So if you're implementing this thing,
it's another layer that you've got to configure
before you can get these basic values out of the product.
Yeah.
AI seems definitely like a stretch to me,
but again, my perspective is probably just jaded
from larger companies.
I can give you, again, another personal example.
So I was on call for Dropbox.com and there were a lot of things that could go wrong,
right?
For like a site that's like fairly large.
So the only thing that we really wanted to do was we had this triage dashboard.
As soon as we get an alert saying availability is down, we have to look at this massive dashboard
and try to go through every graph
and see where is the availability hit from.
And that would take a lot of time.
So the only really script that we wanted to run
was hit our metric system
and print out like all of the metrics
that are like deviating from like a baseline.
This is supposed to be like 99%,
but it's like 95.
So I see like a lot of value in something like,
just give me more information
when like an incident is going wrong. But maybe the answer where there is the observability tool
needs to show that to us rather than... Yeah, I think there's an obvious intersection there,
right? The observability tools and incident response tools. And they're both real-time
tools. They're both designed for, I'm investigating something. And I think if you go back eight,
nine years, observability wasn't
even a space, right? We were like, oh, this is metrics, isn't it? And it took a while to figure
out that there's real-time opportunity. So I do think there's a really interesting intersection
there. And I think today they really do live as separate tools. You're using something like
Honeycomb and you're using something like Contaba. They work next to each other really cleanly,
but it's a pretty natural integration that you would eventually have.
And you have to be careful with those integrations because what you don't want to do
is just pile information into your sort of high signal response channel. It's like the worst thing
you can do is just automatically. But there are situations where, like you're describing,
where depending on the infrastructure piece being affected, the severity level, the tag,
you probably do always need a certain chart. You'll grab that
chart and inject it in just so everyone's got it. If for no other reason that you might want
a record of that point in time of that chart, even if it wasn't affected. I always say the
scariest emergencies are the ones where everything looks okay. It's not so bad when the chart's gone
red. You can at least have somewhere to start. The worst thing in the world is your dashboards
are all green, but your customers are telling you they can't access the site. Even worse is when you can
access it and no one else can. You end up turning Wi-Fi off on your phone and you're trying to
connect on LTE to see if it's something about the network. Those are the scary ones. And I think in
those situations, those are the perfect examples of incident response. They're so fuzzy that when
you do find the reasoning for
the problem, it's going to be this interesting and novel thing. It's not just going to be,
I needed to expand the memory or I didn't turn auto-scaling on. It might be, but more likely,
it's going to be something weird. It's going to be something like my auto-provisioners
ran out of characters. Yeah. Like file descriptors. We ran out of file descriptors
on this one machine and that caused like a cascading problem and stuff. Turns out you can't name an internal user
www or it overwrites your like internal systems for www. There's that, that actually happened.
Like that kind of stuff is, it's just, it's novel. And I think coming back to that,
the original conversation we had about like postmortems and publicity of these write-ups is
everyone wants to know. There was a Robin Hood outage was pretty famous right around daylight savings time. And I remember everyone on Hacker
News was just making fun of them and saying, ha ha, they forgot about daylight savings time,
but what a bunch of fools. And I can't remember if they publicly wrote the postmortem or if someone
gave a talk about it, but it turned out to be some completely different thing. It had something to do
with namespacing and auto-provisioning. And it just happened to happen on Daylight Savings Time.
And I think that's why it's so cool because the fact that we all jumped to the conclusion
that it's this obvious thing makes it all the more critical that someone, especially
within that company, propagates out the information about that's not what happened.
It was this completely different problem that we maybe had no system in place to detect.
The other one's natural disasters, which are pretty tough.
You data center goes dark.
Something horrible happens like that.
And it's okay.
My dashboard's not just red, it's off.
And so the space is fascinating to me.
We get to see all kinds of cracks in people's day-to-day lives.
And we're all running around as firefighters.
So I think the tools that have to exist here have to be practical tools for everyone to use.
They can't be specialist tools.
And I think that's really the fun opportunity in the space to me.
And I think the culturally valuable one.
And the one I was upset didn't exist.
We went out pretty aggressively to the market and looked.
We thought maybe this is surely PagerDuty solved this.
Surely Atlassian has solved this.
And the tools are just complex.
And there are lots of fields to fill out and they just feel like more work.
And that's not a good way to deal with outages. That's like, your goal is to get your firemen on
the ground, not to give them more forms to fill out. It's a joke in policing, right? They'll let
you go if there are too many forms to fill out. Like you want to get the forms out of there. You
want to just let people do their jobs. Yeah. I can imagine in the world, like
maybe five years from now, like Kindaba will have one
button like publish to web.
So you can publish out like a postmortem, like publicly, and you can just post on Hacker
News with another button.
And I totally think the first step to that is just get more incidents recorded inside
of companies to encourage them to write these things at all.
But you're totally right.
Like the more public publishing that happens, the better, especially where it benefits companies. I think
I trust companies that write their postmortems publicly more than I trust companies that don't.
Cloudflare has let me down before as a service provider. And I think of them as really weak
because of it when it happens. And then they write the postmortem and I realize the depth
of their technical knowledge. And I read this over and I think, okay,
I couldn't have handled that. So it's still better for them to be the ones in charge.
And it's that opportunity, right? Even if you're not publishing a postmortem,
just being public about incidents, I think is your opportunity to show your expertise.
That's how people know that you deserve to be a steward of data or provider of a service
or otherwise is the way you deal with major outage situations. And then when you can give that
account of them, there's a... Sidney Decker has a chapter in his book about accountability,
where one of the early pushbacks against incident response or blameless approaches to incident
response was the idea that we're not really holding people
to account. And if you dig into that word, account means you're accountable. It means
you should be providing the account among other things of what happened. And once that happens,
you almost always realize that it has very little to do with the person and it has an awful lot to
do with the system. And the system as a company owner is yours. And I really like that idea.
You're always pushing the problem
outwards to the external factors. And you're saying, this wasn't the subordinate of the
subordinate inside of a hierarchical organization. That's not what happened. What happened was
this company process caused blah, blah, blah. And the minute you do that, you've actually pushed
the accountability across the chain, right? Because we all agree
to processes. We all build those. We all built the company that allowed that to happen.
And that's the right way to do it. If nothing else softens the blow, because we all take it on
as a learning and we're all responsible. Every time an airplane crashes, right? The entire industry
treats that as something they need to be aware of and deal with. And why shouldn't companies,
right? Every time Dropbox.com goes down, that had the potential to take your whole company down.
It doesn't matter if your job is a technical writer or a PR person, you probably care an
awful lot about that response and whether or not maybe you in some tiny way were responsible for
the process that allowed it to happen. And being open to that learning is just powerfully healthy
from a cultural as well
as a company building standpoint.
Yeah, that makes sense.
And maybe this is a good last question,
which is what's your plan
or what's your approach been
to educate the industry
about adopting tools like Intaba?
So we're working really hard
to get more and more
what I call self-service bottoms up
as an organization.
We really want small and up-and-coming companies to be able to adopt it without having to go and do hiring a consultant to go in and teach them the cultural pieces.
So this is really a UX problem throughout their product.
It's not about putting you through a training course.
It's about making each page that you hit, each screen that you encounter informative so you can learn as you go.
Let's file incidents always.
How do we make that more accessible?
Where do we put it in Slack?
How do we make the bot work?
How do we encourage it more often?
How do we give you more exposure to the ones that have been filed?
How do we give you an idea of how often they get filed so you don't feel like you're running
outside of the norm?
That's the core stuff that the product has to work pretty hard at and that we've been
iterating on pretty aggressively and we continue to. The other parts of it are,
there's more and more personification. I think of the product that's happening in the beginning,
the tool was very tool-like. Here's your tool, it disappears and you use it.
And in incident response at a company like a Facebook, you'll have, or Dropbox,
you'll have either like an IMOC, an incident manager on call, you'll have an incident commander, and it tends to be a trained role, right? Like you went through formal training
to learn this stuff and you place them in every incident, regardless of if it's their problem,
right? If nothing else, to make sure everyone's acting correctly, right? You can help adjust
sev levels, you can help move things around. We used to call this the responsible adults in the room. I really do start to feel like the future is almost the
elimination of that role, right? The tool should take on that role of helping you understand how
do I file? When do I file? What do I file? Where do I file? When do I close it? Et cetera, et cetera.
Because otherwise, just like you said with how do you teach people, otherwise you have to train. And if the barrier to the tool is a training class,
it just ends up being this thing that only large organizations will ever get to really adopt.
And so I think that's really what will make that work for every company is it's partially on us
from a feature UX standpoint to do it, but it's also just, it's tool crafting and it's
messaging everything from marketing page out to like how you talk about your own incidents and
product to these conversations we're having right now. I hope that the entire tech world is like
adopting this cultural feeling of we ought to be doing incident management, right? That's the
beginning. And then you propagate forward into the tool and beyond. To me, these things are all
product problems, which is why we build a product in this space. Because when I think about iteration, I think less about how do
I go and own everything your SRE team does, which is a very natural way to think about a product
like this. And instead, I like to think about it from the standpoint of how do we make your
marketing person participate in incidents? Because if I can get your marketing person,
who's never even heard of incident management, to do this, then a small company of engineers who maybe also haven't practiced it
in the past can also adopt it overnight. So it's really about who uses it. And I reference Asana
a lot because I'm really amazed by that product in general. But I think it's similar in that
the rest of the world had to-do lists and we had project management and there's that middle ground.
And the only way you do that is by building a really beautiful product that helps people
understand.
I think incident management is the same way.
Marketing teams already have where they run to when there's a fire, but they don't actually
have a tool to go and make this work consistently.
I think you're working on a really important and exciting goal.
And I think the product, if more companies had the product,
I would enjoy working
at those companies more for sure.
So that's the dream.
It should be a required tool, right?
Imagine showing up at a company
that doesn't have
a task management system.
Or like code review at this point.
Yeah, or code review.
Yeah, it's okay.
You're way out of date.
Their response.
So I feel like we're really in the early
days of this. Google published the book in what, 2016? The SRE book.
And that was the SRE handbook, which was really the first time they publicly talked about
incident command. And I think we're refining that now into what does it look like for non-Googles
to have the same kind of effectiveness without having to maybe do that heavy lift of a
full command system. Yeah. I am excited to read like how smaller companies can translate some
of the learnings from like the Google SRE book and take the most important ones that are like
relevant for themselves. I don't think there's like a smaller version of the Google SRE book
that's out there. And if somebody can drive, that would be amazing. Yeah. I think there's like a smaller version of the Google SRE book that's out there. And if somebody can drive that would be amazing. Yeah, I think there's a couple of okay resources, but you're right.
Actually, Gremlin, if you're familiar with them, and chaos engineering space has a really
interesting write up on running a high severity incident management process. It's the write up,
it's not the tool, but it's pretty cool. So I definitely recommend other folks check that out.
Yeah, I'll take a look at that. I'm gonna read all the other links you send me. And
thank you for being a guest. I think i've learned a lot this episode yeah yeah thanks for having me i think
it's a great topic and a big fan of the show so excited to be part of it thank you so much