Screaming in the Cloud - Incidents, Solutions, and ChatOps Integration with Chris Evans
Episode Date: July 7, 2022About ChrisChris is the Co-founder and Chief Product Officer at incident.io, where they're building incident management products that people actually want to use. A software engineer by trade..., Chris is no stranger to gnarly incidents, having participated (and caused!) them at everything from early stage startups through to enormous IT organizations.Links Referenced:incident.io: https://incident.ioPractical Guide to Incident Management: https://incident.io/guide/
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
DoorDash had a problem.
As their cloud-native environment scaled and developers delivered new features,
their monitoring system kept breaking down.
In an organization where data is used to make better decisions about technology and about the business,
losing observability means the entire company loses their competitive edge.
With Chronosphere, DoorDash is no longer losing visibility into their application suite.
The key? Chronosphere is an open-source compatible, scalable, and reliable observability solution
that gives the observability lead at DoorDash business confidence and peace of
mind. Read the full success story at snark.cloud slash chronosphere. That's snark.cloud slash
C-H-R-O-N-O-S-P-H-E-R-E. Let's face it, on-call firefighting at 2 a.m. is stressful. So there's
good news and there's bad
news. The bad news is that you probably can't prevent incidents from happening, but the good
news is that Incident.io makes incidents less stressful and a lot more valuable. Incident.io
is a Slack-native incident management platform. It allows you to automate incident processes, focus on fixing the issues,
and learn from incident insights to improve site reliability and fix your vulnerabilities.
Try Incident.io to recover faster and sleep more. Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's promoted guest is Chris Evans, who's the CPO and co-founder
of Incident.io. Chris, first, thank you very much for joining me. And I'm going to start
with an easy question. Well, easy question, hard answer, I think. What is an Incident.io exactly?
Incident.io is a software platform that helps entire organizations to respond to,
recover from, and learn from incidents. When you say incident, that means an awful lot of things.
And depending on where you are in the ecosystem in the world, that means different things to
different people. For example, oh, incident, are you talking about the noodle incident? Because
we had an agreement that we would never speak about that thing again, style, versus folks who are steeped in DevOps or
SRE culture, which is, of course, a fancy way to say those who are sad all the time, usually about
computers. What is an incident in the context of what you folks do? That, I think, is the killer
question. I think if you look at organizations in the past,
I think incidents were those things that happened once a quarter, maybe once a year,
and they were the thing that brought the entirety of your site down because your big central
database that was in a data center sort of disappeared. The way that modern companies
run means that the definition has to be very, very different. So most places now rely on
distributed systems, and there is no sort of binary sense of up or down these days.
And essentially, in the general case, like most companies are continually in a sort of state of things being broken all of the time.
And so for us, when we look at what an incident is, it is essentially anything that takes you away from your planned work with a sense of urgency.
And that's the sort of the pithy definition that we use there.
Generally, that can mean anything. It means different things to different folks. And like when we talk to folks,
we encourage them to think carefully about what that threshold is. But generally for us at
Incident.io, that means basically a single error that is worthwhile investigating that you would
stop doing your backlogged work for is an incident. And also an entire app being down, that is an
incident. So there's quite a wide range there. But essentially, by sort of having more incidents and lowering that threshold, you suddenly have a heap of benefits, which I can go very deep into and talk for hours about.
It's a deceptively complex question.
When I talk to folks about backups, one of the biggest problems in the world of backup and building a DR plan, it's not building the DR plan, though that's no picnic either.
It's, okay, in the time of cloud,
all your planning figures out,
okay, suddenly the site is down.
How do we fix it?
There are different levels of down,
and that means different things to different people,
where, especially the way we build apps today,
it's not, is the service or site up or down,
but with distributed systems, it's how down is it?
And, oh, we're seeing elevated
error rates in U.S. Tire Fire 1, region of AWS. At what point do we begin executing on our disaster
plan? Because the worst answer in some respects is every time you think you see a problem, you start
failing over to other regions and other providers and the rest, and three minutes in, you've
irrevocably made the cutover, and it's going to take 15 minutes to come back up, and oh yeah, then your primary site comes back up,
because whoever unplugged something plugged it back in, and now you've made the wrong choice.
Figuring out all things around the incident, it's not what it once was. When you're running your own
blog on a single web server, and it's broken, it's pretty easy to say, is it up or is it down?
As you scale out, it seems like that gets more and more diffuse.
But it feels to me that it's also less of a question of how the technology is scaled,
but also how the culture and the people have scaled. When you're the only engineer somewhere,
you pretty much have no choice but to have the entire state of your stack shoved into your head.
When that becomes 15 or 20 different teams of people in some cases, it feels like it's almost less of a technology problem than it is a problem of how you communicate and how you get people involved and the issues in front of the people who are empowered and insightful in a certain area that needs fixing.
100%. This is like a really, really key point, which is that organizations themselves are very complex.
And so you've got this combination of systems getting
more and more complicated, more and more sort of things going wrong and perpetually breaking,
but you've got very, very complicated information structures and communication throughout a whole
organization to keep things up and running. The very best orgs are the ones where they can engage
the entire sort of every corner of the organization when things do go wrong and lived and breathed
this firsthand when various different previous companies, but most recently at Monzo, which is a bank here in the UK, when an incident happened
there, like one of our two physical data center locations went down, the bank wasn't offline,
everything was resilient to that, but that required an immediate response. And that meant that
engineers were deployed to go and fix things, but it also meant the customer support folks might be
required to get involved because we might be slightly slower processing payments. But it also meant the customer support folks might be required to get involved because we
might be slightly slower processing payments. And it means that risk and compliance folks might need
to get involved because they need to be reporting things to regulators. And the list goes on.
There's like this need for a bunch of different people who almost certainly have never worked
together or rarely worked together to come together, land in this sort of like empty space
of this incident room or virtual incident
room, and figure out how they're going to coordinate their response and get things back
on track in this sort of most streamlined way and as quick as possible. Yeah, when your bank
is suddenly offline, that seems like a really inopportune time to be introduced to the database
team. It's, oh, we have one of those. Wonderful. I feel like you folks are going to come in handy later today. You want to have those pathways of communication open well in advance of
these issues. A hundred percent. And I think the thing that makes incidents unique is that fact.
And I think the solution to that is this sort of consistent level playing field that you can put
everybody on. So if everybody understands that the way that incidents are dealt with are this
consistent,
we declare it like this, and under these conditions, these things happen. And if I flag
this kind of level of impact, we have to pull in someone else to come and help make a decision.
At the core of it, there's this weird kind of duality to incidents where they are both
kind of semi-formulaic in that you can basically encode a lot of the processes that happen, but equally
they are incredibly chaotic and require a lot of human impact to be resilient and figure these
things out because stuff that you have never seen happen before is happening and failing in ways
that you never predicted. And so this is where Incident.io plays into this, is that we try to
take the first half of that off of your hands, which is we will
help you run your process so that all of the brain capacity you have, it goes on to the bit that
humans are uniquely placed to be able to do, which is responding to these very, very chaotic sort of
surprise events that have happened. It feels as well, because I played around in this space a bit
before. I used to run ops teams and more or less, I really should have had a t-shirt then
that said, I am the root cause.
Because yeah, I basically did a lot of self-inflicted outages
in various environments.
Because it turns out I'm not always the best with computers.
Imagine that.
There are a number of different companies
that play in the space
that look at some part of the incident lifecycle.
And from the outside, first, they all look alike.
Because it's, oh, so you're incident.io.
I assume you're pager duty.
You're the thing that calls me at two in the morning
to make sure I wake up.
Conversely, for folks who haven't worked deeply
in that space as well of setting things on fire,
what you do sounds like it's highly susceptible
to the hacker news problem where, wait,
so what you do is effectively just getting
people to coordinate and talk during an incident? Well, that doesn't sound hard. I could do that in
a weekend. And no, no, you can't. If this were easy, you would not have been in business as long
as you have, have the team, the size that you do, the customers that you do. But it's one of those
things that until you've been in a very specific
set of problem, it doesn't sound like it's a real problem that needs solving.
Yeah, I think that's true. And I think that the Hacker News point is a particularly pertinent
one in that someone else sort of in adjacent area launched on Hacker News recently. And the
amount of feedback they got around, you know, you're a Slack bot, how is this a company, was kind of staggering. And I think generally where that
comes from is, well, first of all, that bias that engineers have, which is just everything you look
at as an engineer is like, yeah, I could build that in a weekend. I think there is often infinite
complexity under the hood that just gets kind of brushed over. But yeah, I think at the core of it,
you probably could build a Slack bot in a weekend that creates a channel for you in Slack and allows you to post somewhere that...
Oh, good.
More channels in Slack.
Just what everyone wants.
Well, there you go.
I mean, that's a particular pertinent one because our tool does do that.
And one of the things...
So I built at Monzo a version of incident IO that we used at the company there.
And that was something that I built evenings and weekends.
And among the many, many things I never got around to building, archiving and cleaning up channels
was one of the ones that was on that list.
And so Monzo did have this problem
of littered channels everywhere.
I think that's sort of like part of the problem here
is like it is easy to look at a product like ours
and sort of assume it is this sort of friendly Slack bot
that helps you orchestrate some very basic commands.
And I think when you actually dig into the problems
that organizations above a certain size have, they are not solved by Slack bots. They're solved by platforms that
help you to encode your processes that otherwise have to live on a Google Doc somewhere, which is
five pages long. And when it's 2am and everything's on fire, I guarantee you not a single person reads
that Google Doc. So your process is as good as not in place at all. That's the beauty
of a tool like ours. We have a powerful engine that helps you basically to encode that and take
some load off of you. To be clear, I'm also not coming at this from a position of judging other
people. I just look right now at the Slack workspace that we have at the Duckbill group,
and we have something like a 10 to 1 channel to human ratio. And the proliferation of channels is a very real thing.
And the problem that I've seen across the board
with other things that try to address incident management
has always been fanciful at best
about what really happens when something breaks.
Like you talk about, oh, here's what happens.
Step one, you will pull up the Google Doc
or you'll pull up the wiki or the rest
or in some aspirational places,
ah, something seems weird.
I will go open a ticket in Jira.
Meanwhile, here in reality,
anyone who's ever worked in these environments
knows it's step one.
Oh shit, oh shit, oh shit, oh shit, oh shit.
What are we going to do?
And all the practices and procedures that often exist,
especially in orgs that aren't very practiced
at these sorts of things,
tend to fly out the window and people are going to do what they're going to do.
So any tool or any platform that winds up addressing that has to accept the reality of meeting people where they are, not trying to educate people into different patterns of
behavior as such. One of the things I like about your approach is, yeah, it's going to be a lot of conversation in Slack. That is a given. We can pretend otherwise,
but here in reality, that is how work gets communicated, particularly in extremists.
And I really appreciate the fact that you are not trying to, I guess, fight what feels almost
like a law of nature at this point. Yeah, I think there's a few things in that. The
first point around the document approach or the clearly defined steps of how an incident works,
in my experience, those things have always gone wrong. The data center's down, so we're going to
the wiki to follow our incident management procedure, which is in the data center, just
lost power. Yeah, there's a dependency problem there too. 100%, 100%. I think part of the problem
that I see there is that very, very often you've got this situation where the people designing the
process are not the people following the process. And so there's this classic, I've heard it through
John Osborne, but there's a bunch of other folks who talk about the difference between people,
you know, at the sharp end or the blunt end of the work. And I think the problem that people
have faced in the past is you have these people who sit in the sort of metaphorical upstairs of the office and think that they make a company safe by defining a process on paper.
And they ship the piece of paper and go, that is a good job for me done.
I'm going to leave and know that I've made the bank, the other, whatever your organization does, much, much safer.
And I think this is where things fall down.
I want to ambush some of those people in the performance reviews with cool, just for fun,
all the documentation.
We're going to pull up the analytics to see how often that stuff gets viewed.
Oh, nobody ever sees it.
It's frustrating.
It's frustrating because that never, ever happens clearly.
But the point you made around like meeting people where you are, I think that is a huge
one, which is incidents are founded on great communication.
Like, as I said earlier, this is like form a team with someone you've never ever worked with before. And the last thing you want to do is be
like, Hey, Corey, I've never met you before, but let's jump out onto this other platform somewhere
that I've never been, or I haven't been for weeks. And we'll try and figure stuff out over there.
It's like, no, you're going to be communicating. We use Slack internally, but we have a WhatsApp
chat that we wind up using for incident stuff. So go ahead and log into WhatsApp, which you haven't done in 18 months, and join the chat. In the dawn of time,
in the mists of antiquity, you vaguely were hearing something about that your first week,
and then never again. This stuff has to be practiced, and it's important to get it right.
How do you approach the inherent and often unfortunate reality that incident response and management
inherently becomes very different depending upon the specifics of your company or your culture
or something like that. In other words, how cookie cutter is what you have built versus
adaptable to different environments it finds itself operating in.
Man, the amount of time we spent as a founding team in the early days deliberating over
how opinionated we should be versus how flexible we should be was staggering. The way we like to
describe it is we are quite opinionated about how we think incidents should be run. However,
we let you imprint your own process into that. So putting some color onto that, we expect
incidents to have a
lead. That is something you cannot get away from. However, you can call the lead, whatever it makes
sense for you at your organization. So some folks call them an incident commander or a manager or
whatever. It's the overwhelming militarization of these things. Like, oh yes, we're going to wind
up. I take a bunch of terms of the military here. It's like, you realize that your entire giant
screaming fire is that the lights on the screen
are in the wrong pattern. You're trying to make them the right pattern. No one dies here in most
cases. So it feels a little grandiose for some of those terms being tossed around in some cases,
but I get it. You've got to make something that is unpleasant and tedious in many respects,
a little bit more gripping. I don't envy people. Messaging's hard.
Yeah, it is. And I think if you are overly
virtuistic and inflexible, you're sort of fighting an uphill battle here, right? So
folks are going to want to call things what they want to call things. And you've got
people who want to import ITIL definitions for severities into the platform because that's what
they're familiar with. That's fine. What we are opinionated about is that you have some
severity levels because absent the academic criticism of severity levels, they are
a useful mechanism to very coarsely and very quickly assess how bad something is and to take
some actions off of it. So yeah, we basically have various points in the product where you can
customize and put your own sort of flavor on it. But generally we have a relatively opinionated
end-to-end expectation of how you will run that process. The thing that I find that annoys me
in some cases the most
is how heavyweight the process is.
It's clearly built by people
in an ivory tower somewhere
where there's effectively a two-day long
post-mortem analysis of the incident
and so on and so forth.
And okay, great.
Your entire site isn't blown off the internet.
Yeah, that probably makes sense.
But as soon as you start broadening that
to things like, okay,
an increase in 500 errors on this service for 30 minutes, great. Well, we're going to have a two-day
post-mortem on that. It's, yeah, it should be nice if you could go two full days without having
another incident of that caliber. So in other words, who's, what, are we going to hire a new
team whose full-time job it is, is to just go ahead and triage and learn from all these incidents?
Seems to me like that's sort of throwing wood behind the wrong arrows. Yeah, I think it's very reductive to suggest that learning only happens in a
postmortem process. So I wrote a blog actually not so long ago that is about running postmortems and
when it makes sense to do it. And as part of that, I had a sort of a statement that was that we
haven't run a single postmortem when I wrote this blog at Incident.io, which is probably shocking to many people because we're an incident company and we talk about this stuff.
But we were also a company of five people.
And when something went wrong, the learning was happening.
And these things were sort of, we were carving out the time, whether it was called a post-mortem
or not, to learn and figure out these things.
Extrapolating that to bigger companies, there is little value in following processes for the sake
of following processes. And so you could... Someone in compliance just wound up spitting
their coffee all over their desktop as soon as you said that, but I hear you.
Yeah. And it's those same folks who are the ones who care about the document being written,
not the process and the learning happening. And I think that's deeply frustrating.
But all the plans, of course, assume that people will prioritize the company over their own family for certain kinds of disasters.
I love that, too.
It's this divorce from reality that's ridiculous on some level.
Speaking of ridiculous things, as you continue to grow and scale, I imagine you integrate
with things beyond just Slack.
You grab other data sources and over the fullness of time.
For example, I imagine one of your most popular requests from some of your larger customers
is to integrate with their HR system in order to figure out who's the last engineer who left.
Therefore, everything is immediately their fault because, Lord knows, the best practice is to pillory whoever was last left because then they're not there to defend themselves anymore and no one's going to get dinged for that irresponsible jackass's decisions, even if they never touch the system at all.
I'm being slightly hyperbolic, but only slightly.
Yeah, I think it's an interesting one.
I am definitely going to raise that feature request
for a pre-filled root cause category,
which is, you know, the value is just the last person
who left the organization.
It's a wonderful scapegoat situation there.
I like it.
To the point around what we do integrate with,
I think the thing is actually with incidents
is quite interesting is there is a lot of tooling that exists in the space that does little pockets
of useful, valuable things in the shape of incidents. So you have PagerDuty is this system
that does a great job of making people's phone make a noise, but that happens and then you're
dropped into this sort of empty void of nothingness and you've got to go and figure out what to do.
And then you've got things like Jira where clearly you want to be able to track actions that are coming out of things going wrong in some cases. And
that's a great tool for that and various other things in the middle there. And yeah, our value
proposition, if you want to call it that, is to bring those things together in a way that is
massively ergonomic during an incident. So when you're in the middle of an incident, it is really
handy to be able to go, oh, I've shipped this horrible fix to this thing. It works, but I must remember to undo that.
And we put that at your fingertips in an incident channel from Slack that you can just
log that action, lose that cognitive load that would otherwise be there, move on with fixing
the thing. And you have this sort of, I think it's like that multiplied by a thousand in incidents
that is just what makes it feel delightful. And I cringe a little bit saying that
because it's an incident at the end of the day,
but genuinely it feels magical
when some things happen that are just like,
oh my gosh, you've automatically hooked
into my GitHub thing and someone else merged that PR
and you've posted that back into the channel for me.
So I know that that happens.
That would otherwise have been a thing
where I'd jump out of the incident
to go and figure out what was happening.
This episode is sponsored in parts by our friend EnterpriseDB. That would otherwise have been a thing where I'd jump out of the incident to go and figure out what was happening. Grisqueel, on-premises, private cloud, and they just announced a fully managed service on AWS and Azure called Big Animal. All one word. Don't leave managing your database to your cloud vendor
because they're too busy launching another half-dozen managed databases to focus on any
one of them that they didn't build themselves. Instead, work with the experts over at Enterprise
DB. They can save you time and money. They can even help you migrate legacy applications,
including Oracle, to the cloud.
To learn more, try Big Animal for free.
Go to biganimal.com slash snark
and tell them Corey sent you.
The problem with cloud, too,
is the first thing when they start seeing an incident happen
is the number one decision,
almost the number one decision point is,
is this my shitty code,
something we have just pushed in our stuff, or is it the underlying provider itself, which is why
the AWS status page being slow to update is so maddening, because those are two completely
different paths to go down, and you are having to pursue both of them equally at the same time
until one can be ruled out. And that is why time to identifying at least what side of the universe it's on is so important.
That has always been a bit of a tricky challenge.
I want to talk a bit about circular dependencies.
You target a certain persona of customer,
but I'm going to go out on a limb
and assume that one explicit company
that you are not going to want to do business with
in your current iteration is Slack itself.
Because a tool to manage, okay, so our service is down, so we're going to go to Slack to fix it, doesn't work
when the service is Slack itself. So that becomes a significant challenge. As you look at this
across the board, are you seeing customers having problems where you have circular dependency issues
with this? Easy example, Slack is built on top of AWS. When there's an underlying degradation of,
huh, suddenly US East 1 is not doing
what it's supposed to be doing,
now Slack is degraded as well,
as well as the customer site.
It seems like at that point,
you're sort of in a bit of tricky positioning as a customer.
Counterpoint, when neither Slack nor your site are working,
figuring out what caused that issue doesn't seem like it's the biggest stretch of the imagination at that point.
I've spent a lot of my career working in infrastructure platform type teams, and I
think you can end up tying yourself in knots if you try and over-optimize for avoiding these
dependencies. I think it's one of those sort of turtles all the way down situations. So yes, Slack are unlikely to become a customer because they are clearly going to want to use
our product when they are down. They reach out, we'd like to be your customer response. Please
don't be. None of us is going to be happy with this outcome. Yeah. I mean, the interesting thing
there is that we're friends with some folks at Slack and they, believe it or not, they do use
Slack to navigate their incidents. They have an internal tool that they have written.
And I think this sort of speaks to the point we made earlier,
which is that incidents and things failing are not these sort of big binary events.
And so all of Slack is down is not the only kind of incident
that a company like Slack can experience.
That goes far as it's most commonly not that.
It's most commonly that you're navigating incidents where it is a degradation
or some edge case or something else that's happened. And so the pragmatic solution here is not to avoid the
circular dependencies, in my view. It's to accept that they exist and make sure you have sensible
escape hatches so that when something does go wrong. So a good example, we use Incident.io
at Incident.io to manage incidents that we're having with Incident.io. And 99% of the time,
that is absolutely fine because we are having some error in. And 99% of the time, that is absolutely fine
because we having some error in some corner of the product
or a particular customer is doing something
that is a bit curious.
And I could count literally on one hand
the number of times that we have not been able
to use our product to fix our product.
And in those cases, we have a fallback, which is-
I assume you put a little thought into what happened.
Well, what if our product is down?
Well, I guess we'll never be able to fix it
or communicate about it.
It seems like that's the sort of thing that given what you do,
you might have put more than 10 seconds of thought into.
We've put a fair amount of thought into it.
But at the end of the day, it's like, if stuff's down,
like what do you need to do?
You need to communicate with people.
So jump on a Google chat, jump on a Slack huddle,
whatever else it is.
We have various different like fallbacks in different order.
And at the core of it, I think this is the thing is like, you cannot be prepared for every single
thing going wrong. And so what you can be prepared for is to be unprepared and just accept that
humans are incredibly good at being resilient. And therefore all manner of things are going to
happen that you've never seen before. And I guarantee you will figure them out and fix them
basically. But yeah, I say this, if my SOC 2 auditor is listening, we also do have a very well-defined backup plan in our SOC 2,
in our policies and processes that is the thing that we will follow there. But yeah.
The fact that you're saying the magic words of SOC 2, yes, exactly. Being a responsible adult
and living up to some baseline compliance obligations is really the sign of a company
that's put a little thought into these things.
So as I pull up incident.io,
the website, not the company, to be clear,
and look through what you've written
and how you talk about what you're doing,
you've avoided what I would almost certainly have not
because your tagline front and center on your landing page
is manage incidents at scale without leaving Slack.
If someone were to reach out and
say, well, look, we're down all the time, but we're using Microsoft Teams, so I don't know that we can
use you, like the immediate instinctive response that I would have to that to the point where I
would put it in the copy is, okay, this piece of advice is free. I would posit that you're down all
the time because you're the kind of company to use Microsoft Teams. But that doesn't tend to win a whole lot of friends in various places.
In a slightly less sarcastic end, do you see people reaching out with,
well, we want to use you because we love what you're doing, but we don't use Slack?
Yeah, we do.
A lot of folks, actually.
And we will support Teams one day.
I think there is nothing especially unique about the product that means that we are tied to Slack. It is a great way to distribute our product and it sort of aligns with the companies that
think in the way that we do in the general case. But like at the core of what we're building,
it's a platform that augments our communication platform to make it much easier to deal with a
high stress, high pressure situation. And so in the future, we will support ways for you to connect
Microsoft Teams or if Zoom sought out getting rich app experiences, talk on a Zoom and be able to do various things
like logging actions and communicating with other systems and things like that.
But yeah, for the time being, very, very deliberate focus mechanism for us.
We're a small company.
We're like 30 people now.
And so, yeah, focusing on that sort of very slim vertical is working well for us.
And it certainly seems to be working to your benefit.
Every person I've talked to who has encountered you folks has nothing but good things to say.
We have a bunch of folks in common listed on the wall of logos, the social proof eye
chart thing of here's people who are using us.
And these are serious companies.
I mean, your last job before starting Incident.io was at Monzo, as you mentioned.
You know what you're doing in a regulated, serious sense. I would be, quite honestly, extraordinarily skeptical if your
background were significantly different from this because, well, yeah, we worked at Twitter for pets
and our three-person SRE team, we can tell you exactly how to go ahead and handle your incidents.
Yeah, there's a certain
level of operational maturity that I kind of just based upon the name of the company there don't
think that Twitter for Pets is going to nail. Monzo is a bank. Yes, you know what you're talking
about, given that you have not basically been shut down by an army of regulators. It really
does breed an awful lot of confidence. But what's interesting to me is
that the number of people that we talk to in common are not themselves banks. Some are, and
they do very serious things, but others are not these highly regulated command and control top
down companies. You are nimble enough that you can get embedded at those startup-y of startup
companies once they hit a certain point
of scale and wind up helping them arrive at a better outcome. It's interesting in that you
don't normally see a whole lot of tools that wind up being able to speak to both sides of
that very broad spectrum and most things in between very effectively, but you've somehow
managed to thread that needle. Good work. Thank you. Yeah. What else can I say other than thank you?
I think it's a deliberate product positioning
that we've gone down to try and be able to support
those different use cases.
So I think at the core of it,
we have always tried to maintain
the incident error should be installable
and usable in your very first incident
without you having to have a very steep learning curve.
But there is depth behind it
that allows you to support
a much more sophisticated incident setup.
So, I mean, you mentioned Monzo.
Like, I just feel incredibly fortunate to have worked at that company.
I joined back in 2017 when they were, I don't know, like 150,000 customers.
And it was just getting its banking license.
And I was there for four years and was able to then see it scale up to 6 million customers.
And all of the challenges and pain that goes along with that, both from building infrastructure and the technical side of things, but from an organizational side of things
and was like front row seat to being able to work with some incredibly smart people in and sort of
see all these various different pain points. And honestly, it feels a little bit like being in sort
of a cheat mode where we get to just import a lot of that knowledge and pain that we felt at Monto
into the product. And that happens to resonate with a bunch of folks. So yeah, I feel like things are sort of coming out quite well
at the moment for folks. The one thing I will say before we wind up calling this an episode
is just how grateful I am that I don't have to think about things like this anymore. There's
a reason that the problem that I chose to work on of expensive AWS bills being very much a business
hours only style of problem. We're a services
company. We don't have production infrastructure that is externally facing. Oh no, one of our data
analysis tools isn't working internally. That's an interesting curiosity, but it's not an emergency
in the same way that, oh, we're an ad network and people are looking at ads right now because we're
broken is. So I am grateful that I don't have to think about these
things anymore. And also a little wistful because there's so much that you do that would have made
dealing with expensive and dangerous outages back in my production years a lot nicer.
Yeah, I think that's what a lot of folks are telling us, essentially. There's this curious
thing with like, this product didn't exist however many years ago and i think it's sort of been quite emergent in a lot of companies that you know as sort of things have moved on that
something needs to exist in this little pocket of space dealing with incidents in in modern
companies so i'm very pleased that that what we're able to build here is sort of working and filling
that for folks yeah i really want to thank you for taking so much time to go through the ethos
of what you do why you do it and how you do it if people want to thank you for taking so much time to go through the ethos of what you do, why you do it, and how you do it. If people want to learn more, where's the best place for them to go? Ideally, not during an incident.
Not during an incident, obviously. Handily, the website is the company name, so incident.io is a great place to go and find out more. We've literally, literally just today actually launched our practical guide to incident management, which is a really full piece of content, which hopefully will be useful to a bunch of different
folks. Excellent. We will, of course, put a link to that in the show notes. I really want to thank
you for being so generous with your time. Really appreciate it. Thanks so much. It's been an
absolute pleasure. Chris Evans, Chief Product Officer and co-founder of Incident.io. I'm
cloud economist Corey Quinn, and this is Screaming in the Cloud.
If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice.
Whereas if you've hated this episode, please leave a five-star review on your podcast platform of choice,
along with an angry comment telling me why your latest incident is all the intern's fault. If your AWS bill keeps rising
and your blood pressure is doing the same,
then you need the Duck Bill Group.
We help companies fix their AWS bill
by making it smaller and less horrifying.
The Duck Bill Group works for you, not AWS.
We tailor recommendations to your business and we get to the point.
Visit duckbillgroup.com to get started.
This has been a humble pod production
stay humble