Screaming in the Cloud - Incidents, Solutions, and ChatOps Integration with Chris Evans

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. DoorDash had a problem. As their cloud-native environment scaled and developers delivered new features,

Starting point is 00:00:36 their monitoring system kept breaking down. In an organization where data is used to make better decisions about technology and about the business, losing observability means the entire company loses their competitive edge. With Chronosphere, DoorDash is no longer losing visibility into their application suite. The key? Chronosphere is an open-source compatible, scalable, and reliable observability solution that gives the observability lead at DoorDash business confidence and peace of mind. Read the full success story at snark.cloud slash chronosphere. That's snark.cloud slash C-H-R-O-N-O-S-P-H-E-R-E. Let's face it, on-call firefighting at 2 a.m. is stressful. So there's

Starting point is 00:01:24 good news and there's bad news. The bad news is that you probably can't prevent incidents from happening, but the good news is that Incident.io makes incidents less stressful and a lot more valuable. Incident.io is a Slack-native incident management platform. It allows you to automate incident processes, focus on fixing the issues, and learn from incident insights to improve site reliability and fix your vulnerabilities. Try Incident.io to recover faster and sleep more. Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's promoted guest is Chris Evans, who's the CPO and co-founder of Incident.io. Chris, first, thank you very much for joining me. And I'm going to start with an easy question. Well, easy question, hard answer, I think. What is an Incident.io exactly?

Starting point is 00:02:22 Incident.io is a software platform that helps entire organizations to respond to, recover from, and learn from incidents. When you say incident, that means an awful lot of things. And depending on where you are in the ecosystem in the world, that means different things to different people. For example, oh, incident, are you talking about the noodle incident? Because we had an agreement that we would never speak about that thing again, style, versus folks who are steeped in DevOps or SRE culture, which is, of course, a fancy way to say those who are sad all the time, usually about computers. What is an incident in the context of what you folks do? That, I think, is the killer question. I think if you look at organizations in the past,

Starting point is 00:03:05 I think incidents were those things that happened once a quarter, maybe once a year, and they were the thing that brought the entirety of your site down because your big central database that was in a data center sort of disappeared. The way that modern companies run means that the definition has to be very, very different. So most places now rely on distributed systems, and there is no sort of binary sense of up or down these days. And essentially, in the general case, like most companies are continually in a sort of state of things being broken all of the time. And so for us, when we look at what an incident is, it is essentially anything that takes you away from your planned work with a sense of urgency. And that's the sort of the pithy definition that we use there.

Starting point is 00:03:43 Generally, that can mean anything. It means different things to different folks. And like when we talk to folks, we encourage them to think carefully about what that threshold is. But generally for us at Incident.io, that means basically a single error that is worthwhile investigating that you would stop doing your backlogged work for is an incident. And also an entire app being down, that is an incident. So there's quite a wide range there. But essentially, by sort of having more incidents and lowering that threshold, you suddenly have a heap of benefits, which I can go very deep into and talk for hours about. It's a deceptively complex question. When I talk to folks about backups, one of the biggest problems in the world of backup and building a DR plan, it's not building the DR plan, though that's no picnic either. It's, okay, in the time of cloud,

Starting point is 00:04:28 all your planning figures out, okay, suddenly the site is down. How do we fix it? There are different levels of down, and that means different things to different people, where, especially the way we build apps today, it's not, is the service or site up or down, but with distributed systems, it's how down is it?

Starting point is 00:04:44 And, oh, we're seeing elevated error rates in U.S. Tire Fire 1, region of AWS. At what point do we begin executing on our disaster plan? Because the worst answer in some respects is every time you think you see a problem, you start failing over to other regions and other providers and the rest, and three minutes in, you've irrevocably made the cutover, and it's going to take 15 minutes to come back up, and oh yeah, then your primary site comes back up, because whoever unplugged something plugged it back in, and now you've made the wrong choice. Figuring out all things around the incident, it's not what it once was. When you're running your own blog on a single web server, and it's broken, it's pretty easy to say, is it up or is it down?

Starting point is 00:05:22 As you scale out, it seems like that gets more and more diffuse. But it feels to me that it's also less of a question of how the technology is scaled, but also how the culture and the people have scaled. When you're the only engineer somewhere, you pretty much have no choice but to have the entire state of your stack shoved into your head. When that becomes 15 or 20 different teams of people in some cases, it feels like it's almost less of a technology problem than it is a problem of how you communicate and how you get people involved and the issues in front of the people who are empowered and insightful in a certain area that needs fixing. 100%. This is like a really, really key point, which is that organizations themselves are very complex. And so you've got this combination of systems getting more and more complicated, more and more sort of things going wrong and perpetually breaking,

Starting point is 00:06:09 but you've got very, very complicated information structures and communication throughout a whole organization to keep things up and running. The very best orgs are the ones where they can engage the entire sort of every corner of the organization when things do go wrong and lived and breathed this firsthand when various different previous companies, but most recently at Monzo, which is a bank here in the UK, when an incident happened there, like one of our two physical data center locations went down, the bank wasn't offline, everything was resilient to that, but that required an immediate response. And that meant that engineers were deployed to go and fix things, but it also meant the customer support folks might be required to get involved because we might be slightly slower processing payments. But it also meant the customer support folks might be required to get involved because we

Starting point is 00:06:45 might be slightly slower processing payments. And it means that risk and compliance folks might need to get involved because they need to be reporting things to regulators. And the list goes on. There's like this need for a bunch of different people who almost certainly have never worked together or rarely worked together to come together, land in this sort of like empty space of this incident room or virtual incident room, and figure out how they're going to coordinate their response and get things back on track in this sort of most streamlined way and as quick as possible. Yeah, when your bank is suddenly offline, that seems like a really inopportune time to be introduced to the database

Starting point is 00:07:20 team. It's, oh, we have one of those. Wonderful. I feel like you folks are going to come in handy later today. You want to have those pathways of communication open well in advance of these issues. A hundred percent. And I think the thing that makes incidents unique is that fact. And I think the solution to that is this sort of consistent level playing field that you can put everybody on. So if everybody understands that the way that incidents are dealt with are this consistent, we declare it like this, and under these conditions, these things happen. And if I flag this kind of level of impact, we have to pull in someone else to come and help make a decision. At the core of it, there's this weird kind of duality to incidents where they are both

Starting point is 00:07:59 kind of semi-formulaic in that you can basically encode a lot of the processes that happen, but equally they are incredibly chaotic and require a lot of human impact to be resilient and figure these things out because stuff that you have never seen happen before is happening and failing in ways that you never predicted. And so this is where Incident.io plays into this, is that we try to take the first half of that off of your hands, which is we will help you run your process so that all of the brain capacity you have, it goes on to the bit that humans are uniquely placed to be able to do, which is responding to these very, very chaotic sort of surprise events that have happened. It feels as well, because I played around in this space a bit

Starting point is 00:08:41 before. I used to run ops teams and more or less, I really should have had a t-shirt then that said, I am the root cause. Because yeah, I basically did a lot of self-inflicted outages in various environments. Because it turns out I'm not always the best with computers. Imagine that. There are a number of different companies that play in the space

Starting point is 00:08:58 that look at some part of the incident lifecycle. And from the outside, first, they all look alike. Because it's, oh, so you're incident.io. I assume you're pager duty. You're the thing that calls me at two in the morning to make sure I wake up. Conversely, for folks who haven't worked deeply in that space as well of setting things on fire,

Starting point is 00:09:17 what you do sounds like it's highly susceptible to the hacker news problem where, wait, so what you do is effectively just getting people to coordinate and talk during an incident? Well, that doesn't sound hard. I could do that in a weekend. And no, no, you can't. If this were easy, you would not have been in business as long as you have, have the team, the size that you do, the customers that you do. But it's one of those things that until you've been in a very specific set of problem, it doesn't sound like it's a real problem that needs solving.

Starting point is 00:09:52 Yeah, I think that's true. And I think that the Hacker News point is a particularly pertinent one in that someone else sort of in adjacent area launched on Hacker News recently. And the amount of feedback they got around, you know, you're a Slack bot, how is this a company, was kind of staggering. And I think generally where that comes from is, well, first of all, that bias that engineers have, which is just everything you look at as an engineer is like, yeah, I could build that in a weekend. I think there is often infinite complexity under the hood that just gets kind of brushed over. But yeah, I think at the core of it, you probably could build a Slack bot in a weekend that creates a channel for you in Slack and allows you to post somewhere that... Oh, good.

Starting point is 00:10:28 More channels in Slack. Just what everyone wants. Well, there you go. I mean, that's a particular pertinent one because our tool does do that. And one of the things... So I built at Monzo a version of incident IO that we used at the company there. And that was something that I built evenings and weekends. And among the many, many things I never got around to building, archiving and cleaning up channels

Starting point is 00:10:46 was one of the ones that was on that list. And so Monzo did have this problem of littered channels everywhere. I think that's sort of like part of the problem here is like it is easy to look at a product like ours and sort of assume it is this sort of friendly Slack bot that helps you orchestrate some very basic commands. And I think when you actually dig into the problems

Starting point is 00:11:03 that organizations above a certain size have, they are not solved by Slack bots. They're solved by platforms that help you to encode your processes that otherwise have to live on a Google Doc somewhere, which is five pages long. And when it's 2am and everything's on fire, I guarantee you not a single person reads that Google Doc. So your process is as good as not in place at all. That's the beauty of a tool like ours. We have a powerful engine that helps you basically to encode that and take some load off of you. To be clear, I'm also not coming at this from a position of judging other people. I just look right now at the Slack workspace that we have at the Duckbill group, and we have something like a 10 to 1 channel to human ratio. And the proliferation of channels is a very real thing.

Starting point is 00:11:46 And the problem that I've seen across the board with other things that try to address incident management has always been fanciful at best about what really happens when something breaks. Like you talk about, oh, here's what happens. Step one, you will pull up the Google Doc or you'll pull up the wiki or the rest or in some aspirational places,

Starting point is 00:12:07 ah, something seems weird. I will go open a ticket in Jira. Meanwhile, here in reality, anyone who's ever worked in these environments knows it's step one. Oh shit, oh shit, oh shit, oh shit, oh shit. What are we going to do? And all the practices and procedures that often exist,

Starting point is 00:12:21 especially in orgs that aren't very practiced at these sorts of things, tend to fly out the window and people are going to do what they're going to do. So any tool or any platform that winds up addressing that has to accept the reality of meeting people where they are, not trying to educate people into different patterns of behavior as such. One of the things I like about your approach is, yeah, it's going to be a lot of conversation in Slack. That is a given. We can pretend otherwise, but here in reality, that is how work gets communicated, particularly in extremists. And I really appreciate the fact that you are not trying to, I guess, fight what feels almost like a law of nature at this point. Yeah, I think there's a few things in that. The

Starting point is 00:13:05 first point around the document approach or the clearly defined steps of how an incident works, in my experience, those things have always gone wrong. The data center's down, so we're going to the wiki to follow our incident management procedure, which is in the data center, just lost power. Yeah, there's a dependency problem there too. 100%, 100%. I think part of the problem that I see there is that very, very often you've got this situation where the people designing the process are not the people following the process. And so there's this classic, I've heard it through John Osborne, but there's a bunch of other folks who talk about the difference between people, you know, at the sharp end or the blunt end of the work. And I think the problem that people

Starting point is 00:13:42 have faced in the past is you have these people who sit in the sort of metaphorical upstairs of the office and think that they make a company safe by defining a process on paper. And they ship the piece of paper and go, that is a good job for me done. I'm going to leave and know that I've made the bank, the other, whatever your organization does, much, much safer. And I think this is where things fall down. I want to ambush some of those people in the performance reviews with cool, just for fun, all the documentation. We're going to pull up the analytics to see how often that stuff gets viewed. Oh, nobody ever sees it.

Starting point is 00:14:13 It's frustrating. It's frustrating because that never, ever happens clearly. But the point you made around like meeting people where you are, I think that is a huge one, which is incidents are founded on great communication. Like, as I said earlier, this is like form a team with someone you've never ever worked with before. And the last thing you want to do is be like, Hey, Corey, I've never met you before, but let's jump out onto this other platform somewhere that I've never been, or I haven't been for weeks. And we'll try and figure stuff out over there. It's like, no, you're going to be communicating. We use Slack internally, but we have a WhatsApp

Starting point is 00:14:41 chat that we wind up using for incident stuff. So go ahead and log into WhatsApp, which you haven't done in 18 months, and join the chat. In the dawn of time, in the mists of antiquity, you vaguely were hearing something about that your first week, and then never again. This stuff has to be practiced, and it's important to get it right. How do you approach the inherent and often unfortunate reality that incident response and management inherently becomes very different depending upon the specifics of your company or your culture or something like that. In other words, how cookie cutter is what you have built versus adaptable to different environments it finds itself operating in. Man, the amount of time we spent as a founding team in the early days deliberating over

Starting point is 00:15:29 how opinionated we should be versus how flexible we should be was staggering. The way we like to describe it is we are quite opinionated about how we think incidents should be run. However, we let you imprint your own process into that. So putting some color onto that, we expect incidents to have a lead. That is something you cannot get away from. However, you can call the lead, whatever it makes sense for you at your organization. So some folks call them an incident commander or a manager or whatever. It's the overwhelming militarization of these things. Like, oh yes, we're going to wind up. I take a bunch of terms of the military here. It's like, you realize that your entire giant

Starting point is 00:16:02 screaming fire is that the lights on the screen are in the wrong pattern. You're trying to make them the right pattern. No one dies here in most cases. So it feels a little grandiose for some of those terms being tossed around in some cases, but I get it. You've got to make something that is unpleasant and tedious in many respects, a little bit more gripping. I don't envy people. Messaging's hard. Yeah, it is. And I think if you are overly virtuistic and inflexible, you're sort of fighting an uphill battle here, right? So folks are going to want to call things what they want to call things. And you've got

Starting point is 00:16:31 people who want to import ITIL definitions for severities into the platform because that's what they're familiar with. That's fine. What we are opinionated about is that you have some severity levels because absent the academic criticism of severity levels, they are a useful mechanism to very coarsely and very quickly assess how bad something is and to take some actions off of it. So yeah, we basically have various points in the product where you can customize and put your own sort of flavor on it. But generally we have a relatively opinionated end-to-end expectation of how you will run that process. The thing that I find that annoys me in some cases the most

Starting point is 00:17:06 is how heavyweight the process is. It's clearly built by people in an ivory tower somewhere where there's effectively a two-day long post-mortem analysis of the incident and so on and so forth. And okay, great. Your entire site isn't blown off the internet.

Starting point is 00:17:19 Yeah, that probably makes sense. But as soon as you start broadening that to things like, okay, an increase in 500 errors on this service for 30 minutes, great. Well, we're going to have a two-day post-mortem on that. It's, yeah, it should be nice if you could go two full days without having another incident of that caliber. So in other words, who's, what, are we going to hire a new team whose full-time job it is, is to just go ahead and triage and learn from all these incidents? Seems to me like that's sort of throwing wood behind the wrong arrows. Yeah, I think it's very reductive to suggest that learning only happens in a

Starting point is 00:17:49 postmortem process. So I wrote a blog actually not so long ago that is about running postmortems and when it makes sense to do it. And as part of that, I had a sort of a statement that was that we haven't run a single postmortem when I wrote this blog at Incident.io, which is probably shocking to many people because we're an incident company and we talk about this stuff. But we were also a company of five people. And when something went wrong, the learning was happening. And these things were sort of, we were carving out the time, whether it was called a post-mortem or not, to learn and figure out these things. Extrapolating that to bigger companies, there is little value in following processes for the sake

Starting point is 00:18:25 of following processes. And so you could... Someone in compliance just wound up spitting their coffee all over their desktop as soon as you said that, but I hear you. Yeah. And it's those same folks who are the ones who care about the document being written, not the process and the learning happening. And I think that's deeply frustrating. But all the plans, of course, assume that people will prioritize the company over their own family for certain kinds of disasters. I love that, too. It's this divorce from reality that's ridiculous on some level. Speaking of ridiculous things, as you continue to grow and scale, I imagine you integrate

Starting point is 00:18:55 with things beyond just Slack. You grab other data sources and over the fullness of time. For example, I imagine one of your most popular requests from some of your larger customers is to integrate with their HR system in order to figure out who's the last engineer who left. Therefore, everything is immediately their fault because, Lord knows, the best practice is to pillory whoever was last left because then they're not there to defend themselves anymore and no one's going to get dinged for that irresponsible jackass's decisions, even if they never touch the system at all. I'm being slightly hyperbolic, but only slightly. Yeah, I think it's an interesting one. I am definitely going to raise that feature request

Starting point is 00:19:31 for a pre-filled root cause category, which is, you know, the value is just the last person who left the organization. It's a wonderful scapegoat situation there. I like it. To the point around what we do integrate with, I think the thing is actually with incidents is quite interesting is there is a lot of tooling that exists in the space that does little pockets

Starting point is 00:19:48 of useful, valuable things in the shape of incidents. So you have PagerDuty is this system that does a great job of making people's phone make a noise, but that happens and then you're dropped into this sort of empty void of nothingness and you've got to go and figure out what to do. And then you've got things like Jira where clearly you want to be able to track actions that are coming out of things going wrong in some cases. And that's a great tool for that and various other things in the middle there. And yeah, our value proposition, if you want to call it that, is to bring those things together in a way that is massively ergonomic during an incident. So when you're in the middle of an incident, it is really handy to be able to go, oh, I've shipped this horrible fix to this thing. It works, but I must remember to undo that.

Starting point is 00:20:28 And we put that at your fingertips in an incident channel from Slack that you can just log that action, lose that cognitive load that would otherwise be there, move on with fixing the thing. And you have this sort of, I think it's like that multiplied by a thousand in incidents that is just what makes it feel delightful. And I cringe a little bit saying that because it's an incident at the end of the day, but genuinely it feels magical when some things happen that are just like, oh my gosh, you've automatically hooked

Starting point is 00:20:54 into my GitHub thing and someone else merged that PR and you've posted that back into the channel for me. So I know that that happens. That would otherwise have been a thing where I'd jump out of the incident to go and figure out what was happening. This episode is sponsored in parts by our friend EnterpriseDB. That would otherwise have been a thing where I'd jump out of the incident to go and figure out what was happening. Grisqueel, on-premises, private cloud, and they just announced a fully managed service on AWS and Azure called Big Animal. All one word. Don't leave managing your database to your cloud vendor because they're too busy launching another half-dozen managed databases to focus on any

Starting point is 00:21:36 one of them that they didn't build themselves. Instead, work with the experts over at Enterprise DB. They can save you time and money. They can even help you migrate legacy applications, including Oracle, to the cloud. To learn more, try Big Animal for free. Go to biganimal.com slash snark and tell them Corey sent you. The problem with cloud, too, is the first thing when they start seeing an incident happen

Starting point is 00:21:59 is the number one decision, almost the number one decision point is, is this my shitty code, something we have just pushed in our stuff, or is it the underlying provider itself, which is why the AWS status page being slow to update is so maddening, because those are two completely different paths to go down, and you are having to pursue both of them equally at the same time until one can be ruled out. And that is why time to identifying at least what side of the universe it's on is so important. That has always been a bit of a tricky challenge.

Starting point is 00:22:31 I want to talk a bit about circular dependencies. You target a certain persona of customer, but I'm going to go out on a limb and assume that one explicit company that you are not going to want to do business with in your current iteration is Slack itself. Because a tool to manage, okay, so our service is down, so we're going to go to Slack to fix it, doesn't work when the service is Slack itself. So that becomes a significant challenge. As you look at this

Starting point is 00:22:56 across the board, are you seeing customers having problems where you have circular dependency issues with this? Easy example, Slack is built on top of AWS. When there's an underlying degradation of, huh, suddenly US East 1 is not doing what it's supposed to be doing, now Slack is degraded as well, as well as the customer site. It seems like at that point, you're sort of in a bit of tricky positioning as a customer.

Starting point is 00:23:21 Counterpoint, when neither Slack nor your site are working, figuring out what caused that issue doesn't seem like it's the biggest stretch of the imagination at that point. I've spent a lot of my career working in infrastructure platform type teams, and I think you can end up tying yourself in knots if you try and over-optimize for avoiding these dependencies. I think it's one of those sort of turtles all the way down situations. So yes, Slack are unlikely to become a customer because they are clearly going to want to use our product when they are down. They reach out, we'd like to be your customer response. Please don't be. None of us is going to be happy with this outcome. Yeah. I mean, the interesting thing there is that we're friends with some folks at Slack and they, believe it or not, they do use

Starting point is 00:24:02 Slack to navigate their incidents. They have an internal tool that they have written. And I think this sort of speaks to the point we made earlier, which is that incidents and things failing are not these sort of big binary events. And so all of Slack is down is not the only kind of incident that a company like Slack can experience. That goes far as it's most commonly not that. It's most commonly that you're navigating incidents where it is a degradation or some edge case or something else that's happened. And so the pragmatic solution here is not to avoid the

Starting point is 00:24:29 circular dependencies, in my view. It's to accept that they exist and make sure you have sensible escape hatches so that when something does go wrong. So a good example, we use Incident.io at Incident.io to manage incidents that we're having with Incident.io. And 99% of the time, that is absolutely fine because we are having some error in. And 99% of the time, that is absolutely fine because we having some error in some corner of the product or a particular customer is doing something that is a bit curious. And I could count literally on one hand

Starting point is 00:24:53 the number of times that we have not been able to use our product to fix our product. And in those cases, we have a fallback, which is- I assume you put a little thought into what happened. Well, what if our product is down? Well, I guess we'll never be able to fix it or communicate about it. It seems like that's the sort of thing that given what you do,

Starting point is 00:25:08 you might have put more than 10 seconds of thought into. We've put a fair amount of thought into it. But at the end of the day, it's like, if stuff's down, like what do you need to do? You need to communicate with people. So jump on a Google chat, jump on a Slack huddle, whatever else it is. We have various different like fallbacks in different order.

Starting point is 00:25:23 And at the core of it, I think this is the thing is like, you cannot be prepared for every single thing going wrong. And so what you can be prepared for is to be unprepared and just accept that humans are incredibly good at being resilient. And therefore all manner of things are going to happen that you've never seen before. And I guarantee you will figure them out and fix them basically. But yeah, I say this, if my SOC 2 auditor is listening, we also do have a very well-defined backup plan in our SOC 2, in our policies and processes that is the thing that we will follow there. But yeah. The fact that you're saying the magic words of SOC 2, yes, exactly. Being a responsible adult and living up to some baseline compliance obligations is really the sign of a company

Starting point is 00:26:03 that's put a little thought into these things. So as I pull up incident.io, the website, not the company, to be clear, and look through what you've written and how you talk about what you're doing, you've avoided what I would almost certainly have not because your tagline front and center on your landing page is manage incidents at scale without leaving Slack.

Starting point is 00:26:24 If someone were to reach out and say, well, look, we're down all the time, but we're using Microsoft Teams, so I don't know that we can use you, like the immediate instinctive response that I would have to that to the point where I would put it in the copy is, okay, this piece of advice is free. I would posit that you're down all the time because you're the kind of company to use Microsoft Teams. But that doesn't tend to win a whole lot of friends in various places. In a slightly less sarcastic end, do you see people reaching out with, well, we want to use you because we love what you're doing, but we don't use Slack? Yeah, we do.

Starting point is 00:26:56 A lot of folks, actually. And we will support Teams one day. I think there is nothing especially unique about the product that means that we are tied to Slack. It is a great way to distribute our product and it sort of aligns with the companies that think in the way that we do in the general case. But like at the core of what we're building, it's a platform that augments our communication platform to make it much easier to deal with a high stress, high pressure situation. And so in the future, we will support ways for you to connect Microsoft Teams or if Zoom sought out getting rich app experiences, talk on a Zoom and be able to do various things like logging actions and communicating with other systems and things like that.

Starting point is 00:27:33 But yeah, for the time being, very, very deliberate focus mechanism for us. We're a small company. We're like 30 people now. And so, yeah, focusing on that sort of very slim vertical is working well for us. And it certainly seems to be working to your benefit. Every person I've talked to who has encountered you folks has nothing but good things to say. We have a bunch of folks in common listed on the wall of logos, the social proof eye chart thing of here's people who are using us.

Starting point is 00:27:58 And these are serious companies. I mean, your last job before starting Incident.io was at Monzo, as you mentioned. You know what you're doing in a regulated, serious sense. I would be, quite honestly, extraordinarily skeptical if your background were significantly different from this because, well, yeah, we worked at Twitter for pets and our three-person SRE team, we can tell you exactly how to go ahead and handle your incidents. Yeah, there's a certain level of operational maturity that I kind of just based upon the name of the company there don't think that Twitter for Pets is going to nail. Monzo is a bank. Yes, you know what you're talking

Starting point is 00:28:35 about, given that you have not basically been shut down by an army of regulators. It really does breed an awful lot of confidence. But what's interesting to me is that the number of people that we talk to in common are not themselves banks. Some are, and they do very serious things, but others are not these highly regulated command and control top down companies. You are nimble enough that you can get embedded at those startup-y of startup companies once they hit a certain point of scale and wind up helping them arrive at a better outcome. It's interesting in that you don't normally see a whole lot of tools that wind up being able to speak to both sides of

Starting point is 00:29:16 that very broad spectrum and most things in between very effectively, but you've somehow managed to thread that needle. Good work. Thank you. Yeah. What else can I say other than thank you? I think it's a deliberate product positioning that we've gone down to try and be able to support those different use cases. So I think at the core of it, we have always tried to maintain the incident error should be installable

Starting point is 00:29:35 and usable in your very first incident without you having to have a very steep learning curve. But there is depth behind it that allows you to support a much more sophisticated incident setup. So, I mean, you mentioned Monzo. Like, I just feel incredibly fortunate to have worked at that company. I joined back in 2017 when they were, I don't know, like 150,000 customers.

Starting point is 00:29:54 And it was just getting its banking license. And I was there for four years and was able to then see it scale up to 6 million customers. And all of the challenges and pain that goes along with that, both from building infrastructure and the technical side of things, but from an organizational side of things and was like front row seat to being able to work with some incredibly smart people in and sort of see all these various different pain points. And honestly, it feels a little bit like being in sort of a cheat mode where we get to just import a lot of that knowledge and pain that we felt at Monto into the product. And that happens to resonate with a bunch of folks. So yeah, I feel like things are sort of coming out quite well at the moment for folks. The one thing I will say before we wind up calling this an episode

Starting point is 00:30:33 is just how grateful I am that I don't have to think about things like this anymore. There's a reason that the problem that I chose to work on of expensive AWS bills being very much a business hours only style of problem. We're a services company. We don't have production infrastructure that is externally facing. Oh no, one of our data analysis tools isn't working internally. That's an interesting curiosity, but it's not an emergency in the same way that, oh, we're an ad network and people are looking at ads right now because we're broken is. So I am grateful that I don't have to think about these things anymore. And also a little wistful because there's so much that you do that would have made

Starting point is 00:31:09 dealing with expensive and dangerous outages back in my production years a lot nicer. Yeah, I think that's what a lot of folks are telling us, essentially. There's this curious thing with like, this product didn't exist however many years ago and i think it's sort of been quite emergent in a lot of companies that you know as sort of things have moved on that something needs to exist in this little pocket of space dealing with incidents in in modern companies so i'm very pleased that that what we're able to build here is sort of working and filling that for folks yeah i really want to thank you for taking so much time to go through the ethos of what you do why you do it and how you do it if people want to thank you for taking so much time to go through the ethos of what you do, why you do it, and how you do it. If people want to learn more, where's the best place for them to go? Ideally, not during an incident. Not during an incident, obviously. Handily, the website is the company name, so incident.io is a great place to go and find out more. We've literally, literally just today actually launched our practical guide to incident management, which is a really full piece of content, which hopefully will be useful to a bunch of different

Starting point is 00:32:08 folks. Excellent. We will, of course, put a link to that in the show notes. I really want to thank you for being so generous with your time. Really appreciate it. Thanks so much. It's been an absolute pleasure. Chris Evans, Chief Product Officer and co-founder of Incident.io. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this episode, please leave a five-star review on your podcast platform of choice, along with an angry comment telling me why your latest incident is all the intern's fault. If your AWS bill keeps rising and your blood pressure is doing the same,

Starting point is 00:32:50 then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started. This has been a humble pod production stay humble

Screaming in the Cloud - Incidents, Solutions, and ChatOps Integration with Chris Evans

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.