Screaming in the Cloud - A Chaos Engineering & Jeli Sandwich with Nora Jones

Episode Date: February 11, 2021

About NoraNora is the founder and CEO of Jeli. She is a dedicated and driven technology leader and software engineer with a passion for the intersection between how people and software work i...n practice in distributed systems. In November 2017 she keynoted at AWS re:Invent to share her experiences helping organizations large and small reach crucial availability with an audience of ~40,000 people, helping kick off the Chaos Engineering movement we see today. She created and founded the www.learningfromincidents.io movement to develop and open-source cross-organization learnings and analysis from reliability incidents across various organizations, and the business impacts of doing so.Links:Jeli main webpage: https://www.jeli.io/Chaos Engineering Book: https://www.amazon.com/Chaos-Engineering-System-Resiliency-Practice/dp/1492043869Learning From Incidents: https://www.learningfromincidents.io/Jeli contact us form: https://www.jeli.io/contact-us/

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Nora Jones, who, despite having a storied history, is probably best known these days for being the founder and CEO of Jelly. Nora, welcome to the show. Thank you, Corey. So that's Jelly, J-E-L-I. I'll avoid the various jam puns we could wind up going with. Let's start at the very beginning. What the heck is Jelly? So first of all, please never stop using the puns. We all use puns internally. We call ourselves
Starting point is 00:01:31 Shelly Beans, so it's complete pundum in our Slack. But Jelly is an incident analysis platform. So we've built the first incident analysis platform that allows companies to not only learn from their incidents, but address everything they can from them so that they're actually understanding what's contributing to some of their major failure modes, what they're doing well at versus what they think they're doing well at and exposing the delta between those two worlds and honestly using incidents as a catalyst for helping orgs understand themselves better so that they can make better decisions. This can lead to things like helping them with their OKRs, helping them
Starting point is 00:02:10 with their headcount on teams. It can lead to a number of things. Really, Jelly is using the incident as a catalyst for helping you understand how you think your org works versus how it actually works. Okay, let's back up a little bit. You have a history as a software engineer. You were at Jet, and then you wound up at Netflix. You quite literally wrote the book on chaos engineering. Then you went to Slack, and now you founded a company of your own that's aimed at this. What's the common thread there? So I was actually in hardware prior to Jet, and I was working on reliability there. And I've been focused on developer productivity, reliability, honestly, my entire career. And
Starting point is 00:02:53 I was seeing some of almost the exact same flavor of incidents happening at some of the companies I was working at. It's always DNS, a disk fills up, that sort of thing? Or are you talking about something beyond that? Honestly, even with certain tools. Like, I love console. I think it's a really great mechanism. But I was seeing the same types of console incidents at certain companies, like, five or six years apart
Starting point is 00:03:19 from working at them. And I thought that was kind of incredible how folks were using it in the same unintended ways that were leading to certain failure modes. And I just started thinking like, wow, it would be really helpful if we as an industry even started sharing this stuff with each other a little bit more and also even giving these companies tools to understand it more. When I was at Jet, we were having incidents pretty regularly. Our marketing team was crushing it, and we were just growing really fast. And it
Starting point is 00:03:51 was a trade-off at the time, and we had amazing engineers. It was a lot of speed. And so we were trying to figure out what to do with incidents at that time. Well, when you say a lot of incidents, I mean, I've worked in shops where that means, so how many, what is a lot? And everyone's going to have questions about that. From my perspective, it's, oh, we've had eight incidents. Oh, when? And of course your company isn't at that. No, no, this morning. And it comes down to at that point when everything's an incident, nothing is, and everyone feels sort of trapped into wherever they are because of architectural decisions, because of business requirements, et cetera. And it's easy to sit here without running an infrastructure myself and have these conversations. But when you're in the middle of it, it feels exhausting, never-ending, and the rest.
Starting point is 00:04:34 And that's such a good point, Corey, is that what is an incident at our company? And I've challenged a few orgs that I've worked with in the past to ask, you know, I'll say, if I ask five different people at this company, what is an incident here? How many different answers am I going to get? And I'll usually get the answer five. Some folks will be cheeky and say, you know, you'll get 10 answers. But that's the problem at some of these organizations is that when everything's an incident, nothing is. And so there needs to be some key business metrics. And that has to be something that people can consistently answer from legal all the way to engineering, to marketing. Everyone should be able to have
Starting point is 00:05:16 a consistent answer on what an incident means. And I don't mean a document that you could pull up that's five pages that I have to figure out if this is an incident, what level incident. I'm trying to find the document. I've wasted 10 minutes at this point and it's two in the morning and I'm really tired. Like none of that should happen. This should just be kind of a consistent KPI level metric that you can grok and people can pull on. And I realize that's different with different products, but at Jet, we were having a lot of incidents. And when I say incidents, I mean, we were having incidents every day. And I worked with amazing folks there that were working around the clock to pull things back together and some of the best and brightest
Starting point is 00:05:54 engineers I've ever seen. But when I went to Netflix, I noticed that everyone kind of knew what an incident was at Netflix. Everyone knew what the key business metrics were and how they impacted things. And it was just, there was this alignment. And so it was never a question on whether something was an incident, on whether something was worth waking up at two in the morning for. It was just understood and baked into the fabric of the culture pretty early on. Unfortunately, this kind of doesn't help in some respects because it feels like it's just another example of,
Starting point is 00:06:29 oh, well, Netflix is, of course, otherworldly and far beyond what any other mortal company could wind up doing. And I don't know that that's necessarily true, but it also feels reminiscent of chaos engineering insofar as getting buy-in to fix things by breaking them on purpose is often a very heavy lift for folks who can't get to a point of stability. Similarly, it feels like learning from incidents is going to be very hard with respect to finding the time to do it when you're
Starting point is 00:06:58 buried in them. It almost feels like you have to educate your customer before you can help them. Is that at all accurate, or am I misunderstanding something dramatic? No, it's a really interesting point, Corey. And when I was at Jet, when we were having all of those incidents, we were kind of reaching a point where we were like, let's just try a few different things. And I think a lot of companies reach that point where they're willing to try something, right, where they're not wanting to wake their engineers up anymore. And so that's when we started trying chaos engineering. And it was helpful from the perspective of helping us understand our culture a little bit more and like who we need to rely on. The hard part was, was figuring out what to fail, like where to inject chaos and even what to do with the results
Starting point is 00:07:45 afterwards. As a lot of software engineers do, it was kind of like thought of almost as a let's automate it away situation. We can just have a tool running in the background, but that kind of defeated the purpose of chaos engineering. And so when I went to Netflix, I was really excited to join the chaos engineering team there because Netflix had made this percolate and work throughout the company. And I was really excited to be in an organization where it was so widely understood. But as I worked on the team a bit more, and I mean, I was programming probably like 80% of my time at Netflix and as was the rest of the chaos engineering team. But when I went to check who was actually using the chaos engineering tools on our team, it was mostly the four of us building these tools,
Starting point is 00:08:33 which was fine from a certain perspective, but we weren't getting enough ROI out of it. Like the whole purpose of chaos engineering is to actually learn where your weak spots are so that you can be a bit more proactive to them. And we were focusing very much on the injecting failure and we were really focused on mitigating the blast radius so we could do it safely. We were doing some very fascinating things technically, and we were doing some really great stuff with distributed systems in general and, you know, working with other teams.
Starting point is 00:09:05 But what we weren't super focused on was the creating the experiment and what to do with the results. And so it was usually us nudging people to create experiments. Some folks would put them in their, you know, continuous deployment pipelines and stuff, which can add a little bit more of the benefit. But actually sitting and taking the time to think about where you want to experiment and what you want to do with the results causes, I think, a bit more ROI from doing chaos engineering. And so I realized those were problem areas at Netflix. And so I started analyzing incidents to try to make a catalyst for like, okay, here's where we should create chaos experiments. And here are the areas where we need to do a lot of stuff with the results. Basically, I started looking at incidents to try to help my chaos tools a little bit. And then I realized there was so much more to
Starting point is 00:09:55 learning from incidents than that. I was finding things like, wow, we bring in this particular engineer all the time. Like they are a knowledge island in this organization or this team is severely underwater right now. Like maybe we should staff them up. And so, yes, it was helpful in informing like where we should chaos experiment and what we should do with those results. Like if we should prioritize them in our action items, but it was also helpful
Starting point is 00:10:19 in a number of things in the business. And we started writing incident reports that were getting read by folks all over the company. And people were learning more about the system because we had taken a deeper look at analyzing these incidents. And when I say analyzing these incidents, I mean looking through the chat transcripts, understanding who got paged, figuring out what team they were on, what tenure they have, things like that. And so that was clearly beneficial. And it was stuff I started doing at Slack too, but it was a lot of manual work. And I can't imagine that most companies would invest that time doing that manual work. And so
Starting point is 00:10:56 we wanted to help do some of that for them. And so basically Jelly gives you shoulders to stand on with your incidents so that you're not coming in at zero. Like we're directing your attention towards places that could use your attention organizationally. And so that this postmortem that you're doing is not a chore or a checklist item just because it's part of this process you ingrained five years ago. It's actually something that's useful for you. So we're showing you places that deserve more attention, maybe like an engineer that got brought in that we hadn't planned on being there and understanding what specific knowledge that they had or understanding that with Kubernetes incidents, like we don't do a great job as an organization figuring out the
Starting point is 00:11:42 right folks to get in the room, or we throw out a lot of theories before we actually figure out what's going on. These are the things that we're showing you. We're helping you get to those places faster so that you can do your postmortems faster, but we're also enhancing the quality of the output at the same time. So when I look back to my operations days, the dealing with incidents was always obnoxious. Let me walk you through a minor example of one, and then you can figure out, I guess, well, the audience can figure out more easily what is wrong with the places that I've worked. So things are breaking. Getting the right people on the call is important and almost impossible. So you wind up with your great, great grand boss on and you have the CEO breathing down your neck.
Starting point is 00:12:27 Is it fixed? Is it fixed? Is it fixed? And then it finally comes up and cool. Now it's time to do a post-mortem, but we don't call them that. So it's going to be an incident retrospective. Right. And you're sitting in the room and it's a blameless post-mortem.
Starting point is 00:12:40 Cool. And you say, great. So that engineer over there screwed this up. It's like, whoa, whoa, whoa, blameless. Okay. So an unnamed engineer screwed this up and it becomes an iterative process. And invariably it almost feels like it's a justify why you're still good at your job exercise in a lot of these places. Help. Yeah, it is. How do I fix that? It's what people know. If your incident is hitting Twitter or you have customers calling, that's when someone from your C-suite is probably going to jump in. And they're probably doing more harm than good. We're actually giving you tools to show you where some of that is hurting. Like
Starting point is 00:13:18 if certain folks jumping in are hurting or helping the situation, we're allowing you to analyze that a little bit better so that you can reduce your costs of coordination during these incidents. And I think costs of coordination are not something that folks tend to look at. I think a lot of companies look at what, quote unquote, caused the incident, and they look at, quote unquote, how to prevent it from ever happening again, when really they should also be looking at how they worked together in that moment. Did this involve teams that had never spoken to each other prior to this event? Well, not politely anyway. Yeah.
Starting point is 00:13:52 Had they been in an incident before? Did the CTO actually hurt, you know, jumping in or were they helpful? Who knew the right people to bring in the room? How many hops did it take to get to those right people? And even like, imagine a world where you can understand who exactly has the information that you're looking for in a particular moment and allowing the incident to go much more smoothly and also just showing you how it didn't go smoothly. I think postmortems and incident reviews, and I don't like the word postmortem either. I tend to use incident review, but if that's a word that you want to hold onto,
Starting point is 00:14:30 you can hold onto it, but I recommend making like little changes at a time. It's already a super charged event. Removing the word postmortem can make it a little less charged. And I think having a tool to help you during that event to point to areas can also make it a little less awkward of a situation where it doesn't feel as blamey and finger-pointy, even if you are using the term blameless postmortem to describe what that meeting is. It can still sometimes feel like that. Whenever you talk about something like a tool in this space, I start getting flashbacks to a number of, I don't want to say failed attempts, but that's really what they were. Looking at previous patterns and then making AI or machine learning driven suggestions about what the outage is likely to be, which generally means they're trying to swindle someone. If you're not sure who it is,
Starting point is 00:15:21 it's probably you. And it looks at previous things and it pops up, if you ever get it to this level of development, with exceedingly unhelpful things. Like last week, a disk filled up. Maybe this time it's a disk that filled up when it is very clearly not that. And with anything that's driven with suggestion-oriented or machine learning-based, it feels like two or three bad suggestions in a row means that no one will trust anything it has to say forever, even if it improves. I mean, take a look at the various digital assistants we have floating around. When you ask Siri to do something and it doesn't work the way you expect it to, you feel a bit dumb for having asked in the first place. Nevermind the fact that a week later it does that thing. You won't go back and try it again.
Starting point is 00:16:02 Yeah, I completely agree. And I am so skeptical of AI ops and something that's automating everything for you. I think where we need to go as a software engineering industry and like, you know, some insights are helpful and bubbling those things up for you is super helpful. You know, I think about different things that G Suite does sometimes, you know, like some of the automated responses are useful. Some of them are not, you know, setting up the Zoom meetings accordingly. But I think the tools that work the best are the ones that you treat like a member of your team almost. It's not something that's doing something for you. It's something that you're working with to achieve
Starting point is 00:16:39 the best outcome. And they're still putting something on the table. And that's the mindset that we're building Jelly with. You know, we are showing you some insights, but you still have to do some work on your own. What we're really giving you is a playground to play with your incident a little bit more. Something that's dedicated and built to help you understand this incident a little bit more. To the point where, like, if you signed on to the incident, we've given you some areas to direct your attention, but you still need to put in some time to understand those areas as you would any postmortem. We want to help you facilitate that discussion so that it's productive and people leave the discussion feeling like, wow, this was really a good discussion for us. It's not like we are telling you which questions to ask or which things to fix, like the disuses stuff. We are just giving you focus areas so that you can
Starting point is 00:17:32 see themes over time. And I think how we're really different too is we are really focusing on the people and how to best enable and help them. I've done this sort of pattern analysis and incident analysis at a number of organizations and it is useful and it can provide a lot of recommendations. And I don't want to just automate exactly what I was doing at these organizations, but I do want to automate the beginning stages of that. And that's what we're doing with Jelly is letting you not start from scratch with a postmortem. Because let's face it, when you get assigned a postmortem, you're like, I have to remember how to do this. Okay, let me open up a Google Doc. Okay,
Starting point is 00:18:13 let me pull up 15 Chrome tabs. Okay, I need a DM, so and so. Or it's the other side where you're so used to it, it's habit, and it winds up auto-completing automatically. And that doesn't feel great either. No, it doesn't. And so we're helping that be productive. We're helping you look better. We're helping your organization be a bit more collaborative in these events and just feel more confident about your incidents. This episode is sponsored by ExtraHop. ExtraHop provides threat detection and response for the enterprise, not the starship.
Starting point is 00:18:42 On-prem security doesn't translate well to cloud or multi-cloud environments, and that's not even counting IoT. ExtraHop automatically discovers everything inside the perimeter, including your cloud workloads and IoT devices, detects these threats up to 35% faster, and helps you act immediately. Ask for a free trial of detection and response
Starting point is 00:19:04 for AWS today at extrahop.com slash trial. So help me understand, who's your target customer for something like this? Is it going to be the hyperscale companies who already have attained a certain level of operational maturity? Is it a brand new startup that has just committed their first line of code yesterday and won't figure out until the end of the month that it has a disastrous effect on their AWS bill? Or is it folks in between? I mean, who is your ideal customer these days? We're working with a number of different companies right now
Starting point is 00:19:34 that are getting value out of Jelly in very different ways. We're working with a Series B company that has about 100 people. We are working with a company that has 10,000 people that's been around for a while. We're working with a company that can measure the exact cost of their incidents at this point. The primary criteria is that it's companies that have incidents right now, and they want something a little bit better. I think everyone in the industry doesn't feel great about their post-incident process.
Starting point is 00:20:02 I think that's a pretty common thread and wants it to be a little bit better. And so we want to help that be a more delightful and less kind of awkward experience for folks that they're actually getting value out of and that it's helping with their internal relationships, it's helping with their customer relationships, it's helping with a number of things. So framing it a little bit differently, if I'm an engineer and I'm consistently frustrated by the way that incidents seem to always happen, turn into blame festivals, et cetera, is that something Jelly can help with? In other words, what is the pain that I have that is going to transform into you jumping up and saying, yes, yes, that's what we fix.
Starting point is 00:20:45 What is the symptom that lets me know as I walk through the world that I'm a prospective fit for what Jelly's doing? So the pain that engineers are experiencing right now, or anyone that has to write this post-incident document, is a pain around creating a timeline and copying and pasting items and figuring out what to focus on in their meeting and the time it takes to do a good job doing this. So we reduce that time for you. We make that faster for you. And we enhance the quality of the output.
Starting point is 00:21:17 And so the target audience, the target pain point is folks that have pain putting this together today and feel like it's a chore and feel like it's seldom a great experience. We want to help make that faster for you and we want you to have a better and different experience. So that's really the pain, which is why we're working with an array of companies
Starting point is 00:21:40 because no matter where you are, you are having incidents if you're having customers, right? And so it's a matter of how you're addressing those. Well, not if you ignore them sufficiently. That's true. But if you do that, they become not customers anymore. And that sort of solves the problem, but not in the way that anyone really wanted it to. Yeah.
Starting point is 00:21:57 So you've been a software engineer. You've been a senior technical leader at a number of different companies. And now you're a founder. What has changed for you or surprised you the most as you went through that path? Yeah, it's an interesting question. A lot of things. I mean, I'm definitely a software engineer at heart, so I still love architecting and writing code. And it's definitely been a big shift to enabling the folks around me to build in this vision too and add to it. I think that's been, it's not really a surprise because it's, you know, I think we have a really great team, but it's been amazing. It's exactly what I want to be doing right now. I can't imagine doing
Starting point is 00:22:44 anything else right now. It's just, I kept having this itch at every company I was at that there has to be something better around incidents. And after seeing these patterns at a number of places, I got the urge to go build it. And there's a lot of folks in the industry that are feeling this pain too, and are dedicated towards making a better solution. And I think that's been a lot of fun. It's hard to go back on some level. Once you started a company, the autonomy, it's scary and it's hard. And it's one of those, I don't ever see a future where I go back to what I used to be. It's sort of a one-way door that you never
Starting point is 00:23:19 really realize it when you're going through it. Yeah, absolutely. So something I've noticed about every company, no matter what it does, I mean, for my own, where I fix Amazon bills, people have hilarious misunderstandings about it. In my case, it's, oh, great, how can I save money on socks? And the answer is, I don't really have a good answer, except I actually kind of do. If you get their prime credit card, it knocks 5% off, but don't quote me on that. And even if it's something relatively straightforward, people don't always get it. What are the most hilarious misapprehensions you've seen so far about what Jelly does? I think some of what you alluded to earlier, it's the AI ops kind of thing. We're certainly providing insights for folks, but you're also participating in the insights. And it's not this AI-focused engine.
Starting point is 00:24:10 I think folks are not used to understanding the value that they can get from looking at the chat transcript, but there is so much in there that is just kind of waiting to be analyzed. And I get it. I don't want to go read a Slack conversation after it occurs. And so we're making it easier for you to do that and glean those insights so that you can get the most value out of them. But yeah, assuming you can, that alone is valuable. People have always been saying,
Starting point is 00:24:39 oh, the chat logs will become super valuable just as soon as we learn how to work with them. And people have been saying that for 15 years, but I've yet to see it really become valuable. I don't find myself scrolling back to look at how conversations unfolded. I search for specific terms. Oh, there's the URL I was looking for. There's the image. Getting more signal than that seems inevitable, but I don't see people doing much with it yet. No, and what I was doing at various companies I was at
Starting point is 00:25:05 is I was reading the chat transcript. I would sometimes print them out, go at my desk, highlight them, write notes on them, figure out who the people were, figure out what teams they were on, figure out who was getting paged. And I just, you know, I ended up having- D minus, you can do better than this.
Starting point is 00:25:20 Please see me after class and mail it to someone. I ended up having a desk that looked like, a crime scene where I had sticky notes everywhere and I had yarn, you know, attached to different sticky notes and just trying to connect all the pieces. Like an investigation was unfolding because that's exactly what it was. And what we're doing. Someone didn't know better. They think you were trying to put together Google's messaging strategy. But at Jelly, we're aggregating all of that for folks. So you can have a more comprehensive picture about how people were coordinating in that moment so that you can reduce those coordination costs in the future. And no one's really looking at that today because it's not easy to do.
Starting point is 00:25:58 But there's so much data in there that could actually help you really, really improve and make incidents not such a stressful, time-consuming experience. So one other thing that you've been involved with that I wanted to make sure that we got to talk about was you are also the founder of the learningfromincidents.io. Is it a community? Is it a movement? I'm not entirely clear, but it sounds directly aligned with what you're doing now, what you have been doing, and what Jelly is setting out to solve for. What is the relationship between learning from incidents as an entity and Jelly as an entity? Yeah. While I was at Netflix and I was getting more deeply into incident analysis,
Starting point is 00:26:40 I kind of had this thought like, wow, I really want to talk to folks from other organizations that are also looking at incidents under a deep lens. Like surely there are more folks. And I, you know, I posted something on Twitter and I think I got like hundreds and hundreds of DMs that night. And I just, you know, I kind of wanted to get like-minded folks together so that we could share our experiences and learn from each other in kind of like a safe space. And so I started a Slack community around that and I got some great people in it. And as we talked for over a year together, we kind of wanted to open source some of these learnings. As I was mentioning earlier, it would be so helpful if companies talked a little bit more openly about their incidents and understood that they can do so without revealing proprietary business information.
Starting point is 00:27:29 And so we started open sourcing some of our learnings on the learning from incidents.io website. And so it's a community of folks that want to use incidents as a catalyst for helping their organization and helping their businesses. But it's also a place to open source some of those learnings and get stories from folks that are doing it as well. And so, yeah, it's a movement. It's definitely a community. It's both of those things. And I think it's a new take on how the software industry is progressing in the reliability world
Starting point is 00:28:03 is kind of taking a more human-centered and learning-focused approach because ultimately it can be really good for your business. So we've covered an awful lot of ground over the course of this episode. What are the next steps? What should people who are interested in what you're up to do next if they want to learn more or figure out whether they're potentially a fit for some of these stuff that you're talking about and offering solutions to very real painful problems? Yeah. So for learning from incidents.io, definitely go to the webpage and read some of the posts. There's posts from brilliant folks in that community that are actually doing real
Starting point is 00:28:40 things and chopping the wood and carrying the water. And my focus with that website was, I don't want to talk about the theory. I actually want to do this stuff at companies and have folks talk about how that worked out. And that's what that website is. And I think if your organization is not feeling like you're getting a lot out of your incidents right now and wants a boost and you're interested in Jelly, you can use the contact us form on our webpage right now and we'll reach out to you to set something up. Excellent. We will, of course, include links to that in the show notes. Nora, thank you so much for taking the time to speak with me today. I really appreciate it. Thanks, Corey. Nora Jones, founder and CEO of Jelly. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed
Starting point is 00:29:25 this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice, along with a lengthy comment arguing about exactly whose fault it is. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever fine snark is sold. This has been a HumblePod production. Stay humble.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.