Screaming in the Cloud - A Chaos Engineering & Jeli Sandwich with Nora Jones
Episode Date: February 11, 2021About NoraNora is the founder and CEO of Jeli. She is a dedicated and driven technology leader and software engineer with a passion for the intersection between how people and software work i...n practice in distributed systems. In November 2017 she keynoted at AWS re:Invent to share her experiences helping organizations large and small reach crucial availability with an audience of ~40,000 people, helping kick off the Chaos Engineering movement we see today. She created and founded the www.learningfromincidents.io movement to develop and open-source cross-organization learnings and analysis from reliability incidents across various organizations, and the business impacts of doing so.Links:Jeli main webpage: https://www.jeli.io/Chaos Engineering Book: https://www.amazon.com/Chaos-Engineering-System-Resiliency-Practice/dp/1492043869Learning From Incidents: https://www.learningfromincidents.io/Jeli contact us form: https://www.jeli.io/contact-us/
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Nora Jones, who, despite having a storied history,
is probably best known these days for being the founder and CEO of Jelly. Nora, welcome to the
show. Thank you, Corey. So that's Jelly, J-E-L-I. I'll avoid the various jam puns we could wind up
going with. Let's start at the very beginning. What the heck is Jelly? So first of all, please never stop using the puns. We all use puns internally. We call ourselves
Shelly Beans, so it's complete pundum in our Slack. But Jelly is an incident analysis platform.
So we've built the first incident analysis platform that allows companies to not only
learn from their incidents, but address
everything they can from them so that they're actually understanding what's contributing to
some of their major failure modes, what they're doing well at versus what they think they're
doing well at and exposing the delta between those two worlds and honestly using incidents as a
catalyst for helping orgs understand themselves better so that they can
make better decisions. This can lead to things like helping them with their OKRs, helping them
with their headcount on teams. It can lead to a number of things. Really, Jelly is using the
incident as a catalyst for helping you understand how you think your org works versus how it actually
works. Okay, let's back up a little bit. You have a history as a software engineer. You were at Jet,
and then you wound up at Netflix. You quite literally wrote the book on chaos engineering.
Then you went to Slack, and now you founded a company of your own that's aimed at this.
What's the common thread there?
So I was actually in hardware prior to Jet, and I was working on reliability there. And
I've been focused on developer productivity, reliability, honestly, my entire career. And
I was seeing some of almost the exact same flavor of incidents happening at some of the companies I
was working at. It's always DNS, a disk fills up, that sort of thing?
Or are you talking about something beyond that?
Honestly, even with certain tools.
Like, I love console.
I think it's a really great mechanism.
But I was seeing the same types of console incidents
at certain companies, like, five or six years apart
from working at them.
And I thought that was kind of incredible
how folks were using it in the same
unintended ways that were leading to certain failure modes. And I just started thinking like,
wow, it would be really helpful if we as an industry even started sharing this stuff with
each other a little bit more and also even giving these companies tools to understand it more.
When I was at Jet, we were having incidents
pretty regularly. Our marketing team was crushing it, and we were just growing really fast. And it
was a trade-off at the time, and we had amazing engineers. It was a lot of speed. And so we were
trying to figure out what to do with incidents at that time. Well, when you say a lot of incidents,
I mean, I've worked in shops where that means,
so how many, what is a lot? And everyone's going to have questions about that. From my perspective, it's, oh, we've had eight incidents. Oh, when? And of course your company isn't at that. No,
no, this morning. And it comes down to at that point when everything's an incident, nothing is,
and everyone feels sort of trapped into wherever they are because of architectural decisions,
because of business requirements, et cetera. And it's easy to sit here without running an infrastructure myself and have these
conversations. But when you're in the middle of it, it feels exhausting, never-ending, and the rest.
And that's such a good point, Corey, is that what is an incident at our company? And I've
challenged a few orgs that I've worked with in the past to ask, you know, I'll say, if I ask five different people at this company, what is an incident here?
How many different answers am I going to get?
And I'll usually get the answer five.
Some folks will be cheeky and say, you know, you'll get 10 answers.
But that's the problem at some of these organizations is that when everything's an incident, nothing is. And so there needs to be
some key business metrics. And that has to be something that people can consistently answer
from legal all the way to engineering, to marketing. Everyone should be able to have
a consistent answer on what an incident means. And I don't mean a document that you could pull
up that's five pages that I have to figure out if this is an
incident, what level incident. I'm trying to find the document. I've wasted 10 minutes at this point
and it's two in the morning and I'm really tired. Like none of that should happen. This should just
be kind of a consistent KPI level metric that you can grok and people can pull on. And I realize
that's different with different products, but at Jet, we were having a lot of incidents. And when
I say incidents, I mean, we were having incidents every day. And I worked with amazing folks there
that were working around the clock to pull things back together and some of the best and brightest
engineers I've ever seen. But when I went to Netflix, I noticed that everyone kind of knew
what an incident was at Netflix. Everyone knew what the key business metrics were and how they impacted things.
And it was just, there was this alignment.
And so it was never a question on whether something was an incident, on whether something
was worth waking up at two in the morning for.
It was just understood and baked into the fabric of the culture pretty early on.
Unfortunately, this kind of doesn't help in some respects
because it feels like it's just another example of,
oh, well, Netflix is, of course, otherworldly
and far beyond what any other mortal company
could wind up doing.
And I don't know that that's necessarily true,
but it also feels reminiscent of chaos engineering
insofar as getting buy-in to fix things by breaking them on purpose is often a
very heavy lift for folks who can't get to a point of stability. Similarly, it feels like learning
from incidents is going to be very hard with respect to finding the time to do it when you're
buried in them. It almost feels like you have to educate your customer before you can help them.
Is that at all accurate, or am I misunderstanding something dramatic?
No, it's a really interesting point, Corey.
And when I was at Jet, when we were having all of those incidents, we were kind of reaching a point where we were like, let's just try a few different things.
And I think a lot of companies reach that point where they're willing to try something, right, where they're not wanting to wake their engineers up anymore. And so that's when we started trying chaos engineering. And it
was helpful from the perspective of helping us understand our culture a little bit more and like
who we need to rely on. The hard part was, was figuring out what to fail, like where to inject
chaos and even what to do with the results
afterwards. As a lot of software engineers do, it was kind of like thought of almost as a
let's automate it away situation. We can just have a tool running in the background, but that kind of
defeated the purpose of chaos engineering. And so when I went to Netflix, I was really excited to
join the chaos engineering team there because Netflix had made this percolate and work throughout the company.
And I was really excited to be in an organization where it was so widely understood.
But as I worked on the team a bit more, and I mean, I was programming probably like 80% of my time at Netflix and as was the rest of the chaos engineering team. But when I went to check who was actually using
the chaos engineering tools on our team,
it was mostly the four of us building these tools,
which was fine from a certain perspective,
but we weren't getting enough ROI out of it.
Like the whole purpose of chaos engineering
is to actually learn where your weak spots are
so that you can be a bit more proactive to them. And we were focusing very much on the injecting failure and we were really focused on
mitigating the blast radius so we could do it safely. We were doing some very fascinating things
technically, and we were doing some really great stuff with distributed systems in general and,
you know, working with other teams.
But what we weren't super focused on was the creating the experiment and what to do with the results.
And so it was usually us nudging people to create experiments.
Some folks would put them in their, you know, continuous deployment pipelines and stuff, which can add a little bit more of the benefit. But actually sitting and taking the time to think about where you want to experiment and what you want to do
with the results causes, I think, a bit more ROI from doing chaos engineering. And so I realized
those were problem areas at Netflix. And so I started analyzing incidents to try to make a
catalyst for like, okay, here's where we should create chaos experiments. And here are the areas
where we need to do a lot of stuff with the results. Basically, I started looking at incidents
to try to help my chaos tools a little bit. And then I realized there was so much more to
learning from incidents than that. I was finding things like, wow, we bring in this particular
engineer all the time. Like they are a knowledge island in this organization or this team is severely underwater right now.
Like maybe we should staff them up.
And so, yes, it was helpful in informing
like where we should chaos experiment
and what we should do with those results.
Like if we should prioritize them in our action items,
but it was also helpful
in a number of things in the business.
And we started writing incident reports
that were getting read by folks all over the company. And people were learning more about the system because we had
taken a deeper look at analyzing these incidents. And when I say analyzing these incidents, I mean
looking through the chat transcripts, understanding who got paged, figuring out what team they were on,
what tenure they have, things like that. And so that was clearly
beneficial. And it was stuff I started doing at Slack too, but it was a lot of manual work.
And I can't imagine that most companies would invest that time doing that manual work. And so
we wanted to help do some of that for them. And so basically Jelly gives you shoulders to stand
on with your incidents so that you're not coming in at zero. Like we're directing your attention towards places that could use your attention
organizationally. And so that this postmortem that you're doing is not a chore or a checklist item
just because it's part of this process you ingrained five years ago. It's actually something
that's useful for you. So we're showing you places
that deserve more attention, maybe like an engineer that got brought in that we hadn't
planned on being there and understanding what specific knowledge that they had or understanding
that with Kubernetes incidents, like we don't do a great job as an organization figuring out the
right folks to get in the room, or we throw out a lot of theories before we actually figure out what's going on. These are the things that we're
showing you. We're helping you get to those places faster so that you can do your postmortems faster,
but we're also enhancing the quality of the output at the same time.
So when I look back to my operations days, the dealing with incidents was always obnoxious.
Let me walk you through a minor example of one, and then you can figure out, I guess, well, the audience can figure out more easily what is wrong with the places that I've worked.
So things are breaking.
Getting the right people on the call is important and almost impossible.
So you wind up with your great, great grand boss on and you have the CEO breathing down your neck.
Is it fixed? Is it fixed? Is it fixed?
And then it finally comes up and cool.
Now it's time to do a post-mortem,
but we don't call them that.
So it's going to be an incident retrospective.
Right.
And you're sitting in the room
and it's a blameless post-mortem.
Cool. And you say, great.
So that engineer over there screwed this up.
It's like, whoa, whoa, whoa, blameless. Okay. So an unnamed engineer screwed this up and it becomes an iterative
process. And invariably it almost feels like it's a justify why you're still good at your job
exercise in a lot of these places. Help. Yeah, it is.
How do I fix that? It's what people know. If your incident is hitting Twitter or you have customers calling,
that's when someone from your C-suite is probably going to jump in. And they're probably doing more
harm than good. We're actually giving you tools to show you where some of that is hurting. Like
if certain folks jumping in are hurting or helping the situation, we're allowing you to analyze that
a little bit better so that you can reduce your costs of coordination during these incidents. And I think costs of coordination are
not something that folks tend to look at. I think a lot of companies look at what, quote unquote,
caused the incident, and they look at, quote unquote, how to prevent it from ever happening
again, when really they should also be looking at how they worked together in that moment.
Did this involve teams that had never spoken to each other prior to this event?
Well, not politely anyway.
Yeah.
Had they been in an incident before?
Did the CTO actually hurt, you know, jumping in or were they helpful?
Who knew the right people to bring in the room?
How many hops did it take to get to those right people? And even like,
imagine a world where you can understand who exactly has the information that you're looking
for in a particular moment and allowing the incident to go much more smoothly and also just
showing you how it didn't go smoothly. I think postmortems and incident reviews, and I don't like the word
postmortem either. I tend to use incident review, but if that's a word that you want to hold onto,
you can hold onto it, but I recommend making like little changes at a time.
It's already a super charged event. Removing the word postmortem can make it a little less charged.
And I think having a tool to help you during that event to point to areas can also make it a little less awkward of a situation where it doesn't feel as blamey and finger-pointy, even if you are using the term blameless postmortem to describe what that meeting is.
It can still sometimes feel like that.
Whenever you talk about something like a tool in this space, I start getting flashbacks to a number of,
I don't want to say failed attempts, but that's really what they were. Looking at previous
patterns and then making AI or machine learning driven suggestions about what the outage is likely
to be, which generally means they're trying to swindle someone. If you're not sure who it is,
it's probably you. And it looks at previous things and it pops up, if you ever get it to this level of development, with exceedingly unhelpful things.
Like last week, a disk filled up. Maybe this time it's a disk that filled up when it is very clearly
not that. And with anything that's driven with suggestion-oriented or machine learning-based,
it feels like two or three bad suggestions in a row means that
no one will trust anything it has to say forever, even if it improves. I mean, take a look at the
various digital assistants we have floating around. When you ask Siri to do something and it doesn't
work the way you expect it to, you feel a bit dumb for having asked in the first place.
Nevermind the fact that a week later it does that thing. You won't go back and try it again.
Yeah, I completely agree. And I am so skeptical
of AI ops and something that's automating everything for you. I think where we need to go
as a software engineering industry and like, you know, some insights are helpful and bubbling those
things up for you is super helpful. You know, I think about different things that G Suite does
sometimes, you know, like some of the automated responses are useful. Some of them are not,
you know, setting up the Zoom meetings accordingly. But I think the tools
that work the best are the ones that you treat like a member of your team almost. It's not
something that's doing something for you. It's something that you're working with to achieve
the best outcome. And they're still putting something on the table. And that's the mindset
that we're building Jelly with. You know, we are showing you some insights, but you still have to do some work on your own.
What we're really giving you is a playground to play with your incident a little bit more.
Something that's dedicated and built to help you understand this incident a little bit more.
To the point where, like, if you signed on to the incident, we've given you some areas to direct your attention, but you still need to put in some time to understand those areas as you would any postmortem.
We want to help you facilitate that discussion so that it's productive and people leave the discussion feeling like, wow, this was really a good discussion for us.
It's not like we are telling you which questions to ask or
which things to fix, like the disuses stuff. We are just giving you focus areas so that you can
see themes over time. And I think how we're really different too is we are really focusing on the
people and how to best enable and help them. I've done this sort of pattern analysis and
incident analysis at a number of organizations and it is useful and it can provide a lot of
recommendations. And I don't want to just automate exactly what I was doing at these organizations,
but I do want to automate the beginning stages of that. And that's what we're doing with Jelly is
letting you not
start from scratch with a postmortem. Because let's face it, when you get assigned a postmortem,
you're like, I have to remember how to do this. Okay, let me open up a Google Doc. Okay,
let me pull up 15 Chrome tabs. Okay, I need a DM, so and so.
Or it's the other side where you're so used to it, it's habit, and it winds up auto-completing
automatically. And that doesn't feel great either.
No, it doesn't. And so we're helping that be productive.
We're helping you look better.
We're helping your organization be a bit more collaborative in these events and just feel more confident about your incidents.
This episode is sponsored by ExtraHop.
ExtraHop provides threat detection and response for the enterprise, not the starship.
On-prem security doesn't translate well
to cloud or multi-cloud environments,
and that's not even counting IoT.
ExtraHop automatically discovers everything
inside the perimeter, including your cloud workloads
and IoT devices, detects these threats up to 35% faster,
and helps you act immediately.
Ask for a free trial of detection and response
for AWS
today at extrahop.com slash trial. So help me understand, who's your target customer for
something like this? Is it going to be the hyperscale companies who already have attained
a certain level of operational maturity? Is it a brand new startup that has just committed their
first line of code yesterday and won't figure out until the end of the month that it has a disastrous effect on their AWS bill?
Or is it folks in between?
I mean, who is your ideal customer these days?
We're working with a number of different companies right now
that are getting value out of Jelly in very different ways.
We're working with a Series B company
that has about 100 people.
We are working with a company that has 10,000 people
that's been around for a while.
We're working with a company that can measure the exact cost of their incidents at this point.
The primary criteria is that it's companies that have incidents right now, and they want something a little bit better.
I think everyone in the industry doesn't feel great about their post-incident process.
I think that's a pretty common thread and wants it
to be a little bit better. And so we want to help that be a more delightful and less kind of awkward
experience for folks that they're actually getting value out of and that it's helping with their
internal relationships, it's helping with their customer relationships, it's helping with a number of things. So framing it a little bit
differently, if I'm an engineer and I'm consistently frustrated by the way that incidents seem to always
happen, turn into blame festivals, et cetera, is that something Jelly can help with? In other words,
what is the pain that I have that is going to transform into you jumping up and saying, yes,
yes, that's what we fix.
What is the symptom that lets me know as I walk through the world that I'm a prospective fit for
what Jelly's doing? So the pain that engineers are experiencing right now, or anyone that has
to write this post-incident document, is a pain around creating a timeline and copying and pasting
items and figuring out what to focus on in their meeting
and the time it takes to do a good job doing this.
So we reduce that time for you.
We make that faster for you.
And we enhance the quality of the output.
And so the target audience, the target pain point
is folks that have pain putting this together today
and feel like it's a chore
and feel like it's seldom a great experience.
We want to help make that faster for you
and we want you to have a better and different experience.
So that's really the pain,
which is why we're working with an array of companies
because no matter where you are,
you are having incidents if you're having customers, right?
And so it's a matter of how you're addressing those.
Well, not if you ignore them sufficiently.
That's true.
But if you do that, they become not customers anymore.
And that sort of solves the problem, but not in the way that anyone really wanted it to.
Yeah.
So you've been a software engineer.
You've been a senior technical leader at a number of different companies.
And now you're a founder. What has changed for you or surprised you the most as you went through
that path? Yeah, it's an interesting question. A lot of things. I mean, I'm definitely a software
engineer at heart, so I still love architecting and writing code. And it's definitely been a big shift to enabling the
folks around me to build in this vision too and add to it. I think that's been,
it's not really a surprise because it's, you know, I think we have a really great team,
but it's been amazing. It's exactly what I want to be doing right now. I can't imagine doing
anything else right now.
It's just, I kept having this itch at every company I was at that there has to be something
better around incidents. And after seeing these patterns at a number of places, I got the urge
to go build it. And there's a lot of folks in the industry that are feeling this pain too,
and are dedicated towards making a better
solution. And I think that's been a lot of fun. It's hard to go back on some level. Once you
started a company, the autonomy, it's scary and it's hard. And it's one of those, I don't ever
see a future where I go back to what I used to be. It's sort of a one-way door that you never
really realize it when you're going through it. Yeah, absolutely. So something I've noticed about every company, no matter what it does, I mean, for my own,
where I fix Amazon bills, people have hilarious misunderstandings about it. In my case, it's,
oh, great, how can I save money on socks? And the answer is, I don't really have a good answer,
except I actually kind of do. If you get their prime credit card, it knocks 5% off, but don't quote me on that. And even if it's something relatively
straightforward, people don't always get it. What are the most hilarious misapprehensions
you've seen so far about what Jelly does? I think some of what you alluded to earlier,
it's the AI ops kind of thing. We're certainly providing insights for folks, but you're also participating in the insights.
And it's not this AI-focused engine.
I think folks are not used to understanding the value
that they can get from looking at the chat transcript,
but there is so much in there
that is just kind of waiting to be analyzed.
And I get it. I don't want to
go read a Slack conversation after it occurs. And so we're making it easier for you to do that
and glean those insights so that you can get the most value out of them.
But yeah, assuming you can, that alone is valuable. People have always been saying,
oh, the chat logs will become super valuable just as soon as we learn how to work with them.
And people have been saying that for 15 years, but I've yet to see it really become valuable.
I don't find myself scrolling back to look at how conversations unfolded.
I search for specific terms.
Oh, there's the URL I was looking for.
There's the image.
Getting more signal than that seems inevitable, but I don't see people doing much with it yet.
No, and what I was doing at various companies I was at
is I was reading the chat transcript.
I would sometimes print them out,
go at my desk, highlight them, write notes on them,
figure out who the people were,
figure out what teams they were on,
figure out who was getting paged.
And I just, you know, I ended up having-
D minus, you can do better than this.
Please see me after class and mail it to someone.
I ended up having a desk that looked like,
a crime scene where I had sticky notes everywhere and I had yarn,
you know, attached to different sticky notes and just trying to connect all the pieces.
Like an investigation was unfolding because that's exactly what it was. And what we're doing.
Someone didn't know better. They think you were trying to put together Google's messaging strategy.
But at Jelly, we're aggregating all of that for folks.
So you can have a more comprehensive picture about how people were coordinating in that moment so that you can reduce those coordination costs in the future. And no one's really looking at that today because it's not easy to do.
But there's so much data in there that could actually help you really, really improve and make incidents not such a stressful,
time-consuming experience. So one other thing that you've been involved with that I wanted to
make sure that we got to talk about was you are also the founder of the learningfromincidents.io.
Is it a community? Is it a movement? I'm not entirely clear, but it sounds directly aligned
with what you're doing now,
what you have been doing, and what Jelly is setting out to solve for. What is the relationship
between learning from incidents as an entity and Jelly as an entity?
Yeah. While I was at Netflix and I was getting more deeply into incident analysis,
I kind of had this thought like, wow, I really want to talk to folks from other
organizations that are also looking at incidents under a deep lens. Like surely there are more
folks. And I, you know, I posted something on Twitter and I think I got like hundreds and
hundreds of DMs that night. And I just, you know, I kind of wanted to get like-minded folks together
so that we could share our experiences and learn from each other in kind of like a safe space. And so I started a Slack community around that and I got some great people
in it. And as we talked for over a year together, we kind of wanted to open source some of these
learnings. As I was mentioning earlier, it would be so helpful if companies talked a little bit
more openly about their incidents and understood that they can do so without revealing proprietary business information.
And so we started open sourcing some of our learnings on the learning from incidents.io website.
And so it's a community of folks that want to use incidents as a catalyst for helping their organization and helping their businesses. But it's also a place to open source some of those learnings
and get stories from folks that are doing it as well.
And so, yeah, it's a movement.
It's definitely a community.
It's both of those things.
And I think it's a new take on how the software industry
is progressing in the reliability world
is kind of taking a more human-centered
and learning-focused approach because ultimately it can be really good for your business.
So we've covered an awful lot of ground over the course of this episode. What are the next steps?
What should people who are interested in what you're up to do next if they want to learn more
or figure out whether they're potentially a fit for some
of these stuff that you're talking about and offering solutions to very real painful problems?
Yeah. So for learning from incidents.io, definitely go to the webpage and read some
of the posts. There's posts from brilliant folks in that community that are actually doing real
things and chopping the wood and carrying the water. And my focus with that website was, I don't want to talk about the theory. I actually want to do this stuff at companies
and have folks talk about how that worked out. And that's what that website is.
And I think if your organization is not feeling like you're getting a lot out of your incidents
right now and wants a boost and you're interested in Jelly, you can use the contact us form on our webpage right
now and we'll reach out to you to set something up. Excellent. We will, of course, include links
to that in the show notes. Nora, thank you so much for taking the time to speak with me today.
I really appreciate it. Thanks, Corey. Nora Jones, founder and CEO of Jelly. I'm cloud economist
Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed
this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you
hated this podcast, please leave a five-star review on your podcast platform of choice,
along with a lengthy comment arguing about exactly whose fault it is.
This has been this week's episode of Screaming in the Cloud. You can also find more Corey at
screaminginthecloud.com or wherever fine snark is sold.
This has been a HumblePod production.
Stay humble.