CoRecursive: Coding Stories - Tech Talk: Test in Production and being On-Call with Charity Majors
Episode Date: August 31, 2018Tech Talks are in-depth technical discussions. "Metrics and Dashboards can die in a fire and every software engineer should be on-call" - Charity Majors Today's Interview is with Charity Majors. We ta...lk about how to make it easier to debug production issues in today's world of complicated distributed systems.  A warning, There is some explicit language in this interview. I originally saw a talk by Charity where she said something like fuck your metrics and dashboards, you should test in production more. It was a pretty hyperbolic statement, but backed up with a lot of great insights. I think you'll find this interview similarly insightful. Charity and her company are probably best known for popularizing the concept that observability in the key to being able to debug issues in production. Also if you are a talented developer with functional programming skills, I've got a job offer for you. My employer Tenable is hiring.  Tenable is a pretty great place to work. Here is a job link. Show notes: Facebook Scuba Observability Talk the-engineer-manager-pendulum HoneyComb.io  Show Link
Transcript
Discussion (0)
Welcome to Code Recursive, where we bring you discussions with thought leaders in the world of software development.
I am Adam, your host.
Metrics, dashboards, I'll die in a fire.
And every software engineer should be on call.
Hey, today's interview is with charity majors we talk about
how to make it easier to debug production issues in today's world of complicated distributed systems
a warning there is some explicit language in this interview
i originally saw charity uh give a talk where she said something like,
fuck your metrics and dashboards, you should just test in production more.
It was a pretty hyperbolic statement, but she ended up backing it up with a lot of great insights.
And I think you'll find this interview similarly insightful.
If you are a talented developer with functional programming skills, I've got a job offer for you.
My employer, Tenable, is hiring.
Hit me up via email to find out more.
I'll also drop a link in the show notes.
Tenable is a pretty great place to work, if you ask me.
Okay, I'm hearing a little.
Can you actually hear me?
Oh, I can hear you now.
Oh, really?
Yeah.
Okay, let's call this the beginning cool charity you are the ceo of uh honeycomb.io accidental ceo accidental ceo well
thanks for joining me on the podcast yeah my pleasure it's really nice to be here thanks
so i used to i used to be able to debug production issues.
Like something would go wrong, some ops person would come and get me,
and then we'd look at things and we'd find out whatever.
There's some query that's running on this database that's just causing a bunch of issues,
and we'll knock it out, or, okay, we need to turn off this feature and add some caching in front of it
um and it you know i always felt like a hero it mostly works yeah yeah and now um now i've i've
woken up into this dark future where first of all now like the i get paid before the ops person
sometimes and then uh like things are just crazy complicated. There's like more databases than people, it seems.
And like every, every product that Amazon.
10 microservices per developer.
Yeah.
Yeah.
So, uh, that's what, that's what I wanted to have you on because I feel like maybe I'm
hoping that maybe you have an answer, uh, for all this mess.
Oh yeah. Oh, yeah.
Well, I do.
The answer is everything is getting harder.
And we need to approach this not as an afterthought as an industry,
but as something that we invest in,
that we expect to spend money on,
that we expect to spend time on.
And we have to upgrade our tools.
Like the model that you described,
where you have your trusty, rusty ops buddy
who kind of guides you through the subterranean passages,
but also our system used to be tractable.
You could have a dashboard, you could glance at it,
you could pretty much know in a glance
where the problem was, if not what the problem was,
and you could go fix it, right?
Whether it's launching GDP or looking for a shit
or like pairing somebody with a bunch of infrastructure knowledge,
query sniffing, whatever.
Finding the component that was at fault was easy.
And so you just needed localized knowledge in that code base
or technology or whatever.
As you mentioned, this basically doesn't work anymore for any
moderately complex system.
The systems often
loop back into themselves. Their platforms,
like when you're a platform, you're inviting
all your users' chaos to come live on your servers
with you, and you just have to make it
work and make sure it doesn't hurt
any of your other users.
Complex co-tendency problems like that.
There's ripple effects.
There's thundering herds.
There's God knows how many programming languages and how many storage.
And databases don't even get started.
I come from databases, right?
So I am, yeah.
Anyway, the way I have been thinking of it is like we're just kind of hitting,
everyone is hitting a cliff where suddenly, and it's pretty suddenly,
all of your tools and your
tactics that have gotten you to this point
no longer work.
And so
this was exactly what happened to us.
So my co-founder,
Christine and I are from Parse, which was
the mobile backend as a service
acquired by Facebook.
And I was there. I was the first infrastructure engineer.
And Parse is a beautiful product.
We just made promises.
You can write,
this is the best way to build a mobile app.
You don't need to worry about the backend.
You don't need to worry about the storage model or anything.
We make it work for you.
It's magic.
Which you can translate as a lot of gracious work.
And around the time we got acquired by Facebook,
I think we were serving about
60,000 mobile
developers, which is
not trivial.
And this is also
when I was coming
to think with
dawning horror that
we had built a
system that was
effectively
undebugable by some
of the best
engineers in the
world.
Like, both of our
backend teams were
spending like all of
our time tracking
down one-offs,
which is the kiss
of death if you're a platform.
So we'd be like, parse this down every day.
And I'd be like, parse this down, dude.
Behold, my wall is full of dashboards.
They're all green.
Take your Wi-Fi.
Arguing with your users is always a great strategy.
But I'd dispatch an engineer.
I'd go try and figure out what was wrong.
It could be anything.
We let them write their own queries and upload them.
We just had to make them work.
We let them write their own JavaScript and upload it. We just had to make it work.
So we could spend more time than we had just tracking down these one-offs,
and it was just failing.
I tried fast forwards with all the things I tried.
The one thing that finally helped us get a dent helped us get ahead of our problems was
this janky ass
unloved tool at Facebook called
Skula that they had used to debug their
iSQL databases a few years ago
it's aggressively hostile to users
it just lets you slice and dice on any dimensions
in basically real time
and they can all be high cardinality fields
and
this didn't mean
anything to me so i was um but like we got we got to handle our shit and then i moved on right
because i'm an awesome like onto the next fire and it wasn't until i was leaving facebook that
i kind of went wait a minute i no longer know how to engineer without the stuff that we've built
and scuba why is that like How did it worm its way into
my soul to the point where I'm like, this is how
I understand what's happening in my production systems.
It's like
getting glasses and then being
told you can't have them anymore.
How am I even going to know how to navigate in the world?
We've been thinking
about this for a couple years now and
I'll pause for breath here in a second.
But I don't
want to say that like hyacom is the only way to do this hyacom is the result of all of this trauma
that we have endured like when our systems hit this cliff of complexity and we really thought
at first with just platforms it was going to hit this and then we realized no everyone's hitting
it because it's a function of the complexity of the systems you can't hold it in your brain anymore
you have to reason about it by putting in a function of the complexity of the system. You can't hold it in your brain anymore. You have to reason about it
by putting in a tool where you and others
can navigate the information.
So how did Scuba...
Is that what it was called? Scuba? Yeah.
So what did it...
What did it consume? Structured
data. It's
agnostic about it. Mostly logs.
But it was just
the fact that it was fast was fast there was no like
you know having to construct a query and walk away to get coffee and come back you know because
when you're debugging it has to be you're asking lots of small questions as quickly as you can
right you're following cookie crows instead of crafting one big question that you know will give
you the answer because you don't know it's going to be the answer you don't even know what the
question is right um also high cardinality.
When I say that, I mean,
imagine you have a table with 100 million users.
High cardinality fields
are going to be, the highest will be anything that's
unique ID, social security number.
Very high cardinality would be
last name and first name. Very low would be
gender and species. I assume
it's the lowest of all.
The reason, I was laughing
when you said, fuck metrics.
I've said that many times. The reason that I
hate metrics so much, and this is what
20 years of operations software
is built on, is the metric, right? Well, the metric
is just a number. Then you can
append tags to it to help you group them.
You're limited in cardinality to the number of
tags you can have, generally, which is 100
to 300. So you can't have more than 300 unique IDs to group those by,
which is incredibly limiting.
Some newer things like Prometheus,
like you put key value pairs in there, which is a little bit better,
but bottom line, it's very limited,
and you have to think, you have to structure your question,
your data just right.
Like all the advice you can get online
about how to try not to have these problems,
which, when you think about it, is stupid, question, your data just right. All the advice you can get online about how to try not to have these problems, which
when you think about it, is stupid
because all of the interesting
information that you're ever going to care about is high cardinality.
Request ID,
raw query, you know?
You just
need this information so desperately.
And so that
I think was the killer thing for Scuba. It was the first
time I'd ever gotten to work with a data system
that just let you have arbitrary...
So imagine a common thing that companies will do as they grow is,
well, they have some big customers who are more important to them.
So they pre-generate all the dashboards for those customers
because they don't have the ability to just break down
by that 1 in 10 million user IDs,
and then any combination of anything else.
When you can just do that, so many things get so simple.
To make sure I understand it.
So like if I'm using, like with metrics, I have like Datadog, right?
And I have like a, I have a Datadog metric and like I have,
basically I'll measure like this request on my microservice or whatever, like how long does this normally take? Right. So it has like
some sort of, um, just the time that it takes from start to end on that. And I can put it on a,
on a graph, uh, or, or whatever. So high cardinality, if I understand it as saying,
like, let's, let's not just count this single number. Let's count everything.
What's the user that requested it?
It's more like,
so every metric is a number
that is devoid of context, right?
It's just a number with some text.
But the way that Scuba and Honeycomb work
is we work on events,
arbitrarily wide events.
You can have hundreds,
well-informed services,
usually 300 to 500 dimensions.
So all of that data for
the request is in one event.
The way we typically will instrument
is that you initialize an event
when the request enters the service.
We pre-populate with some useful stuff, but then
throughout, while the request is
being served, you just toss in whatever you
think might possibly be interesting someday.
Any IDs, any shopping cart information,
any raw queries, any normalized queries,
any timing information,
every hop to any other
microservice. And then when the request
is going to exit or error, you just ship
that event off to us or to Scuba.
And then you have
all this information that is all tied together.
The context is
what makes this incredibly powerful.
A metric has zero context.
So like when, you know,
over the course of your request in your service,
you might fire off like 20 or 30 different metrics, right?
Counters, badges, whatever,
but those are tied to each other.
So you can't reason about them
as all of these things are connected to this one request.
This is so powerful
because so much of debugging is looking for
outliers, right? You want to know which of your requests failed, and then you want to look for
what they have in common. Was it that, you know, there is a, you know, some of the TCP statistics
were overflowing only on those? Or was it that those are the ones making a particular call to a
host or to a version of the software?
Like just being able to just a slice and dice and figure that out at a glance is why our time to resolve these issues went from like hours or days or God knows to like seconds, just seconds or minutes, just repeatedly.
Because you can just ask the questions.
So I would say that to summarize, the thing that makes
it powerful is the fact that you have all that
context, and you have a way to link all
of these numbers together, and the fact that
you can ask questions
no matter how high the cardinality is.
So you can combine them, right?
You want to look at the combination of this unique
request ID, this
query from this host
at this time or whatever and it's it's like
precision it sounds like what i normally do with like logs like i have them all gathered somewhere
in splunk or something and i'm searching for things it's much more like because logs are just
what typically unstructured events they're just straights, right?
And if you're structuring your logs,
then you're already way ahead of most people. If you're structuring your logs,
then I would say,
I would encourage you to structure them very widely.
Not to issue lots of log lines per request,
but to bundle all that stuff together
so that you get the additional power of having it all at once.
Otherwise, you kind of have to reconstitute it. Give me all the log
wives for this one request ID and you have to do stuff. If you just pack them
together, it's much more convenient. And then that's basically what Honeycomb
is, plus the columnar store that we wrote in order to do the exploration. You can also think
of it like BI for systems. You can think of
it like BI systems. can think of it like bi systems i don't know bi for systems business
intelligence okay for systems like because like you were talking in the beginning about
debugging with an ops team in a dashboard right um and the ops person was just kind of like along
for the ride and filling in all of this intuition all of this you know scar tissue all of the you
know you weren't able to explore that information because it wasn't a tool with someone else in And all of this intuition, all of this, you know, past scar tissue, all of the, you know,
you weren't able to explore that information because it wasn't a tool.
It was in someone else's brain.
This is why, like, the best debugger on the team
is almost always the person who's better than the last, right?
Because they've amassed the most context
built up in their brains, which is,
I love being that hero.
Like, I love being the person who just gazes
at a dashboard and goes, it's red.
Like, I just feel it in my bones.
But it's not actually good for us as teams.
I can't take a vacation.
Nobody else can explore the data.
And I've now had the experience three times
where the best debugger on the team
was not the person who'd been there the longest.
This was at Paras, at Facebook, and at Honeycomb.
Because when you've taken all of that data about the system
and put it into a place where people can explore it with a tool,
then the best debugger is going to be the person
who's the most curious and persistent.
I like what you said about the intuition.
And I find that, like, you know,
I described that problem of debugging something.
And I know that there's a person on my team, John,
and I feel like he just has a really good model
of how the system works in his head.
Yeah, yeah.
The problem is that systems are getting too large
and too complicated and changing too quickly
and they're overflowing our heads across the board.
But what you just said there is another thing
that I'm so, so excited about,
which is our tools as developers,
they have not treated us like human beings.
They have treated us like automatons.
How many Vim sequences do you
know by heart? Way too many. I know way too
many. It's like this point of
pride, which is kind of stupid.
So the thing that we're
really passionate about, this is all just table stakes.
The stuff that we're really passionate about is building
for teams, looking for ways to bring everyone up to the level of the best debugger
or the person with the most context and most information about every corner of your systems
right because like if i get paged about something and i'm like uh shit this is about cassandra i
don't know fuck all about cassandra um but christine does and like we have didn't we have
an average that was like five or six weeks ago, and I think she was on call then?
I'm just going to go look at what she did.
I want to like, what questions did she ask?
What did she think was meaningful enough to publish to Slack?
What got tagged as part of a postmortem?
What comments did she leave for herself?
You know, I just want to, because I learned Linux
by reading other people's bash history files
and just trying all the commands.
I love, you know, tapping into that sense of curiosity, almost
that snoopiness that we have.
When people are really good at their jobs, we just want to
go see how they do things.
I'm so excited
about tapping into the
social... Once we've gotten
the information off our heads, then how do we help
people explore it? How do we make it fun? How do we make
it approachable?
And how do we make it so that
we forget less stuff? Because when I go to debug a gnarly problem, I'm going to do a deep dive and
I'm going to know everything about it for like a week. And then it starts to decay, right?
And it asks me two or three months later, and I'm just like back to zero.
But if I can just have access to how I interacted with the system, what columns did I query?
What notes did I leave for myself?
What were the most useful things that I did?
And if I and my team can access that information,
then we've forgotten a lot less.
And that's nice.
I find we have a bunch of dashboards
that somebody has kindly made and painstakingly
put together and, um, they, they, um, they have helped me before, but, but not that much.
Yeah.
And fundamentally you're consuming very passively.
You're not actually interrogating the system.
You're not throwing a hypothesis or asking a question.
Um, and the best way to actually
get good at systems is to force yourself
to ask some questions.
To predict what the answer might
be.
Every time you look at someone else's
dashboard, or even your own dashboard from a past
outage, it's like an artifact.
You're not exploring it.
It's a very passive consumption.
And because it's so passive,
we often miss when
a data source isn't there
anymore. Or when
it's like the dog that didn't bark.
I can't even count the number of times that I've been...
There's a spike and I'm just looking
through my dashboards, looking for
the root cause and realizing that
oh, we forgot to
set up the graphing software on that one.
Or, oh, it stopped sending, you know, or just something like that.
Because you're not actually actively asking a question.
You're just kind of skimming with your eyeballs, just like scanning, eyes getting tired.
Agreed.
You said something, I was saying, I watched this talk of yours and you said something
about how we should be doing more testing in production
or something like that.
What does that mean?
I think what I'm trying to say is that we
do test in production, whether we want to
or not, whether we admit it or not.
Every config change,
even if you devote a lot of resources
to try and keep staging in sync with production,
assuming it's even possible with your security
conditions and blah blah blah, blah.
It's never exactly the same.
Your config files are different.
Every unique combination of
deploy plus the software you use to deploy
plus the environment you're deploying to plus the code
itself is unique.
There's literally no way, as anyone who's ever
typoed production knows,
there's some small
amount of it that is a test because you're doing it for the
first time. And I feel like most teams, because there's this whole, you can't test in production,
we don't do anything in production that isn't tested. They're just not admitting reality.
And that causes them to pour too many of their very scarce engineering cycles
into trying to make staging perfect. When those cycles would be better used
making guardrails for production,
investing in things like good canary deploys
that automatically roll back if bad
and promote if good.
That part of the industry is starved for resources.
And I think it's because we don't have
unlimited resources.
And the right place to take
it from is staging. I think because staging is just fragile and full of... It's just not a good
use of time. I think that... And I'm not saying we shouldn't test before production. Obviously,
we should run tests, but those are for your known items. In the future, known unknowns are not really the
hardest problems or even the most frequent problems that we have. It's all about these
unknown unknowns, which is a way, I think, of talking about this cliff that we're all going off.
You know, it used to be known unknowns. You'd get paged, you'd look at it, you'd kind of know what
it was, you'd go poke around and you'd solve it. Now it's like, when you get paged, you should
honestly just be like, uh, what is this? You
know, I haven't seen this before. This is new. Or you don't really know where to start. Partly
because of the sheer complexity and probably just because there are so many more possible outcomes
or possible root causes. You just need a different, you need to stress resiliency in the modern world, not perfection.
And I think that I'm sort of joking
and trying to push people's buttons
when I say I test in production,
but also sort of not.
I mean, it's for real.
Like that outage that Google Cloud Platform
just had last week,
what did they do?
It's a config change.
Worked great in staging.
They pushed it to prod,
took the whole thing down.
You can't test everything.
So you have to invest in catching things.
Failure should be boring, right?
That's why we test in prod.
And you could say experiment in prod.
I don't know, whatever.
But I think that like for the known unknowns,
you should test before production.
But there are so many things that you just can't.
And so we should be investing more
into tools that let us test.
And I think that a really key part of that has been observability.
We haven't actually, it's easy to ship things to production.
It's much harder to tell what impact it has had.
And that's why I feel like something like Honeycomb,
where you can just poke around, is necessary.
Like, I think that, I hope that we look back in a couple of years
at the battle days
when we used to just ship code
and wait to get paged.
Like, how fucking crazy is that?
That's insane that we used to just like
wait for something bad to happen
at a high enough threshold to wait to set up.
We should have the muscle memory as engineers
that if, like, what causes things to break?
Well, usually it's us changing something.
So whenever you change something in prod,
you should have muscle memory to just go look
at what you expect to happen, actually happen.
Did anything else obvious happen at the same time?
Like there's something so satisfying,
so much dopamine that you can get straight to your brain
just by going and looking and finding something
and fixing it before anyone notices or complains.
So we have like in the real world,
we have a fixed amount of resources.
And if we're trying to decide
like what percentage of effort should go towards
like recovering from production issues
and what should go towards preventing them?
Oh, this is such a personal question.
It's based entirely on your business case, right?
Like how much appetite do you have for failure?
It's going to be different for a bank than for, you know,
how old are you?
Who are your customers?
You know, startups have way more appetite for risk
than companies that are serving banks.
You know, it's very, very, there's no answer.
That's exactly the same for any two companies, I think.
But it sounds like what you're saying to me is that we should put
a lot of effort into, into recovering from production issues.
Into resiliency. Yeah. Into early detection and mitigation. Recovery is an interesting word.
Often I think it's just understanding.
There are many changes you have to make.
Say you're rolling out a version of the code
that is going to increase the RAM footprint.
And it's not a leak, you know it,
but you don't actually know how much
because you run it in staging.
And again, you're not going to have the same kind of traffic, the same variance. So you don't actually know how much because you run it in staging. And again, you're not going to have the same kind of traffic, the same variance.
So you don't actually know.
So I'm arguing that you need to roll things out.
You need to have the tooling to make this a very mundane operation, right?
It should roll out to 10%, get promoted, run for a while, get promoted 20%, 30%,
and be able to watch it
so that you know if it's about to hit
an out of bounds or something.
Because I think it's important
actually, well I think it's just as a developer
it gives confidence when you can actually
just roll back. But not everything is
not everything can be rolled back I guess.
Yeah, especially
when the closer you get to the laying bits
down on disk,
the more many things are rolled forward only.
Then you start to get sweaty palms.
I don't know.
It depends, but I've seen some
hair-raising database migrations.
Oh, God.
I come from databases.
I have done things with databases
that would turn your hair white.
So you mentioned earlier that you build your own database.
Oh, no, no, no.
I've spent my entire career telling people not to write a database.
So I'd like to be very clear on this point.
We have written a storage engine.
That's my, I'm sticking to it.
Tell me about your storage engine.
It's as dead simple as we could possibly make it.
It's a columnar store that is really freaking fast.
We target one second for the 95th percentile of all queries.
Why did you need your own data store?
Well, that's a great question.
Believe me, we tried everything
out there. So the operations
engineering community for 20 years has
been investing in time-series databases
built on metrics, right?
And we knew that this was just not
a data model that was going to enable
the kind of interactive,
fast
kind of
interaction that we wanted to support. And furthermore, we knew that we wanted to support.
And furthermore, we knew that we wanted to have these really wide,
arbitrarily wide events.
And we knew that because we're dealing with unknown unknowns,
we knew that we didn't want to have any schemas.
Because anytime you have to think about what information you might want to
capture and fit it into a schema,
it introduces friction in a really bad way. and then you don't deal with indexes you know like one of the problems
with every log tool is you have to pick which indexes you have to support some of them even
charged by indexes i think um but then if like if you need to ask the question about something that
isn't indexed well you're back to like oh i'm gonna go get coffee while i'm waiting for this
query to run right and then if you didn't ask the right question, you've got to go for another walk.
It's not interactive. It's not exploratory.
So we tried everything out there.
Druid came a little close, but it still didn't have the kind of richness.
Yeah, we knew what we wanted, and so we had to write it.
We wrote it as simply as possible.
We were using Golang.
It is descended from Scuba at Facebook, for sure.
Scuba was just like 10,000 lines of C++.
It was entirely in memory
because they didn't have SSDs when they wrote it.
And it shills out to rsync for replication.
It's janky as fuck.
But the architecture is nice.
It's distributed.
So there's a fan out model where
query comes in, fans out
five nodes, does a column
scan on all five aggregates
pushes them back up, there's too much to
aggregate, then it fans out
again to another five nodes and
repeats, so it's very
scalable, we can handle very very high
throughput just by adding more nodes.
So you're saying it doesn't have any indexes or it indexes everything?
Well, columns are effectively indexes, right?
Yeah.
So everything is equally fast, basically.
It's sort of like index everything because everything's a column.
Yes.
Yeah.
And you can have arbitrarily wide
that we use a file per column
basically. So up to the
Linux open file handle
when it's just like 32k or something.
It becomes not tractable for
humans long before then.
I like this idea that there
is this very
janky tool at Facebook that changed the world.
Oh, they can't kill it. It's too useful, but it has been not invested in.
And so it is horribly hard to understand.
It's aggressively hostile to users.
It does everything it can to get you to go away, but people just can't let it go. Do you think that like more, more people should,
should kind of embrace the chaos and have more of a startup focus?
Yeah,
I do.
Yeah.
I did.
I thought you were going a different direction with that question,
but yes,
that too.
Which way did you think I was going?
Oh,
I thought you were going to ask if more people should build tools based on
events instead of metrics.
And yes,
I'm,
I'm truly, you're yes, I'm truly...
You're very... I'm opening
the door. We've given talks
and we built our storage engine. As an industry,
we have to make the jump from
very limited...
The thing about metrics is also
they are always looking at the aggregate.
The aggregate and the older
they are, the less fine
grained they are, right?
That's what they drop data is by aggregating at the right time.
We drop data by sampling instead because it is really, really powerful to have those raw events.
This is a shift that I think the entire industry has to make.
And I almost don't care if it's us or not.
That's a lie.
I totally care if it's us or not.
But there needs to be more of us, right?
This needs to be a shift that the entire industry makes because it's the only way to understand these systems. It's the only way I've ever seen. We should talk about tracing real quick.
Because tracing is just a different way of visualizing events. Tracing is the only other
thing that I know of that is oriented around events. Oh, what I was starting to say was that
metrics are great for describing the health of the system, right?
But they don't
tell you anything about the event because they're not fine-grained
and they lack the context. And as a developer,
we don't care about
the health of the system. If it's up and
serving my code, I don't give
a shit about it. What I care about
is every request, every event, and I care
about all of the details from the perspective
of that event, right?
And we spend so much time trying to work backwards
from these metrics to the questions
we actually want to ask.
And that bridge right there
is what is being filled by all of this intuition,
you know, and jumping around between tools,
like jumping from the metrics
and aggregate for the system
and jumping into your logs
and trying to prep for the stream
that you think might, you know,
shed some light on it.
Everything becomes so much easier when you can just ask questions
from the perspective of the event.
Tracing is interesting because tracing
is just like Honeycomb, except for
it's depth-first, is how I think of it.
Honeycomb is breadth-first, where you're slicing
and dicing between events,
trying to isolate the ones that
have characteristics that you're looking for.
And tracing is about, okay, now I've found one of those events.
Now tell me everything about it from start to finish.
And we just released our tracing product.
And what's really freaking cool about it is you can go back and forth, right?
You can start with, all right, I don't know the question.
All I have is a vague problem report.
So I'm going to go try and find something, find an outlier,
find an error that matches
this user ID query, whatever.
Okay, cool. I found it. Now
show me a trace. Trace everything
that hits this hop or this
query or whatever. And then once you
have been like, oh, cool.
I found one. Then you can zoom back
out and go, okay, now show me everyone else
who was affected by this. Show me everyone else
who has experienced this?
We've been debugging our own storage engine this way
for about three or four months now.
It is mind-blowing just how easy
it makes problems.
Yeah, that sounds powerful for sure.
I guess we're kind of getting the tools
back that we lost when we
split up into a million different services
in some ways.
Yeah, totally. It's kind of like distributed GDB.
So I don't talk to too many ops people on this podcast.
I wanted to ask you,
what do you think developers have to learn
from an ops culture or mindset?
Oh, that's such a great question.
First of all, I heard you say that you're on call
and you get paid for the opt-in.
Well, bless you.
This is the model of the future.
And I want to make it clear that I don't want to...
Ops has had a problem with masochism
for as long as I've been alive.
And the point of putting software engineers on call is not to invite them
into masochism with us. It's to raise our standards for everyone and the amount of sleep that we
should expect. I think I feel very strongly about this. The only way to write and support good code
is by shortening those feedback loops and putting the same people who write software on call for it.
So it's just necessary.
In the glorious future, which is here for many of us, we are all distributed systems engineers.
And one thing about distributed systems is that it has a very high operational cost, right?
Which is why software engineers are now having to learn ops. I'll often say that I feel like the first wave of DevOps
was all about yelling at ops people to learn how to write code.
And we did. Cool. We did.
And I feel like now for just the last year or two,
the pendulum's been swinging the other way.
And now it's all about, okay, software engineers, it's your turn.
It's your turn to learn to build operable services.
It's your turn to learn to instrument really well,
to make the systems explain themselves back to you. It's your turn to pick up the ownership
side of the software that you've been developing. And I think that this is great. I think this is
better for everyone. It is not saying that everyone needs to be equally expert in every area.
It's just saying that what we have learned about shipping good software is that
everyone should write code
and everyone should support
code. Something like 70-80% of our
time is spent maintaining and extending
and debugging, not greenfield
development, which is
fundamentally, software engineers
do more ops than software engineering.
So I think it makes
sense to acknowledge that. I think it makes sense to acknowledge that.
I think it makes sense to reward people for that.
I am a big proponent of
you should never make someone a senior engineer.
Don't promote them if they don't know
how to operate their services,
if they don't show good operational hygiene.
You have to show that this is what you value in an org,
not what you kick around and deprioritize.
And people pay attention to signals like promotions and pay grades and who thinks they're too good for what work.
Definitely.
I think there is an ops culture.
Maybe it's just my perception.
You mentioned masochism.
I don't know where the causation and correlation go uh that there's a i don't know
if developers are going to become more um yeah there's a certain uh there's a certain attitude
sometimes and i don't think there's anything wrong with this but of you know like you know
call me when something's on fire you know that's that's when i'm alive right is when things are
breaking and yeah i believe me i'm one of those people i love what
i love it i'm the person you want to call in a crisis if if we're not if the database is down
we're not sure if it's ever coming up again the company might be screwed like i am the person that
you want at your side and i've spent my career like working myself out of a job um repeatedly
i guess that's why i'm a startup CEO now. But that aside, you can
both enjoy something and recognize
that it's too much, but it's not good for you.
I enjoy drinking.
But I do
try to be
responsible about it.
Yeah, I don't know.
I think that
the things that you call out and praise in your
culture are the things that are going to get repeated. And if you praise people for firefighting
and heroics, you're going to get more of that. And if you treat it as an embarrassing episode
that we're, you know, yeah, glad we got through it together. We privately thank people or whatever,
but you don't call it out and praise it. And you make clear that this is not something you value,
that it was an accident and you take it seriously, you know,
and you give people enough time to execute on all the tasks that came out of
the postmortem instead of, you know, having a retrospective,
coming up with all this shit and then like deprioritizing it,
going on to feature it.
That doesn't say, yes, we value your time and,
and we don't want to see more firefighting.
I think that these organizational things are really the
responsibility of any senior management and senior engineers.
It's a tricky problem.
I wanted to ask you, you are now CEO. Do you
still get to work as an individual contributor? Do you still get to
fight fires
and get down in the trenches i'm not i'm not well i am fighting fires but not of the technical
variety um i wanted to be cto that's what i was shooting for um but circumstances um i don't know
i mean i i believe in this mission i've seen it change people's lives. I've seen it make healthier teams
and I am going to see it through.
I really miss sitting down
in front of a terminal every morning.
I really, really, really do.
But I've always been highly motivated
by what has to be done.
I don't play with technology for fun.
I get in the morning
and I look at what needs to be done.
So I guess this is just another variation on that. This is what needs to be done. I spent a year trying to get
someone else to be CEO. I'm done. I can't find someone. That's fine. I'm in it for now. I'll
just take it as far as I can. It's a very pragmatic approach. i always worry like you know that if they if they take my
whatever my text editor away from me like i'll never get back just because i've seen it happen
to other people for sure i i've written a blog post about the engineer manager pendulum um because
i believe that the best technologists that i've ever gotten to work with were people who had gone back and forth a
couple of times. Because the best tech leads are the ones who have spent time in management.
They're the ones with the empathy and the knowledge for how to motivate people and how to
connect the business to technology and explain it to people in a way that motivates them.
And the best managers I've ever had, wine managers, were always never more
than two or three years removed from writing
code, doing hands-on work themselves.
I feel like it's a real shame
that it's often a one-way path.
I think it doesn't have to be if we're assertive about
knowing that what we want is to
go back and forth. Certainly what I hope for
for myself.
There doesn't seem to be a lot of precedent
for switching back and forth
there isn't but um since i wrote that piece i have i still get contacted by people every day
just saying thank you this is what i wanted i didn't know it's possible i'm totally going to
do it now i actually wrote it for a friend of mine at slack who was considering going through
that transition i was just like yeah you should do. And I wrote the post for him and he went back to being an IC and he is so much happier.
So he went back to being a contributor rather than a management role?
Yeah, he was a director.
And he's having an immense amount of impact in his senior IC role
because he's been there for so long.
He knows everything.
He can do these really great industry moving projects.
Oh, that's awesome. How are you doing for time? Do you have to run?
Um, I don't know. Let me see. Oh no, I can leave you my slate.
So do you like dashboards or not?
Um, I think that some dashboards are inevitable.
Like you would need a couple of just top-level...
All right, I think a couple are inevitable,
but they're not a debugging tool, right?
They're a state-of-the-world tool.
As soon as you have a question about it,
you want to jump in and explore and ask questions.
And I don't call that a dashboard.
Some people do.
But I think it's too confusing.
Interactive dashboards are fine.
But you do have to ask that question.
Ask those questions.
You need to support, you know, what about this?
What about for that user?
What about, you know, I don't care what you call it as long as you can do that.
I do.
I also think that like a huge pathology right now these complex systems is that we're overpaging ourselves.
And we're overpaging ourselves because we don't actually trust our tools to let us ask any question and isolate the source of the problem quickly.
So we rely on these clusters of alerts to give us a clue as to what the source is. And if you actually have a tool with this data that you trust,
I think that the only paging words you really need are
requests per second, errors, latency, maybe saturation.
And then you just need a dashboard that at a top level,
at a high level, shows you that.
And then whenever you actually want to dig in and understand something,
then you jump into a more of a debugging framework.
So these issues you talked about before, like a specific user,
like you would never get paid for that.
How would that come to your attention?
That is a great question.
That is a great point.
So many of the problems that show up in these systems will never show up in your alerts or else you're over-alerting.
Because they're localized. This is another thing that's different
about the systems that we have now versus the old style of systems.
It used to be that everyone shared the same pools.
They shared a tier for the web, for the app, for the database.
And so they all had roughly the same experience. With these new
distributed systems,
say you had a 99.5%
reliability in your old system.
That meant that everyone's erroring
0.5% of the time.
On the new systems,
it more likely means that
the system is 100%
up for almost everyone.
But everyone whose last name starts with S-H-A
who happens to be on this shard,
they're 100% down, right?
You get the same,
like if you're just getting the top level percentages,
your paging alerts are not going to be reflective
of the actual experience of your users.
So then you're like, well, okay,
you can generate alerts for every single combination and blah, blah,
blah, blah, blah. And then you're just going to have black alerts
over time.
Honestly,
a lot
of the problems that we are going to see are going to
come to us through support or through
users reporting problems.
And over time, as you interact with the system, you'll learn
what the most common high signals
are. Maybe you'll want to have an end-to-end check
that traverses every shard, right?
Hits every shard or something like that.
It's different for every architectural type.
But I don't remember what the question was.
Oh, I was just talking about the difference in systems.
Yeah, you can't...
There are so many ways that systems will break
but only affect a few people now.
So it makes the high cardinality questions
even more important.
And like you were mentioning,
developers should be able to operate their systems.
I think, actually,
developers should spend time doing support.
It's horrible.
It's not fun.
Oh, God, yes.
No, but it really empathy
for your users yeah and so the the issue like with whatever you said users with the last name sh like
that'll come in you know that'll come in as a support ticket and if if i'm busy and i'm a
developer and that i'll be like that doesn't make sense are you sure and then like but if i actually
have to if i'm the one who has to deal with this ticket, you know?
Yeah. No, totally. Yeah. We're big fans of, you know,
having everyone rotate through, you know, on call,
rotate through support triaging. It doesn't even have to be that often,
you know, maybe once a quarter or so is enough to keep you very grounded.
It's like an empathy factor, I think.
It really is.
Yeah.
And one of the hardest things,
one of the things that separates
good senior engineers from me
is that they know how to spend their time
on things that have the most impact, right?
Business impact.
Well, what does that mean?
Well, often it means things
that actually materially affect
your user's experience.
And there's no better way than just having to be on a support rotation.
Because if you don't,
if you aren't feeding your intuition with the right inputs,
you're going,
your sense of what has the most impact is going to be off,
right?
I like to think of like the intuition as being something you have to kind of
cultivate with the right experiences um and and the right
shared experiences right you want a team to kind of have the same idea of what makes important
important as a team like i feel like there there's healthy teams and unhealthy teams um but i mean
some teams uh really uh gel and i always feel like uh the ops people tend to be more cohesive than other groups.
I think so too.
A lot of it is because of...
It's like the band of brothers effect, right?
You go to war together.
You have each other's backs.
Getting woken up in the middle of the night.
There's just a...
Every team...
Every place I've ever worked,
the ops team has been the one
that just has the most identity, I guess.
The most character and identity. The most in-jokes. Usually very g has the most identity, I guess. The most character
and identity, the most in-jokes, usually
very ghoulish, graveyard humor.
But
I think that
the impact of a good
on-call rotation is that there is this
sense of shared sacrifice.
And I would liken that to salt in
food. A teaspoon makes
your meal amazing.
A cup of it means that you're crying, you know?
Like a teaspoon of shared sacrifice
really pulls a team together.
Yeah, I can see that.
You don't want it to be like
the person can't sleep at night.
No, no.
But like if one of the people on your team has a baby then
everybody just like immediately volunteers because they're not going to let them get woken up by the
baby and the pagers they're just going to fill in for them until they're for the next year you know
that that type of thing like lowering the barrier should just be assumed that you know you want to
have each other's backs that nobody should be too impacted. That, you know, as an ops manager, whenever somebody got in the middle of the night,
I would encourage them not to come in or to sleep in,
or I would take a pager for them for the next night or something like that.
It just looking out for each other's welfare and well-being is the thing that finds people, I think.
Definitely.
Well, it's been great to talk to you.
Was there anything, I don't know't i liked your uh i liked your
controversial statements about you know fuck metrics what else you got metrics dashboards
can all die in a fire yeah and every software engineer should be on call boom
all right there's the title for the uh the episode
there you go i'm gonna make a lot of friends here
all right that's the show thank you for listening to the co-recursive podcast
i'm adam bell your host if you like the show please tell a friend.