CoRecursive: Coding Stories - Tech Talk: Test in Production and being On-Call with Charity Majors

Starting point is 00:00:00 Welcome to Code Recursive, where we bring you discussions with thought leaders in the world of software development. I am Adam, your host. Metrics, dashboards, I'll die in a fire. And every software engineer should be on call. Hey, today's interview is with charity majors we talk about how to make it easier to debug production issues in today's world of complicated distributed systems a warning there is some explicit language in this interview i originally saw charity uh give a talk where she said something like,

Starting point is 00:00:45 fuck your metrics and dashboards, you should just test in production more. It was a pretty hyperbolic statement, but she ended up backing it up with a lot of great insights. And I think you'll find this interview similarly insightful. If you are a talented developer with functional programming skills, I've got a job offer for you. My employer, Tenable, is hiring. Hit me up via email to find out more. I'll also drop a link in the show notes. Tenable is a pretty great place to work, if you ask me.

Starting point is 00:01:19 Okay, I'm hearing a little. Can you actually hear me? Oh, I can hear you now. Oh, really? Yeah. Okay, let's call this the beginning cool charity you are the ceo of uh honeycomb.io accidental ceo accidental ceo well thanks for joining me on the podcast yeah my pleasure it's really nice to be here thanks so i used to i used to be able to debug production issues.

Starting point is 00:01:48 Like something would go wrong, some ops person would come and get me, and then we'd look at things and we'd find out whatever. There's some query that's running on this database that's just causing a bunch of issues, and we'll knock it out, or, okay, we need to turn off this feature and add some caching in front of it um and it you know i always felt like a hero it mostly works yeah yeah and now um now i've i've woken up into this dark future where first of all now like the i get paid before the ops person sometimes and then uh like things are just crazy complicated. There's like more databases than people, it seems. And like every, every product that Amazon.

Starting point is 00:02:31 10 microservices per developer. Yeah. Yeah. So, uh, that's what, that's what I wanted to have you on because I feel like maybe I'm hoping that maybe you have an answer, uh, for all this mess. Oh yeah. Oh, yeah. Well, I do. The answer is everything is getting harder.

Starting point is 00:02:49 And we need to approach this not as an afterthought as an industry, but as something that we invest in, that we expect to spend money on, that we expect to spend time on. And we have to upgrade our tools. Like the model that you described, where you have your trusty, rusty ops buddy who kind of guides you through the subterranean passages,

Starting point is 00:03:10 but also our system used to be tractable. You could have a dashboard, you could glance at it, you could pretty much know in a glance where the problem was, if not what the problem was, and you could go fix it, right? Whether it's launching GDP or looking for a shit or like pairing somebody with a bunch of infrastructure knowledge, query sniffing, whatever.

Starting point is 00:03:32 Finding the component that was at fault was easy. And so you just needed localized knowledge in that code base or technology or whatever. As you mentioned, this basically doesn't work anymore for any moderately complex system. The systems often loop back into themselves. Their platforms, like when you're a platform, you're inviting

Starting point is 00:03:54 all your users' chaos to come live on your servers with you, and you just have to make it work and make sure it doesn't hurt any of your other users. Complex co-tendency problems like that. There's ripple effects. There's thundering herds. There's God knows how many programming languages and how many storage.

Starting point is 00:04:11 And databases don't even get started. I come from databases, right? So I am, yeah. Anyway, the way I have been thinking of it is like we're just kind of hitting, everyone is hitting a cliff where suddenly, and it's pretty suddenly, all of your tools and your tactics that have gotten you to this point no longer work.

Starting point is 00:04:33 And so this was exactly what happened to us. So my co-founder, Christine and I are from Parse, which was the mobile backend as a service acquired by Facebook. And I was there. I was the first infrastructure engineer. And Parse is a beautiful product.

Starting point is 00:04:48 We just made promises. You can write, this is the best way to build a mobile app. You don't need to worry about the backend. You don't need to worry about the storage model or anything. We make it work for you. It's magic. Which you can translate as a lot of gracious work.

Starting point is 00:05:02 And around the time we got acquired by Facebook, I think we were serving about 60,000 mobile developers, which is not trivial. And this is also when I was coming to think with

Starting point is 00:05:13 dawning horror that we had built a system that was effectively undebugable by some of the best engineers in the world.

Starting point is 00:05:19 Like, both of our backend teams were spending like all of our time tracking down one-offs, which is the kiss of death if you're a platform. So we'd be like, parse this down every day.

Starting point is 00:05:28 And I'd be like, parse this down, dude. Behold, my wall is full of dashboards. They're all green. Take your Wi-Fi. Arguing with your users is always a great strategy. But I'd dispatch an engineer. I'd go try and figure out what was wrong. It could be anything.

Starting point is 00:05:44 We let them write their own queries and upload them. We just had to make them work. We let them write their own JavaScript and upload it. We just had to make it work. So we could spend more time than we had just tracking down these one-offs, and it was just failing. I tried fast forwards with all the things I tried. The one thing that finally helped us get a dent helped us get ahead of our problems was this janky ass

Starting point is 00:06:08 unloved tool at Facebook called Skula that they had used to debug their iSQL databases a few years ago it's aggressively hostile to users it just lets you slice and dice on any dimensions in basically real time and they can all be high cardinality fields and

Starting point is 00:06:24 this didn't mean anything to me so i was um but like we got we got to handle our shit and then i moved on right because i'm an awesome like onto the next fire and it wasn't until i was leaving facebook that i kind of went wait a minute i no longer know how to engineer without the stuff that we've built and scuba why is that like How did it worm its way into my soul to the point where I'm like, this is how I understand what's happening in my production systems. It's like

Starting point is 00:06:52 getting glasses and then being told you can't have them anymore. How am I even going to know how to navigate in the world? We've been thinking about this for a couple years now and I'll pause for breath here in a second. But I don't want to say that like hyacom is the only way to do this hyacom is the result of all of this trauma

Starting point is 00:07:10 that we have endured like when our systems hit this cliff of complexity and we really thought at first with just platforms it was going to hit this and then we realized no everyone's hitting it because it's a function of the complexity of the systems you can't hold it in your brain anymore you have to reason about it by putting in a function of the complexity of the system. You can't hold it in your brain anymore. You have to reason about it by putting in a tool where you and others can navigate the information. So how did Scuba... Is that what it was called? Scuba? Yeah.

Starting point is 00:07:34 So what did it... What did it consume? Structured data. It's agnostic about it. Mostly logs. But it was just the fact that it was fast was fast there was no like you know having to construct a query and walk away to get coffee and come back you know because when you're debugging it has to be you're asking lots of small questions as quickly as you can

Starting point is 00:07:55 right you're following cookie crows instead of crafting one big question that you know will give you the answer because you don't know it's going to be the answer you don't even know what the question is right um also high cardinality. When I say that, I mean, imagine you have a table with 100 million users. High cardinality fields are going to be, the highest will be anything that's unique ID, social security number.

Starting point is 00:08:16 Very high cardinality would be last name and first name. Very low would be gender and species. I assume it's the lowest of all. The reason, I was laughing when you said, fuck metrics. I've said that many times. The reason that I hate metrics so much, and this is what

Starting point is 00:08:31 20 years of operations software is built on, is the metric, right? Well, the metric is just a number. Then you can append tags to it to help you group them. You're limited in cardinality to the number of tags you can have, generally, which is 100 to 300. So you can't have more than 300 unique IDs to group those by, which is incredibly limiting.

Starting point is 00:08:51 Some newer things like Prometheus, like you put key value pairs in there, which is a little bit better, but bottom line, it's very limited, and you have to think, you have to structure your question, your data just right. Like all the advice you can get online about how to try not to have these problems, which, when you think about it, is stupid, question, your data just right. All the advice you can get online about how to try not to have these problems, which

Starting point is 00:09:05 when you think about it, is stupid because all of the interesting information that you're ever going to care about is high cardinality. Request ID, raw query, you know? You just need this information so desperately. And so that

Starting point is 00:09:21 I think was the killer thing for Scuba. It was the first time I'd ever gotten to work with a data system that just let you have arbitrary... So imagine a common thing that companies will do as they grow is, well, they have some big customers who are more important to them. So they pre-generate all the dashboards for those customers because they don't have the ability to just break down by that 1 in 10 million user IDs,

Starting point is 00:09:46 and then any combination of anything else. When you can just do that, so many things get so simple. To make sure I understand it. So like if I'm using, like with metrics, I have like Datadog, right? And I have like a, I have a Datadog metric and like I have, basically I'll measure like this request on my microservice or whatever, like how long does this normally take? Right. So it has like some sort of, um, just the time that it takes from start to end on that. And I can put it on a, on a graph, uh, or, or whatever. So high cardinality, if I understand it as saying,

Starting point is 00:10:21 like, let's, let's not just count this single number. Let's count everything. What's the user that requested it? It's more like, so every metric is a number that is devoid of context, right? It's just a number with some text. But the way that Scuba and Honeycomb work is we work on events,

Starting point is 00:10:39 arbitrarily wide events. You can have hundreds, well-informed services, usually 300 to 500 dimensions. So all of that data for the request is in one event. The way we typically will instrument is that you initialize an event

Starting point is 00:10:53 when the request enters the service. We pre-populate with some useful stuff, but then throughout, while the request is being served, you just toss in whatever you think might possibly be interesting someday. Any IDs, any shopping cart information, any raw queries, any normalized queries, any timing information,

Starting point is 00:11:09 every hop to any other microservice. And then when the request is going to exit or error, you just ship that event off to us or to Scuba. And then you have all this information that is all tied together. The context is what makes this incredibly powerful.

Starting point is 00:11:25 A metric has zero context. So like when, you know, over the course of your request in your service, you might fire off like 20 or 30 different metrics, right? Counters, badges, whatever, but those are tied to each other. So you can't reason about them as all of these things are connected to this one request.

Starting point is 00:11:41 This is so powerful because so much of debugging is looking for outliers, right? You want to know which of your requests failed, and then you want to look for what they have in common. Was it that, you know, there is a, you know, some of the TCP statistics were overflowing only on those? Or was it that those are the ones making a particular call to a host or to a version of the software? Like just being able to just a slice and dice and figure that out at a glance is why our time to resolve these issues went from like hours or days or God knows to like seconds, just seconds or minutes, just repeatedly. Because you can just ask the questions.

Starting point is 00:12:23 So I would say that to summarize, the thing that makes it powerful is the fact that you have all that context, and you have a way to link all of these numbers together, and the fact that you can ask questions no matter how high the cardinality is. So you can combine them, right? You want to look at the combination of this unique

Starting point is 00:12:39 request ID, this query from this host at this time or whatever and it's it's like precision it sounds like what i normally do with like logs like i have them all gathered somewhere in splunk or something and i'm searching for things it's much more like because logs are just what typically unstructured events they're just straights, right? And if you're structuring your logs, then you're already way ahead of most people. If you're structuring your logs,

Starting point is 00:13:11 then I would say, I would encourage you to structure them very widely. Not to issue lots of log lines per request, but to bundle all that stuff together so that you get the additional power of having it all at once. Otherwise, you kind of have to reconstitute it. Give me all the log wives for this one request ID and you have to do stuff. If you just pack them together, it's much more convenient. And then that's basically what Honeycomb

Starting point is 00:13:35 is, plus the columnar store that we wrote in order to do the exploration. You can also think of it like BI for systems. You can think of it like BI systems. can think of it like bi systems i don't know bi for systems business intelligence okay for systems like because like you were talking in the beginning about debugging with an ops team in a dashboard right um and the ops person was just kind of like along for the ride and filling in all of this intuition all of this you know scar tissue all of the you know you weren't able to explore that information because it wasn't a tool with someone else in And all of this intuition, all of this, you know, past scar tissue, all of the, you know, you weren't able to explore that information because it wasn't a tool.

Starting point is 00:14:08 It was in someone else's brain. This is why, like, the best debugger on the team is almost always the person who's better than the last, right? Because they've amassed the most context built up in their brains, which is, I love being that hero. Like, I love being the person who just gazes at a dashboard and goes, it's red.

Starting point is 00:14:24 Like, I just feel it in my bones. But it's not actually good for us as teams. I can't take a vacation. Nobody else can explore the data. And I've now had the experience three times where the best debugger on the team was not the person who'd been there the longest. This was at Paras, at Facebook, and at Honeycomb.

Starting point is 00:14:42 Because when you've taken all of that data about the system and put it into a place where people can explore it with a tool, then the best debugger is going to be the person who's the most curious and persistent. I like what you said about the intuition. And I find that, like, you know, I described that problem of debugging something. And I know that there's a person on my team, John,

Starting point is 00:15:04 and I feel like he just has a really good model of how the system works in his head. Yeah, yeah. The problem is that systems are getting too large and too complicated and changing too quickly and they're overflowing our heads across the board. But what you just said there is another thing that I'm so, so excited about,

Starting point is 00:15:22 which is our tools as developers, they have not treated us like human beings. They have treated us like automatons. How many Vim sequences do you know by heart? Way too many. I know way too many. It's like this point of pride, which is kind of stupid. So the thing that we're

Starting point is 00:15:40 really passionate about, this is all just table stakes. The stuff that we're really passionate about is building for teams, looking for ways to bring everyone up to the level of the best debugger or the person with the most context and most information about every corner of your systems right because like if i get paged about something and i'm like uh shit this is about cassandra i don't know fuck all about cassandra um but christine does and like we have didn't we have an average that was like five or six weeks ago, and I think she was on call then? I'm just going to go look at what she did.

Starting point is 00:16:07 I want to like, what questions did she ask? What did she think was meaningful enough to publish to Slack? What got tagged as part of a postmortem? What comments did she leave for herself? You know, I just want to, because I learned Linux by reading other people's bash history files and just trying all the commands. I love, you know, tapping into that sense of curiosity, almost

Starting point is 00:16:25 that snoopiness that we have. When people are really good at their jobs, we just want to go see how they do things. I'm so excited about tapping into the social... Once we've gotten the information off our heads, then how do we help people explore it? How do we make it fun? How do we make

Starting point is 00:16:41 it approachable? And how do we make it so that we forget less stuff? Because when I go to debug a gnarly problem, I'm going to do a deep dive and I'm going to know everything about it for like a week. And then it starts to decay, right? And it asks me two or three months later, and I'm just like back to zero. But if I can just have access to how I interacted with the system, what columns did I query? What notes did I leave for myself? What were the most useful things that I did?

Starting point is 00:17:10 And if I and my team can access that information, then we've forgotten a lot less. And that's nice. I find we have a bunch of dashboards that somebody has kindly made and painstakingly put together and, um, they, they, um, they have helped me before, but, but not that much. Yeah. And fundamentally you're consuming very passively.

Starting point is 00:17:37 You're not actually interrogating the system. You're not throwing a hypothesis or asking a question. Um, and the best way to actually get good at systems is to force yourself to ask some questions. To predict what the answer might be. Every time you look at someone else's

Starting point is 00:17:56 dashboard, or even your own dashboard from a past outage, it's like an artifact. You're not exploring it. It's a very passive consumption. And because it's so passive, we often miss when a data source isn't there anymore. Or when

Starting point is 00:18:13 it's like the dog that didn't bark. I can't even count the number of times that I've been... There's a spike and I'm just looking through my dashboards, looking for the root cause and realizing that oh, we forgot to set up the graphing software on that one. Or, oh, it stopped sending, you know, or just something like that.

Starting point is 00:18:30 Because you're not actually actively asking a question. You're just kind of skimming with your eyeballs, just like scanning, eyes getting tired. Agreed. You said something, I was saying, I watched this talk of yours and you said something about how we should be doing more testing in production or something like that. What does that mean? I think what I'm trying to say is that we

Starting point is 00:18:52 do test in production, whether we want to or not, whether we admit it or not. Every config change, even if you devote a lot of resources to try and keep staging in sync with production, assuming it's even possible with your security conditions and blah blah blah, blah. It's never exactly the same.

Starting point is 00:19:08 Your config files are different. Every unique combination of deploy plus the software you use to deploy plus the environment you're deploying to plus the code itself is unique. There's literally no way, as anyone who's ever typoed production knows, there's some small

Starting point is 00:19:23 amount of it that is a test because you're doing it for the first time. And I feel like most teams, because there's this whole, you can't test in production, we don't do anything in production that isn't tested. They're just not admitting reality. And that causes them to pour too many of their very scarce engineering cycles into trying to make staging perfect. When those cycles would be better used making guardrails for production, investing in things like good canary deploys that automatically roll back if bad

Starting point is 00:19:56 and promote if good. That part of the industry is starved for resources. And I think it's because we don't have unlimited resources. And the right place to take it from is staging. I think because staging is just fragile and full of... It's just not a good use of time. I think that... And I'm not saying we shouldn't test before production. Obviously, we should run tests, but those are for your known items. In the future, known unknowns are not really the

Starting point is 00:20:25 hardest problems or even the most frequent problems that we have. It's all about these unknown unknowns, which is a way, I think, of talking about this cliff that we're all going off. You know, it used to be known unknowns. You'd get paged, you'd look at it, you'd kind of know what it was, you'd go poke around and you'd solve it. Now it's like, when you get paged, you should honestly just be like, uh, what is this? You know, I haven't seen this before. This is new. Or you don't really know where to start. Partly because of the sheer complexity and probably just because there are so many more possible outcomes or possible root causes. You just need a different, you need to stress resiliency in the modern world, not perfection.

Starting point is 00:21:06 And I think that I'm sort of joking and trying to push people's buttons when I say I test in production, but also sort of not. I mean, it's for real. Like that outage that Google Cloud Platform just had last week, what did they do?

Starting point is 00:21:19 It's a config change. Worked great in staging. They pushed it to prod, took the whole thing down. You can't test everything. So you have to invest in catching things. Failure should be boring, right? That's why we test in prod.

Starting point is 00:21:33 And you could say experiment in prod. I don't know, whatever. But I think that like for the known unknowns, you should test before production. But there are so many things that you just can't. And so we should be investing more into tools that let us test. And I think that a really key part of that has been observability.

Starting point is 00:21:48 We haven't actually, it's easy to ship things to production. It's much harder to tell what impact it has had. And that's why I feel like something like Honeycomb, where you can just poke around, is necessary. Like, I think that, I hope that we look back in a couple of years at the battle days when we used to just ship code and wait to get paged.

Starting point is 00:22:08 Like, how fucking crazy is that? That's insane that we used to just like wait for something bad to happen at a high enough threshold to wait to set up. We should have the muscle memory as engineers that if, like, what causes things to break? Well, usually it's us changing something. So whenever you change something in prod,

Starting point is 00:22:24 you should have muscle memory to just go look at what you expect to happen, actually happen. Did anything else obvious happen at the same time? Like there's something so satisfying, so much dopamine that you can get straight to your brain just by going and looking and finding something and fixing it before anyone notices or complains. So we have like in the real world,

Starting point is 00:22:47 we have a fixed amount of resources. And if we're trying to decide like what percentage of effort should go towards like recovering from production issues and what should go towards preventing them? Oh, this is such a personal question. It's based entirely on your business case, right? Like how much appetite do you have for failure?

Starting point is 00:23:11 It's going to be different for a bank than for, you know, how old are you? Who are your customers? You know, startups have way more appetite for risk than companies that are serving banks. You know, it's very, very, there's no answer. That's exactly the same for any two companies, I think. But it sounds like what you're saying to me is that we should put

Starting point is 00:23:32 a lot of effort into, into recovering from production issues. Into resiliency. Yeah. Into early detection and mitigation. Recovery is an interesting word. Often I think it's just understanding. There are many changes you have to make. Say you're rolling out a version of the code that is going to increase the RAM footprint. And it's not a leak, you know it, but you don't actually know how much

Starting point is 00:24:03 because you run it in staging. And again, you're not going to have the same kind of traffic, the same variance. So you don't actually know how much because you run it in staging. And again, you're not going to have the same kind of traffic, the same variance. So you don't actually know. So I'm arguing that you need to roll things out. You need to have the tooling to make this a very mundane operation, right? It should roll out to 10%, get promoted, run for a while, get promoted 20%, 30%, and be able to watch it so that you know if it's about to hit

Starting point is 00:24:28 an out of bounds or something. Because I think it's important actually, well I think it's just as a developer it gives confidence when you can actually just roll back. But not everything is not everything can be rolled back I guess. Yeah, especially when the closer you get to the laying bits

Starting point is 00:24:44 down on disk, the more many things are rolled forward only. Then you start to get sweaty palms. I don't know. It depends, but I've seen some hair-raising database migrations. Oh, God. I come from databases.

Starting point is 00:24:59 I have done things with databases that would turn your hair white. So you mentioned earlier that you build your own database. Oh, no, no, no. I've spent my entire career telling people not to write a database. So I'd like to be very clear on this point. We have written a storage engine. That's my, I'm sticking to it.

Starting point is 00:25:23 Tell me about your storage engine. It's as dead simple as we could possibly make it. It's a columnar store that is really freaking fast. We target one second for the 95th percentile of all queries. Why did you need your own data store? Well, that's a great question. Believe me, we tried everything out there. So the operations

Starting point is 00:25:48 engineering community for 20 years has been investing in time-series databases built on metrics, right? And we knew that this was just not a data model that was going to enable the kind of interactive, fast kind of

Starting point is 00:26:03 interaction that we wanted to support. And furthermore, we knew that we wanted to support. And furthermore, we knew that we wanted to have these really wide, arbitrarily wide events. And we knew that because we're dealing with unknown unknowns, we knew that we didn't want to have any schemas. Because anytime you have to think about what information you might want to capture and fit it into a schema, it introduces friction in a really bad way. and then you don't deal with indexes you know like one of the problems

Starting point is 00:26:29 with every log tool is you have to pick which indexes you have to support some of them even charged by indexes i think um but then if like if you need to ask the question about something that isn't indexed well you're back to like oh i'm gonna go get coffee while i'm waiting for this query to run right and then if you didn't ask the right question, you've got to go for another walk. It's not interactive. It's not exploratory. So we tried everything out there. Druid came a little close, but it still didn't have the kind of richness. Yeah, we knew what we wanted, and so we had to write it.

Starting point is 00:27:01 We wrote it as simply as possible. We were using Golang. It is descended from Scuba at Facebook, for sure. Scuba was just like 10,000 lines of C++. It was entirely in memory because they didn't have SSDs when they wrote it. And it shills out to rsync for replication. It's janky as fuck.

Starting point is 00:27:21 But the architecture is nice. It's distributed. So there's a fan out model where query comes in, fans out five nodes, does a column scan on all five aggregates pushes them back up, there's too much to aggregate, then it fans out

Starting point is 00:27:37 again to another five nodes and repeats, so it's very scalable, we can handle very very high throughput just by adding more nodes. So you're saying it doesn't have any indexes or it indexes everything? Well, columns are effectively indexes, right? Yeah. So everything is equally fast, basically.

Starting point is 00:27:58 It's sort of like index everything because everything's a column. Yes. Yeah. And you can have arbitrarily wide that we use a file per column basically. So up to the Linux open file handle when it's just like 32k or something.

Starting point is 00:28:14 It becomes not tractable for humans long before then. I like this idea that there is this very janky tool at Facebook that changed the world. Oh, they can't kill it. It's too useful, but it has been not invested in. And so it is horribly hard to understand. It's aggressively hostile to users.

Starting point is 00:28:37 It does everything it can to get you to go away, but people just can't let it go. Do you think that like more, more people should, should kind of embrace the chaos and have more of a startup focus? Yeah, I do. Yeah. I did. I thought you were going a different direction with that question, but yes,

Starting point is 00:28:54 that too. Which way did you think I was going? Oh, I thought you were going to ask if more people should build tools based on events instead of metrics. And yes, I'm, I'm truly, you're yes, I'm truly...

Starting point is 00:29:05 You're very... I'm opening the door. We've given talks and we built our storage engine. As an industry, we have to make the jump from very limited... The thing about metrics is also they are always looking at the aggregate. The aggregate and the older

Starting point is 00:29:21 they are, the less fine grained they are, right? That's what they drop data is by aggregating at the right time. We drop data by sampling instead because it is really, really powerful to have those raw events. This is a shift that I think the entire industry has to make. And I almost don't care if it's us or not. That's a lie. I totally care if it's us or not.

Starting point is 00:29:43 But there needs to be more of us, right? This needs to be a shift that the entire industry makes because it's the only way to understand these systems. It's the only way I've ever seen. We should talk about tracing real quick. Because tracing is just a different way of visualizing events. Tracing is the only other thing that I know of that is oriented around events. Oh, what I was starting to say was that metrics are great for describing the health of the system, right? But they don't tell you anything about the event because they're not fine-grained and they lack the context. And as a developer,

Starting point is 00:30:12 we don't care about the health of the system. If it's up and serving my code, I don't give a shit about it. What I care about is every request, every event, and I care about all of the details from the perspective of that event, right? And we spend so much time trying to work backwards

Starting point is 00:30:28 from these metrics to the questions we actually want to ask. And that bridge right there is what is being filled by all of this intuition, you know, and jumping around between tools, like jumping from the metrics and aggregate for the system and jumping into your logs

Starting point is 00:30:41 and trying to prep for the stream that you think might, you know, shed some light on it. Everything becomes so much easier when you can just ask questions from the perspective of the event. Tracing is interesting because tracing is just like Honeycomb, except for it's depth-first, is how I think of it.

Starting point is 00:30:56 Honeycomb is breadth-first, where you're slicing and dicing between events, trying to isolate the ones that have characteristics that you're looking for. And tracing is about, okay, now I've found one of those events. Now tell me everything about it from start to finish. And we just released our tracing product. And what's really freaking cool about it is you can go back and forth, right?

Starting point is 00:31:16 You can start with, all right, I don't know the question. All I have is a vague problem report. So I'm going to go try and find something, find an outlier, find an error that matches this user ID query, whatever. Okay, cool. I found it. Now show me a trace. Trace everything that hits this hop or this

Starting point is 00:31:34 query or whatever. And then once you have been like, oh, cool. I found one. Then you can zoom back out and go, okay, now show me everyone else who was affected by this. Show me everyone else who has experienced this? We've been debugging our own storage engine this way for about three or four months now.

Starting point is 00:31:49 It is mind-blowing just how easy it makes problems. Yeah, that sounds powerful for sure. I guess we're kind of getting the tools back that we lost when we split up into a million different services in some ways. Yeah, totally. It's kind of like distributed GDB.

Starting point is 00:32:08 So I don't talk to too many ops people on this podcast. I wanted to ask you, what do you think developers have to learn from an ops culture or mindset? Oh, that's such a great question. First of all, I heard you say that you're on call and you get paid for the opt-in. Well, bless you.

Starting point is 00:32:32 This is the model of the future. And I want to make it clear that I don't want to... Ops has had a problem with masochism for as long as I've been alive. And the point of putting software engineers on call is not to invite them into masochism with us. It's to raise our standards for everyone and the amount of sleep that we should expect. I think I feel very strongly about this. The only way to write and support good code is by shortening those feedback loops and putting the same people who write software on call for it.

Starting point is 00:33:08 So it's just necessary. In the glorious future, which is here for many of us, we are all distributed systems engineers. And one thing about distributed systems is that it has a very high operational cost, right? Which is why software engineers are now having to learn ops. I'll often say that I feel like the first wave of DevOps was all about yelling at ops people to learn how to write code. And we did. Cool. We did. And I feel like now for just the last year or two, the pendulum's been swinging the other way.

Starting point is 00:33:36 And now it's all about, okay, software engineers, it's your turn. It's your turn to learn to build operable services. It's your turn to learn to instrument really well, to make the systems explain themselves back to you. It's your turn to pick up the ownership side of the software that you've been developing. And I think that this is great. I think this is better for everyone. It is not saying that everyone needs to be equally expert in every area. It's just saying that what we have learned about shipping good software is that everyone should write code

Starting point is 00:34:07 and everyone should support code. Something like 70-80% of our time is spent maintaining and extending and debugging, not greenfield development, which is fundamentally, software engineers do more ops than software engineering. So I think it makes

Starting point is 00:34:24 sense to acknowledge that. I think it makes sense to acknowledge that. I think it makes sense to reward people for that. I am a big proponent of you should never make someone a senior engineer. Don't promote them if they don't know how to operate their services, if they don't show good operational hygiene. You have to show that this is what you value in an org,

Starting point is 00:34:42 not what you kick around and deprioritize. And people pay attention to signals like promotions and pay grades and who thinks they're too good for what work. Definitely. I think there is an ops culture. Maybe it's just my perception. You mentioned masochism. I don't know where the causation and correlation go uh that there's a i don't know if developers are going to become more um yeah there's a certain uh there's a certain attitude

Starting point is 00:35:12 sometimes and i don't think there's anything wrong with this but of you know like you know call me when something's on fire you know that's that's when i'm alive right is when things are breaking and yeah i believe me i'm one of those people i love what i love it i'm the person you want to call in a crisis if if we're not if the database is down we're not sure if it's ever coming up again the company might be screwed like i am the person that you want at your side and i've spent my career like working myself out of a job um repeatedly i guess that's why i'm a startup CEO now. But that aside, you can both enjoy something and recognize

Starting point is 00:35:49 that it's too much, but it's not good for you. I enjoy drinking. But I do try to be responsible about it. Yeah, I don't know. I think that the things that you call out and praise in your

Starting point is 00:36:05 culture are the things that are going to get repeated. And if you praise people for firefighting and heroics, you're going to get more of that. And if you treat it as an embarrassing episode that we're, you know, yeah, glad we got through it together. We privately thank people or whatever, but you don't call it out and praise it. And you make clear that this is not something you value, that it was an accident and you take it seriously, you know, and you give people enough time to execute on all the tasks that came out of the postmortem instead of, you know, having a retrospective, coming up with all this shit and then like deprioritizing it,

Starting point is 00:36:37 going on to feature it. That doesn't say, yes, we value your time and, and we don't want to see more firefighting. I think that these organizational things are really the responsibility of any senior management and senior engineers. It's a tricky problem. I wanted to ask you, you are now CEO. Do you still get to work as an individual contributor? Do you still get to

Starting point is 00:37:04 fight fires and get down in the trenches i'm not i'm not well i am fighting fires but not of the technical variety um i wanted to be cto that's what i was shooting for um but circumstances um i don't know i mean i i believe in this mission i've seen it change people's lives. I've seen it make healthier teams and I am going to see it through. I really miss sitting down in front of a terminal every morning. I really, really, really do.

Starting point is 00:37:34 But I've always been highly motivated by what has to be done. I don't play with technology for fun. I get in the morning and I look at what needs to be done. So I guess this is just another variation on that. This is what needs to be done. I spent a year trying to get someone else to be CEO. I'm done. I can't find someone. That's fine. I'm in it for now. I'll just take it as far as I can. It's a very pragmatic approach. i always worry like you know that if they if they take my

Starting point is 00:38:07 whatever my text editor away from me like i'll never get back just because i've seen it happen to other people for sure i i've written a blog post about the engineer manager pendulum um because i believe that the best technologists that i've ever gotten to work with were people who had gone back and forth a couple of times. Because the best tech leads are the ones who have spent time in management. They're the ones with the empathy and the knowledge for how to motivate people and how to connect the business to technology and explain it to people in a way that motivates them. And the best managers I've ever had, wine managers, were always never more than two or three years removed from writing

Starting point is 00:38:48 code, doing hands-on work themselves. I feel like it's a real shame that it's often a one-way path. I think it doesn't have to be if we're assertive about knowing that what we want is to go back and forth. Certainly what I hope for for myself. There doesn't seem to be a lot of precedent

Starting point is 00:39:04 for switching back and forth there isn't but um since i wrote that piece i have i still get contacted by people every day just saying thank you this is what i wanted i didn't know it's possible i'm totally going to do it now i actually wrote it for a friend of mine at slack who was considering going through that transition i was just like yeah you should do. And I wrote the post for him and he went back to being an IC and he is so much happier. So he went back to being a contributor rather than a management role? Yeah, he was a director. And he's having an immense amount of impact in his senior IC role

Starting point is 00:39:37 because he's been there for so long. He knows everything. He can do these really great industry moving projects. Oh, that's awesome. How are you doing for time? Do you have to run? Um, I don't know. Let me see. Oh no, I can leave you my slate. So do you like dashboards or not? Um, I think that some dashboards are inevitable. Like you would need a couple of just top-level...

Starting point is 00:40:07 All right, I think a couple are inevitable, but they're not a debugging tool, right? They're a state-of-the-world tool. As soon as you have a question about it, you want to jump in and explore and ask questions. And I don't call that a dashboard. Some people do. But I think it's too confusing.

Starting point is 00:40:25 Interactive dashboards are fine. But you do have to ask that question. Ask those questions. You need to support, you know, what about this? What about for that user? What about, you know, I don't care what you call it as long as you can do that. I do. I also think that like a huge pathology right now these complex systems is that we're overpaging ourselves.

Starting point is 00:40:49 And we're overpaging ourselves because we don't actually trust our tools to let us ask any question and isolate the source of the problem quickly. So we rely on these clusters of alerts to give us a clue as to what the source is. And if you actually have a tool with this data that you trust, I think that the only paging words you really need are requests per second, errors, latency, maybe saturation. And then you just need a dashboard that at a top level, at a high level, shows you that. And then whenever you actually want to dig in and understand something, then you jump into a more of a debugging framework.

Starting point is 00:41:28 So these issues you talked about before, like a specific user, like you would never get paid for that. How would that come to your attention? That is a great question. That is a great point. So many of the problems that show up in these systems will never show up in your alerts or else you're over-alerting. Because they're localized. This is another thing that's different about the systems that we have now versus the old style of systems.

Starting point is 00:41:55 It used to be that everyone shared the same pools. They shared a tier for the web, for the app, for the database. And so they all had roughly the same experience. With these new distributed systems, say you had a 99.5% reliability in your old system. That meant that everyone's erroring 0.5% of the time.

Starting point is 00:42:15 On the new systems, it more likely means that the system is 100% up for almost everyone. But everyone whose last name starts with S-H-A who happens to be on this shard, they're 100% down, right? You get the same,

Starting point is 00:42:33 like if you're just getting the top level percentages, your paging alerts are not going to be reflective of the actual experience of your users. So then you're like, well, okay, you can generate alerts for every single combination and blah, blah, blah, blah, blah. And then you're just going to have black alerts over time. Honestly,

Starting point is 00:42:51 a lot of the problems that we are going to see are going to come to us through support or through users reporting problems. And over time, as you interact with the system, you'll learn what the most common high signals are. Maybe you'll want to have an end-to-end check that traverses every shard, right?

Starting point is 00:43:09 Hits every shard or something like that. It's different for every architectural type. But I don't remember what the question was. Oh, I was just talking about the difference in systems. Yeah, you can't... There are so many ways that systems will break but only affect a few people now. So it makes the high cardinality questions

Starting point is 00:43:29 even more important. And like you were mentioning, developers should be able to operate their systems. I think, actually, developers should spend time doing support. It's horrible. It's not fun. Oh, God, yes.

Starting point is 00:43:44 No, but it really empathy for your users yeah and so the the issue like with whatever you said users with the last name sh like that'll come in you know that'll come in as a support ticket and if if i'm busy and i'm a developer and that i'll be like that doesn't make sense are you sure and then like but if i actually have to if i'm the one who has to deal with this ticket, you know? Yeah. No, totally. Yeah. We're big fans of, you know, having everyone rotate through, you know, on call, rotate through support triaging. It doesn't even have to be that often,

Starting point is 00:44:18 you know, maybe once a quarter or so is enough to keep you very grounded. It's like an empathy factor, I think. It really is. Yeah. And one of the hardest things, one of the things that separates good senior engineers from me is that they know how to spend their time

Starting point is 00:44:33 on things that have the most impact, right? Business impact. Well, what does that mean? Well, often it means things that actually materially affect your user's experience. And there's no better way than just having to be on a support rotation. Because if you don't,

Starting point is 00:44:51 if you aren't feeding your intuition with the right inputs, you're going, your sense of what has the most impact is going to be off, right? I like to think of like the intuition as being something you have to kind of cultivate with the right experiences um and and the right shared experiences right you want a team to kind of have the same idea of what makes important important as a team like i feel like there there's healthy teams and unhealthy teams um but i mean

Starting point is 00:45:19 some teams uh really uh gel and i always feel like uh the ops people tend to be more cohesive than other groups. I think so too. A lot of it is because of... It's like the band of brothers effect, right? You go to war together. You have each other's backs. Getting woken up in the middle of the night. There's just a...

Starting point is 00:45:37 Every team... Every place I've ever worked, the ops team has been the one that just has the most identity, I guess. The most character and identity. The most in-jokes. Usually very g has the most identity, I guess. The most character and identity, the most in-jokes, usually very ghoulish, graveyard humor. But

Starting point is 00:45:51 I think that the impact of a good on-call rotation is that there is this sense of shared sacrifice. And I would liken that to salt in food. A teaspoon makes your meal amazing. A cup of it means that you're crying, you know?

Starting point is 00:46:10 Like a teaspoon of shared sacrifice really pulls a team together. Yeah, I can see that. You don't want it to be like the person can't sleep at night. No, no. But like if one of the people on your team has a baby then everybody just like immediately volunteers because they're not going to let them get woken up by the

Starting point is 00:46:29 baby and the pagers they're just going to fill in for them until they're for the next year you know that that type of thing like lowering the barrier should just be assumed that you know you want to have each other's backs that nobody should be too impacted. That, you know, as an ops manager, whenever somebody got in the middle of the night, I would encourage them not to come in or to sleep in, or I would take a pager for them for the next night or something like that. It just looking out for each other's welfare and well-being is the thing that finds people, I think. Definitely. Well, it's been great to talk to you.

Starting point is 00:47:02 Was there anything, I don't know't i liked your uh i liked your controversial statements about you know fuck metrics what else you got metrics dashboards can all die in a fire yeah and every software engineer should be on call boom all right there's the title for the uh the episode there you go i'm gonna make a lot of friends here all right that's the show thank you for listening to the co-recursive podcast i'm adam bell your host if you like the show please tell a friend.

CoRecursive: Coding Stories - Tech Talk: Test in Production and being On-Call with Charity Majors

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.