Screaming in the Cloud - Shifting from Observability 1.0 to 2.0 with Charity Majors

Episode Date: April 2, 2024

This week on Screaming in the Cloud, Corey is joined by good friend and colleague, Charity Majors. Charity is the CTO and Co-founder of Honeycomb.io, the widely popular observability platform.... Corey and Charity discuss the ins and outs of observability 1.0 vs. 2.0, why you should never underestimate the power of software to get worse over time, and the hidden costs of observability that could be plaguing your monthly bill right now. The pair also shares secrets on why speeches get better the more you give them and the basic role they hope AI plays in the future of computing. Check it out!Show Highlights:(00:00 - Reuniting with Charity Majors: A Warm Welcome(03:47) - Navigating the Observability Landscape: From 1.0 to 2.0(04:19) - The Evolution of Observability and Its Impact(05:46) - The Technical and Cultural Shift to Observability 2.0(10:34) - The Log Dilemma: Balancing Cost and Utility(15:21) - The Cost Crisis in Observability(22:39) - The Future of Observability and AI's Role(26:41) - The Challenge of Modern Observability Tools(29:05) - Simplifying Observability for the Modern Developer(30:42) - Final Thoughts and Where to Find MoreAbout CharityCharity is an ops engineer and accidental startup founder at honeycomb.io. Before this she worked at Parse, Facebook, and Linden Lab on infrastructure and developer tools, and always seemed to wind up running the databases. She is the co-author of O'Reilly's Database Reliability Engineering, and loves free speech, free software, and single malt scotch.Links:https://charity.wtf/Honeycomb Blog: https://www.honeycomb.io/blogTwitter: @mipsytipsy

Transcript
Discussion (0)
Starting point is 00:00:00 And a lot of it has to do with the 1.0 versus 2.0 stuff. You know, like the more sources of truth that you have and the more you have to dance between, the less value you get out because the less you can actually correlate, the more the actual system is held in your head, not your tools. Welcome to Screaming in the Cloud. I'm Corey Quinn, and I am joined after a long hiatus from the show by my friend and yours, Charity Majors, who probably needs no introduction, but we can't assume that. Charity is and remains the co-founder and CTO of Honeycomb.io.
Starting point is 00:00:38 Charity, it's been a few years since we spoke in public. How are you? It has been a few years since we bantered publicly. I'm doing great, Corey. This episode is sponsored in part by my day job, the Duck Bill Group. Do you have a horrifying AWS bill?
Starting point is 00:00:54 That can mean a lot of things. Predicting what it's going to be, determining what it should be, negotiating your next long-term contract with AWS, or just figuring out why it increasingly resembles a phone number, but nobody seems to quite know why that is. To learn more, visit duckbillgroup.com.
Starting point is 00:01:14 Remember, you can't duck the duck bill bill. And my CEO informs me that is absolutely not our slogan. We are both speaking in a few weeks from this recording at the, at SRECon San Francisco, which is great. And I think we both found out after the fact that, wait a minute, this is, this is like a keynote plenary opening session. Oh dear. That, that means I won't be able to phone it in. Given that you are at least as peculiar when it comes to writing processes, I am, I have to ask, have you started building your talk yet? Because I haven't. That's funny. No, I have not. Good, good. I always worry it's just me. There was some Twitter kerfuffle years ago about how it's so rude when speakers get up and say they didn't build their talk until recently. Like,
Starting point is 00:02:01 we're all sitting here listening to you. Like, we deserve better than that. And that doesn't mean it hasn't been weighing on me and I haven't been thinking about it for months, but I'm not going to sit down and write slides until I'm forced to. I've been angsting about it. I've had nightmares. Does that count? I feel like that should count for something.
Starting point is 00:02:19 You know, I feel like that gets lumped in the same bucket as speakers should never give the same talk more than once. That's just rude. We are paying to be here. And it's like, calm down, kids. Like, okay, everyone's a critic, but like, shit is a lot of work. We also have our day jobs and we all have a process that works for us. And you know, if you don't like the product that I am here delivering to you, you don't have to come see me ever again. That's cool. You don't have to invite me to your conference ever again. I would completely understand if you made those choices, but leave me to my life choices. My presenter notes are generally bullet points of general topics to talk about. So I don't think I've ever given the same talk twice. Sure, the slides might be the
Starting point is 00:03:04 same, but at least try and punch up the title a smidgen. think I've ever given the same talk twice. Sure, the slides might be the same, but I'll at least try and punch up the title a smidgen. Well, I've given the same talk, not verbatim, but several times. And actually, I think it gets better every time because you lean into the material, you learn it more, you learn how an audience likes to interact with it more. I've had people request that I give the same talk again.
Starting point is 00:03:22 They're like, oh God, I love this one. And I'm like, oh cool, I've got some updated material there. And they're like, oh, God, I love this one. And I'm like, oh, cool. I've got some up-to-date material there. And they're like, I really want my team to see this. And so, you know, I think you can do it in a lazy way. But just because you're doing it doesn't mean you don't care, which I think is the root of the criticism. Oh, you don't care. Yeah, this is a new talk for me. It's about the economics of on-prem versus cloud, which I assure you I've been thinking about for a long time and answering questions on for the past seven years. But I never put it together into a talk. And I'm somewhat annoyed as I'm finally starting to put it together that I hear reports that VMware slash Broadcom is now turning someone's $8 million renewal into $100 million renewal.
Starting point is 00:04:00 It's like, well, suddenly that just throws any nuanced take out the window. Yeah, when you're like 11x-ing your bill to run on-prem, yeah, suddenly move to the cloud. You can do it the dumbest possible way and still come out financially ahead. Not something I usually get to say. Well, you know, the universe provides. You have been talking a fair bit lately around the concept of going from observability 1.0 to observability 2.0. It's all good. Well, if nothing else, at least you people are using decent forms of semantic versioning. Good for you. But what does that mean here for the rest of us who are not drowning in the world of understanding what our applications are doing at any given moment? You know, it was kind of an offhand comment that I made. You know, you spit a bunch of shit out into the world, and every now and then people pick
Starting point is 00:04:50 up on a strand, and they start really pulling on it. This is a strand that I feel like people have been really pulling on. And the source is, of course, the mass confusion and delusion that has been everyone's idea of what observability means over the past 10 years. I feel like, you know, Christine and I started talking what observability means over the past 10 years. I feel like Christina started talking about observability in 2016, and we were lucky enough to be the only ones at the time. And we had this whole idea that there was a technical definition of observability, high cardinality, high dimensionality.
Starting point is 00:05:19 It's about unknown unknowns. And it lasted for a couple of few years. But 2020-ish, like all the big players started paying attention and flooding their money in. And they're like, well, we do observability too. There's three pillars. Now it's like, it can mean literally anything.
Starting point is 00:05:34 So the definition of observability that I've actually been kind of honing in on recently is it's a property of complex systems, just like reliability. You can improve your observability by adding metrics if you don't have any. You can improve your observability by adding metrics if you don't have any. You can improve your observability by instrumenting your code. You can improve your observability by educating your team or sharing some great dashboard links, right? But
Starting point is 00:05:55 it remains the fact that there's kind of a giant sort of discontinuous step function and capabilities and usabilities and a bunch of other things that we've experienced and our users report experiencing. And so we kind of need some way to differentiate between the three pillars world and what I think hope is the world of the future. And they're very discontinuous because it starts at the very bottom. It starts with how you collect and store data. With observability 1.0, you've got three pillars, at least.
Starting point is 00:06:23 You're probably collecting and storing your data way more than three times. You've got your APM, you've got your web monitoring stuff. You might be collecting and storing your telemetry half a dozen different times, paying every time, and nothing really connects them except you sitting in the middle eyeballing these different graphs and trying to correlate them in your head based on past scar tissue for the most part. And 2.0 is based on a single source of truth. These arbitrarily wide structured data blobs, you can call them canonical logs, you can call them spans, traces, but you can derive metrics from those, you can derive traces from those, you can derive logs from those. And the thing that connects them is it's data.
Starting point is 00:07:08 So you can slice and dice. You can dive down. You can zoom out. You can treat it just like fucking data. And so we've been starting to refer to these as observability 1.0 and 2.0. And I think a lot of people have found this very clarifying. What is the boundary between 1.0 and 2.0? Because, you know, with vendors doing what vendors are going to do, if the term observability 2.0 catches on, they're just going to drape themselves in its trappings.
Starting point is 00:07:34 But what is it that does the differentiation? There's a whole bunch of things that sort of collect around both. Like for 1.0, you know, the observability tends to be about MTTR, MTDD, reliability. And for 2.0, it's very much about what underpins the software development lifecycle. But the filter that you can apply to tell us what 1.0 or 2.0 is, how many times are you storing your data? If it's greater than one, you're in observability 1.0 land. But the reason that I find this so helpful is like, so there's a lot of stuff that like 1.0, you're tend to paging yourself a lot because you're
Starting point is 00:08:08 paging on symptoms because you rely on these page bombs to help you debug your systems. And 2.0 land, you typically have SLOs. So I feel like to get observability 2.0, you need both a change in tools and a sort of change in mentality and practices because it's about hooking up and making these tight feedback loops, the ones so that you're instrumenting your code as you go, get it into production, and then you look at your code through the lens of the instrumentation you just wrote, and you're asking yourself,
Starting point is 00:08:39 is it doing what I expected it to do? And does anything else look weird? And you can get those those loops so so tight and so fast that you know you're finding you're reliably finding things before your users find them in production you're finding bugs before your users find them and you can't do that without the practices and you can't do that without the tools either because if you're on one point in land you know you're like trying to predict which custom metrics you might need you're like logging out a crap load of stuff but there's a lot of guesswork involved there's a lot of
Starting point is 00:09:09 pattern matching involved and in 2.0 land it doesn't require a lot of you know knowledge up front about how your system's going to behave because you can just slice and dice you could break down by build id break down by feature flag break down by device type break down by you know canary like anything you need, you can just explore and see exactly where the data tells you every time. You don't need all this prior knowledge of your systems. And so you can ask yourself, is my code doing what I expect it to do?
Starting point is 00:09:34 Does anything else look weird? Even if you honestly have no idea what the code is supposed to be doing. It's that whole known unknowns versus unknown unknowns thing. Yeah, I've been spending the last few months running Kubernetes locally. I built myself a Kubernetes of my very own, given that I've been fighting against that tide for far too long. And I have a conference talk coming up where I committed to making fun of it. And it turns out that's not going to be nearly as hard as I was worried it would be. It's something of a target
Starting point is 00:10:02 rich environment. But one of the things I'm seeing is that this cluster isn't doing a whole heck of a lot, but it's wordy. And figuring out why a thing is happening requires ongoing levels of instrumentation in strange and different ways. Some things work super well because of how they're designed for and what their imagined use cases are. For example, when I've instrumented, this was Honeycomb, among other things. And I've yet to be able to get it to reliably spit out the CPU temperature of the nodes, just because that's not something OTEL usually thinks about in a cloud-first world.
Starting point is 00:10:36 I've also been spending fun times figuring out why the storage subsystem would just eat itself from time to time with what appears to be no rhyme or reason to it. And I was checking with Axiom where I was just throwing all of my logs from the beginning on this thing. And in less than two months, it's taken 150 gigs of data. I'm thinking of that in terms of just data egress charges. That is an absurd amount of data slash money for a cluster that frankly isn't doing anything so it's certainly wordy about it but it's not doing anything yeah this is what this is why the logs topic is so
Starting point is 00:11:11 fraught you can spend so much money doing you know and and the log vendors are like i always get the monty python every sperm is sacred when they're like every log line must be kept i'm like yeah i bet it does you know because like people are so afraid of sampling but this is the shit you sample the shit that means absolutely call right or health checks you know in a microservices environment 25 of your requests might be health checks so like when we come out and say sample we're not saying sample your billing api requests you know it's like sample the bullshit that's just getting spat out for no reason and log with intent the stuff that you care about
Starting point is 00:11:51 and keep that. But like the whole logging mindset is spammy and loud and it's full of trash, frankly. Like when you're just like admitting everything you might possibly think about, you know, then you can't really correlate anything. You can't really do anything with that data, doesn't really mean anything. But when you take sort of the canonical logging slash tracing approach, you know, you can spend very little money, but get a lot of very rich and tenfold data. I also find that spewing out these logging events in a bunch of different places.
Starting point is 00:12:26 I have no idea where it's stirring any of this internally to the cluster. I'm sure I'll find out if a disk fills up, if that alarm can get through or anything else. But the painful piece that I keep smacking into is that all of this wordiness, all this verbosity is occluding anything that could actually be important signal. And there have been some of those during some of the experiments I'm running. I love the fact, for example, that by default, you can run Kubectl events or Kubectl get events, and those are not the same thing, because why would they be? And Kubectl get events loves to just put them in apparently non-deterministic order.
Starting point is 00:12:59 And the reasoning behind that is, well, if we've seen something a bunch of times, do we show it at the beginning or at the end of that list? It's, it's a hard decision to make. It's that's great. I've had a bunch of things happen in the last 30 seconds. Why is that all hidden by stuff that happened nine days ago? It's obnoxious. That's a great question. Boy, you should be a, you should be a Kubernetes designer, Corey. No, because apparently I still have people who care what I have to say and I want people to think well of me. That's apparently a non-starter
Starting point is 00:13:28 for doing these things. It's awful. I don't mean to attack people for the quality of their work, but the output is not good. Things that I thought Kubernetes was supposed to do out of the box, like, all right, let's fail a node.
Starting point is 00:13:40 I'll yank the power cord out of the thing because why not? And it just sort of sits there forever because I didn't take additional extra steps on a given workload to make sure that it would time out itself and then respawn on something else. I wasn't where every single workload needed to do that. The fact that it does is more than a little disturbing. Yes, it is. So things are great over here in Kubernetes land as best I can tell, because I've been avoiding it for a decade. And I'm coming here and looking at all of this and it's what have you people exactly been doing? Just because this seems like the basic problems that I was dealing with back when I worked on servers, when we were just running VMs and pushing those around in the days before containers. Now, I keep thinking there's something I'm missing, but I'm getting more and more concerned that I'm not sure that there is. You know, never underestimate software's ability to get worse.
Starting point is 00:14:30 I will say that instrumenting it with Honeycomb was a heck of a lot easier than it was when I tried to use Honeycomb to instrument a magically bespoke architectured serverless thing running on Lambda and some other stuff. Because, oh, you're actually, you know, it turns out when you're running an architecture that a sane company might actually want to deploy,
Starting point is 00:14:51 then, yeah, okay, it turns out that suddenly you're back on the golden path of where most observability folks are going. A lot less of solve it myself. And let's be fair, you folks have improved the onboarding experience, the documentation begins to make a lot more sense now, and what it says it's going to do is generally what happens in the environment. So gold star for you. That is high praise. Thank you, Corey. Yeah, we put some real muscle
Starting point is 00:15:16 grease into Kubernetes last fall, right before KubeCon, because like you said, it's the golden path. It's the path everyone's going down. And for a long time, we kind of avoided that because, honestly, we're not an infrastructure tool. We are for understanding your applications from the perspective of your application for the most part. But a very compelling argument was made that, you know, Kubernetes is kind of an application of itself. It's your distributed system, so it actually does kind of matter
Starting point is 00:15:44 when you need to, like, pull down an artifact or you need to do a rolling, you know, restart or when all these things are happening. So we tried to make that, you know, pretty easy. And I'm glad to hear things have gotten better. You mentioned the cost thing. I'd like to circle back to that briefly. I recently wrote an article about observability, the cost crisis in observability is what it's called, because a lot of people have been kind of hot under the collar about their, we won't name specific vendors, but their bills lately when it comes to observability. The more I listened, the more I realized they aren't actually upset about the bill itself. I mean, maybe they're upset about the bill itself. What they're really upset about is the fact that the value that they're getting out of these tools has become radically decoupled
Starting point is 00:16:29 from the amount of money that they're paying. And as their bill goes up, the value they get out does not go up. In fact, as the bill goes up, the value they get out often goes down. And so I wrote a blog post about why that is. And a lot of it has to do with the 1.0 versus 2.0 stuff. The more sources of truth that you have and the more you have to dance between, the less value you get out. Because the less you can actually correlate, the more the actual system is held in your head, not your tools. With logs, as you just talked about, the more you're logging, the more span you have, like the higher your bill is, the harder it is to get shit out. The slower your full text search has become, the more you have to know what you're looking for before you can search for the thing
Starting point is 00:17:14 that you need to look for. And where they live. I will name a vendor name. CloudWatch is terrible in this sense. 50 cents per gigabyte ingest, though at reInvent they just launched a 25 cent gigabyte ingest with a lot less utility. Great. And the only way to turn some things off from logging to CloudWatch at 50 cents a gigabyte ingest is to remove the ability
Starting point is 00:17:35 for what it's doing to talk to CloudWatch. That is absurd. That is one of the most ridiculous things I've seen. And I've got to level with you, CloudWatch is not that great of an analysis tool. It's just not. I know they're trying with CloudWatch log insights and
Starting point is 00:17:49 all the other stuff, but they're failing. Wow. Yeah. I mean, you can't always just solve it at the network level. That's a solution. You can't always, can't always reach for it. You can also solve it at the power socket level, but most of us prefer other levels of solving our distributed systems problems. It's awful. It's, it's one of those areas where I have all these data going all these different places. Even when I trace it, it still gets very tricky to understand what that is. When I work on client environments,
Starting point is 00:18:14 there's always this question of, okay, there's an awful lot of cross AZ traffic, an awful lot of egress traffic. What is that? And very often the answer is tied to observability in some form. Yeah, yeah, yeah. No, for sure, for sure. In our systems nowadays, the problem is very rarely
Starting point is 00:18:30 debugging the code. It's very often where in the system is the code that I need to debug? It's the murder mystery aspect to it. It's the murder mystery for sure. Here at the Duckbill Group, one of the things we do with, you know, my day job is we help negotiate AWS contracts. We just recently crossed $5 billion of contract value negotiated. It solves for fun problems such as how do you know that your contract that you have with AWS is the best deal you can get? How do you know you're not leaving money on the table? How do you know that you're not doing what I do on this podcast and on Twitter constantly and sticking your foot in your mouth?
Starting point is 00:19:11 To learn more, come chat at duckbillgroup.com. Optionally, I will also do podcast voice when we talk about it. Again, that's duckbillgroup.com. But like metrics are even worse than logs when it comes to this shit, because number one, like your duckbillgroup.com. or there are all kinds of things you can do to, this is not visible in your bill. You have to really like take a microscope to it. And I was repeating this to someone last Friday. He was like, oh, I wish it only cost us $30,000 a month. He's like, over the weekend, some folks deployed some metrics
Starting point is 00:19:54 and they were causing us $10,000 a piece over the weekend. And I was just like, oh my God. And there's no way to tell in advance before you deploy one of these. You just have to deploy and hope that you're watching closely. And that isn't even the worst of it. You have to predict in advance every single custom metric that you need to collect. Every single combination, permutation of attributes, you have to predict in advance and capture it. And then you can never
Starting point is 00:20:20 connect any two metrics again, ever. You've discarded all of that at right time. You can't correlate anything from one metric to the next. You can't tell if this spike in this metric is the same as that spike in that metric. You can't tell. You have to predict them up front. You have no insight into how much they're going to cost. You have barely any insight into how much they did cost. And you have to constantly sort of like reap them because your cost goes up at minimum linearly with the number of new custom metrics that you create, which means there's a real hard cap on the number that you're willing to pay for. And some poor fucker has to just sit there like manually combing through them every day and picking out which ones they think that they can afford to sacrifice. And when it comes to using metrics, all you have, what it feels like, honestly, is on the command line using
Starting point is 00:21:07 grep and bc. You can do math on the metrics, and you can search on the tags, but never the twain shall meet. And you can't combine them, you can't use any other data types, and it's just like, this is a fucking mess. So, talking about the bridge from serverability 1.0 to 2.0, that bridge is just the log.
Starting point is 00:21:24 So while I have historically said some very mean things about logs,.0, that bridge is just the log. So while I historically said some very mean things about logs, the fact is that that is the continuum that people need to be building the future on. Metrics are the best. Metrics are so hobbled. Like they got us here, right? As my therapist would say,
Starting point is 00:21:41 they're what got us here, but they won't get us to where we need to go. Logs will get us to where we need to go. As we structure them, as we make them wider, as we make them less messy, as we stitch together all of the things that happen over the course of a run, as we add IDs so that we can trace using them and use spans using them, that is the bridge that we have to the future. And what the cost model looks like in the future is very different. It's not exactly cheap. I'm not going to lie and say that it's ever cheap.
Starting point is 00:22:11 But the thing about it is that as you pay more, as your bill goes up, the value you get out of it goes up too. Because, well, for Honeycomb at least, I can't speak for all observability 2.0 vendors, but it doesn't matter how wide the event is. It has hundreds of dimensions per request. We don't care. We encourage you to because the wider your events are, the wider your logs are, the more valuable they will be to you.
Starting point is 00:22:35 And if you instrument your code more, adding more spans, presumably that's because you have decided that those spans are valuable to you to have in your observability tool. And when it comes to really controlling costs, your levers are much more powerful because you can do dynamic sampling. You can say like, okay, all the health checks, all the Kubernetes spam, sample that at a rate of one to a hundred. So one of the things we do at Honeycomb is you
Starting point is 00:23:03 can attach a sample rate to every event so we do the math for you in the ui so like if you're sampling at one to a hundred we multiply every event by a hundred when it comes to counting it so you can still see you know what the actual traffic rate is there's that's the problem is it's people view this as an all or nothing where you've either got to retain everything or as soon as you start sampling people think you're going to start sampling like transaction data. And that doesn't work. So there's a, it requires a little bit of tuning
Starting point is 00:23:30 and tweaking at scale. Yes, but it doesn't require constant tuning and tweaking the way it does in Observability 1.0. You don't have to sit there and scan your list of metrics after every weekend, you know? And you don't have to do done sampling. Like the sample story in Observability 1.0. And I understand why people hate it
Starting point is 00:23:47 because you're usually dealing with consistent hashes or log levels, or you're just commenting out blocks of code. Yeah, I agree that sucks a lot. And now I have to ask the difficult question here. Do you think there's an AI story down this path? Because on some level, when something explodes back in Observability 1.0 land, for me at least, it's, okay, at three o'clock, everything went nuts. What spike or change right around then that might be a potential leading indicator of
Starting point is 00:24:16 where to start digging? And naively, I feel like that's the sort of thing that something with a little bit of smart behind it might be able to dive into. But again, math and generative AI don't seem to get along super well yet. You know, I do think that there are a lot of really interesting AI stories to be told, but I don't think we need it for that. You've seen bubble up. If there's anything on any of your graphs,
Starting point is 00:24:38 any of your heat maps that you think is interesting with Honeycomb, you draw a little bubble around it and it computes for all dimensions inside the bubble and outside, baseline, source, and distance. You're like, ooh, what's this? You draw a little bubble around it and it computes for all dimensions inside the bubble and outside the baseline source and distance. You're like, ooh, what's this? You draw a little thing and it's like, oh, these are all requests that are going to slash export. They're all with a two megabit
Starting point is 00:24:55 blob size coming from these three customers. They're going to this region with this language pack, this build ID, and with this feature flag turned on. And so much of debugging is that. Here's the thing I care about because it's paging me. How is it different from everything else I don't care about? And like, and you can do this with SLOs too. This is how people start debugging with Honeycomb.
Starting point is 00:25:12 You get paged about an SLO violation. You go straight to the SLO and it shows you, oh, the events that are violating SLOs are all the ones to this cluster, to this read replica, using, you know, this client. Like this is easy shit. read replica, using this client. This is easy shit. We're still in place.
Starting point is 00:25:29 People need to do this. People probably do need to use AI in observability 1.0 land just because they're dealing with so much noise. And again, the connective tissue that they need to just get this out of the data, they don't have anymore because they threw it away at right time. But when you have
Starting point is 00:25:45 the connective tissue, you can just do this with normal numbers. Now, I do think there are a lot of interesting things we can do with AI here. I hesitate to use a core debugging task because false positives are so expensive and so common, and you're changing your system every day if you're deploying it.
Starting point is 00:26:01 I just think it's a bit of a mismatch. But we use generative AI for, like, we have a query system. If you've shipped some code and you want to use natural language to ask, did something break? Did something get slow? You just ask using English. Did something break? Did something get slow? And it pops you into the query builder with that question.
Starting point is 00:26:19 I think there are really interesting things that we can do when it comes to using AI to like, oh, God the thing that I really can't wait to get started with is like, so Honeycomb, I have to keep explaining Honeycomb and I apologize because I don't want, I don't like to go to the centers. But like one of the things that we do with Honeycomb is you can see your history, all the queries that you've run. So often when you're debugging, you run into a wall and you're like, oh, lost the plot. You need to like rewind to where you last had the plot, right? So you can see, you know, all the queries you've run and the shapes of them and everything. So you scroll back to where you last had the plot. So you can see all the queries you've read and the shapes of them and everything. So you scroll back. But you also have access to everyone else's history. So if I'm getting paged about my SQL, maybe I don't know the fuck all
Starting point is 00:26:53 about my SQL, but I know that Ben and Emily on my team really know my SQL well. And I feel like they were debugging a problem like this last Thanksgiving. So I just go and I asked like what were Ben and Emily doing around my sequel last Thanksgiving? What did they think was valuable enough to like put in a postmortem? What did they put in, you know, a clip doc? What did they put a comment in? What did they post to Slack? And then I can jump to like, and that's the shit where I can't get to,
Starting point is 00:27:19 I can't wait to get generative AI like processing the shit that, what is your team talking about? What is your team doing? How are the experts in every, because every, in a distributed system, everybody's an expert in their corner of the world, right? But like you have to debug things that span the entire world. So how can we like derive wisdom from everybody's deep knowledge of their part of the system and expose that to everyone else. That's the shit I'm excited about. I think it's hella basic to have to use AI to just see what page you're in in the middle of the night. And yet somehow it is.
Starting point is 00:27:54 There's no good answer here. It's the sort of thing we're all stuck with on some level. It's a chicken and egg problem, right? But the thing is that that's because people are used to using tools that were built for known unknowns. They were built in the days of the LAMP stack era, when all of the complexity was really
Starting point is 00:28:09 built, it was bound up inside the code you were writing. So if all those failed, you could just jump into GDB or something. But your system failed in predictable ways. You can size up a LAMP stack and go, okay, I know how to write exactly 80% of all the monitoring checks that are ever going to go off, right? Queues are going to fill up, you know, things are going to like start 500, fine. But those days are like long gone, right? Now we have these like vast, far-flung distributed architectures. You're using, you know, yes, you're using containers, you're using Lambda, you're using third-party platforms, you're using all this shit, right?
Starting point is 00:28:43 And that means that every time you get paged, honestly, it should be something radically different. We've also gotten better at building resilient systems. We've gotten better at fixing the things that break over and over. And every time you get paged, it should be something radically different. But our tools are still built for this world where you have to kind of know what's going to break before it breaks. And that's really what Honeycomb was built for, is a world where you don't kind of know what's going to break before it breaks. And that's really what Honeycomb was built for, is a world where you don't have prior knowledge. Every time it breaks, it is something new. And you should be able to follow the trail of breadcrumbs from something
Starting point is 00:29:16 broke to the answer every time in a very short amount of time without needing a lot of prior knowledge. And the thing is, you know, I was on a panel with the amazing Miss Amy Tobey a couple of weeks ago. And at the end, they asked us, what's the biggest hurdle that people have, you know, in their observability journey? And I was like, straight up, they don't believe a better world is possible. They think that like the stuff we're saying is just vendor hype, like, because all vendors hype their shit. The thing about honeycoats, we have never been able to adequately describe to people the enormity of how their lives are going to change.
Starting point is 00:29:50 Like, we underpitch the ball because we can't figure out or that people don't believe us, right? Like, better tools are better tools and they can make your life so much better. And the thing is, the other thing is that people, not only do they not believe a better world is possible, but when they hear us talk about it, but the world are like, oh, God, yeah, that sounds great.
Starting point is 00:30:08 But, you know, it's probably going to be hard. It's going to be complicated. You know, we're going to have to shift everyone's way of thinking. Instrument, all of your code manually historically has been the way that you wind up getting information out of this. It's like that. That's great. Like, I'll just do that out of the unicorn paddock this weekend. It doesn't happen.
Starting point is 00:30:24 It doesn't happen. But in fact, it's so much easier. It's easier todock this weekend. It doesn't happen. It doesn't happen. But in fact, it's so much easier. It's easier to get data in. It's easier to get data out. I'm liking a lot of the self-instrumentation stuff. Oh, we're going to go ahead and start grabbing the easy stuff, and then we're going to figure out ourselves where to change some of that. It's doing the right things.
Starting point is 00:30:40 The trend line is positive. Yeah, when you don't have to predict in advance every single custom metric that you're going to have to be able to use and everything, it's so much easier. It's like the difference between using the keyboard and using the mouse, which is not something that you and I, people like us typically go to,
Starting point is 00:30:56 but having to know every single unit's command line versus drag and drop, it's that kind of leap forward. Yeah, we're not going to teach most people to wind up becoming command line wizards. And frankly, in 2024, we should not have to do that. No, we should not have to do that. I remember inheriting a curriculum for a lamp course. I was teaching about 10 years ago. And the first half of it was how to use VI. It's like step one, rip that out. We're using nano today, the end. And it just, cause I don't want to, people should
Starting point is 00:31:22 not have to learn a new form of text editing in order to make changes to things. These days, VS code is the right answer for almost all of it. But I, but I digress. Especially, especially in this, this is really true for observability because nobody is sitting around just like, Oh wait, okay. On every team, there's like one dude, one person who's like, okay, I'm a huge geek about graph. I've never been that person. Typically when people are using these tools, it's because something else is wrong and they are on a hot path to try and figure it out. And that means that every brain cycle that they have to spare to their tool is a brain cycle that they're not devoting to the actual problem that they have. And that's a huge problem. It really is. I want to thank you for taking the time to speak with
Starting point is 00:32:02 me today about what you've been up to and how you're thinking about these things. If people want to learn more, where's the best place for them to go find you? I write periodically at charity.wpf and the Honeycomb blog. We actually write a lot about open telemetry and stuff that isn't super Honeycomb related. And I still am on Twitter at MipsyDipty. And we will, of course, include links to all of those things in the show notes. Thank you so much for taking the time to speak with me. I really appreciate it.
Starting point is 00:32:29 Thanks for having me, Corey. It's always a joy and a pleasure. Charity Majors, CTO and co-founder of Honeycomb. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice. That platform, of course, being the one that is a reference customer for whatever observability vendor you work for.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.