The Pragmatic Engineer - Observability: the present and future, with Charity Majors

Starting point is 00:00:00 the three pillars model. What is that? The famous phrase goes, I think it was coined by Peter Borgon back in the 2017, the observability has three pillars, metrics, logs, and traces. And a lot of vendors glommed onto this because, statistically speaking, they have a metrics product to sell, a logging product to sell, and a tracing product to sell. But it's actually kind of worse than that, right? Like, every request to enter your system, historically people have stored it in, sure, like metrics, storage,

Starting point is 00:00:24 and dashboards, and structured logs, and unstructured logs, in a tracing tool, in a profiling tool, in an analytics tool, you know, again and again and again and again, and what connects them? Not much. You, the engineer, sitting in the middle, going, well, that shape looks like that shape, so there are probably this thing, or maybe copy-paste your IDs, and the cost multiplier is obscene. Charity Majors is an observably expert and the author of the book Observably Engineering. She worked at a software engineer at Parse, then at Facebook, and then co-founded at Honeycom, and observability startup. She believes that observability tools should be a lot easier and faster than they are today

Starting point is 00:01:01 and that every engineer should have the kind of magical observability experience like DevSethmeta have with their internal tools. In today's episode we go into, what is observability and what do things like the three-pillar model and cardinality mean? What is observability 2.0 and why is everyone so excited about it? Things engineering teams frequently get wrong about observability and so many more things worth knowing about modern observability practices. If you enjoy the show and would like to support it,

Starting point is 00:01:27 please subscribe to the podcast on any podcast platform and on YouTube. Charity, welcome to the podcast. Thank you so much for having me. You and I go way back, and I feel like we don't get to talk enough, so this is a real treat for me. We go way back. In fact, the way we started talking was so funny.

Starting point is 00:01:45 We both were on our blogs, we were writing advice columns. Oh, my guys, the same advice question. to both of us and we both answered and we were like, hey, hi there, friend. Yeah, so someone asked something about, I think measuring developer productivity actually and then I think you wrote something and I wrote something and we didn't know about each other and then we discovered it later and I read your article and I'm like, oh, I could have written the same and I realized you were a little bit common souls.

Starting point is 00:02:15 Yeah, exactly. That was funny. This episode was brought to you by Sonar, the creators of Sonar Cube Server, cloud, ID, and community build. Sonar helps prevent bugs, code quality, and security issues from reaching production, amplifies developer's productivity in concert with AI assistance, and improves the developer experience with streamlined workflows. Sonar analyzes all code regardless of who writes it,

Starting point is 00:02:40 your internal team or Gen AI, resulting in more secure, reliable, and maintainable software. Combining Sonar's AI code assurance capability and Sonar Cube with the power of AI coding assistance like GitHub Copilot, Amazon Q developer, and Google Gemini Code Assist, boosts developer productivity, and ensures that code meets rigorous quality and security standards. Join over 7 million developers from organizations like IBM, NASA, Barclays, and Microsoft who use Sonar. Trust your developers, verify your AI-generated code.

Starting point is 00:03:11 Visit Sonarsource.com slash Pragmatic to try SonarCube for free today. That is sonarsource.com slash primordes. Truss isn't just earned, it's demanded. Whether you're a startup founder navigating your first audit or seizing security professional skill in your governance risk and compliance program, proving your commitment to security has never been more critical or more complex. That's where Vanta comes in. Vante can help you start or scale your security program by connecting with auditors and

Starting point is 00:03:41 experts to conduct your audit and set up your security program quickly. Plus, with automation and AI throughout the platform, Vante gives your time back. so you can focus on building your company. Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SOC2 and ISO-27-O-1. With Vanta, they centralized security workflows, complete questionnaires up to five times faster, and proactively manage vendor risk.

Starting point is 00:04:07 Join over 9,000 global companies to manage risk and prove security in real time. For a limited time, my listeners get $1,000 of Vanta at vanta.com slash pragmatic. That is V-A-N-T-A-com slash pragmatic for $1,000 off. So to kick things off, can you tell us how you got first exposed to observability? Beforehand, when we talked, I remember that there was something about the startup that you worked at Parse, and then it got acquired and you worked at Facebook, and then something happened there. How did this go? Well, so around the time that Parsh was getting acquired, you know, we were, this was in the like free money days.

Starting point is 00:04:46 you know, rocket ship growth. And like we were, we were taking off. And we had like 60, some thousand mobile apps when we got acquired. By the time I left, we had over a million mobile apps. And the traffic was so unpredictable. Like a different app would hit the top 10 in iTunes like every day, you know? And just be like. And this was back in 2013, 2014, right?

Starting point is 00:05:07 Yeah, exactly. Like 2012, 2013. And we had built this on the Ruby and Rails stack, which was a smart enough idea at the time. but, you know, fixed pool of workers. And so any app takes off and suddenly, boom, parse goes down because all of the in-flight workers get caught up on threads. Any of the backends get slow, boom, parse goes down. And I'm like the infrastructure lead, right?

Starting point is 00:05:30 And as a reliability engineer, this is just professionally humiliating for me because it's constantly going down. And I tried everything, you know, every tool out there. And the first glimmer of light that we got was we get to Facebook, And I was really cynical about the, you know, because all of the tools have been built in-house for in-house workloads. You have the luxury of forcing people to use your stuff, you know, and I'm like, this is not built for us. But we started feeding some data sets into a tool there called scuba.

Starting point is 00:05:59 And suddenly we're able to slice and dice in real time on high cardinality dimensions. We were able to, instead of just going, oh, God, because like, parse goes down, we're looking through the logs. And, like, it might be full of a particular request, but that doesn't necessarily. mean that that's the app that caused it to go down. There might just be lots of backing up, you know? It might have a few very slow requests that's taking, like, you just don't know. And having the ability to break down, slice by it, things like app ID, user ID, raw query, normalized query. The amount of time it took it for us to just pinpoint the cause and do something about it dropped like a rock, like from hours. Sometimes it would just recover and

Starting point is 00:06:39 we'd never actually know what happened to like seconds. Like it wasn't even an engineering problem. anymore, it was like, it was like a support problem, right? And this made, this, this, like, this was life-changing, right? And so when, when Christine, my co-file, so I'm coming at this from, like, an operational land. And Christine, my co-founder, had written the parse analytics product, and she had built on top of Cassandra. And she was constantly being, like, professionally humiliated by the fact that this product she had written for our users to understand their applications, you had to pre-define in advance how to capture the data and what questions you'd be able to ask.

Starting point is 00:07:15 And they'd be like, but I need to ask this new question. And she'd be like, oh, how embarrassing. She'd go and look it up by hand in scuba and reply to them. And she'd just like, ah. So we had both had this experience from very different sides of the stack. And we were just like, so when we left Facebook,

Starting point is 00:07:34 it's looking back on how ill-prepared me were to start. I had never heard the words product market fit, right? deep in infrastructure. We didn't know that categories existed or Gartner existed. So, like, we were not setting out to build a category. We were setting out to try and explain this, this, like, life-changing experience we had. Like, we knew it wasn't monitoring. We knew it wasn't logging, you know, and which is why it took so much longer and was so

Starting point is 00:08:02 much harder than it needed to do that. And then just to get a sense of, so what was Scuba at Facebook or how would you describe It was this, because it sounds like it's a mix of like just listening to like events, but also monitoring and querying. Scuba is this weird beast. It's kind of a columnar store, but it was in memory. They were a white paper about it. One of my favorite things was that replication was, it's a C++ binary and it sheled out to R sync to do replication. It was quick and dirty.

Starting point is 00:08:33 And the evolution of scuba is actually an interesting one. They developed it about a decade earlier when they were trying to get a handle on their MySQL issues. Like there was a time when like PHP and MySQL, you know, just crashing all the time. Facebook would being professionally humiliated, right? Which is why I think it's so interesting. Like the genesis of Scuba and thus Honeycomb is like not connected to the entire 20, 30 year history of telemetry and monitoring and three pillars. Like it came out of complete left field. It's much closer to, like, the history of business analytics, I think.

Starting point is 00:09:11 Mm-hmm. So that's where it started. Yeah. And now, jumping a little bit more back to basics, what does, you've been in the observability field for a long time now, like we can save now, it's called that, but what does observability mean to you from a software in your point of view? How would you define it? Yeah.

Starting point is 00:09:31 You know, there are so many definitions floating around. I'm probably responsible for more than my fair share. Of them. But really, it's about understanding your software. It's about understanding the intersection of your code, your systems, and your users, right? That's all it is. And I feel like for a long time, the observability field was obsessed with errors and bugs and outages and nines and crashes and all these things.

Starting point is 00:10:01 And I feel like one of the directions that we're starting to pull the industry in is that it's not just about problems. Understanding, the need to understand impact of what you're putting out into the world is so much bigger and more interesting than that, which is part of why I feel like, you know, it's not just an operational tool, right? This is the, this is the tool that underpins your development feedback loops. Yeah, and I guess maybe not just development, but at some level product as well, right? Like how your stuff works. Exactly.

Starting point is 00:10:33 And ultimately, like, Chris, you and I just did this extra. to refresh our mission and vision. And the phrase that we landed on was, we're here to help engineers explain and understand our software in the language of the business. I feel like, I feel like there's this void. If you look at all of the C-level roles out there, the only one that has no template or definition is CTO all over the fucking map. Right. And if you look at like VPs and executive teams, engineering VPs have, traditionally been kind of like the junior varsity league, right? They're like not really in the key, the core group, right? Why is that? It's because for so long, engineers have been kind of

Starting point is 00:11:16 the artists of the company. You just trust us, right? We can't really justify it. You know, we have really, because what is it about executive teams? Like the point of having an executive team is that you're each other's first team and you co-own the most fundamental decisions about where companies invest their resources, right? Which means you have to you have to co-understand them, right? Everyone needs to understand enough about marketing to be like, in the upcoming year, our priorities are to move the needle in these ways, so we're going to allocate these, you know, for engineering, it's just like, we need 20% more people. Why? Well, we just do. You know, like, that's about as deep as it

Starting point is 00:11:58 goes. And so I feel like over the next five years or so, really helping engineers get that engineering product, even design, get that first class seat at the table. It all comes down to being able to explain and understand our work in the same language as everyone else, which is money. So you're saying that let's say the CFO, so the finance team, the CMO, the marketing, then the CEO, the operations team, they can all explain, here's what we do. Here's how that translates to the business. Here's why we are important. Here's the things that we're moving for the business. If I get 20% more people, I can do this for the business, right?

Starting point is 00:12:37 And then you're saying that engineering at a lot of startups, if I understand, don't really have that ability and observability or understanding, you know, how our stuff connects to the business gets us there. It's a translation layer. Yeah. Being able to understand reason about the work that you're doing and tie it back to top level goals, you know, I don't think this is just a startup problem either. Like I talked to big company execs and, you know, nobody wants to say this in public. But there's a lot of, it's hard. It's really freaking hard. Yeah.

Starting point is 00:13:07 So going back to observability as a concept, it is an industry and you've written a large book about it. Can we go through some kind of common terms where if I'm a software engineer, I, you know, I heard about observability. What are some things that I should probably know about? And you kind of mentioned just before you mentioned high cardinality, normalized query, but there's stuff like traces, instrumentation, sampling. What are some of the basics that you think, look, if you're an engineer, figure these

Starting point is 00:13:36 things out, like look it up, read a book, talk to people, and then you'll be able to start with. Yeah. I think it's important to know the difference between metrics. Like, there's small M metrics and big M metrics, right? We use, we use the term metrics a lot. So you say small M metrics. Small M. Like the generic term for telemetry, we're just like, oh, the metrics are blah, blah, blah. And then there's the metric, which is a number. with some tags appended, right? And I think this causes a lot of confusion with folks.

Starting point is 00:14:06 The metric is a very small, fast, efficient data type, but it's supremely limited because it doesn't store any contextual relationship data, right? I think one of the big shifts in the industry right now is away from the sort of three pillars model where every request that enters your system, you store it in a bunch of different places to a model where you have unified storage. So I think understanding the difference between the metric and structured data is pretty key. I think sampling is emerging as a really important lever. And it scares a lot of people in large part because big logging vendors have put decades down to telling people,

Starting point is 00:14:48 every log is sacred, don't drop a log line, you know, which always reminds me of the sort of Monty Python, every sperm is sacred sketch. And it's like, eh, eh, anyway. Yeah, so I think sampling is important. I think understanding the data structures is important. You mentioned something that I think is also kind of a given in the industry, observable industry, or people who do it, the three pillars model. What is that?

Starting point is 00:15:14 Yeah. So the famous phrase goes, I think it was coined by Peter Borgon back in the 2017. The observability has three pillars, metrics, logs, and traces. And a lot of vendors bombed onto this because, cynically speaking, have a metrics product to sell, a logging product to sell, and a tracing product to sell. But it's actually kind of worse than that, right? Like, every request to enter your system, historically people have stored it in, sure, like metrics, storage, in dashboards, and structured logs, and unstructured logs, in a tracing tool, in a profiling tool, in an

Starting point is 00:15:47 analytics tool, and, you know, again and again and again and again. And what connects them? you, the engineer sitting in the middle going, well, that shape looks like that shape, so there are probably something. Or maybe copy pasty IDs. You know, and some of the bigger, fancier tools like Datadog have built sort of bridges you can predefine, like this metric here ties into that log lane over there.

Starting point is 00:16:11 But again, you're in the situation where you're having to define in advance what information is going to be important and where you're going to need to connect it. And the cost multiplier is obscene. Right? Some people, every request enters their system, they're storing it in 15 different tools for 15 different use cases. And the more of them you get, the more expensive it gets, the more strong it gets, and the harder it gets to correlate everything. So, like, the 1.0 to 2.0 shift. Like, data is data. There are a lot of different ways to, like, climb the mountain.

Starting point is 00:16:41 But, like, fundamentally, it's about moving from many sources of truth to unified storage. Okay. So just so I get that right, observably 1.0 is this like three pillars. The metrics, logs, and traces, which, you know, it started like that. And apparently it is really good to sell a bunch of services and you can sell expensive services with it as well. So what is 2. Okay. Because you mentioned Univized Storage, but can you expand on, you know, why that it just sounds a bit too simple. I mean, you know, it sounds like people could have come up with this like earlier, right? Well, they have. Like, they've had nice things in the business side for decades. I've been. extremely expensive, but we're in the very, like, the Cobbush children have no shoes in the software side. It's just like, you know, well, we can make it work with our duct tape and bailing wire. Like, I remember, Vertica came out like 15, 20 years ago and, you know, the business, anyway, a lot of folks are talking about unified observability, but in a lot of situations, when you look under a hood, what they're talking about is either a unified bill or at best a unified

Starting point is 00:17:44 visualization. And the beautiful thing about having unified storage is you have no dead ends, right? You click on a log. You can turn it into a trace, right? You can visualize it over time. You can derive your metrics from it. You can derive your SLOs from it. You can take your SLO data, click on it, jump into it, see exactly which events are, you know, violating your SLO and why and what's different about them. But you're right, there is a lot more to it. I just, I try to emphasize that because when I started writing about this I started to notice all these other like Observability 2.0 articles popping

Starting point is 00:18:19 up all over the place where people are like oh yeah we do that too because blah blah blah blah blah and it's like all right some of the other things that I think are associated with two and by the way I'm not trying to police this and be like only Honeycomb does this. I don't think that at all. In fact one of the most exciting things about this last year to me

Starting point is 00:18:34 was that I feel like the batch of baby observability startups that we're seeing are no longer looking like cheaper data dogs they're looking more like cheaper honeycombs, right? They're built on Click House. They're built in columnar stores. They're using hotel native.

Starting point is 00:18:50 They're using wide structured events organized around units of work. And I am so ecstatic. Like, this is a better world for the industry, and I'm excited. But some other, like, things that I think have in common with this are, I think it really does parallel the shift from durability being an ops tool. You know, organized around errors and downtime and outages. and crashes and towards being something that underpins the whole development cycle. It's what underpins your feedback loops and allows you to, you know, one of the most exciting

Starting point is 00:19:24 use cases, I think, is in the CICD, like being able to visualize it as a trace, see where your tests are breaking, where your time is going. Because keeping that time between when you're building it and when it's in production and you're looking at it, keeping that as small as possible is like it's like the most fundamental part of building great software and great teams. So when you say it underpins development, like do, are you saying that, you know, these new tools either that exist today or or that will exist, you're kind of envisioning it as I am coding my stuff. And as I push to just development, to CICD and, you know, there's some, there's a crash. Something went wrong. I should be able to look at either like, you know, if it's the built crash, I should be able to look

Starting point is 00:20:06 out there. If the app crash, I can just look at this, you know, like magical dashboard, which, you know, in the past I had to look at the logs or if I had a trace, look at the trace. So you're saying that I can just use this to like actually develop faster, like me as a dev? Basically, yeah. I mean, it can be hard to generalize because there are so many different ways people do this and everything. I think that having observability in the pre-production environment is good. But I really think it's so much of it about. accelerating time to value, accelerating time to insights, accelerating, getting it into production.

Starting point is 00:20:43 Like some of the most exciting things that I think we're doing right now are about getting things. I mean, I usually have my test and prodder live a lie shirt on today. I have my database shirt on. But like getting things into production immediately, right? And safely, like using progressive deployment, using feature flags, you know, And the combination of observability of the 2.0 framework plus things like progressive deployments, canaries, feature flags, it's like it's greater than the sum of its parts. Because when you can slice and dice and break down, like say you get your code out within 15, 20 minutes, right?

Starting point is 00:21:20 It's so fresh in your head. You know what you did, why you did it, how you did it. And you're looking. But it's not like you're blasting it out to everyone immediately. Of course. Of course not. You ship it, you know, behind a feature flag. You ship it to a canary.

Starting point is 00:21:33 You ship it to 10% and then you flip it on, you send some test requests. You know, you do it as very control. It's like using a scalpel, right? You have this precision tooling that gives you confidence in moving fast. So am I hearing correctly that? What you're saying is when you have like these modern and helpful engineering practices, things like feature flex, things like being able to test in production because you have like some separation of tendencies, for example, all of those things, if you have those things, and then you kind of invest in observability,

Starting point is 00:22:08 you know, trying to have it better. Like that is just a much bigger win than let's say just, I don't know, let's get into this whatever observability 2.0 is. You know, let's modernize it. But by itself, it's not going to be as big of a win, right? Correct. Correct. I use this metaphor a lot.

Starting point is 00:22:22 I'm super blind. So it's like putting on your glasses before you go barreling down the freeway, right? When you're driving, you don't want to be just like veering. it shouldn't feel like you're always course correcting or driving it. It should feel like you're just driving, right? And when you're building, you should have these feedback loops that are so intuitive and so fast and so integrated that it feels like you're just building, right? Like that's the dream.

Starting point is 00:22:48 Yeah. Now, one other question I have related to, we talk about developers of observability, but how do you think of 11Y observability, it relates to, the short form, how does it relate to SRE, DevOps, and other roles? Because I feel that there are some companies where they're saying, oh, observability, that is owned by our SRE team or our DevOps team. What's your view about this? I think that, you know, I think, obviously, any vendor relationship, any product

Starting point is 00:23:31 has to have an owner. I think that's not necessarily a terrible idea. I think that a lot of the center of gravity is moving to platform teams in a lot of places because it's like platform, their remit is really managing that, that thin line, the fuzzy line between our code and infrastructure code, right? It's permeable. There's always, you know, things that sort of somebody's got to own that line. But the other thing I like about the platform engineering model is that your customers are internal, right? It's a product-focused development organization. Your customers are internal.

Starting point is 00:24:04 And I like that model because some of the historical flaws of, you know, the sort of SRE DevOps, whatever models have been that, oh, they own it. They own monitoring. And the platform model is so explicitly, you own your code and we're here to help you with that. And that is, I think, such a critical change. I could not agree more. It's interesting how it's a platform between suddenly it's not they, but they kind of either reflect and it's you know that you have skin in the game and you cannot expect them to, for example,

Starting point is 00:24:37 go on call for your platform. Do you just don't do that, right? They will build the tooling for you, but. And to be clear, like, you know, as someone opts is deep in my DNA and it's not like SREs are going anywhere. Like decisions are only getting harder and more complex and there's an area for expertise and like, you know, the consultative model is, I think, a great one. DevOps, if I might go on a little side rant here, I do feel like, you know,

Starting point is 00:25:02 is there any term in computing that's been more contested than the word DevOps? I'm not sure. But like, whereas I feel like the DevOps philosophy of, you know, being very, you know, you work together, empathy, you know, collaboration is eternal and not restricted to software, I do feel like we're sort of in the waning days of the DevOps movement because it's no longer considered a good thing to do to spit up a dev team and an ops team to then collaborate, right? Increasingly, there are only engineers who write code and own their code in production. And I think this is really exciting.

Starting point is 00:25:43 I think it's, you know, we can understand why dev versus ops evolved, but it was always kind of a crazy the idea that half your engineers could build the software and the other half would understand and operate it. Like, that's just not a great way to break down. It doesn't lead to excellence in either domain. Yeah, I feel it's a little bit similar to like waterfall, where people still talk about waterfall, but there is no waterfall. Waterfall used to be literally three or four year old projects, long projects, but they don't exist. And so when people talk about a waterfall, they talk about a two-month project. That is not waterfall. If it's one or two months, that's, you can call it whatever you want.

Starting point is 00:26:21 And so, which is a good thing, right? It's a good thing that we don't have waterfall projects anymore. Even like government projects no longer take, I mean, most of them don't, don't say thankfully that long. Yeah. Yeah. And to be clear, like, I get just as mad as anyone else. We're like, DevOps is dead.

Starting point is 00:26:35 Like, you. No, it's not. Like, that's the wrong, that's the wrong takeaway. It's like any good movement eventually fulfills its purpose, right? And I feel like the movement of DevOps is in the fulfillment stage. I agree with that. So I have a question I've been wanting to ask you. Why do you think observability is so darn hard?

Starting point is 00:27:02 Every single developer has their story of why observability is hard, if they actually did it wherever they sit. First of all, what was your kind of first time where you just realize like this is just really hard even though it shouldn't be? You know, if I can be honest, I actually, always hated observability and monitoring. Story, like, I would do anything to weasel out of being the one to own it, like, including three companies in a row.

Starting point is 00:27:35 I hired my friend Ben Hartshorn, who loves this stuff. I'd be like, Ben, please come work with me because I hate this shit. And he would come in and he would build all the graphs and I would bookmark them. Like, I have always hated this. Oh, my gosh. Look where that got you. I know, right? Anyway, you know,

Starting point is 00:27:52 It's hard. It's hard because software is hard, right? And like, observability is like the first line of defense. And it's like, not only you have to build the software, but you also have to have like this sort of meta thread that's watching what you're doing and going, what is future me going to need to know? Or what is future me going to need to understand at 2 a.m. And that's just a muscle that takes a long time to build it, right? And I also think that like historically we've had a lot of tools that were so, like, Like, some of my, some of my gripes about, you know, the past generation of tooling is just that, like, you really expected people to master at least two discipline. Like, so many of the tools for so long required you to convert the code that you were writing into, like, physical resources. Like, what is this doing to my CPU and my RAM? Like, that's just like, okay, this is a little too much to ask of your average software engineer, you know? Okay, I'll ask you an easier question. One of the biggest frustrations, I think, you know, like non-technical people like a CEO or CFO have is they look at the bill of an organization. Number one is going to be the cloud cost or infrastructure depending on yourself hosting.

Starting point is 00:29:04 Number two is always, almost always observability. And it just feels like people feel that it gets out of hand. And, you know, when you ask engineers, obviously engineers are told, you know, optimize the bill or make it lower. Why do you think the costs just get out of hand in so many cases? I think if they don't get out of hand, that is the exception. What makes it so darn expensive? I mean, what makes it so darn expensive is the complexity of our systems and our high bar. Like, the easiest way to slash observability bill is to give fewer shits about your customer experience.

Starting point is 00:29:41 You know? And to be clear, like, there's a wide rate. Like, there are companies who, every single request delivery, delivery companies, banks, right? You don't really have a choice. You have to understand every single request. So there's a certain built-in. And then there were like advertising companies, which more spray and prey, you know,

Starting point is 00:29:57 and you can get by in like buckets. And so I think, you know, that's a legit angle to look at. The second one that I would call out is the thing I mentioned before, which is just like the multiplier effect. How many times are you storing a record of this data for every request that enters your system? I think that, like, you know, we've been, Honeycomb, we've been building this for nine years on January 1st.

Starting point is 00:30:21 Wow. I know. We were so small for several years, and in the last couple years, they're just taken off. But like, a big part of the driver from 1.30 to 2.0 is people suddenly money's not free anymore, and they're taking a look at the multiplier effect, and it's just unsustainable. Like, no matter how tightly you're trying to keep control on costs, it's unsustainable if you're storing 15 different copies for every single request. You know, it's just as it's unsustainable. The third one that I'll point out is the one that personally bothers me the most. which is cardinality. If you look at actual observability engineering teams using, you know, your traditional three pillars bottles, they usually, and this is true, spend an outright majority of their time trying to govern cardinality. You know, you can go to bed Friday night, have a $200,000 a month data dog bill, make no co-changes over the weekend, and wake up Monday with a $2 million a month data dog bill

Starting point is 00:31:16 just because the cardinality changed out from under you. Can we pause here? I don't think every software engineer will actually understand what cardinality does, especially if you've not worked with observability. Can we break it out? Like, what does it mean? Because it's wicked important. Like, I know, but the first time you meet at it, it's kind of, it's not an obvious concept,

Starting point is 00:31:40 is it? Yeah. No, not at all. Cardinality refers to, I think the mathematicians call it the number of unique items in a set. And basically it means, you know, if you've got a collection of a whole. 100 million users, any unique ID is going to be the highest possible cardinality. So like request ID or in America like social security numbers, right? If you have 100 million users, you have 100 million values, right?

Starting point is 00:32:05 And then the lowest possible cardinality would be a field with just one value, like species equals human, right? Yeah. Now, the point of big end metrics tools is they are built to handle low cardinality data, full stop. like so like if you've got like and there's there's this very traditional experience that everyone who use these tools has which they start using it they're happy with it they append something like hosting right and then they get to have like more than a hundred hosts and suddenly it all breaks or it becomes absolutely and they're just like what the hell and and just so explaining the behind the scenes or why it gets like so expensive whatnot because you have to store you have to store you have to

Starting point is 00:32:48 store, there's no relational data, right, in time series databases. You have to store another unique, every unique combination of number and value you have to store again and again. So you have, like, the term custom metrics, I always thought that that meant like, oh, a custom metric that you've defined in your code. It's a line of code. No, it refers to unique combinations of metrics values. So basically every single unique combination will take up more space. And this is, you know, like, as you said, a rookie mistake. I've seen when people use one of these many products is you just literally, you're like,

Starting point is 00:33:29 okay, I don't know, I want to query for like, what city, what country does this event happen? And then they add IP address. And then IP address is unique to everyone. And suddenly it just adds so much to your bill. Yeah, it can like 100x overnight. And so world-class observability engineering teams end up spending an outright majority of their time just trying to govern. Because the irony is that is the most valuable data, right? The more unique the data is, the more identifying it is, which means the easier it is to debug your systems and understand what's happening.

Starting point is 00:34:05 And it's not just storing the data. It's also being able to use it to slice and dice and breakdown and group by and explore. And so you just can't do this with stuff that's built on the metric data type. Like it's actually like impossible. And this is what we talk about, observably 1.0 or so. So how does how does it change, right? Like it seems this feels a little bit to me like, you know, if I had to compare like Kobol, the programming language from the 60s, you actually had to worry about where your program,

Starting point is 00:34:36 like you need to tell where it goes on a tape, goes back and you know, you structured your code accordingly. And it kind of feels to me that if you're using these observably products, you really need to, you know, like when you're like thinking about like, I want to lock stuff, you need to be thinking about like how expensive will this be. If I add this field, which is like an IP or a website, oh, I shouldn't add that. I should transform it, which I mean, in 2024, that sounds a bit silly. It does. Because you're kind of like, you're optimizing, you know, machine time, even though machine time is cheap, but I guess storage is expensive. And so what's the solution? The solution is we have to move away from tools that are backed by big M metrics.

Starting point is 00:35:16 We have to move towards tools that use structured data, where you can have high cardinality, where you can store lots of, like, I often said like the bridge from 1.0 to 2.0 is logs, right? It's emitting fewer logs, but wider logs. Like, the wider your log, the more context you're attaching to each event. And the more context you have, the better your ability to identify outliers and correlate things. Like, we have this thing called bubble up, which is just like, any graph you have, you can be like, what's that?

Starting point is 00:35:54 Draw a little bubble around it. And we'll compute, like, for all the dimensions inside the bubble versus the base like outside the bubble. So you're like, what's this little spike? And you're like, oh, I see. All of these are for requests coming from Android devices, from things. this region going to the secondary with the batch size of blah, blah, blah, and it's taking this. And like so much debugging boils down to, here's the thing I care about why.

Starting point is 00:36:16 What are all the ways it's different? And gathering up your data in this way, you ask like why it's so expensive, and it's because the model does not fit the needs that we have for this data. And do I understand correctly? Because, you know, the software engineers always trade off. I think it's easy enough to understand that, like, observably 1.0, it optimizes for, it, When it stores the data, you can immediately query it and you will get it almost immediately. You don't need too much computation.

Starting point is 00:36:44 When you're doing the opposite of, okay, I can just store whatever, it's not going to expand my storage costs. I'm assuming there's a tradeoff with, let's say, compute, right? Like you're going to compute it later or you'll have post-processing or something, right? Like, you don't get anything for free. No? You're shaking your head. So you're right. There are always trade-offs, but it might not be exactly the trade-offs that you think.

Starting point is 00:37:03 So this is made possible by the falling costs of storage and compute and all these things. Absolutely. Metrics were optimized for a world where all of your resources were so expensive, right? You're just like, I can't afford to store this engine X log. I'm going to derive some metrics and store that, right? Yeah, no, you should be able to slice and dice in real time. to interact. Another aspect of this is I don't think you can really move to a 2.0 world

Starting point is 00:37:34 without taking your wide structure data and feeding it into a columnar store. Because if you're using a traditional relational database, you have to define and advance the indexes, the schemas, you know, all these things. You want it to be helpful. You want to be able to just drop in. Oh, this might be useful someday. Drop it in and immediately.

Starting point is 00:37:50 Well, yeah, for logging, you shouldn't have to like, it feels a royal pain to do it. It feels a little bit like, you know, like typically what happens when you don't have good logging or good practices you have a bug, something crashed with the customer, you realize you have zero logs, you have no way to look at it. So what do you do? You ship something to production, you know, back in our mobile app, you add logs. And then you tell the customer, we now have logs.

Starting point is 00:38:13 Can you please update the app and try it again? And it's like, I mean, it's doable, but it's pretty darn embarrassing, right? Yeah, the girl is to be capturing enough rich telemetry all the time that you can go back and you can just be like, oh, what did they do? Right. Yeah. So a common worry about any observability these days, I mean, you could build your own observability, but unless you're Facebook or Google, you probably shouldn't do it and they do it already. But there's this worry about vendor lock-in. Thinking, okay, whatever I choose, I might be locked in or should I try to choose a vendor that's not elected. Now, you work for a vendor, so you know you're on one and you're going to be biased on the other end.

Starting point is 00:38:56 You often speak truth of power. how big of a deal do you think vendor lock-in is? Can you avoid it? Should you even want to avoid it? Is it even possible? This is a rare spot of good news in the landscape. Historically, this has been a huge problem. Open telemetry is changing everything. The goal with open telemetry has always been you instrument your code with O-Tel and you can basically take your fire hose and point it to whatever vendor you want, which is, there's a little bit, but it is 90s. 95% true. This is a game changer. Forcing vendors to compete for your business based on being excellent and responsive and a good value instead of keeping you locked in their ecosystem is, and honestly, this is the first year where I've really seen this start to come true. And it's really exciting. It's interesting because my next note was exactly this. I've made a note,

Starting point is 00:39:51 open telemetry. I've heard about it. Again, don't forget, I'm a bit of an outsider for observability, right like like i know software but like not not the details of this what is open till elementary like you you you said how great it is but what is it and why should we hear about it yeah i mean it's the inheritor toward you know google had like open census and open tracing both kind of you know flopped um ben sigelman and the folks at light step actually expect this out it's now get this this is also the year where it over took kubernetes is the number one CNCF project telemetry. Telemetry is now the top project in terms of commits and committers.

Starting point is 00:40:31 Yeah, it's huge. It's amazing. And, you know, it's a few years old now. I, it gets, it gets critiqued a bit for being kind of big and bloated. It does the job it needs to do. I think what most people need to understand is this. It does a lot of jobs. It does them well.

Starting point is 00:40:51 And increasingly, you don't need to understand everything about it to get the value wow because for a long time it was like it was kind of funky and you had to really invest into understanding it. Increasingly, it's getting to the point where it just accelerates. Like a lot of the value is you get your data, even if you don't think about hotel at all. If you get your data into an hotel enabled pipeline, it gets consistent meaning, consistent structure. You know, there's semantic conventions and stuff. And what this means is then when it gets to the server side, your vendors can do amazing

Starting point is 00:41:25 shit for it. Like we can intuit, we can we can derive, we can compute, we can we know what the data is, right? So we can do a lot of really exciting things with it. And I think you're going to see a lot more of that in the next couple of years. And so just to get a sense of what exactly it is, I'm on the website and it says how open telemetry is a collection of APIs, SDKs and tools. It's built for to make telemetry portable and effective. And it's a, it writes that you can use an instrument, generate, collect, an export telemetry data. So do I understand it correctly that a lot of like languages or frameworks have like, I don't know, APIs that you can like use and then you can kind of plug in vendors underneath it or how do the vendors come in the picture here?

Starting point is 00:42:05 You know, some vendors support the wrong collectors. Some are a lot of, are contributing back to the core project. But basically the idea is, you know, however you do your instrument, trying to provide just framework for consistency, right? It's a little bit too big to just generalize Like, if you're like, does the O-Tel do this? The answer is probably yes. But basically it's just getting people's teletriate into a consistent format, consistent naming and, you know, semantic conventions, which means that we can do a lot of great stuff with it.

Starting point is 00:42:36 Okay. So is it safe to say that if I'm working at now, like a mid-sized company or a project that I know will need observability or already needs it, it's kind of a safe bet to look at open telemetry, see if I can at least parts of my pipeline adhere to it because, A, this will hopefully make it a bit better. And then B, if the question of portability comes up, it'll be way easier to move vendors if it ever comes up, right?

Starting point is 00:43:01 Because we know that a lot of companies will never move, but I think, you know, like knowing that you could move. Well, it also helps. I mean, I'm kind of talking a little bit against like possibly your business. But as any business, the best negotiation is saying we could move if we wanted to. So let's talk about do you want to give us, should we do a longer term commitment? Do you have new features that you ship that are unique and your competitors don't have it and we need it? Like, you know, that's, and I kind of have the same thing.

Starting point is 00:43:30 It's forced offenders to compete on the territory that they should be competing on. What are some common things that you've seen engineering teams get wrong about putting observability in place? Some of the most common ones. Oh, boy. Feeling like they don't need to start, they don't need to have any until it's in production and things start breaking. like you really want to be developing with it. You want to get in a habit of understanding your software. Like what?

Starting point is 00:43:57 You're going to attach a like a GDP to it. Like no, you want to debug your software the way you're going to debug it in production. Like shift left, shift right, whatever the fuck you want to call it. Like you want to do that early. Other other areas, you know, I think a lot of folks feel like all the dashboards versus a lot of folks get really attached to their dashboards and I really don't feel like unless your dashboard is dynamic and allows you to ask questions I feel like it's a really poor view into your software you you want to be like and then what and then what and then what you want to be interacting with your data

Starting point is 00:44:38 if all you're doing is looking at static dashboards I think it really it limits your ability to really develop a rich mental model of your software And it means that there are things you don't think to ask or graph for dashboards so you don't see them. Yeah, it's interesting because I built so many dashboards. And whenever you walk into an office where there's a team, usually they have a dashboard and it looks cool. And all of the dashboards we've had always looked cool. And I'll be honest, they were kind of useless. I mean, they were good. Like, you know, when someone important walk by, you know, a director or VP, they're like, oh, this is a team.

Starting point is 00:45:15 Here's their sets. And after a while, what we started to do because we realized that's why most of you. used it for. We made sure that like nice numbers and it was always green. It sounds silly to say it. No, I get it. Like a public dashboard. You didn't really. So then we had like private ones where we actually had the real stuff. Relatedly, I think that any habit that more and more teams are picking up, which I think is super important, is using SLOs as their entry point instead of using dashboards as their entry point. Oh. So use it as an entry point for what? for understanding, debugging, interacting.

Starting point is 00:45:53 You know, like, SLOs, I think I have this sticker that I made. It's like SLOs are the APIs for engineering teams. And like an SLO is your agreement. You're like, we will provide a level of service that we all agree internally and externally is good. This means we have a budget, right? We can use what's left over in our budget for running chaos engineering experiments, for, you know, one of the things that we did at Honeycomb a few years ago was

Starting point is 00:46:16 We had Kafka nodes that kept just like vanishing on us. And it was really frustrating. And so we took some of our SLO budget and we started killing Kafka machines every day and working on the automated recovery process, right? So that they would, you know, and to this day, every Monday we kill the oldest Kafka node. We just shoot it in the head. So we're always testing the bootstrapping process, right? Got us out of a lot of, you stopped getting pages in the middle of the night because of Kafka nodes.

Starting point is 00:46:46 right? Because we're constantly testing this thing. But if you have SLOs, it's also the greatest hedge against micromanagement, I think. Because if you're meeting your obligations, then how you spend your time as a team is like below the fold. Nobody should care about that if you are meeting your obligations. It's also a way for you to negotiate and be like, hey, we're not meeting, we're not meeting your obligations. So we have to put a hold on this feature work because we have to do this reliability work because we have to get ourselves back to a place where, you know, we can, deliver on our obligation.

Starting point is 00:47:17 So there's just so many ways that I think, and an anti-pattern, I think, is when your SLOs are not derived from the same data that you're using to debug. When it's like something that's out there in a satellite, it's not connected, it's like, well, now you have one more problem, right? It's like, you really want it to be like, here's my SLOs. Ooh, I don't understand that, you know, my budget is like going down faster. Like, click on it and see why. I really like you saying that SLOs can be a way to avoid micromanagement, and it should be.

Starting point is 00:47:50 Because, you know, this is no one wants micromanagement. I think it's kind of fair, right? Like, I think micromanager is warranted when you're doing a terrible job. Then, you know, like the manager or tech leader, whoever or director should come and look at you, but otherwise they should leave you alone. And I think that's kind of fair, right? It's a good way to think of it. I wish more teams would take this and hopefully they can take this as inspiration.

Starting point is 00:48:11 I think it's picking up steam. I see a lot of people really dealing with SLOs, and I feel like that wasn't true for a long time. So with Honeycom, it's an observably startup that you're building. What was a major exciting or interesting engineering challenge that you have to solve to actually build this product? I mean, we had to ride our own database. Your own database. Really? You're kidding.

Starting point is 00:48:35 Why? Because, you know, when we started in 2016, to be clear, I spent my entire career telling people never write a database. Don't do it. Just never write a database. If you think you want to write a database, trust me, you don't worry a write a database. But Christine and I got started and we were just like, well, shit, you know, like, Click House wasn't around. And ironically, we would have, I guarantee, if Click House was around, we would have used it. And I'm now really grateful that it wasn't and we didn't. Because being able to, like, the data model is so custom to us, right? And being able to iterate on it, you know, add traces, add, you know, all these things has been a real force multiplier for us. I will say it's why we lost when I was CEO, our earliest investor. I met with him a year in. I was like, well, we're starting to get some interest, you know, but like, you know, we're not, we're not, we're going to need more money. He's like, well, if you're going to succeed, you would have succeeded by now. And I'm not giving you any more money. And I'm

Starting point is 00:49:39 like, well, and he's like, well, you know what? You shouldn't have spent all this time fucking your own writing a database. You should have found product market fit first and then written a database. And the thing is, the thing is, as snotty as he was, he's right. He's absolutely right. That is the common wisdom. That is almost like 99 times out of 100. That is the right smart thing to do. We were too dumb to know better. So we accidentally did the right thing. And so how did your database help you? And what's it called even? It's internal to you, right?

Starting point is 00:50:11 You never opened it up. Yeah, yeah, yeah. It's called Retriever. So all of our services are called after dogs. We have Dalmatian and Poodle and, yeah, Retriever and Basset and Hound and all these things. And so, yeah, how has it helped us? I mean, it's, I mean, every people. What kind of database is it?

Starting point is 00:50:31 It's a column restore. It's a columnar store. It's got, so it's been through a few different evolutions. And actually, so Sam Stokes gave a great talk at Strange Loop, and I think 2018 about this, about the internals. People can go and look it up. And then a couple years later, at the very last Strange Loop, Jessatron gave a talk about how we've evolved it. Because at some point, around 2020 or so, we actually serverlessed our database. We were like, okay, so like, initially we're using the Collander Store.

Starting point is 00:51:01 it was all on, you know, local SSDs on EC2. But the vast majority of data that gets written to desk never gets queried by anyone, right? And it was really expensive. And we're just like, this is not, you can't build a business on this, where 99% of data never gets queried. And yet we have to pay to store it. And so Ian Wilkes, one of our principal engineers who's been here since the very beginning, he actually moved the query planner to Lambda jobs.

Starting point is 00:51:27 And shortly after the data gets laid down in that, on the SSDs, we actually age out to S3, and then we do this massive fan out and merge at query time. All right. That's pretty crazy that you built. I mean, congrats. Yeah. No, it's, it's, you know, I still tell people who never read a database,

Starting point is 00:51:50 but there's like an asterisk. Once in a while, you really can't. Well, and what I also like about this is, is like there are startups that succeed because they don't follow the beat in path. In fact, some of the more interesting ones I talk with, we covered some in the news that are I'll link it in the show notes below, but

Starting point is 00:52:10 there's a, you know, like Figma, for example, ignored the wisdom of launch in six months. They took three or four years to build their first version. You know, again, they burned a lot of money, but it was the right thing to do. There was another company, and Tithesis, which

Starting point is 00:52:26 built a advanced debugging tool for four years. I just. Just talk to the CEO of Antinuzis just a few days ago. They also built their own database, but they did something wild, which is they didn't write any test, and they're having their platform tested. It's wild. But again, he told me the same thing. People say don't write your own thing.

Starting point is 00:52:46 So I think it's kind of a little bit reassuring that when you know where you want to go, I mean, you know, there's a chance that you might run out of money, whatnot. But when you know, just do it. I mean, you know, like take all the advice, but you don't need to control C, control V. to succeed. Yeah. No, 100%. You know, I'm a big fan of the innovation token metaphor. You know, you've got like two or three innovation tokens as a startup, so spend them wisely. And we definitely spent two on our internal storage engine. Of the three that you had in total back in the time. But it pays off. Like every time we go to write a feature, like we're not fighting our storage engine. We can build our storage engine to do what we want to do. And it's actually an incredible force multiplier.

Starting point is 00:53:30 Okay, so let's jump into a pretty interesting topic, which is observability and AI. LLMs are a super hot topic these days, and they're everywhere. How do you think about observability and AI systems? Yeah, it's such a good question. So I feel like there are three places where AI really intersects with observability. Number one is when you're building and training a model. Number two is when you're developing with LLMs. And number three is, the everyone problem of, we're all now dealing with this influx of software of unknown origin.

Starting point is 00:54:09 Like, it used to be you could pretty much guarantee that someone somewhere understood the software at some point. And you can no longer take that for granted. And that is, I think, so funny because it really harkens back to, like, the origin story of Honeycomb at parse. Like, we had developers all over the world just writing snippets of JavaScript and uploading them. We just had to make it work, right? MongoD queries, they'd write them and upload it, and we just had to make it work. And so many of the things we just sort of forge in fire to understand this unknown software is like, oh, now this is a worldwide problem that everyone is having.

Starting point is 00:54:43 So it's a little fun. I wrote a blog post about observability and AI last week. Oh, cool. Yeah, it's not, I don't think it's super mind-blowing or anything. But like a couple of the conclusions that we're coming to is, you know, basically, first of all, if you can compute the answer, you should probably compute the answer. We see a lot of observability vendors out there who are like, AI, this and that. And it's like, yeah, but it's actually you're, it's a guess, right? And if you had gathered the context, then you could have computed it and it would have been faster, cheaper, easier, and better.

Starting point is 00:55:17 But instead, you're like, we put AI on it and it's a get. That's not better. It's not better just because it has AI on. Yeah, I can make things worse, too, you know? There are certain problems where AI is like the right tool for their job. When it comes to calculation and computation, not one of them. Another thing that we're really seeing a lot of, and unfortunately, we have actually some customers who are really sophisticated, like model builders and stuff, but they're not one of them. Everybody in the AI community is so tight-lipped.

Starting point is 00:55:48 Like they don't want to talk about anything, you know, and it's like, guys, come on. But I feel like one of the early lessons for Philip and me has been that you can't have good AI observability in isolation. You have to have it embedded in good software observability, right? There are all these startups out there that are raising just buckets of cash to solve the problem of AI observability. And they're all focusing on like the sort of self-contained models. And it's like, but like the inputs come from all these different services and data and humans and stuff. Like it's a trace, right? It's a trace shaped problem.

Starting point is 00:56:30 You have to be able to trace it all the way from all of these inputs up here in software land through the model to the human feedback. It's a classic hyper now. And by AI observability, you mean that the problem is like, okay, I have an LLM or a system that uses LLMs in my software. and I want to add observability to it, right, to understand how it works. And you're saying that a lot of the startups are like focus on, all right, let's just, you know, wrap the model into observability, inputs and outputs, and we'll see it. Whereas like this thing, it actually, like, you know, it's built around other parts of your software. You want to see the user interface.

Starting point is 00:57:05 You want to, you know, connect to like, like customer support tickets, that kind of stuff. Yeah, absolutely. It's a trace shaped problem. Yeah, it's a trace shape problem. it's a high cardinality shaped problem, and it's a high dimensionality-shaped problem. Like, these are just, this is a software problem with non-deterministic elements.

Starting point is 00:57:25 It's not an AI problem, is how I think of it. Yeah, and well, there's going to be a lot of money in it for sure, just because how much money there is in AI. But I really liked your third point, which is, because you said that the three buckets are, number one is observability for these LLM models, or, you know, when companies are writing AI, Number two was observability for developers who were writing code using LLMs.

Starting point is 00:57:48 And number three, which I think is the most important, is observability for this code generated by AI, which we know is, you know, hard to tell if it's good quality or not, but there's going to be just more of it. And this is what you said, that back at Parse, you were used to just adding observability for basically like all sorts of code, like JavaScript and whatever that people uploaded to power their mobile apps. Yep, exactly, exactly. You know, production is where code meets reality.

Starting point is 00:58:15 It's where it doesn't matter how pretty it looked. Doesn't matter how, you know, great. Like, you don't know if your code is good or not until you've watched it run in production. So I feel like, like so many things in the age of AI, this is not new. It's just a really intensified version of what we're actually dealing with. Yeah.

Starting point is 00:58:34 And what it tells me is like any company that wants to use these AI agents, you know, maybe these new AI agents or just more AI code, probably the prerequisite of doing that and not like burning is have good observability so that you know when stuff breaks. And from there on, you know, like it might be a next step of figuring out, can there be a feedback loop? Can I actually allow some of this AI to push production and all of those things? But without that, you're going to be flying blind,

Starting point is 00:59:01 which is just stupid to do with something as unreliable as an LM. Exactly. A different topic, but also pretty, you know, important one. Every company these days has a choice of building, buying, or using open source. Now, in the case of observability, usually this goes down to should we buy or should we use open source, because again, building for scratch doesn't really make sense. All of these have upsides. Do you predict any of these two gain more momentum based on what you're seeing? Do you see more companies might be going to use open source and try to host them or maybe more of them are giving up

Starting point is 00:59:36 and going to vendors and maybe using open telemetry, those kind of things. You know, I think insofar as open telemetry is open source, I think it has a really bright future, and I'm so relieved to see it. When it comes to using open source, you know, running your own versus using vendors, the main trend that I'm seeing in the space is consolidation. I think it's part of people just trying to get a handle on their bills, which totally makes sense, right? If you're paying five different vendors and each of those vendors is going for like your 15%.

Starting point is 01:00:10 Like I feel like a reasonable like benchmark for observability spend is like 15 to 20% of your clodspend depending on the type of business or whatever. I think it's a good rule of thumb. But if you're paying for five different vendors and each of them are gunning for that 15 to 20% of your class spend. That's just like, it's not so, right? You can't do that. So there's a big consolidation in the industry. the only open source vendor that's really involved is Grafana, right?

Starting point is 01:00:39 And Grafana is doing very well. They just raised a huge round, right? I think, you know, different models. Although I guess she should count Prometheus too, but Prometheus is like, I think of Prometheus and Datadog as the last best Capital M metrics products that have ever, will ever be built. Like, nobody's ever going to try and launch another big project. like, what's the point?

Starting point is 01:01:05 Like, these are mature. They're great, honestly. And they're just very mature technologies. So, you know, and metrics, for all the shit talking that I do, metrics have a place in the ecosystem, right? It's just not, right now it's like 80% of what people use as metrics and 20% of structured data and they need to invert that. It needs to be 80% structured data and 20% metrics.

Starting point is 01:01:27 But metrics are still great for, you know, cheaply plotting trends over. long periods of time, right? Or for, you know, counters, right? Counters are a very essential metric thing. Or at a certain scale, which is a lot higher than a lot of people think, structured data is too expensive and you should use metrics. So there are niche use cases for metrics. And so Prometheus, I think will continue to be a contender. I think Ruffon is great. I think that, you know, data dog, et cetera. But in general, most people are using vendors and most people don't want have to deal with this stuff under the hood.

Starting point is 01:02:06 When everything breaks at 2 a.m., you don't want to also be dealing with a broken observability tool on top of your other broken software. So I think it's, I mean, it makes sense to me, and I'm not saying that just because I'm a vendor. I mean, I wouldn't say that. That wouldn't die. Yeah. A question I got actually from a reader on social media.

Starting point is 01:02:28 What about front end and mobile observability? Because when we talk about it, you know, it feels so much about. it as about the back end. Yes, yes. So we actually launched our rum replacement this year for front end, trying to do the same thing for rum that we have done for the back end. What does ROM stand for? Really, user monitoring. And it's basically the front end. It's organized around browser sessions and user sessions instead of like backend requests. But I think it's so critical that I feel like so often the borders of tools are what creates silos. So if you've got one team over here that's using a completely different view into their software

Starting point is 01:03:08 than the other teams, it's like you spend more time arguing about the nature of reality than trying to solve the problem together. When you have a common, again, back to that unified storage, right? Many different views, many different entry points, but a unified view all the way from your mobile device or your browser to the database and back, it's, it's, it's, it's, it's, it's, it's, such a powerful thing. Mobile is a different beast, and we've started dipping our toe into it this year. It's no surprise that mobile is kind of the only standalone solution out there that sort of left, and I think a lot of folks are trying to figure out how. I mean, you come from a mobile background,

Starting point is 01:03:49 so you probably know all the reasons for this better than I do. Well, I mean, right, there was crashlytics, which was required. It's such a weird thing because what I've heard is, Crashlytics around like 2012 or something was acquired by Twitter and for a while it kept alive and then it was destroyed and two things happened there. I think they bought it for about $300 million and apparently it was such a low return that the VCs really got spooked and they never invested in another company that was doing this because they said if this is the biggest exit then was there and it always feels that mobile has kind of been left by itself. There are some tools but none of them are really first class. None of them really come from the the

Starting point is 01:04:29 vendors that are doing, you know, the proper kind of back-end observability monitoring. And then it just feels, I think everyone thinks it's such a small market, which I kind of disagree with, honestly. It's not small. But I do have a theory for why this is. And it's because the build pipeline is so alien and different. Because you can't see ICD, right? You've got like the Apple store gating or you've got like the Android diaspora and the inability

Starting point is 01:04:56 to like fold it into the best practices of software. software development has, I think that's why it's out an island by itself. No, mobile remains a little surprisingly archaic. And a lot of it has to do with Apple and then, you know, Google going along, not allowing binary code to be shipped directly without their permission. Obviously, companies are, you know, like every team is going around. They're doing like feature flags and there's some JavaScript here, JavaScript there. But it's all hush, hush under the hood.

Starting point is 01:05:26 Apple kind of knows, but they, it's, it's, it's. different. It's a, it's a, it's a, it's a fun, fun world. Yeah. Every area has its own challenges. Every, every area he does, yeah. And the last question I got from, again, someone on social, uh, media from, from past nine, this person wrote, I'm starting a new, a new company today. What is the right time to start investing in observability and how can I design for it upfront, you know, startup, fresh idea? I, I, I think it's as interoperability. I, I think it's as integral to building software as tests. I do. I think it's the same sort of, the best time to instrument is the best time to write test, the best time to instrument is while you're writing

Starting point is 01:06:10 the code. Anything that you try to slap on after the fact is not going to get at that original intent of what you were trying to do when you had it in your head. And I think that done correctly, it actually accelerates your development. It doesn't hold you back. It accelerates your ability to get stuff out to understand, to keep moving. And so, yeah, I would say, as soon as the code that you're writing is real, you know, something you intend to put in front of a user, as soon as you start thinking about writing tests, you should be thinking about writing observability. So it sounds like to me we're kind of saying, like, when you're prototyping and, you know,

Starting point is 01:06:47 you're doing throwaway stuff and, again, you wouldn't write any test. Don't worry about it. But when you're like, okay, this might go out. I really like the test analogy. And you know, the interesting thing with tests is, there's two sides. people like when you say oh tests will will speed you up the long term like people who haven't written tests or haven't seen it they're like that's BS like it takes time it cannot slow me up I'm just going to skip it and then people who've seen it they're like no no no you don't

Starting point is 01:07:12 understand and now you see you know the two startup founders one of them is starting by writing the tests and they actually do get sped up where the other one says it's it's silly you know there's a comic about pushing the the the the the car on the square wheel, someone brings a round one saying, oh, let's change it. No, no, no, it will take too much time. So real. I feel this might be the case of observability. Once you've seen it, you probably cannot unsee it.

Starting point is 01:07:41 Exactly. So let's close with some rapid questions. I'll just shoot a question and you tell me whatever comes to mind. People told me I need to ask something around management for you. So I'll ask, do you like being an engineer or a manager more? I love being an engineer. I love it. One of the hardest things about, you know, the early years of Honeycomb were really rough. I didn't expect to have to be CEO and I wasn't a very good one. But so much of my identity came from being an engineer and it was really, really hard for me to kind of move past that. I will say that when I was an engineering manager, I hated it. But now I look back and I'm like, oh, I kind of miss that, which is why I actually have this open calendar link where people can set up time to like just kind of bring their problems to me and talk. about them and I get such a kick out of it. Like I miss those aspects of doing engineering management but I think being an engineer is just you get paid to solve puzzles all day. That

Starting point is 01:08:37 dopamine hit of just like figuring things out, making things work. Yet I would just go around feeling high all day. It was really fun. Yeah. What is a controversial thing that you believe is true? Oh, absolutely nothing. Jarege, how can you make such a thing to me? My God. What am I thinking? What are you thinking? A controversial thing that I believe to be true. You know, I actually wrote an article yesterday. I got out something about founder mode.

Starting point is 01:09:10 And I, so I'm not sure if this counts as controversial or not. But I think it counts as controversial in the Silicon Valley, like YCE. I do not think that it's a good idea for the CEO to have to approve everything that goes out. I think that is egotistical. I think it is wrong. I think it hobbles good decision-making and judgment in other parts of the organization. I wouldn't want to work for someone like that. You know, there's this group of folks who just idolize Steve Jobs, you know, Johnny.

Starting point is 01:09:41 And I think that Steve Jobs, obviously, very successful person. And I think in large part, despite the fact that he was a raging control-free asshole. I think he was successful despite that because of that. And it makes me really sad to see so many bright people be like, oh, well, I should also be an asshole and control-free. Think of how many brilliant, wonderful people probably left because they couldn't stand to be micromanaged in that way. Well, I think it also comes to show there's just so many ways to succeed that there's no one way.

Starting point is 01:10:11 And also, to be fair, so many personalities, right? So I'm going to just like, you know, do a wild stab. You don't operate like that, right? Oh, don't. I, for better or for worse, I've been a worker and an employee much longer than I've been a founder. or at a sea level. And I don't know, it's so clear to me that the way that you bring out greatness in people is by supporting them and empowering them and giving them agency and giving them control.

Starting point is 01:10:38 And yes, be in the details. Like the advice of higher great people and get out of their way is terrible advice because you need to be doing the work to create alignment, to make sure that you have a shared view of what good looks like, what great looks like, so that you can course correct early when you start to diverge. So yes, be in the details, but you don't take people's agency away from them. So if you are not building an observability startup, what would you be doing?

Starting point is 01:11:05 I would be a staff engineer someplace. In fact, that's what I plan to do next. So just go be an engineer in someone's company, build stuff. At 5 p.m. I go home and I turn my brain off and it's like, it's your problem now, bud. It's going to be amazing. Oh, you know, for some reason, I want to believe you, but I don't think you're going to be that. I'm leaving my family. then later.

Starting point is 01:11:26 The turning of the brain at 5 p.m., I mean specifically. Oh, that part. Yeah, yeah, maybe not. I mean it, though. Like, I plan on going and being an engineer for a while. I think you can only, like, I hate the term thought leader. It just makes me a little nauseous. And I really think that, like, the farther detached you get from the work, the,

Starting point is 01:11:46 just the lower quality, I don't know. Like, for me, at some point, I need to circle back and do something with my hands in order to, in order to not cringe when I hear myself. speaking. I absolutely hear you. So you're a big fan of whiskey. What is the current favorite? You know, for the longest time, I really like PD Scottis. And it's, I'm still kind of coming to terms with my identity as being more of a bourbon person now. Oh. Yeah, no, interesting. I really like whistle pig. And I really like the, the wry whistle pig, too. My favorite ever is important. possible to find now, though. It's called George T. Steggs. And it is like 190 proof. It is so good.

Starting point is 01:12:32 You can't find it. But they started making something called Staggs Jr., which is like 80% is good. So if you haven't tried it, I Staggs Jr. All right. And what's a nonfiction book you'd recommend for software engineers? The book is called Fluke by Brian Claus. Fluke, chance, chaos, and why everything we do matters. I'm not a religious person, but I do like really look for ways of understanding, have meaning in my life. And like, the takeaway from you is that like everything we do really does matter because you don't know when the thing, like a straight comment that you drop in someone's presence sets off something in their head that changes their life, you know, or causes

Starting point is 01:13:15 them to happen or causes them to get sober or just, you know, all these ripple effects happen. And like the things that we do and say, yeah, 90% of them, you know, don't maybe 95 don't actually like set off. But things that we do, like really do matter. It's called Fluke. I've read it three times in the past year. Cannot recommend it highly enough. That is a very strong recommendation. All right.

Starting point is 01:13:39 Now I'm going to get it. Thank you. Well, Charity, this was a really interesting and fun conversation as usual. Thank you so much. It's so nice to see you. I wish we lived closer so we could get together more often. and it's always a delight getting to spend a little bit of time just shooting the shit. Thank you to Charity for sharing all these interesting details.

Starting point is 01:13:57 Charity is a prolific writer and you can read more from her on her blog at charity. WTF, which is linked in the show notes below. For a deep dive on how to build an observably startup and for details on how a scale-up managed to have a $65 million observably bill for a single year, see the pragmatic engineer deep dives linked in the show notes below. If you enjoyed this podcast, please do subscribe on your favorite podcast. platform and on YouTube. Thank you and see you the next one.

The Pragmatic Engineer - Observability: the present and future, with Charity Majors

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.