The Changelog: Software Development, Open Source - Observability is for your unknown unknowns (Interview)

Starting point is 00:00:00 Bandwidth for ChangeLog is provided by Fastly. Learn more at Fastly.com. We move fast and fix things here at ChangeLog because of Rollbar. Check them out at Rollbar.com. And we're hosted on Linode cloud servers. Head to Linode.com slash ChangeLog. This episode is brought to you by DigitalOcean, the simplest cloud platform out there. And we're excited to share they now offer dedicated virtual droplets.

Starting point is 00:00:28 And unlike standard droplets, which use shared virtual CPU threads, their two performance plans, general purpose and CPU optimized, they have dedicated virtual CPU threads. This translates to higher performance and increased consistency during CPU intensive processes. So if you have build boxes, CI, CD, video encoding, machine learning, ad serving, game servers, databases, batch processing, data mining, application servers, or active front-end web servers that need to be full-duty CPU all day, every day, then check out DigitalOcean's dedicated virtual CPU droplets. Pricing is very competitive, starting at $40 a month. Learn more.

Starting point is 00:01:01 Get started for free with a $50 credit at do.co slash changelog. Again, do.co slash changelog. From Changelog Media, you're listening to the Changelog, a podcast featuring the hackers, the leaders, and the innovators of software development. I'm Adam Stachowiak, Editor-in-Chief here at Changelog. On today's show, we're talking with Christine Yen, co-founder and CEO of Honeycomb. We're talking about her upcoming talk at Strange Loop titled, Observability, Superpowers for Developers.

Starting point is 00:01:37 We talk practically about observability and how it delivers on these superpowers. We also cover the biggest hurdles to observability, the cultural shifts needed in teams to implement observability, and even the gains the entire organization can enjoy when you deliver high quality code and you're able to respond to system failure with resilience.

Starting point is 00:01:54 And by the way, Strange Loop is happening in St. Louis, Missouri, September 12th through 14th. We hope to see you there. Tickets are available right now. Head to the strangeloop.com to learn more and register. Christine, in your Twitter bio, it says that you miss writing software. And that makes me kind of sad. I am kind of sad about it.

Starting point is 00:02:14 Write some software. You got to write some software. I know. I know. I do it on, at this point, it's like, okay, well, what weekend art projects can I get some software or have some excuse to do that? Or I mean, I think this is the nature of early stage startups, right? The sorts of things that made sense when you were two people and you needed to get something off the ground no longer makes sense when you're 25.

Starting point is 00:02:37 And you need to start thinking about how the business is doing. And, you know, delegation is a skill. It's a learned skill lots of times. It absolutely is a learned skill. It does not come naturally, especially to perfectionists. A lot of times people writing software are like, that's my code, it's my software, it's my thing.

Starting point is 00:02:55 It's tough to let go of that and trust other people. One of the best arguments our director of engineering was like, Christine, you need to stop rogue fixing bugs at night. And I was like, but why? She was like, because because when you do it it doesn't let us improve our process to make sure the things you care about get fixed that is a great argument that i can get behind successful teams are often built on successful systems and processes so you definitely have to give room for that to take place. Otherwise, if you're just fixing the problems and the problems aren't fixed by the team,

Starting point is 00:03:29 then it's kind of hard to build a strong culture, which I'm sure is important in those contexts. Definitely. So we're excited for your talk at Strangeloop. We're excited for Strangeloop because we've been trying to get to Strangeloop for a couple of years. I think I offhanded say that to you a lot, Adam. Like, hey, let's do Strangeloop this year. been trying to get to Strangeloop for a couple of years. I think I offhanded say that to you a lot, Adam, like, hey, let's do Strangeloop this year. It's like four years now. Yeah, I think there was an OzCon conversation we had a couple years back on the show where I even

Starting point is 00:03:52 said, we should go to Strangeloop. And then we still haven't gotten there. Well, we're going to be there. We're going to be there this year, September 12th through the 14th. Adam, check my work there. I believe that's correct on the dates. And we are working with Strangeloop to invite everybody out. Come. It looks like an excellent conference. We'll be there. Come see us. Come say hi. And I was looking at some of the sessions and yours jumped out to me, Christine. Observability, superpowers for developers. And that just makes you stop and think, hey, I want superpowers. Who doesn't want superpowers, right? Yeah. I, um, I had a lot of fun coming up with that title.

Starting point is 00:04:31 Um, because it really seemed to capture a lot of the thoughts that have been kicking around my head in our honeycomb life for the last few years, namely in that observability is the thing that so many people associate with like something from ops people or SREs, you know, the hardcore people that carry pagers. And I mean, it's true observability is something that those folks care about, but that it's actually more powerful for the folks who are writing the code, the people who are sometimes in ops folks' minds causing the problems. But it's, it's taking this thing that is associated with

Starting point is 00:05:08 fighting fires and being like, well, but what if we bring it earlier in the process and what can it supercharge? What can people do that they couldn't do before because now they have the ability to see into their systems? And it helps that I'm a huge Marvel and superhero genre fan. Yeah. And I, I've never actually done a talk before that with like, I was able to pull in so many pop culture references and end up, you know, what would have taken like normally an hour to sit down and work on something would end up taking like two hours. Cause I'd get sidetracked on Wikipedia and Google image search rabbit hole looking for that

Starting point is 00:05:48 perfect image. Anyway, it should be fun. Well, let's get to the superpowers in a minute because you mentioned something there, which we've read is something you care deeply about and something that Adam and I have talked about a little bit and touched on, but haven't gone deep onto. You talk about the ops folks and the dev folks and how those are different folks lots of times and there's a cultural divide it seems maybe not always but generally speaking between those sets of folks on teams or inside of businesses because of one of the things you said there which is like the ops people are causing the problems, or the devs are causing the problems that the ops people have to deal with,

Starting point is 00:06:28 and one group is on pager duty and the other group isn't necessarily, and there's a gap there, which is something that needs to be addressed because that's not a good way to be on a team. That's like us versus them. So this is something you care about. Will you share thoughts on that divide and what we can do about it absolutely so to zoom out a little bit um i'm working on a company called honeycomb with my

Starting point is 00:06:56 friend and co-founder charity majors um and in many ways she embodies the ops stereotype she's been in ops for many years she's carried she says she's carried a pager since she's been on call since she was 17. That's a long time. A long time. Whereas I have much more of a product development background where I'm like, oh, I want to build stuff that users touch and feel and improves their lives. And before Honeycomb, we actually worked together at a company called Parse, where it was a mobile backend as a service.

Starting point is 00:07:27 And one of the tenets of our engineering culture is that everyone did a day of support. You rotate through and no matter what, you were the one in the email inbox answering questions for customers. What this meant is that the people who were writing the software, like me, were always super aware of ways the software sucked, or ways that were unclear, things that were confusing. And there was a really tight feedback loop between what users saw and the work that we did, whether we were actively writing the code or the folks maintaining of maintaining it, um, maintaining the systems. And even there though, uh, there was this element of, okay, I was building the analytics part of the product, right? Things. And this new, exciting part of the system, we talked to ops and sort of plan out how I'd want it to scale. And then we hit, you know, to go live. And inevitably people would come knocking on my door being like, Hey,

Starting point is 00:08:26 Christine, something happened to the, you know, right throughput on our Mongo cluster or Cassandra cluster. Like, what do you know about it? And I'd kind of look around and be like, uh, I don't know. Right. Throughput on Mongo. Hmm. That's a great question.

Starting point is 00:08:42 Um, and eventually someone would work, we work together and track down what had happened with the code. And I think through that experience, again, because we had this feeling of always being on the same side and kind of working with, working to support our customers, became very aware of the different types of skills

Starting point is 00:09:02 that go into building a system that is resilient for your customers and how much better things were when we were on the same, looking at the same information. And I think of those days of seeing right throughput is up as sort of the past and actually our post-acquisition time as the future in that when we got acquired by Facebook, we were exposed to an internal tool called Scuba,

Starting point is 00:09:34 which was sort of the predecessor of Honeycomb, which allowed for a lot more flexibility in interpreting impact on the system in the terms that I as a developer understood. So instead of, you know, hey, Christine, WTF, you just, something you did change the write throughput on this database, it would be, hey, Christine, latency for serving this particular type of request

Starting point is 00:10:00 went up on this endpoint for our largest customer. Does this sound familiar? And those are things, those are the entities and those are the nouns that made it a lot easier for me to understand how the code that I wrote impacted production. And really that's the sort of thing that ops folks have innately, that developers have to almost learn, right? Especially in a world where boundaries between dev and ops are blurring. Developers aren't going to, can't start to adopt that ops sensibility

Starting point is 00:10:37 until they see cause and effect. Oh, when I write this kind of code for this type of production workload, this is what happens. These are the signals to look for. These are the things I can start to work to prevent or watch out for in my code. One of the taglines that we've played with

Starting point is 00:10:55 or one of the phrases that we've liked, especially in this realm of observability, is that it allows you to test in production, which I know means a lot of things for, you know, future flag folks are using that and dah, dah, dah. But I like it in the context of observability because what are you doing when you test? You compare actual versus expected. And a lot of ops folks in their monitoring, you know,

Starting point is 00:11:22 with their monitoring setups, that's kind of what they're doing. I expect CPU to be within this threshold. Actually, it's over here. And the more those signals can be framed in the same way to be like, well, I expect latency to be around here. I expect to be able to handle 2,000 requests per second for this customer compared to actual and tie it back to the code that I write. Boy, that's a really good feedback loop

Starting point is 00:11:50 and really virtuous cycle for developers being able to ship better code in the first place. Are there a lot of devs out there that aren't in the know? I mean, is it common for developers to just not see that side of things? I think it's so easy. Yeah? I think it's so easy to write code based on what you think is normal or what should be true without actually verifying it

Starting point is 00:12:09 to some Adam. I know you said you're, you have background in product management. Um, and this is to say this nicely, uh, being able to verify for yourself what is happening in production almost lets sometimes developers sidestep that product management intuition. Or it lets you develop your own intuition based on reality and not, or helps you supplement the more qualitative research product management perspective with, but this is actually happening, right? This largest customer is actually sending us this volume of data. Or we assume that people

Starting point is 00:12:47 send us payloads of this type but are instead sending us payloads of another even just talking to folks at various tech conferences there's lots of developers who are like oh well you know i write code according to spec and i write my tests and i ship it and when things go wrong it's just something in the infrastructure not my code i think that that that's a mindset that's slowly changing over time. So that's coming at it from the developer's perspective, right? So they have, that's a technological solution in terms of showing them, right, observability into the way that this will perform

Starting point is 00:13:16 in production or the way it does perform in production in real life, allowing them to tie back to their code. What about from the ops perspective? Is there anything, is there, because you're bringing basically the developer closer to the ops side. Is there any effort to bringing the ops people closer to the code in terms of why can't the infrastructure person get the stack trace, or not stack trace and error, but go back to the lines of code that are affecting this

Starting point is 00:13:43 and then analyze that? Is there movement in that direction, or am I out in left field? I think there's some movement in that direction. I actually think of the movement from ops over to dev as being something that is almost part of the broader DevOps transformation movement, migration. Getting ops folks to get more comfortable with automation and code as a way to do their work is something that I feel like has been happening over the last five or 10 years already. And to some extent that it makes folks, ops folks,

Starting point is 00:14:18 SREs more willing to get their hands into the code itself. But on a team of a certain size, there's always going to be folks who are a little bit more comfortable, or they're going to be the folks who are largely producing the code, which is the folks who sometimes stick their hands in to make sure there's instrumentation in place or that to test something.

Starting point is 00:14:38 And certainly from my perspective, I'm more interested in calling the folks who are focused on shipping to be like, well, okay, ship faster, but also be aware of what you're shipping and how what you're shipping is behaving. Can we actually break down what observability means? It's like this buzzword, good term. I got log files, right? Everybody has log files.

Starting point is 00:15:00 There you go. You look at your logs. Yeah, exactly. Just look at your log files. There you go. You look at your logs. Exactly. Just look at your log files. What exactly is observability in the context of these superpowers and Honeycomb and this context? Yeah. I define observability as the ability to ask new questions of your systems, ideally without deploying new code. I'll break that down. Being able to ask new questions.

Starting point is 00:15:21 What this means is if you look at traditional monitoring systems, often you're defining some sort of dashboard and you're saying, I want to know what average latency of my system is or the total throughput of requests. So you take that and you put it in a dashboard, you put it on your wall and it just stays there. And that is a question that you've asked. And that's the answer to the question. and it very rarely changes. Part of the reason observability has grown in popularity the last five years is really that our systems are now evolving to a point where you can't just predict the one or two questions that will be important and put them on a wall and have that be enough. You need to be able to ask questions like, okay, well, I have average latency up there, but what is the P95 of latency for customers fitting this profile? Or that one customer over there? Or what is average latency if I remove requests that touch this database I know is slow? The ability to ask these freeform questions

Starting point is 00:16:21 is becoming more and more critical to being able to support these more and more complex systems we've been building. And the reason we've found ourselves drawn towards this new word is that there is almost a, there's this

Starting point is 00:16:39 split between the things that are stable enough to monitor, right? CPU utilization, maybe are stable enough to monitor, right? CPU utilization. Maybe it's nice to know. It's not going to change that much. That's the sort of question you can monitor. You can put it on the dashboard, whatever.

Starting point is 00:16:56 Things like, well, what's happening for this customer? Why does our service look down for them? That's a much broader question. That's much fuzzier where the answer to that might be different, whether I'm looking today, tomorrow, or next week. Someone came up with a phrase that I really like. I can't remember who it is right now and I'll hit you notes afterwards. But what they said is, if testing is for known knowns, where you're trying to capture known behavior and immortalize it, and testing is for known knowns, where you're trying to capture known behavior and immortalize it, and monitoring is for known unknowns, you know you might care about CPU,

Starting point is 00:17:31 but you don't know what it is at this point. Observability is for unknown unknowns. And I love that because this idea of unknown unknowns really does, again, provide the perfect flip side to testing something that's unknown. Observability, you're like, well, something will go wrong. Something will go wrong in my system. I just have no idea what it is or where to start looking, and I need a tool that will work with that uncertainty and work with that flexibility rather than kind of hemming me in to the questions

Starting point is 00:17:59 that I thought to ask ahead of time. That last part of the definition of observability where I tacked on a, um, without deploying new code is important to include because lots of folks can say, well, I can ask like any question I want in my monitoring system. I just, you just add a new metric and then deploy it and then it's gone. Then it's there. Um, but that whole act of having to add that new metric and deploy it, like it's too late it's too late sometimes it's not even scalable right if you have a hundred thousand customers you just can't track a hundred

Starting point is 00:18:30 thousand metrics easily um you know caveats you throw money or hardware or something at it maybe it'll work um but it's there's there's an element of okay something is happening now and i need to sort it out now that i think we really now are able to capture in this concept of observability is an ability to do this thing, not a type of data, not a specific tool. So is it just collect all the data all the time kind of thing? Or is it collect all the things and then ask questions because you have, you know, you've collected all the data.

Starting point is 00:19:07 Essentially, you've monitored, you've logged every possible thing to enable yourself to ask those questions, the unknown unknowns of the future. I think it is a lot in line with collect all the data all the time. But, you know, we being engineers, we know that that's a recipe for something that is itself unfeasible and unscalable. Certainly, what we at Honeycomb like to talk about is capture the data that you think will be important to your business. Capture the data that are going to be helpful in tracking down the issue. There's a couple of things here, and I'll break that down. First, it's when I say capture all the data or capture data that is necessary, I mean, capture all of the context around things that are happening in

Starting point is 00:19:51 your system. This is again, in contrast to more traditional metrics and monitoring. Metrics monitoring, it's very common to be like, okay, well, let's just increment this counter when requests come through. On the observability, from the observability perspective, we say, oh, man, if you're only capturing a counter, you're losing all of this interesting context and useful metadata around what sort of requests they were and who issued them and what the requests were trying to do and how long they took and then how long they spent in the database and how long they spent rendering and how long they spent doing these other things. So context plays a big part because those are the bits that are going to be necessary

Starting point is 00:20:27 for the unknown unknowns for tracking down the things that went wrong. Another dimension on the capture everything all the time. All the time does not necessarily mean you should be capturing information about every single request. I think for many folks, especially for folks who are coming from a logging world, sampling is a little bit of a dirty word. And you're like, oh, no, you can't sample. And how am I ever going to capture the kind of low frequency events that are important? And you're asking me to throw away data?

Starting point is 00:21:01 No, I can't. And while, yes, storage has gotten much cheaper and we could store everything we wanted, ultimately, the model of using logs to capture a historical record of everything that happened made sense when logs were human scale or software systems were human scale. And it made sense to have a human with their eyeballs reading through log

Starting point is 00:21:27 lines of what happened. One of the truths is that our systems are no longer like that. Logs are no longer human scale. They're machine scale. And as a result, we can start to do things like sample intelligently and capture just enough to gain a sketch of what's happening in our system in real time. When I say things like sample intelligently, I mean things like, okay, if you care a lot more about errors than successful requests, capture 1% of all successful requests and a hundred percent of anything that hit a four or 500. Maybe you have certain customers

Starting point is 00:22:06 that you care about or certain customers that you know that you don't care about because they're high volume. Great. Sample that down. And we all have, all of our tools now are capable of doing this sort of statistical analysis and statistical compensation for these more complex sampling rules. And they can allow us to manage the volume of overall data while not having to miss out on that rich context that actually allows us to answer questions and solve problems in our system. This episode is brought to you by GoCD. With native integrations for Kubernetes and a Helm chart to quickly get started, GoCD is an easy choice for cloud-native teams.

Starting point is 00:22:59 With GoCD running on Kubernetes, you define your build workflow and let GoCD provision and scale build infrastructure on the fly for you. GoCD installs as a Kubernetes native application, which allows for ease of operations, easily upgrade and maintain GoCD using Helm, scale your build infrastructure elastically with a new Elastic Agent that uses Kubernetes conventions to dynamically scale GoCD agents. GoCD also has first-class integration with Docker registries, easily compose, track, and visualize deployments on Kubernetes. Learn more and get started at gocd.org slash kubernetes. Again, gocd.org slash kubernetes. So you explained that sampling is logical. It also is counterintuitive because you have all the people like,

Starting point is 00:23:56 well, if I sample the wrong thing, I'm going to miss something. And as you described, observability is for the unknown unknowns. Well, that's the hardest thing to know about, right? Because you don't know about it. De facto, do not know what you don't know. And so what are some of the heuristics or ways that you can decide what's important, what's not important? Because like you said in the first segment, tracking all the things doesn't really scale

Starting point is 00:24:19 well for most businesses. And so these decisions have to be made. And yet you don't want to miss something that you may need. So you mentioned maybe an important customer or maybe an error, you want to track more. But tell us more on these decisions and help folks decide what do I need to observe and what don't I care about? This is a great question. One of the things, one of the principles I really like to have in my head is that with any of these data tools, the data tool is only going to be as good as the data that you're getting into it. Put garbage in, you're going to get garbage out. And so these questions around, well, what do I sample and where do I capture data from are so important to always be aware of. I think that there's a perception.

Starting point is 00:25:07 Well, first, if observability is strategy, it's the high-level thing that you're working towards, instrumentation and figuring out where to capture data from is the tactic to get right. And a lot of people think about instrumentation, and they're like, oh, my gosh, it seems like so much work having to go in and say that I want to capture data from this. Can't I just... Don't you just have an integration I can plug in out of the box and have it work? All my APM tools just work out of the box.

Starting point is 00:25:35 And I think that it is awesome when things work out of the box. But ultimately, you know your system best. You know your system best. You know what your business cares about. You know what tends to go wrong in your infrastructure. You know what is even wrong to the application. Those APM vendors may not. And so out of the box, getting something up and running might be helpful for making sure you don't miss any of the common bits. But ultimately, thinking through

Starting point is 00:26:04 what are the sort of entities I care about when breaking things down for my business? I like to talk about Intercom, one of Honeycomb's longest and oldest customers. For a long time before they found Honeycomb, we're not able to break down by app. Being a B2B company, they needed to be able to say, well, this customer or this app is doing this thing and this other customer is doing this other thing. And that was just something that was important to their business that previously had not been able to be translated to their engineering tools. And that's the sort of thing that only your engineering team is going to be able to go in and be like, okay, well, here's this entity. I'm going to shove this into our metadata of our data tools

Starting point is 00:26:47 so that I can ask questions that incorporate this piece of metadata. We talk to folks about getting started with observability or doing that first pass of instrumentation. There can be a lot of these questions about what matters to your business, what matters to your infrastructure. For example, for us, we use Kafka pretty heavily. It tends to matter which partition things get written to. So that's a piece of metadata that gets captured in all of our dog food instrumentation. Back to what I said earlier,

Starting point is 00:27:16 there's this perception that instrumentation is this big lift, this big thing that you have to get right. And it's a lot of work. And that I say, it doesn't have to be, it's something that's iterative. It's something that evolves along with the code that you're writing. And the same way documentation and comments tend to evolve or tests evolve as the logic underneath changes. So should your instrumentation. And with that frame of thinking, it's almost like you start off capturing a baseline of things that you think will be useful. If you have a basic web server, you probably care about handle this request

Starting point is 00:27:54 and it returns this HTTP status and maybe came in from this user or customer ID. And as your understanding of the system evolves, as your understanding of the questions that you might want to ask evolve, you can just add new fields, add new pieces of metadata. And the schemas that you're capturing

Starting point is 00:28:14 or the types of bits of data that you have to work with end up changing and growing and sometimes shrinking if you're pulling out stale fields. A lot of people don't like this answer because it requires some thinking right it requires them to like sit be like oh well what does matter to me and a lot of people you know no one likes to be told what do you tell those people what if i said i don't like that answer from what it depends on

Starting point is 00:28:39 how whether i'm wearing my honeycomb hat or not if i'm wearing my honeycomb hat or not. If I'm wearing my honeycomb hat, the answer is usually, cool. Well, good luck. We'll talk to you in a couple months. So take your honeycomb hat off and answer that question. Yeah, with the honeycomb hat off, it's a little bit more like, how much have your underlying system technologies changed? Are you playing with microservices? Are you playing with containers and orchestration?

Starting point is 00:29:05 If yes, chances are your practices around supporting that are going to have to change also. The idea that we can change how we deliver and host software without changing our thought patterns about how we ensure that those pieces of new technology are working the way that we expect is kind of mind-blowing, right? Logging tools and metrics tools really came into being like 25 years ago

Starting point is 00:29:34 when we only had grep and we had counters. And APM tools kind of came into being at some point along that path in order to bridge the gap between, okay, well, I want this, I want these graphs, but I also want some flexibility and being able to get down to the raw data. And those tools are struggling, especially the ones that have been around for a while are struggling to keep up

Starting point is 00:29:57 with a containerized world. Things that rely on stable host names tend to not be so happy when you have like a hundred nodes that you've spun up and spun down three times over the course of the day as you're experimenting with something. This increased attention being paid to how am I capturing the information that I need from this more complex system to answer these more complex questions, I think is a good thing. And there are lots of patterns and good practices that you can use to make sure to minimize the amount of work that you have to do and to make sure you're on the right path. But ultimately, all of the custom logic, all of the things that matter to your business bottom line, I think that are only going to be inside your head.

Starting point is 00:30:52 Yeah, it seems that as the trends in software architecture move towards microservices and towards serverless components, observability trends alongside those as moving from a place where it's a subset of context in which it's worth the effort to instrument the correct things. I was about to say instrument all the things, but not almost all the things. And set up these circumstances in which you can ask questions about your unknown unknowns towards a place where it's more broadly, like everyone's going to need this if we're going to continue to move into this more nebulous, cloudy, apologize for the pun, circumstance of serverless and microservices. Because we don't have, we just aren't as close to the

Starting point is 00:31:39 quote-unquote metal as we used to be. Like you said, when we used to have just grep encounters, things are, things are changing as we move in that direction. It seems like observability becomes more and more paramount. Yeah. I mean, I think that serverless is a great part of this also, right? Again, instrumentation doesn't have to be this big hole heavy lift. It's just a question of, well, what actually matters? Oh, you know, for Honeycomb, it's that our customers,

Starting point is 00:32:18 if they write a payload, they can query it in under a second. Okay. So let's start our instrumentation in order to capture what the user is seeing. Let's find a way to capture the API layer and the query layer in order to ensure this experience. And then as we need to, we can go deeper into the stack, we can go deeper into the code, add the instrumentation for what happened at the merge step inside our query engine, da-da-da. But when you're first starting out, leave that level of detail out until you know that you need it.

Starting point is 00:32:46 Some things it sounds like would be hard to observe and some things it seems like would be easy to observe. So if you take our completely self-centered circumstances, there are certain things about podcasting where it's hard to observe. Our listeners, for example, we don't know very much about them. Adam and I happen to not care too much about that. But as an unknown unknown unknown perhaps we might want to know something about that or more on the infrastructure side of the question that's more on like the uh maybe the advertising infrastructure like how fast are they able to download all of our episodes and how do we observe those things that's a little bit easier for us to track so

Starting point is 00:33:25 what are some things that are traditionally hard to observe or maybe people think they're hard to observe and they really aren't that hard or on the converse what are some things that people think are easy actually are hard or slice and dice that question however you like hmm this is an interesting question um i might come at it from another angle. There are really interesting parallels between this burgeoning observability trend in the ops and engineering and infra space and business intelligence folks and almost data science. People talk, Honeycomb will go out there and be like, oh, you can do these things with your data and you can answer these questions. And there's someone out there sitting on their giant Tableau instance being like, I haven't been able to do that since I don't know how long.

Starting point is 00:34:17 The most interesting man in the world. Right. That reminds me of those commercials. And, you know, I'm gonna take the um actual differences between honeycomb and tableau aside set them on a shelf won't get into them here um and and just point out how silly it is sometimes that there are these divisions between closely related disciplines where yeah i mean business folks, data scientists have been dealing with unknowns and unknowns forever, right? They've been dealing with this question of like,

Starting point is 00:34:50 oh man, why did profits go up last quarter? From a completely different context though, not from an ops perspective. Totally. But it's almost the same like actions, right? Thinking about this observability movement, it's exciting to me because it means that maybe engineers and operators and technical folks will be able to not purely think about, well, I have this data. What can I do with the data? But instead start thinking about what are the questions that I need to answer in order to ensure a good experience for our users? What if you found out tomorrow that the changelog wasn't being downloaded? It wasn't accessible to anyone in France. And a whole geo was just unable to access it because something in the infrastructure. Like these are, these are the sorts of things that I think because we're technical folks, because we are engineers, we're so accustomed to looking at what we have to work with and then figuring out what we can do with it.

Starting point is 00:35:40 Then starting to think about what our tools can help us achieve and then setting up the data that we need to achieve those goals. It's almost like literally like a hacker, where a hacker has to think about how to infiltrate and circumvent a system. You almost have to dream how your system will fail or problems that will come up or things that will essentially defunct your user experience that you desire, whether it's throughput, speed, etc., you almost have to dream of what could happen and then monitor the data from that. Kind of. I think that is the middle step. But even

Starting point is 00:36:16 you can even go higher and be like, what would get you out of bed? You'd be like, I would get out of bed in the middle of the night. Well, in the middle of the night. In an alarming way. Well, I guess. What would make you go get the coffee? Okay. Well, and it's users.

Starting point is 00:36:32 If you're Shopify, it's like users not being able to check out. Yeah. Oh, crap. That is the problem. Okay. Well, now let's think about, okay, yeah, what are the ways things might go wrong? What are the pieces things might go wrong? What are the pieces of metadata that we might need in order to quickly isolate what users aren't able to check out to users aren't able to check out because they're unable to talk to that database?

Starting point is 00:37:07 Being able to think of it from the perspective of whether your customers are able to achieve their goals is how, frankly, I feel like all software should be written or thought about which is a little bit a little bit of a harder um harder sell so we we tend to focus on observability and the technical things that can be achieved but is that a starting point though the what would get you out of bed you know is that sort of how you approach the necessary pieces you would want to capture to query the unknown unknowns of the future? You know, like, is it just simply that question or other questions that because to me, that's a great question to ask. Like what would get you out of bed to fix something, you know, page duty, et cetera. Yeah, I think that that specific question, people have been great. Like people have associated that too tightly with what is true in their present day right there's definitely people who would be like i would get out of bed

Starting point is 00:37:49 if um disk space was over 90 which is certainly an answer but doesn't quite carry the same end user impact that we want to inform um i think that it's more how do you know that something is actually broken or is actually impacting your business or your customers? What are they experiencing? Set alerts or set your pager on that. reducing alert fatigue and burnout and over, over monitoring that I will not go down over the course of this podcast, but there are a lot of smarter folks who've said things about on that front where again, it's asking the right question or thinking about the signals that actually

Starting point is 00:38:38 matter is something that can really improve an engineering team's lives, culture, et cetera, on a whole bunch of different levels. Observability is just a really great opportunity to start asking those questions. So if there's somebody out there that's like, great, I'm sold. Observability rocks. I want to implement it. I want to bring it into my organization. What are the steps? Who has to be convinced or sold the idea of it? And what are the tooling? You know, obviously, Honeycomb is one of them, but you mentioned APM earlier. You mentioned other tooling out there. What kind of tools or steps would somebody go through or take to start to chisel away at observability for the organization?

Starting point is 00:39:20 I think that tools are a catalyst for conversation, but rarely that first step. I think that first step is always going to have to be, oh, man, let's take a step back and think about whether we can answer the questions that our organization needs. Do our current tools or do our current practices support looking at things from the customer's perspective? Do they support being able to break down by FID or shopping cart ID if those are the most important things from there folks can then start to try things like okay well we have this data tool we don't really want to swap it out but I want to add this new field or I want to add the ability to compare this customer versus that customer. Great. Let's try that. And it, I think again, as technical people, we want like technical answers for,

Starting point is 00:40:10 ah, just use this technology and buy a Kubernetes and then it'll, it'll fix your problems. I was hoping there was an easy one, but it seems like there's not. But I think starting these conversations can at least keep a lot of this at a human level and identifying those questions and those pieces of information that you want to be able to starting these conversations can at least keep a lot of this at a human level and identifying those questions and those pieces of information that you want to be able to interact with in your data tool is the first step from there then it's a question of okay well can your tool support that does your does your rest of your tool chain or how you're instrumenting support, being able to answer these questions. Okay, well, if not, then that core set of questions works well for both, let's take this set of questions

Starting point is 00:40:54 and go figure out which tool makes sense for us, as well as arguing upwards, saying, hey, these are important questions to the business. We need to be able to ask these questions. Hey, Mr. or Ms. VP, I want a little bit of time or budget to explore this better way that my team can support their software. We have, these are very abstract things. On Honeycomb's site, we have a white paper section, in particular where MicroFounder Charity and Liz Fong-Jones have recently published a framework towards an observability maturity model that provide a number of these questions around, can your team do this?

Starting point is 00:41:40 Can your team do that? These are signs of your tools not being able to help you minimize tech debt. And I think that document in particular provides a great way to start thinking about and evaluating your organization's current observability practices or to start mapping out a way to improve improve them. This episode is brought to you by cross-browser testing of SmartBear, the innovator behind the tools that make it easier for you to create better software faster. If you're building a website and don't know how it's going to render across different browsers or even mobile devices, you'll want to give this tool a shot. It's the only all-in-one testing platform that lets you run automated visual and manual UI tests across thousands of real desktop and mobile browsers. Make sure every experience is perfect

Starting point is 00:42:38 for everyone who uses your site and it's easy and completely free to try. Check it out at crossbrowsertesting.com slash changelog. Again at crossbrowsertesting.com slash changelog. Again, crossbrowsertesting.com slash changelog. So, Christine, let's imagine a software developer and this person is interested in superpowers. And you have promised said developer superpowers if they will just adopt observability. So give us that pitch.

Starting point is 00:43:20 What does a superpower look like in this context? What do I get out of it from the dev side? And I'm going to adopt the concepts and try to get the metrics going and I want to observe my system. What do I get out of that? What are some superpowers? Great. Well, first, let's think about the sorts of things that a developer has to do throughout the software development cycle. Maybe you're deciding first what to build, either because something is broken and you need to fix it, or because product manager is handing you a spec. You need to figure out how you're going to build it. It's the architecture review

Starting point is 00:43:54 or the feasibility assessment. Then you need to figure out, you need to make sure that it works, local testing. You need to figure out that it, ideally, you make sure that it works kind of in a broader sense. Sometimes it's DCI. Sometimes you're pushing something behind a feature flag. And then often you're responsible for maintenance.

Starting point is 00:44:14 This is making sure it doesn't throw exceptions in production or what have you. My thesis is that observability can impact all of these. It can improve your ability and supercharge your ability to do any of these, not just that last one. And so I'll throw a couple of stories and examples at you. I think my favorite one is the how to build something. Because a lot of people are like, okay, well, I have a spec. How do I do it? Let me just kind of come up with something that I think will work locally.

Starting point is 00:44:46 Or let me come up with something that if you're a TDD shop, maybe you write your test first and then you're like, well, now I just have the right code that will satisfy this use case. How do you even know that that's the right use case? Or how do you even know that that use case is going to be representative of what your code will encounter in production?

Starting point is 00:45:02 And the way that observability comes in is it lets you actually verify your assumptions. I think that my code will have to handle workloads of this sort, payloads of this size, things like that. It will actually let me make sure that the code that I'm writing will behave well. An example from our very, very early days, at its core, Honeycomb has an API

Starting point is 00:45:27 that just accepts a whole bunch of JSON. And we were trying to decide, we had this ticket that was like, okay, well, we should unroll nested JSON, flatten it. Go, okay, great. Like the correct thing to do is obviously to do this by default

Starting point is 00:45:41 so that folks sort of just get this better experience. And the engineer who was working on it was like, wait a minute, let's double check this first. Let's find out who would be impacted and let's find out, let's make sure that if we do this, it will have the intended effect, which is our users being happier rather than being unhappier. And so what that engineer did, his name is Ben, is he made the two-line code

Starting point is 00:46:08 that would have unrolled the JSON or figured out how deep the JSON blob was. And instead of deploying the change right away to actually do the unrolling, captured something in our instrumentation that said, if we had unrolled, it would have added these new fields as a result

Starting point is 00:46:30 of the JSON blob being nested with a depth of three or five. And he was able to find out that something like a third of our customers were actually relying on things not being unrolled. So he was like, oh, okay, well, thus, the correct thing to do is to have it be an option.

Starting point is 00:46:49 That is the sort of thing where if he hadn't checked it ahead of time, if he hadn't actually verified in production that a third of the customers were relying on a certain type of behavior, he could have just blindly shipped this quote-unquote improvement and made a bunch of people unhappy, maybe cause some incidents down the road. So that is an example of how even just the, like how to build something can be improved. Another, how did he check that in production?

Starting point is 00:47:15 I might've missed, like, was there the metrics were already in place to check how much, all right. Yeah. If you remember the details. No, as the metrics weren't in place.

Starting point is 00:47:26 But as he was writing it, he was like, oh, well, while I work on the full pull request and while I write the tests for the code that I would want to ship, I'm going to prepare the smaller PR to just look at a payload as it comes in and alongside our, like, oh, I'm handling this request.

Starting point is 00:47:46 Capture a bit that tells us how deep that JSON payload is. Gotcha. So he added the he made it observable. Yeah, he made it observable. And I feel like one of the keys to working with a tool that supports this whole workflow is having the tool be not even just tolerant, but have the tool be totally fine and love new fields being added as necessary. One of the principles we try to build Honeycomb to is adding a new line of instrumentation should feel as easy as adding a comment to your code.

Starting point is 00:48:20 It should be lightweight. It should be something that developers do because they have this new question. They want to see what happens rather than some like big, hairy process that involves lots of ops people stroking their chins to figure out whether we should do it or not. And so the developer in this case was able to just ask this question almost in parallel with the code that he was writing. A kind of more concrete and more fully baked version of this. It's a company called GeckoBoard in London, and they are a very data-driven company.

Starting point is 00:48:55 And at one point, they wanted to build a new feature that part of it reduced down to the bin packing problem, right? NP-complete problem. Their engineers probably could have spent quite a bit of time coming up with a perfect implementation of this NP-complete problem. Their engineers probably could have spent quite a bit of time coming up with a perfect implementation of this NP-complete problem. And their PM was like, well, let's just test a couple of quick implementations against our real production workload, capture the results, don't expose it to customers, and run it for a day. And then we can see which implementation of this algorithm performed best. And they did it. And they were able to pick one and throw the other two away and then move

Starting point is 00:49:33 forward. And by running this sort of experiment in production, by making production not feel like, Oh, that's what happens when the code is fully baked, but is instead part of the development process, they were able to move much faster and be more confident that the implementation they eventually went with is one that would serve their needs.

Starting point is 00:49:55 It seems, too, as you draw back the observability, the superpower is having more eyes on the data, or in this case case an experiment around assumptions right like it's you're no longer a lone ranger uh isolated you now have your entire team's eyes on the same data the same data set and you're no longer alone yeah yeah it's more eyes it's smarter eyes it's eyes that can see deeper into the code. And I think previously, you know, rewind five or 10 years,

Starting point is 00:50:30 you had the ops people watching graphs and the developers shipping code. And with these efforts around observability, around making these tools able to talk about the terms that developers care about, you're able to invite developers over to this part of the room, invite developers to watch and think and be like, oh, I noticed this thing.

Starting point is 00:50:51 I, as a developer, have context that you, as an ops person, doesn't. Great. Now we can improve this. Now we can react faster or know better or take this learning from production and feed it into our whole development team. One of the things that, um, you know, at a simple level, a lot of monitoring tools can't handle well is being able to break metrics down by build ID as a developer, knowing whether my change from, you know,

Starting point is 00:51:20 knowing for sure whether my change was included in a specific change or, or drop in a graph, that is the most useful thing because that tells me whether I need to care or not. And don't get me. My, my commit wasn't in there. Yeah. All right. I mean, that's, that's the like, not my problem version of it, but it's like that. That's how you start to really directly attribute like, Oh, okay. What I did had an impact. And often that's a good start to really directly contribute like oh okay what i did had

Starting point is 00:51:45 an impact and and often that's a good thing right like oh great performance thing i shipped did do what i expected it to do otherwise you're like okay well did the build go out at this time i think times line up or all the machines on this new build like there's some uncertainty there and it's it's about being able to see and understand really what your code is doing rather than just sort of guessing at abstract signals and our team's ability to execute on those desires and potentially even create something that can make money. So it's a multifaceted job. I might think that observability might even be a superpower for a product manager or somebody in charge of engineering because you now have more resilient code. You have less issues, and that means it's more cost effective to actually run your team and your code.

Starting point is 00:52:50 So it's a business problem more than just simply a developer's superpower. This is very true. That particular angle tends to go across less well at developer conferences. But certainly, that's the appeal, right? It's the recognizing that. And this is why I get so excited about framing the question from the perspective of how does it impact customers? What is the business impact of it?

Starting point is 00:53:14 Because that's what gets other people in the organization looking and paying attention and supporting it. Product managers, product analytics are their whole own thing. And product managers need very advanced tools to make funnels, retention, and all that sort of things. But there is so much power in them being able to share and understand and ask questions in the same playground and using the same tools that the engineers do. At Honeycomb, admittedly, we're a small team,

Starting point is 00:53:42 but our support folks use the same tools that engineering does to verify, oh, yeah, this customer is saying they're seeing this thing. They are seeing this thing. It looks like this. Oh, hey, engineering, this thing is happening. And now that handoff is able to be a lot more informed and educated. Our product managers are able to ask questions like, okay, if we make this improvement, which customers is it going to impact right away? And I think that there are a lot of things that I think of

Starting point is 00:54:14 as something that really benefits engineers. Running queries, being able to feel fast and iterative, those qualities really benefit anyone who's adjacent to the product development process, whether you're a product manager or a support person. The ability to ask questions of your systems in production is not constrained to engineering disciplines at all. It's people who care about how that software is behaving. We even, you know, we have a couple of nonprofits who are using us where they have these honeycomb to spit out some graphs that their chief donation officer cares about,

Starting point is 00:54:53 because they just happen to be able to incorporate the entities that the chief donation officer cares about donors or donation amounts in with the same data that they use to assess operational stability. Can you imagine if you're running a donation platform and you can say things like, you were able to get, you were able to tease apart some correlation between things that, donations that were slow, donations that are large, and you can literally quantify the business value immediately of an engineering work that you're doing. And that sort of thing I feel like is the holy grail of different parts of an engineering organization

Starting point is 00:55:30 being able to really understand their impact rather than just, I made this thing faster because I wanted to. Right. There was actually some true effect on the business. And I would even dare say the users too because they obviously got more excited about whatever they're doing in terms of donating and they're able to do it. I'm pulling a little quote from your white paper that you referenced earlier, the white paper on this framework.

Starting point is 00:55:57 It says the acceleration of complexity in production systems means that it's not a matter of if your organization will need to invest in building your observability practice, but when and how. And, you know, systems are getting more and more complex. And as we just said before, the business case value of some of this instrumentation to be in place to capture this data and provide this ability for more than just one set of eyes to see a problem. It's not a matter of if, it's a matter of when because most things are moving to cloud. Most of the things are becoming more and more distributed. There doesn't seem to be a downside in regards to the data collection

Starting point is 00:56:38 like there is on the business intelligence side. Just thinking back to that dichotomy of we're doing the same things in different areas. Whereas on the business intelligence side, just thinking back to that dichotomy of we're doing the same things in different areas. Whereas on the business intelligence side, you have the creepy factor of tracking people and doing too much. Maybe there is even on the observability side, on infrastructure. Maybe you can speak to that, Christine. But it seems like aside from the scalability problems of collecting too much data,

Starting point is 00:57:06 you don't have the privacy and security concerns that you do like you would on the front end. Do you think that's a fair statement? Or is there still concerns with regards to privacy and security of your customers and server-side analytics maybe that happen here with observability? I think that there's still some risk there. If you're Stripe or something, at some point in the code, you probably do have some variable that holds some sensitive PII. And I think that there are a number of different laws

Starting point is 00:57:44 as well as internal practices that allow people to protect that data. And certainly, with great power comes great responsibility. And when you make it very easy for developers to capture and metadata that they might find interesting, that tends to be something that organizations need to keep an eye on as well. Hey, let's make sure not to send personal addresses to plain text somewhere. Yeah, that's fair. Well, coming at a strange loop near you, right?

Starting point is 00:58:12 All this and more. Super excited about it. I'm excited about this. You mentioned this is your first time at Strange Loop. This is Jared and I's first time at Strange Loop. We have something tentative on the stage. We're still not sure what that is, but if you're listening to this and you're going to Strange Loop, then, hey, you might see us on stage

Starting point is 00:58:32 and you can see something live. We're thinking about some sort of fireside chats. We're still working through the details, but it's a lot of fun. You can definitely see Christine live as she gives her talk, Observability, Superpowers for Developers. Let's finish on a really tough question favorite superhero roundtable style

Starting point is 00:58:49 oh boy we'll let Christine go first you're the guest favorite superhero it's got to be Storm Storm nice Halle Berry version or comic book version I think comic book version I like Halle Berry but the I don't know if i

Starting point is 00:59:06 can get behind the storm in the new reboot generation is pretty cool she's got her mohawk um i can get behind that maybe nice adam how about you i don't even i've known you for a long time but i've never asked you this very personal question favorite superhero um i'm gonna go super og super obscure okay i'm'm going to say Spawn. Oh. And the reason I'm going to say Spawn is because I'm a huge fan of Todd McFarland. He was responsible for the reigniting of Spider-Man. He had once, you know, drew for Marvel.

Starting point is 00:59:41 And so a lot of the modern look of Spider-Man can be attributed to Todd. And there was this sort of revolution, so to speak, in the comic world. And he and some others from Marvel broke off and created Image. And Image was the brand under which Spawn was. And I just love Todd's art. He's amazing. Yeah, the art style of Spawn was. I just love Todd's art. He's amazing. Yeah, the art style of Spawn was awesome. So Spawn isn't really the best character,

Starting point is 01:00:12 but I think he was well done. The first 20 issues of Spawn were amazing. And as a matter of fact, I own all of them. Mint in their jackets, all that good stuff. What do you do with them? Do you have them up on a bookshelf somewhere? Do you observe them? I do it back into observability here.

Starting point is 01:00:31 Observability on them. No, they're actually just in a shoebox, tucked away in a closet, dark, away from all the elements. But I got that and plus a ton of other comics that I used to collect. But Spawn is my favorite. Awesome. I never knew that about you i'm glad i asked well i'll go super boring slash super mainstream slash superman sorry little superman i always have probably think that probably just the first superhero that i ever learned about as a child and he's got all the skills he's got everything you know and uh and yet somehow they still inject drama into the shows and then the stories because he's gotta choose he's always

Starting point is 01:01:10 gotta choose who you're gonna save so just can't i also like batman quite a bit so i'm pretty boring but superman very cool well since you said batman i can say that i'm a huge fan of the most recent batman trilogy i think that was probably the best of all Batman in my opinion. Well, we may be able to save that conversation for an episode of backstage. As we're now completely ignoring our guests and they're just talking about movies. No,

Starting point is 01:01:36 no, I will. Since we're, since we're going down that road, if anyone on this, either of you or listening to this, hasn't watched the Spider-Man Into the Multiverse. Oh my gosh.

Starting point is 01:01:47 Love it. It was like, what a great translation of comics to movie. What a great way to tell a story that I was not excited to watch again because I'm like, how many Spider-Mans do we need? But literally they take that and they play with it.

Starting point is 01:02:01 And that was a lot of fun. And in many ways, the inspiration for the title of this talk. So plus that but i'll also add to it because this is a pen to append only not take away um i will uh i'll add because you said you weren't excited about seeing another spider-man i will agree until i watched homecoming spider-man homecoming was actually really good i loved the fact that uh and this is super backstage, so this is extended, but whatever. I loved the fact that they kind of remade the story

Starting point is 01:02:32 with Peter Parker that was a part of the Avengers, and I might be spoiling some of it, but just this whole new aspect that sort of brought it into the Avengers story and kind of gave it more of the bigger universe Spider-Man appeal than just simply Spider-Man alone. It's crazy to me. Like whenever I meet someone who just hasn't been following the MCU, they do such a good job, you know, MCU, you're like, you're either all in or you're all out. And the way that they tie the stories together,

Starting point is 01:03:00 it's, they've made it so rewarding for people who have have been watched all the movies and things and i was less of a fan of homecoming i think than you are uh but definitely definitely still appreciated it and yeah i love it too it wasn't like i'm so glad that this was just definitely good but it's kind of weird that we were we were all kind of over spider-man and then they released back-to-back spider-mans both of which and which were good. And the one of which, I think was the Sony production, was the multiverse one. To me, it was groundbreaking. It was like, this is so amazing and very impressive. Well, now that we've officially turned into backstage in the ChangeLog, Christine, thank

Starting point is 01:03:40 you so much for sharing your wisdom here and for the work you're doing at Honeycomb. Can't wait to see you at Strange Loop. Looking forward to the talk. Thanks for sharing your time here today. And we appreciate it. Thanks so much for having me. All right. Thank you for tuning into this episode of the Change Log. Hey, guess what? We have discussions on every single episode now. So head to changelog.com to discuss this episode. And if you want to help us grow this show, reach more listeners, and influence more developers,

Starting point is 01:04:10 do us a favor and give us a rating or review in iTunes or Apple Podcasts. If you use Overcast, give us a star. If you tweet, tweet a link. If you make lists of your favorite podcasts, include us in it. Also, thanks to Fastly, our bandwidth partner, Rollbar bar our monitoring service

Starting point is 01:04:26 and linode our cloud server of choice this episode is hosted by myself adam stachowiak and jared santo and our music is done by breakmaster cylinder if you want to hear more episodes like this subscribe to our master feed at changelog.com slash master, or go into your podcast app and search for changelog master. You'll find it. Thank you for tuning in this week. We'll see you again soon. Hey, guess what? Brain Science is officially launched. Episode number one is on the feed right now. So head to changelog.com slash brain science to listen, to subscribe, and to join us on this journey of exploring the human mind.

Starting point is 01:05:22 Once again, changelog.com slash brain science or search for brain science in your favorite podcast app.

Your Ad Here

The Changelog: Software Development, Open Source - Observability is for your unknown unknowns (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.