Screaming in the Cloud - Building Systems That Work Even When Everything Breaks with Ben Hartshorne

Episode Date: January 15, 2026

When AWS has a major outage, what actually happens behind the scenes? Ben Hartshorne, a principal engineer at Honeycomb, joins Corey Quinn to discuss a recent AWS outage and how they kept cus...tomer data safe even when their systems couldn't fully work. Ben explains why building services that expect things to break is the only way to survive these outages. Ben also shares how Honeycomb used its own tools to cut their AWS Lambda costs in half by tracking five different things in a spreadsheet and making small changes to all of them.About Ben Hartshorne: Ben has spent much of his career setting up monitoring systems for startups and now is thrilled to help the industry see a better way. He is always eager to find the right graph to understand a service and will look for every excuse to include a whiteboard in the discussion.Show highlights: (02:41)Two Stories About Cost Optimization(04:20) Cutting Lambda Costs by 50%(08:01) Surviving the AWS Outage(09:20) Preserving Customer Data During the Outage(13:08) Should You Leave AWS After an Outage?(15:09) Multi-Region Costs 10x More(18:10) Vendor Dependencies(22:06) How LaunchDarkly's SDK Handles Outages(24:40) Rate Limiting Yourself(29:00) How Much Instrumentation Is Too Much?(34:28) Where to Find BenLinks: Linkedin: https://www.linkedin.com/in/benhartshorne/GitHub: https://github.com/maplebedSponsored by: duckbillhq.com

Transcript
Discussion (0)
Starting point is 00:00:00 For all of these dependencies, there are clearly several who have built their system with this challenge in mind and have a series of different fallbacks. I'll give you the story of we used LaunchDarkley for our feature flagging. Their service was also impacted yesterday. One would think, oh, we need our feature flags in order to boot up. Well, their SDK is built with the idea is that you set your feature flag default in code. And if we can't reach our service, we'll go ahead and use those. And if we can reach our service, great, we'll update them. And if we can update them once, that's great.
Starting point is 00:00:37 If we can connect to the streaming service, even better. And I think they also have some more bridging in there, but we don't use the more complicated infrastructure. But this idea that they design the system with the expectation that in the event of a service unavailability, things will continue to work. made the recovery process all that much better. And even when their service was unavailable and ours was still running,
Starting point is 00:01:07 the SDK still answers questions in code for the status of all of these flags. It doesn't say, oh, I can't reach my upstream. Suddenly, I can't give you an answer anymore. No, the SDK is built with that idea of local caching so that it can continue to serve the correct answer so far as it knew from whenever it lost its connection. Welcome to Screaming in the Cloud. I'm Corey Quinn. My guest today is one of those folks that I am disappointed I have not had on the show until now just because I assumed I already had. Ben Hartzorne is a principal engineer at Honeycomb, but oh so much more than that. Ben, thank you for dating to join us.
Starting point is 00:01:50 It's lovely to be here this morning. This episode is sponsored in part by my day job, Duck Bill. Do you have a horrifying AWS bill? That can mean a lot of things. predicting what it's going to be, determining what it should be, negotiating your next long-term contract with AWS, or just figuring out why it increasingly resembles a phone number, but nobody seems to quite know why that is. To learn more, visit duckbillhq.com. Remember, you can't duck the duck bill bill, which my CEO reliably informs me is absolutely not our slogan. So you gave a talk about roughly a month ago at the inaugural FinOps Meetup in San Francisco.
Starting point is 00:02:38 Give us the high level. What did you talk about? Well, I got to talk about two stories. I love telling stories. I got to talk about two stories of how we used honeycomb and instrumentation to help optimize our cloud spending. A topic near and dear to your heart
Starting point is 00:02:54 is what brought me there. We got to look at the overall bill and say, hey, where are some of the big things coming from? Obviously, it's people sending us data and people asking us questions about those data. And if they would just stop both of those things, your bill would be so much better. It would be so much smaller. So at my salary, unfortunately. So we wanted to reduce some of those costs, but it's a problem that's hard to get into just from like a general perspective. You need to really get in and look at all the details to find out
Starting point is 00:03:28 what you're going to change. So I got to tell two stories of reducing costs. One, by switching from Andy to Arm architecture for Amazon. That's the Graviton chip set, which is fantastic. And the other was about the amazing power of spreadsheets. As much as I love graphs, I also love spreadsheets. I'm sorry, it's a personal failing, perhaps. It's wild to me how many tools out there. Do all kinds of business adjacent things, but somehow never bother to realize that if you can just export and CSV, suddenly you're speaking kind of the language of your ultimate user, play up with pandas a little bit more and spit out an actual Excel file, and now you're cooking with gas? So the second story is about doing that with honeycomb, taking a number of different
Starting point is 00:04:21 graphs and looking at five different attributes of our Lambda costs and what was going into them and making changes across all of them in order to accomplish an overall cost reduction about 50%, which is really great. So the story, it does combine my love of graph because we got to see the three lines go down, the power of spreadsheets, and also this idea that you can't just look for one answer to find the solution to your problems around, well, anything really. but especially around reducing costs, it's going to be a bunch of small things
Starting point is 00:05:03 that you can put together into one place. There's a lot that's valuable when we start going down that particular path of starting to look at things through a lens of a particular kind of data that you otherwise wouldn't think to. I maintain that you remain, the only customer we have found so far
Starting point is 00:05:23 that uses honeycomb to completely instrument their AWS bill. We had not seen that before or since. It makes sense for you to do it that way, absolutely. It's a bit of a heavy lift for, shall we say, everyone else. And it actually is a bit of a lift for us to say we've instrumented the entire bill is a wonderful thing to assert. And as we've talked about, we use the power of spreadsheets too.
Starting point is 00:05:55 So there are some aspects. There is that. There's some aspects of our ATABus spending and actually really dominant ones that lend themselves very easily to be described using honeycomb. The best example is Lambda, because Lambda is charged on a per millisecond basis and our instrumentation is collecting spans, traces, about your compute on a per millisecond basis. There's a very easy translation there. And so we can get really good insight into which customers are spending how much, or rather, which customers are causing us to spend how much in order to provide our product to them and understand how we can balance our development resources to both provide new features
Starting point is 00:06:44 and also understand when we need to shift and spend our attention managing costs instead. There's a continuum here. And I think that it tends to follow a lot around company ethos and company culture here, where folks have varying degrees of insight into the factors that drive their cloud spent. You are clearly an observability company. You have been observing your AWS bill for, I would argue, longer than it would have made sense to on some level. In the very early days, you were doing this.
Starting point is 00:07:16 And your AWS bill was not the limiting factor to your, company's success back in those days. But you did grow into it. Other folks, even at very large enterprise scale, more or less do this based on vibes. And most folks, I think, tend to fall somewhere in the middle of this, but it's not evenly distributed. Some teams tend to have a very deep insight into what they're doing. And others are Amazon bill. You mean the books? Again, most tend to fall somewhere center of that. It's law of large numbers. Everything starts to revert to a mean past a certain point. Well, I mean, you wouldn't have a job if they didn't make it a bit of a challenge to do so.
Starting point is 00:07:56 Or I might have a better job, depending. But we'll see. I do want to detour a little bit here because as we record this, it is the day after AWS's big significant outage. I could really mess with the conspiracy theorists and say it is their first major outage of October of 2025. And then people like, wait, what do you mean? What do you mean? This is World War I? Like, same type of approach.
Starting point is 00:08:20 But these things do tend to cluster. How was your day yesterday? Well, it did start very early. Our service has presence in multiple regions, but we do have our main U.S. instance in U.S. East One. And so as things stopped working, a lot of our service stopped working too. Not all. I mean, the outage was significant but wasn't pervasive.
Starting point is 00:08:52 There were still some things that kept functioning. And amazingly, we actually preserved all of the customer telemetry that made it to our front door successfully, which is a big deal because we hate dropping data. Yeah, it is. That took some work in engineering. And I have to imagine this was also not an accident. It was not an accident. Now, their ability to query that data during the outage that suffered. I'm going to push back on you on that for a second there.
Starting point is 00:09:20 When AWS is U.S. East 1, where you have a significant workload, is impacted to this degree, how important is observability? I know that when I've dealt with outages in the past, there's the first thing you try and figure out of, is it my shitty, shitty code or is it a global issue? That's important. And once you establish it's a global issue, then you can begin the mitigation part of that process. And yes, observability becomes extraordinarily important there. for some things. But for others, it's, there's also, at least with the cloud being as big as it is now, there's some reputational headline risk protection here in that no one is talking about your site going down in some weird ways yesterday. Everyone's talking about AWS going down. They own
Starting point is 00:10:05 the reputation of this. Yeah. That's true. And also, when a business's customers are asking them, which parts of your service are working. I know AADOAS is having a thing. How bad is it affecting you? You want to be able to give them a solid answer. So our customers were asking us yesterday, hey, are you dropping our data? And we wanted to be able to give them a reasonable answer,
Starting point is 00:10:31 even in the moment. So, yes, we're able to deflect a certain amount of the reputational harm. But at the same time, there are people that have come back and say, well, I mean, shouldn't you have done? better. It's important for us to be able to rebuild our business and to move region to region. And we need you to help us do that too. Oh, absolutely. I actually encountered a lot of this yesterday when I, early in the morning, tried to get a, what was it, a Halloween costume? And Amazon site was not working properly for some strange reason. Now, if I read some of the relatively out of touch
Starting point is 00:11:07 analyses in the mainstream press, that's billions and billions of dollars lost. Therefore, I either went to go get a Halloween costume from another vendor, or I will never wear a Halloween costume this year, better luck in 2026. Neither of those is necessarily true. And that's really exactly why we're, we were focused on preserving successfully storing our customers data in the moment, because then when the time comes afterwards, they're like, okay, now we, we said what we said in the moment. Now they're asking us, okay, what really happened? That data is invaluable in helping our customers piece together which parts of their services were working and which weren't at what times. Did you see a drop in telemetry during the outage? Yeah, for sure. Is that because people's
Starting point is 00:11:55 systems were down or is that because their systems could not communicate out? Both. Excellent. We did get some reports of, from our customers that they're specifically, the open telemetry collector that was gathering the data from their application, was unable to successfully send it to Honeycomb. At the same time, we were not rejecting it. So clearly there were challenges in the path between those two things, whether that was an AWS network, in some other network unable to get to AWS, I don't know. So we definitely saw there were issues of reachability, and so undoubtedly there was some data
Starting point is 00:12:37 dropped there. That's completely out of our control. So the only part we could say is once the data got to us, we were able to successfully store it. So the question is, was it customers' apps going down? Absolutely. Many of our customers were down, and they were unable to send us on eCellometry because their app was offline. But the other side is also true. The ones that were up were having trouble getting to us because of our location in U.S. East. Now, to continue reading what the mainstream press had to say about this, does that mean that you are now actively considering evacuating AWS entirely to go to a different provider that can be more reliable, probably building your own data centers?
Starting point is 00:13:23 Yeah, you know, I've heard people say that's the thing to do these days. Now, I have helped build data centers in the past. As have I. There's a reason that both of us have a job that does not involve that. There is. The data centers I built were not as reliable as any of the data centers that are available from our big public cloud providers. I would have said, unless you worked at one of those companies building the data centers, and even back then, given the time you've been at Honeycomb, I can say with a certainty, you are not as good at running data centers as they are, because effectively no one is. This is something that you get to learn about at significant scale.
Starting point is 00:13:56 The concern is I see it as one of consolidation, but I've seen too many folks try and go multi-cloud for resilience reasons. And all they've done is, they added a second single point of failure. So now they're exposed to everyone's outage. And when that happens, their site continues to fall down in different ways, as opposed to being more resilient, which is a hell of a lot more than just picking multiple providers. There is something to say, though, of looking at a business and saying,
Starting point is 00:14:20 okay, what is the cost for us to be, you know, single region versus what is the cost to be fully, you know, multi-region where we can fail over an instant and nobody notices? those costs differences are huge. And for most businesses... Of course, it's a massive investment, at least 10x. Yeah. So for most businesses, you're not going to go that far. My newsletter publication is entirely bound within U.S. West 2.
Starting point is 00:14:48 Because if that goes down, that just happened to be for latency purposes, not reliability reasons. But if the region is hard down and I need to send an email newsletter and it's down for several days, I'm writing that one by hand because I've got a different story to tell that week. I don't need it to do the business as usual thing. And that's a reflection of architecture and investment decisions reflecting the reality of my business. Yes. And that's exactly where to start.
Starting point is 00:15:11 And there are things you can do within a region to increase a little bit of resilience to certain services within that region suffering. So as an example, I don't remember how many years ago it was, but Amazon had an outage in KMS, the key management service. and that basically made everything stop. You can probably find out exactly when it happened. Yes, I'm pulling that up now. Please continue. I'm curious now.
Starting point is 00:15:38 They provide a really easy way to replicate all of your keys to another region and a pretty easy way to fail over accessing those keys from one region to another. So even if you're not going to be fully multi-region, you can insulate against individual services that might have an incident and prevent those one services from having an outsized impact on your application. We don't need their keys most of the time, but when you do need them,
Starting point is 00:16:05 you kind of need them to start your application. So if you need to scale up or do something like that and it's not available, you're really out of luck. So the thing is, I don't want to advocate that people try and go fully multi-region, but that's not to say that we advocate all responsibility for insulating our application from having transient outages in our dependencies.
Starting point is 00:16:27 Yeah. To be clear, they did not do a formal write-up on the KMS issue on their basically kind of not terrific list of outpost-event summaries. Things have to be sort of noisy for that to hit. I'm sure yesterdays will wind up on that list once they have. They're probably got that up before this thing publishes. But, yeah, they did not put the KMS issue there. You're completely correct. It's a, this is the sort of thing of what is the, what is the blast radius of these issues?
Starting point is 00:16:58 And I think that there's this sense that before we went in the cloud, everything was more reliable, but just the opposite is true. The difference was, is that if we were all building our data centers, today, my shitty stuff at Duckville is down as it is every, you know, every random Tuesday. And tomorrow, Honeycomb is down because, oops, it turns out you once again are forgotten to replace a bad hard drive. Cool. But those are not happening at the same time.
Starting point is 00:17:25 When you start with this centralization story, suddenly a disproportionate swath of the world is down simultaneously, and that's where things get weird. It gets even harder, though, because you can test your durability and your resilience as much as you want, but it doesn't account for the challenge of third party providers on your critical path. You obviously need to make sure that in order to honeycomb to work. Honeycomb itself has to be up. That's sort of step one. But to do that, AWS itself. has to be up in certain places. What other vendors factor into this?
Starting point is 00:17:59 You know, that was, I think, the most interesting part of yesterday's challenge, bringing the service back up, is that we do rely on an incredible number of other services. There's some list of all of our vendors that is hundreds of long. Now, those are obviously very different parts of the business. They involve, you know, companies we contract with for marketing outreach and for business and for all of that. Right. We use Dropbox here.
Starting point is 00:18:24 And if Dropbox is down, that doesn't necessarily impact our ability to wind up serving our customers. But it does mean I need to find a different way, for example, to get the recorded file from this podcast over to my editing team. Yeah. So there's a very long list. And then there's the much, much shorter list of vendors that are really in the critical past. And we have a bunch of those too. We use vendors for feature flagging and for sending email and for some other forms of telemetry that are destined for other spots. For the most part, when we get that many vendors all relying on each other, and they're all down at once, there's this bootstrapping problem, where they're all trying to come back, but they all sort of rely on each other in order to come back
Starting point is 00:19:12 successfully. And I think that's part of what made yesterday morning's outage move from roughly what like midnight to 3 a.m. Pacific all the way through the rest of the day and still have issues with with some companies up until 5, 6, 7 p.m. This episode is sponsored by my own company, Duck Bill. Having trouble with your AWS bill, perhaps it's time to renegotiate a contract with them. Maybe you're just wondering how to predict what's going on in the wide world of AWS. Well, that's where Duck Bill comes in to help. Remember, you can't duck the Duck Bill bill, which I am reliably informed by my business partner,
Starting point is 00:19:56 is absolutely not our motto. To learn more, visit DuckbillHQ.com. The Google SRE book talked about this, oh, geez, when was it? 15 years ago now, damn near, that at some point when a service goes down and then it starts to recover, everything that depends on it will often basically pummel it. back into submission trying to talk to the thing. It's a, like, I remember back when I worked at, as a senior systems engineer at Media Temple in the days before GoDaddy bought and then ultimately killed them.
Starting point is 00:20:29 They, I was torn the data center my first week. We had, we had three different facilities. I was in one of them. And I asked, okay, great. I just trip over things and hit the emergency power off switch. Great. And kill the entire data center. There is an order that you have to bring things back up in the event of those catastrophic
Starting point is 00:20:45 outages. Is there a runbook? Of course there was. Great. Where is it? Oh, it's not confluence. Terrific. Where's that?
Starting point is 00:20:50 Oh, in the rack over there. And I looked at the data center manager. And she was delightful and incredibly on her point. And she knew exactly where I was going to print that out right now. Excellent. Excellent. Like that's why you ask. It's someone who has never seen it before but knows how these things were going through
Starting point is 00:21:06 that because you build dependency on top of dependency. And you never get the luxury of taking a step back and looking at it with fresh eyes. But that's what our industry has done. Like you have your vendors that have their own critical. dependencies that they may or may not have done as good a job as you have of identifying those and so on and so forth. The end of a very long chain that does kind of eat itself at some point. Yeah, there are two things that that brings to mind. First, we absolutely saw exactly what you're describing yesterday in our traffic patterns where the volume of incoming traffic would sort of come along and then it would
Starting point is 00:21:36 drop as their services went off and then it's quiet for a little while. And then we get this huge spike as they're trying to like, you know, bring everything back on all at once. Thankfully, those were sort of spread out across our customers. So we didn't have like just one enormous spike hit all of our servers. But we did see them on a per customer basis. It's a very real pattern. But the second one, for all of these dependencies, there are clearly several who have built their system with this challenge in mind
Starting point is 00:22:09 and have a series of different fallbacks. and I'll give you the story of, we used LaunchDarkly for our feature flagging. Their service was also impacted yesterday. One would think, oh, we need our feature flags in order to boot up. Well, their SDK is built with the idea that you set your feature flag default in code. And if we can't reach our service, we'll go ahead and use those.
Starting point is 00:22:36 And if we can reach our service, great. We'll update them. And if we can update them once, that's great. If we can connect. to the streaming service even better. And I think they also have some more bridging in there, but we don't use the more complicated infrastructure. But this idea that they designed the system with the expectation that in the event of a service unavailability,
Starting point is 00:23:01 things will continue to work, made the recovery process all that much better. And even when their service was unavailable and ours was still running, the SDK still answers questions in code for the status of all of these flags. It doesn't say, oh, I can't reach my upstream. Suddenly, I can't give you an answer anymore. No, the SDK is built with that idea of local caching so that it can continue to serve the correct answer so far as it knew from whenever it lost its connection.
Starting point is 00:23:32 But it means that if they have a transient outage, our stuff doesn't break. And that kind of design really makes recovering from these. like interdependent outages, feasible in a way that the strict ordering you were describing just is really difficult. At least in my case, I have the luxury of knowing these things just because I'm old. And I figured this out before it was SRE common knowledge or SRE was a widely acknowledged thing, where, okay, you have a job server that runs cron jobs every day. And when it turns out that, oh, and you found it missed a cron job,
Starting point is 00:24:08 oops-doosy, that's a problem for some of those things. So now you start building in error checking and the rest. And then you do a restore for three days ago from backup for that thing. And it suddenly thinks it missed all the cron jobs and runs them all. And then hammers some other system to death when it shouldn't. And you learn iteratively of, oh, that's kind of a failure mode. Like when you start externalizing and hardening APIs, you learn very quickly, everything needs a rate limit. And you need a way to make bad actors stop hammering your endpoints.
Starting point is 00:24:38 And not just bad actors, naive ones. And rate limits are a good example because that is one of the things that did happen yesterday as people were coming back. We actually wound up needing to rate limit ourselves. We didn't have to rate limit our customers, but because, so brief digression here, honeycomb uses honeycomb in order to build honeycomb. We are our own observability vendor. Now, this leads to some obvious challenges in architecture. how do we know we're right? Well, in the beginning, we did have some other services that we'd use to checkpoint
Starting point is 00:25:16 our numbers and make sure they were actually correct. But our production instance sits here and serves our customers, and all of its telemetry goes into the next one down the chain. We call that dog food because we are, you know, the whole phrase of eating your own dog food, drinking your own champagne is the other more pleasing version. So from our production, it goes to dog food. Well, what's dog food made of? It's made up of kibble. So our third environment is called kibble. So the dog food telemetry, it goes into this third environment. And that third environment, well, we need to know if it's working too. So it feeds back into our production instance. Each of these instances is emitting telemetry. And we have our rate limiting, I'm sorry, our tail sampling proxy called refinery that helps us reduce volume. So it's not a positively amplifying cycle.
Starting point is 00:26:08 But in this incident yesterday, we started emitting logs that we don't normally emit. These are coming from some of our SDKs that were unable to reach their services. And so suddenly we started getting two or three or four log entries for every event we were sending and did get into this kind of amplifying cycle. So we put a pretty heavy rate limit on the kibble environment in order to squash that traffic and disrupt the cycle, which made it difficult to ensure that dog food was working correctly, but it was. And that let us make sure that the production instance is working all right. But this idea of rate limits being a critical part of maintaining an interconnected stack
Starting point is 00:26:59 in order to suppress these kind of wave-like, formations, the oscillations that start growing on each other and amplifying themselves, can take any infrastructure down and being able to put in just the right point, a little, a couple of switches and say, nope, suppress that signal, really made a big difference in our ability to bring back all of the services. I want to pivot to one last topic. We could talk with this outage for days and hours. But there's something that you mentioned you wanted to go into that I wanted to pick a fight
Starting point is 00:27:33 with you over, was how to get people to instrument their applications for observability so they can understand their applications, their performance, and the rest. And I'm going to go with the easy answer because it's a pain in the ass, Ben. Have you tried instrumenting an application that already exists without having to spend a week on it? I have. And you're not wrong. It's a pain in the and it's getting better. There's lots of ways to make it better. There are packages that do auto instrumentation. Oh, absolutely.
Starting point is 00:28:07 For my case, yeah, it's Claude Codd Codd's problem. Now I'm getting another drink. You know, you say that in jest, and yet they are actually getting really good. Yeah. No, that's what I've been doing. It works super well. You test it first, obviously, but yeah.
Starting point is 00:28:23 You know, YOLO slammed that into production, but yeah. The LLMs are actually getting pretty good at understanding where instrumentation can be useful. I say understanding. I put that in their quotes. They're good at finding code that represents a good place to put instrumentation
Starting point is 00:28:37 and adding it to your code in the right place. Yeah, I need to take another try one of these days. The last time I played with Honeycomb, I instrumented my home Kubernetes cluster, and I exceeded the limits of the free tier based on ingest volume by the second day of every month. And that led to either.
Starting point is 00:28:55 You have really unfair limits, which I don't believe to be true, or the more insightful question, what the hell is my Kubernetes cluster doing that's that chatty? So I rebuilt the whole thing from scratch, so it's time for me to go back and figure that out. Yeah, so I will say a lot of instrumentation is terrible. A lot of instrumentation is based on this idea
Starting point is 00:29:17 that every single signal must be published all the time. And that's not relevant to you as a person, and running the Kubernetes cluster. Do you need to know every time a local pod checks in to see whether it needs to be evicted? No, you don't. What you're interested in are the types
Starting point is 00:29:42 of activities that are relevant to what you need to do as an operator of that cluster. And the same is true of an application. If you just put in the tracing language, put a span on every single function call,
Starting point is 00:29:58 you will not have useful traces because it doesn't map to a useful way of representing your user's journey through your product. So there's definitely some nuance to getting the right level of instrumentation. And I think the right level, it's not a single place. It's a continuously moving spectrum based on what you're trying to understand about what your application is doing. So at least at honeycomb, we add instrumentation all the time and we remove instrumentation all the time. Because what's relevant to me now as I'm building out this feature is different from what I need to know about that feature once it is fully built and stable and running in a regular workload. Furthermore, as I'm looking at a specific problem or question, we talked about pricing for Lambda's at the beginning of this. There was a time when we really wanted to understand pricing for S3.
Starting point is 00:30:56 And part of our model, it's a struggle. Part of our storage model is that we store our customer's telemetry in S3 in many files. And we put instrumentation around every single S3 access in order to understand both the volume and the latency of those to see like, okay, should we bundle them up or resize it like this? And how does that influence? So it's so on. And it's incredibly expensive to do that kind of experiment. And it's not just expensive in dollars.
Starting point is 00:31:26 Adding that level of instrumentation does have an impact on the overall performance of the system. When you're making 10,000 calls to S3 and you add a span around every one, it takes a bit more time. So once we understood the system well enough to make the change we wanted to make, we pulled out that back out. So for your Kubernetes cluster, you know, maybe it's interesting at the very beginning. to look at every single connection that any process might make. But if it's your home cluster, that's not really what you need to know as an operator. So finding the right balance there of instrumentation that lets you fulfill the needs of the business, that lets you understand the needs of the operator in order to best be able to provide the service
Starting point is 00:32:15 that this business is providing to its customers. it's a place somewhere there in the middle and you're going to need some people to find it. And that's easier said than done for a lot of folks. But you're right, it is getting easier to instrument these things. It is something that is iteratively getting better all the time. To the point where now, this is an area where AI is surprisingly effective. It doesn't take a lot to wrap a function call with a decorator.
Starting point is 00:32:43 It just takes a lot of doing that over and over and over again. You do a lot of them, and you see what it looks like, and then you see, okay, which ones of these are actually useful for me now and take out some others, and that's going to change. And we want to be open to that changing and willing to understand that this is an evolving thing. And this does actually tie back to one of the core operating principles of modern SaaS architectures, the ability to deploy your code quickly. because if you're in this cycle of adding instrumentation
Starting point is 00:33:18 or removing instrumentation, you see a bug. It has to be easy enough to add a little bit more data to get insight into that bug in order to resolve it. And if it's not, you're not going to do it and the whole business is going to suffer for it. What is quickly to you? I'd like to see it in between I need to make this change
Starting point is 00:33:40 and it's visible in my test environment, a couple of minutes. I need to make this change and have it visible running in production. It depends on how much the, how frequent the bug comes, but I'm actually okay with it being about an hour for that kind of turnaround. I know a lot of people say you should have your code running in 15 minutes. That's great. I know that's out of reach for a lot of people and a lot of industries.
Starting point is 00:34:06 So I'm not a hardliner on how quickly it has to be. But it can't be a week. it can't be a day. That's just like, you're going to want to do this two or three times in the course of resolving a bug. And so if it's something too long, you're just really pushing out any ability to respond quickly to a customer. I really want to thank you for taking the time to speak with me about all this. If people want to learn more, where's the best place for them to go? You know, I have backed off of almost all of the platforms in which people carry on conversations in the internet.
Starting point is 00:34:42 Everyone seems to have done this. I did work for Facebook for two and a half years. And someday I might forgive you. Someday I might forgive myself. It was a really different environment. And I could see the allure of the world they're trying to create. And it doesn't match. Oh, I interviewed there in 2009.
Starting point is 00:35:06 It was incredibly compelling. It doesn't match the view that I see of the world. in. And so I have a presence at Honeycomb. I do have accounts on all of the major platforms. So you can find me there. There will be links afterwards, I'm sure. But LinkedIn, Blue Sky, I don't know, GitHub. Is that a social media platform now? They wish. We'll put all this in the show notes. Problem solved for us. Thank you so much for taking the time to speak with me. I appreciate it. It's a real pleasure. Thank you. Ben Hartzhorn is the principal engineer at Honeycomb.
Starting point is 00:35:45 One of the possibly might have more than one. It seems to be something you can scale, unlike my nonsense, as Chief Cloud Economist at the Doc Bill Group. And this is screaming in the cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice,
Starting point is 00:36:05 along with an insulting comment that won't work because that platform is down and not accepting comments at this moment. I'm just

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.