Screaming in the Cloud - Building Systems That Work Even When Everything Breaks with Ben Hartshorne
Episode Date: January 15, 2026When AWS has a major outage, what actually happens behind the scenes? Ben Hartshorne, a principal engineer at Honeycomb, joins Corey Quinn to discuss a recent AWS outage and how they kept cus...tomer data safe even when their systems couldn't fully work. Ben explains why building services that expect things to break is the only way to survive these outages. Ben also shares how Honeycomb used its own tools to cut their AWS Lambda costs in half by tracking five different things in a spreadsheet and making small changes to all of them.About Ben Hartshorne: Ben has spent much of his career setting up monitoring systems for startups and now is thrilled to help the industry see a better way. He is always eager to find the right graph to understand a service and will look for every excuse to include a whiteboard in the discussion.Show highlights: (02:41)Two Stories About Cost Optimization(04:20) Cutting Lambda Costs by 50%(08:01) Surviving the AWS Outage(09:20) Preserving Customer Data During the Outage(13:08) Should You Leave AWS After an Outage?(15:09) Multi-Region Costs 10x More(18:10) Vendor Dependencies(22:06) How LaunchDarkly's SDK Handles Outages(24:40) Rate Limiting Yourself(29:00) How Much Instrumentation Is Too Much?(34:28) Where to Find BenLinks: Linkedin: https://www.linkedin.com/in/benhartshorne/GitHub: https://github.com/maplebedSponsored by: duckbillhq.com
Transcript
Discussion (0)
For all of these dependencies, there are clearly several who have built their system with this challenge in mind and have a series of different fallbacks.
I'll give you the story of we used LaunchDarkley for our feature flagging.
Their service was also impacted yesterday.
One would think, oh, we need our feature flags in order to boot up.
Well, their SDK is built with the idea is that you set your feature flag default in code.
And if we can't reach our service, we'll go ahead and use those.
And if we can reach our service, great, we'll update them.
And if we can update them once, that's great.
If we can connect to the streaming service, even better.
And I think they also have some more bridging in there,
but we don't use the more complicated infrastructure.
But this idea that they design the system with the expectation that in the event of
a service unavailability, things will continue to work.
made the recovery process all that much better.
And even when their service was unavailable
and ours was still running,
the SDK still answers questions in code
for the status of all of these flags.
It doesn't say, oh, I can't reach my upstream.
Suddenly, I can't give you an answer anymore.
No, the SDK is built with that idea of local caching
so that it can continue to serve the correct answer
so far as it knew from whenever it lost its connection.
Welcome to Screaming in the Cloud. I'm Corey Quinn. My guest today is one of those folks that I am disappointed I have not had on the show until now just because I assumed I already had. Ben Hartzorne is a principal engineer at Honeycomb, but oh so much more than that. Ben, thank you for dating to join us.
It's lovely to be here this morning. This episode is sponsored in part by my day job, Duck Bill. Do you have a horrifying AWS bill? That can mean a lot of things.
predicting what it's going to be, determining what it should be, negotiating your next long-term
contract with AWS, or just figuring out why it increasingly resembles a phone number, but
nobody seems to quite know why that is. To learn more, visit duckbillhq.com. Remember, you can't duck
the duck bill bill, which my CEO reliably informs me is absolutely not our slogan. So you gave a talk
about roughly a month ago
at the inaugural FinOps
Meetup in San Francisco.
Give us the high level.
What did you talk about?
Well, I got to talk about two stories.
I love telling stories.
I got to talk about two stories
of how we used honeycomb and instrumentation
to help optimize our cloud spending.
A topic near and dear to your heart
is what brought me there.
We got to look at the overall bill
and say, hey,
where are some of the big things coming from? Obviously, it's people sending us data and people
asking us questions about those data. And if they would just stop both of those things,
your bill would be so much better. It would be so much smaller. So at my salary, unfortunately.
So we wanted to reduce some of those costs, but it's a problem that's hard to get into
just from like a general perspective. You need to really get in and look at all the details to find out
what you're going to change. So I got to tell two stories of reducing costs. One, by switching
from Andy to Arm architecture for Amazon. That's the Graviton chip set, which is fantastic.
And the other was about the amazing power of spreadsheets. As much as I love graphs, I also
love spreadsheets. I'm sorry, it's a personal failing, perhaps. It's wild to me how many tools out
there. Do all kinds of business adjacent things, but somehow never bother to realize that if you can
just export and CSV, suddenly you're speaking kind of the language of your ultimate user,
play up with pandas a little bit more and spit out an actual Excel file, and now you're cooking
with gas? So the second story is about doing that with honeycomb, taking a number of different
graphs and looking at five different attributes of our Lambda costs and what was going
into them and making changes across all of them in order to accomplish an overall cost
reduction about 50%, which is really great.
So the story, it does combine my love of graph because we got to see the three lines go down,
the power of spreadsheets, and also this idea that you can't just look for one answer
to find the solution to your problems around, well, anything really.
but especially around reducing costs,
it's going to be a bunch of small things
that you can put together into one place.
There's a lot that's valuable
when we start going down that particular path
of starting to look at things
through a lens of a particular kind of data
that you otherwise wouldn't think to.
I maintain that you remain,
the only customer we have found so far
that uses honeycomb
to completely instrument their AWS bill.
We had not seen that before or since.
It makes sense for you to do it that way, absolutely.
It's a bit of a heavy lift for, shall we say, everyone else.
And it actually is a bit of a lift for us to say we've instrumented the entire bill
is a wonderful thing to assert.
And as we've talked about, we use the power of spreadsheets too.
So there are some aspects.
There is that.
There's some aspects of our ATABus spending and actually really dominant ones that lend themselves very easily to be described using honeycomb.
The best example is Lambda, because Lambda is charged on a per millisecond basis and our instrumentation is collecting spans, traces, about your compute on a per millisecond basis.
There's a very easy translation there.
And so we can get really good insight into which customers are spending how much, or rather,
which customers are causing us to spend how much in order to provide our product to them
and understand how we can balance our development resources to both provide new features
and also understand when we need to shift and spend our attention managing costs instead.
There's a continuum here.
And I think that it tends to follow a lot around company ethos and company culture here,
where folks have varying degrees of insight into the factors that drive their cloud spent.
You are clearly an observability company.
You have been observing your AWS bill for, I would argue,
longer than it would have made sense to on some level.
In the very early days, you were doing this.
And your AWS bill was not the limiting factor to your,
company's success back in those days. But you did grow into it. Other folks, even at very
large enterprise scale, more or less do this based on vibes. And most folks, I think, tend to fall
somewhere in the middle of this, but it's not evenly distributed. Some teams tend to have a very
deep insight into what they're doing. And others are Amazon bill. You mean the books?
Again, most tend to fall somewhere center of that. It's law of large numbers. Everything starts to revert
to a mean past a certain point.
Well, I mean, you wouldn't have a job if they didn't make it a bit of a challenge to do so.
Or I might have a better job, depending.
But we'll see.
I do want to detour a little bit here because as we record this, it is the day after AWS's big significant outage.
I could really mess with the conspiracy theorists and say it is their first major outage of October of 2025.
And then people like, wait, what do you mean?
What do you mean?
This is World War I?
Like, same type of approach.
But these things do tend to cluster.
How was your day yesterday?
Well, it did start very early.
Our service has presence in multiple regions, but we do have our main U.S. instance in
U.S. East One.
And so as things stopped working, a lot of our service stopped working too.
Not all.
I mean, the outage was significant but wasn't pervasive.
There were still some things that kept functioning.
And amazingly, we actually preserved all of the customer telemetry that made it to our front door successfully, which is a big deal because we hate dropping data.
Yeah, it is.
That took some work in engineering.
And I have to imagine this was also not an accident.
It was not an accident.
Now, their ability to query that data during the outage that suffered.
I'm going to push back on you on that for a second there.
When AWS is U.S. East 1, where you have a significant workload, is impacted to this degree, how important is observability?
I know that when I've dealt with outages in the past, there's the first thing you try and figure out of, is it my shitty, shitty code or is it a global issue?
That's important.
And once you establish it's a global issue, then you can begin the mitigation part of that process.
And yes, observability becomes extraordinarily important there.
for some things. But for others, it's, there's also, at least with the cloud being as big as it is now,
there's some reputational headline risk protection here in that no one is talking about your
site going down in some weird ways yesterday. Everyone's talking about AWS going down. They own
the reputation of this. Yeah. That's true. And also, when a business's customers are asking them,
which parts of your service are working.
I know AADOAS is having a thing.
How bad is it affecting you?
You want to be able to give them a solid answer.
So our customers were asking us yesterday,
hey, are you dropping our data?
And we wanted to be able to give them a reasonable answer,
even in the moment.
So, yes, we're able to deflect a certain amount of the reputational harm.
But at the same time, there are people that have come back and say,
well, I mean, shouldn't you have done?
better. It's important for us to be able to rebuild our business and to move region to region.
And we need you to help us do that too. Oh, absolutely. I actually encountered a lot of this
yesterday when I, early in the morning, tried to get a, what was it, a Halloween costume? And Amazon
site was not working properly for some strange reason. Now, if I read some of the relatively out of touch
analyses in the mainstream press, that's billions and billions of dollars lost. Therefore, I either went to go get a
Halloween costume from another vendor, or I will never wear a Halloween costume this year,
better luck in 2026. Neither of those is necessarily true. And that's really exactly why we're,
we were focused on preserving successfully storing our customers data in the moment,
because then when the time comes afterwards, they're like, okay, now we, we said what we said in the
moment. Now they're asking us, okay, what really happened? That data is invaluable in helping our
customers piece together which parts of their services were working and which weren't at what
times. Did you see a drop in telemetry during the outage? Yeah, for sure. Is that because people's
systems were down or is that because their systems could not communicate out? Both. Excellent.
We did get some reports of, from our customers that they're specifically, the open telemetry
collector that was gathering the data from their application, was unable to successfully send it
to Honeycomb.
At the same time, we were not rejecting it.
So clearly there were challenges in the path between those two things, whether that was an
AWS network, in some other network unable to get to AWS, I don't know.
So we definitely saw there were issues of reachability, and so undoubtedly there was some data
dropped there. That's completely out of our control. So the only part we could say is once the data
got to us, we were able to successfully store it. So the question is, was it customers' apps going
down? Absolutely. Many of our customers were down, and they were unable to send us on eCellometry
because their app was offline. But the other side is also true. The ones that were up were having
trouble getting to us because of our location in U.S. East.
Now, to continue reading what the mainstream press had to say about this, does that mean that
you are now actively considering evacuating AWS entirely to go to a different provider
that can be more reliable, probably building your own data centers?
Yeah, you know, I've heard people say that's the thing to do these days.
Now, I have helped build data centers in the past.
As have I. There's a reason that both of us have a job that does not involve that.
There is. The data centers I built were not as reliable as any of the data centers that are
available from our big public cloud providers. I would have said, unless you worked at one of those
companies building the data centers, and even back then, given the time you've been at Honeycomb,
I can say with a certainty, you are not as good at running data centers as they are,
because effectively no one is. This is something that you get to learn about at significant scale.
The concern is I see it as one of consolidation, but I've seen too many folks try and go multi-cloud
for resilience reasons. And all they've done is,
they added a second single point of failure.
So now they're exposed to everyone's outage.
And when that happens, their site continues to fall down in different ways,
as opposed to being more resilient,
which is a hell of a lot more than just picking multiple providers.
There is something to say, though, of looking at a business and saying,
okay, what is the cost for us to be, you know, single region versus what is the cost to be
fully, you know, multi-region where we can fail over an instant and nobody notices?
those costs differences are huge.
And for most businesses...
Of course, it's a massive investment, at least 10x.
Yeah.
So for most businesses, you're not going to go that far.
My newsletter publication is entirely bound within U.S. West 2.
Because if that goes down, that just happened to be for latency purposes, not reliability reasons.
But if the region is hard down and I need to send an email newsletter and it's down for several days,
I'm writing that one by hand because I've got a different story to tell that week.
I don't need it to do the business as usual thing.
And that's a reflection of architecture and investment decisions reflecting the reality of my
business.
Yes.
And that's exactly where to start.
And there are things you can do within a region to increase a little bit of resilience
to certain services within that region suffering.
So as an example, I don't remember how many years ago it was, but Amazon had an outage in
KMS, the key management service.
and that basically made everything stop.
You can probably find out exactly when it happened.
Yes, I'm pulling that up now.
Please continue. I'm curious now.
They provide a really easy way to replicate all of your keys to another region
and a pretty easy way to fail over accessing those keys from one region to another.
So even if you're not going to be fully multi-region,
you can insulate against individual services that might have an incident
and prevent those one services from having an outsized impact
on your application.
We don't need their keys most of the time,
but when you do need them,
you kind of need them to start your application.
So if you need to scale up or do something like that
and it's not available, you're really out of luck.
So the thing is,
I don't want to advocate that people try and go fully multi-region,
but that's not to say that we advocate all responsibility
for insulating our application from having transient outages
in our dependencies.
Yeah.
To be clear, they did not do a formal write-up on the KMS issue on their basically kind of not terrific list of outpost-event summaries.
Things have to be sort of noisy for that to hit.
I'm sure yesterdays will wind up on that list once they have.
They're probably got that up before this thing publishes.
But, yeah, they did not put the KMS issue there.
You're completely correct.
It's a, this is the sort of thing of what is the, what is the blast radius of these issues?
And I think that there's this sense that before we went in the cloud, everything was more
reliable, but just the opposite is true.
The difference was, is that if we were all building our data centers, today, my shitty
stuff at Duckville is down as it is every, you know, every random Tuesday.
And tomorrow, Honeycomb is down because, oops, it turns out you once again are forgotten
to replace a bad hard drive.
Cool.
But those are not happening at the same time.
When you start with this centralization story, suddenly a disproportionate swath of the world is down simultaneously, and that's where things get weird.
It gets even harder, though, because you can test your durability and your resilience as much as you want, but it doesn't account for the challenge of third party providers on your critical path.
You obviously need to make sure that in order to honeycomb to work.
Honeycomb itself has to be up.
That's sort of step one.
But to do that, AWS itself.
has to be up in certain places.
What other vendors factor into this?
You know, that was, I think, the most interesting part of yesterday's challenge,
bringing the service back up, is that we do rely on an incredible number of other services.
There's some list of all of our vendors that is hundreds of long.
Now, those are obviously very different parts of the business.
They involve, you know, companies we contract with for marketing outreach and for business
and for all of that.
Right.
We use Dropbox here.
And if Dropbox is down, that doesn't necessarily impact our ability to wind up serving our customers.
But it does mean I need to find a different way, for example, to get the recorded file from this podcast over to my editing team.
Yeah. So there's a very long list. And then there's the much, much shorter list of vendors that are really in the critical past.
And we have a bunch of those too. We use vendors for feature flagging and for sending email and for
some other forms of telemetry that are destined for other spots.
For the most part, when we get that many vendors all relying on each other,
and they're all down at once, there's this bootstrapping problem,
where they're all trying to come back, but they all sort of rely on each other in order to come back
successfully. And I think that's part of what made yesterday morning's outage move from
roughly what like midnight to 3 a.m. Pacific all the way through the rest of the day and
still have issues with with some companies up until 5, 6, 7 p.m.
This episode is sponsored by my own company, Duck Bill.
Having trouble with your AWS bill, perhaps it's time to renegotiate a contract with them.
Maybe you're just wondering how to predict what's going on in the wide world of AWS.
Well, that's where Duck Bill comes in to help.
Remember, you can't duck the Duck Bill bill, which I am reliably informed by my business partner,
is absolutely not our motto.
To learn more, visit DuckbillHQ.com.
The Google SRE book talked about this, oh, geez, when was it?
15 years ago now, damn near, that at some point when a service goes down and then it starts to
recover, everything that depends on it will often basically pummel it.
back into submission trying to talk to the thing.
It's a, like, I remember back when I worked at, as a senior systems engineer at Media
Temple in the days before GoDaddy bought and then ultimately killed them.
They, I was torn the data center my first week.
We had, we had three different facilities.
I was in one of them.
And I asked, okay, great.
I just trip over things and hit the emergency power off switch.
Great.
And kill the entire data center.
There is an order that you have to bring things back up in the event of those catastrophic
outages.
Is there a runbook?
Of course there was.
Great.
Where is it?
Oh, it's not confluence.
Terrific.
Where's that?
Oh, in the rack over there.
And I looked at the data center manager.
And she was delightful and incredibly on her point.
And she knew exactly where I was going to print that out right now.
Excellent.
Excellent.
Like that's why you ask.
It's someone who has never seen it before but knows how these things were going through
that because you build dependency on top of dependency.
And you never get the luxury of taking a step back and looking at it with fresh eyes.
But that's what our industry has done.
Like you have your vendors that have their own critical.
dependencies that they may or may not have done as good a job as you have of identifying those and so on
and so forth. The end of a very long chain that does kind of eat itself at some point. Yeah, there are
two things that that brings to mind. First, we absolutely saw exactly what you're describing yesterday
in our traffic patterns where the volume of incoming traffic would sort of come along and then it would
drop as their services went off and then it's quiet for a little while. And then we get this huge spike
as they're trying to like, you know, bring everything back on all at once. Thankfully, those were
sort of spread out across our customers.
So we didn't have like just one enormous spike hit all of our servers.
But we did see them on a per customer basis.
It's a very real pattern.
But the second one, for all of these dependencies,
there are clearly several who have built their system with this challenge in mind
and have a series of different fallbacks.
and I'll give you the story of,
we used LaunchDarkly for our feature flagging.
Their service was also impacted yesterday.
One would think, oh, we need our feature flags in order to boot up.
Well, their SDK is built with the idea
that you set your feature flag default in code.
And if we can't reach our service, we'll go ahead and use those.
And if we can reach our service, great. We'll update them.
And if we can update them once, that's great.
If we can connect.
to the streaming service even better.
And I think they also have some more bridging in there,
but we don't use the more complicated infrastructure.
But this idea that they designed the system with the expectation
that in the event of a service unavailability,
things will continue to work,
made the recovery process all that much better.
And even when their service was unavailable and ours was still running,
the SDK still answers questions in code for the status of all of these flags.
It doesn't say, oh, I can't reach my upstream.
Suddenly, I can't give you an answer anymore.
No, the SDK is built with that idea of local caching so that it can continue to serve
the correct answer so far as it knew from whenever it lost its connection.
But it means that if they have a transient outage, our stuff doesn't break.
And that kind of design really makes recovering from these.
like interdependent outages, feasible in a way that the strict ordering you were describing
just is really difficult.
At least in my case, I have the luxury of knowing these things just because I'm old.
And I figured this out before it was SRE common knowledge or SRE was a widely acknowledged thing,
where, okay, you have a job server that runs cron jobs every day.
And when it turns out that, oh, and you found it missed a cron job,
oops-doosy, that's a problem for some of those things.
So now you start building in error checking and the rest.
And then you do a restore for three days ago from backup for that thing.
And it suddenly thinks it missed all the cron jobs and runs them all.
And then hammers some other system to death when it shouldn't.
And you learn iteratively of, oh, that's kind of a failure mode.
Like when you start externalizing and hardening APIs, you learn very quickly, everything needs a rate limit.
And you need a way to make bad actors stop hammering your endpoints.
And not just bad actors, naive ones.
And rate limits are a good example because that is one of the things that did happen yesterday as people were coming back.
We actually wound up needing to rate limit ourselves.
We didn't have to rate limit our customers, but because, so brief digression here, honeycomb uses honeycomb in order to build honeycomb.
We are our own observability vendor.
Now, this leads to some obvious challenges in architecture.
how do we know we're right?
Well, in the beginning, we did have some other services that we'd use to checkpoint
our numbers and make sure they were actually correct.
But our production instance sits here and serves our customers,
and all of its telemetry goes into the next one down the chain.
We call that dog food because we are, you know,
the whole phrase of eating your own dog food,
drinking your own champagne is the other more pleasing version.
So from our production, it goes to dog food.
Well, what's dog food made of? It's made up of kibble. So our third environment is called kibble. So the dog food telemetry, it goes into this third environment. And that third environment, well, we need to know if it's working too. So it feeds back into our production instance. Each of these instances is emitting telemetry. And we have our rate limiting, I'm sorry, our tail sampling proxy called refinery that helps us reduce volume. So it's not a positively amplifying cycle.
But in this incident yesterday, we started emitting logs that we don't normally emit.
These are coming from some of our SDKs that were unable to reach their services.
And so suddenly we started getting two or three or four log entries for every event we were sending
and did get into this kind of amplifying cycle.
So we put a pretty heavy rate limit on the kibble environment in order to squash that traffic and disrupt the cycle,
which made it difficult to ensure that dog food was working correctly, but it was.
And that let us make sure that the production instance is working all right.
But this idea of rate limits being a critical part of maintaining an interconnected stack
in order to suppress these kind of wave-like,
formations, the oscillations that start growing on each other and amplifying themselves,
can take any infrastructure down and being able to put in just the right point, a little,
a couple of switches and say, nope, suppress that signal, really made a big difference in our
ability to bring back all of the services.
I want to pivot to one last topic.
We could talk with this outage for days and hours.
But there's something that you mentioned you wanted to go into that I wanted to pick a fight
with you over, was how to get people to instrument their applications for observability so they can
understand their applications, their performance, and the rest. And I'm going to go with the easy
answer because it's a pain in the ass, Ben. Have you tried instrumenting an application that already
exists without having to spend a week on it? I have. And you're not wrong. It's a pain in the
and it's getting better.
There's lots of ways to make it better.
There are packages that do auto instrumentation.
Oh, absolutely.
For my case, yeah, it's Claude Codd Codd's problem.
Now I'm getting another drink.
You know, you say that in jest,
and yet they are actually getting really good.
Yeah.
No, that's what I've been doing.
It works super well.
You test it first, obviously, but yeah.
You know, YOLO slammed that into production, but yeah.
The LLMs are actually getting pretty good at understanding
where instrumentation can be useful.
I say understanding.
I put that in their quotes.
They're good at finding code
that represents a good place
to put instrumentation
and adding it to your code in the right place.
Yeah, I need to take another try one of these days.
The last time I played with Honeycomb,
I instrumented my home Kubernetes cluster,
and I exceeded the limits of the free tier
based on ingest volume
by the second day of every month.
And that led to either.
You have really unfair limits,
which I don't believe to be true,
or the more insightful question,
what the hell is my Kubernetes cluster doing that's that chatty?
So I rebuilt the whole thing from scratch,
so it's time for me to go back and figure that out.
Yeah, so I will say a lot of instrumentation is terrible.
A lot of instrumentation is based on this idea
that every single signal must be published all the time.
And that's not relevant to you as a person,
and running the Kubernetes cluster.
Do you need to know every time
a local pod checks in to see
whether it needs to be evicted?
No, you don't. What you're interested
in are the types
of activities that are relevant
to what you need to do as an
operator of that cluster. And the same is true
of an application. If you
just put
in the tracing language,
put a span on every single
function call,
you will not have useful traces because it doesn't map to a useful way of representing your user's journey through your product.
So there's definitely some nuance to getting the right level of instrumentation.
And I think the right level, it's not a single place.
It's a continuously moving spectrum based on what you're trying to understand about what your application is doing.
So at least at honeycomb, we add instrumentation all the time and we remove instrumentation all the time.
Because what's relevant to me now as I'm building out this feature is different from what I need to know about that feature once it is fully built and stable and running in a regular workload.
Furthermore, as I'm looking at a specific problem or question, we talked about pricing for Lambda's at the beginning of this.
There was a time when we really wanted to understand pricing for S3.
And part of our model, it's a struggle.
Part of our storage model is that we store our customer's telemetry in S3 in many files.
And we put instrumentation around every single S3 access in order to understand both the volume
and the latency of those to see like, okay, should we bundle them up or resize it like this?
And how does that influence?
So it's so on.
And it's incredibly expensive to do that kind of experiment.
And it's not just expensive in dollars.
Adding that level of instrumentation does have an impact on the overall performance of the system.
When you're making 10,000 calls to S3 and you add a span around every one, it takes a bit more time.
So once we understood the system well enough to make the change we wanted to make, we pulled out that back out.
So for your Kubernetes cluster, you know, maybe it's interesting at the very beginning.
to look at every single connection that any process might make.
But if it's your home cluster, that's not really what you need to know as an operator.
So finding the right balance there of instrumentation that lets you fulfill the needs of the business,
that lets you understand the needs of the operator in order to best be able to provide the service
that this business is providing to its customers.
it's a place somewhere there in the middle
and you're going to need some people to find it.
And that's easier said than done for a lot of folks.
But you're right, it is getting easier to instrument these things.
It is something that is iteratively getting better all the time.
To the point where now, this is an area where AI is surprisingly effective.
It doesn't take a lot to wrap a function call with a decorator.
It just takes a lot of doing that over and over and over again.
You do a lot of them, and you see what it looks like,
and then you see, okay, which ones of these are actually useful for me now
and take out some others, and that's going to change.
And we want to be open to that changing and willing to understand that this is an evolving thing.
And this does actually tie back to one of the core operating principles of modern SaaS architectures,
the ability to deploy your code quickly.
because if you're in this cycle of adding instrumentation
or removing instrumentation, you see a bug.
It has to be easy enough to add a little bit more data
to get insight into that bug in order to resolve it.
And if it's not, you're not going to do it
and the whole business is going to suffer for it.
What is quickly to you?
I'd like to see it in between
I need to make this change
and it's visible in my test environment,
a couple of minutes.
I need to make this change and have it visible running in production.
It depends on how much the, how frequent the bug comes,
but I'm actually okay with it being about an hour for that kind of turnaround.
I know a lot of people say you should have your code running in 15 minutes.
That's great.
I know that's out of reach for a lot of people and a lot of industries.
So I'm not a hardliner on how quickly it has to be.
But it can't be a week.
it can't be a day.
That's just like, you're going to want to do this two or three times in the course of resolving a bug.
And so if it's something too long, you're just really pushing out any ability to respond quickly to a customer.
I really want to thank you for taking the time to speak with me about all this.
If people want to learn more, where's the best place for them to go?
You know, I have backed off of almost all of the platforms in which people carry on conversations in the internet.
Everyone seems to have done this.
I did work for Facebook for two and a half years.
And someday I might forgive you.
Someday I might forgive myself.
It was a really different environment.
And I could see the allure of the world they're trying to create.
And it doesn't match.
Oh, I interviewed there in 2009.
It was incredibly compelling.
It doesn't match the view that I see of the world.
in. And so I have a presence at Honeycomb. I do have accounts on all of the major platforms. So you can
find me there. There will be links afterwards, I'm sure. But LinkedIn, Blue Sky, I don't know,
GitHub. Is that a social media platform now? They wish. We'll put all this in the show notes.
Problem solved for us. Thank you so much for taking the time to speak with me. I appreciate it.
It's a real pleasure. Thank you.
Ben Hartzhorn is the principal engineer at Honeycomb.
One of the possibly might have more than one.
It seems to be something you can scale,
unlike my nonsense, as Chief Cloud Economist at the Doc Bill Group.
And this is screaming in the cloud.
If you've enjoyed this podcast, please leave a five-star review
on your podcast platform of choice.
Whereas if you've hated this podcast, please leave a five-star review
on your podcast platform of choice,
along with an insulting comment that won't work
because that platform is down and not accepting comments at this moment.
I'm just
