Software at Scale - Software at Scale 7 - Charity Majors: CTO, Honeycomb
Episode Date: January 25, 2021Charity Majors is the CTO of Honeycomb, a platform that helps engineers understand their own production systems through observability. Honeycomb is very different from traditional monitoring tools lik...e Wavefront as it is built for data with high cardinality and high dimensionality, which can instantly speed up debugging of many problems.Apple Podcasts | Spotify | Google PodcastsNOTE: This episode has some explicit language.We talk about observability, monitoring, building your own database for a particular use case, starting a developer tool startup, having the right oncall culture, getting to fifteen minute deployments and more.HighlightsNotes are italicized05:00 - High cardinality and high dimensionality in Honeycomb. Data retention in Honeycomb - 60 days. Many monitoring systems, like Dropbox’s Vortex, downsample data in two weeks13:00 - Observability driven development. The impact of deploying code within 15 minutes of it being merged in. Synchronous and asynchronous engineering workflows19:00 - Setting up oncall rotations. What the size of a rotation should be21:00 - How often should someone on a 24/7 oncall rotation be woken up? Once or twice. But there are exceptions. The impractical nature of some of Google SRE book’s “Being Oncall” chapter. Oncall for managers31:00 - Why are monitoring tools so ubiquitous compared to observability tools?36:00 - Observability & Tracing. What the future of observability infrastructure might look like40:00 - What will the job of an SRE look like in the future? The split of roles in software engineering organizations in the future43:00 - Shipping code faster makes engineers happier. How do you ensure your engineering organization is healthy, and the metrics to use. Learned helplessness in engineering organizations, and leadership failures51:00 - Building internal tools in-house vs using external tools. The large impact that designers at Honeycomb have had on the product.58:00 - The story of starting Honeycomb. Creating a “Minimum Lovable Product”. A description of Honeycomb internal architecture. Dealing with tail latencies. 71:00 - Continuous Deployment and releasing code quickly. Use calendly.com/charitym if you want to chat with Charity about continuous deployment best practices or anything else. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Thanks, folks, for joining me on an episode of the Software at Scale podcast.
We're joined here today with Charity Majors, and in her own words, Charity is an ops engineer and an accidental startup founder at
Honeycomb IO. And before this, she worked at Parse, Facebook, Linden Lab on infra and developer
tools. And she always seems to wind up running the databases. Always. And she's also the co-author
of O'Reilly's Database Reliability Engineering, loves free speech, free software and single
malt scotch. And I'll talk a little bit about Honeycomb IO.
So it's an observability system.
It helps you understand your systems better.
And maybe, Charity, if you just want to get started
with telling us about what Honeycomb is.
Sure.
Honeycomb was really the first observability project.
We kind of developed the language of observability by,
you know, we had had this experience, my co-founder, Christina, and I had this experience
at Facebook of using and, you know, collaborating on a tool that was just radically different from
anything that I'd ever used in the past. And like, and I've been on call since I was 17, right? So
I've used all the tools. But, you know you know when parse was going through our really rapid
growth spurt back in 2012 or so you know we had 60,000 mobile apps um by the time we sold to
Facebook when I left there are over a million mobile apps and and like every day a different
app would hit the iTunes you know top 10 or it would you know take off or you know there's I
think the first one was like this Swedish death metal band. It just came out
of nowhere. All of my tactics for monitoring systems, predicting how they would fail,
then writing monitoring checks, making dashboards, and post-morteming, and creating
documentation so that we could find it immediately the next time. It was all basically useless because these things weren't breaking in a patterned way.
And I had tried everything.
And we were going down multiple times a day.
It was a really rocky and stressful time.
But we started to get a handle on it.
The first crack of hope I got was when we had started feeding some data sets into this tool at Facebook called Scuba, which is butt ugly, like aggressively hostile to users.
Does one thing well, which is let's use slice and dice and high cardinality dimensions in near real time.
And, you know, we started feeding some data into there and and the time it took us to like identify
and you know pinpoint the cause of you know today's problems just dropped like a rock like
from days open-ended who knows maybe we'll get lucky right down to like seconds like not even
minutes like it wasn't even an engineering problem anymore it was a support problem we just break down
by the app id which is something you can't do support problem. We just break down by the app ID,
which is something you can't do in any monitoring product, right? Break down by the app ID, you know,
and, you know, just follow the trail of breadcrumbs and it would lead you to the answer every time.
And this made a huge impact on my life. You know, suddenly I could sleep again. We had a, you know, system that I felt proud of again um so when i was leaving facebook
i kind of you know stopped short and went oh shit like i don't know how to engineer anymore without
this stuff that we've built like it's become so core to my experience of software like it's it's
not even you know it's not it's it's not even that it's a tool anymore it's it's it's my five
senses for production and and
the idea of going back to not having it is like flying blind like i just my ego couldn't take it
you know so so we decided to build something to kind of approximate that that experience
but we right from the beginning we could tell that the language that was that existed like
was very inadequate you know it wasn't a monitoring tool because it wasn't
it was you know wasn't like reactive um and and then like six months in i i happened to google
the definition of observability which nobody was really using at the time um and i looked it up and
i read the definition which comes from mechanical engineering and control systems theory it's about
you know how well can you understand the inner workings of your system
just by looking at it from the outside?
And I just had fireworks going on.
I'm like, oh my God, this is what we're trying to build, right?
We're trying to build something that will let us ask any question from the outside,
like ask any combination of questions, new questions, whatever,
to describe some novel system state that we've never encountered before, we have no prior knowledge of, and we can't write any custom
code to describe it because that would imply that we could predict it in advance, right?
And so it's like, it's the unknown unknowns problem. So we started talking about observability,
we started, you know, building Honeycomb to that spec. And what was the question?
Here we are today.
Yeah.
Yeah.
So if this is just about Honeycomb, right?
And the key insight seems to be high cardinality, right?
Most monitoring.
It's one of them.
Okay.
Another one is high dimensionality.
You know, because whenever you're pinpointing a problem in modern systems,
it's not, it's high cardinality, absolutely.
Everything's a high cardinality dimension.
But it's also being able to string together as many of those as you want. Like maybe the bug is only triggered for iOS devices running a particular version using this particular firmware, this language pack, this version of the app ID, this and this region, you know, it's just like every single one of those, right? And when you've
got metrics, you can't do that because you discarded all of that connective tissue when
you wrote that data out at the beginning. You can't ever recreate that, right? Which is like
the source of truth for observability is these arbitrarily wide structured data blobs that you
could just slice and dice and combine and recombine as much as you want.
It's just a different level of flexibility.
So you get high cardinality and high dimensionality,
and especially with monitoring,
you have to think of all of these different things beforehand.
And it's also at least some of these time series databases
that back monitoring systems,
they're not built for this high cardinality.
No, they're not built for it.
Exactly.
They're built for it.
And I don't, I'm not talking shit about them.
They're built for different use cases that are really super valid.
They're built for, you know, counts and aggregates and, you know,
dials and, and, and, and storing lots of like fine detail,
very in a very compressed and you know space sensitive way because because
as it ages out it aggregates right which is really great when you're trying to like plot trends over
time or something but it makes it useless when what you're trying to do is pinpoint what was
that user's behavior like yeah that definitely gets aggregated away. And I've seen that at my
workplace and I wish I've had to do like Huba or Honeycomb or something. It's transformational.
Yes. Like in our work, for example, like trying to determine whether one IP address is just
spamming us a lot. It's just another version of the same problem, right? You end up doing so much
guesswork and intuition and just like brute force
and it's just an open-ended amount of time. When you have a tool like this, it's literally just
slice, dice, there it is. So then what's the catch? Like how is this stuff implemented internally so
that like, like I can see why monitoring systems don't deal with cardinality well, right? They're
not meant for that. How do you implement something like Scuba or Honeycomb
that's like different from a monitoring system?
Well, part of what's needed to happen
was hardware had to get a lot cheaper, right?
Because like the early versions
of all this kind of software was all written
and to be stored in RAM, right?
Now SSDs are fast enough.
And actually we use S3 to back our files.
But it's also important that, for observability's sake, you can't define indexes or schemas.
Because whenever you have a schema, you're predicting again.
You're like, these are the only dimensions that I'm ever going to need to have, right?
When instead, you need to be incentivizing people to just throw shit in whenever it occurs
to them that it might be useful.
And stop throwing it in whenever you start. You know, it just has to be much more fluid.
And anytime that you're dealing with indexes, again, you're predicting in advance, which,
which dimensions need to be queryable in a short amount of time, which you don't know what questions you're going to need to ask. So, you know, the solution that we arrived upon was a
columnar store, which is basically just every dimension is an index, effectively.
And it's distributed, so it can grow very elastically.
You know, we just, it gets distributed across partitions.
And then, so we do aggregate, but instead of aggregating at write time,
we aggregate at read time,
which means that we can combine and recompile it as much as we want.
Okay.
But then how do you prevent, like,
the explosion of data that you're going to get eventually?
Why would I want to prevent that?
It's great. It's fantastic.
Or in terms of, you know,
you'll have so much data for like many months.
Is there something where like you cut off like,
okay, you can't look at data from six months ago
or something like that?
You know, the trade-offs that we make for this kind of a data storage problem
are very different.
Like there's a reason that this database we had to write from scratch,
you know, it doesn't exist on the market because it makes no sense
for almost any other use case.
Because where else would it be where you want your data to be like
fast and mostly right, right?
You just don't want those trade-offs right um
but when you when you know we we make them and it's it's quite fast like we have all our users
even the free tier 60 days of storage for free and amazing yeah and you can store it even longer
you know we now we've actually kind of the first version of our database was just columnar stores, you know, and nodes.
This last year, we rewrote significant parts of it so that the queries are actually being run as Lambda jobs, Lambda queries.
And it ages the data out, you know, a few hours or days afterwards, it ages the data out to S3, which we thought would have a huge performance hit. And it turns out not. It has a different set of performance characteristics
than having it all in local SSDs, but it's not overall slower. So as you can imagine,
that opens just like massive vistas. Yeah, that's pretty surprising and pretty awesome.
Just the fact that you get 60 days of data to debug,
that's more than enough for...
You know, disks are fucking cheap now.
And I feel like this is a thing that most vendors
either haven't really woken up to
or they're unwilling to let go of the exorbitant prices
that they've been charging people.
But storage is not, like, this is not a commodity business.
You know, we're not selling at Amazon's price plus, you know, because the value in our service is not the storage.
The value is in you, the user, the user experience, the interface, you know, nudging you, like, guiding you, you know.
Any computer can detect a spike, but it's about helping humans attach meaning to the data that they're seeing.
That's what we're charging for.
Yeah.
I think the value is pretty clear to me.
In terms of the VC, there is a technology risk.
Can you write a database that gets this working?
But there's no market risk.
Once you have this working, I think people want to buy it.
I would certainly want to use it if I I had like a high scale startup. Yeah. And the thing is, it's not just if you have a high scale startup,
like there are boundaries where this sort of thing becomes, you have to have it or you die.
But having something like this from the beginning, it makes it so that you never have to dig those
holes, right? You never have to build a system that is just something that a hairball
that the cat coughed up that nobody's ever understood. You're shipping new shit every day
that nobody's ever understood, you know? And it's just like, this is why people are afraid to be on
call. They don't want to touch it, right? But like your systems don't have to be that way. And like,
it is easier and better to have observability from the beginning. It is easier to develop if
you can see what you're doing. It's faster to find bugs. You know, you end up just being in this very intimate
conversation with your code as it's running in production. And I think that almost everyone
who's had this experience can't imagine going back. So yeah, what this enables you to do is
worry less about shipping code that might be buggy because you can easily catch on.
Yes, yes. Observability driven development is kind of what I've been calling it where you're just, as you're writing code, you're instrumenting, you know, with a thought to yourself an hour from now, and you say, is it doing what I expected? Does anything else look weird? And you will catch like upwards of
80% of all bugs, like right there, if you're just looking at it while it's fresh in your mind.
And that brings me to another conversation that we've had recently around deploying your code,
within 15 minutes of it being landed or being submitted in the code base or being merged.
Yeah.
Why do you think most organizations think it's not possible to do something?
Or many organizations think that.
Because they've never seen it.
And so it becomes a self-fulfilling process, prophecy.
Well, if it was possible, other people would have done it.
I would have seen it before.
Right. it, I would have seen it before, right? And I don't want to downplay how difficult it can be
to work your way up out of the pit once you're in the pit. It can be hard and scary once you're in
the pit. Conversely, if you just never fall into the pit, it is way easier. I think that I set up
our auto-deploy stuff, it was just a bash stuff. It was just a bash script, you know,
it's just a bash script that like looks for an artifact and deploys it,
you know, every 15 minutes or something.
And I did that in like week three. Right.
And so we've just, we've grown up with this.
And anytime you merge,
you just know your code's going to be out in a few minutes and, and,
and there's just, it's never gotten to be hard. Right.
Like we've honeycomb, like for the last couple years, we've had about
nine or ten people writing code for everything from the database, the query planner, the application,
the integrations, the security, proxy, you know, everything. There's no way that we could have done
that if we didn't have that tight coupling between it's merged, it's deployed, and you look at it,
right? Because you can see how, you know,
if that interval becomes elongated, all these pathologies just proliferate, you know,
people have lost their place, you know, they've paged it out, they've forgotten what they were doing. Somebody else deploys something that has your changes and a bunch of other changes all
bundled up and then spends the rest of the afternoon trying to untangle it and to get
bisect and like figure out whose thing broke it, it you know and it's just like it becomes it becomes like i i i've been
doing some back of the envelope you know calculations and like checking them with my
intuition to see if this sounds right but like i think the order the number of engineers that it
takes you to write and support software if your code is automatically deployed within 15 minutes
let's call that you know that, that's N, right?
That's the number it inherently takes.
If it takes more than that, if it takes hours, I think you need twice as many engineers.
And if it takes longer, if it takes like days, I think you need twice as many again.
And if it takes like weeks, I think you need twice as many again. So that's like, you know, that's, that's triplet, that's, that's doubling it three times.
And past that, I actually have no experience.
So I don't know if it's true or not, but like, that's incredibly costly.
It's incredibly costly because, you know, it's the mythical man month all over.
You add people, you're also adding friction.
You're also adding communication.
You're adding specialists.
You're adding, you're also adding friction, you're also adding communication, you're adding specialists, you're adding, you know, friction, you're, it's, it's, it's, it's amazing to be able to just
execute on what must be done with a small, nimble team. And, you know, we've got some experienced
engineers, but we very explicitly did not just hire all our friends, like Google and Facebook,
like we brought some intermediate engineers on and, you know, we, you know, we have a pretty diverse team. And, and this is why I feel so strongly that, you know,
the main factor that defines how quickly you can move is not your knowledge of data structures and
algorithms. It's the system that exists around you that supports you or hobbles you or inhibits you
or whatever. You will rise or fall to the level of your ability to execute within the system
within a few weeks of joining the system. And people are always like, we have the best team.
We hired the best engineers. But they're spending way less time and money because it is an investment.
It's not on your feature list, right? And yet you have to carve out continuous dedicated time
towards maintaining this sociotechnical system that surrounds your engineers, that takes their
code and delivers it to users and make sure that there isn't bugs or alerts them in a timely way,
or doesn't bother more people than are you know like
all of this stuff that you were at this question the question you asked me was why is it that
people don't do this i think it's because it it hasn't really sunk into them how important it is
and and that this isn't just a thing for elites this is for everyone it is literally easier to
write code this way than any other way. So that's why I'm
ranting about it. Yeah. I certainly, the way I think about it is it goes from being like an
asynchronous workflow where you're waiting to see and you're checking, has my code been deployed or
not? Yes. You submit your code, you wait, you see that it works fine and then you move on with your
day. So you're not thinking about this in the back of your mind.
We spend so much time just waiting on each other.
Yeah, it reduces the amount of communication exactly that you need to do with other people.
So you mentioned that it's easier to like avoid going into like a bad place
in the first place rather than trying to fix it once it's already really bad.
So what are some other things like that a company can do
if they're starting off, if they have five or 10 people that,
that can help them avoid that problem in the first place?
Yeah. I mean, I think that like job number one is just like,
pay attention to that interval between when you write the code and when it's
live and just like, keep it small because you know, it's, it's like, it's,
it's, it's at the very beginning of the stream, right? So any growth there is only going to be exponentially multiplied later on in the stream.
You can never really recover from it.
Having a good CICD pipeline, which like you were just saying, it frees you up to not have
to think about so many things because that's software doing what software does well so
that humans can do what humans do well right um i think putting all engineers on call is really important when you're
small like that um having it be a very egalitarian you know there's no when you're small you've got
five developers there's no excuse for anyone not to know how to deploy their own code right or how
to debug it when it when something breaks or you know how to how to follow code, right? Or how to debug it when something breaks or, you know,
how to follow from end to end, you know, and just debug a request if it's broken.
Also, and I know that this can be costly because, you know, if you spread much more thinly,
you can cover more ground and move more quickly, you know, if you have only one person who knows
each thing. But I think that that's a really false and fragile form of speed.
And you really, it's like running with RAID 0, right, instead of RAID 5.
Like, yeah, it might be a little faster and cheaper, but you're going to pay.
It's not a question of if, it's a question of when.
Yeah, that's a great analogy.
In terms of like on-call call do you know at like what point
do you think it makes sense to separate people who are on call when like maybe people who are
just worried about shipping features or does it ever make sense to make that separation
i so i do think there is an upper bound to the size of an on-call rotation. You really don't want it to be more than like seven people, you know, because otherwise you're going to forget too much of how the system,
you know, it needs to be a regular thing. You know, it shouldn't be too life impacting, right?
I'm a big proponent of all engineers being on call, but also this is management's job. Like
you have one job, make it not suck, right?
Like people shouldn't dread it.
It shouldn't be impacting your life.
But yeah, I don't think you can really have
an effective rotation that's significantly higher
than seven, maybe 10 people
if you're doing primary, secondary.
So once you're bigger than that,
then you got to split into two.
And I think that is around the point
that a lot of teams start to have
some natural specialization, whether it's the iOS and Android engineers or the really front-end people start to have a lot of custom assets and SaaS and I don't know what else.
And there tend to be more back and front and whatever.
So I think that that tends to happen fairly organically.
It's like the two pizza team.
Yeah.
You want it to be too big. And for upper bounds,
is the reason why you think it shouldn't be too big
because people just lose context on what there is?
If you're only on call once every six months,
it's going to be a big fucking deal, right?
And you're going to forget
and have to learn all over again.
What's changed since you were last on call?
Because every rotation,
it should shift subtly, right?
Something should change.
Something should be fixed.
I also really strongly believe that when you're on call, that's your job.
And you shouldn't be held, like the product managers should factor this into their planning
that you should not be expected to get any work done on features.
Because that's how you budget in this ongoing care for the system. And I've really,
I have seen this done at many places so well that people actually look forward to being on call
because it's a neat break in routine. All you have to do is whatever the hell you want that
you think might fix it, right? All the annoying things are just kind of bugging you. You have
carte blanche, you know, the system isn't on fire and your tickets are low, you know, then you go do it, do whatever you want,
right? And I think that that's liberating. I think that's healthy because engineers want to do good
work. People want to work on their system. They want to improve them. They just often aren't given
the time and the leeway to do so. And so I think that making it clear that, you know, on-call time is not project time.
It's a nice little breather for the engineer.
It's a good reminder for whoever's doing the planning that, you know, you have to bake
some flex into the system.
Yeah.
Okay.
And another point you mentioned is that, yeah, on-call shouldn't suck when you're on-call,
right?
So that one part of that definitely means you need to have tools that are good enough that makes it easy to debug problems. Is there other parts of on-call not sucking? Is there like, yeah. seven startup or service, you know, I think it's reasonable to ask them to be woken up
once or twice a year for their service.
Once or twice a year, but not more.
Once or twice a year, more than that.
And you're just, it's going to start to be, it's going to very rapidly become incompatible
with many people's lives.
And the only exception I would make for that is if you have an infant, if you have a child
who is not yet trained to sleep
through the night then you're not on call only one of those at a time please right but like yeah i
think once or twice a year and and as with anything there are this is a human system there will be
exceptions i there was one one time when i had a guy on my team who he was game like he he didn't
want to shirk his duty he wanted to be on call He wanted to do be like the cool kids. But he, his body just
wouldn't let him, you know, like he had so much anxiety that he could never go to sleep. And it
wasn't because he was getting paged all the time. It was just like, he was, you know, and so he gave
it a try and I was like, let's see if we can make this work. He tried, you know, a couple of rounds
and it just wasn't getting better. So we found a different thing for him to do, right?
Like instead of that, he owned the CICD pipeline bugs for half the time.
It was something that was like equally drudge worthy, but it was not, it just, it wasn't
something that would like tweak his anxiety.
Like teams, you know, should be pretty understanding of this stuff.
As long as everyone wants to pull together and do their part, people are usually pretty
generous about this sort of thing.
I also, I really like to do primary secondary instead of having one person on call,
because I really think it's healthy to like lower the barrier to ask when you have a buddy who,
you know, you're used to working with. And in the early days, you could often do this by staggering
front end, back end, front end, back end. So you've always got a buddy who knows the part of
the stack that you don't, right? And so if you're like, I want to go to a movie for a couple hours, you know, hey, can you take it for the afternoon?
I think it's really healthy to incentivize people to just ask freely. Right.
Because it's part of making it not impact your life. And as a manager, I would often there would be times in the ebb and flow of the team where, although I took myself,
I think it's really healthy for line managers to be in the on-call rotation if possible.
If not possible, I think they should be very vocally offer themselves up as the
alt of first resort, right?
Anytime that you want to like, if you got paged and you need to take, if you need to
sleep tomorrow night, I'll take the pager,
right? Like I would proactively offer. And if somebody had to go out of town for the weekend, I would offer, you know, and I would repeatedly make it clear that I wanted to be on call at least
a couple of few times a month. And because it's, it starts, the goal is for it not to feel like a
trap, right? The goal is for it not to feel like a tether that impacts your life. It's just a responsibility, right? You carry the pager in your laptop around for a week.
And if it's inconvenient to do that, then you let one of the people who
are happy to take over for the, you know, it's not that big of a deal.
Yeah. Another thing that you mentioned was having at least seven people on the rotation.
And I find it a little ironic and it makes sense that, you know, the Google SRE book says like, oh, you should have at least eight people.
And as someone who's never been on call in a team, which has been so huge that it could have like eight people. Well, Google SRE, they're talking about their SLA, they have a two or three minute response time and it will make the cover of the New York. That is the peak maximal pressure, right? We should all be pretty realistic about how much it matters and how quick. So for most of us, five minutes, 10 minutes, not that big a deal.
Even 30 minutes, okay, text somebody just like,
I'm in a bind, I can't get there.
Can you take it?
And they'll get it.
It's fine.
We've got life.
The system should not be going up and down so much that this is a thing, right? It should be pretty rare that you get alerted outside of hours.
Yeah, what I'm annoyed by is the fact that since there's this one company
that's written this book, it's often treated as like a gospel yeah it's ridiculous like so are there
any other things from that book that you can think of just off the top of your head like which
doesn't really apply to regular or non i'm gonna confess that i haven't actually read it okay yeah
i think yeah it's not too bad it's not i know i feel like i know everything that's in it
just because you know the conversations but i've never felt it i've been having people tell me what
google does all my life i don't feel like i need to read it yeah for sure so yeah i think that's
one of those things it also says try to have a european like counterpart oh yeah yeah yeah
follow the sun yeah in my dreams,
I would love to have a London chapter,
you know?
Okay.
Yeah, I wish there was
some kind of resource for...
I was thinking of writing a blog
about this,
like healthy on-call rotations
for companies that don't have
money printing machines.
I love that title.
The last blog post I wrote
about on-call was...
It was on- call for managers.
Like here are my expectations.
This is what it means to have a rotation that doesn't suck.
And here's how to do it.
I love that.
On call for companies without money printing machines.
That's pretty much it.
Yeah.
But another point that reminded me of was,
you know,
managers should go on call as well.
I've rarely ever seen that happen.
Yeah. Yeah. that's unfortunate uh i i think that so and it's unfortunate kind of but but this is a this is
there's such a wide spectrum spectrum of situations here right i was i always insisted on being in the
on-call rotation until there was a point when i realized this was at facebook i realized it was
hurting my team um for me to be on call because I was in meetings like, and I'd be like
halfway across the campus, you know, in some meeting and I couldn't make it back to my laptop
in time. So I was, you know, I would be pinging someone back at the desk, go, Oh, could you cover
for? And so, you know, I was just bothering them when they weren't on call. And, and, and that's
when I realized I can't, I shouldn't do this
anymore. And I will instead become like the first pinch hitter of last resort. But I think that this,
I think this is fairly common when you've got a manager who kind of grew up from the inside and
then trans transitioned locally. And it's very rare when you hire in a manager from the outside
or, or when, you know, they're more of a professional managerial class. But, you know, the leaders that excite me the most tend to be the very hands-on sort
of pendulum ones who go back and forth every couple of few years and, you know, who make it
a point of pride to stay sharp and to stay, you know, embedded in their team's experience.
Yeah, I think your point on meetings is like a good one. I
thought that was, it was a great idea to try convincing my manager after this, but maybe I'm
just going to hold back on that looking at his calendar. Yeah, it's, you know, I think that
I think that you can kind of split the baby by asking them to be on call during non-business hours.
And I've even done this with other people who really struggled with the overnight aspects.
They would be on call during the days and I would just take the nights, right?
Because it grounds them in reality in a way that I think that is very, it's, it's really healthy while also just, you know,
at bigger companies like Dropbox. Yeah. He's,
he's just going to have a hard time doing it during the day.
Yeah. Another one,
one more thing because there's so many topics that I can go on.
Like, why do you think like observ, observability is definitely catching on.
Definitely like Facebook
had a tool like Scuba
because somebody thought
it was a good idea.
But I haven't seen,
maybe, you know,
like smaller companies,
like I have friends that,
you know,
all of these Silicon Valley,
like mid-sized companies
that they would say.
But things like Scuba
or like Honeycomb,
like observability tools,
they're still,
they haven't caught on as much as you know building a monitoring tool so like why do you think that is and you think
that's going to change well um you know we have 30 years of writing monitoring tools you know like
big brother you know like the metric was born in like the 80s you know we have just got a long time
and and there's some path dependency here right Like we just figured out the metrics thing.
Then we built this enormous heritage of time series
databases.
And for a long time, I think that paradigm worked pretty
well because most of the variable complexity
was bound up inside the application.
And you had the app and the database and the network.
And it was never very hard to figure out which component was having a problem.
And if it was in the app, you just have to attach a debugger
or do something complicated there.
But everything kind of started to shift when microservices came around, right?
Because now so much of the complexity inside the application is now hopping the network.
It has been exposed to like the operational side of things.
And the hardest part of the problem is now where in my system is this problem coming
from, you know?
And now we've got polyglot persistence.
We've got half a dozen different kinds of databases and they're all sharded, you know?
And like, and it used to be that, you know, if your uptime was 99.1 let's say um then 0.9 of the time somebody was having a bad
experience but it was pretty evenly distributed right now it's probably more likely that you know
everybody whose last name starts with j-i-l thinks you're 100 down because that shard is down right
but like everyone else is fine it's likely to be localized and extreme because we've done all these things for resiliency's sake to just partition and
shard and distribute and everything. So this is just kind of the natural part of the trade-off
that we've made by embracing more complexity for the sake of resiliency. and so observability is only really four years old. I would say that's,
that's when I came up with like, what I felt like was a reasonable technical definition for what,
what do you, what technical things do you need to ask these kinds of questions and support these
kinds of things? So I think that the, the excitement that, I mean, yes, there's been a lot of marketing fuzz in the material.
Everybody, literally everybody from like five adjoining industries is like, we're doing observability too now.
Which, if I take the long view, I think this is a good thing.
Because I do think that they're adopting the marketing stuff, the language faster than the features.
But I think that they are all scrambling on the back end
to implement those features.
And I think that a year or two from now maybe,
I mean, these are not trivial migrations to undertake.
But I do think that the clamor from the community
has been so great that they want these, they need these.
They do think that many more companies
are going to make that leap to observability tooling over the next couple years. Because monitoring is really
only the right tool for infrastructure. It's the right tool for the software that you have to run
in order to support the software that is your core differentiator, right? Because the stuff that is
not your code, you know, think about it, You upgrade it on the timeline of like, you know,
dist upgrade or something like a couple of times a year, maybe, right?
It's a black box to you and you care about it in terms of its capacity,
you know, trends over time, the dist base, you know, you care about it,
but you care about it in aggregate.
That's what monitoring is for.
That's what metrics are for, right?
So to the extent you have infrastructure, you need it.
The difference is that now, you know, so much, a greater percentage of our engineering teams
are devoted because of, you know, third-party platforms and everything.
A greater percentage of each company's engineering team is actually devoted to its differentiators, like its code, like you're the code that you write every day,
the code where it's your responsibility, what your user's experience is, right? And so like,
that is where I, you know, I think that observability becomes really necessary.
And people, you know, for the longest time, people were just told that it was impossible,
that it was just impossible problem, you know, you couldn longest time, people were just told that it was impossible, that it was just impossible problem.
You know, you couldn't have high cardinality.
And now that, you know, their bluff has been called on that and it's clear that it's not impossible.
I think it's just a race to implement.
I think that makes sense.
So software complexity is basically increasing a lot.
And the second part is that, yeah, monitoring is more for infrastructure, things like making sure that your service latency is normal or your disk isn't filling up and you probably want to have an alert on that.
But trying to debug why is this user having a particular problem?
That's a problem.
That's something that people need to find out.
And there is a new class of tools required.
And I think then it's also reflected in the demand or at least the hype around things like service meshes
and like Istio and all that.
Because people want more from the existing tools that they have.
So maybe a basic question for you.
So with microservices, there's also tracing,
like distributed tracing. So how does something like Honeycomb and tracing differ? maybe a basic question for you like so with microservices there's also tracing right like
distributed tracing so how does something like honeycomb and like tracing differ like what oh
yeah i see um tracing as an absolutely necessary component of observability but it's just it's just
taking those same events and visualizing them by time right um we didn't actually build in tracing
in the early days of Honeycomb. We
just built in the slicing and dicing. And then one day, one of our engineers was like, well,
if we just added an ID and propagated it, then we could, and it was like, oh yeah, you're right.
You know, we could. And so we did. And, and yeah, I think it's, I think that it's, it kind of,
I understand why people treat it like a different thing,
but it frustrates me when I see people shelling out to store their data for tracing yet again,
right? So they're paying to store their data as logs, they're paying to store their data as traces,
they're paying to store their data as, you know, metrics. And it's just like, how many times do
you want to store this, right? Because the thing is that you can get all three of those data types
from the arbitrarily wide structured data blobs
and you can't go in reverse.
You can't go to the structured data blob
from the metrics or the logs or the traces.
So I think it's a artifact of us being sort of midway
through this transformation.
Like ultimately, you know,
you should have one source of truth
from which you derive your dashboards and your traces and, you know, do all of your exploration, but it should really just
feel like two sides of the same coin. You know, you're just like slice and dice, visualize over
time, you know, back and forth, you know, you should be able to, there's this, what observability
lets you do too is, is like go from very high level, you know, there's a spike, right? Down to very low level, which requests were different from the others and which ways,
like what do they, what do these errors have in common that is different from all the other
requests around them?
And then, you know, so with, when you include traces, what that means is like, you can,
you know, you can see your error spike and then, you know, trace one of those for me,
right?
Or trace the median one of those for me and then when you find the place in the trace that has the problem
you can kind of zoom back out and go what else is impacted by this right and that that sort of back
and forth like down and up and in and out it really to me defines the experience of just
understanding your systems with observability. Yeah, that sounds like an awesome debugging experience.
Like figure out from like an observability tool, like, okay, this app ID, for example,
has a lot of errors.
Drill down to that, find one request ID that has those errors followed through the system.
Yeah.
Okay, it looks like this particular database.
Yeah.
So yeah, that does sound like i want to implement that at some
point yeah yeah no like once you've used it it's impossible to live without it again
yeah and i think there is that trend of like software companies just getting smaller and
smaller and caring less about you know their infrastructure and and i guess they should right
like people should yeah they should care only about, care about what differentiator they're like
shipping to their customers.
It's a good thing.
And for people like me that come from ops, like we shouldn't find this threatening at
all.
Like there's never going to be a lack of jobs for our skill set.
Like I promise.
But there is like a specialization that's going on.
And, you know, there's a, there's a drift. And if you want to work on infrastructure,
you should probably work on an infrastructure tool or,
or on a, on a product that it's differentiator is that it does
infrastructure better than other people can, right?
Like we're doing infrastructures so that other people don't have to run their
own observability tools. So it will always be a differentiator for us.
Yeah.
Someone was just asking me recently, like, should I switch into SRE?
Like, will this job exist?
And it will probably exist, but it will exist in like maybe a different form than it does today.
Like maybe you won't be embedded in it.
Yeah.
Yeah.
So I think that there are two, there's kind of a split, right?
Like we've always kind of jammed together infrastructure, you know, back end engineers and ops people into this one. It's like you're an optimizer of socio-technical systems.
Often, like, by, you know, looking, scrutinizing the release engineering path, the deploy path, the CICD path, and just looking for ways to optimize it, ways to toolsmith it, ways to make it so that, you know, like, SREs,, I think in the future, if you don't want to do
infrastructure, you know, yeah, you could, you could almost every engineering team that has more
than five people is going to want someone whose job it is to make sure that they're all performing
at their peak ability because the, the, the difference between, you know, systems that are
working well and are well-maintained and the ones that are just neglected, it's like, that's what we were talking about in the beginning, right? You need like
literally three times as many engineers and it's just not cost effective and it's frustrating and
slow. And SREs have the systems knowledge and experience to be incredible force multipliers there. So the split is basically people who work on features that end users see, and then the
people who work on the internal developer platform, in a sense.
Your focus is not maybe just making systems more reliable, but you're force multiplying
all the other engineers by making sure they don't have to worry about all of these other
aspects. It's still kind of about reliable so i've always disliked the term sre because i'm like i
don't just make shit reliable like i build systems right um there is a reliability angle to it though
just in that it's about reliably shipping your code to users and detecting problems and alerting
the right people so you know still fits in the same tent, I guess.
But that also enables your other engineers to move faster
because they can find out about issues faster.
Radically faster.
Yeah.
And it's not just about moving faster either.
It's about living a better life.
It's about, you know, spending.
And, you know, when I talk about, you know, you know, how
teams are wasting 50% of their day and everything.
I want to be clear that I'm not advocating people like working harder and longer, filling
every minute of the day with productivity.
It's the opposite of that.
It's like, you know, you can really only do like maybe four hours a day of really intense cognitive labor.
That's just all you've got in you.
But let's free you up to fucking do that and not spend all your time waiting on people and start, stop, switching around.
If you can just free people up to just focus and just produce, then they can go home at three or four and,
you know, live their lives.
Like butts in seats are not an important metric to me, right?
But like making sure the engineer's time is not wasted and frittered away is my focus
here.
Yeah.
And one more thing I think I read from one of your tweets is also people who ship more
or can ship more efficiently, they're also happier.
They're so much happier.
It makes engineers so much, you know,
producing more is not what burns engineers out.
It's not being able to, it's producing less.
It's being, you know, tied down.
It's being frustrated.
It's being, you know, spread too thin.
It's never seeing your work actually meet your users, right?
It's foundering that burns people out.
And then you combine that with like a tight deadline
and you haven't invested enough
in making your developers really productive.
And then you're pushing them.
You're just like, go faster, go faster.
But they're like, they're strapped in, man.
They're like, they can't go faster
than your system will let them.
Yeah.
And that's pretty, that gives a lot of like food for thought.
Like you want to make your engineers as effective as possible while through like some kind of investment in the developer platform.
How do you know your investment is enough?
Like, is there, as you said, there have to be some guardrail metrics.
People have to be deployed.
Their code should be deployed within 15 minutes.
What else?
How do I know that my end organization is healthy?
I think the right place to start is with the Dora metrics,
the four Dora metrics that Jez and Gene and them, they wrote the whole book Accelerate about this,
right? Which I hope everyone has read. It's just like, the four key metrics are like, you know,
how often do you deploy? How many times do you deploy? How long does it take before your code
is live? How many deploys fail? And how long does it take to recover? Right? And like, I really think
that any team could just start by measuring those things.
First of all, whenever you measure something,
it tends to get better, right?
But just knowing where you stand,
you know, and you can plot yourself.
You can see where you measure up next to,
you know, the other teams that they surveyed.
And that's really motivating.
And as you make those metrics better,
your team begins devoting more of its time
to more productive things.
Now, that's not the whole story.
But I think that most people who are just starting out here,
that's a really good place to start.
Yeah, yeah.
So that makes sense.
Like you take this set of metrics,
which you know that the industry has like,
people have done research and figured out.
Yes, exactly.
There's been like experimental evaluation.
And you try to just compare your organization, then you can make a case to somebody to say, you know what, we should make these metrics better.
Yeah, I feel like right now we're in this interesting, like I straight up, I definitely see this as a failure of leadership.
This is a failure on management's part and leadership's part because by and large,
engineers are already sold on this and they're dying to spend time working on this and they're
not being allowed to do so. And I think that we're in this unfortunate, you know, sort of valley here
where, yeah, every organization is a technical, every company is to some extent, you know, an engineering organization, right? But the leadership at the top is generally not made of engineers, right? And there's some packet loss somewhere between, you know, the engineers and the engineering managers who generally know and can make a case for this. And, you know, maybe the director or VPs who somewhere in there, this message is getting lost that this is how you make it better.
This is how you make your customers happier. You know, and I get that it's a very abstract
technical argument, but it's not particularly controversial or difficult. I really think that,
you know, while I blame the leadership for this, I, because,
you know, I, I also am just someone who feels like rather than complaining about what we don't
have, we should do what we can with what we do have, right? I think that, you know, as engineering
managers, leaders, we need to start being much more vocal about this, being much louder about
it, being much more, try different ways of making it, put it in front of different people's eyes,
like make the case in a few different ways. Try converting it into people years and people
hours, dollars. Anything that you're just talking about in abstract engineering terms is not going
to resonate with them. But if you convert it even in a very messy back of the envelope way to dollars,
that'll probably get their attention. If nothing else, think about retaining your best employees
because they're increasingly chafing at the idea of working at teams that are not
really investing in this stuff, because they know how much of their life is being wasted.
So my hope is that over the next few years, we will mature more as these organizations that never
maybe understood that they were starting an
engineering organization but now they've got one right and they've got to make them happy if they
want to compete and this is how you compete yeah and yeah i'm also thinking there could be a case
where people have just been in such a bad situation for so long they forget that it's bad
there's so much learned helplessness and there there's also this attitude that makes me really sad
where you talk to people sometimes and they're like,
yeah, I know, but that's for Silicon Valley companies.
You know, they're just like, we don't get nice things.
You know, that's not for us.
We're not good enough.
And you just want to be like, dude,
the engineers here are not better than the engineers there.
They are privileged enough to work in better systems by and large.
Sometimes, sometimes not.
Like some of the shittiest work that I've ever seen is also been in Silicon Valley.
This is not an elitism thing.
This is a very accessible, realistic thing that anyone who's capable of shipping code
is capable of making their systems better yeah and it's also sad that leadership like there is that packet loss as you mentioned
but i wonder how much of leadership is just unaware that this is like best practices you know
yeah i think that you know i think that we as engineers we and we as people we we just we
attend to it we always subscribe so much more knowledge and
intention than actually exists you know i what's the quote like nine tenths of the time um what
could be what you think is malice can be understood it can be explained by ignorance or something like
that like they just don't know like and and we'll be be like, but we told them. And it's like,
well, yeah, two years ago, you mentioned it once during lunch, but like, you know,
telling something to upper management requires a campaign and a consistent campaign of multiple
people kind of coordinating your messaging, right. Making sure that all of the key people are in
for like, you're, you're trying to, to you have to think about it like changing the trajectory of ocean liner right like it can be done but it takes some planning and
it takes some coordination and and just mentioning it once does not count yeah and if and if you do
that coordination and you try it and it doesn't work you should leave yeah then and you should
you should definitely vote with your feet like yes there's
there's only so much you can do from there's only so much you can do and and you shouldn't reward
people who refuse to change they're hurting their own people and you shouldn't reward them with your
labor and your presence which is very valuable and like a flip side of this whole discussion is
that it there's more and more space for developer tools to like, or developer tools startups to grow, because there's so many things that could be improved.
So many.
And for sure, like not every company should be building its own like honeycomb.
No, almost nobody should be building their own. It should be seen as a, you know, a failure whenever, you know, a company decides to in-house and build some other tool that is not
you know what they do for for living it it's sometimes inevitable but much less often than
people think so i want to ask your experience going from like starting a company to i think i
saw the honeycomb like team page it has more than 60 people now i don't know how uh up to date that is but like
that's a pretty big team yeah we just we uh we just basically doubled the size of engineering
and we've added you know we've gone from zero designers to seven designers like we're we're
really making a big bet on design and over the next year or two which i'm pretty excited about it's it's it's a
little it's a little strange but yeah yeah that's seven more designers than on most developer tools
right i know right yeah no i i'm i'm all in in fact i've i've kind of had a come to jesus moment
here which is just that you know there are there are things that you know we have we've built this
feature you know and built it again and we've engineered the crap out of it.
And why isn't it?
Why aren't people using it?
And then now I'm like, because that's not an engineering problem, is it?
It's a design problem.
Is there anything specific you can talk about that?
Because I'd love to know, like, because I would think that, you know, Honeycomb was built by engineers for engineers.
Like, what was the design problem there?
Oh my goodness.
Oh, so many.
So for example,
ever since the beginning,
I've always seen this as being,
it should be a pretty intensely collaborative
and social experience, I think.
Because when you're debugging,
you know, your complex system,
you know your corner of the system intimately, like deeply, because when you're debugging, you know, part of your complex system, you know,
your, your corner of the system intimately, like deeply, very well, but you don't know the rest of
the system that intimately, right? But when you're debugging a problem, you need to, you know,
be able to see the entire span. And, and so you should be able to lean on your coworkers and
their intimate knowledge of their parts of the system when trying to ask questions.
So, like, you know, you have your full query history in Honeycomb.
And you also have access to, you know, your team's query history.
And, you know, I feel like if you get paged in the middle of the night and it's like, oh, this looks like my SQL problem.
I don't know.
Fuck all about my SQL.
But I know that the experts on the team are Emily and Ben.
And I feel like we had an outage like this last Thanksgiving.
I'm going to go, I think Ben was on call.
I'm going to go look at what he did.
Like what questions did he ask?
How did he interact with the system?
You know, what was useful enough that it got, you know, run 50 times?
You know, what was, what got attached to a postmortem document or tagged with, you know,
like just like leaning on each other, looking for ways to bring everyone
up to the level of the best debugger in every corner of the system. Because then, you know,
Ben could go out on vacation or on his honeymoon. And, you know, the remnants of how he interacted
with the stuff he was building are still there. So if we have a problem with it while he's out,
like we can just go see what would Ben do, right? Like social and collaborative stuff like that.
You know, because honestly,
query builders are intensely challenging, off-putting.
Like there are most people in the world,
especially while they're under pressure
or time pressure or outage or whatever,
the last thing they want to do
is try and compose a new query from scratch.
It's just really hard, right?
You have to switch from your flow brain to, OK,
let me understand this tool, which is bad.
And what we did at Facebook was we
would pass around these shortcut URL, just like notepads.
Just like any time we had an outage or some i'd be like oh that's an interesting
query i'm gonna yoink it add it with a couple comments you know maybe this is thanksgiving
outage you know sharding whatever and then when i'm trying to debug something like the first thing
i do is not build a query it's go to my notepad and go oh yeah it was the indexing job you know
and then paste that in and then tweak it anyone can tweak a query right like it's very easy
to tweak it once it's there but like finding your way this massive system to like that area that
requires like shortcuts and like and leaning on the social part of our brain it's much easier than
trying to rely on the the you know computing like quantitative part of our brain which is very
expensive yeah yeah that makes
sense and i think the way we work around that is just have like that custom dashboard and you keep
on adding stuff because i can i can totally see like you know i'm not sure whether i build the
right query am i passing in the right task parameter or something and like what have i
missed so so you just make it easy to share queries that, you know, your coworkers made.
So that makes a lot of sense.
And I never thought about that use case.
Yeah.
And just incentivizing people to add text, like, to, like, add text to describe, you know, what they're looking at or what they did.
Or, like, apply, you know, maybe, like, look at your past history over the past couple of days and then just, like, yoink, yoink, yoink, yoink, apply a set of tags.
Like, this was useful to me, right?
Like out of all the things that I did
while I was debugging and interacting,
like these were the useful ones.
Let me just put those in my collection with a few comments,
you know, because, you know, you've got this rich,
you know, historical, you know, memory in your brain,
but it's indexed with little scraps, you know,
just like this word or this tag or this thing, you know?
And just like, once you bring that out of people's heads and you put it in a tool,
you're democratizing that information. You're making it so that, you know, like we really want
to just become like this outsource, this brain, right? In the same way that like, I don't memorize
phone numbers anymore. I know where it is and I can look it up much more quickly, right? Like,
that's kind of what we want Honeycomb
to be like for engineering teams yeah so you make it really easy to create like a debugging playbook
on the fly yeah and now now I'm like I'm thinking of more ideas like what if you can automatically
connect it to like a phaser duty incident oh yeah yeah totally all that stuff and like you know we
have like markers where you can overlay across this time period,
maybe the entire system was impacted. You just put a span with the comment. And when you're querying,
that should show up too. I have a bash snippet that whenever I run sudo, it draws a line that
it's like this command was running root on the system at this time. Yeah, there's just so many cool things
that you can do visually.
And, you know, these are,
but these are super design problems, right?
Like they're engineering problems too,
but like the, how to make this feel intuitive
to other people who are not you.
Oh my God, this is deep design.
Yeah, that makes sense.
Like, how do you know whether the query
made by like a coworker made sense
or they were just like playing with something? Yeah, exactly yeah it's actually a pretty hard problem and you have to
think quite a lot about and it's not really an engineering problem it's not an engineering
problem at all yeah yeah but how what was it like going from you know you had this idea that i should
start this company with your co-founder and did you like go validate with like customers first? Did you just start building?
None of that. I was just like, I don't know, in retrospect,
I'm like, that seems very dumb and arrogant, but like, I just,
honestly, I was just so sure that we were going to fail that I was just like,
you know, that didn't bother me. After a couple of years at Facebook,
I really kind of wanted to just sit in a corner and
write Go code for a year or two.
And, you know, people were offering me some money and I was like, cool, I don't want to
live without this tool.
We'll take the money, we'll build it and then we'll fail and I'll open source it and I won't
have to live without this tool.
Right.
Like, perfect plan.
We just keep not failing.
Yeah, that's a great story i don't know like
if you would give that advice to go that no it's terrible advice no you should definitely start by
validating your customer and everything so so i'm guessing at some point you built out like an mvp
type thing and then you started showing it to other people and what was the whole story like
for the first year year and a half we couldn't do much but write the database.
And I don't, because I was never one of those kids who's like, I'm going to start a company.
You know, I really kind of despise the founder industrial complex.
But, you know, I just kind of accidentally became a founder, you know.
But there's a reason that nobody had done this before this, because you have to start with a custom database. And that means that you don't even get to start, you know, building a product that anybody can say anything about for a while. And, you know, our investors are getting all, you know, they're like,
can't you just use something off the shelf until you've validated the mark? And I was just like,
no, I can't. Because if I do that, it will look and feel exactly like every other tool out there.
And this has to be radically different in these ways or there's no point.
Like it's already been reinvented like too many times.
And now I understand that they were right in a way, but also I was right in a way.
And we're honestly just very lucky that we managed to survive long enough to like see
it all pay off.
There were several points in the first four years where it
looked very dicey. And I was like, well, now it's over. But we did get pretty lucky. And we had some
investors who really stuck with us through thick and thin. And now it's taking off. And everybody's
like, oh, this is obvious. And it's like, it was not obvious. It was not obvious that this was a
wise decision. And in fact, I made a secret series of very poor decisions. And I'm just lucky that we're still
around. Yeah, for like the minimum lovable product for something like Honeycomb, like
the latency has to be like low enough for somebody. The 90th percentile for our queries is like under
a second. That's amazing. Yeah, because without that, you would not get impressed.
Without that, it has to be a state of flow, right?
You have to be debugging
and just like asking questions iteratively
without stopping and waiting for your system
to like compute, right?
It's absolutely key.
And this is something that I always forget
is a differentiator for us
because it seems so obvious when it's fast,
when it's easy.
But, you know, and then I go back and use any other tool out there where like you it's just running in minutes behind you
know and and you're just with honeycomb it's like as soon as you ship the event it shows up you know
and and like we get alerted if if there's more than like you know five or six seconds delay and
it never alerts because it's never that delayed you know it's just always always there um
and i forget how unusual that is because it it's just one of those things that it's invisible when
it works right um yeah and a few questions and if you don't if you're if you don't want to share
like it's totally fine i'm curious about how do you shard Honeycomb for when you deploy it as a service?
Like do you shard it per customer or something or yeah?
Yeah, when we provision a customer,
so we've got a bunch of pairs of nodes, right?
With the same, so when data, when it talk to the API, right?
You send some events through the API.
API is a very thin, reliable, all it does is accept and apply some filters or whatever, protect the service, and drops it into Kafka.
And it looks up which Kafka partitions to drop it into based on the user ID.
Each topic partition has a pair of nodes consuming from it.
So one can go down, and that's fine.
And then a few hours later, it gets aged out to S3.
So basically, when a new user gets deployed or provisioned,
I think it provisions it automatically
across two or three shards.
And if we want to, you know, raise that, we just say,
sharded across six or 10 shards or whatever. And then it starts,
the data is all immutable. You know,
it just drops it on an immutable column store. So, you know,
we just write it across more from there on out. And then the,
the read aggregator does all the math of like,
you know, well, past this date,
it was only across two shards.
After this date, it's across five shards or whatever.
And so it all looks natural for the users.
But yeah, it's dead simple.
It's dead simple.
It's not even fair to call it a database.
It's really a storage engine.
Interesting.
And yeah, and you said that most of the work
is done at like read aggregation time rather than write aggregation time.
Yeah, it fans out and reads just as a column scan filters and aggregates the data and returns it to the user.
Have you had any problems with tail latency at all if it fans out super wide or not really? So, you know, this is where, you know,
one of our company values is, you know,
that fast and close to write is better than,
you know, slow and perfect because this is literally what we do in the storage engine.
You know, if one node isn't, you know,
if you're fanning out to five and one isn't responding,
after a couple of seconds, we cut it off
and there'll just be a little like 20% of the results
are missing from this, right? Because it's way faster to us to have that speed than
perfect accuracy, because lots of people are sampling anyway.
Yeah, yeah. People might not even care about this if they're just like trying to see.
Yeah.
That makes a lot of sense. And it's actually pretty simple. It's much simpler than...
Super simple.
Yeah, I thought it would be.
I'm a big fan of simple.
Yeah, for sure. Yeah, complex systems get hard to debug no matter like how many nice tools you have right
and and if you can't understand it then there isn't really any point cool and so so so you
started off building this for like a year you said just working on this storage system yeah yeah
and and like towards the end of the first
year we were starting to put a super basic you know query query or on top of it um and i think
it was year and a half or so in that we got our first free user um you know we we um strong-armed
one of our friends into using us um which is Nylas. Bless their hearts.
We love them. They're good friends of ours. They were right around the corner from us at the time
and they were using Logly and we were like, honeycomb? Yeah. It took less than two years
to get our first customer, but not substantially less.
So it's a long time in startup years.
Yeah.
And then did it just like grow from word of mouth or something?
Yeah, to this day, like almost every,
almost all of our leads are inbounds.
You know, people have been following us on Twitter for a while or people are word of mouth.
Another source of opportunities for us is is actually when
people leave their jobs and go to other places and they tend to bring us with them that's like
one of the first things that they do because you know that useful yeah yeah it's it's definitely
something that you know i think that the reason that we survived and still exist is on the strength
of our users um raving about us to potential investors,
like the strength of their recommendation. And it's just that they said, they've had the
same experience I did, which is, I don't want to live without this ever again.
And are there any tools like that, that you miss or tools that you think should exist,
but nobody's building them or you haven't seen a good one yet.
You know, I haven't really seen a good canary yet for, you know,
a lot of the complexity around,
it makes no sense that so many companies still have to write their own
fucking deploy tools, you know, like Capistrano can fuck off and die.
But like, it feels like something around just like
progressive deployments, canaries,
maybe almost extending tests to Encompass,
writing an expected, like after this version is deployed,
I expect this end-to-end request URL to complete or go green or something,
right? Like that sort of a thing, like feels like I would love to see a tool like that.
Yeah. Maybe the issue is that like, it might have to be a little customized for every single.
I think that's probably what it is. I think it's that there's like, there's a threshold of
standardization and then below that it's like custom, custom, custom, custom.
And it gets very hard to impose discipline
across a wild, wild west that's been growing
in so many different ways for so long.
It probably won't change until some other layers get
sort of standardized down there, and then we'll
be able to standardize on top of that.
Yeah.
I've thought about another another like ci cd tool but then so many people are just stuck on jenkins
and trying to migrate off jenkins is so much work for everyone yeah the hard part is yeah yeah but
honeycomb it seems like you can have pretty small like libraries across like multiple languages. Yeah. Yeah, the stuff that libraries do is literally just,
you know, the way I often describe it to people is like,
you used to be able to S-trace your process, right?
You could just attach a trace and watch it hop through.
And having all that context around the request
is really valuable.
Now, like, as soon as your request is hopping
from service to service,
you're losing all of that context every time you hop, right? And so the work of the observability,
you know, client libraries is to package up all of that context, you know, all the parameters that
were passed in, all of the environment stuff, all the language parameters, all of the, you know,
important variables that were set, and just like, you know, ship it along with it, you know important variables or set and just like you know ship it along with it you know
so that you can link link it by a single id um but just to like so that you don't lose that because
you need to know what happened earlier at the beginning of the request even at the very end
if you if you're like trying to so so what the sorry i got distracted so what the libraries do
is just um you know uh initialize an empty honeycomb event
and then pre-populate it with parameters,
anything that is known from the past.
And then over the course of while that request
is executing inside that service,
you can just do like a printf basically just like,
oh, this might be useful.
Stick it on the blob.
Oh, that might be useful.
You know, any shopping cart ID, user ID,
you know, any high cardinality dimension,
you know, just shove it in.
And then at the end of the request,
when it's about to error or exit,
you just ship that off to Honeycomb
as one arbitrarily wide structure data blob.
And that's all that the libraries need to do
is just like wrap all that context together
and ship it off in one event.
Yeah, that makes sense.
And have any customers complained about like PII
getting inadvertently clogged?
Yeah.
So we have a solution for that,
which is it's this secure proxy thing
where if you're running inside a secured network,
you can run this proxy in an AWS ASG and just stream all of your events through the proxy to us.
And the proxy, like, you know, all the secrets are stored by you.
We never get the secrets.
We can never decrypt it.
It stores a mapping of the original event to, like, the encrypted event or to a hash of the event.
And then forwards only the hash of the encrypted stream.
Then you VPN in and your browser,
you point to Honeycomb,
but the JavaScript in your browser knows to take what Honeycomb has sent and
use it to look up from the proxy behind
your secure network to fill in the actual values.
Well, that's such an innovative solution.
I would have never thought about something like that.
Like we constantly see, oh, we logged something that we shouldn't have.
And yeah, okay, now it's like a security incident and we need to clean it up somehow.
But yeah, that's great.
And just as some like closing thoughts, like if there's anything you want to share with
like listeners about how should they think about like observability?
I know we've already spoken a bunch about this but anything that you know we missed well first of all honeycomb has a
free tier that's really generous you can you can use it even without you don't have to sample it
all like you can you can use it for real workloads and and i think it's i think it's this is kind of
something that i think everything is every person is different what makes it click for them,
right? And I think there's no substitute for just kind of getting your hands dirty and seeing like
the difference that it, because it's like a taste of the future and I think engineers should
be stoked by that. If you can't use Honeycomb, that's fine. Like I think that there are,
LightStep is also observability
um by my definition sadly i don't think anyone else is i mean those are kind of your two options
right now but i do think that um you know the big players uh you know new relics and um data dogs
etc of the world are are racing towards that and we'll get there sometime in the next couple of years. You can do your own quick and dirty,
especially if you're small, just by storing those.
In fact, AWS has been doing this,
I found out recently for like 10 years.
This is how they store their telemetry,
arbitrarily wide structured data blobs
in like a temporary like file on each node.
And so when they're trying to understand what's happening,
they'll just do the equivalent of a distributed shell out
and just aggregate and slice and dice on that data.
It is different, and it is something
that should be in every engineer's toolkit.
I think that instead of giving advice for observability,
I will repeat that this should be the year that everyone gets their deployments automated so that no
individual has to be a gatekeeper so that it happens automatically within a few minutes.
Think how much of your life you will reclaim and your coworkers will reclaim and all of the cool
things that you can spend that time on. I don't care whether it's at work or at play um stop wasting your life on deploys yeah i'm excited
to see that tweet storm at the end of the year saying with all of the replies saying yeah we all
have automated deployments it's yeah you are even asking this it's just like bread and butter now
yeah and and honestly uh i if if people are really struggling and they want advice
i actually have a calendly link where people is calendly.com slash charitium and then there's a
link there for people if you want you have to send me 50 bucks to reserve your spot because i'm sick
of flakes um but i will happily like talk through your organizational issues strategize and how to
get people to you know give it a try um technical social the whole
works yeah i'll put that in the show notes but yeah thanks so much for being a guest i had a lot
of fun i think it's a really high bandwidth conversation yes great you're great