Software at Scale - Software at Scale 7 - Charity Majors: CTO, Honeycomb

Episode Date: January 25, 2021

Charity Majors is the CTO of Honeycomb, a platform that helps engineers understand their own production systems through observability. Honeycomb is very different from traditional monitoring tools lik...e Wavefront as it is built for data with high cardinality and high dimensionality, which can instantly speed up debugging of many problems.Apple Podcasts | Spotify | Google PodcastsNOTE: This episode has some explicit language.We talk about observability, monitoring, building your own database for a particular use case, starting a developer tool startup, having the right oncall culture, getting to fifteen minute deployments and more.HighlightsNotes are italicized05:00 - High cardinality and high dimensionality in Honeycomb. Data retention in Honeycomb - 60 days. Many monitoring systems, like Dropbox’s Vortex, downsample data in two weeks13:00 - Observability driven development. The impact of deploying code within 15 minutes of it being merged in. Synchronous and asynchronous engineering workflows19:00 - Setting up oncall rotations. What the size of a rotation should be21:00 - How often should someone on a 24/7 oncall rotation be woken up? Once or twice. But there are exceptions. The impractical nature of some of Google SRE book’s “Being Oncall” chapter. Oncall for managers31:00 - Why are monitoring tools so ubiquitous compared to observability tools?36:00 - Observability & Tracing. What the future of observability infrastructure might look like40:00 - What will the job of an SRE look like in the future? The split of roles in software engineering organizations in the future43:00 - Shipping code faster makes engineers happier. How do you ensure your engineering organization is healthy, and the metrics to use. Learned helplessness in engineering organizations, and leadership failures51:00 - Building internal tools in-house vs using external tools. The large impact that designers at Honeycomb have had on the product.58:00 - The story of starting Honeycomb. Creating a “Minimum Lovable Product”. A description of Honeycomb internal architecture. Dealing with tail latencies. 71:00 - Continuous Deployment and releasing code quickly. Use calendly.com/charitym if you want to chat with Charity about continuous deployment best practices or anything else. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Thanks, folks, for joining me on an episode of the Software at Scale podcast. We're joined here today with Charity Majors, and in her own words, Charity is an ops engineer and an accidental startup founder at Honeycomb IO. And before this, she worked at Parse, Facebook, Linden Lab on infra and developer tools. And she always seems to wind up running the databases. Always. And she's also the co-author of O'Reilly's Database Reliability Engineering, loves free speech, free software and single malt scotch. And I'll talk a little bit about Honeycomb IO.
Starting point is 00:00:46 So it's an observability system. It helps you understand your systems better. And maybe, Charity, if you just want to get started with telling us about what Honeycomb is. Sure. Honeycomb was really the first observability project. We kind of developed the language of observability by, you know, we had had this experience, my co-founder, Christina, and I had this experience
Starting point is 00:01:10 at Facebook of using and, you know, collaborating on a tool that was just radically different from anything that I'd ever used in the past. And like, and I've been on call since I was 17, right? So I've used all the tools. But, you know you know when parse was going through our really rapid growth spurt back in 2012 or so you know we had 60,000 mobile apps um by the time we sold to Facebook when I left there are over a million mobile apps and and like every day a different app would hit the iTunes you know top 10 or it would you know take off or you know there's I think the first one was like this Swedish death metal band. It just came out of nowhere. All of my tactics for monitoring systems, predicting how they would fail,
Starting point is 00:01:54 then writing monitoring checks, making dashboards, and post-morteming, and creating documentation so that we could find it immediately the next time. It was all basically useless because these things weren't breaking in a patterned way. And I had tried everything. And we were going down multiple times a day. It was a really rocky and stressful time. But we started to get a handle on it. The first crack of hope I got was when we had started feeding some data sets into this tool at Facebook called Scuba, which is butt ugly, like aggressively hostile to users. Does one thing well, which is let's use slice and dice and high cardinality dimensions in near real time.
Starting point is 00:02:40 And, you know, we started feeding some data into there and and the time it took us to like identify and you know pinpoint the cause of you know today's problems just dropped like a rock like from days open-ended who knows maybe we'll get lucky right down to like seconds like not even minutes like it wasn't even an engineering problem anymore it was a support problem we just break down by the app id which is something you can't do support problem. We just break down by the app ID, which is something you can't do in any monitoring product, right? Break down by the app ID, you know, and, you know, just follow the trail of breadcrumbs and it would lead you to the answer every time. And this made a huge impact on my life. You know, suddenly I could sleep again. We had a, you know, system that I felt proud of again um so when i was leaving facebook
Starting point is 00:03:25 i kind of you know stopped short and went oh shit like i don't know how to engineer anymore without this stuff that we've built like it's become so core to my experience of software like it's it's not even you know it's not it's it's not even that it's a tool anymore it's it's it's my five senses for production and and the idea of going back to not having it is like flying blind like i just my ego couldn't take it you know so so we decided to build something to kind of approximate that that experience but we right from the beginning we could tell that the language that was that existed like was very inadequate you know it wasn't a monitoring tool because it wasn't
Starting point is 00:04:06 it was you know wasn't like reactive um and and then like six months in i i happened to google the definition of observability which nobody was really using at the time um and i looked it up and i read the definition which comes from mechanical engineering and control systems theory it's about you know how well can you understand the inner workings of your system just by looking at it from the outside? And I just had fireworks going on. I'm like, oh my God, this is what we're trying to build, right? We're trying to build something that will let us ask any question from the outside,
Starting point is 00:04:39 like ask any combination of questions, new questions, whatever, to describe some novel system state that we've never encountered before, we have no prior knowledge of, and we can't write any custom code to describe it because that would imply that we could predict it in advance, right? And so it's like, it's the unknown unknowns problem. So we started talking about observability, we started, you know, building Honeycomb to that spec. And what was the question? Here we are today. Yeah. Yeah.
Starting point is 00:05:09 So if this is just about Honeycomb, right? And the key insight seems to be high cardinality, right? Most monitoring. It's one of them. Okay. Another one is high dimensionality. You know, because whenever you're pinpointing a problem in modern systems, it's not, it's high cardinality, absolutely.
Starting point is 00:05:29 Everything's a high cardinality dimension. But it's also being able to string together as many of those as you want. Like maybe the bug is only triggered for iOS devices running a particular version using this particular firmware, this language pack, this version of the app ID, this and this region, you know, it's just like every single one of those, right? And when you've got metrics, you can't do that because you discarded all of that connective tissue when you wrote that data out at the beginning. You can't ever recreate that, right? Which is like the source of truth for observability is these arbitrarily wide structured data blobs that you could just slice and dice and combine and recombine as much as you want. It's just a different level of flexibility. So you get high cardinality and high dimensionality,
Starting point is 00:06:14 and especially with monitoring, you have to think of all of these different things beforehand. And it's also at least some of these time series databases that back monitoring systems, they're not built for this high cardinality. No, they're not built for it. Exactly. They're built for it.
Starting point is 00:06:29 And I don't, I'm not talking shit about them. They're built for different use cases that are really super valid. They're built for, you know, counts and aggregates and, you know, dials and, and, and, and storing lots of like fine detail, very in a very compressed and you know space sensitive way because because as it ages out it aggregates right which is really great when you're trying to like plot trends over time or something but it makes it useless when what you're trying to do is pinpoint what was that user's behavior like yeah that definitely gets aggregated away. And I've seen that at my
Starting point is 00:07:05 workplace and I wish I've had to do like Huba or Honeycomb or something. It's transformational. Yes. Like in our work, for example, like trying to determine whether one IP address is just spamming us a lot. It's just another version of the same problem, right? You end up doing so much guesswork and intuition and just like brute force and it's just an open-ended amount of time. When you have a tool like this, it's literally just slice, dice, there it is. So then what's the catch? Like how is this stuff implemented internally so that like, like I can see why monitoring systems don't deal with cardinality well, right? They're not meant for that. How do you implement something like Scuba or Honeycomb
Starting point is 00:07:46 that's like different from a monitoring system? Well, part of what's needed to happen was hardware had to get a lot cheaper, right? Because like the early versions of all this kind of software was all written and to be stored in RAM, right? Now SSDs are fast enough. And actually we use S3 to back our files.
Starting point is 00:08:04 But it's also important that, for observability's sake, you can't define indexes or schemas. Because whenever you have a schema, you're predicting again. You're like, these are the only dimensions that I'm ever going to need to have, right? When instead, you need to be incentivizing people to just throw shit in whenever it occurs to them that it might be useful. And stop throwing it in whenever you start. You know, it just has to be much more fluid. And anytime that you're dealing with indexes, again, you're predicting in advance, which, which dimensions need to be queryable in a short amount of time, which you don't know what questions you're going to need to ask. So, you know, the solution that we arrived upon was a
Starting point is 00:08:41 columnar store, which is basically just every dimension is an index, effectively. And it's distributed, so it can grow very elastically. You know, we just, it gets distributed across partitions. And then, so we do aggregate, but instead of aggregating at write time, we aggregate at read time, which means that we can combine and recompile it as much as we want. Okay. But then how do you prevent, like,
Starting point is 00:09:05 the explosion of data that you're going to get eventually? Why would I want to prevent that? It's great. It's fantastic. Or in terms of, you know, you'll have so much data for like many months. Is there something where like you cut off like, okay, you can't look at data from six months ago or something like that?
Starting point is 00:09:22 You know, the trade-offs that we make for this kind of a data storage problem are very different. Like there's a reason that this database we had to write from scratch, you know, it doesn't exist on the market because it makes no sense for almost any other use case. Because where else would it be where you want your data to be like fast and mostly right, right? You just don't want those trade-offs right um
Starting point is 00:09:47 but when you when you know we we make them and it's it's quite fast like we have all our users even the free tier 60 days of storage for free and amazing yeah and you can store it even longer you know we now we've actually kind of the first version of our database was just columnar stores, you know, and nodes. This last year, we rewrote significant parts of it so that the queries are actually being run as Lambda jobs, Lambda queries. And it ages the data out, you know, a few hours or days afterwards, it ages the data out to S3, which we thought would have a huge performance hit. And it turns out not. It has a different set of performance characteristics than having it all in local SSDs, but it's not overall slower. So as you can imagine, that opens just like massive vistas. Yeah, that's pretty surprising and pretty awesome. Just the fact that you get 60 days of data to debug,
Starting point is 00:10:45 that's more than enough for... You know, disks are fucking cheap now. And I feel like this is a thing that most vendors either haven't really woken up to or they're unwilling to let go of the exorbitant prices that they've been charging people. But storage is not, like, this is not a commodity business. You know, we're not selling at Amazon's price plus, you know, because the value in our service is not the storage.
Starting point is 00:11:11 The value is in you, the user, the user experience, the interface, you know, nudging you, like, guiding you, you know. Any computer can detect a spike, but it's about helping humans attach meaning to the data that they're seeing. That's what we're charging for. Yeah. I think the value is pretty clear to me. In terms of the VC, there is a technology risk. Can you write a database that gets this working? But there's no market risk.
Starting point is 00:11:39 Once you have this working, I think people want to buy it. I would certainly want to use it if I I had like a high scale startup. Yeah. And the thing is, it's not just if you have a high scale startup, like there are boundaries where this sort of thing becomes, you have to have it or you die. But having something like this from the beginning, it makes it so that you never have to dig those holes, right? You never have to build a system that is just something that a hairball that the cat coughed up that nobody's ever understood. You're shipping new shit every day that nobody's ever understood, you know? And it's just like, this is why people are afraid to be on call. They don't want to touch it, right? But like your systems don't have to be that way. And like,
Starting point is 00:12:19 it is easier and better to have observability from the beginning. It is easier to develop if you can see what you're doing. It's faster to find bugs. You know, you end up just being in this very intimate conversation with your code as it's running in production. And I think that almost everyone who's had this experience can't imagine going back. So yeah, what this enables you to do is worry less about shipping code that might be buggy because you can easily catch on. Yes, yes. Observability driven development is kind of what I've been calling it where you're just, as you're writing code, you're instrumenting, you know, with a thought to yourself an hour from now, and you say, is it doing what I expected? Does anything else look weird? And you will catch like upwards of 80% of all bugs, like right there, if you're just looking at it while it's fresh in your mind. And that brings me to another conversation that we've had recently around deploying your code,
Starting point is 00:13:19 within 15 minutes of it being landed or being submitted in the code base or being merged. Yeah. Why do you think most organizations think it's not possible to do something? Or many organizations think that. Because they've never seen it. And so it becomes a self-fulfilling process, prophecy. Well, if it was possible, other people would have done it. I would have seen it before.
Starting point is 00:13:41 Right. it, I would have seen it before, right? And I don't want to downplay how difficult it can be to work your way up out of the pit once you're in the pit. It can be hard and scary once you're in the pit. Conversely, if you just never fall into the pit, it is way easier. I think that I set up our auto-deploy stuff, it was just a bash stuff. It was just a bash script, you know, it's just a bash script that like looks for an artifact and deploys it, you know, every 15 minutes or something. And I did that in like week three. Right. And so we've just, we've grown up with this.
Starting point is 00:14:14 And anytime you merge, you just know your code's going to be out in a few minutes and, and, and there's just, it's never gotten to be hard. Right. Like we've honeycomb, like for the last couple years, we've had about nine or ten people writing code for everything from the database, the query planner, the application, the integrations, the security, proxy, you know, everything. There's no way that we could have done that if we didn't have that tight coupling between it's merged, it's deployed, and you look at it, right? Because you can see how, you know,
Starting point is 00:14:51 if that interval becomes elongated, all these pathologies just proliferate, you know, people have lost their place, you know, they've paged it out, they've forgotten what they were doing. Somebody else deploys something that has your changes and a bunch of other changes all bundled up and then spends the rest of the afternoon trying to untangle it and to get bisect and like figure out whose thing broke it, it you know and it's just like it becomes it becomes like i i i've been doing some back of the envelope you know calculations and like checking them with my intuition to see if this sounds right but like i think the order the number of engineers that it takes you to write and support software if your code is automatically deployed within 15 minutes let's call that you know that, that's N, right?
Starting point is 00:15:26 That's the number it inherently takes. If it takes more than that, if it takes hours, I think you need twice as many engineers. And if it takes longer, if it takes like days, I think you need twice as many again. And if it takes like weeks, I think you need twice as many again. So that's like, you know, that's, that's triplet, that's, that's doubling it three times. And past that, I actually have no experience. So I don't know if it's true or not, but like, that's incredibly costly. It's incredibly costly because, you know, it's the mythical man month all over. You add people, you're also adding friction.
Starting point is 00:16:01 You're also adding communication. You're adding specialists. You're adding, you're also adding friction, you're also adding communication, you're adding specialists, you're adding, you know, friction, you're, it's, it's, it's, it's amazing to be able to just execute on what must be done with a small, nimble team. And, you know, we've got some experienced engineers, but we very explicitly did not just hire all our friends, like Google and Facebook, like we brought some intermediate engineers on and, you know, we, you know, we have a pretty diverse team. And, and this is why I feel so strongly that, you know, the main factor that defines how quickly you can move is not your knowledge of data structures and algorithms. It's the system that exists around you that supports you or hobbles you or inhibits you
Starting point is 00:16:46 or whatever. You will rise or fall to the level of your ability to execute within the system within a few weeks of joining the system. And people are always like, we have the best team. We hired the best engineers. But they're spending way less time and money because it is an investment. It's not on your feature list, right? And yet you have to carve out continuous dedicated time towards maintaining this sociotechnical system that surrounds your engineers, that takes their code and delivers it to users and make sure that there isn't bugs or alerts them in a timely way, or doesn't bother more people than are you know like all of this stuff that you were at this question the question you asked me was why is it that
Starting point is 00:17:30 people don't do this i think it's because it it hasn't really sunk into them how important it is and and that this isn't just a thing for elites this is for everyone it is literally easier to write code this way than any other way. So that's why I'm ranting about it. Yeah. I certainly, the way I think about it is it goes from being like an asynchronous workflow where you're waiting to see and you're checking, has my code been deployed or not? Yes. You submit your code, you wait, you see that it works fine and then you move on with your day. So you're not thinking about this in the back of your mind. We spend so much time just waiting on each other.
Starting point is 00:18:11 Yeah, it reduces the amount of communication exactly that you need to do with other people. So you mentioned that it's easier to like avoid going into like a bad place in the first place rather than trying to fix it once it's already really bad. So what are some other things like that a company can do if they're starting off, if they have five or 10 people that, that can help them avoid that problem in the first place? Yeah. I mean, I think that like job number one is just like, pay attention to that interval between when you write the code and when it's
Starting point is 00:18:36 live and just like, keep it small because you know, it's, it's like, it's, it's, it's at the very beginning of the stream, right? So any growth there is only going to be exponentially multiplied later on in the stream. You can never really recover from it. Having a good CICD pipeline, which like you were just saying, it frees you up to not have to think about so many things because that's software doing what software does well so that humans can do what humans do well right um i think putting all engineers on call is really important when you're small like that um having it be a very egalitarian you know there's no when you're small you've got five developers there's no excuse for anyone not to know how to deploy their own code right or how
Starting point is 00:19:22 to debug it when it when something breaks or you know how to how to follow code, right? Or how to debug it when something breaks or, you know, how to follow from end to end, you know, and just debug a request if it's broken. Also, and I know that this can be costly because, you know, if you spread much more thinly, you can cover more ground and move more quickly, you know, if you have only one person who knows each thing. But I think that that's a really false and fragile form of speed. And you really, it's like running with RAID 0, right, instead of RAID 5. Like, yeah, it might be a little faster and cheaper, but you're going to pay. It's not a question of if, it's a question of when.
Starting point is 00:20:01 Yeah, that's a great analogy. In terms of like on-call call do you know at like what point do you think it makes sense to separate people who are on call when like maybe people who are just worried about shipping features or does it ever make sense to make that separation i so i do think there is an upper bound to the size of an on-call rotation. You really don't want it to be more than like seven people, you know, because otherwise you're going to forget too much of how the system, you know, it needs to be a regular thing. You know, it shouldn't be too life impacting, right? I'm a big proponent of all engineers being on call, but also this is management's job. Like you have one job, make it not suck, right?
Starting point is 00:20:45 Like people shouldn't dread it. It shouldn't be impacting your life. But yeah, I don't think you can really have an effective rotation that's significantly higher than seven, maybe 10 people if you're doing primary, secondary. So once you're bigger than that, then you got to split into two.
Starting point is 00:21:00 And I think that is around the point that a lot of teams start to have some natural specialization, whether it's the iOS and Android engineers or the really front-end people start to have a lot of custom assets and SaaS and I don't know what else. And there tend to be more back and front and whatever. So I think that that tends to happen fairly organically. It's like the two pizza team. Yeah. You want it to be too big. And for upper bounds,
Starting point is 00:21:26 is the reason why you think it shouldn't be too big because people just lose context on what there is? If you're only on call once every six months, it's going to be a big fucking deal, right? And you're going to forget and have to learn all over again. What's changed since you were last on call? Because every rotation,
Starting point is 00:21:43 it should shift subtly, right? Something should change. Something should be fixed. I also really strongly believe that when you're on call, that's your job. And you shouldn't be held, like the product managers should factor this into their planning that you should not be expected to get any work done on features. Because that's how you budget in this ongoing care for the system. And I've really, I have seen this done at many places so well that people actually look forward to being on call
Starting point is 00:22:13 because it's a neat break in routine. All you have to do is whatever the hell you want that you think might fix it, right? All the annoying things are just kind of bugging you. You have carte blanche, you know, the system isn't on fire and your tickets are low, you know, then you go do it, do whatever you want, right? And I think that that's liberating. I think that's healthy because engineers want to do good work. People want to work on their system. They want to improve them. They just often aren't given the time and the leeway to do so. And so I think that making it clear that, you know, on-call time is not project time. It's a nice little breather for the engineer. It's a good reminder for whoever's doing the planning that, you know, you have to bake
Starting point is 00:22:55 some flex into the system. Yeah. Okay. And another point you mentioned is that, yeah, on-call shouldn't suck when you're on-call, right? So that one part of that definitely means you need to have tools that are good enough that makes it easy to debug problems. Is there other parts of on-call not sucking? Is there like, yeah. seven startup or service, you know, I think it's reasonable to ask them to be woken up once or twice a year for their service. Once or twice a year, but not more.
Starting point is 00:23:30 Once or twice a year, more than that. And you're just, it's going to start to be, it's going to very rapidly become incompatible with many people's lives. And the only exception I would make for that is if you have an infant, if you have a child who is not yet trained to sleep through the night then you're not on call only one of those at a time please right but like yeah i think once or twice a year and and as with anything there are this is a human system there will be exceptions i there was one one time when i had a guy on my team who he was game like he he didn't
Starting point is 00:24:04 want to shirk his duty he wanted to be on call He wanted to do be like the cool kids. But he, his body just wouldn't let him, you know, like he had so much anxiety that he could never go to sleep. And it wasn't because he was getting paged all the time. It was just like, he was, you know, and so he gave it a try and I was like, let's see if we can make this work. He tried, you know, a couple of rounds and it just wasn't getting better. So we found a different thing for him to do, right? Like instead of that, he owned the CICD pipeline bugs for half the time. It was something that was like equally drudge worthy, but it was not, it just, it wasn't something that would like tweak his anxiety.
Starting point is 00:24:35 Like teams, you know, should be pretty understanding of this stuff. As long as everyone wants to pull together and do their part, people are usually pretty generous about this sort of thing. I also, I really like to do primary secondary instead of having one person on call, because I really think it's healthy to like lower the barrier to ask when you have a buddy who, you know, you're used to working with. And in the early days, you could often do this by staggering front end, back end, front end, back end. So you've always got a buddy who knows the part of the stack that you don't, right? And so if you're like, I want to go to a movie for a couple hours, you know, hey, can you take it for the afternoon?
Starting point is 00:25:09 I think it's really healthy to incentivize people to just ask freely. Right. Because it's part of making it not impact your life. And as a manager, I would often there would be times in the ebb and flow of the team where, although I took myself, I think it's really healthy for line managers to be in the on-call rotation if possible. If not possible, I think they should be very vocally offer themselves up as the alt of first resort, right? Anytime that you want to like, if you got paged and you need to take, if you need to sleep tomorrow night, I'll take the pager, right? Like I would proactively offer. And if somebody had to go out of town for the weekend, I would offer, you know, and I would repeatedly make it clear that I wanted to be on call at least
Starting point is 00:25:54 a couple of few times a month. And because it's, it starts, the goal is for it not to feel like a trap, right? The goal is for it not to feel like a tether that impacts your life. It's just a responsibility, right? You carry the pager in your laptop around for a week. And if it's inconvenient to do that, then you let one of the people who are happy to take over for the, you know, it's not that big of a deal. Yeah. Another thing that you mentioned was having at least seven people on the rotation. And I find it a little ironic and it makes sense that, you know, the Google SRE book says like, oh, you should have at least eight people. And as someone who's never been on call in a team, which has been so huge that it could have like eight people. Well, Google SRE, they're talking about their SLA, they have a two or three minute response time and it will make the cover of the New York. That is the peak maximal pressure, right? We should all be pretty realistic about how much it matters and how quick. So for most of us, five minutes, 10 minutes, not that big a deal. Even 30 minutes, okay, text somebody just like,
Starting point is 00:27:09 I'm in a bind, I can't get there. Can you take it? And they'll get it. It's fine. We've got life. The system should not be going up and down so much that this is a thing, right? It should be pretty rare that you get alerted outside of hours. Yeah, what I'm annoyed by is the fact that since there's this one company that's written this book, it's often treated as like a gospel yeah it's ridiculous like so are there
Starting point is 00:27:29 any other things from that book that you can think of just off the top of your head like which doesn't really apply to regular or non i'm gonna confess that i haven't actually read it okay yeah i think yeah it's not too bad it's not i know i feel like i know everything that's in it just because you know the conversations but i've never felt it i've been having people tell me what google does all my life i don't feel like i need to read it yeah for sure so yeah i think that's one of those things it also says try to have a european like counterpart oh yeah yeah yeah follow the sun yeah in my dreams, I would love to have a London chapter,
Starting point is 00:28:07 you know? Okay. Yeah, I wish there was some kind of resource for... I was thinking of writing a blog about this, like healthy on-call rotations for companies that don't have
Starting point is 00:28:18 money printing machines. I love that title. The last blog post I wrote about on-call was... It was on- call for managers. Like here are my expectations. This is what it means to have a rotation that doesn't suck. And here's how to do it.
Starting point is 00:28:31 I love that. On call for companies without money printing machines. That's pretty much it. Yeah. But another point that reminded me of was, you know, managers should go on call as well. I've rarely ever seen that happen.
Starting point is 00:28:47 Yeah. Yeah. that's unfortunate uh i i think that so and it's unfortunate kind of but but this is a this is there's such a wide spectrum spectrum of situations here right i was i always insisted on being in the on-call rotation until there was a point when i realized this was at facebook i realized it was hurting my team um for me to be on call because I was in meetings like, and I'd be like halfway across the campus, you know, in some meeting and I couldn't make it back to my laptop in time. So I was, you know, I would be pinging someone back at the desk, go, Oh, could you cover for? And so, you know, I was just bothering them when they weren't on call. And, and, and that's when I realized I can't, I shouldn't do this
Starting point is 00:29:26 anymore. And I will instead become like the first pinch hitter of last resort. But I think that this, I think this is fairly common when you've got a manager who kind of grew up from the inside and then trans transitioned locally. And it's very rare when you hire in a manager from the outside or, or when, you know, they're more of a professional managerial class. But, you know, the leaders that excite me the most tend to be the very hands-on sort of pendulum ones who go back and forth every couple of few years and, you know, who make it a point of pride to stay sharp and to stay, you know, embedded in their team's experience. Yeah, I think your point on meetings is like a good one. I thought that was, it was a great idea to try convincing my manager after this, but maybe I'm
Starting point is 00:30:10 just going to hold back on that looking at his calendar. Yeah, it's, you know, I think that I think that you can kind of split the baby by asking them to be on call during non-business hours. And I've even done this with other people who really struggled with the overnight aspects. They would be on call during the days and I would just take the nights, right? Because it grounds them in reality in a way that I think that is very, it's, it's really healthy while also just, you know, at bigger companies like Dropbox. Yeah. He's, he's just going to have a hard time doing it during the day. Yeah. Another one,
Starting point is 00:30:58 one more thing because there's so many topics that I can go on. Like, why do you think like observ, observability is definitely catching on. Definitely like Facebook had a tool like Scuba because somebody thought it was a good idea. But I haven't seen, maybe, you know,
Starting point is 00:31:12 like smaller companies, like I have friends that, you know, all of these Silicon Valley, like mid-sized companies that they would say. But things like Scuba or like Honeycomb,
Starting point is 00:31:22 like observability tools, they're still, they haven't caught on as much as you know building a monitoring tool so like why do you think that is and you think that's going to change well um you know we have 30 years of writing monitoring tools you know like big brother you know like the metric was born in like the 80s you know we have just got a long time and and there's some path dependency here right Like we just figured out the metrics thing. Then we built this enormous heritage of time series databases.
Starting point is 00:31:50 And for a long time, I think that paradigm worked pretty well because most of the variable complexity was bound up inside the application. And you had the app and the database and the network. And it was never very hard to figure out which component was having a problem. And if it was in the app, you just have to attach a debugger or do something complicated there. But everything kind of started to shift when microservices came around, right?
Starting point is 00:32:22 Because now so much of the complexity inside the application is now hopping the network. It has been exposed to like the operational side of things. And the hardest part of the problem is now where in my system is this problem coming from, you know? And now we've got polyglot persistence. We've got half a dozen different kinds of databases and they're all sharded, you know? And like, and it used to be that, you know, if your uptime was 99.1 let's say um then 0.9 of the time somebody was having a bad experience but it was pretty evenly distributed right now it's probably more likely that you know
Starting point is 00:32:55 everybody whose last name starts with j-i-l thinks you're 100 down because that shard is down right but like everyone else is fine it's likely to be localized and extreme because we've done all these things for resiliency's sake to just partition and shard and distribute and everything. So this is just kind of the natural part of the trade-off that we've made by embracing more complexity for the sake of resiliency. and so observability is only really four years old. I would say that's, that's when I came up with like, what I felt like was a reasonable technical definition for what, what do you, what technical things do you need to ask these kinds of questions and support these kinds of things? So I think that the, the excitement that, I mean, yes, there's been a lot of marketing fuzz in the material. Everybody, literally everybody from like five adjoining industries is like, we're doing observability too now.
Starting point is 00:33:55 Which, if I take the long view, I think this is a good thing. Because I do think that they're adopting the marketing stuff, the language faster than the features. But I think that they are all scrambling on the back end to implement those features. And I think that a year or two from now maybe, I mean, these are not trivial migrations to undertake. But I do think that the clamor from the community has been so great that they want these, they need these.
Starting point is 00:34:22 They do think that many more companies are going to make that leap to observability tooling over the next couple years. Because monitoring is really only the right tool for infrastructure. It's the right tool for the software that you have to run in order to support the software that is your core differentiator, right? Because the stuff that is not your code, you know, think about it, You upgrade it on the timeline of like, you know, dist upgrade or something like a couple of times a year, maybe, right? It's a black box to you and you care about it in terms of its capacity, you know, trends over time, the dist base, you know, you care about it,
Starting point is 00:35:00 but you care about it in aggregate. That's what monitoring is for. That's what metrics are for, right? So to the extent you have infrastructure, you need it. The difference is that now, you know, so much, a greater percentage of our engineering teams are devoted because of, you know, third-party platforms and everything. A greater percentage of each company's engineering team is actually devoted to its differentiators, like its code, like you're the code that you write every day, the code where it's your responsibility, what your user's experience is, right? And so like,
Starting point is 00:35:34 that is where I, you know, I think that observability becomes really necessary. And people, you know, for the longest time, people were just told that it was impossible, that it was just impossible problem, you know, you couldn longest time, people were just told that it was impossible, that it was just impossible problem. You know, you couldn't have high cardinality. And now that, you know, their bluff has been called on that and it's clear that it's not impossible. I think it's just a race to implement. I think that makes sense. So software complexity is basically increasing a lot.
Starting point is 00:36:01 And the second part is that, yeah, monitoring is more for infrastructure, things like making sure that your service latency is normal or your disk isn't filling up and you probably want to have an alert on that. But trying to debug why is this user having a particular problem? That's a problem. That's something that people need to find out. And there is a new class of tools required. And I think then it's also reflected in the demand or at least the hype around things like service meshes and like Istio and all that. Because people want more from the existing tools that they have.
Starting point is 00:36:39 So maybe a basic question for you. So with microservices, there's also tracing, like distributed tracing. So how does something like Honeycomb and tracing differ? maybe a basic question for you like so with microservices there's also tracing right like distributed tracing so how does something like honeycomb and like tracing differ like what oh yeah i see um tracing as an absolutely necessary component of observability but it's just it's just taking those same events and visualizing them by time right um we didn't actually build in tracing in the early days of Honeycomb. We just built in the slicing and dicing. And then one day, one of our engineers was like, well,
Starting point is 00:37:09 if we just added an ID and propagated it, then we could, and it was like, oh yeah, you're right. You know, we could. And so we did. And, and yeah, I think it's, I think that it's, it kind of, I understand why people treat it like a different thing, but it frustrates me when I see people shelling out to store their data for tracing yet again, right? So they're paying to store their data as logs, they're paying to store their data as traces, they're paying to store their data as, you know, metrics. And it's just like, how many times do you want to store this, right? Because the thing is that you can get all three of those data types from the arbitrarily wide structured data blobs
Starting point is 00:37:47 and you can't go in reverse. You can't go to the structured data blob from the metrics or the logs or the traces. So I think it's a artifact of us being sort of midway through this transformation. Like ultimately, you know, you should have one source of truth from which you derive your dashboards and your traces and, you know, do all of your exploration, but it should really just
Starting point is 00:38:08 feel like two sides of the same coin. You know, you're just like slice and dice, visualize over time, you know, back and forth, you know, you should be able to, there's this, what observability lets you do too is, is like go from very high level, you know, there's a spike, right? Down to very low level, which requests were different from the others and which ways, like what do they, what do these errors have in common that is different from all the other requests around them? And then, you know, so with, when you include traces, what that means is like, you can, you know, you can see your error spike and then, you know, trace one of those for me, right?
Starting point is 00:38:44 Or trace the median one of those for me and then when you find the place in the trace that has the problem you can kind of zoom back out and go what else is impacted by this right and that that sort of back and forth like down and up and in and out it really to me defines the experience of just understanding your systems with observability. Yeah, that sounds like an awesome debugging experience. Like figure out from like an observability tool, like, okay, this app ID, for example, has a lot of errors. Drill down to that, find one request ID that has those errors followed through the system. Yeah.
Starting point is 00:39:19 Okay, it looks like this particular database. Yeah. So yeah, that does sound like i want to implement that at some point yeah yeah no like once you've used it it's impossible to live without it again yeah and i think there is that trend of like software companies just getting smaller and smaller and caring less about you know their infrastructure and and i guess they should right like people should yeah they should care only about, care about what differentiator they're like shipping to their customers.
Starting point is 00:39:50 It's a good thing. And for people like me that come from ops, like we shouldn't find this threatening at all. Like there's never going to be a lack of jobs for our skill set. Like I promise. But there is like a specialization that's going on. And, you know, there's a, there's a drift. And if you want to work on infrastructure, you should probably work on an infrastructure tool or,
Starting point is 00:40:11 or on a, on a product that it's differentiator is that it does infrastructure better than other people can, right? Like we're doing infrastructures so that other people don't have to run their own observability tools. So it will always be a differentiator for us. Yeah. Someone was just asking me recently, like, should I switch into SRE? Like, will this job exist? And it will probably exist, but it will exist in like maybe a different form than it does today.
Starting point is 00:40:39 Like maybe you won't be embedded in it. Yeah. Yeah. So I think that there are two, there's kind of a split, right? Like we've always kind of jammed together infrastructure, you know, back end engineers and ops people into this one. It's like you're an optimizer of socio-technical systems. Often, like, by, you know, looking, scrutinizing the release engineering path, the deploy path, the CICD path, and just looking for ways to optimize it, ways to toolsmith it, ways to make it so that, you know, like, SREs,, I think in the future, if you don't want to do infrastructure, you know, yeah, you could, you could almost every engineering team that has more than five people is going to want someone whose job it is to make sure that they're all performing
Starting point is 00:41:35 at their peak ability because the, the, the difference between, you know, systems that are working well and are well-maintained and the ones that are just neglected, it's like, that's what we were talking about in the beginning, right? You need like literally three times as many engineers and it's just not cost effective and it's frustrating and slow. And SREs have the systems knowledge and experience to be incredible force multipliers there. So the split is basically people who work on features that end users see, and then the people who work on the internal developer platform, in a sense. Your focus is not maybe just making systems more reliable, but you're force multiplying all the other engineers by making sure they don't have to worry about all of these other aspects. It's still kind of about reliable so i've always disliked the term sre because i'm like i
Starting point is 00:42:30 don't just make shit reliable like i build systems right um there is a reliability angle to it though just in that it's about reliably shipping your code to users and detecting problems and alerting the right people so you know still fits in the same tent, I guess. But that also enables your other engineers to move faster because they can find out about issues faster. Radically faster. Yeah. And it's not just about moving faster either.
Starting point is 00:42:56 It's about living a better life. It's about, you know, spending. And, you know, when I talk about, you know, you know, how teams are wasting 50% of their day and everything. I want to be clear that I'm not advocating people like working harder and longer, filling every minute of the day with productivity. It's the opposite of that. It's like, you know, you can really only do like maybe four hours a day of really intense cognitive labor.
Starting point is 00:43:27 That's just all you've got in you. But let's free you up to fucking do that and not spend all your time waiting on people and start, stop, switching around. If you can just free people up to just focus and just produce, then they can go home at three or four and, you know, live their lives. Like butts in seats are not an important metric to me, right? But like making sure the engineer's time is not wasted and frittered away is my focus here. Yeah.
Starting point is 00:43:57 And one more thing I think I read from one of your tweets is also people who ship more or can ship more efficiently, they're also happier. They're so much happier. It makes engineers so much, you know, producing more is not what burns engineers out. It's not being able to, it's producing less. It's being, you know, tied down. It's being frustrated.
Starting point is 00:44:17 It's being, you know, spread too thin. It's never seeing your work actually meet your users, right? It's foundering that burns people out. And then you combine that with like a tight deadline and you haven't invested enough in making your developers really productive. And then you're pushing them. You're just like, go faster, go faster.
Starting point is 00:44:36 But they're like, they're strapped in, man. They're like, they can't go faster than your system will let them. Yeah. And that's pretty, that gives a lot of like food for thought. Like you want to make your engineers as effective as possible while through like some kind of investment in the developer platform. How do you know your investment is enough? Like, is there, as you said, there have to be some guardrail metrics.
Starting point is 00:45:07 People have to be deployed. Their code should be deployed within 15 minutes. What else? How do I know that my end organization is healthy? I think the right place to start is with the Dora metrics, the four Dora metrics that Jez and Gene and them, they wrote the whole book Accelerate about this, right? Which I hope everyone has read. It's just like, the four key metrics are like, you know, how often do you deploy? How many times do you deploy? How long does it take before your code
Starting point is 00:45:38 is live? How many deploys fail? And how long does it take to recover? Right? And like, I really think that any team could just start by measuring those things. First of all, whenever you measure something, it tends to get better, right? But just knowing where you stand, you know, and you can plot yourself. You can see where you measure up next to, you know, the other teams that they surveyed.
Starting point is 00:45:56 And that's really motivating. And as you make those metrics better, your team begins devoting more of its time to more productive things. Now, that's not the whole story. But I think that most people who are just starting out here, that's a really good place to start. Yeah, yeah.
Starting point is 00:46:13 So that makes sense. Like you take this set of metrics, which you know that the industry has like, people have done research and figured out. Yes, exactly. There's been like experimental evaluation. And you try to just compare your organization, then you can make a case to somebody to say, you know what, we should make these metrics better. Yeah, I feel like right now we're in this interesting, like I straight up, I definitely see this as a failure of leadership.
Starting point is 00:46:41 This is a failure on management's part and leadership's part because by and large, engineers are already sold on this and they're dying to spend time working on this and they're not being allowed to do so. And I think that we're in this unfortunate, you know, sort of valley here where, yeah, every organization is a technical, every company is to some extent, you know, an engineering organization, right? But the leadership at the top is generally not made of engineers, right? And there's some packet loss somewhere between, you know, the engineers and the engineering managers who generally know and can make a case for this. And, you know, maybe the director or VPs who somewhere in there, this message is getting lost that this is how you make it better. This is how you make your customers happier. You know, and I get that it's a very abstract technical argument, but it's not particularly controversial or difficult. I really think that, you know, while I blame the leadership for this, I, because, you know, I, I also am just someone who feels like rather than complaining about what we don't
Starting point is 00:47:51 have, we should do what we can with what we do have, right? I think that, you know, as engineering managers, leaders, we need to start being much more vocal about this, being much louder about it, being much more, try different ways of making it, put it in front of different people's eyes, like make the case in a few different ways. Try converting it into people years and people hours, dollars. Anything that you're just talking about in abstract engineering terms is not going to resonate with them. But if you convert it even in a very messy back of the envelope way to dollars, that'll probably get their attention. If nothing else, think about retaining your best employees because they're increasingly chafing at the idea of working at teams that are not
Starting point is 00:48:28 really investing in this stuff, because they know how much of their life is being wasted. So my hope is that over the next few years, we will mature more as these organizations that never maybe understood that they were starting an engineering organization but now they've got one right and they've got to make them happy if they want to compete and this is how you compete yeah and yeah i'm also thinking there could be a case where people have just been in such a bad situation for so long they forget that it's bad there's so much learned helplessness and there there's also this attitude that makes me really sad where you talk to people sometimes and they're like,
Starting point is 00:49:10 yeah, I know, but that's for Silicon Valley companies. You know, they're just like, we don't get nice things. You know, that's not for us. We're not good enough. And you just want to be like, dude, the engineers here are not better than the engineers there. They are privileged enough to work in better systems by and large. Sometimes, sometimes not.
Starting point is 00:49:30 Like some of the shittiest work that I've ever seen is also been in Silicon Valley. This is not an elitism thing. This is a very accessible, realistic thing that anyone who's capable of shipping code is capable of making their systems better yeah and it's also sad that leadership like there is that packet loss as you mentioned but i wonder how much of leadership is just unaware that this is like best practices you know yeah i think that you know i think that we as engineers we and we as people we we just we attend to it we always subscribe so much more knowledge and intention than actually exists you know i what's the quote like nine tenths of the time um what
Starting point is 00:50:14 could be what you think is malice can be understood it can be explained by ignorance or something like that like they just don't know like and and we'll be be like, but we told them. And it's like, well, yeah, two years ago, you mentioned it once during lunch, but like, you know, telling something to upper management requires a campaign and a consistent campaign of multiple people kind of coordinating your messaging, right. Making sure that all of the key people are in for like, you're, you're trying to, to you have to think about it like changing the trajectory of ocean liner right like it can be done but it takes some planning and it takes some coordination and and just mentioning it once does not count yeah and if and if you do that coordination and you try it and it doesn't work you should leave yeah then and you should
Starting point is 00:51:02 you should definitely vote with your feet like yes there's there's only so much you can do from there's only so much you can do and and you shouldn't reward people who refuse to change they're hurting their own people and you shouldn't reward them with your labor and your presence which is very valuable and like a flip side of this whole discussion is that it there's more and more space for developer tools to like, or developer tools startups to grow, because there's so many things that could be improved. So many. And for sure, like not every company should be building its own like honeycomb. No, almost nobody should be building their own. It should be seen as a, you know, a failure whenever, you know, a company decides to in-house and build some other tool that is not
Starting point is 00:51:46 you know what they do for for living it it's sometimes inevitable but much less often than people think so i want to ask your experience going from like starting a company to i think i saw the honeycomb like team page it has more than 60 people now i don't know how uh up to date that is but like that's a pretty big team yeah we just we uh we just basically doubled the size of engineering and we've added you know we've gone from zero designers to seven designers like we're we're really making a big bet on design and over the next year or two which i'm pretty excited about it's it's it's a little it's a little strange but yeah yeah that's seven more designers than on most developer tools right i know right yeah no i i'm i'm all in in fact i've i've kind of had a come to jesus moment
Starting point is 00:52:37 here which is just that you know there are there are things that you know we have we've built this feature you know and built it again and we've engineered the crap out of it. And why isn't it? Why aren't people using it? And then now I'm like, because that's not an engineering problem, is it? It's a design problem. Is there anything specific you can talk about that? Because I'd love to know, like, because I would think that, you know, Honeycomb was built by engineers for engineers.
Starting point is 00:53:01 Like, what was the design problem there? Oh my goodness. Oh, so many. So for example, ever since the beginning, I've always seen this as being, it should be a pretty intensely collaborative and social experience, I think.
Starting point is 00:53:20 Because when you're debugging, you know, your complex system, you know your corner of the system intimately, like deeply, because when you're debugging, you know, part of your complex system, you know, your, your corner of the system intimately, like deeply, very well, but you don't know the rest of the system that intimately, right? But when you're debugging a problem, you need to, you know, be able to see the entire span. And, and so you should be able to lean on your coworkers and their intimate knowledge of their parts of the system when trying to ask questions. So, like, you know, you have your full query history in Honeycomb.
Starting point is 00:53:48 And you also have access to, you know, your team's query history. And, you know, I feel like if you get paged in the middle of the night and it's like, oh, this looks like my SQL problem. I don't know. Fuck all about my SQL. But I know that the experts on the team are Emily and Ben. And I feel like we had an outage like this last Thanksgiving. I'm going to go, I think Ben was on call. I'm going to go look at what he did.
Starting point is 00:54:08 Like what questions did he ask? How did he interact with the system? You know, what was useful enough that it got, you know, run 50 times? You know, what was, what got attached to a postmortem document or tagged with, you know, like just like leaning on each other, looking for ways to bring everyone up to the level of the best debugger in every corner of the system. Because then, you know, Ben could go out on vacation or on his honeymoon. And, you know, the remnants of how he interacted with the stuff he was building are still there. So if we have a problem with it while he's out,
Starting point is 00:54:41 like we can just go see what would Ben do, right? Like social and collaborative stuff like that. You know, because honestly, query builders are intensely challenging, off-putting. Like there are most people in the world, especially while they're under pressure or time pressure or outage or whatever, the last thing they want to do is try and compose a new query from scratch.
Starting point is 00:55:05 It's just really hard, right? You have to switch from your flow brain to, OK, let me understand this tool, which is bad. And what we did at Facebook was we would pass around these shortcut URL, just like notepads. Just like any time we had an outage or some i'd be like oh that's an interesting query i'm gonna yoink it add it with a couple comments you know maybe this is thanksgiving outage you know sharding whatever and then when i'm trying to debug something like the first thing
Starting point is 00:55:34 i do is not build a query it's go to my notepad and go oh yeah it was the indexing job you know and then paste that in and then tweak it anyone can tweak a query right like it's very easy to tweak it once it's there but like finding your way this massive system to like that area that requires like shortcuts and like and leaning on the social part of our brain it's much easier than trying to rely on the the you know computing like quantitative part of our brain which is very expensive yeah yeah that makes sense and i think the way we work around that is just have like that custom dashboard and you keep on adding stuff because i can i can totally see like you know i'm not sure whether i build the
Starting point is 00:56:15 right query am i passing in the right task parameter or something and like what have i missed so so you just make it easy to share queries that, you know, your coworkers made. So that makes a lot of sense. And I never thought about that use case. Yeah. And just incentivizing people to add text, like, to, like, add text to describe, you know, what they're looking at or what they did. Or, like, apply, you know, maybe, like, look at your past history over the past couple of days and then just, like, yoink, yoink, yoink, yoink, apply a set of tags. Like, this was useful to me, right?
Starting point is 00:56:45 Like out of all the things that I did while I was debugging and interacting, like these were the useful ones. Let me just put those in my collection with a few comments, you know, because, you know, you've got this rich, you know, historical, you know, memory in your brain, but it's indexed with little scraps, you know, just like this word or this tag or this thing, you know?
Starting point is 00:57:04 And just like, once you bring that out of people's heads and you put it in a tool, you're democratizing that information. You're making it so that, you know, like we really want to just become like this outsource, this brain, right? In the same way that like, I don't memorize phone numbers anymore. I know where it is and I can look it up much more quickly, right? Like, that's kind of what we want Honeycomb to be like for engineering teams yeah so you make it really easy to create like a debugging playbook on the fly yeah and now now I'm like I'm thinking of more ideas like what if you can automatically connect it to like a phaser duty incident oh yeah yeah totally all that stuff and like you know we
Starting point is 00:57:41 have like markers where you can overlay across this time period, maybe the entire system was impacted. You just put a span with the comment. And when you're querying, that should show up too. I have a bash snippet that whenever I run sudo, it draws a line that it's like this command was running root on the system at this time. Yeah, there's just so many cool things that you can do visually. And, you know, these are, but these are super design problems, right? Like they're engineering problems too,
Starting point is 00:58:13 but like the, how to make this feel intuitive to other people who are not you. Oh my God, this is deep design. Yeah, that makes sense. Like, how do you know whether the query made by like a coworker made sense or they were just like playing with something? Yeah, exactly yeah it's actually a pretty hard problem and you have to think quite a lot about and it's not really an engineering problem it's not an engineering
Starting point is 00:58:31 problem at all yeah yeah but how what was it like going from you know you had this idea that i should start this company with your co-founder and did you like go validate with like customers first? Did you just start building? None of that. I was just like, I don't know, in retrospect, I'm like, that seems very dumb and arrogant, but like, I just, honestly, I was just so sure that we were going to fail that I was just like, you know, that didn't bother me. After a couple of years at Facebook, I really kind of wanted to just sit in a corner and write Go code for a year or two.
Starting point is 00:59:08 And, you know, people were offering me some money and I was like, cool, I don't want to live without this tool. We'll take the money, we'll build it and then we'll fail and I'll open source it and I won't have to live without this tool. Right. Like, perfect plan. We just keep not failing. Yeah, that's a great story i don't know like
Starting point is 00:59:26 if you would give that advice to go that no it's terrible advice no you should definitely start by validating your customer and everything so so i'm guessing at some point you built out like an mvp type thing and then you started showing it to other people and what was the whole story like for the first year year and a half we couldn't do much but write the database. And I don't, because I was never one of those kids who's like, I'm going to start a company. You know, I really kind of despise the founder industrial complex. But, you know, I just kind of accidentally became a founder, you know. But there's a reason that nobody had done this before this, because you have to start with a custom database. And that means that you don't even get to start, you know, building a product that anybody can say anything about for a while. And, you know, our investors are getting all, you know, they're like,
Starting point is 01:00:14 can't you just use something off the shelf until you've validated the mark? And I was just like, no, I can't. Because if I do that, it will look and feel exactly like every other tool out there. And this has to be radically different in these ways or there's no point. Like it's already been reinvented like too many times. And now I understand that they were right in a way, but also I was right in a way. And we're honestly just very lucky that we managed to survive long enough to like see it all pay off. There were several points in the first four years where it
Starting point is 01:00:46 looked very dicey. And I was like, well, now it's over. But we did get pretty lucky. And we had some investors who really stuck with us through thick and thin. And now it's taking off. And everybody's like, oh, this is obvious. And it's like, it was not obvious. It was not obvious that this was a wise decision. And in fact, I made a secret series of very poor decisions. And I'm just lucky that we're still around. Yeah, for like the minimum lovable product for something like Honeycomb, like the latency has to be like low enough for somebody. The 90th percentile for our queries is like under a second. That's amazing. Yeah, because without that, you would not get impressed. Without that, it has to be a state of flow, right?
Starting point is 01:01:28 You have to be debugging and just like asking questions iteratively without stopping and waiting for your system to like compute, right? It's absolutely key. And this is something that I always forget is a differentiator for us because it seems so obvious when it's fast,
Starting point is 01:01:42 when it's easy. But, you know, and then I go back and use any other tool out there where like you it's just running in minutes behind you know and and you're just with honeycomb it's like as soon as you ship the event it shows up you know and and like we get alerted if if there's more than like you know five or six seconds delay and it never alerts because it's never that delayed you know it's just always always there um and i forget how unusual that is because it it's just one of those things that it's invisible when it works right um yeah and a few questions and if you don't if you're if you don't want to share like it's totally fine i'm curious about how do you shard Honeycomb for when you deploy it as a service?
Starting point is 01:02:25 Like do you shard it per customer or something or yeah? Yeah, when we provision a customer, so we've got a bunch of pairs of nodes, right? With the same, so when data, when it talk to the API, right? You send some events through the API. API is a very thin, reliable, all it does is accept and apply some filters or whatever, protect the service, and drops it into Kafka. And it looks up which Kafka partitions to drop it into based on the user ID. Each topic partition has a pair of nodes consuming from it.
Starting point is 01:03:07 So one can go down, and that's fine. And then a few hours later, it gets aged out to S3. So basically, when a new user gets deployed or provisioned, I think it provisions it automatically across two or three shards. And if we want to, you know, raise that, we just say, sharded across six or 10 shards or whatever. And then it starts, the data is all immutable. You know,
Starting point is 01:03:35 it just drops it on an immutable column store. So, you know, we just write it across more from there on out. And then the, the read aggregator does all the math of like, you know, well, past this date, it was only across two shards. After this date, it's across five shards or whatever. And so it all looks natural for the users. But yeah, it's dead simple.
Starting point is 01:03:55 It's dead simple. It's not even fair to call it a database. It's really a storage engine. Interesting. And yeah, and you said that most of the work is done at like read aggregation time rather than write aggregation time. Yeah, it fans out and reads just as a column scan filters and aggregates the data and returns it to the user. Have you had any problems with tail latency at all if it fans out super wide or not really? So, you know, this is where, you know,
Starting point is 01:04:29 one of our company values is, you know, that fast and close to write is better than, you know, slow and perfect because this is literally what we do in the storage engine. You know, if one node isn't, you know, if you're fanning out to five and one isn't responding, after a couple of seconds, we cut it off and there'll just be a little like 20% of the results are missing from this, right? Because it's way faster to us to have that speed than
Starting point is 01:04:49 perfect accuracy, because lots of people are sampling anyway. Yeah, yeah. People might not even care about this if they're just like trying to see. Yeah. That makes a lot of sense. And it's actually pretty simple. It's much simpler than... Super simple. Yeah, I thought it would be. I'm a big fan of simple. Yeah, for sure. Yeah, complex systems get hard to debug no matter like how many nice tools you have right
Starting point is 01:05:11 and and if you can't understand it then there isn't really any point cool and so so so you started off building this for like a year you said just working on this storage system yeah yeah and and like towards the end of the first year we were starting to put a super basic you know query query or on top of it um and i think it was year and a half or so in that we got our first free user um you know we we um strong-armed one of our friends into using us um which is Nylas. Bless their hearts. We love them. They're good friends of ours. They were right around the corner from us at the time and they were using Logly and we were like, honeycomb? Yeah. It took less than two years
Starting point is 01:06:00 to get our first customer, but not substantially less. So it's a long time in startup years. Yeah. And then did it just like grow from word of mouth or something? Yeah, to this day, like almost every, almost all of our leads are inbounds. You know, people have been following us on Twitter for a while or people are word of mouth. Another source of opportunities for us is is actually when
Starting point is 01:06:26 people leave their jobs and go to other places and they tend to bring us with them that's like one of the first things that they do because you know that useful yeah yeah it's it's definitely something that you know i think that the reason that we survived and still exist is on the strength of our users um raving about us to potential investors, like the strength of their recommendation. And it's just that they said, they've had the same experience I did, which is, I don't want to live without this ever again. And are there any tools like that, that you miss or tools that you think should exist, but nobody's building them or you haven't seen a good one yet.
Starting point is 01:07:05 You know, I haven't really seen a good canary yet for, you know, a lot of the complexity around, it makes no sense that so many companies still have to write their own fucking deploy tools, you know, like Capistrano can fuck off and die. But like, it feels like something around just like progressive deployments, canaries, maybe almost extending tests to Encompass, writing an expected, like after this version is deployed,
Starting point is 01:07:41 I expect this end-to-end request URL to complete or go green or something, right? Like that sort of a thing, like feels like I would love to see a tool like that. Yeah. Maybe the issue is that like, it might have to be a little customized for every single. I think that's probably what it is. I think it's that there's like, there's a threshold of standardization and then below that it's like custom, custom, custom, custom. And it gets very hard to impose discipline across a wild, wild west that's been growing in so many different ways for so long.
Starting point is 01:08:17 It probably won't change until some other layers get sort of standardized down there, and then we'll be able to standardize on top of that. Yeah. I've thought about another another like ci cd tool but then so many people are just stuck on jenkins and trying to migrate off jenkins is so much work for everyone yeah the hard part is yeah yeah but honeycomb it seems like you can have pretty small like libraries across like multiple languages. Yeah. Yeah, the stuff that libraries do is literally just, you know, the way I often describe it to people is like,
Starting point is 01:08:51 you used to be able to S-trace your process, right? You could just attach a trace and watch it hop through. And having all that context around the request is really valuable. Now, like, as soon as your request is hopping from service to service, you're losing all of that context every time you hop, right? And so the work of the observability, you know, client libraries is to package up all of that context, you know, all the parameters that
Starting point is 01:09:16 were passed in, all of the environment stuff, all the language parameters, all of the, you know, important variables that were set, and just like, you know, ship it along with it, you know important variables or set and just like you know ship it along with it you know so that you can link link it by a single id um but just to like so that you don't lose that because you need to know what happened earlier at the beginning of the request even at the very end if you if you're like trying to so so what the sorry i got distracted so what the libraries do is just um you know uh initialize an empty honeycomb event and then pre-populate it with parameters, anything that is known from the past.
Starting point is 01:09:52 And then over the course of while that request is executing inside that service, you can just do like a printf basically just like, oh, this might be useful. Stick it on the blob. Oh, that might be useful. You know, any shopping cart ID, user ID, you know, any high cardinality dimension,
Starting point is 01:10:09 you know, just shove it in. And then at the end of the request, when it's about to error or exit, you just ship that off to Honeycomb as one arbitrarily wide structure data blob. And that's all that the libraries need to do is just like wrap all that context together and ship it off in one event.
Starting point is 01:10:25 Yeah, that makes sense. And have any customers complained about like PII getting inadvertently clogged? Yeah. So we have a solution for that, which is it's this secure proxy thing where if you're running inside a secured network, you can run this proxy in an AWS ASG and just stream all of your events through the proxy to us.
Starting point is 01:10:50 And the proxy, like, you know, all the secrets are stored by you. We never get the secrets. We can never decrypt it. It stores a mapping of the original event to, like, the encrypted event or to a hash of the event. And then forwards only the hash of the encrypted stream. Then you VPN in and your browser, you point to Honeycomb, but the JavaScript in your browser knows to take what Honeycomb has sent and
Starting point is 01:11:15 use it to look up from the proxy behind your secure network to fill in the actual values. Well, that's such an innovative solution. I would have never thought about something like that. Like we constantly see, oh, we logged something that we shouldn't have. And yeah, okay, now it's like a security incident and we need to clean it up somehow. But yeah, that's great. And just as some like closing thoughts, like if there's anything you want to share with
Starting point is 01:11:41 like listeners about how should they think about like observability? I know we've already spoken a bunch about this but anything that you know we missed well first of all honeycomb has a free tier that's really generous you can you can use it even without you don't have to sample it all like you can you can use it for real workloads and and i think it's i think it's this is kind of something that i think everything is every person is different what makes it click for them, right? And I think there's no substitute for just kind of getting your hands dirty and seeing like the difference that it, because it's like a taste of the future and I think engineers should be stoked by that. If you can't use Honeycomb, that's fine. Like I think that there are,
Starting point is 01:12:23 LightStep is also observability um by my definition sadly i don't think anyone else is i mean those are kind of your two options right now but i do think that um you know the big players uh you know new relics and um data dogs etc of the world are are racing towards that and we'll get there sometime in the next couple of years. You can do your own quick and dirty, especially if you're small, just by storing those. In fact, AWS has been doing this, I found out recently for like 10 years. This is how they store their telemetry,
Starting point is 01:12:56 arbitrarily wide structured data blobs in like a temporary like file on each node. And so when they're trying to understand what's happening, they'll just do the equivalent of a distributed shell out and just aggregate and slice and dice on that data. It is different, and it is something that should be in every engineer's toolkit. I think that instead of giving advice for observability,
Starting point is 01:13:23 I will repeat that this should be the year that everyone gets their deployments automated so that no individual has to be a gatekeeper so that it happens automatically within a few minutes. Think how much of your life you will reclaim and your coworkers will reclaim and all of the cool things that you can spend that time on. I don't care whether it's at work or at play um stop wasting your life on deploys yeah i'm excited to see that tweet storm at the end of the year saying with all of the replies saying yeah we all have automated deployments it's yeah you are even asking this it's just like bread and butter now yeah and and honestly uh i if if people are really struggling and they want advice i actually have a calendly link where people is calendly.com slash charitium and then there's a
Starting point is 01:14:11 link there for people if you want you have to send me 50 bucks to reserve your spot because i'm sick of flakes um but i will happily like talk through your organizational issues strategize and how to get people to you know give it a try um technical social the whole works yeah i'll put that in the show notes but yeah thanks so much for being a guest i had a lot of fun i think it's a really high bandwidth conversation yes great you're great

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.