Screaming in the Cloud - Best Practices Don’t Exist with Paul Osman

Episode Date: January 5, 2021

About Paul OsmanPaul Osman is a Software Engineer with 20 years of experience in the industry. He's the Lead Instrumentation Engineer at Honeycomb.io and is passionate about making production... a less scary word. Having spent most of his career in the ill-defined space between software development and operations, Paul spends a lot of time thinking about making on-call experiences better, responding to and learning from incidents, and improving ways for software engineers to share knowledge. Before joining Honeycomb.io, Paul worked in Platform and SRE teams at Under Armour, PagerDuty, and SoundCloud.Links Referenced:Honeycomb.ioFollow Paul on TwitterPaul’s Blog

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. tools, and then we get into the advanced stuff. We all have been there and know that pain, or will learn it shortly, and New Relic wants to change that. They've designed everything you need in one platform, with pricing that's simple and straightforward, and that means no more counting hosts. You also can get one user and 100 gigabytes per month totally free. To learn more, visit newrelic.com. Observability made simple.
Starting point is 00:01:07 This episode has been sponsored in part by our friends at Veeam. Are you tired of juggling the cost of AWS backups and recovery with your SLAs? Quit the circus act and check out Veeam. Their AWS backup and recovery solution is made to save you money, not that that's the primary goal, mind you, while also protecting your data properly. They're letting you protect 10 instances for free with no time limits, so test it out now. You can even find them on the AWS Marketplace
Starting point is 00:01:40 at snark.cloud slash back it up. Wait, did I just endorse something on the AWS marketplace? Wonder of wonders I did. Look, you don't care about backups. You care about restores. And despite the fact that multi-cloud's a dumb strategy, it's also a realistic reality. So make sure that you're backing up data from everywhere
Starting point is 00:01:59 with a single unified point of view. Check them out at snark.cloud slash back it up. Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Paul Osmond, who is either a lead engineer or the lead engineer at Honeycomb overseeing instrumentation. Paul, welcome to the show.
Starting point is 00:02:22 And which is it? Thanks so much for having me, Corey. Happy to be here. and which is it? It's all based upon density and the fact that you are never, ever going to float. Absolutely. Especially when you're putting software libraries in your systems. That's what you want to think about. Absolutely. Everyone talks about these lightweight instrumentation frameworks. No, no, you go the opposite. You are the heavyweight instrumentation framework.
Starting point is 00:02:57 We will become the center of gravity in your system. Exactly. Not quite the direction Honeycomb has chosen to go for a variety of reasons, not least among them being that it's a terrible idea. So what do you do? What does instrumentation engineering look like at a company that is fundamentally, well, I'll get in trouble if I call them anything other than an observability company, but instrumentation is kind of what they do. Exactly, yeah. The most succinct way I can think about it is my team works on the tools that help you get data into Honeycomb. So if you think about a system like Honeycomb, you've got the platform, you've got the web UI,
Starting point is 00:03:35 and then you've got everything that runs on a user's or customer's system, and that's my team. So fundamentally, you're in charge of the agents, the embedded SDKs, the libraries that people shove into their systems, depending on how you're orchestrating it these days, the 800 Lambda functions, CloudWatch integrations and whatnot, and run this magic CloudFormation template that instruments all of my AWS accounts to hurl information into your system, that sort of thing? Absolutely. And the list is long, you're right. It turns out that the more that you put into your system architecturally,
Starting point is 00:04:08 the more things there are to monitor. Exactly, and to pull data out of, you mentioned Lambda, and there's a whole bunch of interesting ways that you can get data out of Lambda functions. Who knew? Especially with their new extensions API, which is super interesting. I haven't gone diving into it in any depth yet,
Starting point is 00:04:24 but I like the idea. I really like the idea. I'm a big fan of serverless in general. You could call me a convert because I was honestly skeptical at first. But the idea of creating a platform where you just ship freaking code and you don't worry about anything else,
Starting point is 00:04:38 now having the ability to run processes in parallel, run sidecars in a serverless environment, I think is really, really cool. There's so much capability that's, I guess, fantastic to see. It's amazing to, I guess, look at the complexity of even toy applications. And on the one hand, it's, wow, what an amazing system I've built with all of these different services and everything tied together and the way that it interacts with another. And even if it's well-instrumented, that's great. And then the other side of it is, so what does this application do?
Starting point is 00:05:08 It shows people pictures of cats and that's really it. And at some point it feels like this is painfully overwrought. Now this is not a new problem. It feels like that's a bit of a cyclical thing. Things get so complex, it no longer fits in anyone's head anymore. And then there's a collapsing function of an abstraction layer that winds up becoming broadly adapted. And then the cycle repeats anew. At least that's my impression on this, having been spending the better part of the last two decades in the ops and engineering space. But you have spent two decades in the ops and engineering space. What's your take on it? Yeah, this is something I wrestle with a lot. The idea of complexity, right? You can look at a lot of these sort of architectural guides
Starting point is 00:05:47 and just go, holy crap, there's a lot there. And sometimes that's what you need. So I think straddling or balancing or figuring out the balance between needed complexity and kind of accidental or complexity debt is key there. For a simple thing, you want the simplest thing that could possibly work. Yeah, and there's never any real tacit acknowledgement of that. It always seems that these frameworks and tools
Starting point is 00:06:12 and the rest have, example one is hello world. Example two is hello to the entire world. And great, not all of that stuff is needed for every environment, but you also probably don't want to build something hyperscale. On the first example, there has to be some point of complexity where, okay, at this scale, the complexity trade-off is well worth doing, and in fact, it's dangerous to not have it.
Starting point is 00:06:36 That's not everything. Not every system needs to scale globally at all times. Now, that enrages some people when I point it out, but it's true. Oh, yeah. And the type of scaling that you need is also highly dependent on the workloads that you're managing. You know, you mentioned I come from an ops background. I was working as an SRE before I joined Honeycomb. And one of the things I've always tried to stick to, not always successfully, is, you know, the least amount of technology possible. If you're dealing with something that just has to horizontally scale out and, you know, you've got a pretty consistent
Starting point is 00:07:09 workload, maybe you don't need it to be run in a container orchestrator. Maybe you just need an ALB that can do horizontal scaling on CPU usage or something. Yeah, it winds up being a problem when I talk about my philosophy on things as a best practice, because I say other things that tend to fly directly against that. Somewhat recently, I got in trouble again on Twitter, again, for bringing up the idea that setting your database to your local time zone is a terrible idea. Put it in UTC and then let the presentation layer figure it out from there. And the answer legitimately was, look, it's a local payroll app that's only for a one branch company in a single time zone. Why would you ever need to worry about that?
Starting point is 00:07:54 Well, for that kind of story, my position is if you're building the small thing, great. Leave the door open for it to potentially become a big thing. 95% of apps will never hit a point of success where they need to go hyperscale. But for those 5% that do, don't bury landmines you're going to trip over down the road when that time comes.
Starting point is 00:08:14 Exactly. One of the things that can be challenging with examples like that is there are defaults. And we're not always aware of the consequences of accepting some of the defaults. And it can be really hard as engineers to think through, what is the reversibility of this setting that I'm accepting or this state that I'm accepting? And if the answer is that it's going to be really hard to reverse,
Starting point is 00:08:37 then maybe you want to think twice before doing that. One of the problems that I keep seeing is that there's a lack of awareness of how to build hyperscale applications. And it occurred to me that part of the reason is that no one knows how they go for their particular workload, for their particular constraints. And this is proven out by the fact that if you talk to any hyperscale company about their application architecture, how things are built, ignore what they say at conferences on stage, pay attention to what they say at conferences in the bar after you pour six beers into them, and they all admit that it's crap. Everything we've done is garbage. We're doing as best we can, but there's a lot of rough edges. It feels like we're always a hair's breadth from disaster. I can't shake the feeling
Starting point is 00:09:30 that we're all just making it up as we go along. I totally think we are. You mentioned earlier best practices, and what the hell are best practices when they're so highly dependent on the specific architectural decisions made, on the traffic patterns, on the social aspects of how an organization works. I've had the good fortune of being part of a few teams
Starting point is 00:09:52 that have had to scale up to hundreds of millions of users, and no story has been correct. This is one of the things that always used to annoy me about, I'm glad to say it doesn't seem to happen as much anymore, but when people would point at specific technologies, like Ruby doesn't scale or something like that, that's a meaningless statement. What does that mean? It certainly has scaled for some people in some environments. It just depends on what you're actually doing. And like you said, there's no blanket advice that seems to work for everybody.
Starting point is 00:10:20 There are principles, I think. And if we worked really hard, we could probably dig out some of those principles. But the idea that there's like a one-size-fits-all pattern, that seems to come from people who are trying to sell you something. Oh, yeah. At the time we were doing this recording, there was recently a great tweet by GitHub, or Jithub, depending upon pronunciations, CTO. Well, I'm Canadian, so, you know. the best part of this show is mispronouncing things. It's not Postgres, it's Postgresqueel. I digress. The question that he was asking was, if you're going to start a new company today, what technical stack do you pick? What cloud provider, what language, et cetera, et cetera.
Starting point is 00:10:57 And my response to it is, oh, that's easy. It's the one that the engineers I'm hiring are conversant with and want to work in. Yeah. Because I can look around the landscape and see an awful lot of business failures for a variety of reasons. I'm really hard-pressed to identify any of them as, ah, they picked the wrong technical stack.
Starting point is 00:11:13 Yeah, how many companies have actually been sunk by a decision like that? It literally never happens. You know, and for what it's worth, I completely agree. The right tech stack is the tech stack that you have experience with, the tech stack that you're comfortable with. Way more important, and it's worth, I completely agree. The right tech stack is the tech stack that you have experience with, the tech stack that you're comfortable with. Way more important, and it's funny because people,
Starting point is 00:11:29 I don't know, sometimes I feel like we talk about this less, but it's how comfortable are you with everything else? Who cares what programming language your code is written in if you're not confident in the way that you actually deploy changes? Or if you're not confident in the way that you configure how traffic is routed to it. You know, that stuff all, I would say, arguably matters a lot more than the actual, you know, expression of business logic that gets converted into machine code. It really is. And that's what I want to ask you about, too,
Starting point is 00:12:01 is that you have exposure to a bunch of different stacks, presumably, because you are the instrumentation engineer who's made of lead. And you wind up building these integrations into every godforsaken stack that all of your customers are going to be using, or any of your customers are going to be using. Which means that you get to touch a lot of different languages, you get to touch a lot of different platforms, presumably. Is that correct? Or am I dramatically overestimating Honeycomb's compatibility with different systems? Oh, no, no. You are absolutely on the nose there. When I was being interviewed by Honeycomb, we have a coding exercise that we send to a lot of candidates. And the only difference with me from an average product or platform engineer at the company was they had me do it in a number
Starting point is 00:12:43 of languages just to see how comfortable I was moving from one platform to another because being on the instrumentation team, that is definitely part of the job. So at this point, it's one of those questions that I always used to ask my parents, am I the favorite or is my brother? And the answer that they gave was, you're my children. I can't stand either one of you. So to that end, what is your favorite stack to integrate with and your least favorite stack? Because, you know, it's not really a podcast unless you enrage people. Right.
Starting point is 00:13:15 Yeah. So 100% based on what we were saying earlier, the ones that I prefer, I'm going to surprise you. They're the ones that I have the most experience working in. And so I've trained my brain to think in a number of different ways, I think fairly well. I'm a really big fan of functional programming, a little. So I like languages that tend to support a little bit of functional programming. I come from a background, accidentally, I ended up doing a lot of Scala at a lot of different companies. And so I'm very happy working there.
Starting point is 00:13:42 But conversely, I also really like working in Go, one of the languages that is often kind of made fun of lovingly for being a very basic language. It's not too fancy in terms of features. I want to be very clear here that my position is that language bigotry is awful. Oh, yeah. It's one of those ways of gatekeeping, and it drives me nuts. It doesn't matter what language you pick.
Starting point is 00:14:03 I can write shitty code in all of them. Absolutely. And I have, and I will. It may not even compile. It's so bad. I mean, personally, I don't get JavaScript to save my life. It does not match my understanding of the world.
Starting point is 00:14:16 Python, conversely, is something that aligns much better with how I see things. And Ruby was also a great devil for me for a while. I was also heavily into Perl for a long time. But again, as an old ops person, my favorite language is and always will be bash scripting. Oh, beautiful, yes. It's funny, I have a very similar experience.
Starting point is 00:14:33 Maybe it's something about us ops people. But JavaScript, I have not trained my brain to work that way. I completely agree with you about language bigotry being awful in a form of gatekeeping. And so my approach is when I see somebody who's proficient in JavaScript and can write wonderful applications in Node or browser applications in React,
Starting point is 00:14:52 I'm in f***ing awe. It's just a way that I haven't managed to make my brain as compatible. The challenge, of course, is that it's your responsibility to fundamentally support all stacks. So how do you approach doing an integration in a language or stack with which you're not familiar? Yeah, that's a great question. So how do you approach doing an integration in a language or stack with which you're not familiar?
Starting point is 00:15:08 Yeah, that's a great question. So part of it is you just got to dive in and kind of work through it, which I think if you've worked in enough companies that have different languages and different stacks, you might have some experience doing. I've worked in companies where, I worked in one company once
Starting point is 00:15:24 where we started the whole microservices journey and we regretted this decision, spoiler, but we said everybody can choose whatever language they want to use because it doesn't matter at the end of the day. We're all talking to each other over HTTP and JSON APIs. So that resulted in this Cambrian explosion and surprise, if you wanted to go and work on something on a different team or that a different team had created, it's going to be in a language you may have never even seen before. And so part of it is you just got to kind of dive in and be willing to learn. Where there are real gaps or weaknesses, that's where hiring becomes important. It's funny, I've been a hiring
Starting point is 00:16:00 manager in previous lives, and I've been involved in hiring processes at a bunch of different companies. And I'm very opposed to just hiring based on specific technology or language experience. But sometimes you have to say, ooh, it's a real bonus if this person fills a gap that we don't have on the team at the moment. Oh, absolutely. I think that hiring is one of those hard parts where it's easy to fall into the very common trap of never, ever wanting to hire someone who's weak in something as opposed to, okay, great. Maybe your Python is crappy, but we have three engineers already who are great with it. But if you know Ruby and we don't, cool. That's a strength, not a weakness.
Starting point is 00:16:40 Hire for strengths. Forget the, I can come up with some puzzler problem to put on a whiteboard that'll stump you. Hell with that. Show me what you're best at. I want to see you shine. I don't want to see what it looks like when you're sitting there flailing because you haven't brushed up on your comp sci curriculum in 20 years. Oh God, absolutely. I was very pleasantly surprised, as an aside, when I was interviewing this last round and I joined Honeycomb about a year ago, I did a pretty extensive job hunt and I ended up doing a fair number of onsites. I think it was like six in total, which seems exhausting now just thinking about it. But I was so relieved that no one had asked me.
Starting point is 00:17:16 Six conversations or six different trips to San Francisco to visit them onsite? Three of them were trips, two of them were remote, and one of them was local. Okay, those are actual separate interviews with different folks at different times. Okay, yeah, that's a lot of back and forth. It's a decent amount, but I was so pleasantly surprised to see that nobody asked me one of those whiteboard questions. Not a single thing that would show up in f***ing cracking the coding interview or, you know, leak code or whatever other tool you want. Yeah, part of it is also just this, it's this almost corporate hazing sense.
Starting point is 00:17:48 It sounds weird, especially given that, let's be honest here, most of the audience of the show has an engineering background, but I personally find hiring folks who are either engineers or engineering adjacent to be way easier than a lot of other hires. For example, if I'm hiring another cloud economist
Starting point is 00:18:04 who needs to be able to delve into AWS and have some SRE experience and be able to look at this from a financial analysis perspective, great, I've done a lot of that myself. I know exactly what to look for, what to ask, what to uncover. Whereas if I'm hiring for, I don't know,
Starting point is 00:18:20 a product marketer or an accountant or a graphic designer, I have no earthly idea how to even frame the question. Part of the challenge then is that in many cases, if you're not reaching out to experts who are great at this stuff to help with the winnowing and interviewing process, you're probably going to wind up hiring the person who sounds the most confident, which is kind of awful. Right, exactly. I think the only thing I've ever found that can even begin to crack that for me is ask people what they've done and then delve into really, really specific follow-up. If somebody comes in and says, I'm great at X, great, tell me about a time when you used X to
Starting point is 00:18:59 good result. And obviously you're going to run into people who are just really good at self-selling, but I think if you ask enough follow-ups and if you look for things like communication skills, their ability to connect their effort with outcomes and things like that, you can still get pretty good results. This episode is sponsored in part by Chaos Search. Now their name isn't in all caps, so they're definitely worth talking to. What is Chaos Search? A scalable log analysis service that lets you add new workloads in minutes, not days or weeks. Click, boom, done. Chaos Search is for you if you're trying to get a handle on processing multiple terabytes or more of log and event data per day at a disruptive
Starting point is 00:19:44 price. One more thing for those of you who've been down this path to disappointment before. Chaos Search is a fully managed solution that isn't playing marketing games when they say fully managed. The data lives within your S3 buckets, and that's really all you have to care about. No managing of servers, but also no data movement. Check them out at chaossearch.io and tell them Corey sent you. Watch for the wince when you say my name. That's chaossearch.io.
Starting point is 00:20:13 I think you're probably right. I think that there's a lot to be said for digging into things. What I love is asking open-ended questions in interviews. And at some point, one of us is going to get to a point of, I don't know. I'm either learning something or I'm seeing how people think and what they do when they hit a wall, which, especially for senior roles, is incredibly important. You don't want folks who are going to sit and not go anywhere. It's great. I'm blocked.
Starting point is 00:20:36 How do I resolve this? What do I do? And in my case, it's reach out to people. Look on the internet. Do some searching. But don't sit there and stand at the whiteboard and tear up. It's one of those, yeah, we don't know these things off the top of our heads. No one does.
Starting point is 00:20:51 So ask. That's the point. I want to see people saying that they don't know how to do something. Yeah. And this is one of the hardest things to do. But when you do manage it, and I don't have the perfect answer, but I've seen it. When you get some kind of collaboration happening in the actual interview, right? And you get a sense of like, oh my gosh,
Starting point is 00:21:09 this is what it would be like working with this person because we're actively collaborating on a problem that none of us know the actual answer to. In other words, what we're paid to do day to day. So to that end, I have to ask you, given that you see a lot of this, what makes writing slash shipping slash producing software harder than it needs to be? You know, I think there's a few different things there.
Starting point is 00:21:35 Writing and communicating, I mean, that's hard because you're dealing with human beings. And to our previous discussion about software stacks and tools and tech and processes, there's no perfect answer. And so the hard thing is figuring out what do you actually need to communicate? What do you actually need to do? In terms of shipping software, I think that that comes from making it harder
Starting point is 00:22:01 than it needs to be by creating situations where you're scared to touch anything. My background as an SRE, the thing that always terrifies me the most is the service or the software that people don't touch very often. It's the stuff that maybe, you know, it's harder to find out how it works or how it breaks or whatnot, because frankly, you just never have a need to interact with it. That's the stuff that really scares the crap out of me. From your perspective, what's, I guess, what's the interesting part of software versus what's the part of it that no engineer should ever have to touch or do again? What is the valuable part? What should engineers of the future be building, focusing on, working on? And what should folks never think about again?
Starting point is 00:22:46 I like the fact that you're coming from an engineering perspective, because normally if I ask questions like that, it turns into a sales pitch answer. Yeah, exactly. You know, it's funny because I find myself kind of conflicted here between what I like to do and what I believe to be actually correct. And what I mean there is, I like thinking about all the plumbing that makes software go. You know, I like thinking about infrastructure. And I like thinking about writing tools and helping create things that make it easier for other software developers
Starting point is 00:23:15 to push code to production and help users and delight users and all that sort of thing. And that's exactly what most businesses shouldn't have to worry about. They shouldn't have to employ people like you or I from ops backgrounds who just know how to make the stuff go because that should just be a given. I was talking earlier about serverless and some stuff that I think is hopeful there. The average software developer, I think, who wants to delight users, who wants to create things
Starting point is 00:23:42 that create value for a business and for customers, they don't want to care if it's running on Kubernetes or if it's running on spot instances or things like that. They just want to push it and they want to go. The tricky part comes in when it breaks. And when it breaks, we want something that we have that sort of ability to introspect and debug, even if, you know, it's hidden behind some kind of abstraction. And that's a balance that I don't think we've seen yet in the industry, but I think we're getting closer. Let's see, when I have conversations with folks like you, and we discuss these types of things, and the answers always seem so eminently reasonable. And then I leave the ivory tower of my podcasting studio and go back into the world. And then I see the ivory tower of my podcasting studio and go back into the world.
Starting point is 00:24:25 And then I see the nonsense everyone's building instead. It feels like on some level, there's two worlds, the aspirational way that we all want to be doing things and then the messy way that we really are doing things. And I'm starting to despair of ever being able to fully bridge that gap. Oh, interesting. By the ivory tower, what would be an example
Starting point is 00:24:44 of like an ivory tower perspective or point of view? Oh, interesting. By the ivory tower, what would be an example of like an ivory tower perspective or point of view? Oh, sure. Any conference talk you've ever seen on any technology under the sun where they talk about how they wind up seamlessly deploying software into production. CI, CD stories, for example, are notorious for this. It's the, you watch these amazing presentations like, wow, I'd love to work at a place that did things like that. And the person next to you says, yeah, me too. And you look at their badge and they work at the company the presenter works at.
Starting point is 00:25:11 It's the myths we tell ourselves. Sometimes individual groups wind up solving these problems within larger companies. Sometimes it's a new thing that they're running in tests, but haven't rolled out everywhere. And let's not kid ourselves, if it touches the payment system, everyone's doing waterfall development, whether they admit it or
Starting point is 00:25:27 not. But there's a broader world out there of folks who want to be doing things the right way. They want to be getting rid of the boilerplate and stop reinventing the wheel and reimplementing the wheel and get onto doing the truly interesting and innovative stuff. And those people right now are all listening to this while going back to code a login page. You never get past it on some level. That's what bugs me. And you know what's super interesting about that? In my experience, which may not be representative, but the places that I've seen that have accomplished the closest to that kind of story, have done it in really simple, almost kludgy ways. And what I mean by that is, I've never personally worked somewhere
Starting point is 00:26:11 where we had this great system that tracked state of all of these different services and made sure that there was traffic going from here to there in a way that was canary testing and everything. That all sounds like a lot of moving parts. The best places I've worked have a freaking cron script that just pushes out changes or has a web hook that kicks off something that pulls down a tar ball from an S3 bucket and then ships it to a machine.
Starting point is 00:26:39 Oftentimes this stuff, I think it doesn't make for sexy conference talks, but it's just roll up your sleeves kind of work to get it happening and then move on to something else. I think sometimes we maybe trip ourselves up by wanting it to be more interesting than it actually is. That's part of it. If we were completely honest with people at what we were actually building or working on at any given point in time, the answer would be incredibly depressing and we would just be sadder after explaining our jobs to people. I try not to give talks to classrooms full of schoolchildren anymore on what I do for a living for that specific reason. And yeah, I agree, but isn't it great sometimes that this shit works? If the point is to deliver value to customers quickly and efficiently, maybe investing just enough to make that work repeatably and in a way
Starting point is 00:27:27 that people trust, and frankly, is simple enough that you can also debug when it doesn't do the thing that it's supposed to do, maybe that's actually enough. Maybe we're sometimes over investing in complicated solutions that might fall into that accidental complexity scenario we were talking about earlier. Well, all right, let's take that to its logical extent here. Here's something I know for a fact you have an opinion on. Now, I have opinions on things too, which would surprise no one who listens to this show, but what do you think stops engineers
Starting point is 00:27:56 from wanting to be on call for the service that they work on? Now, there are a couple of answers I have to that. One is the polite public answer, and one's the real answer. But I'd like to hear your answer. Yeah, sure. I want to come back to the difference between the public and the private answer, too. My answer is there's a whole bunch of stuff. One of them is social. And I think this is more common, is that engineers are on call for things and they don't feel like they have necessarily the autonomy to actually react to things the way that they need to. And what I mean by that is, I've certainly seen this and I've done my part to try to fix it or to encourage others to empower people to fix it or whatever the hell you need to do.
Starting point is 00:28:44 But people get paged and they're like, oh, that alert means nothing. Okay, so get rid of that alert. I can't do that. Why not? Just do it. You know, if you're getting paged for something, you have the right to change the system that is alerting you to something. And what you're seeing on the ground level, you know, as the on-call engineer should be gospel. It should be the thing that dictates how the future person experiences that role. And if you don't have that, it's a really shitty experience.
Starting point is 00:29:12 I think the other thing that's technical is this notion of accidental complexity. Is when you have a system that you're responsible for, that you're on call for, and it's just, whether it's because of over-engineering And it's just, you know, whether it's because of over-engineering or it's just out of necessity complex, you don't know how to insert
Starting point is 00:29:31 yourself when it f***s up, right? Like you can start to look at it and say, okay, you know, we've got a drop in traffic or we've got a spike in error rate or something like that. But if you're nervous about the actual mechanics that get your changes from your laptop to the production environment, then it can be a really terrifying experience to make changes. And I've been in environments where people just freeze, and it sucks. So that's why I always think of, if you can make it easier to get the changes from your laptop to production, that is the best investment that you can possibly make, technically. I would agree with that sentiment. It feels like when you talk to software developers who are
Starting point is 00:30:10 building these systems and then complaining about a problem in production, here, log into the prod server and see, well, this looks nothing like their IDE. It looks nothing whatsoever like their development environment. And people feel awkward and out of sorts there. I mean, I intentionally, in years past when I was working in ops roles made production uncomfortable to work in intentionally so because that's not your default place to operate in. But if people are used to using visual studio code,
Starting point is 00:30:36 for example, then, okay, now the only editor we have installed here is VI. So you're going to have to spend some time learning even to look at what's going on here. That's an awful experience. Not to mention that people are never doing these things during the workday, invariably. It's always two in the morning when you're bleary-eyed and have no idea what you're doing. And congratulations, you're being confronted by the puzzle master. It doesn't go well. No, and that's actually a great point that I think is within our control
Starting point is 00:31:03 as engineering teams to change. Yeah, it'll happen at two in the morning, that's actually a great point that I think is within our control as engineering teams to change. Yeah, it'll happen at two in the morning, that's for sure. Any 24-7 service that you're on call for, it's going to break at an uncomfortable time, and you're going to have to debug it. But that doesn't have to be the only time you do this stuff. And in fact, when it is the only time you do this stuff, that's terrible. And that's why I'm a big proponent of have, you know, have fire drills, have game days, break the shit that breaks often so that you know how it breaks when everybody's around, you know, resolve those kinds of uncertainties as much as you can, because obviously some things
Starting point is 00:31:33 are just unknowable, but practice those muscles as often as possible. There's a funny thing that we talk about sometimes at Honeycomb and, you know, this sounds like a humble brag, but it's not. It's just that there are periods of time where we don't have as many incidents. And that makes it actually really hard to make sure that people are primed to be on call. And so we're thinking through, what can we do to just make it more comfortable? Like if someone comes on board and their first on-call rotation is quiet, that doesn't really help them, right? So what can we do to kind of force interaction with production as often as possible to make it almost routine and muscle
Starting point is 00:32:09 memory? I talked with companies back when I was looking at various roles where, oh, everyone is on-call. And you hear that during an interview, and having been through many on-call rotations myself, it's, yeah, that's not a strong point, to be perfectly honest with you. That sounds like, if you're not very careful how you position this, that everyone is woken up for every incident and I won't get a whole lot of sleep working here. And not to be unkind, you're not paying significantly more than other folks who don't subject me to that. Yeah, that's terrible. Everybody is on call. That reminds me of the companies that I worked for, I don't know, before a certain time when, I don't know, maybe it was that pager duty became a ubiquitous tool
Starting point is 00:32:49 that was used in companies of a certain size. But it was that old time when the first person to respond is really the person who's on call. And that's a terrible environment, and it's a recipe for burnout. You should have a clear escalation path, and you should have clear responsibilities. And every engineer should have a huge chunk of time and you should have clear responsibilities. And every engineer should have a huge chunk of time when they're not on call. And they know that they're not on call,
Starting point is 00:33:11 so they can delete Slack from their phone. They can turn off all of their alerts. And when they're done at the end of the day, they're just done. So yeah, I would also run screaming from a company who said that these days. Anecdotally, there's two questions that I always like to ask companies when I'm looking for jobs and talking to companies. One is, how do you get your code to production? Walk me through as many steps as you're comfortable disclosing in an interview, which hopefully is a lot. And two is, how are people put on call
Starting point is 00:33:39 and what's the last major incident you had and what does it look like? Who was involved? What happened? How did people get the support they needed? All those questions are really, really interesting ones to dive into. I wish you had more opportunities than interviewing to ask other companies this stuff. Oh yeah. My personal favorite way of responding to that, which is why I generally don't get offered jobs a whole lot is, so you have an on-call rotation here. Oh yes. It's absolutely
Starting point is 00:34:02 critical that our site is up all the time. Cool, so why don't you staff multiple shifts of people who are responsible for keeping the site up during those times so that you're not making people wake up in the middle of the night to break things. And suddenly we're in one of those, what I say versus what I do are different territories and that becomes a problem.
Starting point is 00:34:22 Oh, are you talking about like follow the sun rotations? Either follow the sun rotations? Either follow the sun or having folks who are either night owls who enjoy night shift or something for, and I'm not talking small startups here. I'm talking companies that have 1,500 engineers working there. It's at some point you have multiple offices in various places.
Starting point is 00:34:40 Why are you still waking people up in the primary time zone every week? Sorry, when I say primary time zone, I should be very explicit on this. There's always a time zone hierarchy in every company. It's the headquarters time, and that is how it's going to be, regardless of what companies claim otherwise.
Starting point is 00:34:56 Of course, it's the center of their universe. And invariably, it seems to be Pacific West Coast. Yes, exactly. Yeah, I agree completely. At a certain size, and there are plenty of companies that I think are doing this, but you have the opportunity to let folks in North America time zones
Starting point is 00:35:12 just stop working. And then folks in European time zones will take over. And then folks in certain Asian time zones will take over. And yeah, that is a great way to do things, I think, if you can manage it. So I guess my last question for you, since I've been peppering you with these,
Starting point is 00:35:27 is if people want to learn more, where can they find you? I am on Twitter. I'm not sure how much value you'll get from my tweets, but every once in a while, maybe I'll tweet something that at least provokes some discussion. Paul Osman. And I very occasionally blog at paulosman.me. And I think that's it. And we'll put links to those in the show notes.
Starting point is 00:35:50 Excellent. Paul, thank you so much for taking the time to speak with me today. I really appreciate it. No problem, Corey. I really enjoyed the discussion. Thanks a lot. As did I. Paul Osmond, lead or lead engineer of instrumentation at Honeycomb. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice and a comment telling me of why your on-call rotation
Starting point is 00:36:20 is different and unique. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever Fine Snark is sold. This has been a humble pod production stay humble

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.