Screaming in the Cloud - Best Practices Don’t Exist with Paul Osman
Episode Date: January 5, 2021About Paul OsmanPaul Osman is a Software Engineer with 20 years of experience in the industry. He's the Lead Instrumentation Engineer at Honeycomb.io and is passionate about making production... a less scary word. Having spent most of his career in the ill-defined space between software development and operations, Paul spends a lot of time thinking about making on-call experiences better, responding to and learning from incidents, and improving ways for software engineers to share knowledge. Before joining Honeycomb.io, Paul worked in Platform and SRE teams at Under Armour, PagerDuty, and SoundCloud.Links Referenced:Honeycomb.ioFollow Paul on TwitterPaul’s Blog
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud. tools, and then we get into the advanced stuff. We all have been there and know that pain,
or will learn it shortly, and New Relic wants to change that. They've designed everything you need
in one platform, with pricing that's simple and straightforward, and that means no more counting
hosts. You also can get one user and 100 gigabytes per month totally free. To learn more, visit newrelic.com. Observability made simple.
This episode has been sponsored in part by our friends at Veeam. Are you tired of juggling the
cost of AWS backups and recovery with your SLAs? Quit the circus act and check out Veeam. Their
AWS backup and recovery solution is made to save you money,
not that that's the primary goal, mind you,
while also protecting your data properly.
They're letting you protect 10 instances for free
with no time limits, so test it out now.
You can even find them on the AWS Marketplace
at snark.cloud slash back it up.
Wait, did I just endorse something on the AWS marketplace?
Wonder of wonders I did.
Look, you don't care about backups.
You care about restores.
And despite the fact that multi-cloud's a dumb strategy,
it's also a realistic reality.
So make sure that you're backing up data from everywhere
with a single unified point of view.
Check them out at snark.cloud slash back it up.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
I'm joined this week by Paul Osmond,
who is either a lead engineer or the lead engineer
at Honeycomb overseeing instrumentation.
Paul, welcome to the show.
And which is it?
Thanks so much for having me, Corey.
Happy to be here. and which is it? It's all based upon density and the fact that you are never, ever going to float. Absolutely. Especially when you're putting software libraries in your systems.
That's what you want to think about.
Absolutely.
Everyone talks about these lightweight instrumentation frameworks.
No, no, you go the opposite.
You are the heavyweight instrumentation framework.
We will become the center of gravity in your system.
Exactly.
Not quite the direction Honeycomb has chosen to go for a variety of reasons,
not least among them being that it's a terrible idea. So what do you do? What does instrumentation engineering look like at a company that is fundamentally, well, I'll get in trouble if I call them anything other than an observability company, but instrumentation is kind of what they do. Exactly, yeah. The most succinct way I can think about it
is my team works on the tools
that help you get data into Honeycomb.
So if you think about a system like Honeycomb,
you've got the platform, you've got the web UI,
and then you've got everything that runs
on a user's or customer's system,
and that's my team.
So fundamentally, you're in charge of the agents,
the embedded SDKs, the libraries that people shove into their systems,
depending on how you're orchestrating it these days, the 800 Lambda functions, CloudWatch integrations and whatnot,
and run this magic CloudFormation template that instruments all of my AWS accounts to hurl information into your system, that sort of thing?
Absolutely. And the list is long, you're right. It turns out that the more that you put into your system architecturally,
the more things there are to monitor.
Exactly, and to pull data out of, you mentioned Lambda,
and there's a whole bunch of interesting ways
that you can get data out of Lambda functions.
Who knew?
Especially with their new extensions API,
which is super interesting.
I haven't gone diving into it in any depth yet,
but I like the idea.
I really like the idea.
I'm a big fan of serverless in general.
You could call me a convert
because I was honestly skeptical at first.
But the idea of creating a platform
where you just ship freaking code
and you don't worry about anything else,
now having the ability to run processes in parallel,
run sidecars in a serverless environment,
I think is really,
really cool. There's so much capability that's, I guess, fantastic to see. It's amazing to,
I guess, look at the complexity of even toy applications. And on the one hand, it's, wow,
what an amazing system I've built with all of these different services and everything tied
together and the way that it interacts with another. And even if it's well-instrumented,
that's great. And then the other side of it is, so what does this application do?
It shows people pictures of cats and that's really it. And at some point it feels like
this is painfully overwrought. Now this is not a new problem. It feels like that's a bit of a
cyclical thing. Things get so complex, it no longer fits in anyone's head anymore. And then there's a
collapsing function of an abstraction layer that winds up becoming broadly adapted. And then the cycle repeats anew.
At least that's my impression on this, having been spending the better part of the last two
decades in the ops and engineering space. But you have spent two decades in the ops and engineering
space. What's your take on it? Yeah, this is something I wrestle with a lot. The idea of
complexity, right? You can look at a lot of these sort of architectural guides
and just go, holy crap, there's a lot there.
And sometimes that's what you need.
So I think straddling or balancing or figuring out the balance
between needed complexity and kind of accidental
or complexity debt is key there.
For a simple thing, you want the simplest thing that could possibly work.
Yeah, and there's never any real tacit acknowledgement of that.
It always seems that these frameworks and tools
and the rest have, example one is hello world.
Example two is hello to the entire world.
And great, not all of that stuff is needed
for every environment,
but you also probably don't want to build something hyperscale.
On the first example, there has to be some point of complexity where,
okay, at this scale, the complexity trade-off is well worth doing,
and in fact, it's dangerous to not have it.
That's not everything.
Not every system needs to scale globally at all times.
Now, that enrages some people when I point it out, but it's true.
Oh, yeah. And the type of scaling that you need is also highly dependent on the workloads that
you're managing. You know, you mentioned I come from an ops background. I was working as an SRE
before I joined Honeycomb. And one of the things I've always tried to stick to, not always
successfully, is, you know, the least amount of technology possible. If you're dealing with
something that just has to horizontally scale out and, you know, you've got a pretty consistent
workload, maybe you don't need it to be run in a container orchestrator. Maybe you just need an
ALB that can do horizontal scaling on CPU usage or something. Yeah, it winds up being a problem
when I talk about my philosophy on things as a best practice, because I say other things that tend to fly directly against that.
Somewhat recently, I got in trouble again on Twitter, again, for bringing up the idea that setting your database to your local time zone is a terrible idea.
Put it in UTC and then let the presentation layer figure it out from there. And the answer legitimately was,
look, it's a local payroll app
that's only for a one branch company in a single time zone.
Why would you ever need to worry about that?
Well, for that kind of story,
my position is if you're building the small thing, great.
Leave the door open for it to potentially become a big thing.
95% of apps will never hit a point of success
where they need to go hyperscale.
But for those 5% that do, don't bury landmines
you're going to trip over down the road
when that time comes.
Exactly.
One of the things that can be challenging
with examples like that is there are defaults.
And we're not always aware of the consequences
of accepting some of the defaults.
And it can be really hard as engineers to think through,
what is the reversibility of this setting that I'm accepting or this state that I'm accepting?
And if the answer is that it's going to be really hard to reverse,
then maybe you want to think twice before doing that.
One of the problems that I keep seeing is that there's a lack of awareness of how to build hyperscale applications. And it occurred to me that part of the reason is that no one knows how they go for their particular workload, for their particular constraints. And this is proven out by the fact that if you talk to any
hyperscale company about their application architecture, how things are built, ignore
what they say at conferences on stage, pay attention to what they say at conferences in
the bar after you pour six beers into them, and they all admit that it's crap. Everything we've
done is garbage. We're doing as best we can, but there's a lot of rough edges.
It feels like we're always a hair's breadth from disaster.
I can't shake the feeling
that we're all just making it up as we go along.
I totally think we are.
You mentioned earlier best practices,
and what the hell are best practices
when they're so highly dependent
on the specific architectural decisions made,
on the traffic patterns, on the social aspects of how an organization works.
I've had the good fortune of being part of a few teams
that have had to scale up to hundreds of millions of users,
and no story has been correct.
This is one of the things that always used to annoy me about,
I'm glad to say it doesn't seem to happen as much anymore,
but when people would point at specific technologies,
like Ruby doesn't scale or something like that, that's a meaningless statement. What does that
mean? It certainly has scaled for some people in some environments. It just depends on what you're
actually doing. And like you said, there's no blanket advice that seems to work for everybody.
There are principles, I think. And if we worked really hard, we could probably dig out some of
those principles. But the idea that there's like a one-size-fits-all pattern, that seems to come from people who are trying to sell you something.
Oh, yeah. At the time we were doing this recording, there was recently a great tweet by GitHub, or Jithub, depending upon pronunciations, CTO.
Well, I'm Canadian, so, you know. the best part of this show is mispronouncing things. It's not Postgres, it's Postgresqueel. I digress.
The question that he was asking was,
if you're going to start a new company today,
what technical stack do you pick?
What cloud provider, what language, et cetera, et cetera.
And my response to it is, oh, that's easy.
It's the one that the engineers I'm hiring are conversant with and want to work in.
Yeah.
Because I can look around the landscape
and see an awful lot of business failures
for a variety of reasons.
I'm really hard-pressed to identify any of them as,
ah, they picked the wrong technical stack.
Yeah, how many companies have actually been sunk
by a decision like that?
It literally never happens.
You know, and for what it's worth, I completely agree.
The right tech stack is the tech stack
that you have experience with,
the tech stack that you're comfortable with. Way more important, and it's worth, I completely agree. The right tech stack is the tech stack that you have experience with, the tech stack that you're comfortable with.
Way more important, and it's funny because people,
I don't know, sometimes I feel like we talk about this less,
but it's how comfortable are you with everything else?
Who cares what programming language your code is written in
if you're not confident in the way that you actually deploy changes?
Or if you're not confident in the way that you configure how traffic is routed to it.
You know, that stuff all, I would say, arguably matters a lot more than the actual,
you know, expression of business logic that gets converted into machine code.
It really is. And that's what I want to ask you about, too,
is that you have exposure to a bunch of different stacks, presumably,
because you are the instrumentation engineer who's made of lead. And you wind up building these integrations into every godforsaken stack that all of your customers are going to be
using, or any of your customers are going to be using. Which means that you get to touch a lot
of different languages, you get to touch a lot of different platforms, presumably. Is that correct?
Or am I dramatically overestimating Honeycomb's compatibility with different systems?
Oh, no, no. You are absolutely on the nose there. When I was being interviewed by Honeycomb,
we have a coding exercise that we send to a lot of candidates. And the only difference with me
from an average product or platform engineer at the company was they had me do it in a number
of languages just to see how
comfortable I was moving from one platform to another because being on the instrumentation
team, that is definitely part of the job. So at this point, it's one of those questions that I
always used to ask my parents, am I the favorite or is my brother? And the answer that they gave
was, you're my children. I can't stand either one of you.
So to that end, what is your favorite stack to integrate with and your least favorite stack?
Because, you know, it's not really a podcast unless you enrage people.
Right.
Yeah.
So 100% based on what we were saying earlier, the ones that I prefer, I'm going to surprise you.
They're the ones that I have the most experience working in.
And so I've trained my brain to think in a number of different ways, I think fairly well.
I'm a really big fan of functional programming, a little.
So I like languages that tend to support a little bit of functional programming.
I come from a background, accidentally, I ended up doing a lot of Scala at a lot of different companies.
And so I'm very happy working there.
But conversely, I also really like working in Go,
one of the languages that is often kind of made fun of lovingly
for being a very basic language.
It's not too fancy in terms of features.
I want to be very clear here that my position is that language bigotry is awful.
Oh, yeah.
It's one of those ways of gatekeeping, and it drives me nuts.
It doesn't matter what language you pick.
I can write shitty code in all of them.
Absolutely.
And I have, and I will.
It may not even compile.
It's so bad.
I mean, personally,
I don't get JavaScript to save my life.
It does not match my understanding of the world.
Python, conversely,
is something that aligns much better with how I see things.
And Ruby was also a great devil for me for a while.
I was also heavily into Perl for a long time.
But again, as an old ops person,
my favorite language is and always will be bash scripting.
Oh, beautiful, yes.
It's funny, I have a very similar experience.
Maybe it's something about us ops people.
But JavaScript, I have not trained my brain to work that way.
I completely agree with you about language bigotry
being awful in a form of gatekeeping.
And so my approach is when I see somebody who's
proficient in JavaScript and can write
wonderful applications in Node or
browser applications in React,
I'm in f***ing awe.
It's just a way that I haven't managed to make
my brain as compatible.
The challenge, of course, is that it's your
responsibility to fundamentally support all
stacks. So how do you approach
doing an integration in a language or stack with which you're not familiar? Yeah, that's a great question. So how do you approach doing an integration
in a language or stack with which you're not familiar?
Yeah, that's a great question.
So part of it is you just got to dive in
and kind of work through it,
which I think if you've worked in enough companies
that have different languages and different stacks,
you might have some experience doing.
I've worked in companies where,
I worked in one company once
where we started the whole
microservices journey and we regretted this decision, spoiler, but we said everybody can
choose whatever language they want to use because it doesn't matter at the end of the day. We're all
talking to each other over HTTP and JSON APIs. So that resulted in this Cambrian explosion and
surprise, if you wanted to go and work on something on a different team
or that a different team had created, it's going to be in a language you may have never even seen
before. And so part of it is you just got to kind of dive in and be willing to learn. Where there
are real gaps or weaknesses, that's where hiring becomes important. It's funny, I've been a hiring
manager in previous lives, and I've been involved in hiring processes at a bunch of different companies.
And I'm very opposed to just hiring based on specific technology or language experience.
But sometimes you have to say, ooh, it's a real bonus if this person fills a gap that we don't have on the team at the moment.
Oh, absolutely.
I think that hiring is one of those hard parts where it's easy to fall into the very common trap of never, ever wanting to hire someone who's weak in something as opposed to, okay, great.
Maybe your Python is crappy, but we have three engineers already who are great with it.
But if you know Ruby and we don't, cool.
That's a strength, not a weakness.
Hire for strengths.
Forget the, I can come up with some puzzler problem to put on a whiteboard that'll stump you. Hell with that. Show me what you're best at. I want to see you shine. I don't
want to see what it looks like when you're sitting there flailing because you haven't brushed up on
your comp sci curriculum in 20 years. Oh God, absolutely. I was very pleasantly surprised,
as an aside, when I was interviewing this last round and I joined Honeycomb about a year ago,
I did a pretty extensive job hunt and I ended up doing a fair number of onsites.
I think it was like six in total, which seems exhausting now just thinking about it.
But I was so relieved that no one had asked me.
Six conversations or six different trips to San Francisco to visit them onsite?
Three of them were trips, two of them were remote, and one of them was local.
Okay, those are actual separate interviews with different folks at different times.
Okay, yeah, that's a lot of back and forth.
It's a decent amount, but I was so pleasantly surprised to see that nobody asked me one of those whiteboard questions.
Not a single thing that would show up in f***ing cracking the coding interview or, you know, leak code or whatever other tool you want.
Yeah, part of it is also just this,
it's this almost corporate hazing sense.
It sounds weird, especially given that,
let's be honest here,
most of the audience of the show
has an engineering background,
but I personally find hiring folks
who are either engineers or engineering adjacent
to be way easier than a lot of other hires.
For example, if I'm hiring another cloud economist
who needs to be able to delve into AWS
and have some SRE experience
and be able to look at this
from a financial analysis perspective,
great, I've done a lot of that myself.
I know exactly what to look for,
what to ask, what to uncover.
Whereas if I'm hiring for, I don't know,
a product marketer or an accountant or a graphic designer,
I have no earthly idea how
to even frame the question. Part of the challenge then is that in many cases, if you're not reaching
out to experts who are great at this stuff to help with the winnowing and interviewing process,
you're probably going to wind up hiring the person who sounds the most confident, which is
kind of awful. Right, exactly. I think the only thing I've ever found that can even begin to crack that for
me is ask people what they've done and then delve into really, really specific follow-up.
If somebody comes in and says, I'm great at X, great, tell me about a time when you used X to
good result. And obviously you're going to run into people who are just really good at self-selling,
but I think if you ask enough follow-ups and if you look for things like communication skills,
their ability to connect their effort with outcomes and things like that,
you can still get pretty good results. This episode is sponsored in part by
Chaos Search. Now their name isn't in all caps, so they're definitely worth talking to. What is
Chaos Search? A scalable log analysis service that lets you add new workloads in minutes,
not days or weeks. Click, boom, done. Chaos Search is for you if you're trying to get a
handle on processing multiple terabytes or more of log and event data per day at a disruptive
price.
One more thing for those of you who've been down this path to disappointment before.
Chaos Search is a fully managed solution that isn't playing marketing games when they say fully managed.
The data lives within your S3 buckets, and that's really all you have to care about.
No managing of servers, but also no data movement.
Check them out at chaossearch.io and tell them Corey sent you.
Watch for the wince when you say my name.
That's chaossearch.io.
I think you're probably right.
I think that there's a lot to be said for digging into things.
What I love is asking open-ended questions in interviews.
And at some point, one of us is going to get to a point of, I don't know.
I'm either learning something or I'm seeing how people think and what they do when they hit a wall, which, especially for senior roles, is incredibly important.
You don't want folks who are going to sit and not go anywhere.
It's great.
I'm blocked.
How do I resolve this?
What do I do?
And in my case, it's reach out to people.
Look on the internet.
Do some searching.
But don't sit there and stand at the whiteboard and tear up.
It's one of those, yeah, we don't know these things off the top of our heads.
No one does.
So ask.
That's the point.
I want to see people saying that they don't know how to do something.
Yeah.
And this is one of the hardest things to do.
But when you do manage it, and I don't have the perfect answer, but I've seen it.
When you get some kind of collaboration happening in the actual interview, right?
And you get a sense of like, oh my gosh,
this is what it would be like working with this person
because we're actively collaborating on a problem
that none of us know the actual answer to.
In other words, what we're paid to do day to day.
So to that end, I have to ask you,
given that you see a lot of this, what makes writing
slash shipping slash producing software harder than it needs to be?
You know, I think there's a few different things there.
Writing and communicating, I mean, that's hard because you're dealing with human beings.
And to our previous discussion about software stacks and tools and tech and processes,
there's no perfect answer.
And so the hard thing is figuring out
what do you actually need to communicate?
What do you actually need to do?
In terms of shipping software,
I think that that comes from making it harder
than it needs to be by creating situations
where you're scared to touch
anything. My background as an SRE, the thing that always terrifies me the most is the service or the
software that people don't touch very often. It's the stuff that maybe, you know, it's harder to
find out how it works or how it breaks or whatnot, because frankly, you just never have a need to interact with it. That's the stuff that really scares the crap out of me.
From your perspective, what's, I guess, what's the interesting part of software versus what's
the part of it that no engineer should ever have to touch or do again? What is the valuable part?
What should engineers of the future be building, focusing on, working on? And what should folks never think about again?
I like the fact that you're coming from an engineering perspective, because normally
if I ask questions like that, it turns into a sales pitch answer.
Yeah, exactly.
You know, it's funny because I find myself kind of conflicted here between what I like
to do and what I believe to be actually correct.
And what I mean there is, I like thinking about all the plumbing that makes
software go. You know, I like thinking about infrastructure. And I like thinking about
writing tools and helping create things that make it easier for other software developers
to push code to production and help users and delight users and all that sort of thing.
And that's exactly what most businesses shouldn't have to worry about.
They shouldn't have to employ people like you or I from ops backgrounds who just know how to make the stuff go
because that should just be a given.
I was talking earlier about serverless
and some stuff that I think is hopeful there.
The average software developer, I think,
who wants to delight users, who wants to create things
that create value for a business and for customers,
they don't want to care if it's running on Kubernetes or if it's running on spot instances or things like that. They just want to push it and they want to go.
The tricky part comes in when it breaks. And when it breaks, we want something that we have that
sort of ability to introspect and debug, even if, you know, it's hidden behind some kind of abstraction.
And that's a balance that I don't think we've seen yet in the industry, but I think we're
getting closer. Let's see, when I have conversations with folks like you, and we discuss these types of
things, and the answers always seem so eminently reasonable. And then I leave the ivory tower of
my podcasting studio and go back into the world. And then I see the ivory tower of my podcasting studio and go back into the world.
And then I see the nonsense everyone's building instead.
It feels like on some level, there's two worlds,
the aspirational way that we all want to be doing things
and then the messy way that we really are doing things.
And I'm starting to despair of ever being able
to fully bridge that gap.
Oh, interesting.
By the ivory tower, what would be an example
of like an ivory tower perspective or point of view? Oh, interesting. By the ivory tower, what would be an example of like an ivory tower
perspective or point of view? Oh, sure. Any conference talk you've ever seen on any technology
under the sun where they talk about how they wind up seamlessly deploying software into production.
CI, CD stories, for example, are notorious for this. It's the, you watch these amazing
presentations like, wow, I'd love to work at a place that did things like that.
And the person next to you says, yeah, me too.
And you look at their badge
and they work at the company the presenter works at.
It's the myths we tell ourselves.
Sometimes individual groups wind up solving these problems
within larger companies.
Sometimes it's a new thing that they're running in tests,
but haven't rolled out everywhere.
And let's not kid ourselves,
if it touches the payment system,
everyone's doing waterfall development, whether they admit it or
not. But there's a broader world out there of folks who want to be doing things the right way.
They want to be getting rid of the boilerplate and stop reinventing the wheel and reimplementing
the wheel and get onto doing the truly interesting and innovative stuff. And those people right now
are all listening to this while going back to code a login page. You never get past it on some level. That's what bugs me.
And you know what's super interesting about that? In my experience, which may not be representative,
but the places that I've seen that have accomplished the closest to that kind of story, have done it in really simple, almost kludgy ways.
And what I mean by that is,
I've never personally worked somewhere
where we had this great system
that tracked state of all of these different services
and made sure that there was traffic going from here to there
in a way that was canary testing and everything.
That all sounds like a lot of moving parts.
The best places I've worked have a freaking cron script that just pushes out changes
or has a web hook that kicks off something that pulls down a tar ball from an S3 bucket
and then ships it to a machine.
Oftentimes this stuff, I think it doesn't make for sexy conference talks,
but it's just roll up your sleeves kind of work to get it happening and then move on to something else. I think sometimes we
maybe trip ourselves up by wanting it to be more interesting than it actually is.
That's part of it. If we were completely honest with people at what we were actually building or
working on at any given point in time, the answer would be incredibly depressing and we would just
be sadder after explaining our jobs to people. I try not to give talks to classrooms full of schoolchildren anymore on what I do for a
living for that specific reason. And yeah, I agree, but isn't it great sometimes that this
shit works? If the point is to deliver value to customers quickly and efficiently, maybe investing just enough to make that work repeatably and in a way
that people trust, and frankly, is simple enough that you can also debug when it doesn't do the
thing that it's supposed to do, maybe that's actually enough. Maybe we're sometimes over
investing in complicated solutions that might fall into that accidental complexity scenario we
were talking about earlier. Well, all right, let's take that to its logical extent here.
Here's something I know for a fact you have an opinion on.
Now, I have opinions on things too,
which would surprise no one who listens to this show,
but what do you think stops engineers
from wanting to be on call for the service that they work on?
Now, there are a couple of answers I have to that.
One is the polite public answer, and one's the real answer. But I'd like to hear your answer. Yeah, sure. I want to come back
to the difference between the public and the private answer, too. My answer is there's a whole
bunch of stuff. One of them is social. And I think this is more common, is that engineers are on call for things and they don't feel like
they have necessarily the autonomy to actually react to things the way that they need to.
And what I mean by that is, I've certainly seen this and I've done my part to try to fix it or
to encourage others to empower people to fix it or whatever the hell you need to do.
But people get paged and they're like, oh, that alert means nothing. Okay, so get rid of that alert.
I can't do that. Why not? Just do it. You know, if you're getting paged for something, you have
the right to change the system that is alerting you to something. And what you're seeing on the
ground level, you know, as the on-call engineer should be gospel.
It should be the thing that dictates
how the future person experiences that role.
And if you don't have that,
it's a really shitty experience.
I think the other thing that's technical
is this notion of accidental complexity.
Is when you have a system
that you're responsible for,
that you're on call for,
and it's just,
whether it's because of over-engineering And it's just, you know, whether it's
because of over-engineering or it's just out of necessity complex, you don't know how to insert
yourself when it f***s up, right? Like you can start to look at it and say, okay, you know,
we've got a drop in traffic or we've got a spike in error rate or something like that.
But if you're nervous about the actual mechanics that get your changes from your laptop to the production environment,
then it can be a really terrifying experience to make changes.
And I've been in environments where people just freeze, and it sucks.
So that's why I always think of, if you can make it easier to get the changes from your laptop to production,
that is the best investment that you can possibly make, technically.
I would agree with that sentiment. It feels like when you talk to software developers who are
building these systems and then complaining about a problem in production, here, log into the prod
server and see, well, this looks nothing like their IDE. It looks nothing whatsoever like their
development environment. And people feel awkward and out of sorts there. I mean, I intentionally,
in years past
when I was working in ops roles
made production uncomfortable to work in intentionally so
because that's not your default place to operate in.
But if people are used to using visual studio code,
for example, then, okay,
now the only editor we have installed here is VI.
So you're going to have to spend some time learning
even to look at what's going on here.
That's an awful experience. Not to mention that people are never doing these things during the
workday, invariably. It's always two in the morning when you're bleary-eyed and have no
idea what you're doing. And congratulations, you're being confronted by the puzzle master.
It doesn't go well. No, and that's actually a great point that I think is within our control
as engineering teams to change. Yeah, it'll happen at two in the morning, that's actually a great point that I think is within our control as engineering teams to change.
Yeah, it'll happen at two in the morning, that's for sure.
Any 24-7 service that you're on call for, it's going to break at an uncomfortable time, and you're going to have to debug it.
But that doesn't have to be the only time you do this stuff.
And in fact, when it is the only time you do this stuff, that's terrible.
And that's why I'm a big proponent of have, you know, have fire drills, have game days,
break the shit that breaks often so that you know how it breaks when everybody's around,
you know, resolve those kinds of uncertainties as much as you can, because obviously some things
are just unknowable, but practice those muscles as often as possible. There's a funny thing that
we talk about sometimes at Honeycomb and, you know, this sounds like a humble brag, but it's not.
It's just that there are periods of
time where we don't have as many incidents. And that makes it actually really hard to make sure
that people are primed to be on call. And so we're thinking through, what can we do to just
make it more comfortable? Like if someone comes on board and their first on-call rotation is quiet,
that doesn't really help them, right? So what can we do to kind of
force interaction with production as often as possible to make it almost routine and muscle
memory? I talked with companies back when I was looking at various roles where, oh, everyone is
on-call. And you hear that during an interview, and having been through many on-call rotations
myself, it's, yeah, that's not a strong point, to be perfectly honest with you. That sounds like,
if you're not very careful how you position this, that everyone is woken up for every incident and
I won't get a whole lot of sleep working here. And not to be unkind, you're not paying significantly
more than other folks who don't subject me to that. Yeah, that's terrible. Everybody is on call.
That reminds me of the companies that I worked for, I don't know, before a certain time when, I don't know,
maybe it was that pager duty became a ubiquitous tool
that was used in companies of a certain size.
But it was that old time when the first person to respond
is really the person who's on call.
And that's a terrible environment, and it's a recipe for burnout.
You should have a clear escalation path,
and you should have clear responsibilities.
And every engineer should have a huge chunk of time and you should have clear responsibilities. And every engineer
should have a huge chunk of time when they're not on call. And they know that they're not on call,
so they can delete Slack from their phone. They can turn off all of their alerts. And when they're
done at the end of the day, they're just done. So yeah, I would also run screaming from a company
who said that these days. Anecdotally, there's two questions that I always like to ask companies
when I'm looking for jobs and talking to companies.
One is, how do you get your code to production?
Walk me through as many steps as you're comfortable
disclosing in an interview, which hopefully is a lot.
And two is, how are people put on call
and what's the last major incident you had
and what does it look like?
Who was involved?
What happened?
How did people get the support they needed? All those questions are really, really
interesting ones to dive into. I wish you had more opportunities than interviewing to ask other
companies this stuff. Oh yeah. My personal favorite way of responding to that, which is why I generally
don't get offered jobs a whole lot is, so you have an on-call rotation here. Oh yes. It's absolutely
critical that our site is up all the time.
Cool, so why don't you staff multiple shifts of people
who are responsible for keeping the site up
during those times so that you're not making people
wake up in the middle of the night to break things.
And suddenly we're in one of those,
what I say versus what I do are different territories
and that becomes a problem.
Oh, are you talking about like follow the sun rotations?
Either follow the sun rotations?
Either follow the sun or having folks who are either night owls who enjoy night shift
or something for, and I'm not talking small startups here.
I'm talking companies that have 1,500 engineers
working there.
It's at some point you have multiple offices
in various places.
Why are you still waking people up
in the primary time zone every week?
Sorry, when I say primary time zone,
I should be very explicit on this.
There's always a time zone hierarchy in every company.
It's the headquarters time,
and that is how it's going to be,
regardless of what companies claim otherwise.
Of course, it's the center of their universe.
And invariably, it seems to be Pacific West Coast.
Yes, exactly.
Yeah, I agree completely.
At a certain size,
and there are plenty of companies that I think are doing this,
but you have the opportunity
to let folks in North America time zones
just stop working.
And then folks in European time zones
will take over.
And then folks in certain Asian time zones
will take over.
And yeah, that is a great way to do things,
I think, if you can manage it.
So I guess my last question for you, since I've been peppering you with these,
is if people want to learn more, where can they find you?
I am on Twitter.
I'm not sure how much value you'll get from my tweets,
but every once in a while, maybe I'll tweet something
that at least provokes some discussion.
Paul Osman.
And I very occasionally blog at paulosman.me. And I think that's it.
And we'll put links to those in the show notes.
Excellent.
Paul, thank you so much for taking the time to speak with me today. I really appreciate it.
No problem, Corey. I really enjoyed the discussion. Thanks a lot.
As did I. Paul Osmond, lead or lead engineer of instrumentation at Honeycomb. I'm cloud
economist Corey Quinn, and this is
Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your
podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star
review on your podcast platform of choice and a comment telling me of why your on-call rotation
is different and unique. This has been this week's episode of Screaming in the Cloud.
You can also find more Corey at screaminginthecloud.com or wherever Fine Snark is sold.
This has been a humble pod production
stay humble