Screaming in the Cloud - Writing the Book on Service Level Objectives with Alex Hidalgo
Episode Date: December 24, 2020About Alex HidalgoAlex Hidalgo is a Site Reliability Engineer and author of the upcoming Implementing Service Level Objectives (O'Reilly Media, September 2020). During his career he has devel...oped a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.Links Referenced: Buy Implementing Service Level Objectives on bookshop.orgBuy Implementing Service Level Objectives on AmazonFollow Alex on TwitterAlex’s personal siteCorey’s landing page for Implementing Service Level Objectives
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud. actively ridiculous by trying to throw everything at a wall and see what sticks. Their pricing winds up being a lot more transparent, not to mention lower. Their
performance kicks the crap out of most other things in this space, and my personal favorite,
whenever you call them for support, you'll get a human who's empowered to fix whatever it is
that's giving you trouble. Visit linode.com slash screaminginth the cloud to learn more and get $100 in credit to kick the tires. That's linode.com slash screaming in the cloud. not days or weeks. No front-end frameworks to figure out or access controls to manage.
Just ship the tools. It'll move your business forward fast. Okay, let's talk about what this
really is. It's Visual Basic for Interfaces. Say I needed a tool to, I don't know, assemble a whole
bunch of links into a weekly sarcastic newsletter that I send to everyone. I can drag various
components onto a canvas. Buttons, checkboxes, tables,
etc. Then I can wire all of those things up to queries with all kinds of different parameters,
post, get, put, delete, etc. It all connects to virtually every database natively, or you can do
what I did and build a whole crap ton of Lambda functions, shove them behind some APIs gateway, and use that instead. It speaks MySQL, Postgres, Dynamo, not Route 53 in a notable oversight, but nothing's perfect.
Any given component then lets me tell it which query to run when I invoke it. Then it lets me
wire up all of those disparate APIs into sensible interfaces, and I don't know front-end. That's the
most important part here. Retool is transformational for those of us who't know front-end. That's the most important part here. Retool is
transformational for those of us who aren't front-end types. It unlocks a capability I
didn't have until I found this product. I honestly haven't been this enthusiastic about a tool for a
long time. Sure, they're sponsoring this, but I'm also a customer and a super happy one at that. Learn more and try it for free at
retool.com slash lastweekinaws. That's retool.com slash lastweekinaws and tell them Corey sent you
because they are about to be hearing way more from me. Welcome to Screaming in the Cloud,
I'm Corey Quinn. I'm joined this week by Alex Hidalgo, who's a site reliability engineer and, due to an
escalatingly poor series of life decisions, a recently published author, specifically
the book Implementing Service-Level Objectives.
Alex, welcome to the show.
Thanks, Corey.
So, every person I've talked to who's written a book has given me the thousand-yard stare when I ask if I should write one and then immediately begin screaming, no, never do this.
I've come to the conclusion that nobody actually wants to write a book.
They want to have written a book.
How accurate is that?
I think there's absolutely some truth to that.
It is difficult.
It is tiring. It is emotionally draining. And having to is about people. It sounds like it's
about service level objectives, and I guess it kind of is, but it's mostly about how to use those
to make people's lives better. And I've seen how this process can do that. And so having something
out there that I hope will help people's lives is ultimately rewarding. And there were more
tears of frustration than there were tears of joy, but there were both.
So let's start at the very beginning here.
I know the book is, at the time of this recording, it is in print.
They are starting to ship out.
I have not yet received my copy, but of course I have ordered one.
I keep an eye on whatever O'Reilly releases, and especially when it's people I know, it's a no-brainer.
It will, in all honesty, sit on the shelf and never get open because I will
do my actual reading on the Kindle. But having something on the shelf is, for something like
this, especially when you know the person that wrote it, is just the right thing to do. But start
at the beginning for me here, because it turns out that I am a white guy in tech, which means my
failure mode is a board seat and a book deal somewhere. Everyone assumes that I know everything
about everything, and I tend to not shatter that illusion very often. But I have no earthly idea
what the hell a service-level objective is, since it's just you, me, and the thousands of people
listening to this. What is an SLO? So an SLO, well, it's an objective for your service. You can't be
perfect. The story I like to tell is imagine you're using
a streaming media service of some sort, a Netflix, a Hulu, a Disney Plus, whatever.
And when you're using this, normally when you start a new video, it buffers for a few seconds
and you're fine with that. And that's technically not perfect, but it turns out these services don't
have to be perfect for you because you're fine if it buffers a bit. But on the same token, if it buffers for like 20 seconds, you don't love that, but you're not
going to abandon the service unless it buffers for 20 seconds every single time. Then you may say,
screw this, I'm moving to a competitor. So the idea is find out what your users can tolerate
and make sure you're only failing that often. If you're not losing users, if people are still happy in general with your service,
if you only take 20 seconds to buffer one in 50 times, then aim for that.
Because you're going to spend too many resources, both financially and via your developers and your support engineers,
if you try to make everything 100% all the time.
It sounds on some level like it's a derivative of SLA, service level agreement. What's the difference? So the difference is a service level agreement, and they've definitely been around much
longer. Actually, in some of my research, I've found- I've seen SLAs in contracts all the time
when negotiating those. I have never seen the phrase service level objective in a contract, which means that lawyers will not know what I'm talking about if I use SLO,
I suspect. Yep, exactly. An SLA is something you put into a contract and it generally implies that
you owe somebody something, whether it's credit or actual money, if you violate that. SLOs are an approach to thinking about the reliability of your service.
They are promises in some sense,
but definitely not contractual ones.
They're tools.
They're a bit of data that help you make decisions.
Are we buffering too long too often?
Is this page not loading correctly too often?
You know these things are going to happen
and just make sure it's not too much of the time.
It kind of accepts the same thing as an SLA does.
SLAs are generally not 100%, because, again, people realize something will break at some point in time.
SLOs think about things in the same way in that sense, but they're used to help you make decisions.
Do we need to focus on this part of our product?
Do we need to focus on this part of our product? Do we need to focus on this?
So help me understand this in the context of a story
that I've related from time to time
on various forms of podcast and whatnot.
Years ago, I was trying to buy a pair of socks
on amazon.com and I clicked the buy button
and I got one of their error pages,
which of course features dogs.
In all honesty, the dog page is more satisfying than any other page on amazon.com. If I listen to the common wisdom,
that would mean that during that outage that lasted about an hour or so, I would have therefore
gone to a competitor to buy the pair of socks or alternately one day out of the week on the day
that that pair of socks should have been there, I would just go without socks whatsoever. In practice, oh, that's weird. I never see that. Ha ha. I come back an
hour later, I buy the socks and life goes on. There was no loss of revenue in my case for
amazon.com during that outage. However, if every third time I tried to buy something at Amazon,
I got the dog page instead, I'd probably spend a lot more money at Target. So is that a naive storytelling, I guess, understanding of a much more complex concept of
SLOs? Are they related or is this completely out in the weeds and it's a boring story? We should
make sure we drop on the floor in post-production. No, that's exactly it. If you're down for an hour,
chances are people are just gonna be like, huh, this
happens. Stuff breaks. People are used to it. They're actually mostly okay with it. And you're
probably just going to come back and check in an hour. That's exactly correct. That's the whole
point. You can be down for an hour. You just can't be down for an hour too often. So find out what
that is. Find that percentage. Can you be down for an hour once a month, twice a month? It's going to be
different depending on your service. If you're a specialty retailer, no one else sells your stuff,
then you can probably actually be a little bit more lax. And if you're someone like Amazon,
who's also expected to be constantly up because they're the largest company in the world in some
sense. So no, I think you got it right. Exactly. It's just that implementing SLOs, there's a lot of math that goes into it.
There's a lot of discussions you have to have.
It's not easy to just pick a number.
So they're simple to talk about,
not always easy to implement.
That's why there's a whole book about it.
But your story, that's exactly it.
That's exactly the whole point.
B&H, the photo company,
closes for 24 hours for Shabbat every week. And they wind up having their
website up, but they say, yeah, you can't actually make a purchase till Shabbat ends. On some level,
it's kind of their own brand now at this point, and it seems to have worked out reasonably well.
But as you say, it comes down to what your story is as far as approaching the market.
I'll go back an hour later to buy socks because I need socks.
I'm not going to go back to your website an hour later
to click on an ad that wasn't displaying.
Exactly.
So a meaningful SLI is a measurement of how your service is operating
from your user's perspective.
And when people ask me to explain it a bit more,
I'm always like, well, it's the same thing as a KPI
or key performance indicator for the business side.
Or if you were to talk about this
to a product manager,
they would say, oh, it's a user journey.
You've got to take everything into account.
People are going to have
different expectations
for how different parts
of the service work.
As you said, there's a difference
in between, you know,
perhaps something like a button
not registering a click on the first try,
but registering a click on the second try.
That's a thing people aren't going to be too upset about. And that can probably fail more often
than just not being able to check out entirely. When you're looking at SLOs through a lens of
things we should strive to do, how does that keep from becoming a, we'll try our best,
which sounds great, makes everyone feel good, but isn't something that
is easy to represent as either having value or matters at all to the larger business.
So I think you really do just try your best, but I understand, I agree that that doesn't sound
like a great sentence, even though it generally is true. You know, you'll like, you'll try your
best and you'll try to make sure your service is good,
but, you know, it can't always be perfect.
And I get that that kind of language isn't great,
but that's actually exactly why I think
the whole SLI, SLO error budgets,
which we haven't even talked about,
measurement of your SLO over time
as opposed to kind of right now.
This is actually an example where the numbers can help.
You can point to things and say like, yeah, but we were only on reliable for four minutes and 32 seconds last
month or something along those lines. And that's how you can kind of help explain to people that,
yes, in a sense, this is, we're just going to try our best because we cannot try our perfect.
That's not a thing. People get to understand what you actually mean with that.
When you're saying, I'm going to try my best, you're actually saying, well, we're aiming to
be 99.95% reliable. And that translates to X number of minutes per month that we may not be
unreliable. And that can often help people understand, oh, huh, maybe trying your best
is actually good enough. I really wish that more people were explicit about saying trying your best is good enough
because I can't shake the feeling that that is not a well-circulated belief in far too
many places.
Totally agreed.
But it's the truth.
It's how things actually work.
Things fail.
People fail.
And it turns out people actually know that.
When you're running a business,
the end goal of course, is to make money and make as much money as possible, or at least for
most businesses that are, there are absolutely outliers there. And that means, you know,
your executives or whoever owns the most shares or your shareholders, if you've gone public,
that's their goal, right? At some level, they want to make money and they want to make as much money as possible. And therefore they think the way to
do that is to aim for perfection, to aim for a hundred percent, but you're always going to falter
if you do that. And those are the people who can be most difficult to convince that it's just not
prudent to aim for a hundred percent, but those people can get there as well. It can take time. It can take a lot of examples,
but don't let great be the enemy of the good. It's a concept that as humans, we know so well
that we have an idiom for it. And you just got to figure out how to translate that to a business
where, again, they think they want to be 100% because they think they need to do that to make as much money as possible.
But you can actually often save in resources by not trying to be perfect because the amount of money you're going to spend, it ends up almost becoming like a limit approaching infinity.
That curve is going to shoot straight up in how much money it's going to cost for you to try to be as reliable as possible.
You're going to have to run multi-region and you're going to have to pay for quicker replication.
And, you know, you're going to have to hire more engineers
because like, if you want to be up almost all the time,
you're going to have to have people who are on call
who can respond within like 30 seconds.
The only way you can do that
is if you have like a follow the sun rotation
where you have offices all over the world,
because otherwise not everyone's going to wake up at 3 a.m.
So you got to make sure it's someone's 3 p.m. instead.
You can see how quickly this escalates.
It can actually cost you more money.
You can actually make more money by having a more reasonable target.
Think about what your users actually need from you,
and often, well, again, humans expect failure.
They're cool if stuff doesn't work every once in a while,
as long as it doesn't work too often.
One of the only other places I've seen SLOs discussed
in any serious capacity was the SRE book
that came out of Google.
And there was a lot of good stuff in that book,
but I had a bit of a negative reaction to it
just because that came out right when I was
in the middle of getting an awful lot of,
we're Google, we're smarter than you,
flack on other fronts.
The people who wrote that book, to be very clear, are great. That is not the impression I have of those people. But it's,
oh, good, how to be more like Google, just what I don't want to listen to. So I largely ignored it
for a while. But Liz Fong-Jones, now at Honeycomb, is a big advocate of SLOs. They're one of my
consulting reference clients. And we've had most of our conversations around cost optimization have centered around SLOs.
So what the expectation for their customers is and the commitments that they have made.
And it was a really interesting philosophy that I haven't seen replicated elsewhere yet.
Yeah, it's something that's gaining a lot of traction.
And actually, so I was at Google.
I was on the
customer reliability engineering team with Liz. And one of the things we did was we went out and
we taught some of Google's largest cloud customers how to SRE. That was kind of our goal. And the
beginning of every journey was you need to have SLOs first. This is our common language. This is
how we talk about reliability. Reliability to an SRE is defined by service level objectives.
And so while it's still outside of Google, still kind of a growing discipline, lots of
people are doing it.
In some ways, this book is about the two years I spent at Squarespace.
It's essentially my story there.
I joined and people said, we want to do SLOs and we know that you know SLOs and let's do
it. And I was like, okay, sure. Except I didn't realize what it takes to build this kind of thing
from the ground up because suddenly I wasn't at Google anymore. I didn't have all the tooling
that Google has. I didn't have the cultural buy-in, not just by engineers, but across various
different organizations. And I had to do everything from the bottom up. New tooling had to
be created, new software written, had to drastically change how people measure things, how people think
about things. And that's kind of how the book came to be. I was running a lot of SLO workshops,
just internally, where teams could come and I could spend three, four hours with them and
maybe even hopefully end up with a single defined SLI, SLO pair before they
left the room. And it was just getting to be a lot because I was seeing the same thing over and
over again. And I was talking to my colleague, Gabe, and I explained this to him, you know,
it's just getting tiring. And then I said, I wish there was a whole book about this.
So I could just point people to the book. And Gabe said, you should write it. I said, no, no, no, no. We need like the expert to write it.
And he said, you are the expert.
And so I'm pretty sure my response was,
because I knew I was now going to write a book.
It sounds on some level like writing a book
is something that is,
it used to be this thing that people would do,
this aspirational task of I'm going to write a book.
Increasingly, it's starting to sound
from an awful lot of author folks
that it's more like a dead dog
that has been cast into your yard by one of your neighbors.
And now it's time for you to worry about it.
I mean, I don't know if I go quite that far with a metaphor,
but yeah, absolutely.
I mean, I won't lie.
It's awesome that I wrote a book.
That's neat. As a kid, my dream was to be an author. I mean, I won't lie. It's awesome that I wrote a book. That's
neat. As a kid, my dream was to be an author. I thought it'd be like fantasy books and not a
technical manual, but still it's amazing. I got the first physical copy yesterday and I bawled.
I cried. It's really, really neat having done that. And I won't pretend that some of the status
that you get with that, I won't pretend I'm totally ignoring that, but absolutely at the root of it, it's more like, yeah, someone needs to do this.
I am the right person for it.
I write well.
I know a lot of people who can help me with this.
I helped with the second SRE book, so I already understood the process just a little bit.
And I was like, you know, this needs to be out there.
And if so, it may as well be me.
This episode is sponsored in part by our friends at New Relic.
If you're like most environments, you probably have an incredibly complicated architecture,
which means that monitoring it is going to take a dozen different tools.
And then we get into the advanced stuff.
We all have been there and know that pain, or we'll learn it shortly.
And New Relic wants to change that.
They've designed everything you need in one platform,
with pricing that's simple and straightforward,
and that means no more counting hosts.
You also can get one user and 100 gigabytes per month, totally free.
To learn more, visit newrelic.com.
Observability made simple.
I'm glad it's you, because first,
I get to wind up reading a book that I don't have to write, which is great, because I'm never going to write one. Observability made simple. angles of this from a whole bunch of different perspectives in a much longer form than a blog
post or this is a tweet thread, tweet one of 487,000 and so on and so forth. It's great to
have a single place for it to go. Also, you tipped me off fairly early on that talk about bucket list
items that I'm referenced in an Easter egg within the book. Yeah. So actually got to give props to some co-authors here.
I actually ended up writing only about 60% of the book.
I was always planning on bringing in two or three people
who are like experts at very niche parts of this.
And suddenly I had people volunteering all over the place.
And so the other 40% have a bunch of amazing people
from across the industry.
And Paulina Geralt and Blake Bissett
wrote a chapter about data reliability. Data reliability is, you've got to approach it in a very different way
than latency or availability or error rates. So we have a whole chapter about data reliability,
and there's a part where they discuss what even is a database. And they asked the question,
is Route 53 a database? And we were able to get a footnote in that says, hi, Corey.
At first, the editor wanted to take it out, and I was pretty adamant that we should leave it in.
Oh, absolutely. The fact that I am referenced in this now means that I'm getting it framed
and hung on the wall. It's, oh, did you write that book? Absolutely not. One of my stupid jokes
made it into that book. And that's, yeah, oh, I'm absolutely going to steal credit for you
in this sense. I can finally have that as my counterpoint to my business partner's story.
He wrote Practical Monitoring for O'Reilly and has it on display on his bookshelf behind him when he's on Zoom calls.
And it's a fun problem because, as it turns out, when you've written a book, it's very hard to bring up the fact that you have written a book because you're proud of
it. You spent a disturbing portion of your life for a while on writing that book, but you can't
open the sentence with that or people find it pretentious and ridiculous. So my position has
always been that if I know someone's written a book, I will drop that into virtually every
conversation when someone who's talking to them doesn't know that fact. I try and be the one
person promotional band for stuff like that. So do you know that you've written a book?
It still doesn't feel real, even though I got that physical copy and have paged through it one
by one. I spent almost like an hour, not really reading, but kind of reading. It still doesn't
feel real, to be honest, but I totally hear you. So I've given myself the opportunity to brag occasionally,
to gush, because I'm incredibly proud of this. And this was literally the hardest thing I've
ever done in my entire life. So I tried to give myself the opportunity to occasionally
talk about it. But outside of Twitter, where honestly, I don't care that much. I'm constantly
self-promotional there. In the real world, you're absolutely right.
It's difficult to bring up.
I posted a picture of the book
on an internal Slack channel at Squarespace.
And because I let myself say, okay, cool.
I haven't talked about the book at work for several months.
Like here's my once a quarter permitted
single message about it.
And there were engineers at the company
that still didn't know I was
writing the book. Because you're right, it's difficult to bring up. You don't want to sound
like a bragger. But, you know, I do think you have to try to give yourself permission every
once in a while. There's nothing wrong with promoting the things you've done, especially
the ones that you're very proud of. Looking back, based upon what you know now,
first, would you write the book again? And secondly, what do you wish you'd
known before you started? Yes, I'd write the book again. At the end of the day, the positives
outweigh how difficult it was. What would I do differently? I would take more time off.
I'm not writing a book again and also working full-time. I did take a few weeks off in December, but having to do essentially
all of this work on weekends and evenings, that was draining.
It was a lot.
And I should have known that I needed to give myself more time there.
Don't do it when a global pandemic is about to happen.
That part was terrible, especially if you're someone who I can't write at home.
I need to be at a coffee shop, at a bar.
It's strangely not very unheard of.
Lots of writers are this way.
And when I didn't have anywhere to go anymore,
that was tough.
That was not easy.
Of course, you can't really predict that,
but I had to throw that out there.
And the final thing I do is I had a bunch of co-authors
and they're all wonderful
and every chapter turned out great.
But trying to navigate that many people's schedules I had a bunch of co-authors and they're all wonderful and every chapter turned out great.
But trying to navigate that many people's schedules and that many people's their own commitments.
And I think I just want to be more sure ahead of time to let these chapter authors know this is going to be difficult.
This is going to take you more time than you think.
Can you take a week or two off? Because otherwise you're really going to be struggling with this. So those are the
things. Just give yourself more time. If you're working with other people, ensure that they're
giving themselves enough time and just try to make sure you're in a good situation in general. That
might be a better way to sum up the whole pandemic thing because of course you can't ever control that. But, you know, make sure you have time, make sure
you're comfortable, make sure you're not going through other life events. Don't do this in the
middle of some other crisis or health problems or something like that. You need to be in the best
possible mental state that you can be in before you embark on something like this.
That's a hard and heavy lift in 2020. And again again there is a production delay between the time that
we record this and the time it goes out sorry listeners when you download something from your
podcast i don't quick get someone on the phone and have that talk live i know spoiling the
production magic for you but it seems like this is getting to be such a weird year from week to
week it's wow they didn't even mention the giant meteor. And well, here we are. It's a hard problem to solve for as far as finding time to write. I've written a few
basic outlines of books I've toyed with writing. And invariably in almost every case, I find
another book that's already been written that aligns closely enough that, oh, I'll just talk
about that thing instead. It's easier. Yeah, that's, I remember when I was first really
getting into service level objectives at Google, I was on a team that was responsible for the
monitoring and alerting for everything else across Google. So we wanted to have well-defined SLOs so
other internal engineers could look to our definitions and understand how reliable we
were aiming to be so they could ensure that their systems were handling things correctly and knew how to retry when they
had to and things like that.
And suddenly I had this great idea of building an SLO repository, like a centralized place
where everyone can define their SLOs and they get some tooling or dashboarding for
free.
And it was a centralized place for you to discover what your dependencies were aiming for.
So you could set your targets correctly.
And I told one of the staff engineers on my team, and he was ecstatic about it.
He was like, oh, my God, Alex, this is a great idea.
This is going to get you your next promotion.
And I spent a few hours starting to, like to outline what it would look like. And then someone else on my team came to me and was like, I just discovered there is an entire team staffed of 10 people working on this product.
So my great idea immediately up in smoke because someone else already had that great idea first.
But in this case, I looked.
I really did.
And when you fill out a proposal for O'Reilly, they ask you, what are competing books
to yours? We need to know, what are you comparing yourself against? And I listed the SRE books,
including Seeking SRE, David Blank Edelman's book, because they at least talk about SLOs.
But really, it was tangential. I was like, what I'm writing is strangely new. It's not just much
more expansive, but it's actually a pretty different take than
how they're described in either of the Google S3 books. So that was one of the reasons I really
felt like I had to do it because I looked and I couldn't find what I thought needed to be out
there. That's probably the greatest sign it's time to write a book, I would imagine, when no one else
is talking about the thing that you want to talk about, or they are, and they're getting it all wrong across the board. My position has always
been to do a snarky take on Twitter or a sarcastic blog post, but there are times you need to go
deeper than that. And to be honest, I'm very glad that people like you have attention spans.
It's easy to fall into a trap, in my experience, of, in the world of Twitter and things like it, it's easy
to attain relative mastery, or absolutely not, but the appearance of relative mastery in the
confines of 280 characters. But then you see people who are legitimate experts in things and,
oh, it turns out that maybe I shouldn't be reinventing all this stuff from first principles
as if I were suddenly Hacker News come to life. There's a definite value in seeing deep, exhaustive research.
One of the things I find most worrying about my increased attention to short-form social media
is that I'm not reading the long-form stuff that really lets you dive into a topic with anything
approaching the frequency that I used to. Yeah, and I think it's actually,
it's an interesting time in tech because I think we're actually kind of,
a lot of people are pivoting towards understanding
that we need to have a more in-depth understanding
of how everything works.
People have stopped being experts
or deep experts at individual things.
People are always being asked to be full stack engineers
and you've got to understand everything.
If you try to do that,
then you will only ever have a shallow understanding of anything.
And I think it's a really interesting time
because people are starting to realize that.
And one of the ways I see it manifesting right now
is in the fact that people are starting to look to outside industries.
We've tried to call ourselves engineers for a long time.
There's a lot of debate about whether or not that's applicable or not. It's just semantics. I don't think it's actually
important outside of the fact that it is in fact a fun debate to have. But what I am seeing is
people realizing, oh, there are other engineering disciplines and they've learned all these lessons
already, especially in my world, people talk about reliability. And unfortunately,
from an industry standpoint, that's mostly come to mean availability. Those things are actually
very different. Reliability means so much more than that. And reliability engineering has been
around since the 1940s. I think it was the late 1940s is when the term was first phrased by the
US military in terms of whether some, I can't remember the exact object it was now,
but you know, like some armament, would it function the way it was intended to?
The term reliability was coined to mean, is this doing what it was designed to do?
And that's a heck of a lot more than just being available.
And we're seeing more people think about that, though.
You're seeing more people getting academic about things. And I remember a few years ago at work, someone was trying to solve a problem with the
fact that there were only very low resolution metrics coming in, only a few API calls per hour.
They're like, how do we alert off of this? Because a single error, which might be totally fine,
could represent 30% of all traffic over this hour window.
But do we want to measure over 24 hours?
Because then we wouldn't alert until 24 hours have passed. And a colleague came over and said, you know, you can just use a binomial distribution to solve that, right?
And everyone was like, huh?
What are you talking about?
And he just broke out Wikipedia and showed us, you know.
And suddenly, the entire team pivoted to understanding understanding like, holy crap, there are statistical models, some of which were developed centuries ago,
that we can use to help with so much stuff. And again, I think it's neat because after years and
years and years of tech being pretty egocentric and thinking we must solve it or thinking we are
the smartest in the room, I was definitely not gone entirely. I feel like, and I've been doing
this for a long time. I've been in the industry in some way or another for almost 20 years.
And I personally, I feel like I'm seeing more and more people looking outside saying,
how have other people already solved this problem before?
One of the problems that I've always seen is that there's this tendency to not look for prior art,
instead sit down and dive right into attempting to solve it internally with
the resources you have first. Well, one of those resources is Google. Take a look and see. Maybe
other people have solved this. One problem that I have around this space is that the term
serverless-level objectives is not discoverable if you don't know that the term exists. How do you get this in front of people
who are absolutely positioned to benefit from this, but don't know what they don't know,
so they don't have the term to look for? I don't know if I have all the answers there,
beyond the fact that, in my opinion, to be a good engineer, you need to understand how to market.
If you have a new idea or a new service... Well, slow down there, hasty pudding, says AWS. But please continue. Exactly how to approach this, how to get this
in front of everyone. I don't think I know. And I don't know if I'm the exact right person for that.
But one thing I have learned is you just repeat yourself a lot. If you have a good idea that you
think can help other people, just tell them over and over again, maybe not to the point that you're
actually annoying them because you do want them to listen to you in the future. But if you think you have
a solution, let people know. And they may ignore you at first, or they may think that they still
have a better way to do it. But the one thing I will say, and I come back to this over and over
and over again, people listen to stories. Yeah, sometimes data helps. Sometimes having some
numbers to put in front of someone helps. But overall, people like stories.
Tell them a story about how this worked for you.
Tell them a story about how it made your life easier.
Tell them a story about, you know, how it saved a company a ton of money or helped them
discover some very obscure bug.
And that's what people really connect to.
We've been storytellers for millennia.
And that's the best way to get these kind of things across, I think.
And a lot of the book is written that way.
There are some chapters that are very heavy in math, and there's an entire chapter about statistics.
But even that one has some great stories about dumplings.
And, you know, we tried to frame the whole book that way.
You need a narrative.
That's how you engage people.
That's how you keep people listening.
And, of course, there's always the option of telling the story on wonderful podcasts like this one 100 i don't think the medium
necessarily matters you can podcast it you can write a book you can write a blog post you can
go out to conferences and tell people these stories verbally or you know whatever it is yeah
i agree i don't think that part necessarily matters because different people consume information
and consume narratives in different ways.
Well, this has been an absolutely fantastic experience and incredibly educational, at
least for me.
If people want to hear more about what you're up to and or learn more about SLOs, where
can they find you slash buy your book?
You can buy the book wherever you want.
It's an O'Reilly published book.
It's available widely. Go to your local bookstore if you can. If you're not comfortable currently
leaving the house, go to bookshop.org. That helps support local bookstores. But if you want to order
it off on Amazon because you got Prime, feel free to do that. Just, you know, think about perhaps
supporting small and local businesses. You can find me on Twitter at A-E-D-O-L-G-O-S-R-E. That's
A-H-I-D-A-L-G-O-S-R-E, where I often pontificate about these kinds of things. And I have a website
at alex-edolgo.com. Given that you do obviously care about various ways to purchase the book,
let's make it easy on people. If you visit snark.cloud slash S-L-O book,
that's S-L-O-B-O-O-K,
we'll drop you onto a site that shows you
how to go about purchasing this
in a variety of different ways.
That's snark.cloud slash S-L-O book.
Alex, thank you so much for taking the time
to speak with me.
I really do appreciate it.
No, thank you, Corey.
This has been a great conversation.
I've had a lot of fun.
Likewise. Alex Hidalgo, site reliability engineer and author.
I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. Whereas if you hated this podcast,
please leave a five-star review on Apple Podcasts and a published statement
about exactly how many nines we should have had instead of an SLO in the comments.
This has been this week's episode of Screaming in the Cloud. You can also find more Corey
at screaminginthecloud.com or wherever Fine Snark is sold.
This has been a HumblePod production.
Stay humble.