Screaming in the Cloud - Writing the Book on Service Level Objectives with Alex Hidalgo

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. actively ridiculous by trying to throw everything at a wall and see what sticks. Their pricing winds up being a lot more transparent, not to mention lower. Their performance kicks the crap out of most other things in this space, and my personal favorite, whenever you call them for support, you'll get a human who's empowered to fix whatever it is that's giving you trouble. Visit linode.com slash screaminginth the cloud to learn more and get $100 in credit to kick the tires. That's linode.com slash screaming in the cloud. not days or weeks. No front-end frameworks to figure out or access controls to manage.

Starting point is 00:01:26 Just ship the tools. It'll move your business forward fast. Okay, let's talk about what this really is. It's Visual Basic for Interfaces. Say I needed a tool to, I don't know, assemble a whole bunch of links into a weekly sarcastic newsletter that I send to everyone. I can drag various components onto a canvas. Buttons, checkboxes, tables, etc. Then I can wire all of those things up to queries with all kinds of different parameters, post, get, put, delete, etc. It all connects to virtually every database natively, or you can do what I did and build a whole crap ton of Lambda functions, shove them behind some APIs gateway, and use that instead. It speaks MySQL, Postgres, Dynamo, not Route 53 in a notable oversight, but nothing's perfect. Any given component then lets me tell it which query to run when I invoke it. Then it lets me

Starting point is 00:02:17 wire up all of those disparate APIs into sensible interfaces, and I don't know front-end. That's the most important part here. Retool is transformational for those of us who't know front-end. That's the most important part here. Retool is transformational for those of us who aren't front-end types. It unlocks a capability I didn't have until I found this product. I honestly haven't been this enthusiastic about a tool for a long time. Sure, they're sponsoring this, but I'm also a customer and a super happy one at that. Learn more and try it for free at retool.com slash lastweekinaws. That's retool.com slash lastweekinaws and tell them Corey sent you because they are about to be hearing way more from me. Welcome to Screaming in the Cloud, I'm Corey Quinn. I'm joined this week by Alex Hidalgo, who's a site reliability engineer and, due to an

Starting point is 00:03:07 escalatingly poor series of life decisions, a recently published author, specifically the book Implementing Service-Level Objectives. Alex, welcome to the show. Thanks, Corey. So, every person I've talked to who's written a book has given me the thousand-yard stare when I ask if I should write one and then immediately begin screaming, no, never do this. I've come to the conclusion that nobody actually wants to write a book. They want to have written a book. How accurate is that?

Starting point is 00:03:40 I think there's absolutely some truth to that. It is difficult. It is tiring. It is emotionally draining. And having to is about people. It sounds like it's about service level objectives, and I guess it kind of is, but it's mostly about how to use those to make people's lives better. And I've seen how this process can do that. And so having something out there that I hope will help people's lives is ultimately rewarding. And there were more tears of frustration than there were tears of joy, but there were both. So let's start at the very beginning here.

Starting point is 00:04:28 I know the book is, at the time of this recording, it is in print. They are starting to ship out. I have not yet received my copy, but of course I have ordered one. I keep an eye on whatever O'Reilly releases, and especially when it's people I know, it's a no-brainer. It will, in all honesty, sit on the shelf and never get open because I will do my actual reading on the Kindle. But having something on the shelf is, for something like this, especially when you know the person that wrote it, is just the right thing to do. But start at the beginning for me here, because it turns out that I am a white guy in tech, which means my

Starting point is 00:05:00 failure mode is a board seat and a book deal somewhere. Everyone assumes that I know everything about everything, and I tend to not shatter that illusion very often. But I have no earthly idea what the hell a service-level objective is, since it's just you, me, and the thousands of people listening to this. What is an SLO? So an SLO, well, it's an objective for your service. You can't be perfect. The story I like to tell is imagine you're using a streaming media service of some sort, a Netflix, a Hulu, a Disney Plus, whatever. And when you're using this, normally when you start a new video, it buffers for a few seconds and you're fine with that. And that's technically not perfect, but it turns out these services don't

Starting point is 00:05:41 have to be perfect for you because you're fine if it buffers a bit. But on the same token, if it buffers for like 20 seconds, you don't love that, but you're not going to abandon the service unless it buffers for 20 seconds every single time. Then you may say, screw this, I'm moving to a competitor. So the idea is find out what your users can tolerate and make sure you're only failing that often. If you're not losing users, if people are still happy in general with your service, if you only take 20 seconds to buffer one in 50 times, then aim for that. Because you're going to spend too many resources, both financially and via your developers and your support engineers, if you try to make everything 100% all the time. It sounds on some level like it's a derivative of SLA, service level agreement. What's the difference? So the difference is a service level agreement, and they've definitely been around much

Starting point is 00:06:37 longer. Actually, in some of my research, I've found- I've seen SLAs in contracts all the time when negotiating those. I have never seen the phrase service level objective in a contract, which means that lawyers will not know what I'm talking about if I use SLO, I suspect. Yep, exactly. An SLA is something you put into a contract and it generally implies that you owe somebody something, whether it's credit or actual money, if you violate that. SLOs are an approach to thinking about the reliability of your service. They are promises in some sense, but definitely not contractual ones. They're tools. They're a bit of data that help you make decisions.

Starting point is 00:07:17 Are we buffering too long too often? Is this page not loading correctly too often? You know these things are going to happen and just make sure it's not too much of the time. It kind of accepts the same thing as an SLA does. SLAs are generally not 100%, because, again, people realize something will break at some point in time. SLOs think about things in the same way in that sense, but they're used to help you make decisions. Do we need to focus on this part of our product?

Starting point is 00:07:44 Do we need to focus on this part of our product? Do we need to focus on this? So help me understand this in the context of a story that I've related from time to time on various forms of podcast and whatnot. Years ago, I was trying to buy a pair of socks on amazon.com and I clicked the buy button and I got one of their error pages, which of course features dogs.

Starting point is 00:08:03 In all honesty, the dog page is more satisfying than any other page on amazon.com. If I listen to the common wisdom, that would mean that during that outage that lasted about an hour or so, I would have therefore gone to a competitor to buy the pair of socks or alternately one day out of the week on the day that that pair of socks should have been there, I would just go without socks whatsoever. In practice, oh, that's weird. I never see that. Ha ha. I come back an hour later, I buy the socks and life goes on. There was no loss of revenue in my case for amazon.com during that outage. However, if every third time I tried to buy something at Amazon, I got the dog page instead, I'd probably spend a lot more money at Target. So is that a naive storytelling, I guess, understanding of a much more complex concept of SLOs? Are they related or is this completely out in the weeds and it's a boring story? We should

Starting point is 00:08:57 make sure we drop on the floor in post-production. No, that's exactly it. If you're down for an hour, chances are people are just gonna be like, huh, this happens. Stuff breaks. People are used to it. They're actually mostly okay with it. And you're probably just going to come back and check in an hour. That's exactly correct. That's the whole point. You can be down for an hour. You just can't be down for an hour too often. So find out what that is. Find that percentage. Can you be down for an hour once a month, twice a month? It's going to be different depending on your service. If you're a specialty retailer, no one else sells your stuff, then you can probably actually be a little bit more lax. And if you're someone like Amazon,

Starting point is 00:09:34 who's also expected to be constantly up because they're the largest company in the world in some sense. So no, I think you got it right. Exactly. It's just that implementing SLOs, there's a lot of math that goes into it. There's a lot of discussions you have to have. It's not easy to just pick a number. So they're simple to talk about, not always easy to implement. That's why there's a whole book about it. But your story, that's exactly it.

Starting point is 00:09:58 That's exactly the whole point. B&H, the photo company, closes for 24 hours for Shabbat every week. And they wind up having their website up, but they say, yeah, you can't actually make a purchase till Shabbat ends. On some level, it's kind of their own brand now at this point, and it seems to have worked out reasonably well. But as you say, it comes down to what your story is as far as approaching the market. I'll go back an hour later to buy socks because I need socks. I'm not going to go back to your website an hour later

Starting point is 00:10:28 to click on an ad that wasn't displaying. Exactly. So a meaningful SLI is a measurement of how your service is operating from your user's perspective. And when people ask me to explain it a bit more, I'm always like, well, it's the same thing as a KPI or key performance indicator for the business side. Or if you were to talk about this

Starting point is 00:10:47 to a product manager, they would say, oh, it's a user journey. You've got to take everything into account. People are going to have different expectations for how different parts of the service work. As you said, there's a difference

Starting point is 00:10:57 in between, you know, perhaps something like a button not registering a click on the first try, but registering a click on the second try. That's a thing people aren't going to be too upset about. And that can probably fail more often than just not being able to check out entirely. When you're looking at SLOs through a lens of things we should strive to do, how does that keep from becoming a, we'll try our best, which sounds great, makes everyone feel good, but isn't something that

Starting point is 00:11:25 is easy to represent as either having value or matters at all to the larger business. So I think you really do just try your best, but I understand, I agree that that doesn't sound like a great sentence, even though it generally is true. You know, you'll like, you'll try your best and you'll try to make sure your service is good, but, you know, it can't always be perfect. And I get that that kind of language isn't great, but that's actually exactly why I think the whole SLI, SLO error budgets,

Starting point is 00:11:56 which we haven't even talked about, measurement of your SLO over time as opposed to kind of right now. This is actually an example where the numbers can help. You can point to things and say like, yeah, but we were only on reliable for four minutes and 32 seconds last month or something along those lines. And that's how you can kind of help explain to people that, yes, in a sense, this is, we're just going to try our best because we cannot try our perfect. That's not a thing. People get to understand what you actually mean with that.

Starting point is 00:12:25 When you're saying, I'm going to try my best, you're actually saying, well, we're aiming to be 99.95% reliable. And that translates to X number of minutes per month that we may not be unreliable. And that can often help people understand, oh, huh, maybe trying your best is actually good enough. I really wish that more people were explicit about saying trying your best is good enough because I can't shake the feeling that that is not a well-circulated belief in far too many places. Totally agreed. But it's the truth.

Starting point is 00:12:58 It's how things actually work. Things fail. People fail. And it turns out people actually know that. When you're running a business, the end goal of course, is to make money and make as much money as possible, or at least for most businesses that are, there are absolutely outliers there. And that means, you know, your executives or whoever owns the most shares or your shareholders, if you've gone public,

Starting point is 00:13:21 that's their goal, right? At some level, they want to make money and they want to make as much money as possible. And therefore they think the way to do that is to aim for perfection, to aim for a hundred percent, but you're always going to falter if you do that. And those are the people who can be most difficult to convince that it's just not prudent to aim for a hundred percent, but those people can get there as well. It can take time. It can take a lot of examples, but don't let great be the enemy of the good. It's a concept that as humans, we know so well that we have an idiom for it. And you just got to figure out how to translate that to a business where, again, they think they want to be 100% because they think they need to do that to make as much money as possible. But you can actually often save in resources by not trying to be perfect because the amount of money you're going to spend, it ends up almost becoming like a limit approaching infinity.

Starting point is 00:14:16 That curve is going to shoot straight up in how much money it's going to cost for you to try to be as reliable as possible. You're going to have to run multi-region and you're going to have to pay for quicker replication. And, you know, you're going to have to hire more engineers because like, if you want to be up almost all the time, you're going to have to have people who are on call who can respond within like 30 seconds. The only way you can do that is if you have like a follow the sun rotation

Starting point is 00:14:37 where you have offices all over the world, because otherwise not everyone's going to wake up at 3 a.m. So you got to make sure it's someone's 3 p.m. instead. You can see how quickly this escalates. It can actually cost you more money. You can actually make more money by having a more reasonable target. Think about what your users actually need from you, and often, well, again, humans expect failure.

Starting point is 00:14:57 They're cool if stuff doesn't work every once in a while, as long as it doesn't work too often. One of the only other places I've seen SLOs discussed in any serious capacity was the SRE book that came out of Google. And there was a lot of good stuff in that book, but I had a bit of a negative reaction to it just because that came out right when I was

Starting point is 00:15:18 in the middle of getting an awful lot of, we're Google, we're smarter than you, flack on other fronts. The people who wrote that book, to be very clear, are great. That is not the impression I have of those people. But it's, oh, good, how to be more like Google, just what I don't want to listen to. So I largely ignored it for a while. But Liz Fong-Jones, now at Honeycomb, is a big advocate of SLOs. They're one of my consulting reference clients. And we've had most of our conversations around cost optimization have centered around SLOs. So what the expectation for their customers is and the commitments that they have made.

Starting point is 00:15:53 And it was a really interesting philosophy that I haven't seen replicated elsewhere yet. Yeah, it's something that's gaining a lot of traction. And actually, so I was at Google. I was on the customer reliability engineering team with Liz. And one of the things we did was we went out and we taught some of Google's largest cloud customers how to SRE. That was kind of our goal. And the beginning of every journey was you need to have SLOs first. This is our common language. This is how we talk about reliability. Reliability to an SRE is defined by service level objectives.

Starting point is 00:16:28 And so while it's still outside of Google, still kind of a growing discipline, lots of people are doing it. In some ways, this book is about the two years I spent at Squarespace. It's essentially my story there. I joined and people said, we want to do SLOs and we know that you know SLOs and let's do it. And I was like, okay, sure. Except I didn't realize what it takes to build this kind of thing from the ground up because suddenly I wasn't at Google anymore. I didn't have all the tooling that Google has. I didn't have the cultural buy-in, not just by engineers, but across various

Starting point is 00:17:00 different organizations. And I had to do everything from the bottom up. New tooling had to be created, new software written, had to drastically change how people measure things, how people think about things. And that's kind of how the book came to be. I was running a lot of SLO workshops, just internally, where teams could come and I could spend three, four hours with them and maybe even hopefully end up with a single defined SLI, SLO pair before they left the room. And it was just getting to be a lot because I was seeing the same thing over and over again. And I was talking to my colleague, Gabe, and I explained this to him, you know, it's just getting tiring. And then I said, I wish there was a whole book about this.

Starting point is 00:17:40 So I could just point people to the book. And Gabe said, you should write it. I said, no, no, no, no. We need like the expert to write it. And he said, you are the expert. And so I'm pretty sure my response was, because I knew I was now going to write a book. It sounds on some level like writing a book is something that is, it used to be this thing that people would do, this aspirational task of I'm going to write a book.

Starting point is 00:18:07 Increasingly, it's starting to sound from an awful lot of author folks that it's more like a dead dog that has been cast into your yard by one of your neighbors. And now it's time for you to worry about it. I mean, I don't know if I go quite that far with a metaphor, but yeah, absolutely. I mean, I won't lie.

Starting point is 00:18:23 It's awesome that I wrote a book. That's neat. As a kid, my dream was to be an author. I mean, I won't lie. It's awesome that I wrote a book. That's neat. As a kid, my dream was to be an author. I thought it'd be like fantasy books and not a technical manual, but still it's amazing. I got the first physical copy yesterday and I bawled. I cried. It's really, really neat having done that. And I won't pretend that some of the status that you get with that, I won't pretend I'm totally ignoring that, but absolutely at the root of it, it's more like, yeah, someone needs to do this. I am the right person for it. I write well.

Starting point is 00:18:51 I know a lot of people who can help me with this. I helped with the second SRE book, so I already understood the process just a little bit. And I was like, you know, this needs to be out there. And if so, it may as well be me. This episode is sponsored in part by our friends at New Relic. If you're like most environments, you probably have an incredibly complicated architecture, which means that monitoring it is going to take a dozen different tools. And then we get into the advanced stuff.

Starting point is 00:19:18 We all have been there and know that pain, or we'll learn it shortly. And New Relic wants to change that. They've designed everything you need in one platform, with pricing that's simple and straightforward, and that means no more counting hosts. You also can get one user and 100 gigabytes per month, totally free. To learn more, visit newrelic.com. Observability made simple.

Starting point is 00:19:42 I'm glad it's you, because first, I get to wind up reading a book that I don't have to write, which is great, because I'm never going to write one. Observability made simple. angles of this from a whole bunch of different perspectives in a much longer form than a blog post or this is a tweet thread, tweet one of 487,000 and so on and so forth. It's great to have a single place for it to go. Also, you tipped me off fairly early on that talk about bucket list items that I'm referenced in an Easter egg within the book. Yeah. So actually got to give props to some co-authors here. I actually ended up writing only about 60% of the book. I was always planning on bringing in two or three people who are like experts at very niche parts of this.

Starting point is 00:20:35 And suddenly I had people volunteering all over the place. And so the other 40% have a bunch of amazing people from across the industry. And Paulina Geralt and Blake Bissett wrote a chapter about data reliability. Data reliability is, you've got to approach it in a very different way than latency or availability or error rates. So we have a whole chapter about data reliability, and there's a part where they discuss what even is a database. And they asked the question, is Route 53 a database? And we were able to get a footnote in that says, hi, Corey.

Starting point is 00:21:04 At first, the editor wanted to take it out, and I was pretty adamant that we should leave it in. Oh, absolutely. The fact that I am referenced in this now means that I'm getting it framed and hung on the wall. It's, oh, did you write that book? Absolutely not. One of my stupid jokes made it into that book. And that's, yeah, oh, I'm absolutely going to steal credit for you in this sense. I can finally have that as my counterpoint to my business partner's story. He wrote Practical Monitoring for O'Reilly and has it on display on his bookshelf behind him when he's on Zoom calls. And it's a fun problem because, as it turns out, when you've written a book, it's very hard to bring up the fact that you have written a book because you're proud of it. You spent a disturbing portion of your life for a while on writing that book, but you can't

Starting point is 00:21:51 open the sentence with that or people find it pretentious and ridiculous. So my position has always been that if I know someone's written a book, I will drop that into virtually every conversation when someone who's talking to them doesn't know that fact. I try and be the one person promotional band for stuff like that. So do you know that you've written a book? It still doesn't feel real, even though I got that physical copy and have paged through it one by one. I spent almost like an hour, not really reading, but kind of reading. It still doesn't feel real, to be honest, but I totally hear you. So I've given myself the opportunity to brag occasionally, to gush, because I'm incredibly proud of this. And this was literally the hardest thing I've

Starting point is 00:22:31 ever done in my entire life. So I tried to give myself the opportunity to occasionally talk about it. But outside of Twitter, where honestly, I don't care that much. I'm constantly self-promotional there. In the real world, you're absolutely right. It's difficult to bring up. I posted a picture of the book on an internal Slack channel at Squarespace. And because I let myself say, okay, cool. I haven't talked about the book at work for several months.

Starting point is 00:22:57 Like here's my once a quarter permitted single message about it. And there were engineers at the company that still didn't know I was writing the book. Because you're right, it's difficult to bring up. You don't want to sound like a bragger. But, you know, I do think you have to try to give yourself permission every once in a while. There's nothing wrong with promoting the things you've done, especially the ones that you're very proud of. Looking back, based upon what you know now,

Starting point is 00:23:21 first, would you write the book again? And secondly, what do you wish you'd known before you started? Yes, I'd write the book again. At the end of the day, the positives outweigh how difficult it was. What would I do differently? I would take more time off. I'm not writing a book again and also working full-time. I did take a few weeks off in December, but having to do essentially all of this work on weekends and evenings, that was draining. It was a lot. And I should have known that I needed to give myself more time there. Don't do it when a global pandemic is about to happen.

Starting point is 00:23:57 That part was terrible, especially if you're someone who I can't write at home. I need to be at a coffee shop, at a bar. It's strangely not very unheard of. Lots of writers are this way. And when I didn't have anywhere to go anymore, that was tough. That was not easy. Of course, you can't really predict that,

Starting point is 00:24:16 but I had to throw that out there. And the final thing I do is I had a bunch of co-authors and they're all wonderful and every chapter turned out great. But trying to navigate that many people's schedules I had a bunch of co-authors and they're all wonderful and every chapter turned out great. But trying to navigate that many people's schedules and that many people's their own commitments. And I think I just want to be more sure ahead of time to let these chapter authors know this is going to be difficult. This is going to take you more time than you think.

Starting point is 00:24:45 Can you take a week or two off? Because otherwise you're really going to be struggling with this. So those are the things. Just give yourself more time. If you're working with other people, ensure that they're giving themselves enough time and just try to make sure you're in a good situation in general. That might be a better way to sum up the whole pandemic thing because of course you can't ever control that. But, you know, make sure you have time, make sure you're comfortable, make sure you're not going through other life events. Don't do this in the middle of some other crisis or health problems or something like that. You need to be in the best possible mental state that you can be in before you embark on something like this. That's a hard and heavy lift in 2020. And again again there is a production delay between the time that

Starting point is 00:25:27 we record this and the time it goes out sorry listeners when you download something from your podcast i don't quick get someone on the phone and have that talk live i know spoiling the production magic for you but it seems like this is getting to be such a weird year from week to week it's wow they didn't even mention the giant meteor. And well, here we are. It's a hard problem to solve for as far as finding time to write. I've written a few basic outlines of books I've toyed with writing. And invariably in almost every case, I find another book that's already been written that aligns closely enough that, oh, I'll just talk about that thing instead. It's easier. Yeah, that's, I remember when I was first really getting into service level objectives at Google, I was on a team that was responsible for the

Starting point is 00:26:12 monitoring and alerting for everything else across Google. So we wanted to have well-defined SLOs so other internal engineers could look to our definitions and understand how reliable we were aiming to be so they could ensure that their systems were handling things correctly and knew how to retry when they had to and things like that. And suddenly I had this great idea of building an SLO repository, like a centralized place where everyone can define their SLOs and they get some tooling or dashboarding for free. And it was a centralized place for you to discover what your dependencies were aiming for.

Starting point is 00:26:47 So you could set your targets correctly. And I told one of the staff engineers on my team, and he was ecstatic about it. He was like, oh, my God, Alex, this is a great idea. This is going to get you your next promotion. And I spent a few hours starting to, like to outline what it would look like. And then someone else on my team came to me and was like, I just discovered there is an entire team staffed of 10 people working on this product. So my great idea immediately up in smoke because someone else already had that great idea first. But in this case, I looked. I really did.

Starting point is 00:27:20 And when you fill out a proposal for O'Reilly, they ask you, what are competing books to yours? We need to know, what are you comparing yourself against? And I listed the SRE books, including Seeking SRE, David Blank Edelman's book, because they at least talk about SLOs. But really, it was tangential. I was like, what I'm writing is strangely new. It's not just much more expansive, but it's actually a pretty different take than how they're described in either of the Google S3 books. So that was one of the reasons I really felt like I had to do it because I looked and I couldn't find what I thought needed to be out there. That's probably the greatest sign it's time to write a book, I would imagine, when no one else

Starting point is 00:28:01 is talking about the thing that you want to talk about, or they are, and they're getting it all wrong across the board. My position has always been to do a snarky take on Twitter or a sarcastic blog post, but there are times you need to go deeper than that. And to be honest, I'm very glad that people like you have attention spans. It's easy to fall into a trap, in my experience, of, in the world of Twitter and things like it, it's easy to attain relative mastery, or absolutely not, but the appearance of relative mastery in the confines of 280 characters. But then you see people who are legitimate experts in things and, oh, it turns out that maybe I shouldn't be reinventing all this stuff from first principles as if I were suddenly Hacker News come to life. There's a definite value in seeing deep, exhaustive research.

Starting point is 00:28:49 One of the things I find most worrying about my increased attention to short-form social media is that I'm not reading the long-form stuff that really lets you dive into a topic with anything approaching the frequency that I used to. Yeah, and I think it's actually, it's an interesting time in tech because I think we're actually kind of, a lot of people are pivoting towards understanding that we need to have a more in-depth understanding of how everything works. People have stopped being experts

Starting point is 00:29:18 or deep experts at individual things. People are always being asked to be full stack engineers and you've got to understand everything. If you try to do that, then you will only ever have a shallow understanding of anything. And I think it's a really interesting time because people are starting to realize that. And one of the ways I see it manifesting right now

Starting point is 00:29:36 is in the fact that people are starting to look to outside industries. We've tried to call ourselves engineers for a long time. There's a lot of debate about whether or not that's applicable or not. It's just semantics. I don't think it's actually important outside of the fact that it is in fact a fun debate to have. But what I am seeing is people realizing, oh, there are other engineering disciplines and they've learned all these lessons already, especially in my world, people talk about reliability. And unfortunately, from an industry standpoint, that's mostly come to mean availability. Those things are actually very different. Reliability means so much more than that. And reliability engineering has been

Starting point is 00:30:15 around since the 1940s. I think it was the late 1940s is when the term was first phrased by the US military in terms of whether some, I can't remember the exact object it was now, but you know, like some armament, would it function the way it was intended to? The term reliability was coined to mean, is this doing what it was designed to do? And that's a heck of a lot more than just being available. And we're seeing more people think about that, though. You're seeing more people getting academic about things. And I remember a few years ago at work, someone was trying to solve a problem with the fact that there were only very low resolution metrics coming in, only a few API calls per hour.

Starting point is 00:30:57 They're like, how do we alert off of this? Because a single error, which might be totally fine, could represent 30% of all traffic over this hour window. But do we want to measure over 24 hours? Because then we wouldn't alert until 24 hours have passed. And a colleague came over and said, you know, you can just use a binomial distribution to solve that, right? And everyone was like, huh? What are you talking about? And he just broke out Wikipedia and showed us, you know. And suddenly, the entire team pivoted to understanding understanding like, holy crap, there are statistical models, some of which were developed centuries ago,

Starting point is 00:31:29 that we can use to help with so much stuff. And again, I think it's neat because after years and years and years of tech being pretty egocentric and thinking we must solve it or thinking we are the smartest in the room, I was definitely not gone entirely. I feel like, and I've been doing this for a long time. I've been in the industry in some way or another for almost 20 years. And I personally, I feel like I'm seeing more and more people looking outside saying, how have other people already solved this problem before? One of the problems that I've always seen is that there's this tendency to not look for prior art, instead sit down and dive right into attempting to solve it internally with

Starting point is 00:32:05 the resources you have first. Well, one of those resources is Google. Take a look and see. Maybe other people have solved this. One problem that I have around this space is that the term serverless-level objectives is not discoverable if you don't know that the term exists. How do you get this in front of people who are absolutely positioned to benefit from this, but don't know what they don't know, so they don't have the term to look for? I don't know if I have all the answers there, beyond the fact that, in my opinion, to be a good engineer, you need to understand how to market. If you have a new idea or a new service... Well, slow down there, hasty pudding, says AWS. But please continue. Exactly how to approach this, how to get this in front of everyone. I don't think I know. And I don't know if I'm the exact right person for that.

Starting point is 00:32:55 But one thing I have learned is you just repeat yourself a lot. If you have a good idea that you think can help other people, just tell them over and over again, maybe not to the point that you're actually annoying them because you do want them to listen to you in the future. But if you think you have a solution, let people know. And they may ignore you at first, or they may think that they still have a better way to do it. But the one thing I will say, and I come back to this over and over and over again, people listen to stories. Yeah, sometimes data helps. Sometimes having some numbers to put in front of someone helps. But overall, people like stories. Tell them a story about how this worked for you.

Starting point is 00:33:29 Tell them a story about how it made your life easier. Tell them a story about, you know, how it saved a company a ton of money or helped them discover some very obscure bug. And that's what people really connect to. We've been storytellers for millennia. And that's the best way to get these kind of things across, I think. And a lot of the book is written that way. There are some chapters that are very heavy in math, and there's an entire chapter about statistics.

Starting point is 00:33:53 But even that one has some great stories about dumplings. And, you know, we tried to frame the whole book that way. You need a narrative. That's how you engage people. That's how you keep people listening. And, of course, there's always the option of telling the story on wonderful podcasts like this one 100 i don't think the medium necessarily matters you can podcast it you can write a book you can write a blog post you can go out to conferences and tell people these stories verbally or you know whatever it is yeah

Starting point is 00:34:20 i agree i don't think that part necessarily matters because different people consume information and consume narratives in different ways. Well, this has been an absolutely fantastic experience and incredibly educational, at least for me. If people want to hear more about what you're up to and or learn more about SLOs, where can they find you slash buy your book? You can buy the book wherever you want. It's an O'Reilly published book.

Starting point is 00:34:44 It's available widely. Go to your local bookstore if you can. If you're not comfortable currently leaving the house, go to bookshop.org. That helps support local bookstores. But if you want to order it off on Amazon because you got Prime, feel free to do that. Just, you know, think about perhaps supporting small and local businesses. You can find me on Twitter at A-E-D-O-L-G-O-S-R-E. That's A-H-I-D-A-L-G-O-S-R-E, where I often pontificate about these kinds of things. And I have a website at alex-edolgo.com. Given that you do obviously care about various ways to purchase the book, let's make it easy on people. If you visit snark.cloud slash S-L-O book, that's S-L-O-B-O-O-K,

Starting point is 00:35:28 we'll drop you onto a site that shows you how to go about purchasing this in a variety of different ways. That's snark.cloud slash S-L-O book. Alex, thank you so much for taking the time to speak with me. I really do appreciate it. No, thank you, Corey.

Starting point is 00:35:42 This has been a great conversation. I've had a lot of fun. Likewise. Alex Hidalgo, site reliability engineer and author. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts. Whereas if you hated this podcast, please leave a five-star review on Apple Podcasts and a published statement about exactly how many nines we should have had instead of an SLO in the comments. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever Fine Snark is sold.

Starting point is 00:36:33 This has been a HumblePod production. Stay humble.

Screaming in the Cloud - Writing the Book on Service Level Objectives with Alex Hidalgo

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.