Screaming in the Cloud - Reliability Starts in Cultural Change with Amy Tobey

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode is sponsored in part by our friends at Vulture, spelled V-U-L-T-R, because they're all about helping save money, including on things like, you know, vowels.

Starting point is 00:00:40 So what they do is they are a cloud provider that provides surprisingly high performance cloud compute at a price that, well, sure, they claim it is better than AWS's pricing. And when they say that, they mean that it's less money. Sure, I don't dispute that. But what I find interesting is that it's predictable. They tell you in advance on a monthly basis what it's going to cost. They have a bunch of advanced networking features. They tell you in advance on a monthly basis what it's going to cost. They have a bunch of advanced networking features. They have 19 global locations and scale things elastically, not to be confused with openly, which is apparently elastic and open. They can mean the same thing sometimes. They have had over a million users. Deployments take less than 60 seconds across 12 pre-selected operating systems,

Starting point is 00:01:25 or if you're one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vulture Cloud Compute, they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something of the scale on their

Starting point is 00:01:46 own. Try Vulture today for free by visiting vulture.com slash screaming, and you'll receive $100 in credit. That's v-u-l-t-r dot com slash screaming. Finding skilled DevOps engineers is a pain in the neck, and if you need to deploy a secure and compliant application to AWS without such things, forget about it. But that's where Duplo Cloud can help. Their comprehensive no-code-slash-low-code software platform guarantees a secure and compliant infrastructure in as little as two weeks while automating the full DevSecOps lifestyle. Get started with DevOps as a service from Duplo Cloud, and your cloud configurations will be done right the first time. Tell them I sent you, and your first two months are free. To learn more, visit snark.cloud slash duplocloud. That's snark.cloud slash d-u-p-l-o-c-l-o-u-d.

Starting point is 00:02:50 Welcome to Screaming in the Cloud. I'm Corey Quinn. Every once in a while, I catch up with someone that it feels like I've known for ages, and I realize somehow I have never been able to line up getting them on this show as a guest. Today is just one of those days. And my guest is Amy Tobey, who has been someone I've been talking to for ages, even in the before times, if you can remember such a thing. Today, she's a senior principal engineer at Equinix. Amy, thank you for finally giving in to my endless wheedling. Thanks for having me. You mentioned the before times. I remember it was right before the pandemic. We had beers in San Francisco, wasn't it? There was Ian there and a couple other people.

Starting point is 00:03:34 It was a really great time. I think I remember beer. Yeah. And then the world ended. Oh my God, yes. It's still March of 2020, right? As far as I know. I haven't checked in a couple years. So you do an awful lot, and it's always a difficult question to ask someone. So can you just encapsulate your entire existence in a paragraph? It's awful. So I'd like to give a bit more structure to it. Let's start with the introduction. You are a senior principal engineer.

Starting point is 00:04:14 We know it's high level because of all the adjectives that get put in there. And none of those adjectives are associate or beginner or junior or all the other diminutives that companies like to play games with to justify paying people less. And you're at Equinix, which is a company that is a bit unlike most of the, shall we say, traditional cloud providers. What do you do over there, both as a company and as a person? So as a company, Equinix, what most people know about is that we have a whole bunch of data centers all over the world. I think we have the most of any company. And what we do is we lease out space in that data center, and then we have a number of other products that people don't know as well, which is one is Equinix Metal, which is what I specifically work on,

Starting point is 00:04:44 where we rent you bare metal servers. None of that fancy stuff that you get in the other clouds on top of it. There's things you can get that are partner things that you can add on like storage and other things like that. But we just deliver you bare metal servers with really great networking.

Starting point is 00:05:00 So what I work on is the reliability of that whole system. All of the things that go into provisioning the servers, making them come up, making sure that they get delivered to the server, make sure the API works right, all of that stuff. So you're on the Equinix cloud side of the world more so than you are on the building data centers by the sweat of your brow, as they say. Correct, yeah. Software side. Excellent. Yeah, I spent some time in data centers in the early part of my career before cloud ate that. That was sort of contemporaneous with the discovery that I'm the hardware destruction bunny and I should pay for the great pains to keep my aura from anything expensive and important, like, you know, the SAN. Right, yeah. So, yeah, companies moving out of data centers and me getting out was a great thing.

Starting point is 00:05:39 But the thing about SANs, though, is like, it might not be you. They're just kind of cursed from the start the start right they just always were kind of fussy and easy to break oh yeah i used to think and i kid you not that i had a limited upside to my career in tech because i sometimes got sloppy and i was fairly slow at crimping ethernet cables that is very similar to growing up in third grade when it became apparent that I was going to have problems in my career because my handwriting was sloppy. Yeah, it turns out the future doesn't look like we predicted it would. Oh gosh, are we going to talk about like neurological development now? That's the thing I struggle with too, right? Is I started typing as soon as they would let, in fact, before they would let me. I remember in high school,

Starting point is 00:06:25 I had teachers who would grade me down for typing a paper out. They wanted me to handwrite it, and I would go, cool, go ahead and take a grade off, because if I handwrite it, you're going to take two grades off my handwriting. So I'm cool with this deal. Yeah, it was pretty easy early on.

Starting point is 00:06:40 I don't know when the actual shift was, but it became more and more apparent that more and more things are moving towards a world where you could type. And I was five when I started working on that stuff. And that really wound up changing a lot of aspects of how I started seeing things. One thing I think that you're probably fairly well known for is incidents. I want to be clear when I say that. You are not the root cause. So why are things broken? It's Amy again. What you gotten into this time? It does happen, but not all the time.

Starting point is 00:07:12 Exactly. It's a learning experience. Right. You've also been deeply involved with SREcon and a lot of aspects of what I will term, and please don't yell at me for this, SRE culture, which is sometimes a challenging thing to wind up describing or putting a definition around. The one that I've always been somewhat partial to is SRE is DevOps, except you work at Google for a while. I don't know how necessarily accurate that is, but it does rile people up. Yeah, it does. Dave Stinke actually did a really great talk at SRECon San Francisco

Starting point is 00:07:45 just a couple weeks ago about the DORA report. And the new DORA report, they split SRE out into its own function and kind of is pushing against that old model, which actually comes from Liz Fong-Jones. I think it's from her or older, about like class SRE implements DevOps, which is kind of this idea that like SREs make DevOps happen. Things have evolved since then. Things have evolved since Google released those books. The world has figured out what works and what doesn't a little bit. And so it's not that we're implementing DevOps so much. In fact, it's that ops stuff that kind of holds us back from the really high-impact work that SREs, I think, should be doing that aren't just like fixing the problems, the symptoms

Starting point is 00:08:25 down at the bottom layer, like what we did at SysAdmin 20 years ago. You know, we go and a lot of people are SREs that came out of the SysAdmin world and still think in that mode where it's like, well, I set up the systems and when things break, I go and I fix them. And why do the developers keep writing crappy code? Why do I have to always getting up in the middle of the night because this thing crashed? And it turns out that the work we need to do to make things more reliable, there's a ceiling to how far the platform can take us, right? Like we can have the best platform in the world with redundancy and, you know, nine-way replicated data storage and all this crazy stuff. And still, if we put crappy software on top, it's going to be unreliable. So how do we make less crappy software?

Starting point is 00:09:07 And for most of my career, people would be like, well, you should test it. So we started doing that, and we still have crappy software. So what's going on here? We still have incidents. So we write more tests, and we still have incidents. We had a QA group. We still have incidents. We send the developers to training, and we still have incidents.

Starting point is 00:09:21 So what is the thing we need to do to make things more reliable? And it turns out that most of it is culture work. My perspective on this stems from being a grumpy old sysadmin. And at some point I started calling myself a systems engineer or DevOps or production engineer or SRE. It was all from my point of view, the same job. But you know, if you call yourself a sysadmin, you're just asking for a 40% pay cut off the top. But I still tended to view the world through that lens. I tended to be very good at Linux systems internals, for example, understanding system calls and the rest. But increasingly, as the DevOps wave or SRE wave or Googlization of the internet wound up being more and more of a thing. I found myself increasingly in job interviews where, great, now can you go wind up implementing a sorting algorithm on the

Starting point is 00:10:10 whiteboard? It's what on earth? No, like my lingua franca is shitty bash. And no one tends to write that without a bunch of tab completions and a quick checking with man pages, die.net or whatnot on the fly as you go down that path. And it was awful. And I felt like my skillset was increasingly eroding. And it wasn't honestly until I started this place where I really got into writing a fair bit of code to do different things because it felt like an orthogonal skillset, but in the fullness of time, it seems like it's not.

Starting point is 00:10:43 And it's a re-skilling and it made me wonder, does this mean in the fullness of time, it seems like it's not. And it's a re-skilling, and it made me wonder, does this mean that the areas of technology that I focused on early in my career, was that all a waste? And the answer is, not really. Sometimes, sure. In that I don't spend nearly as much time worrying about iNotes, for example, as I once did. But every once in a while, I'll run into something and I look like a wizard from the future, but instead I'm a wizard from the past. Yeah, and I find that a lot in my work now is sometimes things I did 20 years ago come back and it's like, oh yeah, I remember I did it. I did all that threading work in 2002 in Pearl and I learned everything the very, very, very hard way. And then, you know, this January I did some work

Starting point is 00:11:22 and some threading work to fix some stability issues. And all of it came flooding back, right? Just the experience is really not more than the code or the learning or the text and stuff. Then more of the just like this feels like thread. Is a diagnostic thing that sometimes we have to say. And the people are like, can you prove it? And I'm like, not really, because it's literally thread f***ery. Like the definition of it is that there's weird stuff happening that we can't figure out why it's happening.

Starting point is 00:11:50 There's something acting in the system that isn't synchronized, that isn't connected to other things. It's happening out of order from what we expect. And if we had a clear signal, we would just fix it. But we don't. We just have like weird stuff happening over here and then over there and then over there and over there. And like that tells me there's just something happening at that layer and then have to go and dig into that, right? And like just basically charge through.

Starting point is 00:12:13 My colleagues are like, well, maybe we should look at this and go look at the database, the things that they're used to looking at and that their experiences inform. Whereas then I bring that ancient toiling through the threading minds experiences back and go, oh, yeah, so let's go find where this is happening, where people are doing dangerous things with threads, and see if we can spot something.

Starting point is 00:12:34 But that came from that experience. There's so much that just repeats itself, and history rhymes. The challenge is that do you have 20 years of experience, or do you have one year of experience repeated 20 times? And as the tide rises, doing the same task by hand, it really is just a matter of time before your full-time job winds up being something a piece of software does. An easy example is, oh, what's your job? I manually place containers onto specific hosts. Well, I've got news for you and you're not gonna like it at all. Yeah, yeah.

Starting point is 00:13:07 I think that we share a little bit. I'm allergic to repeated work. I don't know if allergic's the right word, but if I sit and I do something once, fine. Like I'll just crank it out. It's this form or it's this data file I gotta write. And I'll, fine, I'll type it in and do the manual labor. The second time, the difficulty goes up by 10, right? Like just

Starting point is 00:13:26 mentally, I just, just to do it, be like, I've already done this once. Doing it again is an anathema to everything that I am. And then sometimes I'll get through it. But after that, like writing a program is so much easier because it's like exponential, almost growth and difficulty. You know, the third time I have to do the same thing, that's like just typing the same stuff, like look over here, read this thing and type it over here. I'm out. I can't do it. You know, the third time I have to do the same thing that's like just typing the same stuff, like look over here, read this thing and type it over here. I'm out. I can't do it. You know, I got to find a way to automate.

Starting point is 00:13:50 And I don't know, maybe normal people aren't driven to live this way, but it's kept me from getting stuck in those spots too. It was weird because I spent a lot of time as a consultant going from place to place, and it led to some weird changes. For example, oh, thank God I don't have to think about about that whole messaging queue thing. Sure, next engagement, it's message queue time. Fantastic. I found that repeating myself drove me nuts, but you also have to be very sensitive not to wind up stealing IP from the people that you're working with. But what I loved about the sysadmin side of the world is that the vast majority of stuff that I've taken with me lives

Starting point is 00:14:26 in my shell config. And by what I mean by that is, there's nothing in there that's proprietary, but when you have a weird problem of trying to figure out the best way to figure out which Ruby process is stealing all the CPU, great. Turns out that you can chain seven or eight different shell commands together

Starting point is 00:14:42 through a bunch of pipes. I don't want to remember that forever. So that's the sort of thing I would wind up committing a bunch of pipes. I don't want to remember that forever. So that's the sort of thing I would wind up committing as I learned it. I don't remember what company I picked that up at, but it was one of those things that was super helpful. I have a sarcastic, it's a one-liner, except no sane editor setting

Starting point is 00:14:58 is going to show it any less than three of a whole bunch of Perl piped into DU, piped into the rest, that tells you what are the largest consumers of files in a given part of the system, and it rates them with stars, and it winds up doing some neat stuff. I would never sit down and reinvent something like that today, but the fact that it's there means that I can do all kinds of neat tricks when I need to. It's making sure that as you move through your career on some level, you're picking up skills that are repeatable and applicable beyond one company. Skills and tooling. Yeah. Right. Like you just described

Starting point is 00:15:31 a tool. Another SREcon talk was John Ospaugh and Dr. Richard Cook talking about above the line, below the line. And they started with these metaphors about tools, right? Showing all the different kinds of hammers. And if you're a blacksmith, a lot of times you craft specialized hammers for very specific jobs. And that's one of the properties of a tool that they were trying to get people to think about, is that tools get crafted to the job. And what you just described is a bespoke tool that you had created on the fly that kind of floated under the radar of intellectual property. So let's not tell the security or IP people, right? Like, cause there's probably billions and billions of dollars of technically like made up IP value.

Starting point is 00:16:11 I'm doing air quotes with my fingers. You know, that's just basically people's shell profiles and my God, the Emacs automation that people have done. If you've ever really seen somebody who's amazing at Emacs and is 10, 20, 30, maybe 40 years of experience encoded in their Emacs settings, it's a wonder to behold. I'd look at it and I'd go, man, I wish I could do that. It's like listening to a really great guitar player and be like, wow, I wish I could play like them. You see them just buying through stuff, but all that IP in there is both that person's

Starting point is 00:16:42 collection of wisdom and experience and working with that code, but also encodes that stuff like you described, right? Which is all these little systems tricks and little fiddly commands and things we don't want to remember. And so we encode them into our tool set. Oh, yeah. Anything I wound up taking, I always would share it with people internally, too. I mentioned, yeah, I'm keeping this in my shell files because I just closed it, which solves a lot of the problem. And also, none of it was even close to proprietary or anything like that. I'm sorry, but the way that you wind up figuring out how much of a disk is being eaten up and where in a more pleasing way is not a competitive advantage. It just isn't. It isn't to you or me, but back at the beginning of our

Starting point is 00:17:21 careers, people thought it was worth money and should be proprietary. Like, oh, that disk checking script is a competitive advantage for our company because there were only a few of us doing this work. It was actually being able to actually manage your servers was a competitive advantage. Now it's kind of commodity. Let's also be clear that the world has moved on. I wound up buying a Daisy disc a while back for Mac, which I love. It is a fantastic,

Starting point is 00:17:48 pretty effective. Where's all the stuff on your disc going? And it does a scan and you can drive and collect things and delete them when you're trying to clean things out of using it the other day. So it's top of mind at the moment, but it's way more polished than that crappy Pearl three liner. And I see, I see both sides.

Starting point is 00:18:04 Truly. I do the The trick also, for those wondering in their own career, like, where is the line? It's super easy. Disclose it, what you're doing in those scenarios. In the event someone is no, because they believe that finding the right man page section for something is somehow proprietary, great. When you go home that evening in a completely separate environment, build it yourself from scratch to solve the problem, reimplement it, and save that, and you're done. There are lots of ways to do this. Don't steal from your employer, but your employer employs you. They don't own you.

Starting point is 00:18:33 And the way that you think about these problems, every person I've met who has had a career that's longer than 20 minutes has a giant doc somewhere on some system of all of the scripts that they wound up putting together, all of the one-liners, the notes on. Next time you see this, this is the thing to check. Yeah. The cheat sheet or the notebook with all the little commands or, again, the Emacs config sometimes for some people or shell profiles. Here's the awk one-liner that I put that automatically spits out from an Apache log file. What, sorry, HTTPD log file that just tells me

Starting point is 00:19:07 what are the most frequent talkers and what are the- You should probably let go of that one. You know, like, I think that one's lifetime is kind of past, Corey. Maybe you just- I just have to get to work with Nginx and we're good to go. Oh yeah, there you go. Or S3 access logs, perish the thought. But yeah, like, what are the five most high volume talkers

Starting point is 00:19:24 and what are those relatives to each other? Huh, that one thing seems super crappy and it's coming from Russia, but that's, hmm, one starts to wonder, maybe it's time to dig back in. So one of the things that I have found is that a lot of the people talking about SRE seem to have descended from an ivory tower somewhere.

Starting point is 00:19:42 And they're talking about how some of the best in class companies out there, renowned for their technical cultures, at least externally, are doing these things. But there's a lot more folks who are not there. And honestly, I consider myself one of those people who is not there. I was a competent engineer, but never a terrific one. And looking at the way this was described, I often came away thinking, okay, it was the purpose of this conference talk, just to reinforce how smart people are and how I'm not. And or, well, there are the 18 cultural changes you need to make to your company, and then you can do something kind of like we were

Starting point is 00:20:20 just talking about on stage. It feels like there's a combination of problems here. One is making this stuff more accessible to folks who are not themselves in those environments. And two, how to drive cultural change as an individual contributor, if that's even possible. And I'm going to go out on a limb and guess you have thoughts on both aspects of that, and probably some more. Hit me, please. So the ivory tower, right? Let's just be straight up. Like the ivory tower is Google. I mean, that's where it started.

Starting point is 00:20:50 We get it from the other large companies that want to do conference talks about what this stuff all means and what it does. What I've kind of come around to in the last couple of years is that those talks don't really reach the vast majority of engineers. They don't really apply to a large swath of the enterprise, especially, which is like

Starting point is 00:21:09 where a lot of the bulk of our industry sits, right? We spend a lot of time talking about the darlings out here on the West Coast and in high tech culture and startups and so on. But like we were talking about before we started the show, right? Like the interior of even just America is filled with all these like insurance and banks and all of these companies that are cranking out tons of code and servers and stuff. And they're trying to figure out these same problems. But they're structured in companies where their tech arm is still, in most cases, considered a cost center. Often is bundled under finance for that's a whole show

Starting point is 00:21:46 of itself about that historical blunder and so the tech cultures tend to be very very different from what we experience in what do we call it anymore like i don't even want to say west coast anymore because we've gone remote but like high tech culture we'll say and so like thinking about how to make sre and all this stuff more accessible comes down to like thinking about what the who those engineers are that are sitting at the computers writing all the code that runs our banks, all the code that makes sure that I'm trying to think of examples that are more enterprisey, right? Or shoot, buying clothes online. You go to Macy's, for example. They have a whole bunch of servers that run their online store and stuff. They have internal IT-ish people who keep all this stuff running and write that code and probably integrating open source stuff, much like we all do.

Starting point is 00:22:32 But when you go to try to put in a reliability program that's based on the current SRE models, like SLOs, you put in SLOs and you start doing this incident management program that's like, you have a form you fill out after every incident and then you make developers write retros. And it turns out that those things are very high level skills, skills and capabilities in an organization. And so when you have this kind of IT mindset or the enterprise mindset, bringing the culture together to make those things work often doesn't happen because, you know, they'll go with the prescriptive model and say like, OK, we're going to implement SLOs. We're going to start measuring SLIs on

Starting point is 00:23:09 all of the services, and we're going to hold you accountable for meeting those targets. If you just do that, you're just doing more gatekeeping and policing of your tech environment, my bet is reliability almost never improves in those cases. That's been my experience too, why I get charged up about this is if you just go slam in these practices, people end up miserable. The practices then become tarnished because people experience the worst version of them. And then... With the remote explosion as well, it turns out that changing jobs basically means that your company sends you a different Mac and the next Monday you wind up signing into a different

Starting point is 00:23:43 Slack team. Yeah. So the culture really matters, right? You can't cover it over with foosball tables and great lunch. You actually have to deliver tools that developers want to use. And you have to deliver a software engineering culture that brings out the best in developers instead of demanding the best from developers. I think that's a fundamental business shift that's kind of happening. That's if I'm putting on my wizard hat and looking into the future and dreaming about what might change in the world, right? Is that there's kind of a change in how we do leadership

Starting point is 00:24:13 and how we do business that's shifting more towards that model where we look at what people are capable of and we trust in our people and we get more out of them, the knowledge work model. If we want more knowledge work, we need people to be happy and to feel engaged in their community. And all of a sudden, we start to see these kind of generational, bigger pie kind of things start to happen.

Starting point is 00:24:33 But how do we get there? It's not SLOs. It maybe is a little bit starting with incidents. That's where I've had the most success. And you asked me about that. So getting practical, incident management is probably- Right, well, as I see it, the problem with SLOs across the board is it feels like it's a very insular community so far. And communicating it to engineers seems to be the focus of where the community has been.

Starting point is 00:24:55 But from my understanding of it, you absolutely need buy-in at significantly high executive levels to, at the very least, buy you air cover while you're doing these things and making these changes, but also to help drive that cultural shift. None of this is something I have the slightest clue how to do. Let's be very clear. If I knew how to change a company's culture, I'd have a different job. Yeah. The biggest omission in the Google SRE books was Erz.

Starting point is 00:25:20 There's a guy at Google named Erz who owns availability for Google. And when anything is like in dispute and bubbles up the management chain, it goes to Erz and he says, thou shalt, right? Makes the call. And that's why it works, right? Like that's, it's not just that one person, but that system of management where the whole leadership team, there's a large, very well-funded team with a lot of power in that organization that can drive availability. And they can say, this is how you're going to do metrics for your service. And this is the system that you're in. And it's kind of, yeah, sure, it works for them because they have all the organizational support in place. What I was saying to my team just the

Starting point is 00:25:59 other day, because we're in the middle of our SLO rollout, is that really, I think an SLO program isn't about the engineers at all until late in the game. At the beginning of the game, it's really about getting the leadership team on board to say, hey, we want to put in SLIs and SLOs to start to understand the functioning of our software system. But if they don't have that curiosity in the first place, that desire to understand how well their teams are doing,

Starting point is 00:26:27 how healthy their teams are, don't do it. It's not going to work. It's just going to make everyone miserable. It feels like it's one of those difficult to sell problems as well, in that it requires some tooling changes, absolutely. It requires cultural change and buy-in and whatnot. But in order for that to happen, there has to be a painful problem that a company recognizes and is willing to pay to make go away. The problem with stuff like this is that once you pay, there's a lot of extra work that goes on top of it as well that does not have a perception, rightly or wrongly, of contributing to feature velocity, of hitting the next milestone. It's really, so we're going to be spending how much money to make engineers happier? They should get paid an awful lot and they're

Starting point is 00:27:10 still complaining and never seem happy. Why do I care if they're happy other than the pure mercenary perspective? Otherwise they'll quit. I'm not saying that it's not worth pursuing. It's not a worthy goal. I am saying that it becomes a very difficult thing to wind up selling as a product. Well, as a product, for sure, right? Because, gosh, I have friends in this space who work on these tools, and I want to be careful. Of course. Nothing but love for all of those people, let's be very clear. But a lot of them, you know, they're pulling metrics from existing monitoring systems. They are doing some interesting math on them. But what you get at the end is a nice service catalog and dashboard, which are things we've been trying to land as products in this industry for as long as I can remember. And we've got it this time, though.

Starting point is 00:27:54 This time we'll crack it up. Yeah. Get off the island, Gilligan. And then the other risky thing, right, is the other part that makes me uncomfortable about SLOs and why I will often tell folks that I talk to out in the industry that are asking me about this, like one-on-one, should I do it here? And it's like, you can bring the tool in. And if you have a management team

Starting point is 00:28:12 that's just looking to have metrics to drive productivity instead of trying to drive better knowledge work, what you get is just a fancier version of more Taylorism, which is basically scientific management, this idea that we can like drive workers to maximum efficiency by measuring random things about them and driving those numbers. It turns out that doesn't really work very well, even in industrial scale. It just happened to work because, you know, we have a bloody enough society that we push people into it. But the reality is, is if you implement SLOs badly, you get more really bad Taylorism

Starting point is 00:28:46 that's bad for your developers. And my suspicion is that you will get worse availability out of it than you would if you just didn't do it at all. This episode is sponsored by our friends at Revelo. Revelo is the Spanish word of the day, and it's spelled R-E-V-E-L-O. It means I reveal. Now, have you tried to hire an engineer lately? I assure you it is significantly harder than it sounds. One of the things that Ravello has recognized is something I've been talking about for a while, specifically that while talent is evenly distributed, opportunity is absolutely not. They're exposing a new talent pool to basically those of us without a presence in Latin America via their platform. It's the largest tech talent marketplace in Latin America with over a million engineers in their network, which includes

Starting point is 00:29:38 but isn't limited to talent in Mexico, Costa Rica, Brazil, and Argentina. Now, not only do they wind up spreading all of their talent on English ability as well as, you know, their engineering skills, but they go significantly beyond that. Some of the folks on their platform are hands down the most talented engineers that I've ever spoken to. Let's also not forget that Latin America has high time zone overlap with what we have here in the United States. So you can hire full-time remote engineers who share most of the workday as your team. It's an end-to-end talent service, so you can find and hire engineers in Central and South America without having to worry about, frankly, the colossal pain of cross-border payroll and benefits and compliance because Revelo handles all of it. If you're hiring engineers, check out revelo.io slash screaming to get 20% off your

Starting point is 00:30:33 first three months. That's R-E-V-E-L-O dot I-O slash screaming. That is part of the problem is in some cases to drive some of these improvements, you have to go backwards to move forwards. And it's one of those great, so we spent all this effort and money and the rest and now things are by Gene Kim, has been the fact that companies had these problems and actively cared enough to change it. In my experience, that feels a little on the rare side. Yeah. And I think that's actually the key, right? Is for the culture change and for like, if you're really looking to be like, do I want to work at this company? Am I investing myself in here? Is look at the leadership team and be like, do these people actually give a crap? Are they looking just to punt another number down the road?

Starting point is 00:31:29 That's the real question, right? Like the technology and stuff, at the point where I'm at my career, I just don't care that much anymore. I just, fine, use Kubernetes, use Postgres, MySQL. I don't care. I just don't. Like Oracle, I might have to ask, you know, go to finance and be like,

Starting point is 00:31:42 hey, can we spend 20 million for a database? But like, nobody really asks for that anymore. So. As with us, I will say that I mostly agree with you, but a technology that I found myself getting excited about, given the time of the recording on this is fun. I spent a bit of time yesterday from when we're recording this, teaching myself just enough go to wind up beating together a binary that I needed to do something actively ridiculous for my camera here. And I found myself coming away

Starting point is 00:32:11 deeply impressed by a lot of things about it, how prescriptive it was for one, how self-contained for another. And after spending far too many years of my life writing shitty Perl and shitty Bash and worse Python, et cetera, et etc., the prescriptiveness was great. The fact that it wound up giving me something I could just run, I could cross-compile for anything I needed to run it on, and it just worked. It's been a while since I found a technology that got me this interested in exploring further.

Starting point is 00:32:40 Go is great for that. You mentioned one of my two favorite features of Go. One is usually when a program compiles, at least the way I code in Go, it usually works. I've been working with Go since about 0.9, just a little bit before it was released as 1.0. And that's what I've noticed over the years of working with it, is that most of the time, if you have a pretty good data structure design and you get the code to compile, usually it's going to work, unless you're doing weird stuff. The other thing I really love about Go and that maybe you'll discover over time is the malleability of it. And the reason why I think about that more than

Starting point is 00:33:14 probably most folks is that I work on other people's code most of the time. And maybe this is something that you probably run into with your business too, right? Where you're working on other people's infrastructure. And the way that we encode business rules and things in the languages, in our programming language or our config syntax and stuff, has a huge impact on folks like us and how quickly we can come into a situation, assess, figure out what's going on, figure out where things are laid out, and start making changes with confidence. Forget other people for a minute there. Looking at what I built out three or four years ago here myself, like I look at past me, it's like, what was that rat bastard thinking? This is awful.

Starting point is 00:33:51 And it's forget other people's code. Hell is your own code on some level too. Once you've, once it's slipped out of the mental stack and you have to re-explore it and, oh, well, thank God I defensively wound up not including any comments whatsoever explaining what the living hell this thing was. It's terrible. But you're right. The other people's shell scripts are finicky and odd.

Starting point is 00:34:12 I started poking around for help when I got stuck on something by looking at GitHub and a few bit of searching here and there. Even these large, complex, well-used projects started making sense to me in a way that I very rarely find. It's, what the hell is that thing is my most common refrain when I'm looking at other people's code and Go, for whatever reason, avoids that. I think because it is so prescriptive about formatting, about how things should be done, about the vision that it has. Maybe I'm romanticizing it and I'll hate it in a week from now and I'll want to go back and remove this recording. The size of the language helps a lot, but probably my favorite, it's more of a convention, which is actually funny the way I'm going to talk about this, because the two languages I work on

Starting point is 00:34:51 the most right now are Ruby and Go. And I don't feel like two languages could really be more different. Syntax-wise, they share some things, but really like the mental models are so very, very different. Ruby is all the way in on object-oriented programming and the actual real kind of object-oriented with messaging and stuff. And the whole language kind of springs from that. And it kind of requires you to understand all of these concepts very deeply to be effective in large programs. So what I find is when I approach a Ruby code base, I have to load all this crap into my head and remember, okay, so yeah, there's this convention when you do this kind of thing in Ruby, or especially Ruby on Rails is even worse because they go deep into convention over configuration. But what that's code for is this code is accessible

Starting point is 00:35:35 to people who have a lot of free cognitive capacity to load all this convention into their heads and keep it in their heads so that the code looks pretty. Right. And so that's the trade-off is you've said, okay, my developers have to be these people with all these spare brain cycles to understand like why I would put the code here in this place versus this place. And all these like things that are in the code, like very compact, dense concepts. And then you go to something like go which is like nah we're not going to do lambdas nah we're not doing all this fancy stuff so everything is there on the page this drives some people crazy right is that there's all this boilerplate boilerplate boilerplate but

Starting point is 00:36:17 the reality is i can read most go files from top to the bottom and understand what the hell it's doing whereas i can go sometimes look at like a Ruby thing or sometimes Python and even Perl is just common all the time, right? Is there so much indirection? And it'd just be like, what the fuck is going on? This is so dense. I'm gonna have to sit down and write it out in longhand so I can understand what the developer was even doing here.

Starting point is 00:36:38 And- Well, that's why I got the Mac Studio for when I'm not doing AV stuff with it. That means that I'll have one core that I can use for front-end processing and the rest, and the other 19 cores can be put to work, failing to build Nokogiri and Ruby yet again. I remember the

Starting point is 00:36:52 travails of working with Ruby, and the problem, I have similar problems with Python, specifically, in that, I don't know if I'm special like this. It feels like it's a SRE, DevOps style of working, but I am grabbing random crap off of GitHub constantly and running it like small scripts other people have built. And let's be clear, I run them on my test AWS

Starting point is 00:37:12 account that is nothing important because I'm not a fool and I read most of it before I run it. But I also, it wants a different version of Python every single time. It wants a whole bunch of other things too. And okay, so I use ASDF as my version manager for these things, which for whatever reason does not work for the way that I think about this ergonomically. Okay, great. And I wind up with detritus scattered throughout my system. It's, hey, can you make this reproducible on my machine? Almost certainly not, but thank you for asking. It's like step 17, master the wolf level of instructions. And I think Docker generally papers over the worst of it, right? Is when we built all this stuff in the aughts,

Starting point is 00:37:52 you know, CPAN. Dev containers and VS code are very nice. Yeah, yeah. You know, like we had CPAN back in the day. I was doing cheroots, I think in like 04 or 05, you know, to solve this problem, right? Which is basically, I just screw it. I will compile an entire distro into a directory with a Perl and all of its dependencies so that I can isolate it from the other things I want to run on this machine and not screw up and not have these interactions. And I think that's kind of what you're talking about

Starting point is 00:38:16 is like the old model, when we deployed servers, there was one of us sitting there and we'd log into the server and be like, okay, I'm going to install the Perl. I'll, you know. I'll compile it into slash ops, slash Perl, 5.8, whatever. And then I'll cpan all the stuff in. I'll give it over to the

Starting point is 00:38:32 developer, tell him to set the shebang to that, and everything just works. And now we're in a mode where it's like, okay, you've got to set up a thousand of those. Okay, well, I'll make a tarball. But it's still like... DevOps is about making the dev closer to ops. You're interrelating all the time.

Starting point is 00:38:47 Yeah, and then Docker comes along and dev's like, well, here's the container. Good luck, asshole. And it feels like it's been cast into your yard to worry about. Yeah, well, I mean, that's just kind of business or just, I'm not sure if it's business or capitalism or something like that,

Starting point is 00:39:02 but just the idea that, you know, if I can hand off the shitty work to some other poor schlub, why wouldn't I? I mean, that's most folks, right? Like, just be like, well, I got it working. Like, my part is done. I did what I was supposed to do. And now there's a lot of folks out there. That's how they work, right?

Starting point is 00:39:18 I hit done. I'm done. I shipped it. Sure, it's an old-ass Ubuntu. Sure, there's a bunch of shell scripts that rip through things. Sure, you know, like, I've worked on repos where there's hundreds of things that need to be addressed. And passing it to someone else is fine.

Starting point is 00:39:32 I'm thrilled to do it. Where I run into problems with it is where people assume that, well, my part was the hard part, and anything you schlubs do is easy. Well, that's the underclass. Forget engineering for a second. I throw things to the people over in the finance group here at the duckbill group because those people are wizards at solving for this thing and it's that's how we want to do things yeah specialization

Starting point is 00:39:54 works but we have this it's probably more cultural i want to pick like capitalism to beat on because this is really like human cultural thing and it's not even really particularly western is the the idea that like if i have an underclass why would i give a shit what their experience is and this this is why i say like ops teams like get out of here because most ops teams the extant ops team that are still called ops and a lot of them been renamed sre but they still do the same job are an underclass and i don't mean that those people are below us. People are treated as an underclass and they shouldn't be. Absolutely not. Because the idea is that, well, I'm a fancy person who writes code up my ivory tower and then it all flows down and those people, just faceless people, do the deployment stuff that's beneath me. Bad attitude is the most toxic thing, I think,

Starting point is 00:40:44 in tech orgs to address. Like if you're trying to be like, well, our reliability is bad. We have security problems. People won't fix their code. And go look around and you will find people that are treated as an underclass that are given codes thrown over the wall at them. And then they just have to toil through and make it work. I've worked on that a number of times in my career. And I think just like saying underclass, right, or a cast system is what I found is the most effective way to get people actually thinking about what the hell is going on here. Because most people are just like, well, this is just the way things are. This is how we've always done it. The developers write the code, they give it

Starting point is 00:41:16 to the sysadmins, the sysadmins deploy the code. Isn't that how it always works? you'd really like to hope wouldn't you not me again the way i see it is in theory in theory sysadmins ops all that should not exist people should theoretically be able to write code as developers that just works the end and the right correct the first time and never have to change it again yeah they there's a reason that i always like to call staging environments in places i work theory because it works in theory, but not in production. And that is fundamentally, that entire job role is the difference between theory and practice. over multiple strands of glass and digital transcodings and things right now, right? Like we are detached from the physical reality. You mentioned earlier working in data centers, right? The thing I miss about it is like the physicality of it.

Starting point is 00:42:11 Like actually like I held a server in my arms and put it in the rack and slid it into the rails. I plugged in the power myself. I pushed the power button myself. There's a server there. I physically touched it. Developers who don't work in production, we talk about empathy and stuff, but really I think the big problem is when they work out in their idea space and just writing code, they write their unit tests. If we're very lucky, they'll write a functional test, and then they hand that WOD off to some poor ops group. They're detached from the reality of operations.

Starting point is 00:42:42 It's not even about accountability. It's about experience. The ability to see all of the weird crap we do with, right? You know, like, well, we pushed the code to that server, but there were three bit flips, so we had to do it again. And then the other server, the disk failed. And on the other server, you know, there's all this weird crap that happens.

Starting point is 00:43:00 These systems are so complex that they're always doing something weird. And if you're a developer that just spends all day in your IDE, you don't get to see that. And I can't really be mad at those folks as individuals for not understanding our world. I have to figure out how to help them. And the best thing we've come up with so far is like, well, we start giving them some responsibility in the production environment so that they can learn that. People do that again is another one that can be done wrong where it turns into kind of a forced empathy. I actually really hate that mode

Starting point is 00:43:28 where it's like, we're forcing all the developers online, whether they like it or not, you know, on call, whether they like it or not, because they have to learn this. And it's like, you know, maybe slow your roll, little buddy,

Starting point is 00:43:38 because the stuff is actually hard to learn. Again, minimizing how hard ops work is, oh, we'll just put the developers on it. They'll figure it out, right? They're software engineers. They're probably smarter than you sysadmins is the unstated thing when we do that, right? When we throw them in the pit and be like, yeah, they'll get it. And that was my problem with being asked to do the interview stuff.

Starting point is 00:43:58 It was in the right code on a whiteboard. It's, look, I understood how the system fundamentally worked under the hood. Being able to power my way through to get to an outcome, even in a language I don't know, is sort of part and parcel of the job. But this idea of doing it in an artificially constrained environment in a language I'm not super familiar with off the top of my head, it took me years to get to a point of being able to do it with a bash script because whoever starts with an empty editor and starts getting to work in a lot of these scenarios, especially in an ops world where we're not building something from scratch. That's the interesting thing, right? In the majority of tech work today,

Starting point is 00:44:30 maybe 20 years ago, we did it more because we were literally building the internet we have today. But today, most of the engineers out there working, most of us working staffs, are working on stuff that already exists. We're making small incremental changes, which is great if that's what we're doing. And we're dealing with old code. We're gluing APIs together, and that's fine. I really want to thank you for taking so much time to talk to me about how you see all these things. If people want to learn more about what you're up to, where's the best place to find you? I'm on Twitter every once in a while as MissAmyToby.

Starting point is 00:45:01 M-I-S-S-A-M-Y-T-O-B-E-Y. I have a blog I don't write on enough, and there's a couple things on the Equinix Metal blog that I've written. So if you're looking for that, otherwise, mainly Twitter. And those links will, of course, be in the show notes. Thank you so much for your time. I appreciate it.

Starting point is 00:45:17 I had fun. Thank you. As did I. Amy Tobey, Senior Principal Engineer at Equinix. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice or on the YouTube's smash the like and subscribe buttons as the kids say. Whereas if you've hated this episode, same thing, five-star review, all the

Starting point is 00:45:41 platforms, smash the buttons, but also include an angry comment telling me that you're about to wind up subpoenaing a copy of my shell script because you're convinced that your intellectual property and secrets are buried within. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started. This has been a HumblePod production. Stay humble.

Screaming in the Cloud - Reliability Starts in Cultural Change with Amy Tobey

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.