Screaming in the Cloud - Using SRE to Solve the Obvious Problems with Laura Nolan

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Welcome to Screaming in the Cloud. I'm Corey Quinn.

Starting point is 00:00:34 My guest today is someone that I have been low-key annoying to come on to this show for years. And finally, I have managed to wear her down. Laura Nolan is a principal software engineer over at Stanza. At least, that's what you're up to today, last I've heard. Is that right? That is correct. I'm working at Stanza, and I don't want to go on and on about my startup, but I'm working with Niall Murphy and Joseph Brunas and Matthew Girard and a bunch of other people who've more recently joined us. We are trying to build a load management SaaS service. So we're interested in service observability out of the box, knowing if your critical user journeys are good or bad out of the box, and being able to prioritize your incoming requests by what's most critical

Starting point is 00:01:22 in terms of visibility to your customers. So an emerging space, not in the Gartner Group magic circle yet, but I'm sure at some point. It is surreal to me to hear you talk about your day job, because for, it feels like, the better part of a decade now, Laura, Laura, oh, you mean Usenix, Laura, because you are on the Usenix board of directors. And in my mind, that is what is always shorthanded to what you do. It's, oh, right. I guess that isn't your actual full-time job. It's weird. It's almost like seeing your teacher outside of the elementary school. You figure that they fold themselves up in the closet there when you're not paying attention. I don't know what you do when SRECon is not in process. I assume you just sit there and wait for the next one, right?

Starting point is 00:02:05 Well, no, we've run four of them in the last year, so there hasn't been very much waiting, I'm afraid. Everything got a little bit smooshed up together during the pandemic, so we've had a lot of events coming quite close together. But no, I do have a full-time day job. The work I do with Usenix is just as a volunteer, so I'm on the board of directors, as you say, and I'm on the steering committee for all of the global SRECon events and typically often serve on the program committee as

Starting point is 00:02:30 well. And I'm served there annoying the chairs to, hey, do your thing on time. Very much like an elementary school teacher, as you say. I've been a big fan of Usenix for a while. One of the best interview processes I ever saw was closely aligned with evaluating candidates along the Usenik sage levels to figure out what level of seniority are they in different areas. And it was always viewed through the lens of in what types of consulting engagements will the candidates shine within, not the idea of, oh, are you good or are you crap? And spoiler, if I'm asking the question, I'm of course defaulting myself to good and you to crap. Like the terrible bespoke artisanal job interview process that so many companies do. I love how this company had

Starting point is 00:03:10 built this out. And I asked them about it. Oh yeah, it comes back to the Usenix Sage things. That was one of my first encounters with what Usenix actually did. And the more I learned, the more I liked. How long have you been involved with the group? Relatively short period of time. I think I first got involved with the group relatively short period of time i think i first got involved with these nicks in around 2015 going to lisa and then going on to srecon it was all by accident of course um i fell onto the srecon program committee somehow because i was around and then because i was still around and doing stuff i got eventually you know got co-opted into chairing and onto the steering committee and so forth. And, you know, it's like everything volunteer.

Starting point is 00:03:47 I mean, people who stick around and do stuff tend to be kept around. But E-Synix is quite important to me. We have an open access policy, which is something that I would like to see a whole lot more of. You know, we put everything right out there for free as soon as it is ready. And we are constantly plagued by people saying, hey, where is my SREcon video? The conference was like two weeks ago. And we're like, no, we're still processing the videos. We'll be there. They'll be there. We've had people like literally offered to pay extra money to get the videos sooner. But we're like, we are open access. We are not keeping the videos away from you. We just aren't ready yet. So I love the open access policy. And I think that what I like about it more than anything else is

Starting point is 00:04:30 the fact that we are staunchly non-vendor. We're non-technology specific and we're non-vendor. So it's not like, say, AWS reInvent, for example, or any of the big cloud vendor conferences, we are picking vendor-neutral content by quality. And as well, anyone who's ever sponsored SREcon or any of the other events will also tell you that that does not get you a talk in the conference program. So the content selection is completely independent. And in fact, we have a complete Chinese wall between the sponsorship organization

Starting point is 00:05:02 and the content organization. So, I mean, I really like how we've done that. I think as well, it's for a long time been one of the family of conferences or organizations of conferences that has had the best diversity. Not perfect, but certainly better than it was. Although very, very unfortunately, I see conference diversity everywhere going down after the pandemic, which is particularly gender diversity, which is a real shame. I've been a fan of the SREcon conferencesMEA that I co-presented with John Looney, which was fun because he and I met in person for the first time three hours beforehand, beat together our talk, then showed up to an hour beforehand, found there would be no

Starting point is 00:05:54 confidence monitor, went away for the next 45 minutes and basically loaded it all into short-term cash and gave a talk that we could not repeat if we had to for a million dollars, just because it was so, you're throwing the ball to your partner on stage and really hoping they're going to be able to catch it. And it worked out. It was an anger subtext translator skit for a bit, which was fun. All the things that your manager says, but actually means, you know, the fun sort of approach. It was zany, ideally had some useful takeaways to it, but I loved the conference. That was one of the only SRE cons

Starting point is 00:06:28 that I found myself not surprised to discover was coming to town the next week. Because for whatever reason, there's presumably a mailing list that I'm not on somewhere where I get blindsided by, oh yeah, hey, didn't you know SRE con is coming up? There's probably a notice somewhere that I really should be paying attention to, but on the plus side, I get to be delightfully surprised every

Starting point is 00:06:48 time. Indeed. And hopefully you'll be delightfully surprised in March 2024. I believe it's the 18th to the 20th when SRECon will be coming to town in San Francisco, where you live. So historically, in addition to the work with Usenix, which is, again, not your primary occupation most days, you spent over five years at Google, which of course means that you have strong opinions on SRE. I know that that is a bit dated, where the gag was always, it's only called SRE if it comes from the Mountain View region of California. Otherwise, it's just sparkling DevOps. But for the initial take of a lot of the SRE stuff was, here's how to work at Google. It has progressed significantly beyond

Starting point is 00:07:31 that to the point where companies who have SRE groups are no longer perceived incorrectly as, oh, we just want to be like Google or we hired a bunch of former Google people. But you clearly have opinions to this. You've contributed to multiple books on SRE. You have spoken on it at length. You have enabled others to speak on it at length, which in many ways is by far the better contribution. You can only go so far scaling yourself, but scaling other people, that has a much better multiplier on it, which feels almost like something an SRE might observe. It is indeed something an SRE might observe. And also, you know, good catch, because I really felt you were implying there

Starting point is 00:08:08 that you didn't like my book contributions. Ah, the shock. No, to be clear, I meant it, because I was going to say that strictly speaking, books are also a great one-to-many multiplier, because it turns out you can only shove so many people into a conference hall, but books have this ability to just carry your words

Starting point is 00:08:25 beyond the room that you're in, in a way that video just doesn't seem to. Ah, but open access video that is published on YouTube, like six weeks ahead. That's girls. I wish. People say they want to write a book and I think they're all lying. I think they want to have written a book. That's my philosophy on it. I do not understand people who've written a book. Like, so what are you going to do now? I'm going to write another book. Okay. I'm going to smile, not take my eyes off you for a second and back away slowly. Cause I do not understand your philosophy on that, but you've worked on multiple books with people. I actually enjoy writing. I enjoy the process of it because I always, I always learn something when I write. In fact, I learned a lot of things when

Starting point is 00:09:04 I write and I enjoy that crafting. I will say I do not enjoy having written things because for me, any achievement, once I have achieved it, is completely dead. I will never think of it again and I will think only of my excessively lengthy to-do list. So I clearly have problems here. But nevertheless, it's exactly the same with programming projects, by the way. But back to SRE. We were talking about SRE. SRE is 20 now. SRE can almost drink alcohol in the US.

Starting point is 00:09:30 And that is crazy. So 2003 was the founding of it then? Yes. Yeah, I can do simple arithmetic in my head still. I wondered how far my math skills had atrophied. Yes, good job. Yes, apparently invented in roughly 2003. So the, I mean, from what I understand,

Starting point is 00:09:47 Google's publishing of the 20 years of SRE at Google, they have, in the absence of an actual definite start date, they've simply picked Ben Trainor's start date at Google as the start date of SRE. But nevertheless, discipline about 20 years old. So is it all grown up? I mean, I think it's become heavily commodified. My feeling about SRE is that it's always been this, I mean, you said it earlier, like it's about, you know, how do I scale things? How do I optimize my systems? How do I intervene in systems to solve problems, to make them better, to see where we're going to be in pain in six months and work to prevent that. That's kind of SRE work to me, you know, figure out where the problems are, figure out good ways to intervene and to improve. But there's a lot of SRE as bureaucracy around at the moment where people are like, well, we're an SRE team. So, you know, you will have your

Starting point is 00:10:34 SLO golden signals and you will have your production readiness checklists, which will be the things that we say, no matter how different your system is from what we designed this checklist for. And that's it. We're doing SRE now. It's great. So I think we miss a lot there. My personal way of doing SRE is very much more about thinking not so much about the day-to-day SLO excursion type things, because not that they're not important, they are important, but they will always be there. I always tend to spend more time thinking about how do we avoid the risk of, you know, a giant production fire that will take you down for days or, God forbid, more than days. You know, the sort of big Roblox fire or the time that Meta nearly took down the internet in late 2021, that kind of thing. So I think that modern SRE misses quite a lot of that. It's a little bit like, so when BP, when they had the Deepwater Horizon disaster, on that same, very same day,

Starting point is 00:11:33 they received an award for minimizing occupational safety risks in their environment. So, you know, it's just things like people tripping. Must've been fun the next day. Yeah, we're going to need that back. People tripping and falling and, you know, hitting themselves with a hammer. They got an award because they were so safe. They had very little of that.

Starting point is 00:11:52 And then this thing goes boom. And now they've tried to pivot into an optimization award for efficiency. Like we just decided to flash fry half the sea life in the Gulf at once. Yes, extremely efficient. So, you know,

Starting point is 00:12:04 I worry that we're doing sre a little bit like bp we're doing it back before deepwater horizon i should disclose that i started my technical career as a grumpy old unix sysadmin because it's not like you ever see one of those who's happy or young didn't matter that i was 23 years old i was grumpy and old. And I have viewed the evolution since then of I going from calling myself a sysadmin to a DevOps engineer, to an SRE, to a platform engineer, to whatever we're calling it this week. I still view it as fundamentally the same job in the sense that the responsibility has not changed. And that is keep the site or environment up. But the tools, the processes, and the techniques we apply to it have evolved.

Starting point is 00:12:51 Is that accurate? Does it sound like I'm spouting nonsense? You're far closer to the SRE world than I ever was. But I'm curious to get your take on that perspective. And please, feel free to tell me I'm wrong. No, no, I think you're completely right. And I think one of the ways that I think has shifted, and it's really interesting, but when you and I were, when we were young, we were, we could see everything that was happening. We were deploying on some sort of Linux box or other sort of Unix box somewhere, most likely. And we, if we wanted, we could go

Starting point is 00:13:18 and see the entire source code of everything that our software was running on. And kids these days, they're coming up and they are deploying their stuff on RDS and ECS. And, you know, how many layers of abstraction are sitting between them? I run Kubernetes. That means I don't know where it runs and neither does anyone else. It's great. Yeah. So there's no transparency anymore in what's happening. So it's very easy. You get to a point where sometimes you hit a problem and you just can't figure it out because you do not have a way to get into that system and see what's happening. Even at work, we ran into a problem

Starting point is 00:13:56 with Amazon-hosted Prometheus. We were like, this will be great. We'll just do that. And we could not get some particular type of remote write operation to work. We just could not. Okay. So we'll have to do something else. So one of the many, many things I do when I'm not trying to run the SREcon conference or do actual work or definitely not write a book, I'm studying at Lund University at the moment. I'm doing this master's degree in human factors

Starting point is 00:14:21 and system safety. And one of the things I've realized since doing that program is in tech, we missed this whole 1980s and 1990s discipline of cognitive systems theory, cognitive systems engineering. This is what people were doing. They were like, how can people in the control room, in nuclear plants and in the cockpit, in the airplane, how can they get along with their systems and build a good mental model of the automation and understand what's going on? We missed all that. We came of age when safety science was asking questions like, how can we stop organizational failures like Challenger and Columbia, where people are just not making the correct decisions?

Starting point is 00:15:00 And that was a whole different sort of focus. So we've missed all of this 1980s and 1990s cognitive system stuff. And there's this really interesting idea there where you can build two types of systems. You can build a prosthesis, which does all your interaction with the system for you. And you can see nothing, feel nothing, do nothing. It's just this black box. Or you can have an amplifier, which lets you do more stuff than you could do just by yourself but but lets you still get into the details and we build mostly prostheses we do not build amplifiers we we're hiding all the details we're building these very very opaque abstractions and i think it's to the detriment of i mean it makes our life harder in a bunch of ways but i think it also makes life

Starting point is 00:15:44 really hard for systems engineers coming up because they just can't get into the systems as easily anymore unless they're running them themselves. I have to confess that I have a certain aversion to aspects of SRE

Starting point is 00:15:59 and I'm feeling echoes of it around a lot of the human factor stuff that's coming out of that Lund program. And I think I know what it is. And it's not a problem with either of those things, but rather a problem with me. I have never been a good academic. I have an eighth grade education because school is not really for me. And what I loved about being a systems administrator for years was the fact that it was solving puzzles every day.

Starting point is 00:16:24 I got to do interesting things. I got to chase down problems and firefight all the time. And what SRE has represented is a step away from that to being more methodical, to taking on keeping the site up as a discipline rather than an occupation or a task that you're working on. And I think that a lot of the human factors stuff plays directly into it. It feels like the field is becoming a lot more academic, which is a luxury we never had when, holy crap, the site is down. We're going to go out of business if it isn't back up immediately. Panic mode. I'm going to confess here. I have three master's degrees. Three. I have problems, like I said before. I get what you

Starting point is 00:17:05 mean. You don't like when people are speaking in generalizations and sort of being all theoretical rather than looking at the actual messy details that we need to deal with to get things done, right? And I know what you mean. I feel it too. And I've talked about the human factors stuff and theoretical stuff a fair bit at conferences. And what I always try to do is I always try and illustrate with the details because I think it's very easy to get away from the actual problems and spend too much time in the models and in the theory. And I like to do both. I will confess I like to do both. And that means that the luxury I miss out on is mostly sleep. But here we are. I am curious as far as what you've seen,

Starting point is 00:17:47 as far as the human factors adoption in this space, because every company for a while claimed to be focused on blameless postmortems, but then there'd be issues that quickly turned into a blame Steve postmortem instead. And it really feels, at least from a certain point of view, that there was a time where it seemed to be gaining traction, but that may have been a zero interest rate phenomenon, as weird as that sounds. Do you think that the idea of human factors being tied to keeping systems running in a computer sense has demonstrated staying power? Are you seeing a recession? It could be I'm just looking at headlines too much. It's a good question. There's still a lot of people interested in it. There was a conference in Denver last February that was decently well attended for, you know,

Starting point is 00:18:33 a first initial conference that was focusing on this issue and this very vibrant Slack community, the LFI and the learning from incidents and software community. I will say everything is a little bit stretched at the moment in industry, as you know, with all the layoffs and a lot of people are just, there's definitely a feeling that people want to hunker down and do the basics to make sure that they're not, you know, not seen as doing useless stuff and on the line for layoffs. But the question is, is this stuff actually useful or not?

Starting point is 00:19:02 I mean, I contend that it is. I contend that we can learn from failures, we can learn from what we're doing day to day, and we can do things better. Sometimes you don't need a lot of learning because what's the biggest problem is obvious, right? You know, in that case, yeah, your focus should just be on solving your big obvious problem, for sure.

Starting point is 00:19:21 It feels there's a hierarchy of needs here on some level. Step one, is the building currently on fire? Maybe solve that before thinking about the longer term context and what this does to corporate culture. Yes, absolutely. And I've gone into teams before where people are like, oh, well, you're an SRE, so obviously you wish to immediately introduce SLOs. And I can look around me and go, nope, not the biggest problem right now. Actually, I can see a bunch of things are on fire. We should fix those specific things. I actually personally think that if you want to go in and start improving reliability in a system, the best thing to do is to start a weekly production meeting if the team doesn't have that. Actually create a dedicated space and time for

Starting point is 00:19:59 everyone to be able to get together, discuss what's been happening, discuss concerns and risks, and get all that stuff out in the open. I think that's very useful, and you don't need to spend however long it takes to formally sit down and start creating a bunch of SLOs, because if you're not dealing with a perfectly spherical web service where you can just use the golden signals, and if you start getting into any sorts of thinking about data integrity or backups or any sorts of thinking about data integrity or backups

Starting point is 00:20:26 or any sorts of asynchronous processing, these sorts of things, they need SLOs that are a lot more interesting than your standard error rate and latency. Error rate and latency gets you so far, but it's really just very cookie cutter stuff. But people know what's wrong with their systems, by and large. They may not know everything that's wrong with their systems, but they'll know the big things for sure. Give them space to talk about it. Speaking of bigger things and turning into the idea of these things escaping beyond pure tech, you have been doing some rather interesting work

Starting point is 00:21:00 in an area that I don't see a whole lot of people that I talk to communicating about. Specifically, you're volunteering for the Campaign to Stop Killer Robots, which 10 years ago would have made you sound ridiculous. And now it makes you sound like someone who is very rationally and reasonably calling an alarm on something that is on our doorstep. What are you doing over there? Well, I mean, let's be real. It sounds ridiculous because it is ridiculous. I mean, who would let a computer fly around in the sky and choose what to shoot at? But it turns out that there are, in fact, a bunch of people who are building systems like that. So yeah, I've been volunteering with the campaign for about the last five years, since roughly around the time that I

Starting point is 00:21:42 left Google, in fact, because I got interested in that around about the time that Google was doing the Project Maven work, which was when Google said, hey, wouldn't it be super cool if we took all of this DoD video footage of drone video footage and, you know, did a whole bunch of machine learning analysis on it and figured out where people are going all the time. Maybe we could click on this house and see like a whole timeline of people's comings and goings and which other people they are sort of in a social network with. So I kind of said, maybe I don't want to be involved in that. And I left Google. I found out that there was this campaign and this campaign was largely lawyers and disarmament experts, people of that nature, philosophers, but also a few technologists. And for me, having run computer systems for a large number of years

Starting point is 00:22:34 at this point, the idea that you would want to rely on a big distributed system running over some janky network with a bunch of 18-year-old kids running it to actually make good decisions about who should be targeted in a conflict seems outrageous. And I think almost every software operations person or in fact software engineer that I've spoken to tends to feel the same way. And yeah, there is this big practical debate about this in international relations circles, but luckily there has just been a resolution in the UN just in the last day or two as we record this. The first committee has, by a very large majority, voted to try and do something about this. So hopefully we'll get some international law. The specific interventions that most of us in this field think would be good

Starting point is 00:23:21 would be to limit the amount of force that an autonomous weapon, or in fact, an entire set of autonomous weapons in a region, would be able to wield. Because there's a concern that should there be some bug or problem or sort of weird factor that triggers these systems to... It's an inevitability that there will be. That is not up for debate. Of course it's going to break. In 2020, the template slide deck that AWS sent out for reInvent speakers had a bunch of clip art, and one of them was a line art drawing of a ham with a bone in it. So I wound up taking that image, slapping it on a t-shirt, captioning it AWS ham bone, and selling that as a fundraiser for 826 National.

Starting point is 00:24:06 Now, what happened next is that for a while, anyone who tweeted the phrase AWS Hambone would find themselves banned from Twitter for the next 12 hours due to some weird algorithmic thing where it thought that was doxing or harassment or something. And people on the other side of the issue that you are talking about are straight-facedly suggesting that we give that algorithm and ban tool a gun. Or many guns. I'm sorry, what? Absolutely. Yes. Or missiles or let's build a whole bunch of them and turn them loose with no supervision, just like we do with junior developers. Exactly. Yes. So many people think this is a great idea, or at least they purport to think

Starting point is 00:24:51 this is a great idea, which is not always the same thing. I mean, there's a lot of different vested interests here. Some people who are proponents of this will say, well, actually, we think that this will make targeting more accurate. Less civilians will actually die as a result of this. And the question there that you have to ask is, there's a really good book called Drone by Shamayou, Greg Roy Shamayou. And he says that there's actually three meanings to accuracy. So are you hitting what you're aiming at is one of it, one thing. And that's a solved problem in military circles for quite some time. You've got, you know got laser targeting, very accurate. Then the other

Starting point is 00:25:25 question is, how big is the blast radius? So that's just a matter of how big an explosion are you going to get? That's not something that autonomy can help with. The only thing that autonomy could even conceivably help with in terms of accuracy is better target selection. So instead of selecting targets that are not valid targets, selecting more valid targets. But I don't think there's any good reason to think that computers can solve that problem. I mean, in fact, if you read stuff that military experts write on this, and I've got, you know, lots of academic handbooks on military targeting processes, they all tell you it's very hard and there's a lot of gray areas, a lot of judgment. And that's exactly what computers are pretty bad at. Although, mind you,

Starting point is 00:26:03 I'm amused by your Hambone story, and I want to ask if AWS Hambone is a database. Anything is a database if you hold it wrong. It's fun. I went through a period of time where I, just for fun, I would ask people to name an AWS service, and I would talk about how you could use it incorrectly as a database. And then someone mentioned, what about AWS Neptune, which is their graph database, which absolutely no one understands. And the answer there is I give up. It's impossible to use that thing as a database, but everything else can be like, you know, the tagging system. Great. That has keys and values. It's a database now. Welcome aboard. And I didn't say it was a great

Starting point is 00:26:39 database, but it is a free one and it scales to a point. Have fun with it. All I'll say is this, you can put labels on anything. Exactly. We missed you at the most recent SREcon in Mia. There was a talk about Google's internal chubby system and how people started using it as a database. And I did summon you in Slack, but you didn't show up. No, sadly, I've gotten a bit out of the SRE space.

Starting point is 00:27:04 And also, frankly, I've gotten a bit out of the SRE space. Also, frankly, I've gotten out of the community space for a little while when it comes to conferences. And I have a focused effort to start a 2024 to start changing that. I am submitting CFPs left and right. My biggest fear is that a conference will accept one of these because a couple of them are aspirational. Here's how I built a thing with generative AI, which, spoiler, I have done no such thing yet, but by God, I will by the time I get there. I have something similar around Kubernetes, which I've never used in anger, but soon will if someone accepts the right conference talk. This is how I learned Git. I shot my mouth off in a CFB and I had four months to learn the thing. It was effective, but I wouldn't say it was the best

Starting point is 00:27:42 approach. You shouldn't feel bad about lying about having built things in Kubernetes and with LLMs because everyone has, right? Exactly. It'll be true enough by the time I get there. Why not? I'm not submitting for a conference next week. We're good. Yeah, future Corey is going to hate me. Have it build you a database system. I like that. I really want to thank you for taking the time to speak with me today. If people want to learn more, where's the best place for them to find you these days?

Starting point is 00:28:09 I'm sort of homeless on social media since the whole Twitter implosion, but you can still find me there. I'm Laura Lifts on Twitter and I have the same tag on Blue Sky, but haven't started to use it yet. Yeah, socials are hard at the moment. I'm on LinkedIn. Please feel free to follow me there if you wish to message me as well. And we will of course put links to that in the show notes. Thank you so much for taking the time to speak with me. I appreciate it. Thank you for having me. Laura Nolan, Principal Software Engineer at Stanza. I'm cloud economist, Corey Quinn, and this is Screaming in the Cloud.

Starting point is 00:28:46 If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment that soon, due to me screwing up a database system, will be transmogrified into a CFP submission for an upcoming SRE Cup. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point.

Starting point is 00:29:33 Visit duckbillgroup.com to get started.

Screaming in the Cloud - Using SRE to Solve the Obvious Problems with Laura Nolan

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.