Screaming in the Cloud - Creating A Resilient Security Strategy Through Chaos Engineering with Kelly Shortridge

Episode Date: May 30, 2023

Kelly Shortridge, Senior Principal Engineer at Fastly, joins Corey on Screaming in the Cloud to discuss their recently released book, Security Chaos Engineering: Sustaining Resilience in Soft...ware and Systems. Kelly explains why a resilient strategy is far preferable to a bubble-wrapped approach to cybersecurity, and how developer teams can use evidence to mitigate security threats. Corey and Kelly discuss how the risks of working with complex systems is perfectly illustrated by Jurassic Park, and Kelly also highlights why it’s critical to address both system vulnerabilities and human vulnerabilities in your development environment rather than pointing fingers when something goes wrong.About KellyKelly Shortridge is a senior principal engineer at Fastly in the office of the CTO and lead author of "Security Chaos Engineering: Sustaining Resilience in Software and Systems" (O'Reilly Media). Shortridge is best known for their work on resilience in complex software systems, the application of behavioral economics to cybersecurity, and bringing security out of the dark ages. Shortridge has been a successful enterprise product leader as well as a startup founder (with an exit to CrowdStrike) and investment banker. Shortridge frequently advises Fortune 500s, investors, startups, and federal agencies and has spoken at major technology conferences internationally, including Black Hat USA, O'Reilly Velocity Conference, and SREcon. Shortridge's research has been featured in ACM, IEEE, and USENIX, spanning behavioral science in cybersecurity, deception strategies, and the ROI of software resilience. They also serve on the editorial board of ACM Queue.Links Referenced:Fastly: https://www.fastly.com/Personal website: https://kellyshortridge.comBook website: https://securitychaoseng.comLinkedIn: https://www.linkedin.com/in/kellyshortridge/Twitter: https://twitter.com/swagitda_Bluesky: https://shortridge.bsky.social

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Have you listened to the new season of Traceroute yet? Traceroute's a tech podcast that peels back the layers of the stack
Starting point is 00:00:37 to tell the real human stories about how the inner workings of our digital world affect our lives in ways you may have never thought of before. Listen and follow Traceroute on your favorite platform or learn more about Traceroute at origins.dev. My thanks to them for sponsoring this ridiculous podcast. Welcome to Screaming in the Cloud. I'm Corey Quinn. My guest today is Kelly Shortridge, who's a senior principal engineer over at Fastly, as well as the lead author of the recently released Security Chaos Engineering, Sustaining Resilience in Software and Systems. Kelly, welcome to the show.
Starting point is 00:01:18 Thank you so much for having me. So I want to start with the honest truth that in that title, I think I know what some of the words mean. But when you put them together in that particular order, I want to make sure we're talking about the same thing. Can you explain it like I'm five as far as what your book is about? Yes, I'll actually start with an analogy I make in the book, which is imagine you were trying to rollerblade to some destination. Now, one thing you could do is wrap yourself in a bunch of bubble wrap and become the bubble person. And you can waddle down the street trying to make it to your destination on the rollerblades. But if there's a gust of wind or a dog barks or something, you're going to flop over.
Starting point is 00:01:59 You're not going to recover. However, if you instead don what everybody does, which is, you know, knee pads and other things that keep you flexible and nimble, the gust, you know, there's a gust of wind. You can kind of be agile, navigate around it. If a dog barks, you just roller skate around it. You can reach your destination. The former, the bubble person, that's a lot of our cybersecurity today. It's just keeping us very rigid, right? And then the alternative is resilience, which is the ability to recover from failure and adapt to evolving conditions. I feel like I am about to torture your analogy to death because back when I was in school in 2000, there was an annual tradition at
Starting point is 00:02:36 the school I was attending before failing out where a bunch of us would paint ourselves green every year and then bike around the campus naked. It was the green bike ride. So one year I did this on rollerblades. So if you wind up looking at it, there's a bubble wrap, there's the safety gear, and then there's wearing absolutely nothing, which feels kind of like the startup approach to InfoSec. It's like, it'll be fine. What's the worst that happens? And you're super nimble, super flexible until suddenly, oops, now I really wish I'd done things differently. Well, there's a reason why I don't say rollerblade naked, which other than it being rather visceral, what you described is what I've called YOLOsec before, which is not what you want
Starting point is 00:03:14 to do. Because the problem when you think about it from a resilience perspective, again, is you want to be able to recover from failure and adapt. Sure, you can oftentimes move quickly, but you're probably going to erode software quality over time. So at a certain point, there's going to be some big incident and suddenly you aren't fast anymore. You're actually pretty slow. So there's this kind of happy medium where you have enough, I would like security by design. We can talk about that a bit if you want, where you have enough of this security by design baked in. And you can think of it as guardrails that you're able to withstand and recover from any failure. But yeah, going naked, that's a recipe for not being able to rollerblade like ever again, potentially. I think on some level, the correct
Starting point is 00:03:57 dialing in of security posture is going to come down to context in almost every case. I'm building something in my spare time in the off hours, does not need the same security posture, mostly, as we're a bank. It feels like there's a very wide gulf between those two extremes. Unfortunately, I find that there's a certain tone deafness coming from a lot of the security industry around, oh, everyone must have security as their number one thing ever. It's, I mean, with my clients who are, I fix their AWS bills, I have to care about security contractually,
Starting point is 00:04:31 but the secrets that I hold are boring. How much money certain companies pay another very large company? Yes, I'll get sued into oblivion if that leaks, but nobody dies. Nobody is having their money stolen as a result. It's slightly embarrassing in the tech press for a cycle, and then it's over and done with. That's not the same thing as a brief stint I did running tech ops at Grindr 10 years ago, where leak that database and
Starting point is 00:04:59 people will die. There's a strong difference between those threat models. And on some level, being able to act accordingly has been one of the more eye-opening approaches to increasing velocity, in my experience. Does that align with the thesis of your book, since my copy has not yet arrived for this recording? Yes, the book. I am not afraid to say it depends on the book. And you're right, it depends on context. I actually talk about this resilience potion recipe that you can check out if you want these ingredients so we can sustain resilience. A key one is defining your critical functions. Just what is your system's reason for existence? And that is what you want to make sure can recover and still operate under adverse conditions. Like you said, another example I give all the time is most SaaS
Starting point is 00:05:43 apps have some sort of reporting functionality. Guess what? That's not mission critical. You don't need the utmost security on that for the most part. But if it's processing transactions, yeah, probably you want to invest more security there. So yes, I couldn't agree more that it's context dependent. And oh my God, does the security industry ignore that so much of the time? And it's been my grape for, I feel like, as long as I've been in the industry. I mean, there was a great talk that Netflix gave years ago where they mentioned in passing that all developers have root in production. And that's awesome. And the person next to me was super excited. And I looked at their badge and holy hell, they worked at an actual bank. That seems like a bad plan. But talking to the Netflix speaker after the fact,
Starting point is 00:06:23 Dave Hahn, something that I found that was extraordinarily insightful is that, yeah, but we just isolate off the PCI environment. So the rest and the sensitive data lives in its own compartmentalized area. So at that point, yeah, you're not going to be able to break much in that scenario. It's like that would have been helpful context to put it the talk, which I'm sure he did. But my attention span had tripped out and I missed that. But that's on some level constraining blast radius and not having compliance and regulatory issues extending to every corner of your environment really frees you up to do things appropriately. But there are some things where you do need to care about this stuff regardless of how small the surface area is. Agreed. And I introduced the concept of the effort investment portfolio in the book, which is basically that is where does it matter to invest effort and where can you kind of like maybe save some resources up? I think one thing you touched on though is we're really talking about isolation. And I actually think people don't think about isolation
Starting point is 00:07:20 as detailed or maybe as expansively as they could, because we want both temporal and logical and spatial isolation. What you talked about is, yeah, there are some cases where you want to isolate data, you want to isolate certain subsystems, and that could be containers. It could also be AWS security groups. It could take a bunch of different forms. It could be something like RLbox in WebAssembly land. But I think that's something that I really try to highlight in the book is there's actually a huge opportunity for security engineers, starting from the design of a system to really think about how can we infuse different forms of isolation to sustain resilience. It's interesting that you use the word investment.
Starting point is 00:08:02 Fixing AWS bills for a living, I've learned over the last almost seven years now of doing this, that cost and architecture in cloud are fundamentally the same thing. And resilience is something that comes with a very real cost, particularly when you start looking at what the architectural choices are. I mean, one of the big reasons that I only ever work on a fixed fee basis is because if I'm charging for a percentage of savings or something, it inspires me to say really uncomfortable things like backups are for cowards. And when's the last time you saw an entire AWS availability zone go down for so long that it mattered? You don't need to worry about that. And it does cut off an awful lot of cost issues at the price of making the
Starting point is 00:08:42 environment more fragile. That's where one of the context things starts to come in. In many cases, if AWS is having a bad day in a given region, well, does your business need that workload to be functional? For my newsletter, I have a publication system that's single-homed out of the Oregon region. If that whole thing goes down for multiple days, I'm writing that week's issue by hand because I'm going to have something different to talk about anyway. For me, there's no value in making that investment. But for companies, there absolutely is. But there also seems to be a lack of awareness around how much is a reasonable investment in that area. When do you start making that investment? And most critically, when do you stop? I think that's a good point. And luckily, what's on my side is the fact that there's a lot of just profligate spending in cybersecurity. And that's really what I'm focused on is how can we spend those investments better?
Starting point is 00:09:36 And I actually think there's an opportunity in many cases to ditch a ton of cybersecurity tools and focus more on some of the stuff you talked about. I agree, by the way, that I've seen some threat models where it's like, well, AWS, all regions go down. I'm like, at that point, we have like a severe, bigger than whatever you're thinking about problem, right? Right. So does your business continuity plan account for every one of your staff suddenly quitting on the spot because there's a whole bunch of companies with very expensive consulting-like problems that I'm going to go work for a week and then buy a house in cash. It's one of those areas where, yeah, people are not going to care about your environment more than they are about their families and other things that are going on.
Starting point is 00:10:14 Plan accordingly. People tend to get so carried away with these things, with the tabletop planning exercises. And then, of course, they forget little things like, I overwrote the database by dropping the wrong thing. Turns out that was production. Remembering for a me there. Precisely. And a lot of the chaos experiments that I talk about in the book are a lot of those like let's validate some of those basics. Right. That's actually some of the best investments you can make.
Starting point is 00:10:37 Like if you do have backups, I can totally see your argument about backups are for cowards. But if you do have them, like maybe conduct experiments to make sure that they're available when you need them. And the same thing, even on the social side. No one cares about backups, but everyone really cares about restores suddenly, right after they really should have cared about backups. Exactly. So I think it's looking at those experiments where it's like, okay, you have these basic assumptions in place that you assume to be invariants or assume that they're going to bail you out if something goes wrong. Let's just verify. That's a great place to start because I can tell you, I know you've been to the RSA hall floor,
Starting point is 00:11:08 how many cybersecurity teams are actually assessing the efficacy and actually experimenting to see if those tools really help them during incidents. It's pretty few. Oh, vendors do not want to do those analyses. They don't want you to do those analyses either. If you do, for God's sake, shut up about it. They're trying to sell things here, mostly firewalls. Yeah, cybersecurity vendors aren't necessarily happy about my book and what I talk about
Starting point is 00:11:31 because I have almost this ruthless focus on evidence. And it turns out cybersecurity vendors kind of thrive on a lack of evidence. There's so much fear, uncertainty, and doubt in that space. And I do feel for them. It's a hard market to sell in without having to talk about, here's the thing that you're defending against. In my case, it's easy to sell the AWS bill is high because if I don't have to explain why more or less setting money on fire is a bad thing, I don't really know what to tell you. I'm going to go look for a slightly different customer profile. That's not really how it works in security. I'm sure there are better go-to-market approaches, but they're
Starting point is 00:12:09 hard to find, at least ones that work holistically. There are. And one of my priorities with the book was to really enumerate how many opportunities there are to take software engineering practices that people already know, let's say something like type systems even, and how those can actually help sustain resilience. Even things like integration testing or infrastructure as code. There are a lot of opportunities just to extend what we already do for systems reliability to sustain resilience against things that aren't attacks. And just make sure that we cover a few of those cases as well. A lot of it should be really natural to software engineering teams. Again, security vendors don't like that because it turns out software engineering teams don't particularly like security vendors.
Starting point is 00:12:46 I hadn't noticed that. I do wonder, though, that for those who are unaware, chaos engineering started off as breaking things on purpose, which I feel like one person had a really good story and thought about it super quickly when they were about to get fired. Like, no, no, it's called chaos engineering. Good for them. It's now a well-regarded discipline. But I've always heard of it in the context of reliability of, oh, you think your site is going to work if the database falls over? Let's push it over and see what happens. How does that manifest in a security context? So I will clarify, I think that's a slight misconception. It's really about fixing things in production. And that's the end goal. I think we should not break things just to break them. Right. But I'll give a simple example, which I know it's based on what Aaron Reinhart conducted at United Health Group, which is, OK, let's inject a misconfigured port as an experiment and see what happens end-to-end. In their case, the firewall only detected the misconfigured port
Starting point is 00:13:45 60% of the time. So 60% of the time it works every time. But it was actually the cloud, the very common cloud configuration management tool that caught the change and alerted responders. So it's that kind of thing where we're still trying to verify those assumptions that we have about our systems and how they behave again end-to-end. In a lot of cases, again, with security tools, they are not behaving as we expect. But I still argue security is just a subset of software quality. So if we're experimenting to verify again our assumptions and observe system behavior, we're benefiting software quality and security is just a subset of that. Think about C code, right? It's not like there's
Starting point is 00:14:21 a healthy memory corruption. So it's bad for both equality and security reason. One problem that I've had in the security space for a while is, let's put on this to AWS for a second, because that is the area in which I spend the most of my time, which probably explains a lot about my personality challenges. But the problem that I keep smacking into is if I go ahead and configure everything the way that I should, according to best practices and the rest, I wind up with a firehose torrent of information in terms of CloudTrail logs, etc. And it's expensive in its own right. But then to sort through it or to do a lot of things in security, there are basically two options. I can either buy a vendor's product, which generally tends to start around $12,000 a year and goes up rapidly from there. I'm at my current $6,000 a year bill. So okay, twice as
Starting point is 00:15:12 much as the infrastructure for security monitoring. Okay. Or alternately, find a bunch of different random scripts and tools on GitHub of wildly diverging quality and sort of hope for the best on that. It feels like there's nothing in between. And the reason I care about this is not because I'm cheap, but because when you have an individual learner who is either a student or a career switcher or someone just trying to experiment with this, you want them to begin as you want them to go on. And things that are no money for an enterprise are all the money to them. They're going to learn to work with the tools that they can afford. That feels like it's a big security swing and a miss.
Starting point is 00:15:51 Do you agree? Disagree? What's the nuance I'm missing here? No, I don't think there's nuance you're missing. I think security observability, for one, isn't a buzzword that particularly exists. I've been trying to make it a thing, but I'm solely one individual screaming into the void. But observability just hasn't been a thing. We haven't really focused on, okay, so what? We get data and what do we do with it? And I think, again, from a software
Starting point is 00:16:13 engineering perspective, I think there's a lot we can do. One, we can just avoid duplicating efforts. We can treat observability, again, of any sort of issue as similar, whether that's an attack or a performance issue. I think this is another place where security or any sort of issue is similar, whether that's an attack or a performance issue. I think this is another place where security or any sort of chaos experiment shines, though, because if you have an idea of here's an adverse scenario we care about, you can actually see how does it manifest in the logs and you can start to figure out like what signals do we actually need to be looking for, what signals matter to be able to narrow it down, which again is it involves time and effort. But also I can attest when you're buying the security vendor tool and in theory, absolving some of that time and effort, it's maybe, maybe not because it can be hard to understand what the outcomes are or what the
Starting point is 00:16:59 outputs are from the tool. And it can also be very difficult to tune it and to be able to explain some of the outputs. It's kind of like trading upfront effort versus long-term overall overhead, if that makes sense. It does. On that note, the title of your book includes the magic key phrase, sustaining resilience. I have found that security effort and investment tends to resemble a fire drill in an awful lot of places where we care very much about security, says the company, right after they very clearly failed to care about security. And I know this because I'm reading it in an email about a breach that they've just sent me.
Starting point is 00:17:37 And then there's a whole bunch of running around and hair on fire moments. But then there's a new shiny that always comes up, a new strategic priority, and it falls to the wayside again. What do you see that drives that sustained effort and focus on resilience in a security context? I think it's really making sure you have a learning culture, which sounds very draw the owl, but things again like experiments can help just because when you do simulate those adverse scenarios and you see how your system behaves, it's almost like running an incident and you can use that as very fresh kind of like collective memory. And I even strongly recommend starting off with prior incidents and simulating those just to see like, hey, did the
Starting point is 00:18:16 improvements we make actually help? If they didn't, that can be a kind of another fire under the butt, so to speak, to continue investing. So definitely in practice, and there's some case studies in the book, it can be really helpful just to kind of like sustain that memory and sustain that learning and keep things feeling a bit fresh. It's almost like prodding the nervous system a little, just so it doesn't go back to that complacent, inconvenient feeling. It's one of the hard problems because I'm sure I'm going to get castigated for this by some of the listeners, but computers are easy, particularly compared to the people. There are deterministic ways to solve almost any computer problem, but people are always going to be a little bit
Starting point is 00:18:56 different and getting them to perform the same way today that they did yesterday is an exercise in frustration. Changing the culture, changing the approach and the attitude that people take toward a lot of these things feels, from my perspective, like something of an impossible job. Cultural transformations are things that everyone talks about, but it's rare to see them succeed. Yes. And that's actually something that I very strongly weave throughout the book is that if your security solutions rely on human behavior, they're going to fail. We want to either reduce hazards or eliminate hazards by design as much as possible. So my view is very much, again, like, can you make processes more repeatable?
Starting point is 00:19:36 That's going to help security. I definitely do not think that if anyone takes away from my book that they need to have like a thousand hours of training to change hearts and minds, then they have completely misunderstood most of the book. The idea is very much like what are practices that we want for other outcomes anyway, again, reliability or faster time to market? And how can we harness those to also be improving resilience or security at the same time? It's very much trying to think about those opportunities rather than trying to drill into people's heads like thou shalt not or thou shall. Way back in 2018, you gave a keynote at some conference or another, and you built the entire thing on the story of Jurassic Park, specifically Ian Malcolm as one of your favorite fictional heroes. And you tied it into security in a bunch of different ways. You hadn't written this book
Starting point is 00:20:29 then unless the authorship process is way longer than I think it is. So I'm curious to get your take on what Jurassic Park can teach us about software security. Yes. So I talk about Jurassic Park as a reference throughout the book frequently. I've loved that book since I was a very young child. Jurassic Park is a great example of a complex system they didn't anticipate there could be more in the count. Like there, there's so many different factors that influenced it. You can't actually blame just like human error point fingers at one thing. So that's a beautiful example of how things go wrong in our software systems. Cause like you said, there's this human element and then there's also how the humans interact and how the software components interact. But with Jurassic Park 2, I think the great thing is dinosaurs are going to do dinosaur things like eating people. And there are also equivalents in software like C code. C code is going to do C code things, right? It's not a memory safe
Starting point is 00:21:34 language. So we shouldn't be surprised when something goes wrong. We need to prepare accordingly. How could this happen again? Right. And a certain point, it's like there's probably no way to sufficiently introduce isolation for dinosaurs unless you put them in a bunker where no one can see them. And it's the same thing sometimes with things like Seacoat. There's just no amount of effort you can invest, and you're just kind of investing for a really unclear and generally not fortuitous outcome. So I like it as kind of this analogy to think about, okay, where do our effort investments make sense? And where is it sometimes like we really just do need to refactor
Starting point is 00:22:09 because we're dealing with dinosaurs here. When I was a kid, that was one of my favorite books too. The problem is, is I didn't realize I was getting a glimpse of my future at a number of crappy startups that I worked at. Because you have John Hammond, who was the owner of the park, talking constantly about how we spared no expense, But then you look at what actually happened and he spared every freaking expense. You have one IT person who is so criminally underpaid that smuggling dinosaur embryos off the island becomes a viable strategy for this. He went, oh, we couldn't find the right DNA. So we're just going to splice some other random stuff in there. It'll be fine. Then you have the massive overconfidence, because it sounds very much like he had this almost Muskian desire to fire anyone who disagreed
Starting point is 00:22:54 with him. And yeah, there was a certain lack of investment that could have been made, despite loud protestations to the contrary. I'd say that he is the root cause. He is the proximate reason for the entire failure of the park. But I'm willing to entertain disagreement on that point. I think there are other individuals like Dr. Wu, if you recall, deciding to defrag DNA and not thinking that maybe something could go wrong. I think there was a lot of overconfidence, which you're right. We do see a lot in software. So I think that's actually another very important lesson is that incentives matter and incentives are very hard
Starting point is 00:23:28 to change, kind of like what you talked about earlier. It doesn't mean that we shouldn't include incentives in our threat models. Like in the book I talk about, our threat models should include things like maybe, yeah, people are underpaid or there is a ton of pressure to deliver things quickly or do things as cheaply as possible. That should be just as much of our threat models as all of the technical stuff, too. I think that there's a lot that was in that movie that was flat out wrong. For example, one of the kids, I forget her name, it's been a long time, was logging in and said, oh, this is Unix. I know Unix. And having learned Unix is my first basically professional operating
Starting point is 00:24:05 system. No, you don't. No one knows Unix. They get very confused at some point. The question is just how far down what rabbit hole it is. I feel so sorry for that kid. I hope she wound up seeking therapy when she was older to realize that she, no, you don't actually know Unix. It's not that you're bad at computers. It's that Unix is user hostile, actively so. Like the Raptors. That's the better metaphor when everything winds up shaking out. Yeah. I don't disagree with that. The movie definitely takes many liberties. I think what's interesting, though, is that Michael Crichton, specifically when he talked about writing the book, I don't know how many people know this, dinosaurs were just a mechanism. He knew people would want to read it in an airport. What he cared about was communicating really the danger of complex systems and how if you don't
Starting point is 00:24:48 respect them and respect that interactivity and that it can baffle and surprise us, like things will go wrong. So I actually find it kind of beautiful in a way that the dinosaurs were almost like an afterthought. What he really cared about was exactly what we deal with all the time in software is when things go wrong with complexity. Like one of his other books, Airframe, talked about an air disaster with a bunch of contributing factors and the rest. And for some reason, that did not receive the wild acclaim that Jurassic Park did to become a cultural phenomenon that we're still talking about, what, 30 years later? Right. Dinosaurs are very compelling.
Starting point is 00:25:22 They really are. I have to ask, though, this is the joy of having a kid who's almost six. What is your favorite dinosaur? Not a question most people get asked very often, but I am going to trot that one out. No. Oh, that is such a good question. Maybe a Deinonychus. Oh, because they get so angry they spit and kill people? That's amazing. Yeah, and I like the kind of nimble, smarter ones, and also the fact that most of the smaller ones allegedly had feathers, which I just love this idea of featherful murder machines.
Starting point is 00:25:55 I have the classic nerd kid syndrome, though, where I've read all these dinosaur names as a kid and I've never pronounced them out loud, so I'm sure there are others that I would just word salad. But honestly, it's hard to go wrong with choosing a favorite dinosaur. Oh yeah. I'm sure some paleontologist is sitting out there in the field and a dig somewhere listening to this podcast, just getting very angry at our pronunciation of things. For God's sake, I call the database post-grasqueal, get in line. There's a lot of that out there. We're looking at complex system failures and different contributing factors and the rest. Make stuff, that's what makes things interesting.
Starting point is 00:26:28 I think that there's this, the idea of a root cause is almost always incorrect. It's not, okay, who tripped over the buried landmine is not the interesting question. It's who buried the thing. What were all the things that wound up contributing to this? And you can't even frame it that way
Starting point is 00:26:43 in the blaming context, just because you start doing that and people clam up and good luck figuring out what really happened. Exactly. But that's so much of what the cybersecurity industry is focused on is how do we assign blame? And it's, you know, the marketing person clicked on a link. It's like they do that thousands of times, like a month. And the one time suddenly they were stupid for doing it, that doesn't sound right.
Starting point is 00:27:05 So I'm a big fan of, yes, vanquishing root cause, thinking about contributing factors. And in particular, in any sort of incident review, you have to think about, was there a designer process problem? You can't just think about the human behavior. You have to think about where are the opportunities for us to design things better, to make the secure way more of the default way. When you talk about resilience and reliability and big notable outages, most forward-thinking companies are going to go and do a variety of incident reviews and disclosures around everything that happened to it, depending upon levels of trust and whether you're NDA'd or not and how much gets public is going to vary from place to place. But from a
Starting point is 00:27:44 security perspective, that feels like the sort of thing that companies will clam up about and never say a word. Because I can wind up pouring a couple of drinks into people and get the real story of outages or the AWS bill. But security stuff, they start to wonder if I'm a state actor on some level. When you were building all of this, how did you wind up getting people to talk candidly and forthrightly about issues that if it became tied to them, that they were talking about this in public would almost certainly have negative career impact for them? Yes. So that's almost like a trade secret, I feel like. A lot of it is, yes, over the years, talking with people over generally at a conference where, you know, things are tipsy. I never want to betray confidentiality, to be clear, but certainly pattern matching across people's stories.
Starting point is 00:28:30 We're both in positions where if they're even a hint of they can't be trusted enters the ecosystem. I think both of our careers explode and never recover. Exactly. Yeah. Oh, yeah. They fly fast and loose with secrets is never the reputation you want as a professional. No, no, definitely not. So it's much more pattern matching and trying to generalize. But again, a lot of what can go wrong is not that different when you think about a developer
Starting point is 00:28:54 being really tired and making a bunch of mistakes versus an attacker. A lot of times they're very much the same. So luckily there's commonality there. I do wish the security industry was more forthright and less clandestine because, frankly, all of the public postmortems that are out there about performance issues are just such a boon for everyone else to improve what they're doing. So that's the change I wish would happen. So I have to ask, given that you talk about security, chaos engineering, and resilience,
Starting point is 00:29:21 and of course, software and systems, all in the title of the O'Reilly book. Who is the target audience for this? Is it folks who have the word security featured three times in their job title? Is it folks who are new to the space? Where does your target audience start and stop? Yes. So I have kept it pretty broad and it's anyone who works with software, but I'll talk about the software engineering audience because that is honestly probably out of anyone who I would love to read the book the most because I firmly believe that there's so much that software engineering teams can do to sustain resilience and security, and they don't have to be security experts. So I've tried to demystify security, make it much less arcane, even down to like how attackers, you know, they have their own development life cycle.
Starting point is 00:30:03 I try to demystify that too. So it's very much for any team, especially like platform engineering teams, SREs, to think about, hey, what are some of the things maybe I'm already doing that I can extend to cover, you know, the security cases as well. So I would love for every software engineer to check it out to see like, hey, what are the opportunities for me to just do things slightly differently and have these great security outcomes? I really want to thank you for taking the time to talk with me about how you view these things.
Starting point is 00:30:30 If people want to learn more, where's the best place for them to find you? Yes, I have all of the social media, which is increasingly fragmented, I feel like. But I also have my personal site, kellyschorch.com. The official book site is securitychaosenge.com as well, but otherwise find me on LinkedIn, Twitter, Mastodon, Blue Sky. I'm probably blanking on the others. It's probably already a new one while we've spoken. Yeah, Blue Ski is how I insist on pronouncing it as well while we're talking about fun house pronunciation on things. I like it. Excellent. And we will of course put links to all of those things in the show notes. Thank you so much for being so generous with your time. I really appreciate it.
Starting point is 00:31:09 Thank you for having me and being a fellow dinosaur nerd. Kelly Shortridge, Senior Principal Engineer at Fastly. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment about how our choice of dinosaurs
Starting point is 00:31:36 is incorrect. Then put the computer away and struggle to figure out how to open a door. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying.
Starting point is 00:31:58 The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.