Screaming in the Cloud - Benchmarking Security Attack Response Times in the Age of Automation with Anna Belak

Episode Date: January 4, 2024

Anna Belak, Director of the Office of Cybersecurity Strategy at Sysdig, joins Corey on Screaming in the Cloud to discuss the newest benchmark for responding to security threats, 5/5/5. Anna d...escribes why it was necessary to set a new benchmark for responding to security threats in a timely manner, and how the Sysdig team did research to determine the best practices for detecting, correlating, and responding to potential attacks. Corey and Anna discuss the importance of focusing on improving your own benchmarks towards a goal, as well as how prevention and threat detection are both essential parts of a solid security program. About AnnaAnna has nearly ten years of experience researching and advising organizations on cloud adoption with a focus on security best practices. As a Gartner Analyst, Anna spent six years helping more than 500 enterprises with vulnerability management, security monitoring, and DevSecOps initiatives. Anna's research and talks have been used to transform organizations' IT strategies and her research agenda helped to shape markets. Anna is the Director of Thought Leadership at Sysdig, using her deep understanding of the security industry to help IT professionals succeed in their cloud-native journey. Anna holds a PhD in Materials Engineering from the University of Michigan, where she developed computational methods to study solar cells and rechargeable batteries.Links Referenced:Sysdig: https://sysdig.com/Sysdig 5/5/5 Benchmark: https://sysdig.com/555

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Welcome to Screaming in the Cloud. I'm Corey Quinn.
Starting point is 00:00:33 I am joined again for another time this year on this promoted guest episode brought to us by our friends at Sysdig. Returning is Anna Bellick, who is their director at the Office of Cybersecurity Strategy at Sysdig. Anna, welcome back. It's been a hot second. Thank you, Corey. It's always fun to join you here. Last time we were here, we were talking about your report that you folks had come out with, the cybersecurity threat landscape for 2022. And when I saw you were doing another one of these to talk about something, I was briefly terrified. Oh, wow. Please tell me we haven't gone another year and the cybersecurity threat landscape is moving at that quickly. And it sort of is, sort of isn't.
Starting point is 00:01:16 You're here today to talk about something different, but it also, to my understanding, distills down to just how quickly that landscape is moving. What have you got for us today? Exactly. For those of you who remember that episode, one of the key findings in the threat report for 2023 was that the average length of an attack in the cloud is 10 minutes. To be clear, that is from when you are found by an adversary to when they have caused damage to your system. And that is really fast. Like we talked about how that relates to on-prem attacks or other sort of averages from other organizations reporting how long it takes to attack people. And so we went from weeks or days to minutes, potentially seconds. And so what we've done is we looked at all that data and then we went and talked to our
Starting point is 00:02:01 amazing customers and our many friends at analyst firms and so on to kind of get a sense for if this is real, like if everyone is seeing this or if we're just seeing this, because I'm always like, oh God, like, is this real? Is it just me? And as it turns out, everyone's not only, I mean, not everyone's seeing it, right? Like there's not really been proof until this year,
Starting point is 00:02:19 I would say, because there's a few reports that came out this year, but lots of people sort of anticipated this. And so when we went to our customers and we asked for their SLAs, for example, they were like, oh yeah, my SLA for P0 in cloud is like 10, 15 minutes. And I was like, oh, okay. So what we set out to do is actually set a benchmark essentially to see how well are you doing? Like, are you equipped with your cloud security program to respond to the kind of attack that a cloud security attacker is going to, sorry, an anti-cloud security, I guess, attacker is going to perpetrate against you. And so the benchmark is, drumroll,
Starting point is 00:02:51 5-5-5. You have five seconds to detect a signal that is relevant to potentially some attack in the cloud, hopefully more than one such signal. You have five minutes to correlate all such relevant signals to each other so that you could have a high-fidelity detection of this activity. And then you have five more minutes to initiate an incident response process to hopefully shut this down or at least interrupt the kill chain before your environments experience any substantial damage.
Starting point is 00:03:17 And to be clear, that is from a T0, a starting point. Stopwatch begins, the clock starts, when the event happens. not when the event shows up in your logs, not once someone declares an incident from J random hackerman effectively refreshing the button and getting the response from your API. That's right, because the attackers don't really care how long it takes you to ship logs to wherever you're mailing them to. And that's why it is such a short time frame because we we're talking about, they got in, you saw something, hopefully. And it may take time, right?
Starting point is 00:03:48 Like, some of the, which we'll describe a little later, some of the activities that they perform in the early stages of the attack are not necessarily detectable as malicious right away. Which is why your correlation has to occur kind of in real time. Like, things happen, and you're immediately adding them, sort of like, to increase the risk of this detection, right? To say, hey, this is actually
Starting point is 00:04:04 something, as opposed to, you know, three weeks later, I'm parsing some logs and being like, oh, wow, that's not good. The number five seemed familiar to me in this context. So I did a quick check and sure enough, allow me to quote from chapter and verse from the CloudTrail documentation over in AWS land. CloudTrail typically delivers logs within an average of about five minutes of an API call. This time is not guaranteed. So effectively, if you're waiting for anything that's CloudTrail driven to tell you that you have a problem,
Starting point is 00:04:32 it is almost certainly too late by the time that pops up, no matter what that notification vector is. That is unfortunately or fortunately true. I mean, it is kind of a fact of life. I guess there is a little bit of a veiled reb at our cloud provider friends because really they have to do better ultimately. But the flip side to that argument is CloudTrail or your cloud log source of choice cannot be your only source of data for detecting security events, right? So if you are operating purely on the basis of, hey, I have information in CloudTrail that is my security information, you are going to have a bad time.
Starting point is 00:05:05 Not just because it's not fast enough, but also because there's not enough data in there, right? Which is why part of the first kind of benchmark component is that you must have multiple data sources for these signals, and they, ideally, all will be delivered to you within five seconds of an event occurring or a signal being generated. Give me some more information on that, because I have my own alerter. Specifically, it's a signal being generated. Give me some more information on that because I have my
Starting point is 00:05:25 own alerter up specifically. It's a click ops detector. Whenever someone in one of my accounts does something in the console that is as a right aspect to it rather than just a read component, which again, look at what you want in the console. That's fine. If you're changing things that is not being managed by code, I want to know that it's happening. It's not necessarily bad, but I want to at least have visibility into it. And that spits out the principle, the IP address it emits from, and the rest. I haven't had a whole lot where I need to correlate those between different areas. Talk to me more about the triage step. Yeah, so I believe that the correlation step is the hardest, actually. Correlation step. My apologies.
Starting point is 00:06:05 Triage is fine. Triage, correlation. The words we use matter on these things. Dude, we argued about the words on this for so long, you couldn't even imagine. Yeah, triage, correlation, detection, you name it. We are looking at multiple pieces of data. We're going to connect them to each other meaningfully,
Starting point is 00:06:22 and that is going to provide us with some insight about the fact that a bad thing is happening and we should respond to it. Perhaps automatically respond to it, but we'll get to that. So a correlation. Okay. The first thing is, like I said, you must have more than one data source because otherwise, I mean, you could correlate information from one data source. You actually should do that, but you are going to get richer information if you can correlate multiple data sources and if you can access, for example, like through an API, some sort of enrichment for that information. Like I'll give you an example for Scarlet Eel, which is an attack we describe in a threat report.
Starting point is 00:06:49 And we actually described before this, we're like on Scarlet Eel, I think version three now, because there's so much, this particular threat actor is very active. And they have a better versioning scheme than most companies I've spoken to, but that's neither here nor there. Right.
Starting point is 00:07:01 So one of the interesting things about Scarlet Eel is you could eventually detect that it had happened if you only had access to CloudTrail, but you wouldn't have the full picture ever. In our case, because we are a company that relies heavily on system calls and machine learning detections, we are able to connect the system call events to the CloudTrail events. And between those two data sources, we're able to figure out that there's something more profound going on than just what you see in the logs. And I'll actually tell you which, for example, these are being detected. So in Scarlet EL, one thing that happens
Starting point is 00:07:35 is there's a crypto miner. And a crypto miner is one of these events where you're like, oh, this is obviously malicious because as we wrote, I think two years ago, it costs $53 to mine $1 of Bitcoin in an AWS. So it is very stupid for you to be mining Bitcoin in AWS, unless somebody else is paying the cloud bill. Yeah, in someone else's account.
Starting point is 00:07:52 Absolutely. Yeah. So if you are a sysadmin or a security engineer and you find a crypto miner, you're like, obviously just shut that down. Great. What often happens is people see them and they think, oh, this is a commodity attack. Like people are just throwing crypto miners wherever. I shut it down. I'm done. But in the case of this attack, it was actually a red herring. So they deployed the miner to see if they could, they could, then they determined,
Starting point is 00:08:13 presumably this is me speculating that, oh, these people don't have very good security because they let random idiots run crypto miners in their account in AWS. So they probed further. And when they probed further, what they did was some reconnaissance. So they type in commands, listing, you know, like list accounts or whatever. They try to list all the things they can list that are available in this account. And then they reach out to an EC2 metadata service to kind of like see what they can do, right? And so each of these events, like each of the things that they do, like reaching out to EC2 metadata service, assuming a role, doing a recon, even the lateral movement is by itself not necessarily a scary, big red flag, malicious thing. Because there are lots of legitimate reasons for someone to perform those actions.
Starting point is 00:08:56 Reconnaissance, for one example, is you're looking around the environment to see what's up, right? So you're doing things like listing things, integrating things, whatever. But a lot of the graphical interfaces of security tools also perform those actions to show you what's there. So it looks like reconnaissance when your tool is just listing all the stuff that's available to you to show it to you in the interface, right? So anyway, the point is, when you see them independently, these events are not scary.
Starting point is 00:09:22 They're like, oh, this is useful information. When you see them in rapid succession, right? Or when you see them independently, these events are not scary. They're like, oh, this is useful information. When you see them in rapid succession, right? Or when you see them alongside a crypto miner, then your tooling and or your process and or your human being who's looking at this should be like, oh, wait a minute. Like just the enumeration of things is not a big deal. The enumeration of things after I saw a miner
Starting point is 00:09:43 and you try and talk to the metadata service, suddenly I'm concerned. And so the point is, how can you connect those dots as quickly as possible and as automatically as possible so a human being doesn't have to look at every single event because there's an infinite number of them? I guess the challenge I've got is that
Starting point is 00:09:59 in some cases, you're never going to be able to catch up with this. Because if it's an AWS call to one of the APIs that they manage for you, they explicitly state there's no guarantee of getting information in this until the show's all over, more or less. So how is their, like, how is their hope? I mean, there's always forensic analysis, I guess, for all the things you failed to respond to. Basically, we're doing an after-action thing, because humans aren't going to react that fast. We're just assuming it happened.
Starting point is 00:10:28 We should know about it as soon as possible. On some level, just because something is too late doesn't necessarily mean there's not value added to it. But I'm just trying to turn this into something other than a, yeah, they can move faster than you, and you will always lose the end. Have a nice night. Like, that tends not to be the best narrative vehicle for these things.
Starting point is 00:10:45 You know, if you're trying to inspire people to change. Yeah. I mean, I think one clear point of hope here is that sometimes you can be fast enough, right? And a lot of this,
Starting point is 00:10:53 I mean, first of all, you're probably not going to, sorry, club providers. You're not going to just the club provider defaults for that level of performance. You are going with some sort of third party tool on the,
Starting point is 00:11:03 I guess, bright side, that tool can be open source. Like there's a lot of open source tooling available now that third-party tool. On the, I guess, bright side, that tool can be open source. Like there's a lot of open source tooling available now that is fast and free. For example, this is our favorite, of course, Falco, which is looking at system calls on endpoints and containers and can detect things within seconds of them occurring
Starting point is 00:11:16 and let you know immediately. There is other eBPF-based instrumentation that you can use out there from various vendors and or open source providers. And there's, of course, network telemetry. So if you're into the world of service mesh, there is data you can get off the network also very fast. So the bad news or the flip side to that is you have to be able to manage all that information, right? So that means, again, like I said, you're not expecting a SOC analyst to look at thousands of system calls and thousands of,
Starting point is 00:11:44 you know, network packets or flow logs or whatever you're looking at and just magically know that these things go together, you are expecting to build or have built for you by a vendor or the open source community, some sort of detection content that is taking this into account and then is able to deliver that alert at the speed of 555. When you see the larger picture stories playing out, as far as what customers are seeing, what the actual impact is, what gave rise to the five minute number around this? Just because that tends to feel like it's both too long and also too short in some level. I'm just wondering how you wound up. What is this based on? Man, we went through so many numbers.
Starting point is 00:12:27 So we started with larger numbers and then we went to smaller numbers and then we went back to medium numbers. We align ourselves with the time frames we're seeing for people. Like I said, a lot of folks have an SLA of responding to a P0 within 10 or 15 minutes because their point basically,
Starting point is 00:12:42 and there's a little bit of bias here into our customer base because our customer base is A, fairly advanced in terms of cloud adoption and in terms of security maturity. And also they're heavily in, let's say financial industries and other industries that tend to be early adopters
Starting point is 00:12:56 of new technology. So if you are kind of a laggard, like you probably aren't that close to meeting this benchmark as you are, if you are saying financial, right? So we asked them how they operate, and they basically pointed out to us that knowing 15 minutes later
Starting point is 00:13:09 is too late because I've already lost some number of millions of dollars if my environment is compromised for 15 minutes, right? So that's kind of where the 10 minutes comes from. We took our real research data, and then we went around and talked to folks to see what they're experiencing and what their own expectations are for their incident response and SOC teams.
Starting point is 00:13:26 And 10 minutes is sort of where we landed. Got it. When you see this happening, I guess, in various customer environments, assuming someone has missed that five minute window, is it game over? Effectively, how should people be thinking about this? No. So, I mean, it's never really game over, right? Like until your company is ransomed to bits and you have to close your business, you still have many things that you can do, hopefully to save yourself. And also I want to be very clear that 555 as a benchmark is meant to be something aspirational, right? So you should be able to meet this benchmark for,
Starting point is 00:14:01 let's say, your top use cases if you are a fairly high maturity organization in threat detection specifically, right? So if you're just beginning your threat detection journey, like tomorrow, you're not going to be close. Like you're going to be not at all close. The point here though, is that you should aspire to this level of greatness and you're going to have to create new processes and adopt new tools to get there. Now, before you get there, I would argue that if you can do like 10, 10, 10, or like whatever number you start with, you're on a mission to make that number smaller, right? So if today you can detect a crypto miner in 30 minutes, that's not great because crypto miners are pretty detectable these days. But give yourself a goal of like getting
Starting point is 00:14:40 that 30 minutes down to 20 or getting that 30 minutes down to 10, right? Because we are so obsessed with like measuring ourselves against our peers and all this other stuff that we sometimes lose track of what actually is improving our security program. So yes, compare yourself first. But ultimately, if you can meet the 5-5-5 benchmark, then you are doing great. You are faster than the attackers in theory. So that's the dream. So I have to ask, and I suspect I might know the answer to this, but given that it seems very hard to move this quickly, especially at scale, is there an argument to be made that effectively prevention obviates the need for any of this?
Starting point is 00:15:15 Where if you don't misconfigure things in ways that should be obvious, if you practice defense in depth to a point where you can effectively catch the things that the first layer meets with successive layers, as opposed to, well, we have a firewall. Once we're inside of there, well, it's game over for us. Is prevention sufficient in some ways to obviate this? I think there are a lot of people that would love to believe that that's true. Oh, I sure would. It's such a comforting story. And we've done like, I think one of my openings on this is in the benchmark kind of description actually, is that we've done a pretty good job of advertising prevention in cloud as an important thing and getting people to actually
Starting point is 00:15:53 like start configuring things more carefully or like checking how those things have been configured and then changing that configuration should they discover that it is not compliant with some mundane standard that everyone should know, right? So we've made great progress in thinking cloud prevention, but as usual, like prevention fails, right? Like I still have smoke detectors in my house, even though I have done everything possible to prevent it from catching fire and I don't plan to set it on fire, right? But like threat detection is one of these things that you're always going to need because no matter what you do, A, you will make a mistake because you're a human being
Starting point is 00:16:25 and there are too many things and you'll make a mistake. And B, the bad guys are literally in the business of figuring ways around your prevention and your protective systems. So I am full on on the defense of depth. I think it's a beautiful thing. We should all obviously do that.
Starting point is 00:16:41 And I do think that prevention is your first step to a holistic security program. Otherwise, what even is the point? But threat detection is always going to be necessary. And like I said, even if you can't go 5-5-5, you don't have threat detection at that speed, you need to at least be able to know what happened later so you can update your prevention system. This might be a dangerous question to get into, but why not? That's what I do here.
Starting point is 00:17:06 It's potentially an argument against cloud, by which I mean that if I compromise someone's cloud account and any of the major cloud providers, once I have access of some level, I know where everything else in the environment is as a general rule. I know that you're using S3 or its equivalent
Starting point is 00:17:22 and what those APIs look like and the rest. Whereas as an attacker, if I am breaking into someone's crappy data center hosted environment, everything is going to be different. Maybe they don't have a SAN at all, for example. Maybe they have one that hasn't been patched in five years. Maybe they're just doing local disk for some reason. There's a lot of discovery that has to happen that is almost always removed from cloud. I mean, take the open S3 bucket problem that we've seen as a scourge for five, six, seven years now, where it's not that S3 itself is insecure, but once you make a configuration mistake,
Starting point is 00:17:54 you are now in line with a whole bunch of other folks who may have much more valuable data living in that environment. Where do you land on that one? This is the leave cloud to rely on security through obscurity argument. Exactly, which I'm not a fan of, but it's also hard to argue against from time to time. My other way of phrasing it is the attackers are moving up the stack argument.
Starting point is 00:18:16 Yeah, so there's some sort of truth in that, right? Part of the reason that attackers can move that fast, and I think we say this a lot when we talk about the threat report data too, because we literally see them execute this behavior, right? Is they know what the cloud looks like, right? They have access to all the API documentation. They kind of know what all the constructs are that you're all using. And so they literally can practice their attack and create all these scripts ahead of time to perform their reconnaissance because they know exactly what they're looking at, right? On premise, you're right. Like they're going to get into, even if they get through my firewall, whatever,
Starting point is 00:18:46 they're getting into my data center. They do not know what disaster I've configured, what kinds of servers I have, where, and like what the network looks like. They have no idea, right? In cloud, this is kind of all gifted to them because it's so standard, which is a blessing and a curse.
Starting point is 00:18:58 It's a blessing because, well, for them, I mean, because they can just programmatically go through this stuff, right? It's a curse for them because it's a blessing for us in the same way, right? The defenders, A, have a much easier time knowing what they even have available to them, right?
Starting point is 00:19:11 The days of there's a server in a closet I've never heard of are kind of gone, right? You know what's in your cloud account because, frankly, AWS tells you. So I think there is a trade-off there. The other thing is, but moving up the stack thing, right? No matter what you do, they will come after you if you have something moving up the stack thing, right? Like, no matter what you do, they will come after you
Starting point is 00:19:27 if you have something worth exploiting you for, right? So by moving up the stack, I mean, listen, we have abstracted all the physical servers, all the, like, stuff we used to have to manage the security of because the cloud just does that for us, right? Now we can argue about whether or not they do a good job, but I'm going to be generous to them and say they do a better job than most companies did before.
Starting point is 00:19:48 So in that regard, we say thank you and we move on to fighting this battle at a higher level than the stack, which is now the workloads and the cloud control plane and you name it, whatever's going on after that. So I don't actually think you can sort of trade apples for oranges here. It's just bad in a different way. Do you think that this benchmark is going to be used by various companies who learn about it? And if so, how do you see that playing out? I hope so. My hope when we created it was it would sort of serve as a goalpost or a way to measure. It won't just be marketing words in a page and never mentioned again anywhere.
Starting point is 00:20:25 That's our dream here. Right. I was bored, so I wrote some. I had a word minimum I needed to get out the door. So there we are. It's how we work. Right. As you know, I used to be a Gartner analyst.
Starting point is 00:20:38 So my desire is always to create things that are useful for people to figure out how to do better in security. And my tenure at the vendor is just a way to fund that more effectively. I'm forgetting your ex-gardener. Yeah, it's one of those fun areas of, oh, yeah, we just want to basically talk about all kinds of things because we have a chart to fill out here. Let's get after it. I did not invent an acronym, at least.
Starting point is 00:20:59 Yeah, so my goal was the following. People are always looking for a benchmark or a goal or a standard to be like, hey, am I doing a good job? Whether I'm like a SOC analyst or director, and I'm just looking at my little SOC empire, or I'm a full-on CISO and I'm looking at my entire security program to kind of figure out risk, I need some way to know whether what is happening in my organization is like sufficient or on par or anything. Is it good? Is it bad? Happy face? Sad face? Like I need some benchmark, right? So normally the Gartner answer to this typically is like, you can only come up with benchmarks that are like, only you know what is right for your company, right? It's like, you know, standard, it depends answer, which is true, right? Because I can't say that like, oh,
Starting point is 00:21:38 a huge multinational bank should follow the same benchmark as like a donut shop, right? Like that's unreasonable. So this is also why I say that our benchmark is probably more tailored to the more advanced organizations that are dealing with kind of high maturity phenomena and are more cloud native. But the donut shops should kind of strive in this direction, right? So I hope that people will think of it this way, that they will kind of look at their process and say, hey, like, what are the things that would be really bad if they happened to me in terms of threat detection? Like, what are the threats I'm afraid of,
Starting point is 00:22:08 where if I saw this in my cloud environment, I would have a really bad day? And can I detect those threats in 555? Because if I can, then I'm actually doing quite well. And if I can't, then I need to set like some sort of roadmap for myself on how I get from where I am now to 555, because that implies you would be doing a good job. So that's sort of my hope for the benchmark is that people think of it as something to aspire to. And if they're already able to meet it, then they'll tell us how exactly they're achieving it, because I really want to be friends with them. Yeah, there's a definite lack of reasonable ways to think about these things, at least in ways that can be communicated to folks outside the bounds of the security team.
Starting point is 00:22:46 I think that's one of the big challenges currently facing the security industry is that it is easy to get so locked into the domain-specific acronyms, philosophies, approaches, and the rest that even coming from, well, I'm a cloud engineer who ostensibly needs to know about these things. Yeah, wander around the RSA floor with that as your background, and you get lost very quickly.
Starting point is 00:23:09 Yeah, I think that's fair. I mean, it is a very, let's say, dynamic and rapidly evolving space. And by the way, like, it was really hard for me to pick these numbers, right? Because I very much am on that whole it depends bandwagon of, I don't know what the right answer is. Who knows what the right answer is? I say 5-5-5 today, like tomorrow the attack takes five minutes and now it's two and a half, two and a half, right? Whatever, you have to pick a number and go for it. So I think to some extent we just have to try to make sense of the insanity and choose some
Starting point is 00:23:40 best practices to anchor ourselves in or some kind of sound logic to start with and then go from there. So that's sort of what I go for. So as I think about the actual reaction times needed for 555 to actually be realistic, people can't reliably get a hold of me on the phone within five minutes. So it seems like this is not something
Starting point is 00:23:58 you can have humans in the loop for. How does that interface with the idea of automating things versus giving automated systems too much power to take your site down as a potential failure mode? Yeah, I don't even answer the phone anymore, so that wouldn't work at all. That's a really, really good question. And probably the question that gives me the most, I don't know, I don't want to say lost sleep at night, because it's actually, it's very interesting to think about, right? I don't think you can remove humans
Starting point is 00:24:23 from the loop in the sock. Like, certainly there will be things you can auto respond to to some extent but there better be a human being in there because there are too many things at stake right some of these actions could take your entire business down for far more hours or days than whatever the attacker was doing before and that trade-off of like is my response to this attack actually hurting the business more than the attack itself, is a question that's really hard to answer, especially for most of us technical folks who don't necessarily know the business impact of any given thing. So first of all, I think we have to embrace auto-response actions. Back to our favorite crypto miners, right? There is no reason to not automatically shut them down. There is no reason, right? Just build in a detection and an
Starting point is 00:25:04 auto-response every time you see a crypto miner, kill that process, kill that container, kill that node. I don't care, kill it. Like, why is it running? This is crazy, right? I do think it gets nuanced very quickly, right?
Starting point is 00:25:14 So again, in Scarlet Eel, there are essentially like five or six detections that occur, right? And each of them theoretically has a potential auto-response that you could have executed, depending on your sort of appetite for that level of intervention. Right. Like when you see somebody assuming a role, that's perfectly normal activity most of the time. In this case, I believe they actually assumed a machine role, which is less normal.
Starting point is 00:25:38 Like, that's kind of weird. And then what do you do? Well, you can just like remove the role. You can remove that person's ability to do anything or remove that role's ability to do anything. But that could be very dangerous because we don't necessarily know what the full scope of that role is as this is happening. So you could take a more mitigated utter response action and add a restrictive policy to that rule, for example, to just prevent activity from that IP address that you just saw. Because we're not sure about this IP address, but we're sure about this role, right? So you have to get into these sort of risk tiered response actions where you say, okay, this is always okay to do automatically.
Starting point is 00:26:10 And this is like sometimes okay, and this is never okay. And as you develop that muscle, it becomes much easier to do something rather than doing nothing and just kind of like analyzing it in forensics and being like, oh, what an interesting attack story, right? So that's step one is just start taking these different response actions. And then step two is more long-term, and it's that you have to embrace the cloud-native way of life, right? Like this immutable, ephemeral, distributed religion that we've been selling.
Starting point is 00:26:37 It actually works really well if you go all in on the religion. I sound like a real cult leader. If you just go all in, it's going to be great. But it's true, right? So if your workloads are immutable, that means they cannot change as they're running, then when you see them drifting from their original configuration, you know that it's bad. So you can immediately know that it's safe to take an auto-response,
Starting point is 00:26:56 well, it's relatively safe to take an auto-response action to kill that workload because you are 100% certain it is not doing the right things, right? And then, furthermore, if all of your deployments are defined as code, which they should be, then it is approximately, though not entirely trivial, to get that workload back, right? Because you should push a button,
Starting point is 00:27:14 and it just generates that same Kubernetes cluster with those same nodes doing all those same things, right? So in the on-premise world, where shooting a server was potentially the fireable offense, because if that server was running something critical and you couldn't get it back, you were done. In the cloud, this is much less dangerous because there is an infinite quantity of servers that you could bring back. And hopefully, infrastructure's code and configuration's code in some wonderful registry version controlled for you to rely on to rehydrate all that stuff, right? So again, to sort of TLDR, get used to doing autoresponse actions, but do this carefully. Define a scope for those actions that make sense, not just like something bad happened, burn it all down, obviously.
Starting point is 00:28:03 And then as you become more cloud native, which sometimes requires refactoring of entire applications, by the way, so this could take years, just embrace the joy of everything is code. That's a good way of thinking about it. I just, I wish there were an easier path to get there for an awful lot of folks who otherwise don't find a clear way to unlock that. There is not, unfortunately. I mean, again, the upside on that is like, there are a lot of people that have done it successfully, I have to say. I couldn't have said that to you like six, seven years ago when we were just getting started on this journey. But especially for those of you who were just at KubeCon however long ago, before this airs, you see a pretty robust ecosystem around Kubernetes, around containers, around cloud in general. And so even if you feel like your organization's behind, there are a lot of folks you can reach out to, to learn from, to get some help, to just sort of start joining the masses of cloud native types. So it's not nearly as hopeless as before. And also one thing I like to
Starting point is 00:28:57 say always is almost every organization is going to have some technical debt and some legacy workload that they can't convert to the religion of cloud. And so you're not going to have a 555 threat detection SLA on those workloads. Probably. I mean, maybe you can, but probably you're not. And you may not be able to take autoresponse actions. You may not have all the same benefits available to you. But that's okay.
Starting point is 00:29:18 That's okay. Hopefully, whatever that thing is running is worth keeping alive. But set this new standard for your new workload. So when your team is building a new application or if they're refactoring an application for the new world, set the standard on them and don't torment the legacy folks because it doesn't necessarily make sense.
Starting point is 00:29:38 They're going to have different SLAs for different workloads. I really want to thank you for taking the time to speak with me yet again about the stuff you folks are coming out with. If people want to learn more, where's the best place for them to go? Thanks, Corey. It's always a pleasure to be on your show. If you want to learn more about the 555 benchmark, you should go to SysDict.com slash 555. And we will, of course, put links to that in the show notes.
Starting point is 00:30:01 Thank you so much for taking the time to speak with me today. As always, it's appreciated. Anna Bellick, Director at the Office of Cybersecurity Strategy at Sysdig. I'm cloud economist Corey Quinn, and this has been a promoted guest episode brought to us by our friends at Sysdig. If you've enjoyed this podcast, please leave a five-star review in your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review in your podcast platform of choice, along with you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment that I will read nowhere even approaching within five minutes. If your AWS bill keeps rising and your blood pressure is doing the same,
Starting point is 00:30:39 then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.