Screaming in the Cloud - Benchmarking Security Attack Response Times in the Age of Automation with Anna Belak
Episode Date: January 4, 2024Anna Belak, Director of the Office of Cybersecurity Strategy at Sysdig, joins Corey on Screaming in the Cloud to discuss the newest benchmark for responding to security threats, 5/5/5. Anna d...escribes why it was necessary to set a new benchmark for responding to security threats in a timely manner, and how the Sysdig team did research to determine the best practices for detecting, correlating, and responding to potential attacks. Corey and Anna discuss the importance of focusing on improving your own benchmarks towards a goal, as well as how prevention and threat detection are both essential parts of a solid security program. About AnnaAnna has nearly ten years of experience researching and advising organizations on cloud adoption with a focus on security best practices. As a Gartner Analyst, Anna spent six years helping more than 500 enterprises with vulnerability management, security monitoring, and DevSecOps initiatives. Anna's research and talks have been used to transform organizations' IT strategies and her research agenda helped to shape markets. Anna is the Director of Thought Leadership at Sysdig, using her deep understanding of the security industry to help IT professionals succeed in their cloud-native journey. Anna holds a PhD in Materials Engineering from the University of Michigan, where she developed computational methods to study solar cells and rechargeable batteries.Links Referenced:Sysdig: https://sysdig.com/Sysdig 5/5/5 Benchmark: https://sysdig.com/555
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
I am joined again for another time this year on this promoted guest episode brought to us by our friends at Sysdig.
Returning is Anna Bellick, who is their director at the Office of Cybersecurity
Strategy at Sysdig. Anna, welcome back. It's been a hot second. Thank you, Corey. It's always fun to
join you here. Last time we were here, we were talking about your report that you folks had come
out with, the cybersecurity threat landscape for 2022. And when I saw you were doing another one
of these to talk about
something, I was briefly terrified. Oh, wow. Please tell me we haven't gone another year
and the cybersecurity threat landscape is moving at that quickly. And it sort of is, sort of isn't.
You're here today to talk about something different, but it also, to my understanding,
distills down to just how quickly that landscape is moving. What have you got for us today?
Exactly. For those of you who remember that episode, one of the key findings in the threat report for 2023 was that the average length of an attack in the cloud is 10 minutes. To be clear,
that is from when you are found by an adversary to when they have caused damage to your system.
And that is really fast. Like we talked about how that relates to on-prem attacks or other sort of averages from other
organizations reporting how long it takes to attack people.
And so we went from weeks or days to minutes, potentially seconds.
And so what we've done is we looked at all that data and then we went and talked to our
amazing customers and our many friends at analyst firms and so on
to kind of get a sense for if this is real,
like if everyone is seeing this or if we're just seeing this,
because I'm always like, oh God, like, is this real?
Is it just me?
And as it turns out, everyone's not only,
I mean, not everyone's seeing it, right?
Like there's not really been proof until this year,
I would say, because there's a few reports
that came out this year,
but lots of people sort of anticipated this.
And so when we went to our customers
and we asked for their SLAs, for example, they were like, oh yeah, my SLA for P0 in cloud is
like 10, 15 minutes. And I was like, oh, okay. So what we set out to do is actually set a benchmark
essentially to see how well are you doing? Like, are you equipped with your cloud security program
to respond to the kind of attack that a cloud security attacker is going to, sorry, an anti-cloud security, I guess, attacker is going to perpetrate against you. And so the benchmark is, drumroll,
5-5-5. You have five seconds to detect a signal that is relevant to potentially some attack in
the cloud, hopefully more than one such signal. You have five minutes to correlate all such
relevant signals to each other so that you could have a high-fidelity detection of this activity.
And then you have five more minutes
to initiate an incident response process
to hopefully shut this down
or at least interrupt the kill chain
before your environments experience any substantial damage.
And to be clear, that is from a T0, a starting point.
Stopwatch begins, the clock starts,
when the event happens. not when the event shows
up in your logs, not once someone declares an incident from J random hackerman effectively
refreshing the button and getting the response from your API. That's right, because the attackers
don't really care how long it takes you to ship logs to wherever you're mailing them to. And that's
why it is such a short time frame because we we're talking about, they got in, you saw
something, hopefully. And it may take time, right?
Like, some of the, which we'll describe
a little later, some of the activities that they perform in the
early stages of the attack are not necessarily detectable
as malicious right away. Which is why
your correlation has to occur kind of in real time.
Like, things happen, and you're immediately adding
them, sort of like, to increase the risk
of this detection, right? To say, hey, this is actually
something, as opposed to, you know,
three weeks later, I'm parsing some logs and being like, oh, wow, that's not good.
The number five seemed familiar to me in this context.
So I did a quick check and sure enough, allow me to quote from chapter and verse
from the CloudTrail documentation over in AWS land.
CloudTrail typically delivers logs within an average of about five minutes of an API call.
This time is not guaranteed.
So effectively, if you're waiting for anything that's CloudTrail driven to tell you that you have a problem,
it is almost certainly too late by the time that pops up, no matter what that notification vector is.
That is unfortunately or fortunately true.
I mean, it is kind of a fact of life.
I guess there is a little bit of a veiled reb at our cloud provider friends because really they have to do better ultimately.
But the flip side to that argument is CloudTrail or your cloud log source of choice cannot be your
only source of data for detecting security events, right? So if you are operating purely on the basis
of, hey, I have information in CloudTrail that is my security information, you are going to have a
bad time.
Not just because it's not fast enough,
but also because there's not enough data in there, right?
Which is why part of the first kind of benchmark component
is that you must have multiple data sources for these signals,
and they, ideally, all will be delivered to you
within five seconds of an event occurring or a signal being generated.
Give me some more information on that,
because I have my own alerter. Specifically, it's a signal being generated. Give me some more information on that because I have my
own alerter up specifically. It's a click ops detector. Whenever someone in one of my accounts
does something in the console that is as a right aspect to it rather than just a read component,
which again, look at what you want in the console. That's fine. If you're changing things that is not
being managed by code, I want to know that it's happening. It's not necessarily bad,
but I want to at least have visibility into it. And that spits out the principle, the IP address
it emits from, and the rest. I haven't had a whole lot where I need to correlate those between
different areas. Talk to me more about the triage step. Yeah, so I believe that the correlation step
is the hardest, actually. Correlation step. My apologies.
Triage is fine.
Triage, correlation.
The words we use matter on these things.
Dude, we argued about the words on this for so long,
you couldn't even imagine.
Yeah, triage, correlation, detection, you name it.
We are looking at multiple pieces of data.
We're going to connect them to each other meaningfully,
and that is going to provide us with some insight about the fact that a bad thing is happening
and we should respond to it. Perhaps automatically respond to it, but we'll get to that.
So a correlation. Okay. The first thing is, like I said, you must have more than one data source
because otherwise, I mean, you could correlate information from one data source. You actually
should do that, but you are going to get richer information if you can correlate multiple data
sources and if you can access, for example, like through an API, some sort of enrichment for that
information. Like I'll give you an example for Scarlet Eel,
which is an attack we describe in a threat report.
And we actually described before this,
we're like on Scarlet Eel, I think version three now,
because there's so much,
this particular threat actor is very active.
And they have a better versioning scheme
than most companies I've spoken to,
but that's neither here nor there.
Right.
So one of the interesting things about Scarlet Eel
is you could eventually detect that it had happened if you only had access to CloudTrail, but you wouldn't have the full picture ever.
In our case, because we are a company that relies heavily on system calls and machine learning detections, we are able to connect the system call events to the CloudTrail events. And between those two data sources,
we're able to figure out that there's something
more profound going on than just what you see in the logs.
And I'll actually tell you which, for example,
these are being detected.
So in Scarlet EL, one thing that happens
is there's a crypto miner.
And a crypto miner is one of these events
where you're like, oh, this is obviously malicious
because as we wrote, I think two years ago,
it costs $53 to mine $1 of Bitcoin in an AWS.
So it is very stupid for you to be mining Bitcoin in AWS, unless somebody else is paying
the cloud bill.
Yeah, in someone else's account.
Absolutely.
Yeah.
So if you are a sysadmin or a security engineer and you find a crypto miner, you're like,
obviously just shut that down.
Great.
What often happens is people see them and they think, oh, this is a commodity attack.
Like people are just throwing crypto miners wherever.
I shut it down. I'm done. But in the case of this attack, it was actually a red herring. So they deployed the miner to see if they could, they could, then they determined,
presumably this is me speculating that, oh, these people don't have very good security because they
let random idiots run crypto miners in their account in AWS. So they probed further. And when
they probed further, what they did was some reconnaissance. So
they type in commands, listing, you know, like list accounts or whatever. They try to list all
the things they can list that are available in this account. And then they reach out to an EC2
metadata service to kind of like see what they can do, right? And so each of these events, like each
of the things that they do, like reaching out to EC2 metadata service, assuming a role, doing a recon, even the lateral movement is by itself not necessarily a scary, big red flag, malicious thing.
Because there are lots of legitimate reasons for someone to perform those actions.
Reconnaissance, for one example, is you're looking around the environment to see what's up, right?
So you're doing things like listing things, integrating things, whatever.
But a lot of the graphical interfaces of security tools also perform those actions to show you what's there.
So it looks like reconnaissance when your tool
is just listing all the stuff that's available to you
to show it to you in the interface, right?
So anyway, the point is, when you see them independently,
these events are not scary.
They're like, oh, this is useful information.
When you see them in rapid succession, right? Or when you see them independently, these events are not scary. They're like, oh, this is useful information. When you see them in rapid succession, right?
Or when you see them alongside a crypto miner,
then your tooling and or your process
and or your human being who's looking at this
should be like, oh, wait a minute.
Like just the enumeration of things is not a big deal.
The enumeration of things after I saw a miner
and you try and talk to the metadata service,
suddenly I'm concerned. And so
the point is, how can you connect
those dots as quickly as possible and
as automatically as possible so a human being doesn't
have to look at every single event because there's
an infinite number of them?
I guess the challenge I've got is that
in some cases, you're never going
to be able to catch up with this. Because if it's
an AWS call to one of the APIs that they manage for you, they explicitly state there's no guarantee of getting information in this until the show's all over, more or less.
So how is their, like, how is their hope?
I mean, there's always forensic analysis, I guess, for all the things you failed to respond to.
Basically, we're doing an after-action thing,
because humans aren't going to react that fast.
We're just assuming it happened.
We should know about it as soon as possible.
On some level, just because something is too late
doesn't necessarily mean there's not value added to it.
But I'm just trying to turn this into something other than a,
yeah, they can move faster than you,
and you will always lose the end.
Have a nice night.
Like, that tends not to be the best narrative vehicle for these things.
You know,
if you're trying to inspire people to change.
Yeah.
I mean,
I think one clear point of hope here is that sometimes you can be fast
enough,
right?
And a lot of this,
I mean,
first of all,
you're probably not going to,
sorry,
club providers.
You're not going to just the club provider defaults for that level of
performance.
You are going with some sort of third party tool on the,
I guess,
bright side,
that tool can be open source.
Like there's a lot of open source tooling available now that third-party tool. On the, I guess, bright side, that tool can be open source.
Like there's a lot of open source tooling available now that is fast and free.
For example, this is our favorite, of course, Falco,
which is looking at system calls on endpoints and containers
and can detect things within seconds of them occurring
and let you know immediately.
There is other eBPF-based instrumentation
that you can use out there from various vendors
and or open source providers.
And there's, of course, network telemetry. So if you're into the world of service mesh,
there is data you can get off the network also very fast. So the bad news or the flip side to
that is you have to be able to manage all that information, right? So that means, again, like I
said, you're not expecting a SOC analyst to look at thousands of system calls and thousands of,
you know, network packets or flow logs or
whatever you're looking at and just magically know that these things go together, you are
expecting to build or have built for you by a vendor or the open source community,
some sort of detection content that is taking this into account and then is able to deliver that alert
at the speed of 555. When you see the larger picture stories playing out,
as far as what customers are seeing, what the actual impact is, what gave rise to the five
minute number around this? Just because that tends to feel like it's both too long and also
too short in some level. I'm just wondering how you wound up. What is this based on? Man, we went through so many numbers.
So we started with larger numbers
and then we went to smaller numbers
and then we went back to medium numbers.
We align ourselves with the time frames
we're seeing for people.
Like I said, a lot of folks have an SLA
of responding to a P0 within 10 or 15 minutes
because their point basically,
and there's a little bit of bias here
into our customer base
because our customer base is A,
fairly advanced in terms of cloud adoption
and in terms of security maturity.
And also they're heavily in,
let's say financial industries
and other industries that tend to be early adopters
of new technology.
So if you are kind of a laggard,
like you probably aren't that close
to meeting this benchmark as you are,
if you are saying financial, right?
So we asked them
how they operate, and they basically pointed out to us
that knowing 15 minutes later
is too late because I've already lost some number
of millions of dollars if my environment is compromised
for 15 minutes, right?
So that's kind of where the 10 minutes comes from.
We took our real research data, and then
we went around and talked to folks to see what they're
experiencing and what their own expectations are
for their incident response and SOC teams.
And 10 minutes is sort of where we landed.
Got it.
When you see this happening, I guess, in various customer environments, assuming someone has missed that five minute window, is it game over?
Effectively, how should people be thinking about this?
No.
So, I mean, it's never really game over, right? Like until your company is ransomed to bits and you have to close your business, you still have many things that you
can do, hopefully to save yourself. And also I want to be very clear that 555 as a benchmark
is meant to be something aspirational, right? So you should be able to meet this benchmark for,
let's say, your top use cases if you are a fairly high maturity organization in
threat detection specifically, right? So if you're just beginning your threat detection journey,
like tomorrow, you're not going to be close. Like you're going to be not at all close. The point
here though, is that you should aspire to this level of greatness and you're going to have to
create new processes and adopt new tools to get there. Now, before you get there, I would argue that if you
can do like 10, 10, 10, or like whatever number you start with, you're on a mission to make that
number smaller, right? So if today you can detect a crypto miner in 30 minutes, that's not great
because crypto miners are pretty detectable these days. But give yourself a goal of like getting
that 30 minutes down to 20 or getting that 30 minutes down to 10, right? Because we are so
obsessed with like measuring ourselves against our peers and all this other stuff that we
sometimes lose track of what actually is improving our security program. So yes,
compare yourself first. But ultimately, if you can meet the 5-5-5 benchmark, then you are doing
great. You are faster than the attackers in theory. So that's the dream.
So I have to ask, and I suspect I might know the answer to this, but given that it seems
very hard to move this quickly, especially at scale, is there an argument to be made
that effectively prevention obviates the need for any of this?
Where if you don't misconfigure things in ways that should be obvious, if you practice
defense in depth to a point where you can effectively catch the things that the first layer meets with successive layers, as opposed to, well, we have a firewall.
Once we're inside of there, well, it's game over for us.
Is prevention sufficient in some ways to obviate this?
I think there are a lot of people that would love to believe that that's true.
Oh, I sure would.
It's such a comforting story. And we've done like, I think one of my openings on this is in the benchmark kind of description actually, is that we've done a pretty
good job of advertising prevention in cloud as an important thing and getting people to actually
like start configuring things more carefully or like checking how those things have been configured
and then changing that configuration should they discover that it is not compliant with some
mundane standard that everyone should know, right? So we've made great progress in thinking cloud prevention, but as usual, like prevention
fails, right?
Like I still have smoke detectors in my house, even though I have done everything possible
to prevent it from catching fire and I don't plan to set it on fire, right?
But like threat detection is one of these things that you're always going to need because
no matter what you do, A, you will make a mistake because you're a human being
and there are too many things
and you'll make a mistake.
And B, the bad guys are literally in the business
of figuring ways around your prevention
and your protective systems.
So I am full on on the defense of depth.
I think it's a beautiful thing.
We should all obviously do that.
And I do think that prevention
is your first step to a
holistic security program. Otherwise, what even is the point? But threat detection is always going
to be necessary. And like I said, even if you can't go 5-5-5, you don't have threat detection
at that speed, you need to at least be able to know what happened later so you can update your
prevention system. This might be a dangerous question to get into,
but why not?
That's what I do here.
It's potentially an argument against cloud,
by which I mean that
if I compromise someone's cloud account
and any of the major cloud providers,
once I have access of some level,
I know where everything else in the environment is
as a general rule.
I know that you're using S3 or its equivalent
and what those APIs look like and the rest.
Whereas as an attacker, if I am breaking into someone's crappy data center hosted environment,
everything is going to be different. Maybe they don't have a SAN at all, for example. Maybe they
have one that hasn't been patched in five years. Maybe they're just doing local disk for some
reason. There's a lot of discovery that has to happen that is almost always removed from cloud.
I mean, take the open S3 bucket problem that we've seen as a scourge for five, six, seven years now,
where it's not that S3 itself is insecure,
but once you make a configuration mistake,
you are now in line with a whole bunch of other folks
who may have much more valuable data
living in that environment.
Where do you land on that one?
This is the leave cloud to rely on security
through obscurity argument. Exactly, which I'm not a fan of,
but it's also hard to argue against from time to time.
My other way of phrasing it is the attackers are moving up the stack argument.
Yeah, so there's some sort of truth in that, right? Part of the reason that attackers can
move that fast, and I think we say this a lot when we talk about the threat report data too,
because we literally see them execute this behavior, right? Is they know what the cloud
looks like, right? They have access to all the API documentation. They kind of know what all
the constructs are that you're all using. And so they literally can practice their attack and create
all these scripts ahead of time to perform their reconnaissance because they know exactly what
they're looking at, right? On premise, you're right. Like they're going to get into,
even if they get through my firewall, whatever,
they're getting into my data center.
They do not know what disaster I've configured,
what kinds of servers I have, where,
and like what the network looks like.
They have no idea, right?
In cloud, this is kind of all gifted to them
because it's so standard,
which is a blessing and a curse.
It's a blessing because, well, for them,
I mean, because they can just programmatically
go through this stuff, right?
It's a curse for them
because it's a blessing for us in the same way, right?
The defenders, A,
have a much easier time knowing what they even
have available to them, right?
The days of there's a server in a closet I've never heard of
are kind of gone, right? You know what's in your cloud account
because, frankly, AWS tells you.
So I think there is a trade-off there.
The other thing is,
but moving up the stack thing, right?
No matter what you do, they will come after you if you have something moving up the stack thing, right? Like, no matter what you do,
they will come after you
if you have something worth exploiting you for, right?
So by moving up the stack, I mean, listen,
we have abstracted all the physical servers,
all the, like, stuff we used to have to manage the security of
because the cloud just does that for us, right?
Now we can argue about whether or not they do a good job,
but I'm going to be generous to them
and say they do a better job than most companies did before.
So in that regard, we say thank you and we move on to fighting this battle at a higher level than the stack, which is now the workloads and the cloud control plane and you name it, whatever's going on after that.
So I don't actually think you can sort of trade apples for oranges here.
It's just bad in a different way.
Do you think that this benchmark is going to be used by various companies who learn about it?
And if so, how do you see that playing out?
I hope so.
My hope when we created it was it would sort of serve as a goalpost or a way to measure.
It won't just be marketing words in a page and never mentioned again anywhere.
That's our dream here.
Right.
I was bored, so I wrote some.
I had a word minimum I needed to get out the door.
So there we are.
It's how we work.
Right.
As you know, I used to be a Gartner analyst.
So my desire is always to create things
that are useful for people to figure out
how to do better in security.
And my tenure at the vendor is just a way to fund that more effectively.
I'm forgetting your ex-gardener.
Yeah, it's one of those fun areas of, oh, yeah, we just want to basically talk about all kinds of things because we have a chart to fill out here.
Let's get after it.
I did not invent an acronym, at least.
Yeah, so my goal was the following.
People are always looking for a benchmark or a goal or a standard to be like, hey, am I doing a good job? Whether I'm like a SOC analyst or director, and I'm just looking at
my little SOC empire, or I'm a full-on CISO and I'm looking at my entire security program to kind
of figure out risk, I need some way to know whether what is happening in my organization is like
sufficient or on par or anything. Is it good? Is it bad? Happy face? Sad face? Like I need some
benchmark, right? So normally the Gartner answer to this typically is like, you can only come up
with benchmarks that are like, only you know what is right for your company, right? It's like,
you know, standard, it depends answer, which is true, right? Because I can't say that like, oh,
a huge multinational bank should follow the same benchmark as like a donut shop, right? Like that's
unreasonable. So this is also why I say that our benchmark is probably more tailored to the more advanced
organizations that are dealing with kind of high maturity phenomena and are more cloud
native.
But the donut shops should kind of strive in this direction, right?
So I hope that people will think of it this way, that they will kind of look at their
process and say, hey, like, what are the things that would be really bad if they
happened to me in terms of threat detection? Like, what are the threats I'm afraid of,
where if I saw this in my cloud environment, I would have a really bad day? And can I detect
those threats in 555? Because if I can, then I'm actually doing quite well. And if I can't,
then I need to set like some sort of roadmap for myself on how I get from where I am now to 555,
because that implies you would be doing a good job. So that's sort of my hope for the benchmark is that people think of it
as something to aspire to. And if they're already able to meet it, then they'll tell us how exactly
they're achieving it, because I really want to be friends with them. Yeah, there's a definite
lack of reasonable ways to think about these things, at least in ways that can be communicated
to folks outside the bounds of the security team.
I think that's one of the big challenges
currently facing the security industry
is that it is easy to get so locked
into the domain-specific acronyms,
philosophies, approaches, and the rest
that even coming from, well, I'm a cloud engineer
who ostensibly needs to know about these things.
Yeah, wander around the RSA floor with that as your background, and you get lost very quickly.
Yeah, I think that's fair.
I mean, it is a very, let's say, dynamic and rapidly evolving space.
And by the way, like, it was really hard for me to pick these numbers, right?
Because I very much am on that whole it depends bandwagon of, I don't know what the right answer is.
Who knows what the right answer is? I say 5-5-5 today, like tomorrow
the attack takes five minutes and now it's two and a half, two and a half, right?
Whatever, you have to pick a number and go for it. So I think to some extent we just have
to try to make sense of the insanity and choose some
best practices to anchor ourselves in or some kind of sound logic to start with and then go from there.
So that's sort of what I go for.
So as I think about
the actual reaction times needed for 555
to actually be realistic,
people can't reliably
get a hold of me on the phone within five minutes.
So it seems like this is not something
you can have humans in the loop for.
How does that interface
with the idea of automating things versus
giving automated systems
too much power to take your site down as a potential failure mode? Yeah, I don't even answer
the phone anymore, so that wouldn't work at all. That's a really, really good question. And probably
the question that gives me the most, I don't know, I don't want to say lost sleep at night, because
it's actually, it's very interesting to think about, right? I don't think you can remove humans
from the loop in the sock. Like, certainly there will be things you can auto respond to to
some extent but there better be a human being in there because there are too many things at stake
right some of these actions could take your entire business down for far more hours or days than
whatever the attacker was doing before and that trade-off of like is my response to this attack
actually hurting the business more than the attack itself, is a question that's really hard to answer, especially for most of us technical folks who
don't necessarily know the business impact of any given thing. So first of all, I think we have to
embrace auto-response actions. Back to our favorite crypto miners, right? There is no reason
to not automatically shut them down. There is no reason, right? Just build in a detection and an
auto-response
every time you see a crypto miner,
kill that process, kill that container,
kill that node.
I don't care, kill it.
Like, why is it running?
This is crazy, right?
I do think it gets nuanced very quickly, right?
So again, in Scarlet Eel,
there are essentially like five or six detections
that occur, right?
And each of them theoretically has a potential auto-response
that you could have executed,
depending on your sort of appetite for that level of intervention.
Right. Like when you see somebody assuming a role, that's perfectly normal activity most of the time.
In this case, I believe they actually assumed a machine role, which is less normal.
Like, that's kind of weird. And then what do you do?
Well, you can just like remove the role.
You can remove that person's ability to do anything or remove that role's ability to do anything.
But that could be very dangerous because we don't necessarily know what the full scope of that role is as this is happening.
So you could take a more mitigated utter response action and add a restrictive policy to that rule, for example,
to just prevent activity from that IP address that you just saw.
Because we're not sure about this IP address, but we're sure about this role, right? So you have to get into these sort of risk tiered response actions where you say, okay,
this is always okay to do automatically.
And this is like sometimes okay, and this is never okay.
And as you develop that muscle, it becomes much easier to do something rather than doing
nothing and just kind of like analyzing it in forensics and being like, oh, what an interesting
attack story, right?
So that's step one is just start taking these different response actions.
And then step two is more long-term, and it's that you have to embrace the cloud-native
way of life, right?
Like this immutable, ephemeral, distributed religion that we've been selling.
It actually works really well if you go all in on the religion.
I sound like a real cult leader.
If you just go all in, it's going to be great.
But it's true, right? So if your workloads are immutable,
that means they cannot change as they're running,
then when you see them drifting from their original configuration,
you know that it's bad.
So you can immediately know that it's safe to take an auto-response,
well, it's relatively safe to take an auto-response action
to kill that workload because you are 100% certain
it is not doing the right things, right?
And then, furthermore, if all of your deployments are defined as code,
which they should be,
then it is approximately, though not entirely trivial,
to get that workload back, right?
Because you should push a button,
and it just generates that same Kubernetes cluster
with those same nodes doing all those same things, right?
So in the on-premise world,
where shooting a server was potentially the fireable offense, because if that server was running something critical and you couldn't get it back, you were done.
In the cloud, this is much less dangerous because there is an infinite quantity of servers that you could bring back.
And hopefully, infrastructure's code and configuration's code in some wonderful registry version controlled for you to rely on to rehydrate all that stuff, right?
So again, to sort of TLDR, get used to doing autoresponse actions, but do this carefully.
Define a scope for those actions that make sense, not just like something bad happened, burn it all down, obviously.
And then as you become more cloud native, which sometimes requires refactoring of entire applications, by the way,
so this could take years, just embrace the joy of everything is code.
That's a good way of thinking about it. I just, I wish there were an easier path to get there for an awful lot of folks who otherwise don't find a clear way to unlock that.
There is not, unfortunately. I mean, again, the upside on that is like, there are a lot of people
that have done it successfully, I have to say. I couldn't have said that to you like six, seven years ago when we were just getting started on this journey.
But especially for those of you who were just at KubeCon however long ago, before this airs, you see a pretty robust ecosystem around Kubernetes, around containers, around cloud in general. And so even if you feel like your organization's behind, there are a lot of
folks you can reach out to, to learn from, to get some help, to just sort of start joining the masses
of cloud native types. So it's not nearly as hopeless as before. And also one thing I like to
say always is almost every organization is going to have some technical debt and some legacy workload
that they can't convert to the religion of cloud.
And so you're not going to have a 555 threat detection SLA on those workloads.
Probably.
I mean, maybe you can, but probably you're not.
And you may not be able to take autoresponse actions.
You may not have all the same benefits available to you.
But that's okay.
That's okay.
Hopefully, whatever that thing is running is worth keeping alive.
But set this new standard for your new workload.
So when your team is building a new application
or if they're refactoring an application for the new world,
set the standard on them
and don't torment the legacy folks
because it doesn't necessarily make sense.
They're going to have different SLAs for different workloads.
I really want to thank you for taking the time
to speak with me yet again
about the stuff you folks are coming out with.
If people want to learn more, where's the best place for them to go?
Thanks, Corey. It's always a pleasure to be on your show.
If you want to learn more about the 555 benchmark, you should go to SysDict.com slash 555.
And we will, of course, put links to that in the show notes.
Thank you so much for taking the time to speak with me today.
As always, it's appreciated. Anna Bellick, Director at the Office of Cybersecurity
Strategy at Sysdig. I'm cloud economist Corey Quinn, and this has been a promoted guest episode
brought to us by our friends at Sysdig. If you've enjoyed this podcast, please leave a five-star
review in your podcast platform of choice. Whereas if you've hated this podcast, please leave a
five-star review in your podcast platform of choice, along with you've hated this podcast, please leave a five-star review on your podcast
platform of choice, along with an angry, insulting comment that I will read nowhere even approaching
within five minutes. If your AWS bill keeps rising and your blood pressure is doing the same,
then you need the Duck Bill Group. We help companies fix their AWS bill
by making it smaller and less horrifying.
The Duck Bill Group works for you, not AWS.
We tailor recommendations to your business
and we get to the point.
Visit duckbillgroup.com to get started.