PurePerformance - Using Observability to Prioritize CrowdStrike Remediation with Josh Wood
Episode Date: August 5, 2024When thousands of systems show a blue screen - which ones do you fix first to quickly bring up your most critical systems? For that you need to know which systems are impacted, which mission critical ...applications run on it, and which depending systems are also impacted by something like the recent CrowdStrike incident!We have invited Josh Wood, Principal Solutions Engineer at Dynatrace, who was one of the first responders helping organizations to leverage observability data to identify which systems to fix first to bring critical apps such as ATMs, Self-Service Terminals, POS (Point of Sales), ... back up again quickly.In this special episode Josh is walking us through the technical details of the CrowdStrike BSOD (Blue Screen of Death), what caused it, how to leverage observability to get a priorities list of systems to fix first and what organizations can do to prevent software impacting issues in the future.Here the links we discussed in the episode:Josh on LinkedIn: https://www.linkedin.com/in/joshuadwood/Josh's blog on CrowdStrike BSOD: https://www.dynatrace.com/news/blog/crowdstrike-bsod-quickly-find-machines-impacted-by-the-crowdstrike-issue/CrowdStrike Incident Takeaway Blog: https://www.dynatrace.com/news/blog/crowdstrike-incident-revisiting-vendor-quality-control/Â
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Welcome everyone to another episode of Pure Performance.
You can probably tell this is not the voice of Brian Wilson,
who typically does the intro to a new episode of Pure Performance.
This is Andy Grabner.
But I'm just trying to do as good of a job as if Brian would open up a new episode. However, what I will not do like Brian typically does,
start with a bad joke or with some,
rephrasing some strange dreams that he had.
Instead, I want to go straight into introducing
or having our guest introduce himself.
Josh Wood, thank you so much for being on the show today.
And also thanks for doing this at a very busy time.
Welcome to the show. Maybe Josh, quick round of introduction. Who are you? What do you do?
And why are you here? Do you think on this podcast? Yeah, thank you, Andy. Pleasure to be
here. I'm Josh Wood. I work with Dynatrace. I've been with Dynatrace a little over five and a half
years now. Rough expertise in a matter of different things.
Get interested by a variety of different topics in the observability space.
And I was fortunate to be able to help some of our customers during this CrowdStrike incident
be able to get their systems operational more quickly.
I produced a couple of different examples using the Dynatrace platform that
were instrumental in being able to get those systems up and online. And so I'm happy to
be here today to kind of, A, walk through what exactly had occurred, B, walk through what
observability solutions can do to help in similar situations if they were to occur again,
and then also kind of outline some future steps as to being able to help customers get
in front of situations like this if they were to occur.
Yeah, thank you so much for doing this.
And, you know, there's obviously, I'm not sure how it is in your life, but if you talk
with people that are not in IT, it's sometimes hard to really explain what is observability,
why is it important that we keep systems,
what do we do in order to keep systems running, because systems just work.
But then, I remember, obviously that Friday, I all of a sudden
get text messages from friends that are not in IT asking me how my
day is, because there's this thing going on
and they hope that we're not impacted
and not just have crazy days.
And it was actually my last day of vacation
and I had to open up the news feed
and then I was informed about CrowdStrike.
And so it's one of those events where all of a sudden
IT, especially when it doesn't work as expected,
really overshines everything else
that happens in the world
and makes it to people into their newsfeed
that are typically not reading up
on the latest in IT.
So it was that Friday, like a week ago.
Can you, for those of people
that have not had a chance to look into this closely,
do you want to quickly highlight what happened or just recap for those?
Yeah, sure. And I also kind of similarly figured it out, not through my newsfeed, but I had to end up paying one of my kids had a babysitter and I need to get some cash to pay the babysitter.
And none of the ATMs worked across the entire city.
So I had to say, well, we're going to have to wait maybe until a little later today.
Hopefully things come online.
I drove by a couple of ATMs and saw the blue screen of death.
And I said, this is unusual.
I don't expect a VSOD coming up to the ATM like that.
But to walk through what exactly had occurred.
So CrowdStrike itself, it has a very popular
platform called Falcon. And this
is a security tool
that is used to prevent certain
bad actors from
accessing systems, as most
of these security offerings are. And it has
two key components to make it work.
It has a sensor or an
engine and also
a piece called the Rapid Response content or the RRC.
And if you've read a couple of blog posts in this area, the two kind of work hand in hand.
The engine itself acts on the rules, the definitions that come from the RRC, that rapid response content.
And the rapid response content defines exactly what they need to be looking at.
You can think of it in some terms of like an antivirus definition.
Maybe not quite exactly how that works, but at least loosely so.
And that those definitions are always continuously updating.
So there are lots of updates that happen fast and frequently.
And they're more incremental in nature.
This allows their customers to get the best security posture they can whenever CrowdStrike identifies a particular vulnerability or threat.
They can pass that information through that RRC,
and then it goes on and gets processed by that engine.
Now, the engine portion of Falcon is pretty robust.
It's more static, doesn't change as often.
It goes through, as CrowdStrike terms it, a very extensive QA process
as automated testing, manual testing, validation, rollout.
So standard performance engineering kind of things.
The RRC, by contrast, has those frequent small updates.
From what we see and what we hear from different folks at CrowdStrike,
it appears that that process might be a little less robust than the engine version.
And unfortunately, what happened, we had a confluence of two unfortunate events. One,
the RRC part of that, the fast updates, that had a bug. And then the testing process itself, it also had a bug. So there's two bugs that occurred simultaneously. And within RRC, they do not unfortunately allow their customers to opt out of those updates. all at around 409 UTC on Friday, July 19th.
And as a result of that global definition change happening at that moment,
that put all these different things into motion.
Okay?
And we'll go again to what they could have done maybe differently or different what's next options here in a little bit.
But that at its core was what exactly happened from the software
piece of the equation. Now, on the actual RRC bug, what was the nature of the bug? Why did it result
in things like ATMs or airport kiosks or point of sale devices? Why did it end up with a blue screen of death? Well, the Falcon platform itself has access
at a privileged level to the Windows underlying kernel.
So the operating system being able to make the thing work,
it has that access,
and it's had that access for a good reason.
It has that access to prevent you from being attacked
by these different risks that are coming in.
The problem with that, however, though, is in this scenario, they ended up getting an
out-of-memory problem.
And that out-of-memory problem that in the world of software, that's an exception.
That out-of-memory problem caused an unhandled exception to occur inside of the protected
memory space of the Windows kernel itself.
And then when that page fault occurs, you get a blue screen of death.
And that's what ultimately caused this cascading effect that we saw across the globe on Friday, July 19th.
So I'll just kind of pause there for a moment just to kind of highlight that. Maybe, Andy, if you had any questions for our listeners to see if we can maybe clarify
any pieces on how that exactly occurred. Thank you so much for the
explanation. I think pretty clear. And also, as you mentioned before,
we actually hit the recording button. You do an amazing job in explaining things in simple terms.
So to recap, to make sure that I understood everything correctly,
there is an engine. It's
basically deep in the Windows operating system in the kernel because it needs to detect malicious
activities. And in order to understand if there's a new malicious behavior, it's getting updates
from a central service from CrowdStrike with new behavior that was detected.
And obviously, I think one of the reasons why they are doing this,
and we can debate about this if it's the right approach,
but why they're rolling out these updates constantly all the time,
because assuming they detect a new malicious behavior,
they write this definition and they only send it
to a fraction of the people, which is what we normally do with progressive rollouts,
all the others that are not getting it
and are then getting impacted in the time
when the rollout still takes time until it reaches them,
they could say,
why didn't you protect me but protected the others?
So I guess it's a challenging debate as well.
But basically, there was a bug, I guess, in parsing these definitions
or in the delivery, which then caused this particular component to crash.
And because it lives in the Windows kernel, it actually causes the blue screen
and therefore critical machines like ATMs, kiosks at the airport.
And I also heard about hospitals.
I heard about many different organizations and industry around the world to be majorly impacted.
So definitely, you know, an amazingly impactful change that also showed us what happens if software doesn't work perfectly, which is something very critical.
Yeah, even to this kind of highlights the next thing I want to outline here with CrowdStrike is the remediation of this issue.
Oftentimes, it's getting our systems back operational and that speed at which one can
do that, organizations can do that.
The challenge with this particular style of failure is that it cannot be automated easily for it all.
The remediation of this overall blue screen of death,
this boot loop that occurs on your Windows servers
or your Windows clients,
is that it has to be forcibly put into a lower tier mode like safe mode
and then have the removal of critical files
for the CrowdStrike Falcon platform.
Alternatively, you could then maybe flash your image
with a clean version of Windows
that had the correct sensors for CrowdStrike.
Either approach, we're talking, it takes many, many,
ideally you can get it done in maybe an hour for a server or two. Maybe you can get a couple of economies to scale if you can parallelize a lot of it.
But a lot of these steps are you going manually to those servers,
either in the data center or to your cloud counterparts to be able to get them fixed.
It's not an easy process.
And as a result of that, getting these systems operational took longer than maybe other incidents that have occurred historically.
And this is probably why this incident will go down as one of the most severe, if not the most severe, outage in IT history, just by virtue of the nature of remediation being so painful and manual and how to fix this thing do you happen to know
because you know we talked about these atms and these kiosks at airports are these i would still
assume these are all virtualized desktops or are they do you happen to know virtualized or are
these real desktop machines i can use that it was interesting i can use use the ATM example. I ended up having to go back later to get the cash that I needed for my babysitter. And I saw a Brinks truck. So in the States, these are heavily armored trucks that they have there. And you could see there was the guy who takes the money and is very secure. And then there was what looked like to be a tech guy with him sitting in the truck. And you could tell he pulled off the ATM front and it looked like what he had is a USB flash drive
to be able to boot the thing into a safe mode and then be able to remove. Either he had an ISO for
an image to cleanly restore. That would have been a physical device and not virtualized
at the edge. But for the most part part most of these servers would be virtualized
there might be some opportunities to be able to use that maybe do like a v motion in the world
of vmware um or potentially stand up a new clean image with a new iso depends exactly on the
circumstance but just the one situation where you saw this armored truck and a tech guy sitting in
the truck to try to fix the issue with the ATM kiosk. That was pretty telling.
Yeah, and especially if you think about the scale, as you mentioned earlier,
if you talk about one ATM, yes, that's okay.
If you talk about hundreds and thousands of ATMs, where you potentially have to
bring somebody that is, from a security perspective, allowed
to open up that ATM and then being accompanied
by somebody that actually has the IT skills to then safely reboot that machine
or clean that one file that had to be removed. It's crazy.
Oh, absolutely crazy. And I wouldn't be remiss in
saying, I think the weekend of the 19th, not IT people didn't get
a lot of sleep, to put it mildly.
So maybe we can, that's kind of
what happened exactly.
How can you fix it
when you had that problem?
Now maybe we walk about on the how to fix it
issue, like what are some opportunities
for observability
offerings or
observability postures to be able to help
in those given circumstances. And, you know, this doesn't have to be specific to Dynatrace,
though, admittedly, you and I both work for Dynatrace. It's that this can be any observability
tool that does a good job of being able to consolidate metrics, events, logs, traces, user sessions, being able to
understand all those different pieces of the observability puzzle and then connect the dots
between those signals and establishing not only context, but also impact as to what those signals
truly mean. Now, of course, in the world of Dynatrace, a lot of this is very simple and straightforward.
We built the platform specifically to have that, but you could do this either through a DIY
approach. You could do it through a number of different other offerings in the marketplace as
well. So it's not necessarily unique to a Dynatrace offering per chance, but it's something
that you could do with any observability posture
provided that it was advanced enough.
Now, in the world of Dynatrace,
the reason why we
built it to have that context
at the core was to
make situations like this easier
to remediate.
So if one Dynatrace has
its observability technology
and call it the one agent sitting on top of those virtualized servers or at the edge on that ATM or at a server's back in the on-premise data center or in the cloud, wherever it might be.
And that one agent is looking at all these different signals, metrics, events, traces, logs, user sessions, and so forth. Looking at all these different signals and understanding, hey, this server is now offline.
And these are the events I saw on that server.
And these are the logs generated by that server when it went down.
And these are all connected in this way.
And by the way, this server talks to 10 other servers, or 15 other
servers. Or if you're the bank, it's the one that talks to the ATMs, and it's able to handle the
transactions that go from a user trying to get cash from the ATM back to the on-premise data
center to handle that API. Those contextual pieces, that's what Dynatrace is unique at doing and being able to do that
natively and out of the box.
And in the case of CrowdStrike, we were fortunate enough to be able to provide our customers
who use OneAgent a quick little how-to in a blog post that I had authored that shows a few statements using Dynatrace's
entity model, its data lakehouse that has that context built into it, something we call
Grail, and then say, alright, this server has
CrowdStrike running on it, maybe it has BitLocker running on it, so one of the
other challenges in remediating it is that this underlying server has
BitLocker.
The drives will be encrypted, and so you have to unlock, basically decrypt the drive before you can be able to fix it.
So being able to understand not only does that have a CrowdStrike, but it does also have this encryption service running too.
And then take that information and say, here's a list of servers.
And by the way, these servers have been recently restarted in the last 24 hours or still or are still offline and take that list and say all right now i know which ones
to go after right i have the ability to not only have that that manifest but also a way to sit here
and prioritize the ability to remediate these ones for business critical activity um that was a a
very helpful moment and it felt good from the dietary standpoint to be able
to provide that level of insight to our clients, where on this process, especially given how
manual the remediation task was, getting that list was crucial, and being able to then position
your folks to fix the ones that are most mission essential
and then work your way down the line.
Yeah, and I think this is exactly the point that you made.
You put it very nicely earlier that you said, right?
I mean, we're talking about things we can learn
through observability.
So we can obviously, we can look at the logs.
The logs will tell us, you know,
was the malicious, does CrowdStrike even run?
Was it impacted by that update? Is there also a BitLocker installed? I think this is an
additional piece of evidence, but what I really like is that we have the context, like to your
ATM example, just fixing an ATM or a thousand ATMs will not solve the problem if you also not fix
the machines behind the scenes that actually communicate with the ATMs. So like having this
information and a prioritized list allows you to then fix the machines in the right order and fix
those machines that need to be fixed to bring your business critical systems up. Because I assume
many Windows machines that are out there, you out there are important, but may not be business-critical to
get airplanes back up and running, you get ATMs up and running. And so you want to send your
technicians that you have the limited amount to the right people fast, or to the right machines
fast, and then maybe later on focus on those that are not that critical.
100% agree. You can even look at the ones that unfortunately were impacted at the hospitals. or do the right machines fast, and then maybe later on focus on those that are not as critical.
100% agree.
You can even look at the ones that unfortunately were impacted at the hospitals, where there are certain things that can technically be life or death
with accessing certain medical records for a patient.
And the inability to do that quickly could be really that deciding factor. So having that easy way to understand what is the impact of this overall situation and
what do the systems do, then take that list and take your team, your finite number of
resources, and provision them accordingly to that task, that mandate.
Yeah.
I mean, this also reminds me,
and we talked about this in the preparation of this call,
this reminds me a lot about what happened last November
with Lock4Shell.
Actually, it was already a year and a half ago
where the same thing happened, right?
Lock4Shell all of a sudden hit us.
And then there were two options.
You could either fix everything
or you fix those things that you know were business critical and business critical and vulnerable.
I think from a security perspective, it's also about vulnerability because not every
system that was impacted by lock for shell,
the lock for shell issue was maybe accessible from the outside world
and nobody could exploit that security thing. And in this case here
with CrowdStrike, obviously you can say, let's bring the critical systems and all the systems that are
connected to it back up that we need from hospitals, ATMs, the backend systems on ATMs,
and then focus maybe on some back office machines that are not business essential later on.
Absolutely. And again, this is not necessarily something
from the observability standpoint
that has to be done by Dynatrace.
Certainly, there are ways to pull these signals in
and establish context and prioritize context.
I will just remark that certain things,
even things that Dynatrace offers vis-a-vis CrowdStrike.
CrowdStrike has ways to visualize or know certain servers are there.
They have different dashboards.
But the challenge there is what do those servers mean?
Are they servers that I care about?
Maybe I see a list of 100 servers that are down or not reporting CrowdStrike data,
and maybe you can say, all right, the 100 servers that I have on my
CrowdStrike dashboard, those are the ones that I
care about.
Or you can say, ah,
of the 100 servers, these are the top
10, and the top 10 ones talk to
my banking service or
to my patient record service
or whatever. And that's how,
that's what makes, in this particular scenario,
DynTrace effective versus, say, other offerings
in the observability space. Yeah, and Joshua, what I've also seen,
maybe you've seen it as well, as we both work, obviously,
for Dynatrace, and we work with a lot of clients. Over the years,
I've seen and also advised customers when they are deploying apps
and bringing up new systems to provide additional metadata that then tells an observability platform like Dynatrace,
hey, is this a business-critical system or a non-business-critical system?
Is this a business-critical app or a non-business-critical app?
And with this additional information, you then provide easier and faster answers to exactly that question.
Hey, show me, based on my observability data,
what are my most business-critical systems
that run the most business-critical apps?
Show me how to connect with each other
and show me how to bring them up,
like in which sequence we should bring them up again.
Exactly.
And that business intelligence is also crucially important here
in building an observability practice
where having that business context established in concert with the other signals that you would have, I think is equally,
if not more important, because again, it gives you that opportunity in these critical situations
to prioritize based upon what business needs. Cool. So Josh, we learned about, first of all,
what happened. Thanks for the recap on what happened and going into some of the technical details
on why this blue screen of death, I think it's called.
What's the acronym?
ESOD, yeah.
So folks, for those of you that are not up to date with every acronym out there in IT space,
B-S-O-D, blue screen of death.
You explained all the technical details.
We talked about how observability can help.
We talked about the importance
that observability is great,
but that observability,
I think having a good observability strategy,
which is, I think, the term that you actually used
in the preparation of this call,
where you said it's not just about pulling in metrics, logs and traces, but it's really
connecting it and then having a better view on the system so that we know what's business
critical and what not, what talks to each other.
I think that we just want to stress this fact that having data is great.
Having data in context is better because then you can get better answers.
Are we still, based on you,
we've been working with organizations
that are being impacted.
Are we back to normal
or are people still struggling with?
You mentioned Log4J and not to speak too ill,
but I still see Log4J and customers two years later.
And I know a few servers for a few of my customers that are still struggling to come back online
that maybe an image they tried to do this manual remediation did not take.
And so they have to reflash it.
Thankfully, a lot of those servers, from what I hear from my customers, are not mission essential.
So they have the opportunity to sit here and say, okay, we can get to it as time permits.
Something like a SEV3 or SEV4, like a very low criticality kind of thing.
But I expect this to linger.
And especially understanding the long tail effects of this phenomenon.
It's maybe less about the immediate
blast radius or crowd strike, but perhaps the knock-on effects that it has the rest of the
systems. These IT environments are becoming increasingly more complex. They have back
calls to on-prem. They have connections to cloud. They've got containers. They got monolith. They
got a little bit of everything. Maybe there's a main maybe there's ibm i series and so you have everything under the sun
and so understanding exactly the consequences or maybe even the unintended consequences of this
are still to be seen but i will i thankfully can say for most of the customers that i've talked to
thus far they are i would say would say, 98% operational.
Maybe there's some lingering stuff hanging around.
And that in the case of those clients,
Dynatrace was pretty crucial
in being able to help them get back online.
That's great to hear.
I saw, obviously, your blog post.
And folks, if you're listening to this
and you want to see Josh's blog post,
we also have a GitHub repository with
the dashboards that we built both for our SaaS customers that are
already on Grail that can use DQL or Dynatrace query language. But I think
you also build dashboards for our managed customers where you
provided these dashboards as well. So folks, if you want to
follow up on these links, just check
out the description of this podcast and you will find all the details. The question is,
what's next? What can we learn from this? I mean, what can other people learn about this, right?
Because, you know, while it's easy to now talk about one incident for one company. As sad as it is, this can also happen in other organizations.
Absolutely.
And I think CrowdStrike even admitted
that they may have missed the mark
on their updates for their rapid response content offering,
one of the two pieces for the Falcon platform,
and that those deployments need to be done using a canary-style deployment.
So for those of you, many of the folks who listen in may have an awareness, but a canary-style
deployment comes from the term a canary in a coal mine.
Back in the old days when people wouldn't be mining deep into the earth, they would
put a canary down there, and if the canary would die for whatever reason, then they would
know that there were toxic gases present and that the miners need to escape.
Same concept with the canary deployment in that we produce a small segment,
some percentage of the overall software deployment in production.
And as a result of having that segment of the data put in production, and as a result of having that segment of the data put in production,
then we can see what happens. Does that canary indeed sing in the mind, or does it start to die?
And that ability to use real-world traffic and real-world data to assess that health of the
software is a standard practice
within the world of CICD.
One can argue, and I think, Andy, you alluded to earlier
in this podcast, that if you're dealing in security for software,
that producing a canary-style deployment
might have its disadvantages, right?
That a canary deployment might then cause you to have customers
that miss out on the latest and greatest release.
And then as a result of that, missing that latest and greatest release,
then they are exposed to a threat vector
that they were not previously hoping to be exposed to
and potentially could lead to a breach, right?
Something malicious to occur.
There are pros and cons to both.
Potentially in a different style,
and I'm not here to advocate to what
CrowdTrade should or should not do,
but potentially a blue-green
style deployment where it is very much
like-for-like what they have, or a shadow
deployment, or using feature flags
to kind of turn things on and off.
There are a number of different vehicles
for them to perhaps consider.
We actually put out a number of these different ones
for third-party postures
on another recent blog post for Dynatrace.
One thing that we can do from the Dynatrace perspective
that helps on a CICD standpoint,
and this is not just for DevOps,
but DevSecOps as well,
is to put different gates in
the process. So as code is moving, progressing from left to right, from low-stage environments
and dev to production, we can gate against certain criteria. And I'm sure, Andy, you've talked about
this in this podcast before, but using different indicators, whether it's maybe the number of vulnerabilities in the code or the overall response time of given piece of code or the
number of failures that are generated on that code, or maybe it has a memory leak. And that's
probably the one here where we can maybe use that one in this example, where this did have an out
of out of bounds memory leak, which caused this unexpected exception
to occur in the kernel for Windows.
These different indicators
that could be introduced earlier
in the software lifecycle
and then be used to enhance
and make the software release process
more robust.
Really depends on what you're trying to do.
There are obviously benefits and detriments
to doing different positions, but Dynatrace and other observability tools out there have ways to measure and then drive automation and intelligence in those processes so that they can be handled not only more robustly, but more consistently too, in a way that has the same ways we're
testing.
The last thing you want to do in a lower tier test is change the variables of the test.
The one benefit of testing a small segment of data in production is that you're getting
real world testing conditions.
And therefore, you're never going to get a better QA environment testing in production. But of course, the risk of testing in production is
that you either break something or that your customers get exposed to undue threats. So there
are ways for you to triangulate this, and you can use other observability offerings to make this
better as you shift left in the overall software lifecycle.
Andy, you're kind of the expert on that sort of thing. I might defer to you if you wanted to add anything to what I might have missed
on the SCICD side of things.
I feel really proud that all the things that you just said perfectly hit home
and it's stuff that I've been talking about for many years.
And I also agree with you,
right?
We don't want to bash on anybody.
It is,
there's just a lot of things we can do,
a lot of different options.
Sometimes it's,
it's,
it's easy to miss certain things that are possible.
Sometimes things happen because of,
of time pressure because of simple mistakes.
But I also agree with you, things like a canary deployment, even for a security product,
should be possible, even though you may just make a very small canary for a very short time,
but at least test it out first on a small subset of users
and then roll it out in stages to the rest.
Or, you know, try it out internally.
I think one of the things we see a lot of software organizations do
is, you know, using their own products.
You know, the term dogfooding comes to mind,
or we like to call it drink your own champagne.
Funnily enough, dog fooding have you
i'm pretty sure you've heard about that term yeah oh yes do you know what is where it comes from
because i learned this last night oh i always thought it was your own dog food yeah and drink
your own champagne so um the origin it's interesting where it comes from so i i actually
i learned from a from a ted talk that my wife was watching yesterday.
And it seems, right, the story was told that it was a CEO of a company who was producing pet food.
And in order to make a point, he went into the boardroom with a can of pet food.
He opened it.
He ate it.
Like I say, dog food.
He ate the dog food to show if it is good for
me it must also be good for our customers who then feed this to the dog so this is where the term
dog fooding comes in so eat your own dog food because you want to be confident that you're
not killing anybody and it's the same thing with using your own software to prove that your software
lives up to the standards that your customers expect from
your software. So it's the same thing. Oh, 100% agree. You know, we at Dietrace have been
dogfooding or drinking your own champagne for a while. And it's a fun anecdote. I'm going to
look that one up later and read up on that. And my wife will be like, why are you reading about
dogfood at 9.30 at night? And I'm like, don't ask me questions, dear.
So put simply, our customers and software companies in general
need to have this, and this quote I'm stealing
is directed from one of our blogs.
I think it was really well said,
is that we have to have a holistic approach
to that third-party risk management
that ensures that vulnerabilities in vendor
applications don't become that wink-link in the overall security posture.
So being able to use that quality control process to understand your vendors, and Dynatrace
would not be unique.
It would be any software vendor, for that matter, and saying, give me your risk process.
Give me an understanding of your testing
process. And then from there, if I'm a company, I can understand if I rely on this SaaS vendor
for this particular part of my overall IT portfolio, what am I getting myself into?
So risk and risk remediation and risk mitigation is becoming ever more important in the complex
IT enterprise. And that's honestly as important, becoming ever more important in the complex IT enterprise.
And that's honestly as important, if not more important than other portions of it,
as this incident has shown us.
Josh, I would love to say I want to have you back for another episode on the next disaster,
but I actually hope I will not have you back because of the next IT disaster.
I'd rather like to have you back for some other insights
on more joyful things or more
because I know you've
been working and helping
our customers over the years
to really implement
the holistic approach to stability, which is
a very dear topic to mine, all of our
hearts.
Thank you so much for coming
on the show and explaining, giving us a recap
on CrowdStrike, giving us some insight, especially folks, for those of you that have
maybe not yet heard about the details and you're interested. So I'm sure this was insightful.
I think the importance of observability was a really great topic to touch upon what we can do
when we connect the data.
So that's not just identifying which machines are impacted, but which ones are the ones that are running the most critical apps.
So, folks, if you currently don't have this information in your observability practice, meaning if you don't know what is the difference between machine A, machine B, or pod A or pod B, then this is something you should think about.
How can you implement an observability-driven engineering organization
where every time some new infrastructure gets deployed,
a new application gets deployed, you want to enrich this with metadata
that enriches your observability so that you can ask questions like, show me the most critical infrastructure based on the most critical apps that are running on it.
Because if you cannot answer this question, it will be very tough for you to deal with the next crowd strike.
Agreed.
Andy, I would love to come back.
I knock on wood.
I'm knocking on myself, too, just because, you know, the last name Wood.
I hope that it's not underneath certain circumstances
or underneath duress like this particular incident.
Happy to come back anytime.
And thank you for having me.
And thanks for the,
I should have thought about knocking on wood, of course.
I'm sure you've heard this too many times in your life.
Oh, indeed.
All right, indeed.
All right, folks.
Thanks for listening in.
And as I said, this was a special episode without Brian.
Brian will be back the next time because he's currently on a well-deserved vacation where he hopefully was not impacted by any of this.
And yeah, see you next time.
Thank you.
All right.
Thank you all.