PurePerformance - Using Observability to Prioritize CrowdStrike Remediation with Josh Wood

Episode Date: August 5, 2024

When thousands of systems show a blue screen - which ones do you fix first to quickly bring up your most critical systems? For that you need to know which systems are impacted, which mission critical ...applications run on it, and which depending systems are also impacted by something like the recent CrowdStrike incident!We have invited Josh Wood, Principal Solutions Engineer at Dynatrace, who was one of the first responders helping organizations to leverage observability data to identify which systems to fix first to bring critical apps such as ATMs, Self-Service Terminals, POS (Point of Sales), ... back up again quickly.In this special episode Josh is walking us through the technical details of the CrowdStrike BSOD (Blue Screen of Death), what caused it, how to leverage observability to get a priorities list of systems to fix first and what organizations can do to prevent software impacting issues in the future.Here the links we discussed in the episode:Josh on LinkedIn: https://www.linkedin.com/in/joshuadwood/Josh's blog on CrowdStrike BSOD: https://www.dynatrace.com/news/blog/crowdstrike-bsod-quickly-find-machines-impacted-by-the-crowdstrike-issue/CrowdStrike Incident Takeaway Blog: https://www.dynatrace.com/news/blog/crowdstrike-incident-revisiting-vendor-quality-control/ 

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Welcome everyone to another episode of Pure Performance. You can probably tell this is not the voice of Brian Wilson, who typically does the intro to a new episode of Pure Performance. This is Andy Grabner. But I'm just trying to do as good of a job as if Brian would open up a new episode. However, what I will not do like Brian typically does, start with a bad joke or with some,
Starting point is 00:00:48 rephrasing some strange dreams that he had. Instead, I want to go straight into introducing or having our guest introduce himself. Josh Wood, thank you so much for being on the show today. And also thanks for doing this at a very busy time. Welcome to the show. Maybe Josh, quick round of introduction. Who are you? What do you do? And why are you here? Do you think on this podcast? Yeah, thank you, Andy. Pleasure to be here. I'm Josh Wood. I work with Dynatrace. I've been with Dynatrace a little over five and a half
Starting point is 00:01:21 years now. Rough expertise in a matter of different things. Get interested by a variety of different topics in the observability space. And I was fortunate to be able to help some of our customers during this CrowdStrike incident be able to get their systems operational more quickly. I produced a couple of different examples using the Dynatrace platform that were instrumental in being able to get those systems up and online. And so I'm happy to be here today to kind of, A, walk through what exactly had occurred, B, walk through what observability solutions can do to help in similar situations if they were to occur again,
Starting point is 00:02:03 and then also kind of outline some future steps as to being able to help customers get in front of situations like this if they were to occur. Yeah, thank you so much for doing this. And, you know, there's obviously, I'm not sure how it is in your life, but if you talk with people that are not in IT, it's sometimes hard to really explain what is observability, why is it important that we keep systems, what do we do in order to keep systems running, because systems just work. But then, I remember, obviously that Friday, I all of a sudden
Starting point is 00:02:39 get text messages from friends that are not in IT asking me how my day is, because there's this thing going on and they hope that we're not impacted and not just have crazy days. And it was actually my last day of vacation and I had to open up the news feed and then I was informed about CrowdStrike. And so it's one of those events where all of a sudden
Starting point is 00:03:01 IT, especially when it doesn't work as expected, really overshines everything else that happens in the world and makes it to people into their newsfeed that are typically not reading up on the latest in IT. So it was that Friday, like a week ago. Can you, for those of people
Starting point is 00:03:21 that have not had a chance to look into this closely, do you want to quickly highlight what happened or just recap for those? Yeah, sure. And I also kind of similarly figured it out, not through my newsfeed, but I had to end up paying one of my kids had a babysitter and I need to get some cash to pay the babysitter. And none of the ATMs worked across the entire city. So I had to say, well, we're going to have to wait maybe until a little later today. Hopefully things come online. I drove by a couple of ATMs and saw the blue screen of death. And I said, this is unusual.
Starting point is 00:03:56 I don't expect a VSOD coming up to the ATM like that. But to walk through what exactly had occurred. So CrowdStrike itself, it has a very popular platform called Falcon. And this is a security tool that is used to prevent certain bad actors from accessing systems, as most
Starting point is 00:04:16 of these security offerings are. And it has two key components to make it work. It has a sensor or an engine and also a piece called the Rapid Response content or the RRC. And if you've read a couple of blog posts in this area, the two kind of work hand in hand. The engine itself acts on the rules, the definitions that come from the RRC, that rapid response content. And the rapid response content defines exactly what they need to be looking at.
Starting point is 00:04:45 You can think of it in some terms of like an antivirus definition. Maybe not quite exactly how that works, but at least loosely so. And that those definitions are always continuously updating. So there are lots of updates that happen fast and frequently. And they're more incremental in nature. This allows their customers to get the best security posture they can whenever CrowdStrike identifies a particular vulnerability or threat. They can pass that information through that RRC, and then it goes on and gets processed by that engine.
Starting point is 00:05:19 Now, the engine portion of Falcon is pretty robust. It's more static, doesn't change as often. It goes through, as CrowdStrike terms it, a very extensive QA process as automated testing, manual testing, validation, rollout. So standard performance engineering kind of things. The RRC, by contrast, has those frequent small updates. From what we see and what we hear from different folks at CrowdStrike, it appears that that process might be a little less robust than the engine version.
Starting point is 00:05:50 And unfortunately, what happened, we had a confluence of two unfortunate events. One, the RRC part of that, the fast updates, that had a bug. And then the testing process itself, it also had a bug. So there's two bugs that occurred simultaneously. And within RRC, they do not unfortunately allow their customers to opt out of those updates. all at around 409 UTC on Friday, July 19th. And as a result of that global definition change happening at that moment, that put all these different things into motion. Okay? And we'll go again to what they could have done maybe differently or different what's next options here in a little bit. But that at its core was what exactly happened from the software piece of the equation. Now, on the actual RRC bug, what was the nature of the bug? Why did it result
Starting point is 00:06:56 in things like ATMs or airport kiosks or point of sale devices? Why did it end up with a blue screen of death? Well, the Falcon platform itself has access at a privileged level to the Windows underlying kernel. So the operating system being able to make the thing work, it has that access, and it's had that access for a good reason. It has that access to prevent you from being attacked by these different risks that are coming in. The problem with that, however, though, is in this scenario, they ended up getting an
Starting point is 00:07:30 out-of-memory problem. And that out-of-memory problem that in the world of software, that's an exception. That out-of-memory problem caused an unhandled exception to occur inside of the protected memory space of the Windows kernel itself. And then when that page fault occurs, you get a blue screen of death. And that's what ultimately caused this cascading effect that we saw across the globe on Friday, July 19th. So I'll just kind of pause there for a moment just to kind of highlight that. Maybe, Andy, if you had any questions for our listeners to see if we can maybe clarify any pieces on how that exactly occurred. Thank you so much for the
Starting point is 00:08:12 explanation. I think pretty clear. And also, as you mentioned before, we actually hit the recording button. You do an amazing job in explaining things in simple terms. So to recap, to make sure that I understood everything correctly, there is an engine. It's basically deep in the Windows operating system in the kernel because it needs to detect malicious activities. And in order to understand if there's a new malicious behavior, it's getting updates from a central service from CrowdStrike with new behavior that was detected. And obviously, I think one of the reasons why they are doing this,
Starting point is 00:08:49 and we can debate about this if it's the right approach, but why they're rolling out these updates constantly all the time, because assuming they detect a new malicious behavior, they write this definition and they only send it to a fraction of the people, which is what we normally do with progressive rollouts, all the others that are not getting it and are then getting impacted in the time when the rollout still takes time until it reaches them,
Starting point is 00:09:15 they could say, why didn't you protect me but protected the others? So I guess it's a challenging debate as well. But basically, there was a bug, I guess, in parsing these definitions or in the delivery, which then caused this particular component to crash. And because it lives in the Windows kernel, it actually causes the blue screen and therefore critical machines like ATMs, kiosks at the airport. And I also heard about hospitals.
Starting point is 00:09:45 I heard about many different organizations and industry around the world to be majorly impacted. So definitely, you know, an amazingly impactful change that also showed us what happens if software doesn't work perfectly, which is something very critical. Yeah, even to this kind of highlights the next thing I want to outline here with CrowdStrike is the remediation of this issue. Oftentimes, it's getting our systems back operational and that speed at which one can do that, organizations can do that. The challenge with this particular style of failure is that it cannot be automated easily for it all. The remediation of this overall blue screen of death, this boot loop that occurs on your Windows servers
Starting point is 00:10:35 or your Windows clients, is that it has to be forcibly put into a lower tier mode like safe mode and then have the removal of critical files for the CrowdStrike Falcon platform. Alternatively, you could then maybe flash your image with a clean version of Windows that had the correct sensors for CrowdStrike. Either approach, we're talking, it takes many, many,
Starting point is 00:11:08 ideally you can get it done in maybe an hour for a server or two. Maybe you can get a couple of economies to scale if you can parallelize a lot of it. But a lot of these steps are you going manually to those servers, either in the data center or to your cloud counterparts to be able to get them fixed. It's not an easy process. And as a result of that, getting these systems operational took longer than maybe other incidents that have occurred historically. And this is probably why this incident will go down as one of the most severe, if not the most severe, outage in IT history, just by virtue of the nature of remediation being so painful and manual and how to fix this thing do you happen to know because you know we talked about these atms and these kiosks at airports are these i would still assume these are all virtualized desktops or are they do you happen to know virtualized or are
Starting point is 00:11:58 these real desktop machines i can use that it was interesting i can use use the ATM example. I ended up having to go back later to get the cash that I needed for my babysitter. And I saw a Brinks truck. So in the States, these are heavily armored trucks that they have there. And you could see there was the guy who takes the money and is very secure. And then there was what looked like to be a tech guy with him sitting in the truck. And you could tell he pulled off the ATM front and it looked like what he had is a USB flash drive to be able to boot the thing into a safe mode and then be able to remove. Either he had an ISO for an image to cleanly restore. That would have been a physical device and not virtualized at the edge. But for the most part part most of these servers would be virtualized there might be some opportunities to be able to use that maybe do like a v motion in the world of vmware um or potentially stand up a new clean image with a new iso depends exactly on the circumstance but just the one situation where you saw this armored truck and a tech guy sitting in the truck to try to fix the issue with the ATM kiosk. That was pretty telling.
Starting point is 00:13:08 Yeah, and especially if you think about the scale, as you mentioned earlier, if you talk about one ATM, yes, that's okay. If you talk about hundreds and thousands of ATMs, where you potentially have to bring somebody that is, from a security perspective, allowed to open up that ATM and then being accompanied by somebody that actually has the IT skills to then safely reboot that machine or clean that one file that had to be removed. It's crazy. Oh, absolutely crazy. And I wouldn't be remiss in
Starting point is 00:13:40 saying, I think the weekend of the 19th, not IT people didn't get a lot of sleep, to put it mildly. So maybe we can, that's kind of what happened exactly. How can you fix it when you had that problem? Now maybe we walk about on the how to fix it issue, like what are some opportunities
Starting point is 00:14:00 for observability offerings or observability postures to be able to help in those given circumstances. And, you know, this doesn't have to be specific to Dynatrace, though, admittedly, you and I both work for Dynatrace. It's that this can be any observability tool that does a good job of being able to consolidate metrics, events, logs, traces, user sessions, being able to understand all those different pieces of the observability puzzle and then connect the dots between those signals and establishing not only context, but also impact as to what those signals
Starting point is 00:14:39 truly mean. Now, of course, in the world of Dynatrace, a lot of this is very simple and straightforward. We built the platform specifically to have that, but you could do this either through a DIY approach. You could do it through a number of different other offerings in the marketplace as well. So it's not necessarily unique to a Dynatrace offering per chance, but it's something that you could do with any observability posture provided that it was advanced enough. Now, in the world of Dynatrace, the reason why we
Starting point is 00:15:12 built it to have that context at the core was to make situations like this easier to remediate. So if one Dynatrace has its observability technology and call it the one agent sitting on top of those virtualized servers or at the edge on that ATM or at a server's back in the on-premise data center or in the cloud, wherever it might be. And that one agent is looking at all these different signals, metrics, events, traces, logs, user sessions, and so forth. Looking at all these different signals and understanding, hey, this server is now offline.
Starting point is 00:15:48 And these are the events I saw on that server. And these are the logs generated by that server when it went down. And these are all connected in this way. And by the way, this server talks to 10 other servers, or 15 other servers. Or if you're the bank, it's the one that talks to the ATMs, and it's able to handle the transactions that go from a user trying to get cash from the ATM back to the on-premise data center to handle that API. Those contextual pieces, that's what Dynatrace is unique at doing and being able to do that natively and out of the box.
Starting point is 00:16:31 And in the case of CrowdStrike, we were fortunate enough to be able to provide our customers who use OneAgent a quick little how-to in a blog post that I had authored that shows a few statements using Dynatrace's entity model, its data lakehouse that has that context built into it, something we call Grail, and then say, alright, this server has CrowdStrike running on it, maybe it has BitLocker running on it, so one of the other challenges in remediating it is that this underlying server has BitLocker. The drives will be encrypted, and so you have to unlock, basically decrypt the drive before you can be able to fix it.
Starting point is 00:17:11 So being able to understand not only does that have a CrowdStrike, but it does also have this encryption service running too. And then take that information and say, here's a list of servers. And by the way, these servers have been recently restarted in the last 24 hours or still or are still offline and take that list and say all right now i know which ones to go after right i have the ability to not only have that that manifest but also a way to sit here and prioritize the ability to remediate these ones for business critical activity um that was a a very helpful moment and it felt good from the dietary standpoint to be able to provide that level of insight to our clients, where on this process, especially given how manual the remediation task was, getting that list was crucial, and being able to then position
Starting point is 00:18:02 your folks to fix the ones that are most mission essential and then work your way down the line. Yeah, and I think this is exactly the point that you made. You put it very nicely earlier that you said, right? I mean, we're talking about things we can learn through observability. So we can obviously, we can look at the logs. The logs will tell us, you know,
Starting point is 00:18:23 was the malicious, does CrowdStrike even run? Was it impacted by that update? Is there also a BitLocker installed? I think this is an additional piece of evidence, but what I really like is that we have the context, like to your ATM example, just fixing an ATM or a thousand ATMs will not solve the problem if you also not fix the machines behind the scenes that actually communicate with the ATMs. So like having this information and a prioritized list allows you to then fix the machines in the right order and fix those machines that need to be fixed to bring your business critical systems up. Because I assume many Windows machines that are out there, you out there are important, but may not be business-critical to
Starting point is 00:19:09 get airplanes back up and running, you get ATMs up and running. And so you want to send your technicians that you have the limited amount to the right people fast, or to the right machines fast, and then maybe later on focus on those that are not that critical. 100% agree. You can even look at the ones that unfortunately were impacted at the hospitals. or do the right machines fast, and then maybe later on focus on those that are not as critical. 100% agree. You can even look at the ones that unfortunately were impacted at the hospitals, where there are certain things that can technically be life or death with accessing certain medical records for a patient. And the inability to do that quickly could be really that deciding factor. So having that easy way to understand what is the impact of this overall situation and
Starting point is 00:19:51 what do the systems do, then take that list and take your team, your finite number of resources, and provision them accordingly to that task, that mandate. Yeah. I mean, this also reminds me, and we talked about this in the preparation of this call, this reminds me a lot about what happened last November with Lock4Shell. Actually, it was already a year and a half ago
Starting point is 00:20:14 where the same thing happened, right? Lock4Shell all of a sudden hit us. And then there were two options. You could either fix everything or you fix those things that you know were business critical and business critical and vulnerable. I think from a security perspective, it's also about vulnerability because not every system that was impacted by lock for shell, the lock for shell issue was maybe accessible from the outside world
Starting point is 00:20:40 and nobody could exploit that security thing. And in this case here with CrowdStrike, obviously you can say, let's bring the critical systems and all the systems that are connected to it back up that we need from hospitals, ATMs, the backend systems on ATMs, and then focus maybe on some back office machines that are not business essential later on. Absolutely. And again, this is not necessarily something from the observability standpoint that has to be done by Dynatrace. Certainly, there are ways to pull these signals in
Starting point is 00:21:14 and establish context and prioritize context. I will just remark that certain things, even things that Dynatrace offers vis-a-vis CrowdStrike. CrowdStrike has ways to visualize or know certain servers are there. They have different dashboards. But the challenge there is what do those servers mean? Are they servers that I care about? Maybe I see a list of 100 servers that are down or not reporting CrowdStrike data,
Starting point is 00:21:43 and maybe you can say, all right, the 100 servers that I have on my CrowdStrike dashboard, those are the ones that I care about. Or you can say, ah, of the 100 servers, these are the top 10, and the top 10 ones talk to my banking service or to my patient record service
Starting point is 00:21:59 or whatever. And that's how, that's what makes, in this particular scenario, DynTrace effective versus, say, other offerings in the observability space. Yeah, and Joshua, what I've also seen, maybe you've seen it as well, as we both work, obviously, for Dynatrace, and we work with a lot of clients. Over the years, I've seen and also advised customers when they are deploying apps and bringing up new systems to provide additional metadata that then tells an observability platform like Dynatrace,
Starting point is 00:22:30 hey, is this a business-critical system or a non-business-critical system? Is this a business-critical app or a non-business-critical app? And with this additional information, you then provide easier and faster answers to exactly that question. Hey, show me, based on my observability data, what are my most business-critical systems that run the most business-critical apps? Show me how to connect with each other and show me how to bring them up,
Starting point is 00:22:54 like in which sequence we should bring them up again. Exactly. And that business intelligence is also crucially important here in building an observability practice where having that business context established in concert with the other signals that you would have, I think is equally, if not more important, because again, it gives you that opportunity in these critical situations to prioritize based upon what business needs. Cool. So Josh, we learned about, first of all, what happened. Thanks for the recap on what happened and going into some of the technical details
Starting point is 00:23:28 on why this blue screen of death, I think it's called. What's the acronym? ESOD, yeah. So folks, for those of you that are not up to date with every acronym out there in IT space, B-S-O-D, blue screen of death. You explained all the technical details. We talked about how observability can help. We talked about the importance
Starting point is 00:23:53 that observability is great, but that observability, I think having a good observability strategy, which is, I think, the term that you actually used in the preparation of this call, where you said it's not just about pulling in metrics, logs and traces, but it's really connecting it and then having a better view on the system so that we know what's business critical and what not, what talks to each other.
Starting point is 00:24:17 I think that we just want to stress this fact that having data is great. Having data in context is better because then you can get better answers. Are we still, based on you, we've been working with organizations that are being impacted. Are we back to normal or are people still struggling with? You mentioned Log4J and not to speak too ill,
Starting point is 00:24:41 but I still see Log4J and customers two years later. And I know a few servers for a few of my customers that are still struggling to come back online that maybe an image they tried to do this manual remediation did not take. And so they have to reflash it. Thankfully, a lot of those servers, from what I hear from my customers, are not mission essential. So they have the opportunity to sit here and say, okay, we can get to it as time permits. Something like a SEV3 or SEV4, like a very low criticality kind of thing. But I expect this to linger.
Starting point is 00:25:19 And especially understanding the long tail effects of this phenomenon. It's maybe less about the immediate blast radius or crowd strike, but perhaps the knock-on effects that it has the rest of the systems. These IT environments are becoming increasingly more complex. They have back calls to on-prem. They have connections to cloud. They've got containers. They got monolith. They got a little bit of everything. Maybe there's a main maybe there's ibm i series and so you have everything under the sun and so understanding exactly the consequences or maybe even the unintended consequences of this are still to be seen but i will i thankfully can say for most of the customers that i've talked to
Starting point is 00:26:00 thus far they are i would say would say, 98% operational. Maybe there's some lingering stuff hanging around. And that in the case of those clients, Dynatrace was pretty crucial in being able to help them get back online. That's great to hear. I saw, obviously, your blog post. And folks, if you're listening to this
Starting point is 00:26:21 and you want to see Josh's blog post, we also have a GitHub repository with the dashboards that we built both for our SaaS customers that are already on Grail that can use DQL or Dynatrace query language. But I think you also build dashboards for our managed customers where you provided these dashboards as well. So folks, if you want to follow up on these links, just check out the description of this podcast and you will find all the details. The question is,
Starting point is 00:26:51 what's next? What can we learn from this? I mean, what can other people learn about this, right? Because, you know, while it's easy to now talk about one incident for one company. As sad as it is, this can also happen in other organizations. Absolutely. And I think CrowdStrike even admitted that they may have missed the mark on their updates for their rapid response content offering, one of the two pieces for the Falcon platform, and that those deployments need to be done using a canary-style deployment.
Starting point is 00:27:28 So for those of you, many of the folks who listen in may have an awareness, but a canary-style deployment comes from the term a canary in a coal mine. Back in the old days when people wouldn't be mining deep into the earth, they would put a canary down there, and if the canary would die for whatever reason, then they would know that there were toxic gases present and that the miners need to escape. Same concept with the canary deployment in that we produce a small segment, some percentage of the overall software deployment in production. And as a result of having that segment of the data put in production, and as a result of having that segment of the data put in production,
Starting point is 00:28:06 then we can see what happens. Does that canary indeed sing in the mind, or does it start to die? And that ability to use real-world traffic and real-world data to assess that health of the software is a standard practice within the world of CICD. One can argue, and I think, Andy, you alluded to earlier in this podcast, that if you're dealing in security for software, that producing a canary-style deployment might have its disadvantages, right?
Starting point is 00:28:42 That a canary deployment might then cause you to have customers that miss out on the latest and greatest release. And then as a result of that, missing that latest and greatest release, then they are exposed to a threat vector that they were not previously hoping to be exposed to and potentially could lead to a breach, right? Something malicious to occur. There are pros and cons to both.
Starting point is 00:29:06 Potentially in a different style, and I'm not here to advocate to what CrowdTrade should or should not do, but potentially a blue-green style deployment where it is very much like-for-like what they have, or a shadow deployment, or using feature flags to kind of turn things on and off.
Starting point is 00:29:22 There are a number of different vehicles for them to perhaps consider. We actually put out a number of these different ones for third-party postures on another recent blog post for Dynatrace. One thing that we can do from the Dynatrace perspective that helps on a CICD standpoint, and this is not just for DevOps,
Starting point is 00:29:42 but DevSecOps as well, is to put different gates in the process. So as code is moving, progressing from left to right, from low-stage environments and dev to production, we can gate against certain criteria. And I'm sure, Andy, you've talked about this in this podcast before, but using different indicators, whether it's maybe the number of vulnerabilities in the code or the overall response time of given piece of code or the number of failures that are generated on that code, or maybe it has a memory leak. And that's probably the one here where we can maybe use that one in this example, where this did have an out of out of bounds memory leak, which caused this unexpected exception
Starting point is 00:30:25 to occur in the kernel for Windows. These different indicators that could be introduced earlier in the software lifecycle and then be used to enhance and make the software release process more robust. Really depends on what you're trying to do.
Starting point is 00:30:42 There are obviously benefits and detriments to doing different positions, but Dynatrace and other observability tools out there have ways to measure and then drive automation and intelligence in those processes so that they can be handled not only more robustly, but more consistently too, in a way that has the same ways we're testing. The last thing you want to do in a lower tier test is change the variables of the test. The one benefit of testing a small segment of data in production is that you're getting real world testing conditions. And therefore, you're never going to get a better QA environment testing in production. But of course, the risk of testing in production is that you either break something or that your customers get exposed to undue threats. So there
Starting point is 00:31:35 are ways for you to triangulate this, and you can use other observability offerings to make this better as you shift left in the overall software lifecycle. Andy, you're kind of the expert on that sort of thing. I might defer to you if you wanted to add anything to what I might have missed on the SCICD side of things. I feel really proud that all the things that you just said perfectly hit home and it's stuff that I've been talking about for many years. And I also agree with you, right?
Starting point is 00:32:09 We don't want to bash on anybody. It is, there's just a lot of things we can do, a lot of different options. Sometimes it's, it's, it's easy to miss certain things that are possible. Sometimes things happen because of,
Starting point is 00:32:23 of time pressure because of simple mistakes. But I also agree with you, things like a canary deployment, even for a security product, should be possible, even though you may just make a very small canary for a very short time, but at least test it out first on a small subset of users and then roll it out in stages to the rest. Or, you know, try it out internally. I think one of the things we see a lot of software organizations do is, you know, using their own products.
Starting point is 00:32:58 You know, the term dogfooding comes to mind, or we like to call it drink your own champagne. Funnily enough, dog fooding have you i'm pretty sure you've heard about that term yeah oh yes do you know what is where it comes from because i learned this last night oh i always thought it was your own dog food yeah and drink your own champagne so um the origin it's interesting where it comes from so i i actually i learned from a from a ted talk that my wife was watching yesterday. And it seems, right, the story was told that it was a CEO of a company who was producing pet food.
Starting point is 00:33:36 And in order to make a point, he went into the boardroom with a can of pet food. He opened it. He ate it. Like I say, dog food. He ate the dog food to show if it is good for me it must also be good for our customers who then feed this to the dog so this is where the term dog fooding comes in so eat your own dog food because you want to be confident that you're not killing anybody and it's the same thing with using your own software to prove that your software
Starting point is 00:34:02 lives up to the standards that your customers expect from your software. So it's the same thing. Oh, 100% agree. You know, we at Dietrace have been dogfooding or drinking your own champagne for a while. And it's a fun anecdote. I'm going to look that one up later and read up on that. And my wife will be like, why are you reading about dogfood at 9.30 at night? And I'm like, don't ask me questions, dear. So put simply, our customers and software companies in general need to have this, and this quote I'm stealing is directed from one of our blogs.
Starting point is 00:34:37 I think it was really well said, is that we have to have a holistic approach to that third-party risk management that ensures that vulnerabilities in vendor applications don't become that wink-link in the overall security posture. So being able to use that quality control process to understand your vendors, and Dynatrace would not be unique. It would be any software vendor, for that matter, and saying, give me your risk process.
Starting point is 00:35:04 Give me an understanding of your testing process. And then from there, if I'm a company, I can understand if I rely on this SaaS vendor for this particular part of my overall IT portfolio, what am I getting myself into? So risk and risk remediation and risk mitigation is becoming ever more important in the complex IT enterprise. And that's honestly as important, becoming ever more important in the complex IT enterprise. And that's honestly as important, if not more important than other portions of it, as this incident has shown us. Josh, I would love to say I want to have you back for another episode on the next disaster,
Starting point is 00:35:38 but I actually hope I will not have you back because of the next IT disaster. I'd rather like to have you back for some other insights on more joyful things or more because I know you've been working and helping our customers over the years to really implement the holistic approach to stability, which is
Starting point is 00:35:58 a very dear topic to mine, all of our hearts. Thank you so much for coming on the show and explaining, giving us a recap on CrowdStrike, giving us some insight, especially folks, for those of you that have maybe not yet heard about the details and you're interested. So I'm sure this was insightful. I think the importance of observability was a really great topic to touch upon what we can do when we connect the data.
Starting point is 00:36:25 So that's not just identifying which machines are impacted, but which ones are the ones that are running the most critical apps. So, folks, if you currently don't have this information in your observability practice, meaning if you don't know what is the difference between machine A, machine B, or pod A or pod B, then this is something you should think about. How can you implement an observability-driven engineering organization where every time some new infrastructure gets deployed, a new application gets deployed, you want to enrich this with metadata that enriches your observability so that you can ask questions like, show me the most critical infrastructure based on the most critical apps that are running on it. Because if you cannot answer this question, it will be very tough for you to deal with the next crowd strike. Agreed.
Starting point is 00:37:19 Andy, I would love to come back. I knock on wood. I'm knocking on myself, too, just because, you know, the last name Wood. I hope that it's not underneath certain circumstances or underneath duress like this particular incident. Happy to come back anytime. And thank you for having me. And thanks for the,
Starting point is 00:37:36 I should have thought about knocking on wood, of course. I'm sure you've heard this too many times in your life. Oh, indeed. All right, indeed. All right, folks. Thanks for listening in. And as I said, this was a special episode without Brian. Brian will be back the next time because he's currently on a well-deserved vacation where he hopefully was not impacted by any of this.
Starting point is 00:37:59 And yeah, see you next time. Thank you. All right. Thank you all.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.