Tech Brew Ride Home - (BNS) The Crowdstrike Thing (With Overmind.tech)

Episode Date: August 24, 2024

Breaking down the Crowdstrike outage with Overmind.tech Learn more about your ad choices. Visit megaphone.fm/adchoices...

Transcript
Discussion (0)
Starting point is 00:00:00 On April 4th, 2023, around 2 in the morning, a man was found stabbed multiple times on a sidewalk in downtown San Francisco. Hey, who did this to you? What happened next turned the story into a political firestorm. Reports have identified the victim as Bob Lee, the founder of Cash App. From Bloomberg Podcasts, this is Foundering, the Killing of Bob Lee, beginning April 16. Welcome to another bonus episode of the TechMeme Ride Home, another portfolio profile episode. This is sort of a hybrid because we're going to talk to a company that Ride Home Fund is invested in, but it's also going to be sort of a newsy explainer episode.
Starting point is 00:00:54 We're going to talk a lot about that big crowd strike outage because today we're going to talk to Dylan Rackcliffe of Overmind. And this is what Overmind does is help people prevent things like this from happening. So Dylan, thanks for coming on. Thank you for having me, Brian. So let's just start there. Tell me what Overmind does, you know, sort of 30,000 feet cents and then we'll drill into what happened with CrowdStrike. I mean, you're exactly right with saying that we're trying to prevent these sorts of outages, outages where you make a config change where you think it's a good idea and it turns out to be a terrible idea, hopefully not as terrible as the Crowdstrike 1, but we calculate the blast radius, work out the dependencies and do a risk analysis in advance so that people can know that, you know, pressing this button could potentially ground all the flights in America before they press the button. I love that term blast radius, but obviously, you know, there was a huge blast radius with the over with the, CrowdStrike thing, but go into more detail about being able to identify those risks ahead of time,
Starting point is 00:02:11 because that's the key here. Once you've hit the button, it's too late, as we've seen. Yeah, the only way that we can do that is by understanding the blast radius and the context. What we've found, and it's the case in the crowd strike outage, is the case in every outage that I've studied where I've been involved and when I've studied just from stuff that has been put out on the web, like in the CrowdStrike example, is that the change that you were making was not bad on its own. It was only bad when combined with other latent factors that already existed within the environment. And so when we're doing a risk analysis like this, it's not enough. It's not possible to look at a change and say that is a bad idea.
Starting point is 00:02:52 You have to take the change in context, understand all of the dependencies, understand how they are currently set up. Is this thing actually in use? what uses it, what is that thing doing, in order to work out whether one change that might be fine in one environment might cause a massive outage in another environment. All of these things are entirely context dependent. And so working that out previously and has been done by people with loads and loads of experience who have like a model in their head of how all this stuff fits together and what depends on what. And we're trying to sort of augment that by
Starting point is 00:03:28 building the model dynamically, doing the risk analysis dynamically and helping them. But context is key because the changes themselves are never bad on their own. And this, you know, Overmind works where people are working. So this is mapping your dependencies on like AWS, Kubernetes, Kubernetes. This is essentially a layer where you're working that is, that is just this added sort of insurance policy, I suppose. Yeah, and we deliver it to where you're working. So if you're working in GitHub and you're using GitHub actions to run your
Starting point is 00:04:08 Terraform into AWS or into Kubernetes or whatever you're doing, we deliver the blast radius and the risks straight into, say, GitHub as a comment or into Terraform Cloud or into wherever it is that you're currently working today. All right, so let's talk about what happened with CrowdStrike. I'm going to caveat, obviously, that this is beyond my ken. But so CrowdStrike actually has released, I think, a couple of post-incident reviews at this point.
Starting point is 00:04:43 On a high level to a dumb dumb like me, but also for non-dum-dums out there. Tell me what happened. What was the fatal error here? So the fatal error that actually caused everything the break was an out of bounds read exception. There was an array with 20 things in it and it tried to load the 21st thing and everything fell over. The reason everything fell over, that wouldn't normally be a terribly catastrophic problem trying to read the 21st thing in a 20 element list, unless you are running in kernel mode, which the CrowdStrike driver needs to in order to do the work that it does.
Starting point is 00:05:24 So when you're writing software that runs as a driver in kernel mode like Ground Strike, you can't make those sorts of mistakes. You can't afford to. There is nothing to catch you. If you try to read something that isn't there, the whole computer needs to restart because there is no way to recover from it. And unfortunately, in this instance, the situation that caused it to do that action basically happened immediately.
Starting point is 00:05:50 This wasn't like a 1% chance where, you know, the stars have to align and then it reads this 21st element in a 20 element list. It basically reads it straight away, which means the computer crashes, it starts back up, and it crashes immediately again, which sent these computers into the blue screen of death loop. That was the thing that actually fundamentally caused it, which sounds very simple, but how we got there is probably the more interesting part. Well, right, because, I mean, this, we assume these are, you know, professionals. There's all sorts of automated and manual testing that I'm sure happened.
Starting point is 00:06:27 So again, where was, what was the thing that they missed? Well, so I, my caveat is that I'm getting all this information from these post-incident reviews. I don't have an internal source either. So there is a reason, I'm reading between the lines that needs to be done in order to find out what was missed because they don't just say here's the thing that we missed. What they do say, and it's kind of interesting, they start off by saying in the first preliminary post-intern review, they start off by talking about how the testing works for the file consensus. So the file consensus is the thing that you actually install.
Starting point is 00:07:04 It's the thing that goes and inspect network traffic for suspicious activity and stuff like that. And they go into quite a lot of detail about how that gets deployed. They do automated testing. They do manual testing. They roll it out internally first. Then they roll it out to early adopters. then it becomes generally available. Even when it is generally available,
Starting point is 00:07:24 users can select which sections of their infrastructure get the upgrade first. So you could upgrade the less important stuff first. And so that's a pretty normal deployment process, to be honest, and it's pretty well explained. Yeah, here's the diagram. Yes, if you're watching the video on YouTube, this is a diagram that Dylan, did you take this
Starting point is 00:07:48 from the after incident report? or did you draw this up yourself? I drew it up myself after trying to wrap my brain around what they were trying to explain to me. It took quite a – the diagrams took a lot of effort. I would recommend having a look at them because it helps to explain the jargon. But what stands out with that process is, well, you've got automated testing. You've got manual testing. How did it happen then?
Starting point is 00:08:11 Like, how did it get to the point where it broke all of this stuff? And so the next step in the post-incident view, they end up talking about – what are called rapid response content updates, which is a separate type of update, which follows a separate process. And that update process is way more complicated. I'm not going to go into depth. You can read the blog post if you're interested in depth
Starting point is 00:08:38 about how they explain it here. But basically, they go on to explain the architecture of how these rapid response updates get delivered to you, but not how they decide to push something out. There's a server that delivers it and all that sort of stuff, and they explain that in detail, but they don't explain how they get the confidence to press the button to send the rapid response update to people.
Starting point is 00:09:05 Whereas in the previous example, they did explain, we do this testing, we do this testing, we do this testing. But when it comes to rapid response content, we get a lot of detail about the architecture and essentially no detail about the process, How do they get the confidence to press the button? And I think you just have to read between the lines there to work out why. Certainly, if I was writing it and there was a huge amount of testing being done,
Starting point is 00:09:32 I would have mentioned it in that situation, but I don't think we're ever going to get confirmation that there wasn't testing. For probably liability reasons and things like that. But you suggest that there had to be some level of awareness that there could be a problem with this, be a problem with this? Is it one of those where it's like, well, we think it's possible, but it's probably not going to happen, so let's just go ahead. What do you think the thought process was? It reminds me of like, you know, the part in Oppenheimer where it's like, there's a small percent chance that all the oxygen on the planet could burn up. But yeah,
Starting point is 00:10:10 we're going ahead anyway. Push the button. I don't think that there was that. I don't, I don't think that there was, especially because this particular update, they're a bit cage about what it was and fair enough, like it might have been a really important update to address a zero day that was happening right now. And so maybe there was a lot of pressure to get it out. They don't say that, but they probably wouldn't say that. So maybe there was a lot of pressure. It reads to me like as if there wasn't any semblance of risk. Like it didn't, it seemed perfectly normal. One thing that's really interesting is they speak about the timeline of what happened. And the important events in the timeline are they did a whole bunch
Starting point is 00:10:54 of testing on this new type of, it's called a template instance. Basically, it's a new way of discovering suspicious activity. They did a whole bunch of testing back in March when that was first released. And then they did three more deployments in April. And then, oh, yeah, if we can get the the timeline view up, which is down towards the bottom of the blog. The first blog, sorry. They do three more tests, three more deployments of this, this particular type of way for looking, way at looking for suspicious activity, and then it comes to July the 19th,
Starting point is 00:11:38 which is the fateful day where they make the decision. Now, they actually talk about in the post-incentrant view, what gave them the confidence to press the button, which is pretty rare that you would speak about how you felt in a, like emotionally. Why did you think that this was a good idea in a post-inservant view? And I think that they should be celebrated for putting that in. It gives a lot of color, which I think is really interesting.
Starting point is 00:12:05 And to answer the question of, like, did they think there was a 1% chance? I don't think so. The things that gave them confidence, as quoted in the review, were the fact that they did a bunch of testing in March, and the fact that they had already deployed similar configuration to this same feature three times before in April, and the fact that there was supposed to be a validator that caught any of that.
Starting point is 00:12:31 Now, in hindsight, the fact that something was tested a little bit over four months ago is probably not going to give me confidence that's going to work in production. Similarly, the fact that I deployed three, three similar but not the same pieces of configuration over the previous month is also not going to give me confidence that something is going to work to be perfectly honest. If I'm deploying configuration, I want to know that it has that exact same thing has been deployed somewhere else and tested somewhere else, not just a similar thing or a thing that uses the same feature set, which is what happened here. So I think in hindsight, it doesn't make sense. But at the time, I believe that this was normal. Certainly by the way it's written, they're not saying that people went outside the process.
Starting point is 00:13:21 They're not saying that people did anything that they shouldn't have. It seems like doing it this way of testing it once when it's first released and then getting confidence by just keeping on using it was absolutely the norm. And must have been working for a long time since, otherwise it would have happened earlier. One of the things in your piece that you maybe suggest is the idea that maybe people sleep on the fact that changing configuration is safer than changing code. And in reality, as you point out, there's been a lot of outages, large outages recently that this was, the root cause was configuration changes. So is this sort of suggesting that people need to realign their thinking in terms of, the risk profile of doing config?
Starting point is 00:14:15 I think so. I think that a lot of people understand the risk of doing configuration. Not everybody, and I think that almost all massive outages like this end up being configuration changes. I think specifically in the case of CrowdStrike, I doubt that it was seen as configuration. It was this proprietary binary file, and then there's all these other layers of proprietary stuff
Starting point is 00:14:39 that's interpreting it and things like that. But when you zoom out and look at it, it is a sort of modular thing that gets installed, which is the sensor, that takes configuration, that teaches it what suspicious stuff to be looking for. And so even though it's hidden under many, many layers of proprietary in-house stuff, it effectively is a config file. But I don't think it was being treated that way. I think that not many people would deploy a config file to production without testing it, But because it was wrapped in so many layers of application-specific stuff,
Starting point is 00:15:15 it was sort of seen as not really a config file, not really something that could possibly cause a breakage. So I think that a lot of people understand the potential huge impact here, of configuration. But I think in this particular instance, it was somewhat covered up by lots and lots of layers of application-specific stuff that you certainly wouldn't go and change a Confu file directly in prod. But that's kind of essentially what was happening here if you really strip it back.
Starting point is 00:15:48 Real quick, and this is purely opinion, but what do you think of their response? Because there's been back and forth between certain customers like Delta, like, you didn't help us enough. And they're saying, well, we did reach out to help you. or just on a broad sense, how do you think that they responded to this? Given that, again, this is one of the biggest in history. So I don't know how you can, you know, get an A plus for any of this. Yeah.
Starting point is 00:16:17 Yeah, I mean, the Delta stuff has been hilarious to watch go back and forth. It's very entertaining. The Reddit comments as well were good. I think overall it wasn't too bad. You really did have to read between the lines to see what actually, happened, which it would have been nice to see less of. It would have been nice to not have to work quite so hard. The amount of jargon as well, like, had I not written this blog post,
Starting point is 00:16:46 I wrote it mostly for my own understanding because there was so much jargon in there. I understand that they kind of had to, but I think it could have been more simplified. And then when they released the full root cause analysis, the first like two paragraphs being marketing talk about the powerful oncense to AI and how each sensor correlates context from its local graph store and stuff like that. I think that was a bit of a slap in the face, given what had happened. I didn't need to read two paragraphs of marketing stuff at the beginning of that root cause analysis. I think that was in very poor taste. But to be honest, the mitigations
Starting point is 00:17:26 that they're putting in place are reasonable. Once again, the mitigations sort of confirm that that my reading between the lines was correct in that they are implementing testing. They don't say testing for the first time, but they are implementing testing in this particular workflow, which I think is, it's certainly going to help something like this, where you have a bug that is so completely catastrophic that it breaks every single thing that it touches immediately. Any amount of testing will catch that. Will it catch bugs that are like a 1% chance thing? maybe maybe not that's where you need to be doing stage deployments and it does
Starting point is 00:18:07 seem that they are going to give customers the ability to control it frankly i don't think that any security vendor will be able to go to their customers anymore and say hey we just push updates out to you and you don't have control over it i think that while i'm sure a lot of security vendors do the same thing those days are done and then if if you're having a renewal conversation i think that will be the heart of the renewal conversation is how can we control the updates and get pushed out because it's not it's not reasonable I think that what they've done will definitely stop something this big because you know just running it on your local laptop probably would have caught something this is likely to occur and
Starting point is 00:18:45 but hopefully giving customers more control over the way it's rolled out will help for the you know the 1% things that only affect certain configurations which you know might be really detrimental to a single customer because everything they have fits into that 1%, but are not going to take down everything in quite such a spectacular manner. So let's bring this back to Overmind and let's imagine that someone listening out there, maybe they're not working at CrowdStrike, but they're going to push out something similar with their product. In a little bit more detail, can you describe to me tangible ways that overmind would help prevent a crowd strike like disaster like this for the listener?
Starting point is 00:19:36 I mean, in fairness to my own customers, not many people are not deploying to a test environment before they deploy a production. So even without overmind, they have at least a somewhat representative example that mostly, that does help you to get some degree of confidence. The problem is that deploying things into production, it's never quite exactly the same as testing, there's always more dependencies, things are larger, there's things that have existed you know, for a long time that people have forgotten what they do and are usually not documented and those dependencies are not well understood. And so what we really specialize in is, especially when you're going to production, being able to see what the dependencies are
Starting point is 00:20:27 in that specific environment that we've captured in real time, So we go out, we find what your AWS looks like. We find the dependencies in real time to look at it right now. So if somebody has, for example, and this is something that happened to one of our customers somewhat recently was they had used the security group in AWS for something and they'd managed it with Terraform and done everything by the book, but they'd given it a name like internet access or some really generic name. And they were cleaning up after this project and deleting this security.
Starting point is 00:21:00 group and they deleted it in all of the other environments and it was fine. And when they deleted it into production, in production, a huge amount of their fleet just stopped working. And it was because other teams were not using Terraform, they were doing things manually. And because it had such an incredibly generic name, people had just selected that security group. And so it meant that by changing the rules on that security group, they were changing the rules for a huge amount of internal stuff because everyone had just been
Starting point is 00:21:30 using it because they thought they were supposed to because it had such a good name. And so by looking at things in production, not basing it on how it worked in test, because in tests, because in test, it didn't really matter. People weren't doing as much stuff manually. These other teams were not creating this dependency, but then in production they were. And so it meant that without actually doing the risk analysis again, doing the blast radius calculation, again in prod it would not have been possible to catch something like that and um in the case of things like crowd strike probably would have been possible to catch it with testing and but the more common outages are not some not things that are caught in testing because mostly
Starting point is 00:22:18 people already doing it they happen because of discrepancies between testing and prod discrepancies between um dependencies and things like that and that's sort of what we specialize in I don't think I've mentioned yet that you can find out more at overmind.tech, O-V-E-R-M-I-N-D-T-T-E-R-M-D-T-T. I'm going to link to the blog post that we've been discussing here, but also if anyone listening is interested in finding out more about Overmind, how should they get in touch? What do you want people to know about what you all are doing right now?
Starting point is 00:22:58 The easiest way to find out more about Overmind is just install our CLI and run OverMine Terraform plan. It's just like a normal Terraform plan, but you get a blast radius and you get risks. That's certainly the easiest way to get started if you want to speak to me about it. Hit me up on the website. There's a contact form. It goes straight to me. I'd love to speak to you about it as well. But certainly the easiest way to get started, install the CLI run overmine tariffon plan.
Starting point is 00:23:24 Beautiful. Again, that's Overmind.com. dot tech, Dylan, thanks for giving us an explainer and giving us possible solutions so that this doesn't happen to you. Thank you very much, Brian.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.