The Changelog: Software Development, Open Source - The BSOD CrowdStrikes back (Friends)

Episode Date: July 26, 2024

Robert Ross joins us in CrowdStrike's wake to dissect the largest outage in the history of information technology... and what it means for the future of the (software) world....

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Changelog and Friends, a weekly talk show about Citrix thin clients. Big thanks to our partners at Fly.io, the home of changelog.com. Over 3 million apps have launched on Fly. You can too. Learn more at Fly.io. Okay, let's talk. Hey friends, I'm here with a new friend of mine, Shane Harder, the founder of Chronitor. Check him out, chronitor.io. It lets you keep tabs on your cron jobs, Linux, Kubernetes, Apache Airflow, Sidekick, and more. With over 12 open source integrations, you can instrument all your jobs no matter where you're running them. So, Shane, for me, I'm using Linux and Linux cron jobs are by far
Starting point is 00:01:06 the most popular in my opinion, right? But there's so many other cron like things, Kubernetes, Airflow, Sidekick. Help me understand the full spectrum of background jobs and cron jobs beyond Linux cron. Yeah, Linux cron jobs are massively popular. They are still, 40 years later, the tool that most developers will go to first when they need to start scheduling something in the background. But when you get into a team environment or an enterprise environment, there is a lot of other constraints at play and there's other considerations. And whether it's simply like redundancy that you're not going to get from CronTab itself or, you know, more like complex orchestration stories like you can get with like Airflow. We see companies eventually outgrowing Cron.
Starting point is 00:01:51 And what we wanted to be sure of is that, first of all, like migrating from Cron to anything else is a complicated thing. So we wanted to give you tools to help you monitor that transition and make sure your jobs are working good as you as you do that migration you know and then second we wanted to give you a way to unify all these different job platforms because seldom do you have just like platform a and you migrate cleanly to platform b probably in a in a real world scenario you're running both side by side for a while you don't want to have different monitoring tools or different monitoring strategies for different for every different platform that you that you deploy. So our goal is anywhere you're running a background job, you can use Chroniter. The number one way that we ensured that was possible is by having like a really simple API that you can just use with an HTTP request yourself, which is pretty abnormal for monitoring tools. But that works in a lot of cases. But to make it
Starting point is 00:02:44 easier than every popular job platform out there, like Linux, CronJobs, Kubernetes, CronJobs, Windows, Sidekick, Airflow, you name it. We have a Cronitor SDK that you can install that will run automatically configure your monitoring, run in the background and sync all your jobs with Cronitor the same way your Linux CronJobs will be synced. Okay, friends, join more than 50,000 developers using Chronitor. I'm one of them. You can start for free and they have a pay-as-you-grow pricing plan. Setup is too easy with more than 20 SDKs. Check them out at chronitor.io.
Starting point is 00:03:19 That's C-R-O-N-I-T-O-R dot I-O. Again, chronitor.io. Well, friends, we're here to discuss an outage, a disaster that made history. And we have a good friend of ours here, Robert Ross, the founder and CEO of FireHarton, to help us dig into what exactly happened and maybe more pertinently how to prevent incidents at large or just deal with them. What do you think, Robert? Well, I'll do my best without wearing a monocle and thinking about exactly how this went down. But yeah, I've read every news source about it, I think, at this point.
Starting point is 00:04:09 I think everyone's heard about it, so excited to dive in. What are you guys talking about? I'm not even sure what we're referring to. Yeah, right, Jared. Did something happen? You know what I kept thinking every time I read CrowdStrike? I kept thinking of ACDC's Thunderstruck. I couldn't quite pull the pun across, because it's CrowdStrike, I kept thinking of ACDC's Thunderstruck. I couldn't quite pull the pun across
Starting point is 00:04:25 because it's CrowdStrike, Thunderstruck, but that song has been playing in my head probably before this happened, but it just happened to a line. I don't know. I'm an ACDC fan. What can I say? The developer may have been listening to that when they wrote the code. They might have been. That could be why
Starting point is 00:04:41 you're there, potentially. I like to code to some ACDC. Yeah, for sure. Especially that song. That could be why you're there. Potentially. I like to code to some ACDC. Yeah, for sure. Especially that song. That'll pump you up, man. For sure. I code faster when that type of music is playing.
Starting point is 00:04:53 That's for sure. I'm sure most folks are, to some degree, primed on what happened. But who wants to nominate themselves to explain at least a primer of what happened? I think you did it pretty well in News Jerry,, but you also covered some other sides of it too. But what do you think? Do you want to handle it or do you want me to handle it? Well, there was a giant outage on Friday due to CrowdStrike pushing a bad update to a billion machines.
Starting point is 00:05:18 I'm not sure the exact number. but basically every Windows-based company, organization around the world was affected probably somehow. Many things were down. The banking industry got hit hard. Hospitals got hit hard. Airlines got hit hard, except for Southwest,
Starting point is 00:05:38 which I discussed in news. The reasoning, by the way, quick update on that, I put in news was that they are allegedly still running old versions of Windows 95, 3.1. Could be true. Might not be true. Those are actually rumors.
Starting point is 00:05:52 I thought that was a joke when I saw that. Maybe that's true, actually. It kind of was a... It duped Jared. It got him. It might be fake news. I updated our ChangeLog newsletter to make sure that it's accurate now because I thought it was funny, too, which is why I put it in there.
Starting point is 00:06:06 And it's true that Southwest was unaffected. And of course, Southwest famously was down, was it two years ago? For 10 days. Yeah. Because they couldn't. The holiday outage. Yeah, the holiday outage. And back then, there was reasonings because they were on really old
Starting point is 00:06:23 versions of Windows and they couldn't do stuff. And so I think those two stories combined to say perhaps their old versions of Windows have actually saved them this time. But allegedly, not necessarily the case. But funny either way. Yeah, man. I guess the way I would summarize it is the blue screen of death made an epic comeback and took over the world. Total world domination last week. Wouldn't you say that this is affected by
Starting point is 00:06:48 CrowdStrike customers? Not just simply Windows users. Yeah, but I guess, here's what's weird about it. I had never even heard of CrowdStrike, but it sounds like who's not a CrowdStrike customer. Robert, were you familiar with CrowdStrike prior to this?
Starting point is 00:07:03 Yeah, we used CrowdStrike at FireHydrant. Okay, so what do you use them for? Endpoint security. We run their Falcon daemon on all of the employee laptops. We don't use it for the services we provide, but it is running on every FireHydrant laptop. And these laptops are Windows, Linux, macOS? All Mac, yeah. So we weren't impacted by it, thankfully.
Starting point is 00:07:32 Just the Windows CrowdStrike world. Yeah, that's what it seems like. And it seems like there was a change that was in the new sensor that runs silently. I think a lot of people don't even know that they have CrowdStrike on their laptop. And that's by design, right? I would say a good product.
Starting point is 00:07:53 You don't even know it's there until it gives you a blue screen of death. It's a bad way to find out about it, but before then, brilliant. It's like you had a bunch of stuff in your walls and then eventually it falls out of the wall and you're like, oh, that's been rotting behind there for a long time. I think that the change is always the biggest cause of incidents. We see it all the time. Google
Starting point is 00:08:15 even has a stat that 80% of their incidents are caused by a change. So it's not exactly shocking that a change caused this. I think what's shocking to people is the scale of the incident. And when you had ACDC Thunderstruck playing in your head, I kind of had Jeff Goldblum in my head where he's like, flap your wings and a hurricane happens across the ocean. That's kind of what it felt like. The butterfly effect.
Starting point is 00:08:44 Yeah, the butterfly effect, exactly. That's kind of what it felt like. The butterfly effect. Yeah, the butterfly effect, exactly. That's kind of what it felt to me. A very simple try to access memory that wasn't there. Grounded flights still has grounded flights. Delta has canceled hundreds of flights every single day for the last five days. And I think we're just going to keep hearing about problems for the next few weeks from this thing.
Starting point is 00:09:04 Yeah, it would be interesting if somebody could somehow, some way come up with a global economic impact of this event. But it has to be measured in billions, maybe trillions of dollars. I think so. We had employees and teammates at Fire Hydrant that had to cancel trips. I had friends that were at the airport that had to cancel their weekend plans
Starting point is 00:09:24 that they were flying somewhere. So it wasn't only the places like airports and hospitals that were impacted. It was local economies that were impacted by this as well. Friends going to Dominican Republic that couldn't go. And it's hard to reschedule those types of plans. So it's kind of like, you know like probably not coming back, that loss. That money, yeah. Well, not to mention just labor.
Starting point is 00:09:56 Pure labor costs of mitigation or remediation because this, unfortunately, does require, I think, direct impact with each machine affected, meaning you can't just remotely reboot these machines, is what I read. You have to actually go touch each machine and, I don't know, boot in a safe mode, or maybe you know, Robert or Adam, exactly the process. But it's relatively straightforward, unless you have an encrypted hard drive, then it's slightly less straightforward. But we're talking about people walking around data centers, going to each computer, or walking around hospitals,
Starting point is 00:10:28 going to each computer. I mean, the amount of highly paid individuals effectively doing a mass reboot this week is probably measured in large numbers. Yeah, and even parts of the country in the US that had issues probably don't have, you know, a big workforce capable of doing this work. You think of a, you know, a giant airline, they have a massive IT team that can go and do that labor and that work. But Alaska, in rural Alaska, 911 wasn't working. People couldn't call 911.
Starting point is 00:11:04 Really? in rural Alaska, 911 wasn't working. People couldn't call 911. And at one point, even Portland's mayor declared a state of emergency on Friday. And there's parts of the impact area that just don't have a response unit that can go solve those problems. So I do think we're going to keep hearing about it. There's going to be inquiries by the government. I think I saw today that CrowdStrike CEO is going to be
Starting point is 00:11:28 called upon by Congress. That was news of like 16 hours ago or so. AP had that out there. The Washington Post had it out there. House committee calls on CrowdStrike CEO to testify on the global outage. And not surprising. And he went on
Starting point is 00:11:44 air pretty quickly. It was like, this is our fault. We're fixing it. And I have to commend the confidence to just go and say, own it that quickly. But, you know, I have questions. I think everyone does. Even my aunts and uncles in their late 60s
Starting point is 00:12:00 who don't quite understand this type of world like we all do, were asking me questions i mean it had everyone felt this i think in some way shape or form well windows only there's a lot of details so i caught up with dave plumber that's literally his name he is on youtube he runs a channel called dave's garage he's an ex, from what I understand, an ex-Microsoft operating system developer. And so he knows a lot about this stuff. And I will link it in the show notes, but he was my source of literally what really happened
Starting point is 00:12:39 on the inside. There's also the code report from I think it's Fireside or Firesomething on Fireship on YouTube that also summarizes some things that I pay attention to as well as part of like researching this topic. So there's some theories that this is just simply bad quality code. This could be sabotage or this could be planned. Now those are obviously theories, not truth at this point. But I think it's important to look at, you know, Robert, you said change is what affects things and what causes incidents. We're not sure when exactly this code got pushed, but what happened was, or at least from my understanding, and thanks to Dave for
Starting point is 00:13:21 explaining this, is that this software Falcon as you all run as well it runs in what they call kernel mode and stop me if you've heard this one before but there's two lands to live in basically in the operating world you've got user mode and you got kernel mode and kernel mode has you know higher priority and when an application crashes in kernel mode it crashes a system and it does it by design because it's protecting the system. It's better to crash than to actually boot up. Something else worse could happen if that was the case. And CrowdStrike, their software called Falcon, lives and runs in kernel mode. And that's, I guess, by design.
Starting point is 00:14:00 I'm not sure why it has to. And then there's this labs that Microsoft has called Windows Hardware Quality Labs that drivers that live in kernel mode or run in kernel mode that are third party, they have to go through this process to get deployed. And so it gets tested by Microsoft through this WHQL labs system to be able to be deployed to get signed and used by the operating system etc but the way they bypassed this was because in dave's words they want to be they want to be agile ambitious and aggressive to get the latest protection and so as a way to deploy this latest protection more fastly to windows users and i guess it's not the case for Mac or other systems because it didn't happen to you all, Robert,
Starting point is 00:14:46 is that they have these things called definition files that the kernel reads from. So when the kernel wakes up, if it's a new boot, it wakes up, it enumerates a folder, and looks for this other code, this dynamic code that gets deployed outside the kernel delivery system. So essentially you have unsigned code that runs in kernel mode. That's bad stuff. From what I understand, thanks to Dave, that's a rough version of the mechanics of how this works on the Windows system. I think it's a game of trade-offs, and that's a hard thing to feel now, right?
Starting point is 00:15:22 Like people's flights got canceled, you know, hospital surgeries got canceled. Like it's a big deal. But at the end of the day, do we, it's easy to say this was the worst thing that could happen instead of the sum of the parts of all the things that were maybe prevented in the past. And we just have no idea. I don't even think that CrowdStrike would probably know. But how many things were via CrowdStrike
Starting point is 00:15:48 or another locking system, security system running, have prevented mass credit card theft or identity theft or other things going on? It's hard to say. No one's going to buy that now, though. Because no one's going to look at now though the trade because no one's going to look at a trade-off right now there could be like flight my flight got canceled i don't care what i don't care what my trade-offs were in the past right now the other thing that i think that
Starting point is 00:16:14 is going to be we're just going to have to see if crowd strike post a public retrospective but this code could have been this code that is you know the the crime scene of this code could have been this code that is the crime scene of this code base that could be in there, we don't know, for 10 years we have no idea and another piece of code was deployed 10 miles away or so they thought from that code base or that line of code
Starting point is 00:16:42 calling that memory address and then that caused it right i think it's one of the challenges with building software now is like we were kind of saying earlier the butterfly effect like software is so complex now and so vast that you can deploy a change and what you think is a different country of your code base but it impacts across the ocean than somewhere else and i i would wager that's what happened here i would wager there's just no way that crowd strike doesn't have a crazy test suite that microsoft is probably running tests for them because it does run in kernel mode they have to get that approval it sounds like i just have a really hard time believing that this very simple line of code
Starting point is 00:17:27 just got deployed and took everything down. I could be totally wrong and totally off-base. I have no idea. But whenever I've taken down production, and it's been many times, it wasn't explicitly because the one piece of code that I wrote. Because I tested that. I put that through its paces. I wrote unit tests.
Starting point is 00:17:44 It was the combination of that and something else. When you add chlorine and vinegar, what's that potent combination they say never to do because it's super toxic all of a sudden? That's what it feels like happened to me in this outage specifically. Yeah. I mean, for me, it seems like some of our most ingrained premonitions coming to fruition in terms of being down in the mucky
Starting point is 00:18:14 muck as a developer. We just know, and I've said many times, it just feels like we're building a house of cards. Because it's so complicated. And it's so intertwined. And it's effectively, especially with web development, we're talking about a worldwide distributed system which has things that happen.
Starting point is 00:18:34 Of course, there's an explanation in retrospect for all of these things, but when you build a house of cards, eventually it's going to just topple. And sometimes it topples in ways that you don't know why or when or how and what will be the downstream effects. And, of course, this isn't web development in this case. This is operating system code. But still, network to machines, being able to remotely update. Every once in a while, just a house of cards topples,
Starting point is 00:19:01 and we have to start over to a certain extent rethink things try to adjust and clean up the mess and move forward i mean i even think of for every person listening to this like think about the mechanics of what is going on as you're listening to this podcast if you're using headphones right now your headphones have software in them that is going to a bluetooth chip that has software on it that's part of an operating system that's translating that to go over the air to a cell phone tower that's running software that's going to a network switch that cisco probably built that's running software and it just goes more and then eventually hits an apple music server or some app spotify server that goes through a cdn that's software it's just software all the way down i mean it is thousands of touch
Starting point is 00:19:54 points of software for you to hear this stupid analogy that i'm making like that that's like that and you had to go through that grueling exercise through that much software. And that's just the world now. That's the way it is. It's not going back. We can't unwind this anytime soon. Right. That's why I said sometimes you just have to clean up the mess and then obviously do a retrospective.
Starting point is 00:20:17 And one thing we can do is make sure that particular thing doesn't happen ever again. But that's just one of the things. That's what regression tests are for. I'm not going to let this particular bug bite me and my billion customers again. And I'm sure CrowdStrike, after they go through the PR process,
Starting point is 00:20:38 I mean, not pull requests, but public relations, because their stock was down 23%, I think they have. I mean, they are massively 23% I think they have. I mean they are massively hurt by this. Their reputation is just in the mud. So they're going to go through all that and maybe there'll be people fired. Who knows what's going to happen. But then hopefully they sit down and say, okay, let's do our analysis.
Starting point is 00:21:00 Let's do our postmortem. Let's figure out how we can make this particular aspect of our business not hurt people again. But that's just one thing. As similar, it goes back to the conversation of information security that we're having with Jacob DePriest from GitHub's security team. The challenge of the defender is you have to defend the entire system and the attacker only has to find one hole. Bugs work the same way, only it's just accidental and not malicious, you know? And so in that conversation, I said, I feel like to a certain extent, resistance is futile. I mean, the defender does all they can, but you're still going to have the attacker succeed sometimes.
Starting point is 00:21:39 And it seems like with software systems, the bugs are going to be there. I mean, we haven't found a way of eliminating all bugs. And so how do we build around, fortify, defense in depth, react, respond? I don't know. I think in one case, this is an advertisement for heterogeneous systems. What's the word? Not a monoculture, just like in biological systems, right?
Starting point is 00:22:08 Like you want to have... Yeah, regenerative farming where you have, you know, you plant two crops in the same plot of land and they help each other. Yeah, exactly. Just diversity inside of our software systems
Starting point is 00:22:19 so that when we have a problem in one particular system, aka Windows machines running CrowdStrike, that's not a worldwide global outage. That's like a regional, you know, 20% of the internet was down today, guys, versus what it actually, like that whole, let's have multiple operating systems,
Starting point is 00:22:41 not just worldwide, but even in our own organizations which can be a huge burden, a huge pain and we tend to want to normalize and streamline and formalize a specific stack of software because it's easier to maintain and manage but then you just are vulnerable to
Starting point is 00:23:00 attacks at like a 100% scale of your organization so I mean, I think one takeaway we can have is like, hey, I'm really happy I'm running macOS today. Now maybe tomorrow, all the Windows users will be happy that they're running Windows and not macOS because something will attack macOS.
Starting point is 00:23:16 But the Linux users are having the best time of their life right now. Oh yeah, the memes are strong right now. What is the year of the Linux desktop, as you know, Jared? I've heard that the last 15 years of my life, and it has not come to fruition. Here's the through line to all this, though. The through line is massively deployed software.
Starting point is 00:23:37 That's it. Or massively dependent upon software in a different scenario, like a dependency. It's that this was everywhere, right? It's that this was everywhere, right? It's that this was everywhere. And then I think there's very specifically to this scenario, there are some layers that may have been not thought so well through. in his description of how they bypassed the WHQL, which is a hardware labs quality system that is there to sign these drivers to run in kernel mode. Because it's so, like what runs in kernel mode is so limited because of its power.
Starting point is 00:24:17 And here they are able to run there, which is okay, fine. If you have to then, and Windows and that team blesses you and they put you in the WHQL system to have this signed certificate to say, okay, your driver is blessed. We've tested it to the absolute best of our knowledge. We put it through all the paces. at scale and they be the driver essentially is an engine that runs code that has not been signed or not gone through these paces that alone there is like i'd imagine robert as you look at what you do and how you help folks look at incidents it's like when we look at what we've done here we have to examine the system we built maybe it's you know anti the Windows way to have this sidecar, this folder of definitions that the driver enumerates over and sucks in and the driver essentially is an engine that runs unsigned code. That could be true if Dave's accurate.
Starting point is 00:25:15 And if that is true, sense, but like by the relationship formed between CrowdStrike, the Falcon software and the Windows team that has WHQL to allow this to live in kernel land and not user land. That's one thing. And then you got just the ability to deploy at scale and for the system to do what it should have done. So, you know, when an app crashes, an app crashes. When the kernel mode
Starting point is 00:25:49 crashes, the system crashes. And it crashes because it has to. Like, this is how, it did what it should have done. There was a bug in the kernel driver that when it booted up, it didn't, for whatever reason, cause an exception at the kernel level. And when the kernel crashes, the whole system crashes.
Starting point is 00:26:08 And that's by design. So effectively it was preventative on purpose but by a bug or a faulty code. Yeah. I think as software engineers, and I feel qualified to say this because it is a criticism, is that we love thinking that we you know have invented new things and every once in a while you just kind of have to take a step back and and think of oh actually we've gone through all this process without software we've already done it and the example i use all the time is like buildings and building codes and structures.
Starting point is 00:26:48 And when was the last time you heard of a building catching on fire? I live in New York City. There's a lot of opportunities for buildings to catch on fire. And it does happen. It does happen. But not nearly at the rate that it used to happen. If you think about the London fire, if you think about the San Francisco fire, like all of these events that occurred really just triggered new ways of thinking because of
Starting point is 00:27:17 catastrophe. And this will do the same thing. We've been perfectly fine for however long this sidecar technology has been running in production. We've been perfectly fine with that. And then now we're not, right? Or maybe now, maybe now we're not. The same thing has happened. I mean, we have sprinklers in our buildings because of fires. We didn't put sprinklers there as a preventative measure. We had to have a lot of fires before we said, maybe we should have sprinklers in buildings, or maybe we should put concrete as the center of the building so it doesn't fall when it becomes structurally unsound. And because of the hundreds of years that we've had of retrospectives and all of these
Starting point is 00:27:59 learnings from these types of things, we have safe buildings now. Same things with cars. You were saying the kernel panic is a preventative measure. Cars have the same thing. We have safe buildings now. Same things with cars. You were saying the kernel panic is a preventative measure. Cars have the same thing. They have crinkle zones to protect the driver. It's designed to collapse in certain ways. And we're getting to that point with software more and more. I think the challenge we have for software is it's much easier to do new things with software than it is to do new things with cars. I can go write a crazy random piece of code and put it in production today to all the Fire Engine customers. I swear I won't do that, but I could do that and it would cost nothing.
Starting point is 00:28:38 There'd be no labor virtually, but with these other systems, it's expensive to do new things like that. So the problem I think is we're kind of getting ahead of our skis now with software, it's happening more and more and more that we're hearing about these global outages because the system is changing constantly and we're introducing change at the fastest, most rapid rate that we possibly could do it like you were saying, it's a bit of a house of cards
Starting point is 00:29:03 this is probably just the beginning we're probably going to have another massive outage before we really start to learn oh maybe we should scale back how much we're actually changing these really complicated systems yeah and the technical details of that hypothetical future outage could be wildly different than this and so you know whereas maybe you can say, what was the cause of the fire? Well, it was a gas leak. Well, it was a person who was doing something, you know, there's these different reasons, but it's, they're all kind of like, eh, something combusted where it shouldn't happen. We didn't have, we didn't have preventative measures in place with software. So much of it's wildly different. I think it could be very hard. Now, we have had some motion in the direction of, I think it was the United States White House recently promoted memory-safe languages, for instance.
Starting point is 00:29:53 Rust being, I think, named perhaps, but definitely the Rust stations were very excited about that particular note. So we have kind of nudges happening by governments. I know the EU is what I would call more heavy- handed with their regulation around the things you can and cannot do with software. But gosh, it just seems like because of the diversity in software systems, you can't just put fire suppression in the building and be done. There's going to be so many different things, I think. So many different regulations and rules and details in order to actually harness up some sort of protection that would be effective against
Starting point is 00:30:34 an 80% solution even. I hear what you're saying. It's a crazy thought and I really hope we don't end up in this world. Buildings have regulated materials that they can be built in now and you can't even like children's toys can't have certain chemicals in them like it's a these are all very regulated industries and you know could software eventually get to that point
Starting point is 00:31:00 where governments are like you can't use any memory on safe language. It has to be plus by the US government if it's being used for public distribution. Period. Could we get there? I don't know. Maybe. We've gotten there and almost everything else, people that have cabins in the woods have regulations
Starting point is 00:31:20 that they still have to abide by. It's a wild thought. I've never really had it until you started saying that, Jordan. Well, what you're saying, though, is we get to the future innovations through past failure and retrospectives and learning. That's how we get to the future,
Starting point is 00:31:34 is deploying what we think is the best solution, it not being the best solution. There's some sort of catastrophe on a small or large scale. We examine that. We retrospective. We policy. We regulate.
Starting point is 00:31:50 We redeploy. And we try again. Well, the only other answer is to predict the future. Yeah. Yeah. And I think that's, to some degree, what developers are trying to do. They're at least tasked with trying to
Starting point is 00:32:05 solve the present problem that is future proof. That has a version of future proof in it. You hear that all the time, right? This is future proof code. I've never said that about my code. Maybe not, but somebody's like, this will future proof us. Somebody's definitely said that.
Starting point is 00:32:20 And I have always regretted it. Say feature proof, maybe maybe my code's feature proof yeah not future proof yeah feature free what's up friends i'm here in the breaks with david shu founder and ceo of retool so david retool has definitely cornered the market on internal tool software development but zoom out for me what's the big idea why Why did you start Retool? What is the big idea with internal software? Yeah, so Retool started at this point seven years ago. And when we started Retool, the core idea was that internal software is a giant, giant category that no one really thinks about. And what's surprising to most people is that internal software represents something like 50 to 60% of all the code written in the world, which might sound pretty surprising.
Starting point is 00:33:09 But if you think about it, most of us at Silicon Valley, we work at software companies, whether it's like an Airbnb, a Google, a Meta. These are all companies that are software companies selling software. And so most engineers in these companies are working on external phasing software. But if you think about most software engineers in the world, most software engineers in the world actually don't work at these software companies. There's not that many of them. There's maybe 10, 20 of them, big ones at least. Most of the companies in the world are actually non-software companies. So if you think about a company like an LVMH, for example, or like a Coca-Cola, for example, or like a Zara. Zara's not selling any software,. They actually have a lot of software
Starting point is 00:33:45 engineers, actually. And all their software engineers, all they do day in and day out, is basically build internal software. So that's, I think, one reason we started Retool. The second reason we started Retool is if you look at all this internal software that people are building, it is remarkably similar. So if you take a look at, you know, like a Zara, for example, versus Coca-Cola, two very different companies, obviously. One a clothing company, one a beverage company. But if you actually look at the software they're building internally to go run their operations, it is remarkably similar. It's basically forms, buttons, tables, all these sort of pretty common building blocks, basically, that come together in different ways. But then if you think about, you know, not just the UI, but also what's the logic behind a lot of this stuff,
Starting point is 00:34:28 they're pretty much just hitting API endpoints, hitting databases. You care about authentication, you care about authorization. There are sort of a lot of common building blocks, if you will, to internal tools. And so for us, the insight was, wow, internal software is a ginormous category, and it's all so similar, and developers hate building it and so could we create a sort of higher level framework if you will for building all this software and that would be really cool that would be really cool okay so listeners retool is built for everyone built for enterprise built for scale built for developers and that's you and if you found yourself nodding your head to what david was saying then check out retool at retool.com slash changelog it's the fastest way to build internal software
Starting point is 00:35:09 do yourself a favor get a demo or start for free today again retool.com slash changelog i really come back to this at scale situation. I think, you know, when we have the larger catastrophes, outages, etc., it's because of widely deployed code, which is a great thing because that code is somehow widely useful. But then you've got to have certain things in place that once you're maybe at that level, certain things that have to take place to instantiate change. Because like you said earlier, Robert, it's usually change, and not so much that specific change, it's that change plus something else that's the unintended consequence of those two together. And I did look up, by the way, just because I was like, what actually happens when you combine chlorine bleach with vinegar? It produces chlorine gas, which is highly toxic, so don't do that and the reaction is just i couldn't remember not good at all baby pad
Starting point is 00:36:09 yeah it's not good at all i mean it it will damage your eyes respiratory your respiratory system like your breathing it's it's just not good at all so never we learned that the hard way you know somebody somebody did it yeah see exactly but now we know someone did it. Yeah, see, exactly. But now we know. Someone did it. I like noticing obscure signs in public places because they're always indicative of some sort of incident. Every sign has a story. Yeah, I remember I was at a hotel one time and I was hanging out in the pool or maybe the hot tub. And there's a sign that said,
Starting point is 00:36:39 this pool is not for defecation purposes. Yeah, which was a very strange sign. And that might not be verbatim. And I can't remember if it was a defecation purposes. Yeah, which was a very strange sign. And that might not be verbatim. And I can't remember if it was a defecation or really, you know, it was very formal though. So I probably did say that. And I thought, yeah, somebody pooped in this pool at one point. And there was an incident where they said,
Starting point is 00:36:57 we got to put a sign up. Or someone watched Caddyshack and was just terrified. Just baby ribs. Yeah. and was just terrified. Just baby ribs. So yeah, we learn from the hard way most of the time because we can't predict what will happen when we combine those two elements until somebody does it. And sometimes what happens is we go too far, honestly. We, governments, teams, whatever it is,
Starting point is 00:37:22 the reaction can almost be too much. And I really do hope that, I mean, this is such a big outage that governments are getting involved that I really hope there's some restraint in what comes out of this. I do, because I can see a world where it does get more restrictive in the next few years because of this like a good example is like the tsa the you know horrible tragic event 9-11 but the tsa has been proven time and time again that it's security theater and we spend billions upon billions of dollars on it every single year and i think that's an example of like you know we overdid it we went too far reactionary i don't think a tsa should be gone entirely i think you know there is purpose to it but
Starting point is 00:38:13 there are plenty of examples of things in the world that we just go too far for example moratoriums and code it's pretty often that you have a couple incidents in a row. And then what happens? Everyone says, don't deploy anymore. Stop deploying. And then you realize that you have a memory leak and your system dies anyways because you're not deploying and not restarting that process. And it dies anyways. So I just hope that we don't go too far with this, that we don't overreact to this massive outage. I want an appropriate reaction to it. Right. Just to add some layers to this
Starting point is 00:38:53 and going back to something you said, Jared, and it's kind of a sidetrack, but I kind of get the information now. I texted my friend. So I had lunch with a friend of mine yesterday. I won't say where they work, but they work at a bank. And he said they were down for four hours, which I think is a short time frame compared to other scenarios we've heard of.
Starting point is 00:39:11 I don't know if that was literally only exactly four hours or some coworkers were only down for four hours or the specifics. But let's just say at least a 10,000 plus organization when it comes to having laptops and distributed employees and branches and regional HQs and state HQs, whatever, and all these things. So at least a day, and those who did not have their laptop booted down and have to boot up were safe because there was no reboot required. But for those, Jared, you would love this because if you're a freaking multi-year streaker, what was the number of years for your laptop? I was listening back to our podcast recently and I can't remember which one. Yeah, my old, my very first MacBook Pro laptop.
Starting point is 00:39:57 I didn't reboot it for over a year. I just was trying to see how long it could go. Oh, did you do like uptime and terminal? Yep, uptime. Well, I had the had the also i stat menus we'll show that to you which i've used for many years so it's very cool and i'll just close it and open it and i refuse to reboot it because i just wanted to see how long i called it a server right yeah and you'd have been safe so the people that you know had your your ambitions i suppose on
Starting point is 00:40:20 on boot time were safe but for those who booted down and booted back up the next day, which is a large majority of the people, right, they had that issue. And they were told to reboot and see if it fixed it. Obviously it didn't. And that if that didn't work, they literally had to go to the localist IT center for them to have a person, like you had said, Jared, touch the machine, do something to it, and then it was, you know, good to go again, you know? But could you imagine, like, could you imagine the cost of that enumerated across all the scenarios across
Starting point is 00:41:02 the entire globe that was affected by this. Was it 8.5 million Windows computers were actually affected in a single day? Where there was a larger deployment, but 8.5 million, I think, is the current number, if it's accurate. That's it? I think that was just one section of it, wasn't it? Well, I think that was the crash.
Starting point is 00:41:20 Like, there was, like, that many Windows computers that crashed. I don't know if that's the only computers that were affected necessarily but those are the ones were like in the critical sphere of should be up but not up so yeah well you know and one of those servers was a sql server 2000 that right or is 500 other servers were Right. Yeah, the cascading failure is massive. I just feel like Nick Burns had his best day of work ever. Do you guys remember Nick Burns from Saturday Night Live? This is your company's...
Starting point is 00:41:54 Your company's... Your company's computer guy. Your company's computer guy. Nick, the computer guy. He'll fix your computer. Then he's going to make fun of you. Because he's Nick Burns, the company's computer guy. Yeah, it's a Jimmy Fallon character.
Starting point is 00:42:12 It's one of his better characters. Not a huge Fallon fan myself, but this was a good one where he was just the most obnoxious computer guy stereotype ever. And nobody wanted to go ask him for help because he was going to just denigrate them. And I think his catch line was like, move, move. Was that so hard?
Starting point is 00:42:35 So I think Nick Burns had a great day. He gets to go around to everybody's computer and get out of the way, I'm going to reboot this thing. The heroes, honestly. I mean, the amount of patience that you would have to have on that day saturday sunday today yes you know oh gosh oh my gosh could you imagine this safe booting everything into safe mode and fix i just couldn't even and just to have a list of
Starting point is 00:43:02 like hundreds of computers you have to do next you do next. You're like, all right, just one by one. Bam, bam. Oh my gosh. Yeah, that's true. It was a Friday event that happened over the weekend. I mean, not even just those affected by obviously the downtime and their travel and their plans or their work. It's now like, wow, IT has a big job to do i was just watching like the first few 30 seconds of when i'll link up the show notes nick burns your computer guy or your company's computer guy
Starting point is 00:43:31 he's like something about a virus and he's not going to be able to reboot like he just almost described what happened you've got to go and fix it so i'll drop that in the notes but or maybe even the audio we'll see i mean it's this outage, this CrowdStrike outage, really hit every trope. It really did. Deployed on a Friday. Right. Global outage.
Starting point is 00:43:55 Windows. Yeah, I mean, the whole... It brought in the operating system wars. It really hit so many checkboxes. Memory on safety, of course. There was a lot of C++ versus Rust conversations. Yeah, I saw a lot of flaming of C++. That's what kind of irks me, because I'm like,
Starting point is 00:44:14 I don't know, the stuff that you probably tweeted this tweet from is probably running C++ in some way, shape, or form. Certainly somewhere in the stack, yes. I can even think of it. I think that TwitterX runs Envoy, which is written in c plus plus right i don't know stuff i was thinking about this actually from a an incident standpoint and uh robert you know a thing or two about instance right you know one or two things about them at least like yeah i think so would you think so i mean let, test me out here.
Starting point is 00:44:45 Just checking. It's like, so specifically to my friend in the bank situation, their team had to raise an incident company-wide that wasn't even their fault. It wasn't like their IT department messed up. So can you describe what you hypothesize for how the incident in a well managed IT slash technology stack organization would and should react when it's not even their problem? Like it's their problem, obviously, but they didn't do it. And the fix is not clear because it's upstream. How do you think this percolated inside? What's your hypothesis? So it's a good question.
Starting point is 00:45:28 I mean, for an incident like this, like you're saying, it's on the outside of your controlled world. It's challenging. So your job at that point for whatever these teams, the banks, the call centers, all these places that were down because of this outage, the first job is going to be containment and workarounds. You're going to try to find a workaround as fast as humanly possible. And those teams, what they're going to do is they're going to work
Starting point is 00:45:58 within their controlled world. So an IT team at a bank probably is going to tell everyone at the bank impacted, own the communications like, it's not a bug that we're causing. Here's the news that I'm sure everyone probably knew at that point. Here's what you can do to try to fix it, right? Here's how you boot into safe mode. Here's how you do X. And the incident responders at that point, they're just going to be trying to create a perimeter where it doesn't get worse and they can do things a little bit better. A good example is like, if you think of a wildfire, there are firefighters that are
Starting point is 00:46:37 fighting the fire, that's CrowdStrike. And then there are firefighters down or rather up the hill, chopping down brush, cutting down trees, like trying to stop it from going any further. That's kind of what those teams are going to go. That's the mode that they're going to go into. I can't say for sure, but like that's in the situations I've had a vendor outage. That's the first thing we do is we try to look for another route. This happened recently. I mean, we actually,
Starting point is 00:47:05 our CDN provider, you know, incidents are natural, so I won't name them. It's not, not blaming them, but they had a incident like a week and a half ago, only impacted Newark, pretty small. And we can't control that. And we had to own that. And our, and we had an incident opened internally because all of the East Coast users were going through this point of presence and they were getting 502s. So what we did is we actually just rerouted traffic. We just took our CDN out of the loop
Starting point is 00:47:34 and that's how we got around it. That was the only thing we could do. And I think teams are going to have to start thinking about these emergency routes more and more, especially because it's CrowdStrike outage, they're going to be like, what is our risk surface area? If we use this vendor and that vendor goes down, are we screwed? I think a lot of companies are going to start thinking that now, just because of this one outage, it's going to be pretty present in people's minds. And the management
Starting point is 00:48:01 process is going to have to change. You're going to have to create like your go bag of incident management when it's out of your control. I remember doing these practices back when I was in school, which was a MIS degree with a CS minor I was going to school for, which is, you know, management information systems. I probably haven't said that phrase since I graduated, but I remember them doing these practice routines, business continuity planning. I'm starting to remember the acronyms as well. Disaster recovery. Like you would actually write down
Starting point is 00:48:31 what are all the things that could possibly go wrong, which is a fool's errand, by the way. But you'd still try. You'd do your best, right? There's the predict the future part. You can get close. There's your predict the future part. And then you'd have to come up with a game plan
Starting point is 00:48:43 for each of these situations like how are we going to mitigate the the impact how are we going to continue to run our business what are the workarounds what are the next steps etc and i did enjoy those processes except for the writing part of course because i was in school nobody wants to write. I thought it was very useful to think like, what are a list of things that are likely to happen? Do you remember any of them? A lot of them were, well, they're completely made up businesses of course. So it's all kind of just arbitrary because we didn't actually have any businesses.
Starting point is 00:49:21 And so we were like, you're the CTO of X corp that does Y thing. And now what could happen? And so you had to kind of like make up, here's our technology stack, here's what we're doing. And then if X, then Y. And no, I don't remember any of those particular details, but I did recently visit a nuclear power plant here in Nebraska and the amount of things they've thought through and the amount of planning that they've done and building hedges, so to speak, around almost every possible thing that could go wrong at a power plant. It's actually, it's laudable. It's amazing how thorough these folks have gone through and prepared for umpteen potential things
Starting point is 00:50:09 and it made me realize like oh in software we just kind of fly by the scene of our pants don't we you know of course they move way slower i mean that's the trade-off right like everything moves super slow at a nuclear power plant. It has to because the consequences of disaster are so large. And maybe the fairytale we've told ourselves, and maybe it's gotten less and less true over time, is the consequence of software disasters isn't that big. We even had the phrases for it. I don't think we were pretending at all.
Starting point is 00:50:41 What was it? Move fast and break things. How many times was that said in Silicon Valley? Right. That got abused though. I mean, I think that at the time that began at Facebook, so that was a Facebook-born ideation. And I think it was a culture because they were in an innovation state. They were not in a, I mean, I guess they were becoming more and more widely deployed, but they were also a web service. So it wasn't like, well, it's installed and it's going to crash something. So I think there's scenarios. Now, obviously, it's a social network and there's a lot of people out there that are affected by,
Starting point is 00:51:15 you know, abuse, harm, et cetera, that can happen in social media, which I fully agree to. That's like, that's just how it kind of just sucks. And so the move fast and break things want to occur to a lot of people is just like not a good thing obviously but to a technologist who's trying to innovate that's a very it's a very admirable thing like yeah let's move fast and break things because what happens is what the iteration cycle to learning happens faster right this this cycle you described with the sprinklers, well, it doesn't happen is the danger zone right in places like crowd strike should not deploy this idea of move fast and break things and maybe they did move fast
Starting point is 00:52:14 and break things well it's interesting in that particular context because they are fighting adversaries who are also moving fast in order to break things. And so this goes back to the trade-offs that Robert was discussing. I mean, I can understand the ethos that said, we need a way to deploy to these machines outside of going through the entire process with Microsoft and the kernel stuff and the signing. We need a way to get our fixes out there before they attack all of our customers. That's what they're paying us for. And so I can see that trade-off of like, well, how can we do that? Well, let's develop a system where we're going to just side
Starting point is 00:52:48 load some rules and we'll try to make it innocuous. And we'll have, or I'm sure there's CICD and there's test suites. I mean, this is a publicly traded company. I'm sure they have infrastructure around the code they're rolling out. I'm giving them too much credit. I don't think I am. I would be shocked if we learned that they didn't, like this code went out when one person wrote it and nobody else looked at it. And I doubt that's the case. The anxiety of that code review, Jared.
Starting point is 00:53:13 Right. A little throwback. Yes. And so I can understand that push and pull. I mean, we have this even inside of like the app store where it takes forever in software terms to roll out an app update. But if you have your Logic Server side
Starting point is 00:53:28 and you can push even web components into a view, you can actually update your app throughout the day. You can basically do what they're doing with CrowdStrike, with Falcon. Over-the-air updates are exactly what you're saying. Apple restricts them pretty heavily for their platform but i like what you're saying that crowd strike this is an advantage this is probably something they have bragged about in their sales cycle like you don't
Starting point is 00:53:56 ever need to do an update of this agent it just will update itself this is how i understand how it works and when new vulnerabilities come out, we will cover you and protect you. That's a huge selling point. Why would you want to get rid of that? Come on, Adam. Why would you want to get rid of that? Don't take it away from us.
Starting point is 00:54:15 No, and I agree with that. I think, I don't think, so the question comes back to, what can we do to learn from this? I've heard, I think, did you mention this in news, this? I've heard, I think, was it, did you mention this in news, Jared? I'm like, I've read and listened to several things. EBPF. And how this could
Starting point is 00:54:32 be, this, the way the EBPF works, and I'm loosely, I mean, I'm steeped in it to some degree, but also very, like, beyond even novice. Like, I'm just like, no, I'm a green person when it comes to what EBPF is and how to describe it. But from what like, no. I'm a green person when it comes to what eBPF is and how to describe it.
Starting point is 00:54:46 But from what I understand, this could be a different architecture that could prevent this. Well, what's interesting is that CrowdStrike is actually using eBPF in their Linux client, is what I read from Brendan Gregg's article about eBPF. And so they're very well aware of it. It's a way to do this that's safer. And it's in
Starting point is 00:55:06 development inside of Microsoft to provide EBPF support for Windows. This was you then. Thank you. I love ChangeLog News, by the way. Hey, y'all listen to this. ChangeLog.com slash news. Subscribe today. If you're not, you're just missing out. You're missing out. So Brendan Gregg has this post, which was in Chainsaw News, called No More Blue Fridays, and it's his writing of why eBPF will be potentially another tool in our toolbox, right? In order to achieve what they're trying to achieve
Starting point is 00:55:36 without some of the dangers latent in the current Windows-based rollout. However, the in-development version of eBPF will not have all the features it has in Linux. And so could CrowdStrike immediately use it in order to replace their current rollout? Survey says probably not. It has to be much more full-featured
Starting point is 00:55:57 in order for that to be a thing they could start using as soon as it's shipped. But it's a direction. Well, what better way to get R&D budget to make that go faster than what just happened, right? Well, there you go. That was kind of Brendan Gregg's point at the end. And of course, I think he has a dog in the hunt.
Starting point is 00:56:16 He's very much invested in the BPF, which is open source and all that, but there's businesses built around it. But he said like, hey, here's your great moment. If you are paying for computer security software and you are a paid customer of these entities, you could push them to make this eBPF path happen faster and better
Starting point is 00:56:35 because you're their customer. So that was his call to action at the end of that post. And what would happen is that is at the kernel level do you know much about this to describe what would happen if this hypothesis or this hypothesized world existed this future development, how it would work to prevent this kernel from
Starting point is 00:56:55 crashing the system or booting without it or being more safer? No. Okay. Well that's what I was thinking of. How can we I guess, and I'm not a Windows developer, so by all means, just slap me in the face after this one, but I'm just thinking you have a crash dump whenever the blue screen of death comes up.
Starting point is 00:57:16 And the system knows probably what crashed it, at least if it's a driver in kernel mode, what's crashing it. Could you not just offer the user the option to boot SANS, that third-party, especially if it's third-party software, temporarily? Now, I get that this is cybersecurity software. What do you mean? Well, I'm just thinking if the kernel driver of CrowdStrike, a third-party, not a first-party, native operating system kernel driver,
Starting point is 00:57:42 is crashing the system. So by moniker, it's a third party. Could you not say, well, this system knows that this third party driver is crashing the system. Do you want to boot without it? And maybe that's what safe mode does, but I mean, why couldn't that be a non-safe mode thing?
Starting point is 00:57:57 I don't know. Because maybe those systems could have just been booted by everyday people. It's about UX and user friendliness. Now, I don't know if that's secure. Robert's shaking his head a little bit. Are you saying the system knows that the system is crashing? It's a layer on a layer.
Starting point is 00:58:12 You're throwing another layer that doesn't currently exist in there? Is that what you're saying, Robert? I think. I mean, I'm not even going to try to pretend I know how these kernel I'm going to call it an add-on. That's how inexperienced I am with it. Like, plugins.
Starting point is 00:58:28 I don't want to pretend to know. But I think that what Adam is saying, I think the challenge with that is just more complexity. And is the risk worth the reward? And can the system... Think about the amount of trial and error you would have to go through
Starting point is 00:58:44 for that to work really well. And where does the operating system even store that knowledge that that plugin is borked? You're at the point of it booting. That's my point. It's crashing currently. You might not even have file system access yet. That's how early in the ones and zeros we are. So I think that's the challenge is you got to put it somewhere. So let's zoom back out one layer then. My thought is not literally
Starting point is 00:59:15 how we deploy the fix. Literally, this is how we solve it. But from a user experience standpoint, the reason why the outage perpetuated to its length was because everyday people could not solve their own problem with the system. And I'm just suggesting, is there a path where you can provide everyday users of their computer some version of bypassing this crash. That's all. And I don't know that answer. I'm just hypothesizing that the reason why I perpetuated
Starting point is 00:59:50 was because people who, like IT basically, people smarter than the end user from a technical level, in most cases standpoint, could not solve, they had to come in and be deployed to literally open up the laptop
Starting point is 01:00:03 or could you imagine trucking in a workstation like not everybody uses laptops these days some people use workstations but like you had to take the thing into the people they had to plug a monitor into it and a keyboard into it and somebody else had to touch it i'm just thinking is there an other way where the end user could have done more of this in line too, rather than simply waiting. I don't think Nick Burns wants the end user to do it. No? Well, I remember the days of Windows
Starting point is 01:00:31 where it was remote PCs and the only thing that that station was responsible for was basically connecting to something else that was doing the compute. Maybe that comes back, right? Maybe that's a world that... Client-side computing was thin clients. That was Citrix, and that's my roots, man.
Starting point is 01:00:51 I grew up in IT in the early 2000s, worked at an IT company that deployed Citrix and VMware intensely. We had our own co-location system at a data center. You were talking about the power plant, Jared. Data centers are similarly, if not equally, thought through. Not equally. Not equally.
Starting point is 01:01:10 Yeah, I'm going to say maybe not all the way. Nuclear power plants are so regulated. Well, that's why I said similarly, if not equally. There's a version of the thoughtfulness, let's just say. I'm going to say I hope they're not. I hope that nuclear power plants have more thought. Okay, I would give you that. I came out feeling much safer about nuclear power
Starting point is 01:01:29 through this tour because of how stinking serious they are about safety. But anyways. Yeah. Well, just the point was that I agree, Robert. Maybe thin clients or remote. I mean, but. What's old is new again.
Starting point is 01:01:40 Maybe, you know, I think. Well, you know what the web is? Jerry was talking about that. It's like a widely deployed operating system. Most of us are on web apps these days anyways. You know, the web is? Jerry was talking about that. It's like a widely deployed operating system. Most of us are on web apps these days anyways. You know, most of what we do is through the browser. Like right now, we're having this discussion
Starting point is 01:01:52 through the browser. Video, audio, recorded locally, streamed back up. In most cases, doesn't fail. Really good software, but it's web software. We have to use a special browser, which is a whole different fight.
Starting point is 01:02:07 Web software goes down. I'm just not sure exactly what we're solving with this moving the furniture around. So what I had in my head is, I saw a picture through all the news cycles of this CrowdStrike outage was, it was actually, it was a gate agent's computer. It was at the gate where you board the plane and it had the blue screen of death.
Starting point is 01:02:26 And in that situation does that computer need a crowd strike colonel agent running on it maybe it does maybe it doesn't i don't know but i think where i'm going with this is does that computer just need a screen a mouse and a keyboard that's hooked up to something else down the hallway, you know, that's one station that's powering 20 gates and it's much easier. It's smaller surface area. You know, I think we're getting to that point. Like networks are getting fast enough to do that type of thing. Maybe it's too far. I'm not sure, honestly. I mean, some companies have tried to do this with like gaming, example i don't remember if you know it all failed so far it failed so fast yeah but maybe that was too far right like that's hard to do that's like you need super low latency video feeds right and it was google it was google trying to do
Starting point is 01:03:18 it it wasn't some fly by night i mean they have the resources if anybody could accomplish it you'd think google yeah and microsoft xbox is trying to do it too i forget the name yeah yeah true but maybe it's like that type of world right where it's just a keyboard a mouse and a screen and it's hooked up somewhere else maybe that's where we go to you reduce the surface area therefore you reduce the amount of potential outage i think in this case that hypothesis has merit only because we know what we know. It's not because we know what we knew or know what we know prior to, and that's the plan. Because I think even in that scenario, you
Starting point is 01:03:56 have now a single machine dependency of many dependencies. And now it's like, well, when that one machine is down, it's not just one person. The outage affects many because of the design of, you know, dependency. I am pro thin client, though. I'm pro what Citrix did back in those days. It was a very cool thing. I mean.
Starting point is 01:04:18 I hated it. Well, so for certain workers, for certain tasks, it was perfect. I hated it too, Jared, because I... Why were you for it then? Well, in my scenario, I was for it for everybody else though. Oh, for everybody else. Oh, yeah. Oh, I'm for it for everybody else, yeah.
Starting point is 01:04:36 Yeah, I think it's cool tech. The ergonomics of it were terrible. Yes. Yeah. I agree, the tech was cool. And for certain scenarios, I helped out. I ran network administration for a company that did commodity training. And so they had machines in silos, you know, grain silos. And those places are dirty, nasty, corn, chaff, etc. Like, it's not the place where you're going to have a server farm. Or you wouldn't even want a PC because eventually that tower is going to get all kinds of stuff into it's going to break down and so in those cases like the thinnest client possible
Starting point is 01:05:09 with a Citrix connection was the answer made tons of sense yeah but in many other use cases you got your employees sitting in their office and they're Citrixing into somewhere else you know to run with this latency and it was slow and they didn't have access to local resources. Okay. In those contexts, I was like, this is ridiculous. I have a beefy computer
Starting point is 01:05:30 sitting here. It's connected to a remote machine. The grain silo didn't have a good internet connection. Well, that was another problem. We had to create,
Starting point is 01:05:36 a lot of times we had to create internet connection for them in order for them to actually connect back to Citrix. And so that was,
Starting point is 01:05:43 I mean, it was, you're trying to do remote computing in a grain silo. It's not going to be easy no matter how you do it. Right. What's up, friends? I'm here with Firas Abugadije, founder and CEO of Socket. Socket helps to protect the best engineering teams out there with their developer first security platform.
Starting point is 01:05:59 And so Firas, speaking of developer first, Socket is developer first. What does that mean? What do you mean by being developer first? Most security software is typically sold to executives. So it tends to suck to actually use it. So the company, the vendor goes in and makes a sale. The executive thinks it looks good, but they don't actually care at all what the developer experiences of the tool.
Starting point is 01:06:20 So I think that's where I would start. The first problem with security tools is they're sold to executives. In the best case, those tools get purchased and they just sit around on the shelf bothering nobody and protecting nobody. But in the worst case, they get rolled out and they prevent developers from getting things done. And they just get all up in your face with alerts and pointless noise that isn't actionable. And if you actually go and fix those alerts, you're not even improving security because a lot of the time those vulnerabilities are super low impact. That's like the dirty secret of vulnerabilities is most of them are low impact. They're either in dev dependencies, so they're never going to run in production or they're really difficult to exploit. Or if you exploit them, there's nothing really there. It's like a, you know, a denial of service in some random component. And in reality, like that's just such a low risk in terms of just your priorities of things you need to work on as a developer. I would actually say probably 90 or 95 percent of the vulnerability alerts that developers are used to seeing from other tools are just completely pointless. They're just fake work.
Starting point is 01:07:14 And fixing them doesn't even meaningfully improve security at all. There you have it. Protect yourself, your team, and your software from the threats that really matter. Don't do fake work. Use Socket. Socket.dev. Book a demo. Install the GitHub app. Install the So that really matter. Don't do fake work. Use Socket. Socket.dev. Book a demo. Install the GitHub app.
Starting point is 01:07:28 Install the Socket CLI. Whatever it takes to take the next step, do it. Go to Socket.dev. Again, Socket.dev. Well, Intel Innovation 2024 Accelerate the Future is right around the corner. It takes place September 24th and 25th in San Jose, California. This event is all about you, the developer, the community, and the critical role you play in tackling the toughest challenges across the industry.
Starting point is 01:07:57 Ignite your passion for AI and beyond, grow your skills to maximize your impact, and network with your peers as they unleash the next wave of advancements in technology. Understand the emerging innovation and trends in dev tools, languages, frameworks, and technologies in AI and beyond. Join on-site hands-on labs, workshops, meetups, and hackathons to collaborate and solve real problems in real time. Collab with experts, learn and have fun, engage in interactive sessions, connect, grow your network, gain a unique idea and perspective, and build lasting networks. And of course, have fun. You'll hear from leading experts in the industry, technologists, startup entrepreneurs, and fellow developers, along with Intel leadership CEO Pat Gelsinger and CTO Greg Lavender as they take you through the latest advancements in technology.
Starting point is 01:08:52 Don't miss out on the chance to be at the forefront of innovation. Take advantage of their early bird pricing from now until August 2nd. Register using the link in the show notes or to learn more. Go to Intel dotcom slash innovation. When you're at scale, like CrowdStrike was, and you deploy bad code, regardless of which theory you go with, bad code, done on purpose, rogue whatever. I mean, there's people saying like this was planned.
Starting point is 01:09:26 I haven't read any of that stuff, but I'm sure it's out there. Well, you know, anytime something like this happens at a scale like this, you got to wonder, like we live in a simulation lately. Like there is strange things happening every single day that has been basically unprecedented every single day. So like the new precedent is unprecedented, you know? Right. And I just, I don't want to hypothesize here
Starting point is 01:09:50 because that's not what we're trying to do or not what I'm trying to do. But when you're at scale like this, it's obviously an attack surface of some sort, whether it's bad code, an incident, or just simply, you know, a bad day, a bad Friday, a bad weekend. And how can we give CrowdStrike the ability to do what they want to do and have the sales pitch they want to have without having the opportunity for outage like this?
Starting point is 01:10:19 And then all the others, they're going to fall on their footsteps. Who else? Well, the software will be at scale and be a tax surface, whether it's bad code, planned, intended, rogue, whatever. They're all similar scenarios, just a matter of how the incident
Starting point is 01:10:35 percolates. I mean, there's just the surface area of which software can be impacted now, either just through sheer outage or security is staggering i mean there was i don't know maybe a month and a half ago two months ago there was it was it was newsworthy enough for the new york times i saw the word postgres on the front page of new york times i was like what is this and you go and read it and there it all boils
Starting point is 01:11:07 down to there was a state actor that gained the trust of the core team for postgres and they started submitting patches that were fixed real things and then they submitted something that was very subtle that was caught on accident by another engineer years later and they eventually figured it out they were like holy crap this person just gained our trust by submitting real stuff and then snuck something in and how do you defend that you just you just you just can't i don't think you can and that sounds a lot like the xz thing is this in addition to that i think that's what i'm talking about yeah i can't i couldn't remember the the exact name of it but yeah so i don't remember the postgres part but certainly
Starting point is 01:11:55 this xz backdoor was placed by a state actor i think it was someone working on postgres is like and then they got like down to that level babe That's how I misremembered it. Fair enough. Well, XZ is a dependency of many software packages and was close to being actually distributed via Apt and other package registries prior to
Starting point is 01:12:18 it getting found out on accident by a developer. So yeah, crazy times for sure. Definitely not tinfoil hat, Adam, to say, you know, was this, to ask the question of, was this mere incompetence or was this actually an attack?
Starting point is 01:12:32 Because attacks happen and they are happening and they will continue to. And so those questions do have to be asked. I think in this particular case, I jumped immediately to incompetence, you know, Occam's razor style, because I know how complex software systems are to roll out updates.
Starting point is 01:12:48 You know, I was like, oh gosh, somebody had a really bad day, but that could be a wrong conclusion to jump to. Well, I think in the case that you're talking about, Robert, with Postgres, if this is accurate, is code analysis. You have to analyze, especially in open source, but when it's closed source like CrowdStrike and a definition update, all you can do is rely upon that team, that company, to be mature enough to have protections in place.
Starting point is 01:13:16 When it's proprietary closed source, there's nothing you can do from a scale point to analyze the code. From a different route with open source, you could do a lot of things. You could pay attention to where the patches are coming from. You know, I guess in this case here, if the patch was, you know, hey, Robert, here's the patch.
Starting point is 01:13:35 I'm Adam. Let's just say it's you as the core committer and I'm the friend who's trying to be friendly. I've solved this problem. Here you go, Robert, and you just take my code and maybe you actually deploy it to Postgres. So it's coming in signed. Maybe that's an example where you really can't analyze very well. But if you had to say, Robert is signing this commit, but it's being the location or the source of the commit is from an outside source helping out because it's open source,
Starting point is 01:14:05 then you could at least have a waypoint to begin to track if you're doing code analysis. I think that's the area where I'm really confident and looking forward to more and more being done. Because when you can analyze the Git repository and the graph of things happening in a code base, there's a lot you can pull out when it's like, okay, that's a smell.
Starting point is 01:14:28 You got a brand new committer. You got somebody being nurtured or whatever you want to call it to kind of get their trust over multiple years even. There's layers of anomaly that can be identified because of the way open source works if you do specific code analysis. So that's where I'm hopeful. I'm hopeful that we can keep open source going the way it is for
Starting point is 01:14:54 longer. I do think that some of these risks that are coming up with state actors infiltrating through years of building trust and accidental attack vectors coming through like over time i think that people are going to start to get skeptical yeah and that's going to be a tough moment we're going to have to kind of the start thinking about that i'm starting to hear more and
Starting point is 01:15:16 more about people like don't want to use third-party libraries for common things just because of the risk. For example, attacking a JavaScript MDM package that's widely used. That does a pretty simple thing. Candidly, it's less risky to just do it yourself sometimes. And that's a calculus that companies are going to have to start thinking about. Yes. I mean, I think every developer should make that calculation every
Starting point is 01:15:46 time they're going to pull in dependency and i'm not saying don't pull the dependency in but i think you do have to think through that i think we're learning that and hopefully our collective immune system will react i do think that these state actors being outed every once in a while at least will boost our immune system as open source maintainers to be like let's kind of be a little more leery of the contributors who are coming around and like just you know that whole kumbaya open open open we're all friends worldwide thing that was going on when open source began is like it's gone it's just not the same world anymore and so maybe we just won't be fooled next time hopefully
Starting point is 01:16:28 by somebody who's trying to butter us up in order to take advantage of us do you think there's a way to label software at scale like an XE if you're a contributor to XE do you know how much is deployed and you understand how crucial your core role is to that software?
Starting point is 01:16:48 Yes and no. Yes and no, right? Probably hard to feel the actual gravity of it. Right. Right. I'm just wondering, is there a way to, and I'm literally asking the question without having put any thought into it. So if it's naive, you know, slight me around if you have to. As we do.
Starting point is 01:17:02 Yeah. I'm just wondering, is there a way to elevate certain software without maybe even by analysis to understand its deployment or its dependency levelness, I suppose? Its scale. Like I'm sure CrowdStrike knew how at scale they were. This was not unknown to them so this is
Starting point is 01:17:26 not an example but xz and the folks behind that who are being you know groomed for lack of better terms over a year or more a very long patient amount of time do they understand how crucial the software is that they're in control of so that they can have that position you just said, I'm just thinking, is there a way to label something, hey, you're a scaled software, you're widely deployed, and there's some way to elevate them to a different level, at least by label, so that there's an awareness
Starting point is 01:18:16 that if there's a malicious attack on that code base, it has effects. I feel like GitHub could own that. Honestly. They know how many could own that. Honestly. They know how many times a repository is committed. They know how many times it's even looked at, just page views
Starting point is 01:18:31 in general. They know the number of stars on it. And maybe it's not GitHub. Maybe it's some other program. Maybe it's government sponsored. That goes to these maintainers and says, just FYI, you're on our list. You just made the list.
Starting point is 01:18:52 And in a way, it's like, congratulations, you've built such valuable software. It's now a national security threat. But I hear what you're saying. I think it's hard. I think it's hard because it takes the steam out of it. It takes the altruism out of it sometimes too for some people that just want to do a good thing. When the barrier is high, then people won't do it.
Starting point is 01:19:15 And I think that's challenging. I think the maintainers of scaled software know. I think that they're just wildly under-resourced and exhausted and can't possibly sometimes care enough anymore because they've cared so much for so long, for so little. So I think for the rest of us, I did not know how big XZ was in terms of its dependency graph, the other way around, how many dependency graphs it was in, which was many, but I'm sure that the author of XC has an idea. Like, that's why I said yes and no.
Starting point is 01:19:51 He may not know exactly how big his software is, but at a certain point when your package is deployed across all these distributions and stuff, yeah, you understand that like, wow, this thing is really reaching lots of places. And so I think there's some of that gravity there. But for the rest of us, that might be useful to have that list of softwares that are considered national security importance or whatever it is.
Starting point is 01:20:20 They aren't the threat, but they are of potential threat because of their situation i think one one example of a of a developer who just built an open source something and took it down not realizing the true scale of this thing was left pad oh yeah 2016 that one was wild that was so many packages couldn't be installed and deploys like stop for hours because of that and it was just some i forget the exact context but i think it was like some dispute and out of he was like i'm gonna take down the package you're using wasn't a political yeah i don't remember exactly i don't think I don't think LeftPad was political. LeftPad was a long time ago.
Starting point is 01:21:06 It was a political one. You just deleted it off of NPM package registry and then chaos ensued. I think LeftPad might have been the one where they had another package called Kik or SideKik and another company, a company, not another company. This might not be LeftPad either. But this definitely happened. There's a company, a startup called Kik, K-I- either, but this definitely happened. There's a company, a startup
Starting point is 01:21:26 called Kik, K-I-K, I believe. And there's a package called Kik, I think owned by the LeftPad owner, if it's coming back to me. And the Kik company contacted NPM and wanted the name, but didn't have the package name. And I think NPM granted them
Starting point is 01:21:42 access to the Kik package name, basically kicking it off the LeftPad owner. And then they got mad and just pulled LeftPad. All their stuff. I think theyPM granted them access to the Kik package name, basically kicking it off the LeftPad owner. And then they got mad and just pulled LeftPad and all their stuff. I think they pulled all their stuff. I'm pretty sure that's LeftPad. That may be a different one because there's been so many at this point,
Starting point is 01:21:53 but that definitely happened. I have the, there's a Wikipedia page for it. Is there? NPM LeftPad incident that I just found. And yeah, you're right on the money with what you just said. But you know what's kind of crazy about that? And it kind of goes back to what I was saying
Starting point is 01:22:08 about own your software a little bit more. LeftPad was not a thing that needed to go out over a network and download a package and pull it down. Any engineer should be able to write what LeftPad did. Absolutely. Or copy-paste the function.
Starting point is 01:22:24 It was like a... Or that, yeah. Because I mean, you can use somebody else's code Absolutely. Or copy paste the function. It was like a... Or that, yeah. Because I mean, you can use somebody else's code with a little copy paste and remove that dependency. And because, not because you can't trust the author,
Starting point is 01:22:35 but because we cannot trust the network. Right? That's the problem with NPM. We can trust the authors in most cases, but we cannot trust the network into the future. You can maybe trust it today, but you cannot trust the network tomorrow. And so You can maybe trust it today, but you cannot trust the network
Starting point is 01:22:46 tomorrow. And so, copy paste that sucker. Vendor it. I mean, that's what we used to call it in the real world, vendor it. Which is to pull it into your repo, check it in, and leave it there. I remember doing that. Did you see that one? It was a couple weeks ago that a domain expired that was hosting
Starting point is 01:23:02 a JavaScript package. Polyfill. Someone else bought the domain, put something not good there. Same domain path and all these websites that were resolving that domain to the new source
Starting point is 01:23:17 were impacted. It was like 100,000 websites. You can't trust the network. Yeah, so that's a good way. You can't trust the network. Especially over time. Because that's a good way. You can't trust the network. I think it's a good way. Especially over time. Yeah. Because that's what we think of today,
Starting point is 01:23:27 but over time the network changes. In ways that we wouldn't expect. Like nobody expected polyfill.io to change ownership. Yeah. Or CDN, whatever the CDN that was hosting polyfill.io. Right. We put some stuff through proxy, basically. And that kind of does it.
Starting point is 01:23:44 You proxy yourselves and let kind of does proxy yourselves and the gems and some stuff and that way it's kind of a if it's there we trust it kind of thing right you know if you try to pull something else in a bundle install yarn install whatever it is go get it goes through
Starting point is 01:24:00 there and if it's not there then it kind of triggers a well why are you trying to get something that isn't in this you know it's not blessed yet it's a proxy that you guys run uh yes is this like a like artifactory kind of thing where you pull yeah some other i forget the exact tech if i'm honest but yeah but but similar to the j factory or j frog artifactory yeah yeah it's a great idea just get yourself layers in between you and the unknown. I mean, that's one of the wise practices
Starting point is 01:24:27 for sure. Well, that's like the, I guess, rich man's version or rich person's version of vendoring. It's like the same idea except for it's
Starting point is 01:24:35 Yeah, you vendor it to a server. It's vendoring. I mean, this has been the tale as old as time, basically. Ruby had it first. Well, like I said,
Starting point is 01:24:43 what's old is new again. Yeah, exactly. We're going to go back to all these ideas in some way, shape, or form, I think. We're going back to time basically ruby had it first well like i said what's old is new again yeah exactly we're gonna go back to all these ideas in some way shape or form i think we're going back to thin clients apparently so i mean i think even that too you have to have an incident like this to have a discussion like this that says these older ideas that were probably pretty good you know maybe at the time it was like less modern to do it now it's more modern so maybe there's but i suppose to your point jerry with your meme like i deployed software today Now it's more modern. So maybe there's, but I suppose to your point, Jer, with your meme, like I deployed software today,
Starting point is 01:25:07 so it's modern, right? Like when you have a meme out there somewhere, there's like. Oh yeah. Just mostly a gripe. Like people always advertise their software as modern,
Starting point is 01:25:15 which just literally means that it's just a newer thing. You know, like it's not a feature. It's just that you started coding it six months ago. Right. You know. At some point,
Starting point is 01:25:23 someone's going to start bragging about how much their software hasn't changed. Yeah, I think vintage software should make a move. This is classic, this is vintage. When I was a young gun engineer and I heard about these banks using cobalt still and I was like, ah, what losers. And now I'm like, hey, whatever.
Starting point is 01:25:44 If it works, I can look at my balances and I've never had an issue and I can always charge my card. You do you. Maybe calcified software has a purpose in the world where it just gets rarely touched and we're just happy about that. I'm leaning that way more and more.
Starting point is 01:25:59 Do we need to keep changing the software? I don't know. That's not really good for your business though, Robert. I mean, if you advocate for that. Robert's out there. More incidents. We need more incidents. My investors, my board hears that.
Starting point is 01:26:11 They'll be like, what are you doing? What are you saying, Robert? Stop right now. Well, I think even if you have unchanged software, there's still bound to be incidents of some sort. I mean, there's still going to be... No one's going to listen to you, Robert. No one's going to do that, right?
Starting point is 01:26:25 I recommend that. Yeah. Well, this has been fun digging into the details, I think. You know, it's fun to speculate out. You know, I do want to, again, mention I love Dave Plummer and his channel on YouTube. He's a great resource. I always appreciate what he shares. I probably listened to his video twice, just making sure I kind of understood some of the mechanics behind it because I really want to understand like what to what degree does this software actually operate on Windows. you know, how this incident propagated. You know, we don't know if it was really bad code
Starting point is 01:27:05 or if it was sabotage or if it was some sort of plan. That's all speculation that we're not trying to really go through here. But sort of like, hey, if you're out there and you've been affected by this or you're just curious, you know, go out there and do your own investigations. Pay attention to what's happening out there. And I guess we can look forward to George Kurtz, the CEO, current CEO of CrowdStrike, who was there at the helm during this incident
Starting point is 01:27:28 to stand before Congress and explain exactly what happened. And maybe then we'll know. Talking about security theater. Right. Until then, all we can do is speculate what may have happened. We can, you know, use the, they're not called dumps. What are they called? Are they called dumps whenever it's a
Starting point is 01:27:43 kernel panic? Well, you dump the stack. Yeah. It's not a stack trace, because that's like an application kind of thing. Kernel panic. Yeah, exactly. You can examine that. And there's lots of folks,
Starting point is 01:27:55 there was a famous tweet out there that made the rounds explaining that, you know, this one file was updated, and while it should have had the needed definition in there, instead it just contained zeros because of a null pointer. There's all these things like why this actually happened. But I think in the end, we can just say at scale, software can have massive effects. And we got to do something about that.
Starting point is 01:28:17 It's a good thing to have scale software, but at the same time, we have to do updates responsibly. Or in this scenario where you have a kernel-level driver, how do you do what CrowdStrike wants to do with Falcon but not bypass the security systems? That's the real question here, specifically for this incident. I think for others, it's just love your maintainers if it's open source. If it's not open source, drag them through Congress and make them explain it. You know, and slap them around a little bit. You know, otherwise, just do what you can to stay safe.
Starting point is 01:28:54 You know, scrutinize your dependencies, your third parties, etc. And that's about it for me. And run Linux on your desktop. I mean, that's the way. This is the way. Write Rust, run Linux, and you'll be good to go. And then let on your desktop. I mean, that's the way. This is the way. Write Rust, run Linux, and you'll be good to go. And then let all of us know about it.
Starting point is 01:29:09 Once they figure out their audio drivers to come on this show, it'll be great to hear their experience. Well, every time we have a Linux user, we're always happy, obviously, and then sad. Because we expect to have some version of issue because of drivers.
Starting point is 01:29:26 It's almost unanimous. Almost unanimous. Well, thanks so much for having me. This was a blast. I think it was a fun topic to talk about and super interesting. For sure. Thanks for joining us. Yeah, Robert.
Starting point is 01:29:35 It's been fun. Bye, friends. Bye, Robert. Well, friends, here we are again at the end of a busy and interesting week in the software world, which more and more is the whole world. Do you have thoughts? Do you have opinions? I know you do.
Starting point is 01:29:56 We would love to hear them. Sound off in the comments. Link in the show notes. Oh, and stick around, ChangeLog++ members. This is yet another extended episode. We love doing these for our most loyal supporters. Oh, and by the way, if you are a Changelog++ member, maybe sign in to changelog.com using your plus plus email address and see if you see anything new on your homepage. I won't say more than that for now, but we'll talk details soon enough,
Starting point is 01:30:25 probably on the next Kaizen. Okay, quick thanks again to our partners at Fly.io, to Breakmaster Cylinder, to Sentry, UseCodeChangelog, and to you, of course, for listening along. Seriously, we appreciate it. Next week on the Changelog, news on Monday, Joseph Jaxx from OSS Capital on Wednesday, and Adam is flying solo on Friday, but he has a very special guest, the author of his favorite book series, The Babaverse. Yes, Dennis E. Taylor joins the show. Have a great weekend. Leave us a five-star review if you dig our work,
Starting point is 01:30:59 and let's talk again real soon. So during the main show, I did not ask you about this, nor did we directly reference it, but it was a reference point for me. You wrote something the same day as this incident, I think, is July 19th, 2024. Beyond the headlines, the unsung art of Software Outage Management And rather than It's better

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.