The Changelog: Software Development, Open Source - The BSOD CrowdStrikes back (Friends)
Episode Date: July 26, 2024Robert Ross joins us in CrowdStrike's wake to dissect the largest outage in the history of information technology... and what it means for the future of the (software) world....
Transcript
Discussion (0)
Welcome to Changelog and Friends, a weekly talk show about Citrix thin clients.
Big thanks to our partners at Fly.io, the home of changelog.com. Over 3 million
apps have launched on Fly. You can too. Learn more at Fly.io. Okay, let's talk.
Hey friends, I'm here with a new friend of mine, Shane Harder, the founder of Chronitor.
Check him out, chronitor.io.
It lets you keep tabs on your cron jobs, Linux, Kubernetes, Apache Airflow, Sidekick, and more.
With over 12 open source integrations, you can instrument all your jobs no matter where you're running them.
So, Shane, for me, I'm using Linux and Linux cron jobs are by far
the most popular in my opinion, right? But there's so many other cron like things, Kubernetes,
Airflow, Sidekick. Help me understand the full spectrum of background jobs and cron jobs
beyond Linux cron. Yeah, Linux cron jobs are massively popular. They are still, 40 years later,
the tool that most developers will go to first when they need to start scheduling something in
the background. But when you get into a team environment or an enterprise environment,
there is a lot of other constraints at play and there's other considerations. And whether it's
simply like redundancy that you're not going to get from CronTab itself or, you know, more like complex orchestration stories like you can get with like Airflow.
We see companies eventually outgrowing Cron.
And what we wanted to be sure of is that, first of all, like migrating from Cron to anything else is a complicated thing.
So we wanted to give you tools to help you monitor that transition and make sure your jobs are working good as you as you do that migration you know and then second we wanted to give you a way to unify all these different job
platforms because seldom do you have just like platform a and you migrate cleanly to platform b
probably in a in a real world scenario you're running both side by side for a while you don't
want to have different monitoring tools or different monitoring strategies for different for every different platform that you that you deploy. So our goal is anywhere you're
running a background job, you can use Chroniter. The number one way that we ensured that was
possible is by having like a really simple API that you can just use with an HTTP request yourself,
which is pretty abnormal for monitoring tools. But that works in a lot of cases. But to make it
easier than every popular job platform out there, like Linux, CronJobs, Kubernetes, CronJobs, Windows,
Sidekick, Airflow, you name it. We have a Cronitor SDK that you can install that will run
automatically configure your monitoring, run in the background and sync all your jobs with
Cronitor the same way your Linux CronJobs will be synced. Okay, friends, join more than 50,000 developers using Chronitor.
I'm one of them.
You can start for free and they have a pay-as-you-grow pricing plan.
Setup is too easy with more than 20 SDKs.
Check them out at chronitor.io.
That's C-R-O-N-I-T-O-R dot I-O.
Again, chronitor.io.
Well, friends, we're here to discuss an outage, a disaster that made history.
And we have a good friend of ours here, Robert Ross, the founder and CEO of FireHarton, to help us dig into what exactly happened and maybe more pertinently how to prevent incidents at large or just deal with them.
What do you think, Robert?
Well, I'll do my best without wearing a monocle and thinking about exactly how this went down.
But yeah, I've read every news source about it,
I think, at this point.
I think everyone's heard about it, so excited to dive in.
What are you guys talking about?
I'm not even sure what we're referring to.
Yeah, right, Jared.
Did something happen?
You know what I kept thinking every time I read CrowdStrike?
I kept thinking of ACDC's Thunderstruck.
I couldn't quite pull the pun across, because it's CrowdStrike, I kept thinking of ACDC's Thunderstruck. I couldn't quite pull the pun across
because it's CrowdStrike, Thunderstruck,
but that song has been
playing in my head
probably before this happened, but it just happened to a line.
I don't know. I'm an ACDC fan.
What can I say? The developer may have been
listening to that when they wrote the code.
They might have been. That could be why
you're there, potentially.
I like to code to some ACDC.
Yeah, for sure. Especially that song. That could be why you're there. Potentially. I like to code to some ACDC.
Yeah, for sure.
Especially that song.
That'll pump you up, man.
For sure.
I code faster when that type of music is playing.
That's for sure.
I'm sure most folks are, to some degree,
primed on what happened.
But who wants to nominate themselves to explain at least a primer of what happened?
I think you did it pretty well in News Jerry,, but you also covered some other sides of it too.
But what do you think?
Do you want to handle it or do you want me to handle it?
Well, there was a giant outage on Friday due to CrowdStrike pushing a bad update to a billion machines.
I'm not sure the exact number. but basically every Windows-based company,
organization around the world
was affected probably somehow.
Many things were down.
The banking industry got hit hard.
Hospitals got hit hard.
Airlines got hit hard,
except for Southwest,
which I discussed in news.
The reasoning, by the way,
quick update on that,
I put in news was that they are allegedly still running
old versions of Windows 95, 3.1.
Could be true.
Might not be true.
Those are actually rumors.
I thought that was a joke when I saw that.
Maybe that's true, actually.
It kind of was a...
It duped Jared.
It got him.
It might be fake news.
I updated our ChangeLog newsletter to make sure that it's accurate now
because I thought it was funny, too, which is why I put it in there.
And it's true that Southwest was unaffected.
And of course, Southwest famously was down, was it two years ago?
For 10 days.
Yeah.
Because they couldn't.
The holiday outage.
Yeah, the holiday outage.
And back then, there was reasonings because they were on really old
versions of Windows and they couldn't do stuff.
And so I think those two stories combined to say perhaps their old versions of Windows have actually saved them this time.
But allegedly, not necessarily the case.
But funny either way.
Yeah, man.
I guess the way I would summarize it is the blue screen of death made an epic comeback and took over the world.
Total world domination last week.
Wouldn't you say that this is affected by
CrowdStrike customers?
Not just simply
Windows users.
Yeah, but I guess, here's what's weird about it.
I had never even heard of CrowdStrike,
but it sounds like who's not a CrowdStrike customer.
Robert, were you familiar with CrowdStrike
prior to this?
Yeah, we used CrowdStrike at FireHydrant.
Okay, so what do you use them for?
Endpoint security. We run their Falcon
daemon on all of the employee laptops. We don't use it for the
services we provide, but it is running on every
FireHydrant laptop.
And these laptops are Windows, Linux, macOS?
All Mac, yeah. So we weren't impacted by it, thankfully.
Just the Windows CrowdStrike world.
Yeah, that's what it seems like.
And it seems like there was a change that was in the new sensor
that runs silently.
I think a lot of people don't even know
that they have CrowdStrike on their laptop.
And that's by design, right?
I would say a good product.
You don't even know it's there
until it gives you a blue screen of death.
It's a bad way to find out about it,
but before then, brilliant.
It's like you had a bunch of stuff in your walls
and then eventually it falls out of the wall and you're like, oh, that's been rotting
behind there for a long time. I think that the change is always
the biggest cause of incidents. We see it all the time. Google
even has a stat that 80% of their incidents are caused by a change.
So it's not exactly shocking that a change caused this.
I think what's shocking to people is the scale of the incident.
And when you had ACDC Thunderstruck playing in your head,
I kind of had Jeff Goldblum in my head where he's like,
flap your wings and a hurricane happens across the ocean.
That's kind of what it felt like.
The butterfly effect.
Yeah, the butterfly effect, exactly. That's kind of what it felt like. The butterfly effect. Yeah, the butterfly effect, exactly.
That's kind of what it felt to me.
A very simple try to access memory that wasn't there.
Grounded flights still has grounded flights.
Delta has canceled hundreds of flights
every single day for the last five days.
And I think we're just going to keep hearing
about problems for the next few weeks from this thing.
Yeah, it would be interesting if somebody could somehow,
some way come up with a global economic impact of this event.
But it has to be measured in billions, maybe trillions of dollars.
I think so.
We had employees and teammates at Fire Hydrant
that had to cancel trips.
I had friends that were at the airport
that had to cancel their weekend plans
that they were flying somewhere.
So it wasn't only the places like airports and hospitals that were impacted.
It was local economies that were impacted by this as well.
Friends going to Dominican Republic that couldn't go.
And it's hard to reschedule those types of plans.
So it's kind of like, you know like probably not coming back, that loss.
That money, yeah.
Well, not to mention just labor.
Pure labor costs of mitigation or remediation because this, unfortunately, does require, I think,
direct impact with each machine affected,
meaning you can't just remotely reboot
these machines, is what I read. You have to actually go touch each machine and, I don't know,
boot in a safe mode, or maybe you know, Robert or Adam, exactly the process. But it's relatively
straightforward, unless you have an encrypted hard drive, then it's slightly less straightforward.
But we're talking about people walking around data centers,
going to each computer, or walking around hospitals,
going to each computer.
I mean, the amount of highly paid individuals effectively doing a mass reboot this week
is probably measured in large numbers.
Yeah, and even parts of the country in the US
that had issues probably don't have, you know, a big workforce capable of doing this work.
You think of a, you know, a giant airline, they have a massive IT team that can go and do that labor and that work.
But Alaska, in rural Alaska, 911 wasn't working.
People couldn't call 911.
Really? in rural Alaska, 911 wasn't working. People couldn't call 911.
And at one point, even Portland's mayor declared a state of emergency on Friday.
And there's parts of the impact area
that just don't have a response unit
that can go solve those problems.
So I do think we're going to keep hearing about it.
There's going to be inquiries by the government. I think I saw today
that CrowdStrike CEO is going to be
called upon by Congress.
That was news of like 16
hours ago or so. AP had that out there.
The Washington Post had it out there.
House committee calls on CrowdStrike CEO
to testify on
the global outage. And not surprising.
And he went on
air pretty quickly.
It was like, this is our fault.
We're fixing it.
And I have to commend the confidence
to just go and say, own it that quickly.
But, you know, I have questions.
I think everyone does.
Even my aunts and uncles in their late 60s
who don't quite understand this type of world
like we all do, were asking me questions i
mean it had everyone felt this i think in some way shape or form well windows only there's a
lot of details so i caught up with dave plumber that's literally his name he is on youtube he
runs a channel called dave's garage he's an ex, from what I understand, an ex-Microsoft
operating system developer. And so he knows a lot
about this stuff. And I will link it in the show notes, but
he was my source of literally what really happened
on the inside. There's also the code report from
I think it's Fireside or Firesomething on Fireship
on YouTube that also summarizes some things that I pay attention to as well as part of like
researching this topic. So there's some theories that this is just simply bad quality code.
This could be sabotage or this could be planned. Now those are obviously theories,
not truth at this point. But I think it's important to look at, you know, Robert, you said
change is what affects things and what causes incidents. We're not sure when exactly this code
got pushed, but what happened was, or at least from my understanding, and thanks to Dave for
explaining this, is that this software Falcon as
you all run as well it runs in what they call kernel mode and stop me if you've heard this one
before but there's two lands to live in basically in the operating world you've got user mode and
you got kernel mode and kernel mode has you know higher priority and when an application crashes
in kernel mode it crashes a system and it does it by design because it's protecting the system. It's better to crash than to actually boot up.
Something else worse could happen if that was the case.
And CrowdStrike, their software called Falcon, lives
and runs in kernel mode. And that's, I guess, by design.
I'm not sure why it has to. And then there's this labs that
Microsoft has called
Windows Hardware Quality Labs that drivers that live in kernel mode or run in kernel mode that
are third party, they have to go through this process to get deployed. And so it gets tested
by Microsoft through this WHQL labs system to be able to be deployed to get signed and used by the operating system etc but the way
they bypassed this was because in dave's words they want to be they want to be agile ambitious
and aggressive to get the latest protection and so as a way to deploy this latest protection
more fastly to windows users and i guess it's not the case for Mac or other systems because it didn't happen to you all, Robert,
is that they have these things called definition files that the kernel reads from.
So when the kernel wakes up, if it's a new boot, it wakes up, it enumerates a folder,
and looks for this other code, this dynamic code that gets deployed outside the kernel delivery system.
So essentially you have unsigned code that runs in kernel mode.
That's bad stuff.
From what I understand, thanks to Dave,
that's a rough version of the mechanics of how this works on the Windows system.
I think it's a game of trade-offs, and that's a hard thing to feel now, right?
Like people's flights got canceled, you know, hospital surgeries got canceled.
Like it's a big deal.
But at the end of the day, do we, it's easy to say this was the worst thing that could
happen instead of the sum of the parts of all the things that were maybe prevented in
the past.
And we just have no idea.
I don't even think that CrowdStrike would probably know.
But how many things were via CrowdStrike
or another locking system, security system running,
have prevented mass credit card theft
or identity theft or other things going on?
It's hard to say.
No one's going to buy that now, though.
Because no one's going to look at now though the trade because no one's going
to look at a trade-off right now there could be like flight my flight got canceled i don't care
what i don't care what my trade-offs were in the past right now the other thing that i think that
is going to be we're just going to have to see if crowd strike post a public retrospective but
this code could have been this code that is you know the the crime scene of this code could have been this code that is
the crime scene of this code base
that could be in there, we don't know, for 10 years
we have no idea
and another piece of code was deployed
10 miles away or so they thought
from that code base or that line of code
calling that memory address and then that caused it right
i think it's one of the challenges with building software now is like we were kind of saying
earlier the butterfly effect like software is so complex now and so vast that you can deploy a
change and what you think is a different country of your code base but it impacts across the ocean
than somewhere else and i i would wager that's what happened here i would wager there's just
no way that crowd strike doesn't have a crazy test suite that microsoft is probably running
tests for them because it does run in kernel mode they have to get that approval it sounds like
i just have a really hard time believing that this very simple line of code
just got deployed and took everything down.
I could be totally wrong and totally off-base.
I have no idea.
But whenever I've taken down production, and it's been many times,
it wasn't explicitly because the one piece of code that I wrote.
Because I tested that.
I put that through its paces.
I wrote unit tests.
It was the combination of that
and something else. When you add chlorine and vinegar, what's that potent combination they
say never to do because it's super toxic all of a sudden? That's what it feels like happened to me
in this outage specifically. Yeah. I mean, for me, it seems like some of our
most ingrained premonitions
coming to fruition
in terms of
being down in the mucky
muck as a developer.
We just know, and I've said many times,
it just feels like we're building a house of cards.
Because it's so complicated.
And it's so intertwined.
And it's effectively, especially with web development,
we're talking about a worldwide distributed system
which has things that happen.
Of course, there's an explanation in retrospect for all of these things,
but when you build a house of cards, eventually it's going to just topple.
And sometimes it topples in ways that you don't know why or when or how
and what will be the downstream effects.
And, of course, this isn't web development in this case.
This is operating system code.
But still, network to machines, being able to remotely update.
Every once in a while, just a house of cards topples,
and we have to start over to a certain extent rethink things
try to adjust and clean up the mess and move forward i mean i even think of for every person
listening to this like think about the mechanics of what is going on as you're listening to this
podcast if you're using headphones right now your headphones have software in them that is going to a bluetooth chip that has
software on it that's part of an operating system that's translating that to go over the air to a
cell phone tower that's running software that's going to a network switch that cisco probably
built that's running software and it just goes more and then eventually hits an apple music server or some app spotify server that goes
through a cdn that's software it's just software all the way down i mean it is thousands of touch
points of software for you to hear this stupid analogy that i'm making like that that's like
that and you had to go through that grueling exercise through that much software.
And that's just the world now.
That's the way it is.
It's not going back.
We can't unwind this anytime soon.
Right.
That's why I said sometimes you just have to clean up the mess and then obviously do a retrospective.
And one thing we can do is make sure
that particular thing doesn't happen ever again.
But that's just one of the things.
That's what regression tests are for.
I'm not going to let this particular bug
bite me and my billion customers again.
And I'm sure CrowdStrike,
after they go through the PR process,
I mean, not pull requests, but public relations,
because their stock was down 23%, I think they have.
I mean, they are massively 23% I think they have.
I mean they are massively hurt by this.
Their reputation is just in the mud.
So they're going to go through all that and maybe there'll be people fired.
Who knows what's going to happen.
But then hopefully they sit down and say, okay, let's do our analysis.
Let's do our postmortem. Let's figure out how we can make this particular aspect of our business not hurt people
again. But that's just one thing. As similar, it goes back to the conversation of information
security that we're having with Jacob DePriest from GitHub's security team. The challenge of
the defender is you have to defend the entire system and the attacker only has to find one
hole. Bugs work the same way, only it's just accidental and not malicious, you know?
And so in that conversation, I said, I feel like to a certain extent, resistance is futile.
I mean, the defender does all they can, but you're still going to have the attacker succeed
sometimes.
And it seems like with software systems, the bugs are going to be there.
I mean, we haven't found a way of eliminating all bugs.
And so how do we build around, fortify, defense in depth, react, respond?
I don't know.
I think in one case, this is an advertisement for heterogeneous systems.
What's the word?
Not a monoculture,
just like in biological systems, right?
Like you want to have... Yeah, regenerative farming
where you have, you know,
you plant two crops
in the same plot of land
and they help each other.
Yeah, exactly.
Just diversity inside
of our software systems
so that when we have a problem
in one particular system,
aka Windows machines running CrowdStrike,
that's not a worldwide global outage.
That's like a regional, you know,
20% of the internet was down today, guys,
versus what it actually, like that whole,
let's have multiple operating systems,
not just worldwide, but even in our own organizations which can be
a huge burden, a huge pain
and we tend to want to normalize
and streamline and
formalize a specific
stack of software because it's easier
to maintain and manage
but then you just are vulnerable to
attacks at like a 100%
scale of your organization
so I mean, I think
one takeaway we can have is like, hey, I'm
really happy I'm running macOS today.
Now maybe tomorrow, all the Windows users
will be happy that they're running Windows and
not macOS because something will attack macOS.
But the Linux users are having the best
time of their life right now.
Oh yeah, the memes are
strong right now.
What is the year of the Linux desktop, as you know, Jared?
I've heard that the last 15 years of my life, and it has not come to fruition.
Here's the through line to all this, though.
The through line is massively deployed software.
That's it.
Or massively dependent upon software in a different scenario, like a dependency.
It's that this was everywhere, right?
It's that this was everywhere, right? It's that this was everywhere.
And then I think there's very specifically to this scenario, there are some layers that may have been not thought so well through. in his description of how they bypassed the WHQL, which is a hardware labs quality system
that is there to sign these drivers to run in kernel mode.
Because it's so, like what runs in kernel mode
is so limited because of its power.
And here they are able to run there,
which is okay, fine.
If you have to then, and Windows and that team blesses you
and they put you in the WHQL system to have this signed certificate to say, okay, your driver is blessed. We've tested it to the absolute best of our knowledge. We put it through all the paces. at scale and they be the driver essentially is an engine that runs code that has not been signed
or not gone through these paces that alone there is like i'd imagine robert as you look at what
you do and how you help folks look at incidents it's like when we look at what we've done here
we have to examine the system we built maybe it's you know anti the Windows way to have this sidecar, this folder of definitions that the driver enumerates over and sucks in and the driver essentially is an engine that runs unsigned code.
That could be true if Dave's accurate.
And if that is true, sense, but like by the relationship formed between CrowdStrike,
the Falcon software and the Windows team that has WHQL
to allow this to live in kernel land and not user land.
That's one thing.
And then you got just the ability to deploy at scale
and for the system to do what it should have done.
So, you know, when an app crashes,
an app crashes. When the kernel mode
crashes, the system crashes.
And it crashes because it has to.
Like, this is how, it did
what it should have done. There was a bug
in the kernel driver that when it booted
up, it didn't, for whatever reason,
cause an exception at the kernel level.
And when the kernel crashes, the whole system crashes.
And that's by design. So effectively it was preventative on purpose
but by a bug or a faulty code.
Yeah. I think as software engineers,
and I feel qualified to say this because it is a criticism,
is that we love thinking that we you know have
invented new things and every once in a while you just kind of have to take a step back and
and think of oh actually we've gone through all this process without software we've already done
it and the example i use all the time is like buildings and building codes and structures.
And when was the last time you heard of a building catching on fire?
I live in New York City.
There's a lot of opportunities for buildings to catch on fire.
And it does happen.
It does happen.
But not nearly at the rate that it used to happen.
If you think about the London fire, if you think about the San Francisco fire, like all
of these events that occurred really just triggered new ways of thinking because of
catastrophe.
And this will do the same thing.
We've been perfectly fine for however long this sidecar technology has been running in production. We've been perfectly fine with that. And then now we're not, right? Or maybe
now, maybe now we're not. The same thing has happened. I mean, we have sprinklers in our
buildings because of fires. We didn't put sprinklers there as a preventative measure.
We had to have a lot of fires before we said, maybe we should have sprinklers in buildings, or maybe we
should put concrete as the center of the building so it doesn't fall when it becomes structurally
unsound. And because of the hundreds of years that we've had of retrospectives and all of these
learnings from these types of things, we have safe buildings now. Same things with cars. You
were saying the kernel panic is a preventative measure. Cars have the same thing. We have safe buildings now. Same things with cars. You were saying the kernel panic
is a preventative measure. Cars have the same thing. They have crinkle zones to protect the
driver. It's designed to collapse in certain ways. And we're getting to that point with software
more and more. I think the challenge we have for software is it's much easier to do new things
with software than it is to do new things with
cars. I can go write a crazy random piece of code and put it in production today to all the
Fire Engine customers. I swear I won't do that, but I could do that and it would cost nothing.
There'd be no labor virtually, but with these other systems, it's expensive to do new things
like that. So the problem I think is we're kind of getting ahead of our skis now
with software, it's happening more and more and more
that we're hearing about these global outages
because the system is changing constantly
and we're introducing change at the fastest, most rapid rate
that we possibly could do it
like you were saying, it's a bit of a house of cards
this is probably just the beginning we're probably going to have another massive outage before we really start to
learn oh maybe we should scale back how much we're actually changing these really complicated systems
yeah and the technical details of that hypothetical future outage could be wildly different than this
and so you know whereas maybe you can say, what was the cause of the fire?
Well, it was a gas leak. Well, it was a person who was doing something, you know, there's these
different reasons, but it's, they're all kind of like, eh, something combusted where it shouldn't
happen. We didn't have, we didn't have preventative measures in place with software. So much of it's
wildly different. I think it could be very hard. Now, we have had some motion in the direction of, I think it was the United States White House recently promoted memory-safe languages, for instance.
Rust being, I think, named perhaps, but definitely the Rust stations were very excited about that particular note.
So we have kind of nudges happening by governments.
I know the EU is what I would call more heavy- handed with their regulation around the things you can and cannot do with software.
But gosh, it just seems like because of the diversity in software systems, you can't just put fire suppression in the building and be done. There's going to be so many different things, I think. So many
different regulations and rules and
details in order to actually
harness up some sort of protection
that would be effective against
an 80% solution even.
I hear what you're saying. It's a crazy
thought and I really hope we don't
end up in this world.
Buildings have
regulated materials that they can be built in
now and you can't even like children's toys can't have certain chemicals in them like it's a these
are all very regulated industries and you know could software eventually get to that point
where governments are like you can't use any memory on safe language. It has to be
plus by the US government
if it's being used for public distribution.
Period.
Could we get there? I don't know.
Maybe. We've gotten there and almost
everything else, people that have
cabins in the woods have regulations
that they still have to abide by.
It's a wild thought. I've never really had it
until you started saying that, Jordan.
Well, what you're saying, though,
is we get to the future innovations
through past failure
and retrospectives and learning.
That's how we get to the future,
is deploying what we think is the best solution,
it not being the best solution.
There's some sort of catastrophe
on a small or large scale.
We examine that.
We retrospective.
We policy.
We regulate.
We redeploy.
And we try again.
Well, the only other answer is to predict the future.
Yeah.
Yeah.
And I think that's, to some degree,
what developers are trying to do.
They're at least tasked with trying to
solve the present problem
that is future proof.
That has a version of future proof in it.
You hear that all the time, right? This is future proof code.
I've never said that about my code.
Maybe not, but somebody's like,
this will future proof us.
Somebody's definitely said that.
And I have always regretted it.
Say feature proof, maybe maybe my code's feature proof
yeah not future proof yeah feature free what's up friends i'm here in the breaks with david shu
founder and ceo of retool so david retool has definitely cornered the market on internal tool
software development but zoom out for me what's the big idea why Why did you start Retool? What is the big idea with internal software?
Yeah, so Retool started at this point seven years ago. And when we started Retool,
the core idea was that internal software is a giant, giant category that no one really thinks about. And what's surprising to most people is that internal software represents something like
50 to 60% of all the code written in the world, which might sound pretty surprising.
But if you think about it, most of us at Silicon Valley, we work at software companies, whether it's like an Airbnb, a Google, a Meta.
These are all companies that are software companies selling software.
And so most engineers in these companies are working on external phasing software. But if you think about most software engineers in the world,
most software engineers in the world actually don't work at these software companies.
There's not that many of them. There's maybe 10, 20 of them, big ones at least.
Most of the companies in the world are actually non-software companies.
So if you think about a company like an LVMH, for example, or like a Coca-Cola, for example, or like a Zara.
Zara's not selling any software,. They actually have a lot of software
engineers, actually. And all their software engineers, all they do day in and day out,
is basically build internal software. So that's, I think, one reason we started Retool.
The second reason we started Retool is if you look at all this internal software that people
are building, it is remarkably similar. So if you take a look at, you know, like a Zara,
for example, versus Coca-Cola, two very different companies, obviously.
One a clothing company, one a beverage company.
But if you actually look at the software they're building internally to go run their operations, it is remarkably similar.
It's basically forms, buttons, tables, all these sort of pretty common building blocks, basically, that come together in different ways. But then if you think about, you know, not just the UI, but also what's the logic behind a lot of this stuff,
they're pretty much just hitting API endpoints, hitting databases. You care about authentication,
you care about authorization. There are sort of a lot of common building blocks, if you will,
to internal tools. And so for us, the insight was, wow, internal software is a ginormous category,
and it's all so similar, and developers hate building it and so
could we create a sort of higher level framework if you will for building all this software and
that would be really cool that would be really cool okay so listeners retool is built for everyone
built for enterprise built for scale built for developers and that's you and if you found
yourself nodding your head to what david was saying then check out retool at retool.com slash changelog it's the fastest way to build internal software
do yourself a favor get a demo or start for free today again retool.com slash changelog
i really come back to this at scale situation. I think, you know, when we have the larger catastrophes, outages, etc., it's because of widely deployed code, which is a great thing because that code is somehow widely useful.
But then you've got to have certain things in place that once you're maybe at that level, certain things that have to take place to instantiate change. Because like you said earlier, Robert, it's usually change,
and not so much that specific change, it's that change plus something else
that's the unintended consequence of those two together.
And I did look up, by the way, just because I was like,
what actually happens when you combine chlorine bleach with vinegar?
It produces chlorine gas, which is highly toxic, so don't do that and the reaction is just i couldn't remember not good at all baby pad
yeah it's not good at all i mean it it will damage your eyes respiratory your respiratory system like
your breathing it's it's just not good at all so never we learned that the hard way you know
somebody somebody did it yeah see exactly but now we know someone did it. Yeah, see, exactly. But now we know. Someone did it.
I like noticing obscure signs in public places because they're always indicative of some sort of incident.
Every sign has a story.
Yeah, I remember I was at a hotel one time
and I was hanging out in the pool or maybe the hot tub.
And there's a sign that said,
this pool is not for defecation purposes.
Yeah, which was a very strange sign. And that might not be verbatim. And I can't remember if it was a defecation purposes. Yeah, which was a very strange sign.
And that might not be verbatim.
And I can't remember if it was a defecation or really,
you know, it was very formal though.
So I probably did say that.
And I thought, yeah, somebody pooped in this pool at one point.
And there was an incident where they said,
we got to put a sign up.
Or someone watched Caddyshack and was just terrified.
Just baby ribs.
Yeah. and was just terrified. Just baby ribs.
So yeah, we learn from the hard way most of the time because we can't predict what will happen
when we combine those two elements until somebody does it.
And sometimes what happens is we go too far, honestly.
We, governments, teams, whatever it is,
the reaction can almost be too much. And I really do hope that,
I mean, this is such a big outage that governments are getting involved that I really hope there's
some restraint in what comes out of this. I do, because I can see a world where it does get more
restrictive in the next few years because of this like a good
example is like the tsa the you know horrible tragic event 9-11 but the tsa has been proven
time and time again that it's security theater and we spend billions upon billions of dollars on it
every single year and i think that's an example of like you know we overdid it we went too far reactionary
i don't think a tsa should be gone entirely i think you know there is purpose to it but
there are plenty of examples of things in the world that we just go too far for example
moratoriums and code it's pretty often that you have a couple incidents in a row.
And then what happens? Everyone says, don't deploy anymore. Stop deploying. And then you realize that you have a memory leak and your system dies anyways because you're not deploying
and not restarting that process. And it dies anyways. So I just hope that we don't go
too far with this, that we don't overreact to this massive outage.
I want an appropriate reaction to it.
Right.
Just to add some layers to this
and going back to something you said, Jared,
and it's kind of a sidetrack,
but I kind of get the information now.
I texted my friend.
So I had lunch with a friend of mine yesterday.
I won't say where they work,
but they work at a bank.
And he said they were down for four hours, which I think is a short time frame compared to other scenarios we've heard of.
I don't know if that was literally only exactly four hours or some coworkers were only down for four hours or the specifics.
But let's just say at least a 10,000 plus organization when it comes to having laptops and distributed employees and branches and regional HQs and state HQs, whatever, and all these things.
So at least a day, and those who did not have their laptop booted down and have to boot up were safe because there was no reboot required. But for those, Jared, you would love this
because if you're a freaking multi-year streaker,
what was the number of years for your laptop?
I was listening back to our podcast recently
and I can't remember which one.
Yeah, my old, my very first MacBook Pro laptop.
I didn't reboot it for over a year.
I just was trying to see how long it could go.
Oh, did you do like uptime and terminal?
Yep, uptime.
Well, I had the had the also i stat menus
we'll show that to you which i've used for many years so it's very cool and i'll just close it
and open it and i refuse to reboot it because i just wanted to see how long i called it a server
right yeah and you'd have been safe so the people that you know had your your ambitions i suppose on
on boot time were safe but for those who booted down and booted back up the next day,
which is a large majority of the people, right, they had that issue.
And they were told to reboot and see if it fixed it.
Obviously it didn't.
And that if that didn't work, they literally had to go to the localist
IT center for them to have a person, like you had said, Jared, touch the machine,
do something to it, and then it was, you know, good to go again, you know? But could you imagine,
like, could you imagine the cost of that enumerated across all the scenarios across
the entire globe that was affected by this. Was it 8.5 million Windows computers
were actually affected in a single day?
Where there was a larger deployment,
but 8.5 million, I think, is the current number,
if it's accurate.
That's it?
I think that was just one section of it, wasn't it?
Well, I think that was the crash.
Like, there was, like, that many Windows computers
that crashed.
I don't know if that's the only computers that were affected necessarily but those are the ones were like in the critical
sphere of should be up but not up so yeah well you know and one of those servers was a
sql server 2000 that right or is 500 other servers were Right. Yeah, the cascading failure is massive.
I just feel like Nick Burns had his best day of work ever.
Do you guys remember Nick Burns from Saturday Night Live?
This is your company's...
Your company's...
Your company's computer guy.
Your company's computer guy.
Nick, the computer guy.
He'll fix your computer.
Then he's going to make fun of you.
Because he's Nick Burns, the company's computer guy.
Yeah, it's a Jimmy Fallon character.
It's one of his better characters.
Not a huge Fallon fan myself, but this was a good one
where he was just the most obnoxious computer guy stereotype ever.
And nobody wanted to go ask him for help
because he was going to just denigrate them.
And I think his catch line was like,
move, move.
Was that so hard?
So I think Nick Burns had a great day.
He gets to go around to everybody's computer and
get out of the way, I'm going to reboot this thing.
The heroes, honestly.
I mean, the amount of patience
that you would have to have
on that day saturday sunday today yes you know oh gosh oh my gosh could you imagine this
safe booting everything into safe mode and fix i just couldn't even and just to have a list of
like hundreds of computers you have to do next you do next. You're like, all right, just one by one.
Bam, bam.
Oh my gosh.
Yeah, that's true.
It was a Friday event that happened over the weekend.
I mean, not even just those affected by obviously the downtime and their travel and their plans or their work.
It's now like, wow, IT has a big job to do i was just watching like the first few 30 seconds of when
i'll link up the show notes nick burns your computer guy or your company's computer guy
he's like something about a virus and he's not going to be able to reboot like he just almost
described what happened you've got to go and fix it so i'll drop that in the notes but or maybe
even the audio we'll see i mean it's this outage, this CrowdStrike outage,
really hit every trope.
It really did.
Deployed on a Friday.
Right.
Global outage.
Windows.
Yeah, I mean, the whole...
It brought in the operating system wars.
It really hit so many checkboxes.
Memory on safety, of course.
There was a lot of C++ versus Rust conversations.
Yeah, I saw a lot of flaming of C++.
That's what kind of irks me, because I'm like,
I don't know, the stuff that you probably tweeted this tweet from
is probably running C++ in some way, shape, or form.
Certainly somewhere in the stack, yes.
I can even think of it.
I think that TwitterX runs Envoy, which is written in c plus plus right i don't know stuff i was thinking about
this actually from a an incident standpoint and uh robert you know a thing or two about
instance right you know one or two things about them at least like yeah i think so would you
think so i mean let, test me out here.
Just checking. It's like, so specifically to my friend in the bank situation, their team had to
raise an incident company-wide that wasn't even their fault. It wasn't like their IT department
messed up. So can you describe what you hypothesize for how the incident in a well
managed IT slash technology stack organization would and should react when it's not even their
problem? Like it's their problem, obviously, but they didn't do it. And the fix is not clear
because it's upstream. How do you think this percolated inside?
What's your hypothesis?
So it's a good question.
I mean, for an incident like this, like you're saying, it's on the outside of your controlled
world.
It's challenging.
So your job at that point for whatever these teams, the banks, the call centers, all these
places that were down because of this outage,
the first job is going to be containment and workarounds.
You're going to try to find a workaround as fast as humanly possible.
And those teams, what they're going to do is they're going to work
within their controlled world.
So an IT team at a bank probably is going to tell everyone
at the bank impacted,
own the communications like, it's not a bug that we're causing. Here's the news that I'm sure
everyone probably knew at that point. Here's what you can do to try to fix it, right? Here's how you
boot into safe mode. Here's how you do X. And the incident responders at that point, they're just
going to be trying to create a perimeter where it doesn't get worse and they can do things a little
bit better. A good example is like, if you think of a wildfire, there are firefighters that are
fighting the fire, that's CrowdStrike. And then there are firefighters down or rather up the hill,
chopping down brush, cutting down trees, like trying to stop it from going any further.
That's kind of what those teams are going to go.
That's the mode that they're going to go into.
I can't say for sure, but like that's in the situations I've had a vendor outage.
That's the first thing we do is we try to look for another route.
This happened recently.
I mean, we actually,
our CDN provider, you know, incidents are natural, so I won't name them. It's not,
not blaming them, but they had a incident like a week and a half ago, only impacted Newark,
pretty small. And we can't control that. And we had to own that. And our, and we had an incident
opened internally because all of the East Coast users
were going through this point of presence
and they were getting 502s.
So what we did is we actually just rerouted traffic.
We just took our CDN out of the loop
and that's how we got around it.
That was the only thing we could do.
And I think teams are going to have to start thinking
about these emergency routes more and more,
especially because it's CrowdStrike outage,
they're going to be like, what is our risk surface area? If we use this vendor and that vendor goes
down, are we screwed? I think a lot of companies are going to start thinking that now, just because
of this one outage, it's going to be pretty present in people's minds. And the management
process is going to have to change. You're going to have to create like your go bag of incident management when it's out of your control. I remember doing these
practices back when I was in school, which was a MIS degree with a CS minor I was going to school
for, which is, you know, management information systems. I probably haven't said that phrase
since I graduated, but I remember them doing these practice routines,
business continuity planning.
I'm starting to remember the acronyms as well.
Disaster recovery.
Like you would actually write down
what are all the things that could possibly go wrong,
which is a fool's errand, by the way.
But you'd still try.
You'd do your best, right?
There's the predict the future part.
You can get close.
There's your predict the future part.
And then you'd have to come up with a game plan
for each of these situations like how are we
going to mitigate the the impact how are we going to continue to run our business what are the
workarounds what are the next steps etc and i did enjoy those processes except for the writing part
of course because i was in school nobody wants to write. I thought it was very useful to think like,
what are a list of things that are likely to happen?
Do you remember any of them?
A lot of them were, well, they're completely made up businesses of course.
So it's all kind of just arbitrary because we didn't actually have any businesses.
And so we were like, you're the CTO of X corp that does Y thing. And now what could
happen? And so you had to kind of like make up, here's our technology stack, here's what we're
doing. And then if X, then Y. And no, I don't remember any of those particular details, but
I did recently visit a nuclear power plant here in Nebraska and the amount of things they've thought through and the amount of planning
that they've done and building hedges, so to speak, around almost every possible thing
that could go wrong at a power plant.
It's actually, it's laudable.
It's amazing how thorough these folks have gone through and prepared for umpteen potential things
and it made me realize like oh in software we just kind of fly by the scene of our pants don't we
you know of course they move way slower i mean that's the trade-off right like
everything moves super slow at a nuclear power plant. It has to because the consequences of disaster are so large.
And maybe the fairytale we've told ourselves,
and maybe it's gotten less and less true over time,
is the consequence of software disasters isn't that big.
We even had the phrases for it.
I don't think we were pretending at all.
What was it? Move fast and break things.
How many times was that said in Silicon Valley? Right. That got abused though. I mean, I think that at the time that began at
Facebook, so that was a Facebook-born ideation. And I think it was a culture because they were
in an innovation state. They were not in a, I mean, I guess they were becoming more and more widely deployed,
but they were also a web service.
So it wasn't like, well, it's installed and it's going to crash something.
So I think there's scenarios.
Now, obviously, it's a social network and there's a lot of people out there that are affected by,
you know, abuse, harm, et cetera, that can happen in social media, which I fully agree to.
That's like, that's just how it kind of just sucks.
And so the move fast and break things want
to occur to a lot of people is just like not a good thing obviously but to a technologist who's
trying to innovate that's a very it's a very admirable thing like yeah let's move fast and
break things because what happens is what the iteration cycle to learning happens faster
right this this cycle you described with the sprinklers, well, it doesn't happen is the danger zone right in places like
crowd strike should not deploy this idea of move fast and break things and maybe they did move fast
and break things well it's interesting in that particular context because they are fighting
adversaries who are also moving fast in order to break things. And so this goes back to the trade-offs that Robert was discussing.
I mean, I can understand the ethos that said, we need a way to deploy to these machines
outside of going through the entire process with Microsoft and the kernel stuff and the
signing.
We need a way to get our fixes out there before they attack all of our customers.
That's what they're paying us for.
And so I can see that trade-off of like, well, how can we do that? Well, let's develop a system where we're going to just side
load some rules and we'll try to make it innocuous. And we'll have, or I'm sure there's CICD and
there's test suites. I mean, this is a publicly traded company. I'm sure they have infrastructure
around the code they're rolling out. I'm giving them too much credit. I don't think I am.
I would be shocked if we learned that they didn't,
like this code went out when one person wrote it
and nobody else looked at it.
And I doubt that's the case.
The anxiety of that code review, Jared.
Right.
A little throwback.
Yes.
And so I can understand that push and pull.
I mean, we have this even inside of like the app store
where it takes forever in software terms
to roll out an app update.
But if you have your Logic Server side
and you can push even web components into a view,
you can actually update your app throughout the day.
You can basically do what they're doing with CrowdStrike,
with Falcon.
Over-the-air updates are exactly what you're saying.
Apple restricts them
pretty heavily for their platform but i like what you're saying that crowd strike this is an
advantage this is probably something they have bragged about in their sales cycle like you don't
ever need to do an update of this agent it just will update itself this is how i understand how
it works and when new vulnerabilities come out,
we will cover you and protect you.
That's a huge selling point.
Why would you want to get rid of that?
Come on, Adam.
Why would you want to get rid of that?
Don't take it away from us.
No, and I agree with that.
I think, I don't think,
so the question comes back to,
what can we do to learn from this?
I've heard, I think, did you mention this in news, this? I've heard, I think, was it, did you mention
this in news, Jared? I'm like,
I've read and listened to several things.
EBPF. And how this could
be, this, the way the EBPF
works, and I'm
loosely, I mean, I'm steeped in it
to some degree, but also very, like,
beyond even novice. Like, I'm just like,
no, I'm a green person when it comes to
what EBPF is and how to describe it. But from what like, no. I'm a green person when it comes to what eBPF is
and how to describe it.
But from what I understand,
this could be a different architecture
that could prevent this.
Well, what's interesting is that CrowdStrike
is actually using eBPF in their Linux client,
is what I read from Brendan Gregg's article about eBPF.
And so they're very well aware of it.
It's a way to do this that's safer. And it's in
development inside of Microsoft to provide EBPF support for Windows.
This was you then. Thank you. I love ChangeLog News, by the way. Hey, y'all listen to this.
ChangeLog.com slash news. Subscribe today. If you're not, you're just missing out.
You're missing out. So Brendan Gregg has this post, which was in Chainsaw News,
called No More Blue Fridays,
and it's his writing of why eBPF
will be potentially another tool in our toolbox, right?
In order to achieve what they're trying to achieve
without some of the dangers
latent in the current Windows-based rollout.
However, the in-development version of eBPF
will not have all the features it has in Linux.
And so could CrowdStrike immediately use it
in order to replace their current rollout?
Survey says probably not.
It has to be much more full-featured
in order for that to be a thing they could start using
as soon as it's shipped.
But it's a direction.
Well, what better way to get R&D budget
to make that go faster than what just happened, right?
Well, there you go.
That was kind of Brendan Gregg's point at the end.
And of course, I think he has a dog in the hunt.
He's very much invested in the BPF,
which is open source and all that,
but there's businesses built around it.
But he said like, hey, here's your great moment.
If you are paying for computer security software
and you are a paid customer of these entities,
you could push them to make this eBPF path
happen faster and better
because you're their customer.
So that was his call to action at the end of that post.
And what would happen is that is at the kernel level
do you know much about this to describe
what would happen if this hypothesis
or this hypothesized world existed
this future development, how it would work
to prevent this kernel from
crashing the system or booting without it or
being more safer?
No. Okay.
Well that's what I was thinking of.
How can we
I guess, and I'm not a Windows developer,
so by all means, just slap me in the face after this one, but I'm just thinking
you have a crash dump whenever the blue screen of death comes up.
And the system knows probably what crashed it, at least if it's a driver
in kernel mode, what's crashing it. Could you not just offer
the user the option to boot SANS,
that third-party, especially if it's third-party software, temporarily?
Now, I get that this is cybersecurity software.
What do you mean?
Well, I'm just thinking if the kernel driver of CrowdStrike,
a third-party, not a first-party, native operating system kernel driver,
is crashing the system.
So by moniker, it's a third party.
Could you not say, well,
this system knows that
this third party driver is crashing the system.
Do you want to boot without it?
And maybe that's what safe mode does,
but I mean, why couldn't that be a non-safe mode thing?
I don't know.
Because maybe those systems could have just been booted
by everyday people.
It's about UX and user friendliness.
Now, I don't know if that's secure. Robert's shaking his head a little bit.
Are you saying the system knows that the system
is crashing?
It's a layer on a layer.
You're throwing another layer that doesn't currently exist
in there? Is that what you're saying, Robert?
I think. I mean, I'm not even
going to try to pretend I know how
these kernel
I'm going to call it an add-on.
That's how inexperienced I am with it.
Like, plugins.
I don't want to pretend to know.
But I think that what Adam is saying,
I think the challenge with that is
just more complexity.
And is the risk worth the reward?
And can the system...
Think about the amount of trial and error
you would have to go through
for that to work really well.
And where does the operating system even store that knowledge
that that plugin is borked?
You're at the point of it booting.
That's my point. It's crashing currently.
You might not even have file system access yet.
That's how early in the ones and zeros we are.
So I think that's the challenge is you got to put it somewhere. So let's zoom back out one layer then. My thought is not literally
how we deploy the fix. Literally, this is how we solve it. But from a user experience standpoint,
the reason why the outage perpetuated to its length was because
everyday people could not solve their own problem with the system. And I'm just suggesting,
is there a path where you can provide everyday users of their computer some version of
bypassing this crash. That's all.
And I don't know that answer.
I'm just hypothesizing that
the reason why I perpetuated
was because people who,
like IT basically,
people smarter than the end user
from a technical level,
in most cases standpoint,
could not solve,
they had to come in and be deployed to
literally open up the laptop
or could you imagine trucking in
a workstation like not everybody uses laptops these days some people use workstations but like
you had to take the thing into the people they had to plug a monitor into it and a keyboard into it
and somebody else had to touch it i'm just thinking is there an other way where the end user could
have done more of this in line too, rather than simply waiting.
I don't think Nick Burns wants the end user to do it.
No?
Well, I remember the days of Windows
where it was remote PCs
and the only thing that that station was responsible for
was basically connecting to something else
that was doing the compute.
Maybe that comes back, right?
Maybe that's a world that...
Client-side computing was thin clients.
That was Citrix, and that's my roots, man.
I grew up in IT in the early 2000s,
worked at an IT company that deployed Citrix
and VMware intensely.
We had our own co-location system at a data center.
You were talking about the power plant, Jared.
Data centers are similarly, if not equally, thought through.
Not equally.
Not equally.
Yeah, I'm going to say maybe not all the way.
Nuclear power plants are so regulated.
Well, that's why I said similarly, if not equally.
There's a version of the thoughtfulness, let's just say.
I'm going to say I hope they're not.
I hope that nuclear power plants have more thought.
Okay, I would give you that.
I came out feeling much safer about nuclear power
through this tour because of how stinking serious
they are about safety.
But anyways.
Yeah.
Well, just the point was that I agree, Robert.
Maybe thin clients or remote.
I mean, but.
What's old is new again.
Maybe, you know, I think.
Well, you know what the web is?
Jerry was talking about that.
It's like a widely deployed operating system.
Most of us are on web apps these days anyways. You know, the web is? Jerry was talking about that. It's like a widely deployed operating system. Most of us are on web apps these days
anyways. You know, most
of what we do is through the browser.
Like right now, we're having this discussion
through the browser.
Video, audio, recorded
locally, streamed back up.
In most cases, doesn't fail.
Really good
software, but it's web software. We have to use
a special browser,
which is a whole different fight.
Web software goes down.
I'm just not sure exactly what we're solving with this moving the furniture around.
So what I had in my head is,
I saw a picture through all the news cycles
of this CrowdStrike outage was,
it was actually, it was a gate agent's computer.
It was at the gate where you board the plane
and it had the blue screen of death.
And in that situation does that computer need a crowd strike colonel agent running on it maybe it
does maybe it doesn't i don't know but i think where i'm going with this is does that computer
just need a screen a mouse and a keyboard that's hooked up to something else down the hallway, you know, that's one station that's powering 20 gates and it's much
easier. It's smaller surface area. You know, I think we're getting to that point. Like networks
are getting fast enough to do that type of thing. Maybe it's too far. I'm not sure, honestly. I mean,
some companies have tried to do this with like gaming, example i don't remember if you know it all
failed so far it failed so fast yeah but maybe that was too far right like that's hard to do
that's like you need super low latency video feeds right and it was google it was google trying to do
it it wasn't some fly by night i mean they have the resources if anybody could accomplish it you'd
think google yeah and microsoft xbox is
trying to do it too i forget the name yeah yeah true but maybe it's like that type of world right
where it's just a keyboard a mouse and a screen and it's hooked up somewhere else maybe that's
where we go to you reduce the surface area therefore you reduce the amount of potential
outage i think in this case that hypothesis has merit only because we know
what we know. It's not because we know what we knew or know what we know
prior to, and that's the plan. Because I think even in that scenario, you
have now a single machine dependency
of many dependencies. And now it's like, well, when that one machine is down,
it's not just one person.
The outage affects many because of the design of, you know, dependency.
I am pro thin client, though.
I'm pro what Citrix did back in those days.
It was a very cool thing.
I mean.
I hated it.
Well, so for certain workers, for certain tasks, it was perfect.
I hated it too, Jared, because I...
Why were you for it then?
Well, in my scenario, I was for it for everybody else though.
Oh, for everybody else.
Oh, yeah.
Oh, I'm for it for everybody else, yeah.
Yeah, I think it's cool tech.
The ergonomics of it were terrible.
Yes.
Yeah.
I agree, the tech was cool.
And for certain scenarios, I helped out.
I ran network administration for a company that did commodity training. And so they had machines in silos, you know, grain silos. And those places are dirty, nasty, corn, chaff, etc. Like, it's not the place where you're going to have a server farm. Or you wouldn't even want a PC because eventually that tower is going to get all kinds of
stuff into it's going to break down and so in those cases like the thinnest client possible
with a Citrix connection was the answer made tons of sense yeah but in many other use cases you got
your employees sitting in their office and they're Citrixing into somewhere else you know to run
with this latency and it was slow and they didn't have access to local resources.
Okay.
In those contexts,
I was like,
this is ridiculous.
I have a beefy computer
sitting here.
It's connected
to a remote machine.
The grain silo
didn't have a good
internet connection.
Well, that was another problem.
We had to create,
a lot of times
we had to create
internet connection
for them
in order for them
to actually connect back
to Citrix.
And so that was,
I mean, it was,
you're trying to do remote computing in a grain silo.
It's not going to be easy no matter how you do it.
Right.
What's up, friends?
I'm here with Firas Abugadije, founder and CEO of Socket.
Socket helps to protect the best engineering teams out there with their developer first
security platform.
And so Firas, speaking of developer first, Socket is developer first.
What does that mean?
What do you mean by being developer first?
Most security software is typically sold to executives.
So it tends to suck to actually use it.
So the company, the vendor goes in and makes a sale.
The executive thinks it looks good, but they don't actually care at all what the developer
experiences of the tool.
So I think that's where I would start.
The first problem with security tools is they're sold to executives.
In the best case, those tools get purchased and they just sit around on the shelf bothering nobody and protecting nobody. But in the worst case, they get rolled out and they prevent developers from getting things done. And they just get all up in your face with alerts and pointless noise that isn't actionable. And if you actually go and fix those alerts, you're not even improving security because a lot of the time those vulnerabilities are super low impact. That's like the dirty secret of vulnerabilities is most of them are low impact. They're either in
dev dependencies, so they're never going to run in production or they're really difficult to
exploit. Or if you exploit them, there's nothing really there. It's like a, you know, a denial of
service in some random component. And in reality, like that's just such a low risk in terms of just
your priorities of things you need to work on as a developer. I would actually say probably 90 or 95 percent of the vulnerability alerts that developers are used to seeing from other tools are just completely pointless.
They're just fake work.
And fixing them doesn't even meaningfully improve security at all.
There you have it.
Protect yourself, your team, and your software from the threats that really matter.
Don't do fake work.
Use Socket.
Socket.dev. Book a demo. Install the GitHub app. Install the So that really matter. Don't do fake work. Use Socket. Socket.dev.
Book a demo.
Install the GitHub app.
Install the Socket CLI.
Whatever it takes to take the next step, do it.
Go to Socket.dev.
Again, Socket.dev.
Well, Intel Innovation 2024 Accelerate the Future is right around the corner.
It takes place September 24th and 25th in San Jose, California.
This event is all about you, the developer, the community,
and the critical role you play in tackling the toughest challenges across the industry.
Ignite your passion for AI and beyond, grow your skills to maximize your impact,
and network with your peers as they unleash the next wave of advancements in technology.
Understand the emerging innovation and trends in dev tools, languages, frameworks, and technologies in AI and beyond.
Join on-site hands-on labs, workshops, meetups, and hackathons to collaborate and solve real problems in real time. Collab with experts, learn and have fun, engage in interactive sessions, connect, grow your
network, gain a unique idea and perspective, and build lasting networks.
And of course, have fun.
You'll hear from leading experts in the industry, technologists, startup entrepreneurs, and
fellow developers, along with Intel leadership CEO Pat Gelsinger and CTO Greg Lavender as they take you through the latest advancements in technology.
Don't miss out on the chance to be at the forefront of innovation.
Take advantage of their early bird pricing from now until August 2nd.
Register using the link in the show notes or to learn more.
Go to Intel dotcom slash innovation.
When you're at scale, like CrowdStrike was,
and you deploy bad code, regardless of which theory you go with,
bad code, done on purpose, rogue whatever.
I mean, there's people saying like this was planned.
I haven't read any of that stuff, but I'm sure it's out there.
Well, you know, anytime something like this happens at a scale like this,
you got to wonder, like we live in a simulation lately.
Like there is strange things happening every single day
that has been basically unprecedented every single day.
So like the new precedent is unprecedented, you know?
Right.
And I just, I don't want to hypothesize here
because that's not what we're trying to do
or not what I'm trying to do.
But when you're at scale like this,
it's obviously an attack surface of some sort,
whether it's bad code, an incident,
or just simply, you know, a bad day, a bad Friday, a bad weekend.
And how can we give CrowdStrike the ability to do what they want to do and have the sales
pitch they want to have without having the opportunity for outage like this?
And then all the others, they're going to fall on their footsteps.
Who else?
Well, the software will be at scale
and be a tax surface, whether it's
bad code, planned,
intended, rogue, whatever.
They're all similar scenarios,
just a matter of how the incident
percolates.
I mean, there's just
the surface area of which software
can be impacted
now, either just through sheer outage or security is staggering
i mean there was i don't know maybe a month and a half ago two months ago there was it was it was
newsworthy enough for the new york times i saw the word postgres on the front page of new york times
i was like what is this and you go and read it and there it all boils
down to there was a state actor that gained the trust of the core team for postgres and they
started submitting patches that were fixed real things and then they submitted something that was
very subtle that was caught on accident by another
engineer years later and they eventually figured it out they were like holy crap this person just
gained our trust by submitting real stuff and then snuck something in and how do you defend that
you just you just you just can't i don't think you can and that sounds a lot like the xz thing is this
in addition to that i think that's what i'm talking about yeah i can't i couldn't remember the
the exact name of it but yeah so i don't remember the postgres part but certainly
this xz backdoor was placed by a state actor i think it was someone working on postgres is like
and then they got like down to that level babe That's how I misremembered it.
Fair enough. Well, XZ is a dependency
of many software packages
and was close to being
actually distributed via
Apt and other package
registries prior to
it getting found out on accident
by a developer. So yeah,
crazy times for sure.
Definitely not tinfoil hat, Adam,
to say, you know, was this,
to ask the question of,
was this mere incompetence
or was this actually an attack?
Because attacks happen
and they are happening
and they will continue to.
And so those questions do have to be asked.
I think in this particular case,
I jumped immediately to incompetence,
you know, Occam's razor style,
because I know how complex software systems are to roll out updates.
You know, I was like, oh gosh, somebody had a really bad day,
but that could be a wrong conclusion to jump to.
Well, I think in the case that you're talking about, Robert,
with Postgres, if this is accurate, is code analysis.
You have to analyze, especially in open source,
but when it's closed source like CrowdStrike and a definition update,
all you can do is rely upon that team, that company,
to be mature enough to have protections in place.
When it's proprietary closed source,
there's nothing you can do from a scale point to analyze the code.
From a different route with open source,
you could do a lot of things.
You could pay attention to where the patches are coming from.
You know, I guess in this case here,
if the patch was, you know,
hey, Robert, here's the patch.
I'm Adam.
Let's just say it's you as the core committer
and I'm the friend who's trying to be friendly.
I've solved this problem.
Here you go, Robert, and you just take my code
and maybe you actually deploy it to Postgres.
So it's coming in signed. Maybe that's an example where you really can't analyze very well.
But if you had to say, Robert is signing this commit, but it's being the location or the source of the commit is from an outside source helping out because it's open source,
then you could at least have a waypoint to begin to track
if you're doing code analysis.
I think that's the area where I'm really confident
and looking forward to more and more being done.
Because when you can analyze the Git repository
and the graph of things happening in a code base,
there's a lot you can pull
out when it's like, okay, that's a smell.
You got a brand new committer.
You got somebody being nurtured or whatever you want to call it to kind of get their trust
over multiple years even.
There's layers of anomaly that can be identified because of the way open source works
if you do specific code analysis.
So that's where I'm hopeful.
I'm hopeful that we can keep
open source going the way it is for
longer. I do think that some of
these risks that are coming up with
state actors infiltrating through
years of building trust and
accidental attack
vectors coming through like over time i
think that people are going to start to get skeptical yeah and that's going to be a tough
moment we're going to have to kind of the start thinking about that i'm starting to hear more and
more about people like don't want to use third-party libraries for common things just because of the risk. For example, attacking a JavaScript MDM package
that's widely used.
That does a pretty simple thing.
Candidly, it's less risky to just do it yourself sometimes.
And that's a calculus that companies
are going to have to start thinking about.
Yes.
I mean, I think every developer should make that calculation every
time they're going to pull in dependency and i'm not saying don't pull the dependency in but i think
you do have to think through that i think we're learning that and hopefully our collective immune
system will react i do think that these state actors being outed every once in a while at least
will boost our immune system as open source
maintainers to be like let's kind of be a little more leery of the contributors who are coming
around and like just you know that whole kumbaya open open open we're all friends worldwide thing
that was going on when open source began is like it's gone it's just not the same world anymore and so maybe we just won't be
fooled next time hopefully
by somebody who's trying to butter us up
in order to take advantage of us
do you think there's a way to
label software at scale
like an XE
if you're a
contributor to XE do you know how
much is deployed and you understand how crucial your core role is to that software?
Yes and no.
Yes and no, right?
Probably hard to feel the actual gravity of it.
Right.
Right.
I'm just wondering, is there a way to, and I'm literally asking the question without having put any thought into it.
So if it's naive, you know, slight me around if you have to.
As we do.
Yeah. I'm just wondering, is there a way to elevate certain software
without maybe even by analysis
to understand its deployment
or its dependency levelness, I suppose?
Its scale.
Like I'm sure CrowdStrike knew
how at scale they were.
This was not unknown to them so this is
not an example but xz and the folks behind that who are being you know groomed for lack of better
terms over a year or more a very long patient amount of time do they understand how crucial
the software is that they're in control of so that they can have that position you just said, I'm just thinking, is there a way to
label something, hey, you're a scaled software,
you're widely deployed,
and there's some way to elevate them
to a different level, at least by label,
so that there's an awareness
that if there's a
malicious attack on that code base,
it has effects.
I feel like GitHub could own that.
Honestly. They know how many could own that. Honestly.
They know how many times a repository
is committed. They know how many times
it's even looked at, just page views
in general. They know the number
of stars on it.
And maybe it's not GitHub.
Maybe it's some other program. Maybe it's government
sponsored. That goes
to these maintainers
and says, just FYI, you're on our list.
You just made the list.
And in a way, it's like, congratulations, you've built such valuable software.
It's now a national security threat.
But I hear what you're saying.
I think it's hard.
I think it's hard because it takes the steam out of it.
It takes the altruism out of it sometimes too
for some people that just want to do a good thing.
When the barrier is high, then people won't do it.
And I think that's challenging.
I think the maintainers of scaled software know.
I think that they're just wildly under-resourced
and exhausted and can't possibly
sometimes care enough anymore because they've cared so much for so long, for so little.
So I think for the rest of us, I did not know how big XZ was in terms of its dependency
graph, the other way around, how many dependency graphs it was in, which was many, but I'm sure that the author of XC has an idea.
Like, that's why I said yes and no.
He may not know exactly how big his software is, but at a certain point when your package
is deployed across all these distributions and stuff, yeah, you understand that like,
wow, this thing is really reaching lots of places.
And so I think there's some of that gravity there.
But for the rest of us, that might be useful
to have that list of softwares
that are considered national security importance
or whatever it is.
They aren't the threat, but they are of potential threat
because of their
situation i think one one example of a of a developer who just built an open source something
and took it down not realizing the true scale of this thing was left pad oh yeah 2016 that one was
wild that was so many packages couldn't be installed and deploys like stop for hours
because of that and it was just some i forget the exact context but i think it was like some
dispute and out of he was like i'm gonna take down the package you're using wasn't a political
yeah i don't remember exactly i don't think I don't think LeftPad was political. LeftPad was a long time ago.
It was a political one.
You just deleted it off of NPM package registry and then chaos ensued.
I think LeftPad might have been the one
where they had another package called Kik or SideKik
and another company, a company, not another company.
This might not be LeftPad either.
But this definitely happened.
There's a company, a startup called Kik, K-I- either, but this definitely happened. There's a company, a startup
called Kik, K-I-K, I believe.
And there's a package called Kik, I think
owned by the LeftPad owner,
if it's coming back to me.
And the Kik company contacted
NPM and wanted the name,
but didn't have the package name.
And I think NPM granted them
access to the Kik package name,
basically kicking it off the LeftPad owner. And then they got mad and just pulled LeftPad. All their stuff. I think theyPM granted them access to the Kik package name, basically kicking it off the LeftPad owner.
And then they got mad and just pulled LeftPad
and all their stuff.
I think they pulled all their stuff.
I'm pretty sure that's LeftPad.
That may be a different one
because there's been so many at this point,
but that definitely happened.
I have the, there's a Wikipedia page for it.
Is there?
NPM LeftPad incident that I just found.
And yeah, you're right on the money
with what you just said.
But you know what's kind of crazy about that?
And it kind of goes back to what I was saying
about own your software a little bit more.
LeftPad was not a thing
that needed to go out over a network
and download a package and pull it down.
Any engineer should be able to write
what LeftPad did.
Absolutely.
Or copy-paste the function.
It was like a...
Or that, yeah. Because I mean, you can use somebody else's code Absolutely. Or copy paste the function. It was like a... Or that, yeah.
Because I mean,
you can use somebody else's code
with a little copy paste
and remove that dependency.
And because,
not because you can't trust the author,
but because we cannot trust the network.
Right?
That's the problem with NPM.
We can trust the authors in most cases,
but we cannot trust the network
into the future.
You can maybe trust it today,
but you cannot trust the network tomorrow. And so You can maybe trust it today, but you cannot trust the network
tomorrow. And so, copy
paste that sucker. Vendor it. I mean, that's
what we used to call it in the real world, vendor it.
Which is to pull it into your repo,
check it in, and leave it there.
I remember doing that.
Did you see that one? It was a couple weeks ago that a domain
expired that was hosting
a JavaScript package.
Polyfill.
Someone else bought the domain, put something
not good there.
Same domain
path
and all these websites that were resolving
that domain to the new source
were impacted. It was like 100,000 websites.
You can't trust the network.
Yeah, so that's a good way.
You can't trust the network.
Especially over time. Because that's a good way. You can't trust the network. I think it's a good way.
Especially over time.
Yeah.
Because that's what we think of today,
but over time the network changes.
In ways that we wouldn't expect.
Like nobody expected polyfill.io to change ownership.
Yeah.
Or CDN, whatever the CDN that was hosting polyfill.io.
Right.
We put some stuff through proxy, basically.
And that kind of does it.
You proxy yourselves and let kind of does proxy yourselves and
the gems and some stuff
and that way it's kind of a
if it's there we trust it
kind of thing right you know if you try to pull
something else in a bundle
install yarn install whatever it is
go get it goes through
there and if it's not there then it kind of
triggers a well why are you trying to get something that
isn't in this you know it's not blessed yet it's a proxy that you guys run uh yes is this like
a like artifactory kind of thing where you pull yeah some other i forget the exact tech if i'm
honest but yeah but but similar to the j factory or j frog artifactory yeah yeah it's a great idea
just get yourself layers in between you and the unknown.
I mean, that's
one of the wise practices
for sure.
Well, that's like the,
I guess,
rich man's version
or rich person's version
of vendoring.
It's like the same idea
except for it's
Yeah, you vendor it
to a server.
It's vendoring.
I mean, this has been
the tale as old as time,
basically.
Ruby had it first.
Well, like I said,
what's old is new again.
Yeah, exactly. We're going to go back to all these ideas in some way, shape, or form, I think. We're going back to time basically ruby had it first well like i said what's old is new again yeah exactly we're gonna
go back to all these ideas in some way shape or form i think we're going back to thin clients
apparently so i mean i think even that too you have to have an incident like this to have a
discussion like this that says these older ideas that were probably pretty good you know maybe at
the time it was like less modern to do it now it's more modern so maybe there's but i suppose to your
point jerry with your meme like i deployed software today Now it's more modern. So maybe there's, but I suppose to your point, Jer, with your meme,
like I deployed software today,
so it's modern, right?
Like when you have a meme
out there somewhere,
there's like.
Oh yeah.
Just mostly a gripe.
Like people always advertise
their software as modern,
which just literally means
that it's just a newer thing.
You know, like it's not a feature.
It's just that you started
coding it six months ago.
Right.
You know.
At some point,
someone's going to start bragging
about how much their software hasn't changed.
Yeah, I think vintage software should make a move.
This is classic, this is vintage.
When I was a young gun engineer
and I heard about these banks using cobalt still
and I was like, ah, what losers.
And now I'm like, hey, whatever.
If it works, I can look at my balances
and I've never had an issue and I can always charge my card.
You do you.
Maybe calcified software
has a purpose in the world
where it just gets rarely touched
and we're just happy about that.
I'm leaning that way more and more.
Do we need to keep changing the software? I don't know.
That's not really good for your business
though, Robert.
I mean, if you advocate for that.
Robert's out there.
More incidents.
We need more incidents.
My investors, my board hears that.
They'll be like, what are you doing?
What are you saying, Robert?
Stop right now.
Well, I think even if you have unchanged software,
there's still bound to be incidents of some sort.
I mean, there's still going to be...
No one's going to listen to you, Robert.
No one's going to do that, right?
I recommend that.
Yeah.
Well, this has been fun digging into the details, I think.
You know, it's fun to speculate out.
You know, I do want to, again, mention I love Dave Plummer and his channel on YouTube.
He's a great resource.
I always appreciate what he shares.
I probably listened to his video twice, just making sure I kind of understood some of the mechanics behind it because I really want to understand like what to what degree does this software actually operate on Windows. you know, how this incident propagated. You know, we don't know if it was really bad code
or if it was sabotage or if it was some sort of plan.
That's all speculation that we're not trying to really go through here.
But sort of like, hey, if you're out there and you've been affected by this
or you're just curious, you know, go out there and do your own investigations.
Pay attention to what's happening out there.
And I guess we can look forward to George Kurtz, the CEO,
current CEO of CrowdStrike, who was there
at the helm during this incident
to stand before Congress
and explain exactly
what happened. And maybe then we'll know.
Talking about security theater. Right. Until then,
all we can do is speculate what may have happened.
We can, you know,
use the, they're not called dumps. What are they
called? Are they called dumps whenever it's a
kernel panic? Well, you dump the stack.
Yeah.
It's not a stack trace,
because that's like an application kind of thing.
Kernel panic.
Yeah, exactly.
You can examine that.
And there's lots of folks,
there was a famous tweet out there
that made the rounds explaining that,
you know, this one file was updated,
and while it should have had the needed definition in there,
instead it just contained zeros because of a null pointer.
There's all these things like why this actually happened.
But I think in the end, we can just say at scale, software can have massive effects.
And we got to do something about that.
It's a good thing to have scale software, but at the same time, we have to do updates responsibly. Or in this scenario where you have a kernel-level driver,
how do you do what CrowdStrike wants to do with Falcon
but not bypass the security systems?
That's the real question here, specifically for this incident.
I think for others, it's just love your maintainers if it's open source.
If it's not open source, drag them through Congress and make them explain it.
You know, and slap them around a little bit.
You know, otherwise, just do what you can to stay safe.
You know, scrutinize your dependencies, your third parties, etc.
And that's about it for me.
And run Linux on your desktop.
I mean, that's the way.
This is the way.
Write Rust, run Linux, and you'll be good to go. And then let on your desktop. I mean, that's the way. This is the way. Write Rust, run Linux,
and you'll be good to go.
And then let all of us know about it.
Once they figure out their audio drivers to come on
this show, it'll be great to hear their experience.
Well, every time we have
a Linux user, we're always happy,
obviously, and then sad.
Because we expect to have
some version of issue because
of drivers.
It's almost unanimous.
Almost unanimous.
Well, thanks so much for having me.
This was a blast.
I think it was a fun topic to talk about and super interesting.
For sure.
Thanks for joining us.
Yeah, Robert.
It's been fun.
Bye, friends.
Bye, Robert.
Well, friends, here we are again at the end of a busy and interesting week in the software world,
which more and more is the whole world.
Do you have thoughts?
Do you have opinions?
I know you do.
We would love to hear them.
Sound off in the comments.
Link in the show notes.
Oh, and stick around, ChangeLog++ members.
This is yet another extended
episode. We love doing these for our most loyal supporters. Oh, and by the way, if you are a
Changelog++ member, maybe sign in to changelog.com using your plus plus email address and see if you
see anything new on your homepage. I won't say more than that for now, but we'll talk details soon enough,
probably on the next Kaizen. Okay, quick thanks again to our partners at Fly.io,
to Breakmaster Cylinder, to Sentry, UseCodeChangelog, and to you, of course, for listening along.
Seriously, we appreciate it. Next week on the Changelog, news on Monday, Joseph Jaxx from OSS
Capital on Wednesday, and Adam is flying solo on Friday, but he has a very special guest,
the author of his favorite book series, The Babaverse.
Yes, Dennis E. Taylor joins the show.
Have a great weekend.
Leave us a five-star review if you dig our work,
and let's talk again real soon.
So during the main show, I did not ask you about this, nor did we directly reference it, but it was a reference point for me.
You wrote something the same day as this incident, I think, is July 19th, 2024.
Beyond the headlines, the unsung art of Software Outage Management And rather than
It's better