Command Line Heroes - The One About DevSecOps: Evolving Security and Reliability
Episode Date: November 6, 2018Bad security and reliability practices can lead to outages that affect millions. It’s time for security to join the DevOps movement. And in a DevSecOps world, we can get creative about improving sec...urity. Discovering one vulnerability per month used to be the norm. Now, software development moves quickly thanks to agile processes and DevOps teams. Vincent Danen tells us how that’s led to a drastic increase in what’s considered a vulnerability. Jesse Robbins, the former master of disaster at Amazon, explains how companies prepare for catastrophic breakdowns and breaches. And Josh Bressers, head of product security at Elastic, looks to the future of security in tech. We can’t treat security teams like grumpy boogeymen. Hear how DevSecOps teams bring heroes together for better security. These changes mean different things for everyone involved, and we’d love to hear your take. Drop us a line at redhat.com/commandlineheroes, we're listening...
Transcript
Discussion (0)
On June the 26th, 1991, Washington, D.C., much of Maryland and West Virginia, major portions of my home state, were paralyzed by a massive failure in the public telephone network.
And yet, as technology becomes more sophisticated and network systems more interdependent, the likelihood of recurrent failures increases.
It's not as though there wasn't warning that this would happen. In the early 1990s, 12 million Americans were hit with massive phone network failures.
People couldn't call the hospital.
Businesses couldn't call customers.
Parents couldn't call their daycares.
It was chaos.
And it was also a serious wake-up call. A wake-up call for a country whose infrastructure relied heavily on the computer systems that networked everything.
Those computer networks were growing larger and larger.
And then, when they failed, yeah, they failed big time.
A computer failure caused that phone system crash, this tiny one-line bug in the code.
And today, the consequences of little bugs like that are higher than ever.
I'm Saran Yitbarek, and this is Command Line Heroes, an original podcast from Red Hat.
So, software security and reliability matter more than ever. The old waterfall approach
to development, where security was just tacked on to the end of things, that doesn't cut it anymore.
We're living in a DevOps world where everything is faster, more agile, and scalable in ways they
couldn't even imagine back when that phone
network crashed, that means our security and reliability standards have to evolve to meet
those challenges.
In this episode, we're going to figure out how to integrate security into DevOps, and
we're also exploring new approaches to building reliability and resilience into DevOps. And we're also exploring new approaches to building reliability and resilience
into operations. Even after covering all that, we know there's so much more we could talk about.
Because in a DevSecOps world, things are changing fast for both developers and operations.
These changes mean different things, depending on where you're standing.
But this is our take.
We'd love to hear yours too.
So don't be shy if you think we've missed something.
Hit us up online.
All right, let's dig in
and start exploring this brand new territory.
Here's the thing.
Getting security and reliability up to speed,
getting it ready for a DevOps world,
means we have to make a couple key adjustments to the way we work.
Number one, we have to embrace automation.
I mean, think about the logistics of, say, two-factor authentication.
Think of the impossibly huge task that poses.
It's pretty obvious you're not going to solve things by just adding more staff. So that's
number one, embracing automation. And then number two, and this one's maybe less obvious.
It's really changing the culture. So security isn't a boogeyman anymore. I'm going to explain what I mean by
changing the culture later on, but let's tackle these two steps one at a time.
First up, embracing automation.
Once upon a time, app deployment involved a human-driven security review before every single release.
And I don't know if you've noticed, but humans, they can be a little slow.
That's why automation is such a key part of building security into DevOps.
Take, for example, this recent data breach report from Verizon. They found that 81% of all hacking-related breaches involve stolen or weak passwords.
And that's, on the face of it, such a simple problem.
But it's a simple problem at a huge scale.
Like I mentioned before, you're not going to staff your way out of 30 million password issues, right?
The hurdle is addressing that problem of scale, the huge size of it.
And the answer is the same every time.
It's automation, automation.
If you wait for a human being to get involved, it's not going to scale.
Vincent Danen is the director of product security at Red Hat. And over the 20
years he's been at this, he's watched as DevOps created a faster and faster environment. Security
teams have had to race to keep up. When I started, it was a vulnerability per month. And then it
started becoming every other week and then every week. And now we're into, you know, literally finding hundreds of these things every day.
What's interesting here is that Vincent says there are actually more vulnerabilities showing up as security teams evolve, not less.
We'll never get to the point where we say, oh, we're secure now, we're done.
Our job is over. It'll always be there.
It's just something that has to be as normal as breathing now.
It turns out what counts as an issue for security and reliability teams
is becoming more and more nuanced.
As we're looking for these things, we're finding more.
And this trend is going to continue as you find new classes of vulnerabilities
and things we maybe didn't think were important or didn't even know they existed before.
We're finding out about these things much faster and there's more of them.
And so the scale kind of explodes, you know, it's knowledge, it's volume of software,
it's number of consumers, all of these things contribute to the growth of security in this
area and the vulnerabilities that we're finding.
Once you see security as an evolving issue, rather than one that gets quote-unquote
solved over time, the case for automation, well, it gets even stronger.
Well, I think with automation, you can integrate this stuff into your development pipelines
in a way that is very fast for one
for two you don't require human beings to do this effort right computers don't need to sleep so you
can churn through code as fast as your processors will allow rather than waiting for a human to
pour through some maybe rather tedious lines of code to go looking for vulnerabilities and then
with pattern matching and heuristics,
you can actually determine what's vulnerable
even at the time of writing the code to begin with.
So if you have like a plugin, you know,
for your IDE or your tool that you're using to write your code,
it can tell you as you're writing it, like,
hey, maybe this looks a little fishy or you've just introduced a vulnerability
and you can correct these things before you even commit the code.
Security on the move. That's a huge bonus.
There's just so much that's coming out every day, every hour, even with continuous integration and continuous delivery.
You write code and it's deployed 10 minutes later. So it's really critical to get that validation of that code automatically prior to it being pushed out.
A whole breadth of tools are available so we can actually get this done.
Whether it's static code analysis or plugins for your IDE or a whole bunch of other options. We'll share some of our favorites in the show
notes for this episode over at redhat.com slash command line heroes. Once we've got those tools
in place, they help keep security top of mind. The result? DevOps gets reimagined as DevSecOps, security gets baked into the process.
In the same way that developers and operations kind of combined, you took those two disciplines to generate one.
Now you have DevOps. that in with development and operations, I think is really important because having security
as the afterthought is what makes security so reactive, so expensive, so damaging or
potentially damaging to consumers.
And when you plug that in right at the beginning, you have development being done, security
is in there from start to finish, and the operations work.
Of course, like we mentioned at the top of the episode,
automation is really just one half of a bigger pie. And Vincent gets that. It's not just one
piece. You can't just, you know, throw a tool in your CICD pipeline and expect everything to be
okay. There's a whole gamut of different techniques and technologies and behaviors that
are required to produce those ultimate beneficial results that we want to see.
Automation does get us partway there, but we've got to remember the other piece,
that slightly fuzzier piece. Say it with me. The culture piece.
Getting developers and ops both on board so that these issues aren't boogeymen anymore.
We have to change a culture.
And some folks are learning to do that
in the least painful way possible.
With games.
Let's take a swing over to the op side of things now.
It's so easy to stand up huge infrastructure these days.
But that doesn't mean we should be doing shoddy work.
We should still be hammering on our systems, ensuring reliability, figuring out how to prepare for the unexpected.
That's the mindset Jesse Robbins is working to bring about.
Today, Jesse is the CEO of Orion Labs.
But before that, he was known as the master of disaster over at Amazon.
During his time there, Jesse was pretty much a wizard at getting everybody at least aware of these issues.
And he did it with
something called Game Day. These can involve thousands of employees running through disaster
scenario drills, getting used to the idea of things breaking, and getting intimate with the why and
the how. Here's Jesse and me talking it over, looking especially at how reliability and resilience get built into the operations side.
Very cool.
Okay, so you are known for many things, but one of those things is the exercise Game Day, which you did at Amazon.
What is that?
What's Game Day? was a program that I created to test the operational readiness
of the most vulnerable systems by breaking things at massive scale.
So if you're a fan of what's called Chaos Monkey Now
by the Netflix people and others,
Game Day was the name for my program that definitely preceded all of that.
It was really heavily focused on building a culture
of operational excellence,
building the capability to test systems at massive scale
when they're breaking,
learn how they break to improve them,
and then also to build a culture that is capable of responding to
and recovering from incidents and situations.
And it was all modeled and is all modeled after the incident command system,
which is what fire departments use around the world for dealing with incidents of any size.
It was sort of born from...
Crazy side note, Jesse trained to be a firefighter back in 2005.
And that's where he learned this incident command system that ended up inspiring Game Day.
So all the developers doing these disaster scenarios out there,
you've got Jesse's passion for firefighting and emergency management to thank for that.
Okay, back to our chat.
Resilience is the ability of a system, and that includes people and the things that those people build,
to adapt to change, to respond to failures and disturbances.
And one of the best ways to build that, to build a culture that can respond to those types of environments
and really understands how those work is to provide people training exercises. And those exercises can be as simple as something like, you know, rebooting a server or as complicated
as injecting massive scale faults by, you know, turning off entire data centers and
kind of everything in between. And so what a game day is, is first of all a process where you prepare for something by getting an entire organization together and kind of talking about how systems fail and thinking about what human beings know about how they expect failure to happen.
And that exercise by itself is often one of the most valuable parts of kind of the beginning of a game day.
But then you actually run an exercise where you break something.
Could be something big, could be something small, could be something that breaks all
the time.
And when you do that, you're able to study how everyone responds, where things can move
to.
You can see the system breaking, and that might be something that is safe to break,
you know, a well-understood component, or it might be something that exp safe to break, a well-understood component, or it
might be something that exposes what we call a latent defect.
Those are those problems hiding in software or in technology or in a system at scale that
we only can find out about when you have an extreme or an unexpected event.
It's really designed to train people and to build systems that you understand
how they're going to work under stress and under strain. And so when I hear game day, it makes me
think, was this a response to something very specific that happened that inspired it? Where
did it come from? So game day started during a period of time where I knew because of my role and because of my unique background as a firefighter and emergency manager that it was important to change the cultural approach from focusing on the idea of preventing failure to instead embracing failure, accepting that failure happens. And part of what inspired that was both my own experience, understanding systems,
how buildings fail and how civic infrastructure fails and how disasters happen and the strain
that that puts on people. And saying, well, if we look at the complexity and operational scale that
we have at the place of employment that I was at, the only way that we're really going to build and change
and become a high-reliability, always-on environment
is truly to embrace the fire service approach,
where we know that failures will happen.
It's not a question of if, it's a question of when.
And then, as my old fire chief would say,
you don't choose the moment.
The moment chooses you.
You only choose how prepared you are when it does.
Oh, that's a good one.
So when you first started doing the game days and thinking about how to be prepared for disaster scenarios,
was everyone on board with this or did you get any pushback?
Everyone thought I was crazy.
So definitely there were people that resisted it. And it's interesting because there was a really simple you're able to say, look, let's just measure how many minutes of outage there is,
how many minutes of downtime my team has that has this training
and operates this way versus, I don't know, your team.
Now, who does not have that and who seems to think that doing this type
of training and exercise isn't valuable or isn't important.
And once you do that kind of thing, you basically end up with what I call a compelling event.
So often there'll be an outage or some other thing where the organization suddenly and starkly
realizes, oh my goodness, we can't keep doing things the way that we've been doing them before.
And that becomes the method you use to overcome the skeptics. You use a combination of data and performance information on the one hand,
coupled with metrics and then great storytelling. And then you wait for the big one or the scary
incident that happens. And you say, you know, their whole organization needs this ability if
we're going to operate at web scale or internet scale.
So what I love about this is that it didn't just stay within Amazon.
It spread. A lot of other companies are doing it.
A lot of people have ended up embracing this knowledge and this process to be prepared.
What is next?
How do we continue carrying on the knowledge from game day into
future projects and future companies? I like to talk about it as convergent evolution. So
every large organization that operates on the web has now adopted a version of both the incident
management foundation that I certainly advocated for and has created their own game day testing.
Netflix calls it the Chaos Monkey.
Google has their Dirt program.
Okay.
So what are your hopes and dreams for game day in the future?
What I am excited about, first of all,
is that we are seeing this evolution now from
thinking of silos and thinking of systems as being disconnected to systems being fundamentally
interconnected, interdependent, and built and run by smart people around the world that are
trying to do great and big things. Years ago, when I got my start, caring about operations was a backwater.
It was not an interesting place.
And suddenly we found ourselves being able to propagate the idea that developers and
operations people working together are the only way that meaningful technology gets built
and run in a connected world.
And so my hope for the future
is, number one, we're seeing more and more people embracing these ideas and learning about them,
understanding that when you build something that people depend on, you have an obligation to make
sure that it's reliable, it's usable, it's dependable. It's something that people can
use as part of their daily lives. But also we're seeing a new discipline emerge. It's dependable. It's something that people can use as part of their daily lives. But also,
we're seeing a new discipline emerge. It's being studied. There's PhD theses being written on it.
It's being built out constantly. There's books being written. There's all these new resources
that aren't just a couple of people talking at a conference about how they think the world should work. And so my sort of inspirational hope is, one, understand that if you're building software
and technology that people use, you're really becoming part of the civic infrastructure.
And so the set of skills that I've tried to contribute as a firefighter to technology
and the skills that are now emerging that are taking that
so much farther are part of the foundation for, you know, building things that people depend on
every day. Very nice. Oh, that's a great way to end. Thank you so much, Jesse, for your time.
Yeah. Thank you.
In Jesse's vision, exercises like game day or Chaos Monkey are a crucial part of our tech culture growing up.
But they're also crucial for society at large.
And I love that he's putting the stakes that high.
Because he's right.
Our world depends on the work we do.
That much was obvious back in the 90s when telephone networks started crashing.
Modern life as we know it, almost ground to a halt.
And there's a duty that goes along with that.
A duty to care about security and reliability.
About the resilience of the things we build.
Of course, when it comes to building security into DevOps, we're just getting started.
Security is one of those. It's so in its infancy as an industry.
That's Josh Bressers. He's the head of product security at a data search software startup called Elastic.
For Josh, even though the computer industry has been maturing for a half century or so,
the kind of security we've been talking about here feels like it just came into its own.
Practically speaking, as what I would say may be a profession,
security is still very new and there's a lot of things we don't understand.
Here's what we do understand, though.
In a DevSecOps world, there are some pretty sweet opportunities to get creative about what security can achieve.
I was recently talking to somebody about a concept where they're using user behavior to decide if a user should be able to access a system.
Everybody has certain behaviors, be it where they're coming from, time of day they're accessing a system, the way they
type, the way they move their mouse. And so that's actually one of those places that I think could
have some very powerful results if we can do it right, where we can pay attention to what someone's
doing. And then let's say I'm acting weird and, you know, I'm weird because I just sprained my
wrist, but, you know, the other end doesn't know that.
And so it might say, all right, something's weird.
We want you to log in with your two-factor auth.
And we're going to also send you a text message or something, right?
And so we've just gone from essentially username and password to something more interesting.
And so I think looking at a lot of these problems in new and unique ways is really going to be key.
And in many instances, we're just not
there yet. Getting there requires those two big steps we've been describing. Step one, it's that
automation. So crucial because humans are terrible at doing the same thing over and over again. Fair.
And then we've got step two, the culture. All of us having a stake in security and reliability,
no matter what our job title might say. When most people think of the security team,
they don't think of happy, nice people, right? It's generally speaking, terrible, grumpy,
annoying people who, if they show up, they're going to ruin your day. And nobody wants that, right? IT infrastructure is growing larger and more powerful. Put those two truths together,
and you better live in a world where security gets embraced.
A very DevSecOps world,
where developers and operations are upping their security games,
upping their reliability games.
What I'm talking about is a future
where automation is integrated into every stage,
and everybody's attitudes toward these issues become more holistic.
That's how we're going to keep tomorrow's systems safe.
That's how we're going to keep the phones ringing, the lights on,
all of modern life healthy and strong.
If you pull up Forbes' list of the global 2,000 organizations, that's the top 2,000 public companies,
it turns out a full quarter of them have embraced DevOps.
Integrated, agile workplaces are becoming the rule of the land.
And in a few years, thinking in terms of DevSecOps might become second nature.
We want to go as fast as possible.
But the long game is actually faster when every part of the team is in the race together.
Next episode, we're getting hit by the data explosion.
Humans have entered the zettabyte era.
By 2020, we'll be storing about 40 zettabytes of information on servers that mostly don't even exist yet.
But how are we supposed to make all that data useful?
How do we use high-performance computing and open-source projects to get our data working for us?
We find out in Episode 6 of Command Line Heroes.
And a reminder, all season long, we're working on Command Line Heroes, the game.
It's our very own open-source project, and we've loved watching it all come together.
But we need you to help us finish.
If you hit up redhat.com slash command line heroes,
you can discover how to contribute.
And you can also dive deeper
into anything we've talked about in this episode.
Command Line Heroes is an original podcast from Red Hat.
Listen for free on Apple Podcasts, Google Podcasts, or wherever you do your thing.
I'm Saran Yitbarek.
Until next time, keep on coding.
Hi, I'm Mike Ferris, Chief Strategy Officer and longtime Red Hatter.
I love thinking about what happens next with generative AI.
But here's the thing.
Foundation models alone don't add up to an AI strategy.
And why is that?
Well, first, models aren't one-size-fits-all.
You have to fine-tune or augment these models with your own data,
and then you have to serve them for your own use case.
Second, one-and-done isn't how AI works.
You've got to make it easier
for data scientists, app developers, and ops teams to iterate together. And third, AI workloads demand
the ability to dynamically scale access to compute resources. You need a consistent platform, whether
you build and serve these models on-premise, or in the cloud, or at the edge. This is complex stuff,
and Red Hat OpenShift AI is here to help. Head to redhat.com to see how.