Command Line Heroes - The One About DevSecOps: Evolving Security and Reliability

Starting point is 00:00:00 On June the 26th, 1991, Washington, D.C., much of Maryland and West Virginia, major portions of my home state, were paralyzed by a massive failure in the public telephone network. And yet, as technology becomes more sophisticated and network systems more interdependent, the likelihood of recurrent failures increases. It's not as though there wasn't warning that this would happen. In the early 1990s, 12 million Americans were hit with massive phone network failures. People couldn't call the hospital. Businesses couldn't call customers. Parents couldn't call their daycares. It was chaos. And it was also a serious wake-up call. A wake-up call for a country whose infrastructure relied heavily on the computer systems that networked everything.

Starting point is 00:00:50 Those computer networks were growing larger and larger. And then, when they failed, yeah, they failed big time. A computer failure caused that phone system crash, this tiny one-line bug in the code. And today, the consequences of little bugs like that are higher than ever. I'm Saran Yitbarek, and this is Command Line Heroes, an original podcast from Red Hat. So, software security and reliability matter more than ever. The old waterfall approach to development, where security was just tacked on to the end of things, that doesn't cut it anymore. We're living in a DevOps world where everything is faster, more agile, and scalable in ways they

Starting point is 00:01:43 couldn't even imagine back when that phone network crashed, that means our security and reliability standards have to evolve to meet those challenges. In this episode, we're going to figure out how to integrate security into DevOps, and we're also exploring new approaches to building reliability and resilience into DevOps. And we're also exploring new approaches to building reliability and resilience into operations. Even after covering all that, we know there's so much more we could talk about. Because in a DevSecOps world, things are changing fast for both developers and operations. These changes mean different things, depending on where you're standing.

Starting point is 00:02:26 But this is our take. We'd love to hear yours too. So don't be shy if you think we've missed something. Hit us up online. All right, let's dig in and start exploring this brand new territory. Here's the thing. Getting security and reliability up to speed,

Starting point is 00:02:47 getting it ready for a DevOps world, means we have to make a couple key adjustments to the way we work. Number one, we have to embrace automation. I mean, think about the logistics of, say, two-factor authentication. Think of the impossibly huge task that poses. It's pretty obvious you're not going to solve things by just adding more staff. So that's number one, embracing automation. And then number two, and this one's maybe less obvious. It's really changing the culture. So security isn't a boogeyman anymore. I'm going to explain what I mean by

Starting point is 00:03:26 changing the culture later on, but let's tackle these two steps one at a time. First up, embracing automation. Once upon a time, app deployment involved a human-driven security review before every single release. And I don't know if you've noticed, but humans, they can be a little slow. That's why automation is such a key part of building security into DevOps. Take, for example, this recent data breach report from Verizon. They found that 81% of all hacking-related breaches involve stolen or weak passwords. And that's, on the face of it, such a simple problem. But it's a simple problem at a huge scale.

Starting point is 00:04:19 Like I mentioned before, you're not going to staff your way out of 30 million password issues, right? The hurdle is addressing that problem of scale, the huge size of it. And the answer is the same every time. It's automation, automation. If you wait for a human being to get involved, it's not going to scale. Vincent Danen is the director of product security at Red Hat. And over the 20 years he's been at this, he's watched as DevOps created a faster and faster environment. Security teams have had to race to keep up. When I started, it was a vulnerability per month. And then it

Starting point is 00:05:00 started becoming every other week and then every week. And now we're into, you know, literally finding hundreds of these things every day. What's interesting here is that Vincent says there are actually more vulnerabilities showing up as security teams evolve, not less. We'll never get to the point where we say, oh, we're secure now, we're done. Our job is over. It'll always be there. It's just something that has to be as normal as breathing now. It turns out what counts as an issue for security and reliability teams is becoming more and more nuanced. As we're looking for these things, we're finding more.

Starting point is 00:05:38 And this trend is going to continue as you find new classes of vulnerabilities and things we maybe didn't think were important or didn't even know they existed before. We're finding out about these things much faster and there's more of them. And so the scale kind of explodes, you know, it's knowledge, it's volume of software, it's number of consumers, all of these things contribute to the growth of security in this area and the vulnerabilities that we're finding. Once you see security as an evolving issue, rather than one that gets quote-unquote solved over time, the case for automation, well, it gets even stronger.

Starting point is 00:06:17 Well, I think with automation, you can integrate this stuff into your development pipelines in a way that is very fast for one for two you don't require human beings to do this effort right computers don't need to sleep so you can churn through code as fast as your processors will allow rather than waiting for a human to pour through some maybe rather tedious lines of code to go looking for vulnerabilities and then with pattern matching and heuristics, you can actually determine what's vulnerable even at the time of writing the code to begin with.

Starting point is 00:06:51 So if you have like a plugin, you know, for your IDE or your tool that you're using to write your code, it can tell you as you're writing it, like, hey, maybe this looks a little fishy or you've just introduced a vulnerability and you can correct these things before you even commit the code. Security on the move. That's a huge bonus. There's just so much that's coming out every day, every hour, even with continuous integration and continuous delivery. You write code and it's deployed 10 minutes later. So it's really critical to get that validation of that code automatically prior to it being pushed out.

Starting point is 00:07:31 A whole breadth of tools are available so we can actually get this done. Whether it's static code analysis or plugins for your IDE or a whole bunch of other options. We'll share some of our favorites in the show notes for this episode over at redhat.com slash command line heroes. Once we've got those tools in place, they help keep security top of mind. The result? DevOps gets reimagined as DevSecOps, security gets baked into the process. In the same way that developers and operations kind of combined, you took those two disciplines to generate one. Now you have DevOps. that in with development and operations, I think is really important because having security as the afterthought is what makes security so reactive, so expensive, so damaging or potentially damaging to consumers.

Starting point is 00:08:33 And when you plug that in right at the beginning, you have development being done, security is in there from start to finish, and the operations work. Of course, like we mentioned at the top of the episode, automation is really just one half of a bigger pie. And Vincent gets that. It's not just one piece. You can't just, you know, throw a tool in your CICD pipeline and expect everything to be okay. There's a whole gamut of different techniques and technologies and behaviors that are required to produce those ultimate beneficial results that we want to see. Automation does get us partway there, but we've got to remember the other piece,

Starting point is 00:09:19 that slightly fuzzier piece. Say it with me. The culture piece. Getting developers and ops both on board so that these issues aren't boogeymen anymore. We have to change a culture. And some folks are learning to do that in the least painful way possible. With games. Let's take a swing over to the op side of things now. It's so easy to stand up huge infrastructure these days.

Starting point is 00:09:52 But that doesn't mean we should be doing shoddy work. We should still be hammering on our systems, ensuring reliability, figuring out how to prepare for the unexpected. That's the mindset Jesse Robbins is working to bring about. Today, Jesse is the CEO of Orion Labs. But before that, he was known as the master of disaster over at Amazon. During his time there, Jesse was pretty much a wizard at getting everybody at least aware of these issues. And he did it with something called Game Day. These can involve thousands of employees running through disaster

Starting point is 00:10:31 scenario drills, getting used to the idea of things breaking, and getting intimate with the why and the how. Here's Jesse and me talking it over, looking especially at how reliability and resilience get built into the operations side. Very cool. Okay, so you are known for many things, but one of those things is the exercise Game Day, which you did at Amazon. What is that? What's Game Day? was a program that I created to test the operational readiness of the most vulnerable systems by breaking things at massive scale. So if you're a fan of what's called Chaos Monkey Now

Starting point is 00:11:15 by the Netflix people and others, Game Day was the name for my program that definitely preceded all of that. It was really heavily focused on building a culture of operational excellence, building the capability to test systems at massive scale when they're breaking, learn how they break to improve them, and then also to build a culture that is capable of responding to

Starting point is 00:11:43 and recovering from incidents and situations. And it was all modeled and is all modeled after the incident command system, which is what fire departments use around the world for dealing with incidents of any size. It was sort of born from... Crazy side note, Jesse trained to be a firefighter back in 2005. And that's where he learned this incident command system that ended up inspiring Game Day. So all the developers doing these disaster scenarios out there, you've got Jesse's passion for firefighting and emergency management to thank for that.

Starting point is 00:12:19 Okay, back to our chat. Resilience is the ability of a system, and that includes people and the things that those people build, to adapt to change, to respond to failures and disturbances. And one of the best ways to build that, to build a culture that can respond to those types of environments and really understands how those work is to provide people training exercises. And those exercises can be as simple as something like, you know, rebooting a server or as complicated as injecting massive scale faults by, you know, turning off entire data centers and kind of everything in between. And so what a game day is, is first of all a process where you prepare for something by getting an entire organization together and kind of talking about how systems fail and thinking about what human beings know about how they expect failure to happen. And that exercise by itself is often one of the most valuable parts of kind of the beginning of a game day.

Starting point is 00:13:24 But then you actually run an exercise where you break something. Could be something big, could be something small, could be something that breaks all the time. And when you do that, you're able to study how everyone responds, where things can move to. You can see the system breaking, and that might be something that is safe to break, you know, a well-understood component, or it might be something that exp safe to break, a well-understood component, or it might be something that exposes what we call a latent defect.

Starting point is 00:13:49 Those are those problems hiding in software or in technology or in a system at scale that we only can find out about when you have an extreme or an unexpected event. It's really designed to train people and to build systems that you understand how they're going to work under stress and under strain. And so when I hear game day, it makes me think, was this a response to something very specific that happened that inspired it? Where did it come from? So game day started during a period of time where I knew because of my role and because of my unique background as a firefighter and emergency manager that it was important to change the cultural approach from focusing on the idea of preventing failure to instead embracing failure, accepting that failure happens. And part of what inspired that was both my own experience, understanding systems, how buildings fail and how civic infrastructure fails and how disasters happen and the strain that that puts on people. And saying, well, if we look at the complexity and operational scale that

Starting point is 00:14:59 we have at the place of employment that I was at, the only way that we're really going to build and change and become a high-reliability, always-on environment is truly to embrace the fire service approach, where we know that failures will happen. It's not a question of if, it's a question of when. And then, as my old fire chief would say, you don't choose the moment. The moment chooses you.

Starting point is 00:15:25 You only choose how prepared you are when it does. Oh, that's a good one. So when you first started doing the game days and thinking about how to be prepared for disaster scenarios, was everyone on board with this or did you get any pushback? Everyone thought I was crazy. So definitely there were people that resisted it. And it's interesting because there was a really simple you're able to say, look, let's just measure how many minutes of outage there is, how many minutes of downtime my team has that has this training and operates this way versus, I don't know, your team.

Starting point is 00:16:17 Now, who does not have that and who seems to think that doing this type of training and exercise isn't valuable or isn't important. And once you do that kind of thing, you basically end up with what I call a compelling event. So often there'll be an outage or some other thing where the organization suddenly and starkly realizes, oh my goodness, we can't keep doing things the way that we've been doing them before. And that becomes the method you use to overcome the skeptics. You use a combination of data and performance information on the one hand, coupled with metrics and then great storytelling. And then you wait for the big one or the scary incident that happens. And you say, you know, their whole organization needs this ability if

Starting point is 00:17:02 we're going to operate at web scale or internet scale. So what I love about this is that it didn't just stay within Amazon. It spread. A lot of other companies are doing it. A lot of people have ended up embracing this knowledge and this process to be prepared. What is next? How do we continue carrying on the knowledge from game day into future projects and future companies? I like to talk about it as convergent evolution. So every large organization that operates on the web has now adopted a version of both the incident

Starting point is 00:17:40 management foundation that I certainly advocated for and has created their own game day testing. Netflix calls it the Chaos Monkey. Google has their Dirt program. Okay. So what are your hopes and dreams for game day in the future? What I am excited about, first of all, is that we are seeing this evolution now from thinking of silos and thinking of systems as being disconnected to systems being fundamentally

Starting point is 00:18:12 interconnected, interdependent, and built and run by smart people around the world that are trying to do great and big things. Years ago, when I got my start, caring about operations was a backwater. It was not an interesting place. And suddenly we found ourselves being able to propagate the idea that developers and operations people working together are the only way that meaningful technology gets built and run in a connected world. And so my hope for the future is, number one, we're seeing more and more people embracing these ideas and learning about them,

Starting point is 00:18:51 understanding that when you build something that people depend on, you have an obligation to make sure that it's reliable, it's usable, it's dependable. It's something that people can use as part of their daily lives. But also we're seeing a new discipline emerge. It's dependable. It's something that people can use as part of their daily lives. But also, we're seeing a new discipline emerge. It's being studied. There's PhD theses being written on it. It's being built out constantly. There's books being written. There's all these new resources that aren't just a couple of people talking at a conference about how they think the world should work. And so my sort of inspirational hope is, one, understand that if you're building software and technology that people use, you're really becoming part of the civic infrastructure. And so the set of skills that I've tried to contribute as a firefighter to technology

Starting point is 00:19:42 and the skills that are now emerging that are taking that so much farther are part of the foundation for, you know, building things that people depend on every day. Very nice. Oh, that's a great way to end. Thank you so much, Jesse, for your time. Yeah. Thank you. In Jesse's vision, exercises like game day or Chaos Monkey are a crucial part of our tech culture growing up. But they're also crucial for society at large. And I love that he's putting the stakes that high. Because he's right.

Starting point is 00:20:18 Our world depends on the work we do. That much was obvious back in the 90s when telephone networks started crashing. Modern life as we know it, almost ground to a halt. And there's a duty that goes along with that. A duty to care about security and reliability. About the resilience of the things we build. Of course, when it comes to building security into DevOps, we're just getting started. Security is one of those. It's so in its infancy as an industry.

Starting point is 00:20:53 That's Josh Bressers. He's the head of product security at a data search software startup called Elastic. For Josh, even though the computer industry has been maturing for a half century or so, the kind of security we've been talking about here feels like it just came into its own. Practically speaking, as what I would say may be a profession, security is still very new and there's a lot of things we don't understand. Here's what we do understand, though. In a DevSecOps world, there are some pretty sweet opportunities to get creative about what security can achieve. I was recently talking to somebody about a concept where they're using user behavior to decide if a user should be able to access a system.

Starting point is 00:21:38 Everybody has certain behaviors, be it where they're coming from, time of day they're accessing a system, the way they type, the way they move their mouse. And so that's actually one of those places that I think could have some very powerful results if we can do it right, where we can pay attention to what someone's doing. And then let's say I'm acting weird and, you know, I'm weird because I just sprained my wrist, but, you know, the other end doesn't know that. And so it might say, all right, something's weird. We want you to log in with your two-factor auth. And we're going to also send you a text message or something, right?

Starting point is 00:22:13 And so we've just gone from essentially username and password to something more interesting. And so I think looking at a lot of these problems in new and unique ways is really going to be key. And in many instances, we're just not there yet. Getting there requires those two big steps we've been describing. Step one, it's that automation. So crucial because humans are terrible at doing the same thing over and over again. Fair. And then we've got step two, the culture. All of us having a stake in security and reliability, no matter what our job title might say. When most people think of the security team, they don't think of happy, nice people, right? It's generally speaking, terrible, grumpy,

Starting point is 00:22:57 annoying people who, if they show up, they're going to ruin your day. And nobody wants that, right? IT infrastructure is growing larger and more powerful. Put those two truths together, and you better live in a world where security gets embraced. A very DevSecOps world, where developers and operations are upping their security games, upping their reliability games. What I'm talking about is a future where automation is integrated into every stage, and everybody's attitudes toward these issues become more holistic.

Starting point is 00:23:48 That's how we're going to keep tomorrow's systems safe. That's how we're going to keep the phones ringing, the lights on, all of modern life healthy and strong. If you pull up Forbes' list of the global 2,000 organizations, that's the top 2,000 public companies, it turns out a full quarter of them have embraced DevOps. Integrated, agile workplaces are becoming the rule of the land. And in a few years, thinking in terms of DevSecOps might become second nature. We want to go as fast as possible.

Starting point is 00:24:28 But the long game is actually faster when every part of the team is in the race together. Next episode, we're getting hit by the data explosion. Humans have entered the zettabyte era. By 2020, we'll be storing about 40 zettabytes of information on servers that mostly don't even exist yet. But how are we supposed to make all that data useful? How do we use high-performance computing and open-source projects to get our data working for us? We find out in Episode 6 of Command Line Heroes. And a reminder, all season long, we're working on Command Line Heroes, the game.

Starting point is 00:25:19 It's our very own open-source project, and we've loved watching it all come together. But we need you to help us finish. If you hit up redhat.com slash command line heroes, you can discover how to contribute. And you can also dive deeper into anything we've talked about in this episode. Command Line Heroes is an original podcast from Red Hat. Listen for free on Apple Podcasts, Google Podcasts, or wherever you do your thing.

Starting point is 00:25:48 I'm Saran Yitbarek. Until next time, keep on coding. Hi, I'm Mike Ferris, Chief Strategy Officer and longtime Red Hatter. I love thinking about what happens next with generative AI. But here's the thing. Foundation models alone don't add up to an AI strategy. And why is that? Well, first, models aren't one-size-fits-all.

Starting point is 00:26:16 You have to fine-tune or augment these models with your own data, and then you have to serve them for your own use case. Second, one-and-done isn't how AI works. You've got to make it easier for data scientists, app developers, and ops teams to iterate together. And third, AI workloads demand the ability to dynamically scale access to compute resources. You need a consistent platform, whether you build and serve these models on-premise, or in the cloud, or at the edge. This is complex stuff, and Red Hat OpenShift AI is here to help. Head to redhat.com to see how.

Your Ad Here

Command Line Heroes - The One About DevSecOps: Evolving Security and Reliability

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.