Screaming in the Cloud - The Darth Vader of AWS with Eric Brandwine

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. couldn't find the tools they needed, so they built one. Sounds easy enough. No one's ever tried that before, except they're good at it. Their platform allows teams to create consistency for the entire incident response lifecycle so that your team can focus on fighting fires faster, from alert handoff to retrospectives and everything in between. Things like, you know,

Starting point is 00:01:01 tracking, communicating, reporting, all the stuff no one cares about. Firehydrant will automate processes for you so you can focus on resolution. Visit firehydrant.io to get your team started today and tell them I sent you, because I love watching people wince in pain. This episode is sponsored in part by Chaos Search. As basically everyone knows, trying to do log analytics at scale with an Elk stack is expensive, unstable, time-sucking, demeaning, and just basically all-around horrible. So why are you still doing it, or even thinking about it, when there's Chaos Search? Chaos Search is a fully managed, scalable log analysis service that lets you add new workloads in minutes and easily retain weeks, months, or years of data. With Chaos Search, you store, connect, and analyze, and you're done.

Starting point is 00:01:53 The data lives and stays within your S3 buckets, which means no managing servers, no data movement, and you can save up to 80% versus running an Elk Stack the old-fashioned way. It's why companies like Equifax, HubSpot, Klarna, AlertLogic, and many more have all turned to Chaos Search. So, if you're tired of your Elk Stack falling over before it suffers, or of having your log analytics data retention squeezed by the cost, then try Chaos Search today and tell them I sent you. To learn more, visit chaossearch.io. Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Eric Brandwein, who's a distinguished engineer and VP at AWS.

Starting point is 00:02:37 Eric, welcome to the show. Hi, Corey. Thanks for having me. So what is it you actually do at AWS? Every time I've mentioned your name to folks in passing, they get this sort of stricken look. And all I can assume is that you're basically Darth Vader. Darth Vader, I think, is a slightly unfair characterization,

Starting point is 00:03:00 perhaps not wholly unfair. Because he had a redemption arc? Well, there were only three movies. Exactly. If they'd made a prequel or a sequel, it would have probably been a really good movie. Shame they never did. There were three Indiana Jones movies. There were three Star Wars movies. And yeah, at the end of the third movie, it kind of sort of worked its way out. Every year at reInvent, Andy Jassy gets up on stage and he says security is job zero. And I love it when he says this because one, he's counting from zero, which is how all good computer scientists count. But two, he's very publicly saying how important security is to what we do. And this isn't just Andy getting

Starting point is 00:03:40 up on stage at reInvent. When he comes back to Seattle, this is the behavior that he models for his leaders. And so, unfortunately, my first interaction with a lot of our employees is during a security event. And so, I have met a good number of my coworkers via Ceph2 tickets, Ceph2s are pager tickets. And it's not the best way to make friends and influence people. And so you've got to put a lot of work into building those relationships and making sure you reach out and contact people after the dust has settled. But I would say that my primary job is making sure that we hold the security bar high, relentlessly so. The idea of security being job zero, when I first heard of that,

Starting point is 00:04:28 was my instinctive reaction was, okay, we made a list of all the things we have to do. Oh, crap, we forgot security. So we'll do what everyone does with security, bolt it on at the end and put it at the top so we don't have to renumber anything. And it's a funny joke, and it's great to make the cheap shots and whatnot, but let's be very clear here. It is blindingly apparent to everyone who has used AWS in depth that security is baked in. You cannot bolt it on after the fact and expect to see the level of success that AWS has in a security perspective. I want to be explicitly clear on this. I have a laundry list of grievances around AWS, mostly around service naming, but I have never had a problem with how seriously you folks take security. That is excellent to hear. I'm glad that that's coming through.

Starting point is 00:05:17 It's a no-win game when it comes to security, because either it's in the way or it's invisible if it's done well, but no one ever stops and says, I really like the security there. In fact, the only time people really seem to talk about it is after they've had a data breach in public and they're saying, we take security seriously, right after it was exquisitely clear that they did not take security seriously. Well, I've heard that perspective many times, and I disagree with it because you're unsure of your security when you're surrounded by ambiguity. It's deeply unsettling. And the effect of that, the materialization of that in the business is friction and low

Starting point is 00:05:57 velocity. And when you work with someone, whether it's one of our service teams or one of our customers, and you give them the data that they need and the tools to manage that data so that they can understand the security risks that they're facing and so that they can make data-driven informed decisions about how quickly they want to move, about which risks they need to mitigate, about which risks they can accept, they become way more comfortable. And that leads to greater velocity for the business. It leads to greater confidence for the leadership, and it leads to greater delivery to customers,

Starting point is 00:06:32 which is the reason that the business is there. And so over my time at AWS, I've seen security go from something that nobody talked about to something that could only be a deficit to something that is actually an enabler for us and for our customers. I think you're one of the first people I've spoken to in my life who ever pushed back on the idea of, well, obviously security wasn't the right answer here, yada, yada, yada.

Starting point is 00:06:58 I think you're onto something, though. It's always a spectrum between usability and security, and there are trade-offs that have to get made. And on some level, given what AWS is and who your customers are, you can't ever get it wrong in a serious, big way because you don't get a second bite at that apple. Every cloud doubter on the planet is going to come back with saying, see, see, I told you, I told you.

Starting point is 00:07:24 And that's kind of weird. It's a very high risk story. And so far you've delivered. There haven't been these horrifying nightmare things that the on-prem sysadmin grumpy type, which I used to be one, was long predicting. It's a track record to, from what I can only imagine to be, a thorough defense in depth position. What am I missing? Absolutely. It's something we spend a tremendous amount of time and energy on. It does not happen by accident. And we're constantly looking for the maximum leverage we can get out of any defensive mechanism. But I don't think that the world is as Boolean as you're describing here.

Starting point is 00:08:08 When security events happen, they're absolutely serious. We take them very seriously. We respond immediately. But if you look at the things that have happened in the world at large, even some of the large newsmaking security issues that we've had recently, it's never a complete company extinction. It's never the end of a line of business. It's definitely disruptive to roadmaps. It is damaging to customer trust. It's not something to be taken casually, but it's not like you make a single misstep and

Starting point is 00:08:37 it's all over. And I think it's really important to reinforce that. I'm regularly humbled by the amount of customer trust that we've earned. And I don't take that lightly. I'm not going to play casually with it. But if you're caught up in the belief that a single misstep is going to lead to business extinction, then you're going to be paralyzed. You're going to be unable to move forward. You're going to be unable to objectively consider the risks. And I see security as highly parallel to availability. Availability is something that every service provider thinks about and has deep experience in. And we all think about the risks that we face. It is possible to build a system that is incredibly resilient. You run it in multiple availability zones.

Starting point is 00:09:26 You run it in multiple regions. You build two completely separate implementations of it using two different languages and two different runtimes. And you completely don't share fate across anything. And you can build an incredibly robust system. Almost no one does that because it's expensive. And the business, either implicitly or explicitly, is making decisions about how much they're willing to spend on availability

Starting point is 00:09:50 and which availability risks they're willing to take. And all of these services have some availability risks, and sometimes they have availability events. And security is exactly the same. We are surrounded by security risks. No business is without security risk. And so the way to succeed here is to as objectively as possible think about those risks, mitigate the ones that aren't acceptable, prepare to mitigate the ones that are acceptable if it turns out that your analysis was wrong, and to move the business forward. The problem that I see is that what you've just said is first accurate. I disagree with absolutely none of it, but it's also nuanced. It doesn't fit in easy soundbites. It doesn't fit in tweets, which is my primary form of shit posting. It requires a level of maturity on the part of the listener to understand the nuances. Why is it such a hard concept to convey repeatedly and well? I think there are two things that make security difficult. If you look at availability,

Starting point is 00:11:00 we have models for availability. What are the odds that a backhoe is going to cut this fiber? What are the odds that a bit is going to flip in this dim? What are the odds that a human is going to violate operational procedures and push code that wasn't completely tested? And we have models for that. And you build this chain of events, and you've got some idea of the likelihood of each of these events, and you multiply through, and you come up with some level of assurance that this is an acceptable risk to take or this is a risk that we need to mitigate this much. We need to drive the likelihood of this

Starting point is 00:11:34 event down below this threshold. And when you're dealing with security, you're dealing with a motivated human adversary. There's some reason, and it may be, you know, just kids out for the lulls. They're rattling all the doors down the hallway. And if they happen to find yours, you might have an issue. But in general, you're dealing with a motivated human adversary. And at that point, probabilities go out the window. It's wildly unlikely that this event followed by this event followed by this event are going to happen unless there's a human at the keyboard making them happen. And I've found a lot of engineers shy away from that kind of thinking. And I don't have an explanation for that. I don't know why. But the idea that you're basically playing a blind game of chess with an unknown adversary is unsettling to them.

Starting point is 00:12:22 And of course, there's more than one adversary and only one of them. And despite what you say, there is the perception that security is a, you only get to fail once, and then it's all over. So you have to be confident in what you're doing. And this is one of the things I love about the culture at AWS. It is an incredibly important thing. If you ask anyone in AWS security what my favorite word is, they will immediately respond, escalate. We have a culture of aggressive escalation in Amazon in general, but definitely within AWS. And an escalation is not a vote of no confidence. It's not me saying that you're bad at your job and I don't trust you and I'm going to go get

Starting point is 00:13:04 a second opinion. I'm going to grab your boss because clearly you're incompetent. Yeah, that is in some cultures how it's perceived. Not when AWS does it, when people in those environments do it. That is correct. That is not what we're doing. We're saying, I don't think we have the right decision makers in the room. And rather than getting caught around the axle and having a repetitive conversation that doesn't converge or having a groundhog day meeting where we have the same document and the same argument again and again and again, we're going to get the right decision makers in the room and we're going to make high quality, high velocity decisions. And so if I'm uncomfortable

Starting point is 00:13:41 with something, I know that I can pick up the phone and I can get a hold of literally any leader in the company. And they trust me. You know, I'm not going to call Andy Jassy because something went bump in the night and I'm scared. But if I need to get his attention, I know that I can get his attention. And I know that he will listen to what I have to say. And so given that, I know that if there's a decision that I'm uncomfortable with if there's a path forward that's unclear I can go get high judgment people that I trust to help me with that decision and then when we make that decision it's made with much higher confidence

Starting point is 00:14:17 and that enables me to continue to do my job to continue to stare into the ambiguity of security. The other thing I think that makes security different from other disciplines is availability events happen much more frequently than security events. And so we just have a larger data set, a larger training set. And so the humans that have to deal with availability issues have dealt with them way more often than the humans that need to deal with large-scale security issues. And it's a much harder problem to quantify. And I think that's one of the things that the cloud makes uniquely possible. I've spent a lot of time in security in multiple positions, and I have never had access to the data and the tools that I have access to here at AWS. Between DNS logging and flow logging and CloudTrail and all of the other data sources that we have, the amount of visibility that I have, the ability to reconstruct the past, to set up alarming, and then the tools to deal

Starting point is 00:15:25 with this data, not just, you know, S3 to host it and all of the machine learning and analytics tools, but things like Lambda, where setting up alarming on a new condition is the job for an engineer for an hour, not a major system design. It has completely changed the way that I think about security and the way the team thinks about security. Something to emphasize is you're able to do all of that and have that visibility from the hypervisor and network perspective, but not from within the customer environment. And the fact that you could achieve all of this without effectively forcing your customers to make a privacy or data security trade-off is sort of its own minor miracle,

Starting point is 00:16:07 from where I sit. So I am very happy with how far we've gotten with the data sources that we have. I'm very impressed with the team and what they've managed to accomplish. One of the things that we think about, we're surrounded by constraints. I mean, that's the nature of all human endeavors. And so we have limited time, we have limited money, we have limited human resources, and the human resources are the biggest constraint. Clueful engineers are a hot commodity, and so every engineer hour is precious. And making sure that we allocate those, not optimally, because then you wind up spending a lot of time optimizing and not actually delivering, but acceptably optimally is really

Starting point is 00:16:52 important. And you look at the leverage, at the coverage you're going to get for an invested engineer hour. And something like flow logs was expensive to build, and analyzing Flowlogs is expensive to build as well. But every single thing in AWS talks IP. You can't get into or out of an EC2 instance without talking IP. And so Flowlogs gives us ubiquitous coverage, literally 100% coverage. Every packet is accounted for. And that's huge. It doesn't matter what version of the kernel you're running. It doesn't matter what operating system you're running. It doesn't

Starting point is 00:17:30 matter if you're playing with the latest container micro operating system that we don't have support for. Eventually, it's going to turn into IP packets and it's going to wind up in the flow logs. And so that's one of the things that we consider when we decide where to invest. And that sort of ubiquitous coverage is incredibly valuable. Those of us who are doing things that are, how do I put it, not particularly serious in an AWS environment where, for example, I'm building a Lambda function to wind up taking the status page and make it sarcastic and worse. And I'm having trouble with it. It's irritating on some level where I'm not able to push a button and grant support access into the environment

Starting point is 00:18:10 to look at these things because it's a toy app and I don't care. And it's easy to lose sight of the fact that, yeah, it doesn't matter if it's a toy app that's doing some nonsense like that or a bank that is doing something that is incredibly sensitive and valuable and regulated, I get the same level of protection as those workloads. And that's a powerful thing,

Starting point is 00:18:33 though it's, I admit, easy to lose sight of that when it's two o'clock in the morning and I just want the funny joke to work. I hear you. And for me, this is one of the most enticing challenges of working at AWS. We don't have grades of service. We don't have different levels of complexity. We have a single suite of services that we offer to our customers. customer that reads a blog post and wants to try something out is going to use the same EC2, the same Lambda, the same S3, the same IAM as our most sophisticated government or financial services customers. And in fact, that novice customer may themselves work for one of these very demanding large customers. And this may be their first foray into AWS. And so today,

Starting point is 00:19:28 they're a one instance, one lambda, one bucket kind of customer, but they're going to evolve over time into one of these very sophisticated, very demanding customers. And so there's this continuum here. And you can't tell the customer, I'm sorry, that was great. I'm so happy that you liked that. In order to move to the next level, you need to shut everything down, pack it up, and move it over here to the much more rich, featured, complex cloud. You have to be able to accommodate the getting started use case and the mildly more complicated use case and the early production use case and all the way on through full corporate governance, multiple accounts, organizations, security audits, compliance audits, et cetera, in a single suite of services. And I don't think we've got it perfect. I don't

Starting point is 00:20:15 think we'll ever get it perfect. But figuring out how to accommodate that entire spectrum of use cases in a single service and to grow with your customers and to enable them to tackle complexity incrementally as it becomes meaningful to them is honestly my favorite part of designing a service. The thing that continually eludes me is I accept as fact, because you've clearly demonstrated it, that you can handle, for example, the security in all of its sharp and difficult edges around things like an EC2 instance, talking to RDS, and then storing something in an S3 bucket. That makes sense to me. I don't know how you did it, but you clearly have done it. But then you wind up with the almost Cambrian explosion of higher level AWS services that are

Starting point is 00:21:02 in machine learning. And, hey, we have this thing that talks to satellites in orbit. And, oh, there's this other thing that's look out for equipment, which is apparently named after a sign on the factory floor somewhere. And all of those things in all those different directions have the same level of security guarantee, despite what is in many cases a nearly completely alien workflow compared to what the historical expertise has been aimed at. At least that's what it seems like from the outside. Is that accurate? Is there something fundamental that I'm missing? Or is this just another demonstration of Amazon doing its operational excellence thing? This is my favorite thing about security as opposed to designing an AWS service is you have someone come to you and, for example, they say, we would like to have a farm of iOS and Android

Starting point is 00:21:56 devices that mobile developers can use to test their applications. And they're going to be awesome because they're going to be located right in our data centers, right next to the EC2 instances that they're using for their development work. And you go to the bookshelf and you pull down the big binder of policy, ask anyone in AWS security what my least favorite word is, and they'll tell you policy. And the policy is you're not allowed to have mobile devices in the data center. You're not allowed to have cameras. You're not allowed to have Bluetooth. You're not allowed to have Wi-Fi. And so you run the flowchart that's in the policy,

Starting point is 00:22:29 and the answer is clearly no. That is obviously the wrong answer. The right answer is, wow, that sounds cool. I bet our customers would love that. Let's figure out how to do it. Which leads to the next question, which is how? I have no idea. I have never built a device farm before but we're going to figure it out and so we go and we find people that have the specific

Starting point is 00:22:53 expertise that's necessary but there are patterns that crop up over and over and over again multi tenancy is really challenging but it's an acquirable skill. Capacity management is really hard, but it's something that you can build expertise in. And so we have a whole bunch of the fundamental building blocks lying around in different parts of the organization. It's just a matter of getting the specific knowledge necessary to apply to that domain, whether it's the device farm or ground station or whatever absolutely insane idea our service teams are going to come up with next that's going to delight customers. And it's these crazy ideas, the ones that prima facie seem absolutely ludicrous,

Starting point is 00:23:36 that wind up being really, really valuable to our customers and totally feasible. I would be remiss if I didn't make a feature request while I have you in a circumstance in which you can't possibly say no. Now, let me preface this with I have never yet come to AWS with a feature request and gotten a response of, holy crap, we never thought of that. The answer is always, the reason we can't do it, quite like you're thinking, is nuanced and complicated. And a couple of times I've been taken down that path, and yeah, there are dragons everywhere, and computers are awful, is what I take away from it. But IAM is one of those really, how do I put this, esoteric things for an awful lot of people. It's easier to just grant access to everything, and then in turn, later we'll go

Starting point is 00:24:26 back and fix that. Yeah, 10 years later, it doesn't work that way. We all write terrible things, and we lie to ourselves and others about what we're going to be able to come back and do. It feels like there's an opportunity to build almost a warn-if-reject style IAM approach, where in a test environment, and please only use this in test environments, you could have run a Lambda function, for example, through its paces, and it looks at what function it was able to use. It's allowed to do basically everything, and then it spits out a narrowly scoped down approach. This is a sort of thing that people have been asking for for a long time, but to my understanding, the closest we've gotten is the IAM access analyzer. Is that a reasonable customer request?

Starting point is 00:25:07 Is there something that winds up getting missed somewhere when people are asking for this? Or is this one of the ridiculously rare, wow, no one ever mentioned that to us. We'll get right on it moments. I hate to disappoint you, Corey, but this is not the first time we've had this conversation with a customer. Well, I am reassured by that if it helps. disappoint you, Corey, but this is not the first time we've had this conversation with a customer. Well, I am reassured by that, if it helps. So I think that things like IAM Access Analyzer are our preferred path here. And I think that over time, IAM Access Analyzer will evolve to be more closely that kind of shrink wrap that you describe.

Starting point is 00:25:46 But what we've often found is that in order to get the right shrink wrap policy, you have to exercise all of the functionality of that Lambda function or whatever resource it is that you're attempting to shrink wrap. And if you miss any branches, and in particular, you often miss the error branches and their actions that your code takes when things aren't working well, that are incredibly important to the survivability of your application. And so it turns out that everything's running fine for a long time. Then there's some sort of failure. It's a failure that didn't occur while you were running in test mode to generate the shrink-react policy.

Starting point is 00:26:25 And your code, following exactly what you wrote, says, oh, no, I have to post to this SNS topic in order to let them know that I've had a failure. And it can't because that wasn't included in the policy. And that kind of latent failure is in some ways worse than an overscoped policy. And so there's a balancing act here, and it winds up, as you said, being nuanced and complicated in practice. And this is one of the philosophies that we try and help our customers and our service teams understand, is that you want to do successive refinement here. The tighter you make the policy, the closer to least privilege you get, the more work you're going to have to do with that policy. You're

Starting point is 00:27:11 going to have to spend more time. And, you know, in the fully realized corporate governance version of this, there's going to be some other team that has to review your policy changes and approve them. And if you've got a really, really tight policy that allows exactly and only the things that you need, and then you add a feature and that feature happens to use a new SQS queue or take some new feature of S3 and requires yet another API call that's not currently allowed, then you have to go through this whole process of getting this approval and doing the review and making sure that it's acceptable. And so as your applications mature, you want the policies to get tighter and tighter.

Starting point is 00:27:51 You want the restrictions on changes to have a higher and higher bar, not just for security reasons, but for availability reasons. The thing that you're playing around with on your own personal time, if it has a complete outage, no one's even going to notice. You might not notice. That production app that your customers are depending on, if it has an outage, everyone's going to notice. And so you want to perform successive refinement here where you keep making the policies tighter, you keep making the operations tighter until you get to a level that's appropriate for your current level of maturity, your current scale of operations, the criticality of the data you're currently dealing with. And so I'm not a huge fan of going all the way to least privilege right off the bat. Forget dozens of visualization tools

Starting point is 00:28:36 and view your entire system in one place with New Relic Explorer, the latest addition to New Relic One. See your system-wide health at a glance with a dense hex view that has your hosts, services, containers, and everything else you probably shouldn't be monitoring but are anyway. And get in a statewide view of sudden changes, so you can theoretically catch issues before they impact customers. But let's be serious, you aren't checking your dashboards until 20 minutes into an incident that has been impacting customers for half an hour beforehand. So go to newrelic.com, sign up for free, and start exploring your system today. Be sure to tell them I sent you so that they can facepalm mightily.

Starting point is 00:29:16 Like everything, security feels like more of a journey than it is a destination. But that does change, for example, when you find yourself on the expo floor of RSA, at which point security is then transformed into something people are attempting to sell you. And my question across the board around that, I think, is do you see that there's a place in the security space for third-party offerings to thrive in the context of, I assume, a pure AWS environment along with a spherical cow? That's great. Is there a place for partners in that space? Absolutely. One of the things that we say all the time is that we're not as smart as the aggregate of our customers. If you're building an AWS service, one of the ways you know that you got it right is when you learn of some customer that's doing something with your service that you never anticipated, that's absolutely glorious and clever and enabling for their business, and you got out of their way. You never even thought about this use case, and they managed to do something that stunned you, even though you helped build this service. And so we're also not as smart as the aggregate of our partners or as the aggregate

Starting point is 00:30:31 of the internet as a whole. And we want to make sure that all of these people that have something to offer, that have these differentiating ideas that can make our customers' experience in the cloud better, have an opportunity to do so. There's a set of fundamental building blocks that we have to own, things like EC2 itself or S3 or IAM or CloudTrail. There's a set of things that customers expect us to offer. For example, guard duty. The feedback from our customers was overwhelmingly clear that as the owners of AWS and as the owners of CloudTrail, they expected us to have a service that would perform security analysis over those logs. And one of the data sources used by GuardDuty, one of our external

Starting point is 00:31:17 security services, is CloudTrail. And so that was in response to direct customer feedback. But we have a very rich ecosystem of partners that help customers out in all sorts of places. And some of these are born in the cloud partners. Some of these are partners that have been working with our customers for years and have made the journey with them from their on-premises data centers into the cloud. And there is a long and bright future there. It seems on some level like there's a bit of a series of terms of art or its own unique dialect in the security space, where compared to almost every other line of cloud offerings or SaaS offerings or developer tool offerings, that it feels like it speaks in a much more enterprise-style focus way, even when marketing to startups. Is that just because it's so difficult to message

Starting point is 00:32:10 that everyone is going from the same playbook? Or is there a cultural aspect of InfoSec done properly at a lot of these companies that means that I'm just not in that target market, so it's a language that isn't speaking to me? You asked me a marketing question? Oh yeah, I'm trying to understand. You started off once upon a time as an engineer-focused type. I mean, you don't generally become a distinguished engineer without writing at least a couple lines of code. And you used to be hands-on keyboard, and now you're talking to exactly those folks. And every time I talk to someone in the security space who does speak that dialect, they come away impressed at having

Starting point is 00:32:45 spoken with you. So that tells me that whether you know it or not, you do speak it. I'm just hoping you can sort of act as my security translator. I do think that we've been very clear in our messaging, however. My boss, Steve Schmidt, who's the chief information security officer of AWS, has talked a lot very publicly about how we think about security and how we treat security as something that's baked in from the beginning. How our messaging with our customers is around helping them move forward, helping them move forward with confidence, not about sowing fear, uncertainty and doubt. It's about making the pie larger and enabling more people to succeed, not in scaring people off from doing things. And so I think to a large extent, our security marketing, if not our security product marketing, is very much in our own voice. And I think it does

Starting point is 00:33:39 a good job of conveying the message that we want to convey. So a challenge that I have to imagine is frustrating, if nothing else, is that the reality of AWS and the perception of AWS have some significant gaps, where on the one hand, it's the idea of two pizza teams and people iterating rapidly and a bunch of small service teams each building something as part of a collective whole. And on the other, you take a step back and it's your Amazon, your market cap is measured in the trillions. Why is insert whatever thing annoys you today, such a bad experience or whatever it is? How does that tension wind up manifesting in your world? So it's true that the security team has gotten to be reasonably large. And you look across all of AWS, and I've been with the company now for 13 years, and it is dramatically larger

Starting point is 00:34:35 than it was when I started. But the job that we're taking on is also dramatically larger than it was when I started. It does not feel like our budgets have gotten any richer. It's just customer expectations have gone up. The expectations in terms of compliance, in terms of security, in terms of availability, in terms of operational excellence have all gone up at the same time that we've been launching more and more services and features. And so we're still incredibly parsimonious with our engineer time.

Starting point is 00:35:07 And a lot of our best security tools are things that an engineer was tired of dealing with. And they went off in the space of a couple of days, they made an absolutely horrendous prototype. Like this code should never even have been typed into the computer in the first place but it made their lives better it made their job easier and so another engineer contributed some code and it became a little bit less eye-searing and over the course of a couple of months we wind up with a system that's actually really useful and at some point you have a discussion you're like wow this thing is no longer really useful. This thing is essential to our operations. We've hit another level of scale. And if we didn't have this automation, we wouldn't be able to keep up anymore.

Starting point is 00:35:52 And so then you build a team around it. And when we say build a team, there's the whole two pizza team thing. And we don't really talk about buying pizzas and, you know, thinking in that term. But these tend to be very, very small teams, you know, handful of engineers, software development manager. And now that team owns that thing and they evolve that thing. And all of the security tools that I see that I really like are things that started off as small tactical answers to an actual problem that we had that accreted functionality over years.

Starting point is 00:36:26 And it usually means that they're not beautiful, that there isn't some grand design that some architect sat down and sketched out and thought about all of the future scaling concerns. It means that they tend to be kind of patched together and evolved and as built. But the reality is that the grand designs that the architect sits down to sketch out usually don't take into account the future that actually happens. And so you wind up with patches and changes and emergent future requests anyway. And it is incredible how quickly that value accretes. Oh, absolutely. People are familiar on some level with the idea of the mythical man month. It feels like this is almost a parallel to that. The mythical, just throw $5 billion at it and wait, where it's

Starting point is 00:37:10 throwing additional resources doesn't lead to better outcomes and in many cases can lead to materially worse ones. Absolutely. And so when I'm talking to customers and they want the tools that we have, one of the reasons that our tools are valuable is that they're tightly integrated with the way we do things. At Amazon, we have a ticketing system and everything is a ticket. If your laptop needs more memory, it's a ticket.

Starting point is 00:37:37 If you want to bring your dog to work, it's a ticket. If the website is down, it's a ticket. If your parking token doesn't work, it's a ticket. Everything is a ticket. And so all of our security tooling is integrated with the ticketing system. We even have security tooling that monitors the ticketing system to make sure that the tickets we've already cut are in a healthy state and to take metrics on that so we can report on it, so we can understand if we're spending our time in the right places. And none of that integration translates. And so what I tell

Starting point is 00:38:04 customers that are looking to get started on this journey, customers that want the kinds of tooling and none of that integration translates. And so what I tell customers that are looking to get started on this journey, customers that want the kinds of tooling that we have, I tell them to just get started. Rather than writing a catalog of all the things you'd like to check and all the lambdas you'd like to write, just write one, just pick one.

Starting point is 00:38:20 Check a single thing, write a quick three-liner that'll do it and see how it goes. Yes. Yes. Yeah. And the most important thing is not that you have that check. It's that you have the feedback loop. It's that the next time something goes wrong, you think, why did this go wrong?

Starting point is 00:38:37 What can I check that would prevent this from going wrong? And then you add that check. And so over time, you're going to create this library of validations. And the way we think about this is in terms of invariance. We call them security invariance. These are statements that should always be true. And they can be incredibly simple, like this IAM policy matches this text document exactly. Or they can be incredibly nuanced, like there is no path from the internet through any combination of nodes to any host that's tagged blue. And so the validators can be very simple. They can be very complicated. But you build this library of invariants. And

Starting point is 00:39:17 every time something happens that you don't like, or during the application security process ahead of time, you come up with invariants and you just keep building this library of invariants. And every single time we've done this, the library of invariants that we've wound up with is very different from the library of invariants we thought we needed. And because it's driven by things that have actually happened or things that we specifically identified in our threat models, they're the things we actually need. And that value accretes incredibly quickly. It's a matter of taking a bunch of little things and composing them into something fantastic at the end. It's almost like the microservices story, or some of the architectural diagrams that list a borderline sarcastic number of services,

Starting point is 00:40:01 but the outcome is really neat. Absolutely. And over time, you will learn that past you was not as smart as current you. But that's fine. The principal engineer community has a set of tenets. And one of the tenets is respect what came before. And it's incredibly important to me as an engineer, I've been around long enough that I've seen things where I've said, Oh, my gosh, what idiot did that? And you look in the source repo, it's git blame these days, but it was CVS blame back in the day. And your name is next to that line. And then you immediately fire up git blame someone else.

Starting point is 00:40:42 No, no, you own it. Like I made this decision. Yes, that's why you use the tool that rewrites history. So it's someone else's fault and not your own. Oh yeah, I'm right there with you. So the idiots that built the systems of the past weren't idiots. In fact, they're the ones that got us to where we are today. Those systems are what enabled our current business,

Starting point is 00:41:04 our current success. Now, every single thing I've ever worked on, we've outgrown. You get a couple orders of magnitude scaling out of your design, and then you've got to go back to the drawing board. But you do so making sure that you respect what came before, that you value the systems that got you to where you are, even though they've scaled beyond their utility, even though you think they're old and broken. They embody lessons. They're wise. They're battle-tested. And you make sure that you take as many of the lessons as you can from the systems that got you to where you are, and you treat them with respect, even as you turn them

Starting point is 00:41:40 off in favor of the new shiny thing. And after you've been through that cycles a couple of times, you know the new shiny thing is going to be one of those legacy systems someday soon. I tend to view legacy through a lens of being a disparaging engineering term for it makes money. It turns out that unlike what we learned in conference talks, you can't generally throw the entire banking system away and replace it with something you built in a weekend off of hacker news. So I have an awful lot of sympathy for not just the Greenfield stuff, but how you get what exists today into an environment that is better tomorrow. And there's no easy answer. So I want to thank you for taking so much time to speak with me about what you're up to and how you folks view these things. If people want to learn more about what you're up to, where can they find you?

Starting point is 00:42:28 So I am on Twitter at eBrandwine. I'm not very good at the whole social media thing, so caveat emptor. We also have a wealth of material on the AWS security blog. And a lot of the stuff that I've talked about here, about how we think about security and about making incremental progress is well covered there. Excellent. We will, of course, throw links to that in the show notes. Thanks so much for taking the time.

Starting point is 00:42:56 I really appreciate it. One never knows what one's reputation is with different groups at Amazon. There's no unified single opinion Amazon has. So it's nice to know that at least some people will still take my calls and it's very much appreciated. I think that your taste is terrible and the fact that you had me on just confirms that. And the fact that anyone wants to listen to this is mind-boggling to me. One person's trash is another person's treasure and I'm

Starting point is 00:43:23 generally the trash. Thanks so much. I appreciate it. Thank you, Corey. It's a pleasure. Eric Brandwein, Distinguished Engineer and VP at AWS. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five star review on your podcast platform of choice, along with a comment saying that actually there's a job negative one and tell me what it is. This has been this week's episode of Screaming in the Cloud. You can also find more Corey

Starting point is 00:43:59 at screaminginthecloud.com or wherever fine snark is sold. This has been a HumblePod production. Stay humble.

Screaming in the Cloud - The Darth Vader of AWS with Eric Brandwine

About Erichttps://aws.amazon.com/blogs/security/aws-security-profiles-eric-brandwine-vp-and-distinguished-engineer/Links:Twitter: https://twitter.com/ebrandwineAWS Security Blog: https://aws....amazon.com/blogs/security/

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.