Screaming in the Cloud - The Darth Vader of AWS with Eric Brandwine
Episode Date: March 16, 2021About Erichttps://aws.amazon.com/blogs/security/aws-security-profiles-eric-brandwine-vp-and-distinguished-engineer/Links:Twitter: https://twitter.com/ebrandwineAWS Security Blog: https://aws....amazon.com/blogs/security/
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud. couldn't find the tools they needed, so they built one. Sounds easy enough. No one's ever
tried that before, except they're good at it. Their platform allows teams to create consistency
for the entire incident response lifecycle so that your team can focus on fighting fires faster,
from alert handoff to retrospectives and everything in between. Things like, you know,
tracking, communicating, reporting, all the stuff no one cares about. Firehydrant will automate processes for you so you can focus on resolution. Visit
firehydrant.io to get your team started today and tell them I sent you, because I love watching
people wince in pain. This episode is sponsored in part by Chaos Search. As basically everyone
knows, trying to do log analytics at scale with
an Elk stack is expensive, unstable, time-sucking, demeaning, and just basically all-around horrible.
So why are you still doing it, or even thinking about it, when there's Chaos Search? Chaos Search
is a fully managed, scalable log analysis service that lets you add new workloads in minutes and easily retain weeks, months, or years of data.
With Chaos Search, you store, connect, and analyze, and you're done.
The data lives and stays within your S3 buckets, which means no managing servers, no data movement, and you can save up to 80% versus running an Elk Stack the old-fashioned way. It's why companies
like Equifax, HubSpot, Klarna, AlertLogic, and many more have all turned to Chaos Search. So,
if you're tired of your Elk Stack falling over before it suffers, or of having your log analytics
data retention squeezed by the cost, then try Chaos Search today and tell them I sent you.
To learn more, visit chaossearch.io.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
I'm joined this week by Eric Brandwein, who's a distinguished engineer and VP at AWS.
Eric, welcome to the show.
Hi, Corey.
Thanks for having me.
So what is it you actually do at AWS?
Every time I've mentioned your name to folks in passing,
they get this sort of stricken look.
And all I can assume is that you're basically Darth Vader.
Darth Vader, I think, is a slightly unfair characterization,
perhaps not wholly unfair.
Because he had a redemption arc?
Well, there were only three movies.
Exactly. If they'd made a prequel or a sequel, it would have probably been a really good movie.
Shame they never did. There were three Indiana Jones movies. There were three Star Wars movies.
And yeah, at the end of the third movie, it kind of sort of worked its way out.
Every year at reInvent, Andy Jassy gets up on stage and he says security is job zero. And I love it when he says this because one, he's counting from zero, which is how all good computer scientists count. But two,
he's very publicly saying how important security is to what we do. And this isn't just Andy getting
up on stage at reInvent. When he comes back to Seattle, this is the behavior that he
models for his leaders. And so, unfortunately, my first interaction with a lot of our employees
is during a security event. And so, I have met a good number of my coworkers via Ceph2 tickets,
Ceph2s are pager tickets. And it's not the best way to make friends and
influence people. And so you've got to put a lot of work into building those relationships and
making sure you reach out and contact people after the dust has settled. But I would say
that my primary job is making sure that we hold the security bar high, relentlessly so.
The idea of security being job zero, when I first heard of that,
was my instinctive reaction was, okay, we made a list of all the things we have to do. Oh,
crap, we forgot security. So we'll do what everyone does with security, bolt it on at the
end and put it at the top so we don't have to renumber anything. And it's a funny joke,
and it's great to make the cheap shots and whatnot, but let's be very clear here.
It is blindingly apparent to everyone who has used AWS in depth that security is baked in.
You cannot bolt it on after the fact and expect to see the level of success that AWS has in a security perspective. I want to be explicitly clear on this.
I have a laundry list of grievances around AWS, mostly around service naming, but I have never had a problem with how seriously you folks take security.
That is excellent to hear. I'm glad that that's coming through.
It's a no-win game when it comes to security, because either it's in the way or it's invisible if it's done well, but no one ever stops and says,
I really like the security there. In fact, the only time people really seem to talk about it
is after they've had a data breach in public and they're saying, we take security seriously,
right after it was exquisitely clear that they did not take security seriously.
Well, I've heard that perspective many times, and I disagree with it because you're unsure of
your security when you're surrounded by ambiguity.
It's deeply unsettling.
And the effect of that, the materialization of that in the business is friction and low
velocity.
And when you work with someone, whether it's one of our service teams or one of our customers,
and you give them the data that they need and the tools to manage that data so that they can understand the security
risks that they're facing and so that they can make data-driven informed decisions about how
quickly they want to move, about which risks they need to mitigate, about which risks they can
accept, they become way more comfortable. And that leads to greater velocity for the business.
It leads to greater confidence for the leadership,
and it leads to greater delivery to customers,
which is the reason that the business is there.
And so over my time at AWS,
I've seen security go from something
that nobody talked about
to something that could only be a deficit
to something that is actually an enabler for us and for our customers.
I think you're one of the first people I've spoken to in my life who ever pushed back on
the idea of, well, obviously security wasn't the right answer here, yada, yada, yada.
I think you're onto something, though. It's always a spectrum between usability and security,
and there are trade-offs that have to get made. And on some level, given what AWS is
and who your customers are,
you can't ever get it wrong in a serious, big way
because you don't get a second bite at that apple.
Every cloud doubter on the planet
is going to come back with saying,
see, see, I told you, I told you.
And that's kind of weird. It's a very high
risk story. And so far you've delivered. There haven't been these horrifying nightmare things
that the on-prem sysadmin grumpy type, which I used to be one, was long predicting. It's a track
record to, from what I can only imagine to be, a thorough defense in depth position. What am I
missing?
Absolutely. It's something we spend a tremendous amount of time and energy on. It does not happen by accident. And we're constantly looking for the maximum leverage
we can get out of any defensive mechanism. But I don't think that the world is as Boolean as
you're describing here.
When security events happen, they're absolutely serious.
We take them very seriously.
We respond immediately.
But if you look at the things that have happened in the world at large, even some of the large newsmaking security issues that we've had recently, it's never a complete company extinction.
It's never the end of a line of business.
It's definitely disruptive to roadmaps.
It is damaging to customer trust.
It's not something to be taken casually, but it's not like you make a single misstep and
it's all over.
And I think it's really important to reinforce that.
I'm regularly humbled by the amount of customer trust that we've
earned. And I don't take that lightly. I'm not going to play casually with it. But if you're
caught up in the belief that a single misstep is going to lead to business extinction, then you're
going to be paralyzed. You're going to be unable to move forward. You're going to be unable to objectively consider the risks.
And I see security as highly parallel to availability. Availability is something that every service provider thinks about and has deep experience in. And we all think about the risks
that we face. It is possible to build a system that is incredibly resilient. You run it in multiple availability zones.
You run it in multiple regions.
You build two completely separate implementations of it using two different languages and two
different runtimes.
And you completely don't share fate across anything.
And you can build an incredibly robust system.
Almost no one does that because it's expensive.
And the business, either implicitly
or explicitly, is making decisions about how much they're willing to spend on availability
and which availability risks they're willing to take. And all of these services have some
availability risks, and sometimes they have availability events. And security is exactly
the same. We are surrounded by security risks. No business is without security risk. And so the way to succeed here is to as objectively as possible think about those risks, mitigate the ones that aren't acceptable, prepare to mitigate the ones that are acceptable if it turns out that your analysis was wrong, and to move the business forward. The problem that I see is that what you've just said
is first accurate. I disagree with absolutely none of it, but it's also nuanced. It doesn't fit in
easy soundbites. It doesn't fit in tweets, which is my primary form of shit posting.
It requires a level of maturity on the part of the listener to understand the nuances.
Why is it such a hard concept to convey repeatedly and well?
I think there are two things that make security difficult. If you look at availability,
we have models for availability. What are the odds that a backhoe is going to cut this fiber?
What are the odds that a bit is going to flip in this dim?
What are the odds that a human is going to violate operational procedures and push code
that wasn't completely tested?
And we have models for that.
And you build this chain of events, and you've got some idea of the likelihood of each of
these events, and you multiply through, and you come up with some level of assurance that this is an acceptable risk
to take or this is a risk that we need to mitigate this much. We need to drive the likelihood of this
event down below this threshold. And when you're dealing with security, you're dealing with a
motivated human adversary. There's some reason, and it may be, you know, just kids out for the
lulls. They're rattling all the doors down the hallway. And if they happen to find yours,
you might have an issue. But in general, you're dealing with a motivated human adversary. And at
that point, probabilities go out the window. It's wildly unlikely that this event followed by this
event followed by this event are going to happen unless there's a human at the keyboard making them happen. And I've found a lot of engineers shy away from that
kind of thinking. And I don't have an explanation for that. I don't know why. But the idea that
you're basically playing a blind game of chess with an unknown adversary is unsettling to them.
And of course, there's more than one adversary and only one of them.
And despite what you say, there is the perception that security is a, you only get to fail once,
and then it's all over. So you have to be confident in what you're doing. And this is
one of the things I love about the culture at AWS. It is an incredibly important thing.
If you ask anyone in AWS security what my favorite word
is, they will immediately respond, escalate. We have a culture of aggressive escalation
in Amazon in general, but definitely within AWS. And an escalation is not a vote of no confidence.
It's not me saying that you're bad at your job and I don't trust you and I'm going to go get
a second opinion. I'm going to grab your boss because clearly you're incompetent. Yeah, that is in some
cultures how it's perceived. Not when AWS does it, when people in those environments do it.
That is correct. That is not what we're doing. We're saying, I don't think we have the right
decision makers in the room. And rather than getting caught around the axle and having a
repetitive conversation that
doesn't converge or having a groundhog day meeting where we have the same document and the same
argument again and again and again, we're going to get the right decision makers in the room
and we're going to make high quality, high velocity decisions. And so if I'm uncomfortable
with something, I know that I can pick up the phone and I can get a hold of literally any leader in the company.
And they trust me.
You know, I'm not going to call Andy Jassy because something went bump in the night and I'm scared.
But if I need to get his attention, I know that I can get his attention.
And I know that he will listen to what I have to say.
And so given that, I know that if there's a decision that I'm uncomfortable with if there's a
path forward that's unclear I can go get high judgment people that I trust to help me with
that decision and then when we make that decision it's made with much higher confidence
and that enables me to continue to do my job to continue to stare into the ambiguity of security. The other thing I think
that makes security different from other disciplines is availability events happen
much more frequently than security events. And so we just have a larger data set, a larger training
set. And so the humans that have to deal with availability issues have dealt with them way
more often than the humans that need to deal with large-scale security issues. And it's a much
harder problem to quantify. And I think that's one of the things that the cloud makes uniquely
possible. I've spent a lot of time in security in multiple positions, and I have never had access to the data and the tools that I have access to here at AWS.
Between DNS logging and flow logging and CloudTrail and all of the other data sources that we have, the amount of visibility that I have, the ability to reconstruct the past, to set up alarming, and then the tools to deal
with this data, not just, you know, S3 to host it and all of the machine learning and analytics
tools, but things like Lambda, where setting up alarming on a new condition is the job for an
engineer for an hour, not a major system design. It has completely changed the way that I think
about security and the way the team thinks
about security. Something to emphasize is you're able to do all of that and have that visibility
from the hypervisor and network perspective, but not from within the customer environment.
And the fact that you could achieve all of this without effectively forcing your customers to make
a privacy or data security trade-off is sort of its own minor miracle,
from where I sit. So I am very happy with how far we've gotten with the data sources that we have.
I'm very impressed with the team and what they've managed to accomplish.
One of the things that we think about, we're surrounded by constraints. I mean, that's the nature of
all human endeavors. And so we have limited time, we have limited money, we have limited
human resources, and the human resources are the biggest constraint. Clueful engineers are a hot
commodity, and so every engineer hour is precious. And making sure that we allocate those,
not optimally, because then you wind up
spending a lot of time optimizing and not actually delivering, but acceptably optimally is really
important. And you look at the leverage, at the coverage you're going to get for an invested
engineer hour. And something like flow logs was expensive to build, and analyzing Flowlogs is expensive to build as well.
But every single thing in AWS talks IP.
You can't get into or out of an EC2 instance without talking IP.
And so Flowlogs gives us ubiquitous coverage, literally 100% coverage.
Every packet is accounted for.
And that's huge. It doesn't matter what version
of the kernel you're running. It doesn't matter what operating system you're running. It doesn't
matter if you're playing with the latest container micro operating system that we don't have support
for. Eventually, it's going to turn into IP packets and it's going to wind up in the flow logs.
And so that's one of the things that we consider when we decide where to invest.
And that sort of ubiquitous coverage is incredibly valuable.
Those of us who are doing things that are, how do I put it, not particularly serious in an AWS environment where, for example, I'm building a Lambda function to wind up taking the status page and make it sarcastic and worse.
And I'm having trouble with it. It's irritating on some level
where I'm not able to push a button
and grant support access into the environment
to look at these things
because it's a toy app and I don't care.
And it's easy to lose sight of the fact that,
yeah, it doesn't matter if it's a toy app
that's doing some nonsense like that
or a bank that is doing something
that is incredibly sensitive and valuable and regulated,
I get the same level of protection as those workloads. And that's a powerful thing,
though it's, I admit, easy to lose sight of that when it's two o'clock in the morning and I just
want the funny joke to work. I hear you. And for me, this is one of the most enticing challenges of working at AWS.
We don't have grades of service.
We don't have different levels of complexity.
We have a single suite of services that we offer to our customers. customer that reads a blog post and wants to try something out is going to use the same EC2,
the same Lambda, the same S3, the same IAM as our most sophisticated government or financial
services customers. And in fact, that novice customer may themselves work for one of these
very demanding large customers. And this may be their first foray into AWS. And so today,
they're a one instance, one lambda, one bucket kind of customer, but they're going to evolve
over time into one of these very sophisticated, very demanding customers. And so there's this
continuum here. And you can't tell the customer, I'm sorry, that was great. I'm so happy that you
liked that. In order to move to the next level, you need to shut everything down, pack it up,
and move it over here to the much more rich, featured, complex cloud. You have to be able
to accommodate the getting started use case and the mildly more complicated use case and the early
production use case and all the way on through full corporate governance, multiple accounts, organizations, security audits, compliance
audits, et cetera, in a single suite of services. And I don't think we've got it perfect. I don't
think we'll ever get it perfect. But figuring out how to accommodate that entire spectrum of use
cases in a single service and to grow with your customers and to
enable them to tackle complexity incrementally as it becomes meaningful to them is honestly my
favorite part of designing a service. The thing that continually eludes me is I accept as fact,
because you've clearly demonstrated it, that you can handle, for example, the security in all of its sharp and difficult edges around
things like an EC2 instance, talking to RDS, and then storing something in an S3 bucket.
That makes sense to me. I don't know how you did it, but you clearly have done it.
But then you wind up with the almost Cambrian explosion of higher level AWS services that are
in machine learning. And, hey, we have this thing that talks to satellites in orbit. And, oh, there's this other thing that's look out for equipment,
which is apparently named after a sign on the factory floor somewhere. And all of those things
in all those different directions have the same level of security guarantee, despite what is in
many cases a nearly completely alien workflow compared to what the historical expertise has
been aimed at. At least that's what it seems like from the outside. Is that accurate? Is there
something fundamental that I'm missing? Or is this just another demonstration of Amazon doing
its operational excellence thing? This is my favorite thing about security as opposed to designing an AWS service is you have someone
come to you and, for example, they say, we would like to have a farm of iOS and Android
devices that mobile developers can use to test their applications.
And they're going to be awesome because they're going to be located right in our data centers,
right next to the EC2 instances that they're using for their development work.
And you go to the bookshelf and you pull down the big binder of policy, ask anyone in AWS security
what my least favorite word is, and they'll tell you policy. And the policy is you're not allowed
to have mobile devices in the data center. You're not allowed to have cameras. You're not allowed
to have Bluetooth. You're not allowed to have Wi-Fi.
And so you run the flowchart that's in the policy,
and the answer is clearly no.
That is obviously the wrong answer.
The right answer is, wow, that sounds cool.
I bet our customers would love that.
Let's figure out how to do it.
Which leads to the next question, which is how?
I have no idea. I have never built a device farm
before but we're going to figure it out and so we go and we find people that have the specific
expertise that's necessary but there are patterns that crop up over and over and over again multi
tenancy is really challenging but it's an acquirable skill. Capacity management is really hard, but it's something that you can build expertise in.
And so we have a whole bunch of the fundamental building blocks lying around in different
parts of the organization.
It's just a matter of getting the specific knowledge necessary to apply to that domain,
whether it's the device farm or ground station or whatever absolutely
insane idea our service teams are going to come up with next that's going to delight customers.
And it's these crazy ideas, the ones that prima facie seem absolutely ludicrous,
that wind up being really, really valuable to our customers and totally feasible.
I would be remiss if I didn't make a feature request while I have you in a
circumstance in which you can't possibly say no. Now, let me preface this with I have never yet
come to AWS with a feature request and gotten a response of, holy crap, we never thought of that.
The answer is always, the reason we can't do it, quite like you're thinking, is nuanced and complicated. And a couple of times I've been taken down that path, and yeah, there are dragons everywhere,
and computers are awful, is what I take away from it. But IAM is one of those really, how do I put
this, esoteric things for an awful lot of people. It's easier to just grant access to everything,
and then in turn, later we'll go
back and fix that. Yeah, 10 years later, it doesn't work that way. We all write terrible things, and
we lie to ourselves and others about what we're going to be able to come back and do.
It feels like there's an opportunity to build almost a warn-if-reject style IAM approach,
where in a test environment, and please only use this in test environments,
you could have run a Lambda function, for example, through its paces, and it looks at what function
it was able to use. It's allowed to do basically everything, and then it spits out a narrowly
scoped down approach. This is a sort of thing that people have been asking for for a long time,
but to my understanding, the closest we've gotten is the IAM access analyzer. Is that a reasonable customer request?
Is there something that winds up getting missed somewhere when people are asking for this?
Or is this one of the ridiculously rare, wow, no one ever mentioned that to us.
We'll get right on it moments.
I hate to disappoint you, Corey, but this is not the first time we've had this conversation
with a customer.
Well, I am reassured by that if it helps. disappoint you, Corey, but this is not the first time we've had this conversation with a customer.
Well, I am reassured by that, if it helps.
So I think that things like IAM Access Analyzer are our preferred path here. And I think that over time, IAM Access Analyzer will evolve to be more closely that kind of shrink wrap that you describe.
But what we've often found is that in order to get the right shrink wrap policy,
you have to exercise all of the functionality of that Lambda function
or whatever resource it is that you're attempting to shrink wrap.
And if you miss any branches, and in particular, you often miss the
error branches and their actions that your code takes when things aren't working well, that are
incredibly important to the survivability of your application. And so it turns out that everything's
running fine for a long time. Then there's some sort of failure. It's a failure that didn't occur
while you were running in test mode to generate the shrink-react policy.
And your code, following exactly what you wrote, says, oh, no, I have to post to this SNS topic in order to let them know that I've had a failure.
And it can't because that wasn't included in the policy.
And that kind of latent failure is in some ways worse than an overscoped policy.
And so there's a balancing act here,
and it winds up, as you said, being nuanced and complicated in practice.
And this is one of the philosophies that we try and help our customers and our service teams
understand, is that you want to do successive refinement here. The tighter you make the policy, the closer to
least privilege you get, the more work you're going to have to do with that policy. You're
going to have to spend more time. And, you know, in the fully realized corporate governance version
of this, there's going to be some other team that has to review your policy changes and approve
them. And if you've got a really, really tight policy that allows exactly
and only the things that you need, and then you add a feature and that feature happens to use
a new SQS queue or take some new feature of S3 and requires yet another API call that's not
currently allowed, then you have to go through this whole process of getting this approval and
doing the review and making sure that it's acceptable.
And so as your applications mature, you want the policies to get tighter and tighter.
You want the restrictions on changes to have a higher and higher bar,
not just for security reasons, but for availability reasons.
The thing that you're playing around with on your own personal time,
if it has a complete outage, no one's even going to notice. You might not notice. That production app that your customers are depending on, if it
has an outage, everyone's going to notice. And so you want to perform successive refinement here
where you keep making the policies tighter, you keep making the operations tighter until you get
to a level that's appropriate for your current level of maturity, your current scale of operations, the criticality of the data you're currently dealing with. And so I'm not a huge
fan of going all the way to least privilege right off the bat. Forget dozens of visualization tools
and view your entire system in one place with New Relic Explorer, the latest addition to New
Relic One. See your system-wide health at a glance with a dense hex view that has your hosts,
services, containers, and everything else you probably shouldn't be monitoring but are anyway.
And get in a statewide view of sudden changes,
so you can theoretically catch issues before they impact customers.
But let's be serious, you aren't checking your dashboards until 20 minutes into an
incident that has been impacting customers for half an hour beforehand. So go to newrelic.com, sign up for free, and start exploring
your system today. Be sure to tell them I sent you so that they can facepalm mightily.
Like everything, security feels like more of a journey than it is a destination.
But that does change, for example, when you find yourself on the expo floor of RSA, at which point security is then transformed into something people are attempting to sell you.
And my question across the board around that, I think, is do you see that there's a place in the security space for third-party offerings to thrive in the context of, I assume, a pure AWS environment along
with a spherical cow? That's great. Is there a place for partners in that space? Absolutely.
One of the things that we say all the time is that we're not as smart as the aggregate of our
customers. If you're building an AWS service, one of the ways you know that you got it right is when you learn of some customer that's doing something with your service that you never anticipated, that's absolutely glorious and clever and enabling for their business, and you got out of their way.
You never even thought about this use case, and they managed to do something that stunned you, even though you helped build this
service. And so we're also not as smart as the aggregate of our partners or as the aggregate
of the internet as a whole. And we want to make sure that all of these people that have something
to offer, that have these differentiating ideas that can make our customers' experience in the
cloud better, have an opportunity to do so.
There's a set of fundamental building blocks that we have to own, things like EC2 itself or S3 or
IAM or CloudTrail. There's a set of things that customers expect us to offer. For example,
guard duty. The feedback from our customers was overwhelmingly clear that as the owners of AWS
and as the owners of CloudTrail, they expected us to have a service that would perform security
analysis over those logs. And one of the data sources used by GuardDuty, one of our external
security services, is CloudTrail. And so that was in response to direct customer feedback.
But we have a very rich ecosystem of partners that help customers out in all sorts of places.
And some of these are born in the cloud partners.
Some of these are partners that have been working with our customers for years and have made the journey with them from their on-premises data centers into the cloud.
And there is a long and bright future there. It seems on some level like there's a bit of a series of terms of art or its own unique
dialect in the security space, where compared to almost every other line of cloud offerings
or SaaS offerings or developer tool offerings, that it feels like it speaks in a much more
enterprise-style focus way, even when marketing to startups. Is that just because it's so difficult to message
that everyone is going from the same playbook? Or is there a cultural aspect of InfoSec done
properly at a lot of these companies that means that I'm just not in that target market, so it's
a language that isn't speaking to me? You asked me a marketing question?
Oh yeah, I'm trying to
understand. You started off once upon a time as an engineer-focused type. I mean, you don't generally
become a distinguished engineer without writing at least a couple lines of code. And you used to
be hands-on keyboard, and now you're talking to exactly those folks. And every time I talk to
someone in the security space who does speak that dialect, they come away impressed at having
spoken with you. So that tells me that whether you know it or not, you do speak it. I'm just
hoping you can sort of act as my security translator. I do think that we've been very
clear in our messaging, however. My boss, Steve Schmidt, who's the chief information security
officer of AWS, has talked a lot very publicly about how we think about security and how we treat security as something that's baked in from the beginning.
How our messaging with our customers is around helping them move forward, helping them move forward with confidence, not about sowing fear, uncertainty and doubt.
It's about making the pie larger and enabling more people to succeed,
not in scaring people off from doing things. And so I think to a large extent, our security
marketing, if not our security product marketing, is very much in our own voice. And I think it does
a good job of conveying the message that we want to convey. So a challenge that I have to imagine is frustrating, if nothing else,
is that the reality of AWS and the perception of AWS have some significant gaps,
where on the one hand, it's the idea of two pizza teams and people iterating rapidly
and a bunch of small service teams each building something as part of a collective whole.
And on the other, you take a step back and it's your Amazon, your market cap is measured in the trillions. Why is insert whatever thing annoys you today, such a bad
experience or whatever it is? How does that tension wind up manifesting in your world?
So it's true that the security team has gotten to be reasonably large. And you look
across all of AWS, and I've been with the company now for 13 years, and it is dramatically larger
than it was when I started. But the job that we're taking on is also dramatically larger than it was
when I started. It does not feel like our budgets have gotten any richer.
It's just customer expectations have gone up.
The expectations in terms of compliance, in terms of security,
in terms of availability, in terms of operational excellence
have all gone up at the same time that we've been launching
more and more services and features.
And so we're still incredibly parsimonious with our engineer time.
And a lot of our best security tools are things that an engineer was tired of dealing with.
And they went off in the space of a couple of days, they made an absolutely horrendous prototype.
Like this code should never even have been typed into the computer in the first place but it made their lives better it made their job easier and so another engineer contributed some
code and it became a little bit less eye-searing and over the course of a couple of months we wind
up with a system that's actually really useful and at some point you have a discussion you're
like wow this thing is no longer really useful. This thing is essential to our operations.
We've hit another level of scale.
And if we didn't have this automation, we wouldn't be able to keep up anymore.
And so then you build a team around it.
And when we say build a team, there's the whole two pizza team thing.
And we don't really talk about buying pizzas and, you know, thinking in that term.
But these tend to be very, very small teams,
you know, handful of engineers, software development manager. And now that team owns
that thing and they evolve that thing. And all of the security tools that I see that I really like
are things that started off as small tactical answers to an actual problem that we had that
accreted functionality over years.
And it usually means that they're not beautiful, that there isn't some grand design
that some architect sat down and sketched out and thought about all of the future scaling concerns.
It means that they tend to be kind of patched together and evolved and as built. But the
reality is that the grand designs that the architect sits down to
sketch out usually don't take into account the future that actually happens. And so you wind up
with patches and changes and emergent future requests anyway. And it is incredible how quickly
that value accretes. Oh, absolutely. People are familiar on some level with the idea of the
mythical man month. It feels like this is almost a parallel to that. The mythical, just throw $5 billion at it and wait, where it's
throwing additional resources doesn't lead to better outcomes and in many cases can lead to
materially worse ones. Absolutely. And so when I'm talking to customers and they want the tools that
we have, one of the reasons that our tools are valuable
is that they're tightly integrated
with the way we do things.
At Amazon, we have a ticketing system
and everything is a ticket.
If your laptop needs more memory, it's a ticket.
If you want to bring your dog to work, it's a ticket.
If the website is down, it's a ticket.
If your parking token doesn't work, it's a ticket.
Everything is a ticket. And
so all of our security tooling is integrated with the ticketing system. We even have security
tooling that monitors the ticketing system to make sure that the tickets we've already cut are in a
healthy state and to take metrics on that so we can report on it, so we can understand if we're
spending our time in the right places. And none of that integration translates. And so what I tell
customers that are looking to get started on this journey, customers that want the kinds of tooling and none of that integration translates. And so what I tell customers
that are looking to get started on this journey,
customers that want the kinds of tooling that we have,
I tell them to just get started.
Rather than writing a catalog
of all the things you'd like to check
and all the lambdas you'd like to write,
just write one, just pick one.
Check a single thing,
write a quick three-liner that'll do it
and see how it goes.
Yes. Yes.
Yeah.
And the most important thing is not that you have that check.
It's that you have the feedback loop.
It's that the next time something goes wrong, you think, why did this go wrong?
What can I check that would prevent this from going wrong?
And then you add that check.
And so over time, you're going to create this library of validations.
And the way we think about this is in terms of invariance. We call them security invariance.
These are statements that should always be true. And they can be incredibly simple, like
this IAM policy matches this text document exactly. Or they can be incredibly nuanced, like there is no path from the internet
through any combination of nodes to any host that's tagged blue. And so the validators can
be very simple. They can be very complicated. But you build this library of invariants. And
every time something happens that you don't like, or during the application security process ahead
of time, you come up with invariants and you just keep building this library of invariants. And every single time we've done this,
the library of invariants that we've wound up with is very different from the library of invariants
we thought we needed. And because it's driven by things that have actually happened or things
that we specifically identified in our threat models, they're the things we actually
need. And that value accretes incredibly quickly. It's a matter of taking a bunch of little things
and composing them into something fantastic at the end. It's almost like the microservices story,
or some of the architectural diagrams that list a borderline sarcastic number of services,
but the outcome is really neat. Absolutely. And over time, you will learn that past you was not
as smart as current you. But that's fine. The principal
engineer community has a set of tenets. And one of the tenets is
respect what came before. And it's incredibly important to me
as an engineer, I've been around long enough that I've seen
things where I've said, Oh, my gosh, what idiot did that? And you look in the
source repo, it's git blame these days, but it was CVS blame back in the day.
And your name is next to that line. And then you immediately fire up git blame someone else.
No, no, you own it. Like I made this decision.
Yes, that's why you use the tool that rewrites history.
So it's someone else's fault and not your own.
Oh yeah, I'm right there with you.
So the idiots that built the systems of the past
weren't idiots.
In fact, they're the ones that got us to where we are today.
Those systems are what enabled our current business,
our current success.
Now, every single thing I've ever worked on, we've outgrown. You get a couple orders of
magnitude scaling out of your design, and then you've got to go back to the drawing board.
But you do so making sure that you respect what came before, that you value the systems that got
you to where you are, even though they've scaled beyond
their utility, even though you think they're old and broken. They embody lessons. They're wise.
They're battle-tested. And you make sure that you take as many of the lessons as you can from the
systems that got you to where you are, and you treat them with respect, even as you turn them
off in favor of the new shiny thing. And after you've been through that cycles a couple
of times, you know the new shiny thing is going to be one of those legacy systems someday soon.
I tend to view legacy through a lens of being a disparaging engineering term for it makes money.
It turns out that unlike what we learned in conference talks, you can't generally
throw the entire banking system away and replace it with something you built in a weekend off of hacker news. So I have an awful lot of sympathy for not just the Greenfield stuff,
but how you get what exists today into an environment that is better tomorrow. And there's
no easy answer. So I want to thank you for taking so much time to speak with me about what you're
up to and how you folks view these things. If people want to learn more about what you're up to, where can they find you?
So I am on Twitter at eBrandwine. I'm not very good at the whole social media thing,
so caveat emptor. We also have a wealth of material on the AWS security blog.
And a lot of the stuff that I've talked about here, about how we think about security
and about making incremental progress
is well covered there.
Excellent.
We will, of course, throw links to that in the show notes.
Thanks so much for taking the time.
I really appreciate it.
One never knows what one's reputation is
with different groups at Amazon.
There's no unified single opinion Amazon has.
So it's nice to know that
at least some people will still take my calls and it's very much appreciated. I think that your taste
is terrible and the fact that you had me on just confirms that. And the fact that anyone wants to
listen to this is mind-boggling to me. One person's trash is another person's treasure and I'm
generally the trash. Thanks so much. I appreciate it.
Thank you, Corey. It's a pleasure.
Eric Brandwein, Distinguished Engineer and VP at AWS.
I'm cloud economist Corey Quinn, and this is Screaming in the Cloud.
If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice.
Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five star review on your podcast platform of choice, along with a comment
saying that actually there's a job negative one and tell me what it is.
This has been this week's episode of Screaming in the Cloud. You can also find more Corey
at screaminginthecloud.com or wherever fine snark is sold.
This has been a HumblePod production.
Stay humble.