Screaming in the Cloud - Firewalls, Zombies, and Cloud Permissions Security with Sandy Bird
Episode Date: May 2, 2024On this Featured Guest episode of Screaming in the Cloud, Corey is joined by Sandy Bird, Co-Founder and CTO of Sonrai Security. The two discuss the current state of cloud permissions security..., and Sandy details the company’s breakthrough Cloud Permissions Firewall which promises fast and scalable cloud least privilege all with one click. Corey and Sandy also talk about bunk AWS tools in this space, the insanely high “zombie” population in the cloud, and how Sonrai works for companies of all sizes.Highlights:(00:00) Welcome to Screaming in the Cloud with Corey Quinn(00:50) Sponsored Ad(01:32) Exploring Sonrai Security's Mission and Challenges(03:38) Introducing the Cloud Permissions Firewall Concept(05:59) Comparing Cloud Providers' Permissions Models(09:49) Sponsored Ad(10:12) Addressing the Zombie Identity Problem(16:44) Scaling Solutions for Different Company Sizes(20:10) Navigating Cloud Security Challenges(23:38) Innovative Approaches to Permission Management(25:27) Optimizing Permission Requests with Statistics(27:04) Improving Cloud Security with Permissions on Demand(35:15) Concluding Thoughts and ContactAbout Sandy: Sandy Bird is the co-founder and CTO of Sonrai Security, helping enterprises protect their data by securing cloud identities and access. Sandy was the co-founder and CTO of Q1 Labs, which was acquired by IBM in 2011. At IBM, Sandy became the CTO for the global security business and worked closely with research, development, marketing and sales to develop new and innovative solutions to help the IBM Security business grow to ~$2B in annual revenue. He is a trusted and experienced cloud security expert.Links referenced: Sonrai Security Website: https://sonrai.co/screaming-cloud Free 14-Day Trial: https://sonrai.co/screaming-trialSandy’s LinkedIn: https://www.linkedin.com/in/sandy-bird-835b5576/* Sponsor Sonrai Security: https://sonrai.co/screaming-cloud
Transcript
Discussion (0)
Our existing customer base, the people that really cared about least privilege
were like large financials. They actually had the staff to put in place to kind of
monitor you're actually getting the least privilege and cut the tickets and they could
afford the extra developer time to do it right. And so those customers we found as a pattern,
not only cared about least privilege, they were really good at writing,
we use the example of ADOS, SCPs, Azure policy, things like that, to basically
block the undesired activity.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
And this promoted guest episode is brought to us by our friends at Sunree Security.
Also brought to us is their own co-founder and CTO,
Sandy Bird. Sandy, thank you for joining me. Thanks for having me, Corey.
Do you know what's more old school than blowing on a Nintendo cartridge to make it work?
Manually creating individual policies to achieve least privilege in your cloud.
Leave old habits in the past and lock down access to sensitive permissions and services without disrupting DevOps with a single click.
With a cloud permissions firewall, you can easily restrict excessive permissions from human and machine identities,
quarantine unused identities, and restrict specific regions and unused services with the click of a button.
Start a 14-day free trial for Sonri's cloud permissions firewall at sonri.co.
That's S-O-N-R-A-I dot C-O slash screaming.
So think from the top, I suppose.
I don't believe I'd heard of Sonri before you had reached out.
What is it you folks do over there?
Yeah, for the last five years, four to five years, we have focused on getting
identities that are in AWS, Azure, GCP to least privilege. So you can think about that as looking
at the history of what they do, generating a better policy, applying that policy to that
particular identity, and now it's at least privilege. We've learned a lot in four years.
That probably is in some ways, and I hate to say this fool's errand
because you have so many identities
that doing them one at a time,
unless you have some way to completely automate that
and trust the automation is almost impossible.
So effectively you take existing permission sets
in various cloud accounts
and then prune them down to least up
to a minimum viable privilege
in order to get something out,
people do their roles,
but they don't just have casual access to things that they don't need. Is that directionally correct?
That was, again, I always call that, you know, as you build these companies, Sunray 1.0, right?
And it was our thesis, which was, and I had this great thesis. It was because cloud logged
everything, it doesn't actually log everything, but let's pretend it logs most things. We would
be able to look at every resource and get these perfect policies for it.
And then over time, we adapted those policies to make them a little less restrictive.
Really annoying when you're using the console and somebody's taken away every single thing
you've never done before and you browse around the console and everything is broke when you
get there.
That's not such a great experience.
So we made those, we'll call them least restrictive, least privileged policies.
But we came into this conclusion about a year ago that we would monitor our customers.
We had this great customer that was super successful at this.
They had built it into their thing.
They put Jira tickets in for people.
They fixed their terraform.
They would test it in UAT and then they would roll it to production and be like, oh, we
were super successful.
We measured that timing and it was like over a 10-month period,
they fixed 2,000 or 3,000 identities.
And that's pretty successful until you realize they generated
more than 2,000 identities in that same period.
And then you're like, oh, this isn't working.
We're getting more and more efficient
at pushing this boulder up the hill continuously.
It is, right?
And so a year ago,
we took this kind of flip it on its head model
and said, there has to be a better way to do this. And so we created this thing called a cloud permissions firewall. I'm curious, Corey,
what do you think of that name? Not knowing even exactly what it does yet. What do you think of the
name? Oh, I think it's brilliant marketing because you're not going to be able to get into RSA unless
you have a firewall to sell someone. I mean, that's basically their entire shtick. So great.
I'd call basically anything a firewall if it gets me access to people I need to market to. It's great. It also explains, based upon what I'm thinking off the top of my
head, that that's something that helps explain something sort of esoteric, which is effectively
identity as perimeter, which is what we're talking about here, and explaining it to people who still
think in terms of firewalls. See again, RSA. And you've kind of hit it on the head.
It was a really touchy topic around here as we were naming it, because part of it, as you say, is this very old school name,
which, by the way, is even older than networks, right?
We have firewalls between apartment buildings and we have firewalls in our car.
But the reality is it's a very old kind of term.
And we didn't know if people would be able to make this bridge into this,
as you say, identity as the new firewall world. But when we started thinking about it more, as we kind of built this new model,
it really is flipped on its head to be a deny first model for identity, but only for the most
sensitive permissions. If you actually took every single identity, there's, I think it's up to like
43,000 permissions across those three main cloud providers. Now it's insane. And it grows every
day, literally every day. There's more permissions.
If you did that and tried to protect all 43,000 of them,
everybody would be using something new
at some point in time,
and it would just be super annoying.
However, if you took,
we actually did this piece of work
to find all of the really sensitive stuff,
created a new internet gateway,
create a pre-signed URL,
copy a snapshot to another place.
The things that actually leak data, poke holes in your world,
destroy the cloud, these types of things.
We got it down to about 3,000 permissions across those three clouds,
plus or minus a few.
And when we looked at it that way, we could flip the model.
We could say, now we can build deny first for those 3,000 permissions.
And if you have the other ones, they're not as restrictive. You should go back to our old model, build least privileged policies for it. But if
you don't get it, we can take most of the risk out of this by protecting the 3,000 centrally.
So it's a different model, super effective, super fast to getting it done. You can get it done in
a week versus 10 months. I have a lot of thoughts on the idea of permissions in cloud and least
privilege.
Two almost diametrically opposed philosophies, at least the last time I dug into this in any depth,
AWS and GCP. By default, nothing in AWS can talk to anything full stop. Whereas in GCP,
everything within a project generally can speak to everything within that project until you start isolating things down. And security purists love to turn up their nose at the Google approach,
but I think it is the better way to start.
Otherwise, you wind up with what everyone does in AWS.
You try and just give it the permissions it needs,
and then something doesn't work,
and you expand it a bit, and it still doesn't work,
and you try and expand it yet again,
and it still doesn't work,
and then you just give it full access to do things.
With a to-do, fix this later,
and the to-do hangs around longer
than any five employees in your company.
I think you've nailed it on the head and it will,
I'll bleed Azure in here too,
just to really mess the world up.
Right?
So in AWS,
you have this expanding of the wildcard problem.
You don't know what the permissions are underneath of them.
And so people just,
as an example,
give it EC2 star,
Lambda star,
whatever they need to get the thing done.
Then they find out they need a pass roll.
So they add,
you know,
I am,
you also need to be able to talk to cloud trail logs or won't be able to charge you out the
wazoo. That's right. Exactly. So you have all of this massive permission set in AWS.
One thing that's neat about the AWS model, though, is it is a deny first model. And so if you can get
a deny somewhere in that path of your identity, you can deny something. And no matter
how many times other things grant it, it will still be denied. As you say, GCP is a little
different than that. It has these kind of very open projects, right? And we always, we pick on
people for their service accounts that can act as anything else, including all the other service
accounts in the project. But it is still a deny first model. And about a year and a half ago, maybe a little
longer ago, GCP put a binding in that's very special that allows you to create a deny on a
permission. And you can actually build exemptions around that using different principles that they
have. And so you can actually get a pretty good deny-first model in GCP. But as you say,
at least it starts in a usable form where things aren't open to the entire
GCP cloud. They're at least limited to that product. So it's a little bit better in some
ways. Although sometimes I equate a project to an account in AWS. And again, we can talk about
how open those are. Azure is really backwards. Azure is an allow-first model. So no matter how
many denies you have and how many policies you've written,
if anywhere's in Azure, there's one statement that says you're allowed to do it, you're allowed to do it. And so you have to think completely differently in Azure when you go to correct these things,
because you can't, you can still create a deny first model, but you have to understand
all the inheritance and everything for doing that. And depending on where you're putting the rules,
there are other things that can override them. So anyway, we could spend a lot of time on that. I've been beating the drum for ages
that Azure security is deeply flawed across
a variety of different levels. I wasn't even aware of this
and just add it to the pile at this point.
Although I will give them credit. They're the
most cost-effective cloud just because how easy
it is to run your stuff on someone else's account.
Yeah, well, there you go.
We could spend a lot of time
for it, but I'm going to go on a side tangent
in Eric's discovery of Azure.
So we were spending time building this particular project, and we were looking for ways to basically think we're going to talk about zombies later.
I love zombies.
We're going to talk about zombies and cleaning zombies up.
But we were trying to find ways to make sure that we would know if something happened in Azure that was denied.
What we discovered was almost nothing in Azure that is denied is logged
to their centralized logging. It shows up in the screen of the person who is denied in their
console, you'll get a deny or in the SDK, you'll get the deny. But then when you go to look at the
activity logs, no matter what you turn on, the diagnostic logs, all these things, no deny log.
And it's not every permission, but it's huge numbers of them, which is really
interesting and azure. Anyway, side tangent, we could go down that one for a while.
I want to go to the zombie thing that you're talking about, because I suspect I may have a
real-world story that is germane to this. If you're going the same place, I think you're
going with it. But please, tell me more. Yeah, so we were doing our research in building this
flip-the-model-on- on its head and doing this cloud permission firewall
instead of this, you know, let's fix every identity.
And one of the statistics we started looking at
was how many of these identities are completely unused.
So they have permissions attached to them.
Some of them are really sensitive.
Some of them are benign,
but they just have something attached to them
and they're sitting in the cloud
and they're completely unused.
And we took a large chunk of our customers, big, big enterprise customers that have thousands of accounts and then little small
customers that have 10 or 15. And there was this interesting stat that the longer you were in cloud,
the more of these identities that you had, which we nicknamed zombies, that were sitting there with
all these permissions that weren't used. And it's really scary when you started looking at companies
that were in cloud for more than five years.
So they had history.
It was like 75% of the identities kicking around were unused.
That high. It was insane how high it was.
Some were worse, actually. That was an average.
So it's pretty bad, actually.
And all that stuff, of course, opens up risk in the environment.
Well, so does closing them. And that's the challenge I have around this,
because depending on what your sampling window is, there are things that only run
once a quarter, for example. So if it's not at least 90 days, you're going to catch some of
those things out. And then you have some very frantic, very upset business people wondering
why something isn't working. But the one that I care about the most from the old world IT ops side of the world is
the break glass scripts, the things that you have sitting somewhere that don't normally run in the
course of business. I have one now in my personal account where everything for my dev box, everything
is on my tailscale network. On the off chance that that isn't working for whatever reason,
I can hit a Lambda endpoint with a pre-stored key. And all that does is it changes
the security group to open up port 22. So I can SSH into the thing with an actual credential and
continue from there. That is something that I don't think I've ever used it other than when I
built it and tested it. Easy for something like this to view that as, oh, you don't need this
around and you're right until suddenly I will very much need that in some weird networking circumstance, and it won't work. How do you avoid that trap?
Look, Corey, I think you nailed the last four years of my life. We have this great CIEM solution,
Cloud Infrastructure Entitlements Management, another acronym by Gardner. And we've been trying
to get people to clean up these zombie identities forever. And there's really kind of, you said, two ways, and there's actually a bit of a third, which is part of your first solution,
which is the break glass accounts are never supposed to be used, as you said,
and we should never get rid of them. That's also a small handful of them,
though, to be clear, as opposed to the huge amount of things that got spun up as detritus
of other things. Exactly. And I would argue you really should know what they are.
The second part is more like your solution, though, where you have another team that's built something. It might be a yearly report. It's named really weird. And you as the cloud ops person at the top of the configuration setup where if I hit this Lambda
endpoint, then that will do something which changes this. And there may be a resource group
on that that trusts that Lambda function. And so it's this encompassing workload. It can get worse
if it's an IAM user, which maybe you shouldn't be doing this, but has an access key cut on it.
And you delete any of those things. What happens is not only do you delete the identity or the IAM
user, you delete
the access. So you've lost the key material. You've removed all of the permissions. And now
that identity that's trusted through some trust relationship on some other resource doesn't exist
anymore. And so if you had to put it back, you wouldn't even know how to do it. You wouldn't
have the original state. This is the guidance I give customers when they're talking about,
we don't think this thing's being used, but we're not sure.
How do we find out?
And it's like, well, if you turn it off and no one knows what it is and something breaks,
that's going to be challenging.
And not because there's really no warn if reject on a lot of these things.
Great.
Let's change security groups so nothing can talk to it and leave it there for some period
of time.
Check the instance role.
Is it doing anything during that sampling period?
And at some point, then go ahead and stop it without terminating it and let it go another
period. And then there's the scream test. When you block access to it, who screams?
That's on some level sounds like what you're talking about.
It is exactly it. And so what we did was we basically said, that's fine. Let's leave all
the permissions intact and then basically short circuit it using a deny star. So it doesn't
work anymore. And what we did was we have this second part of our product, which we call
permissions on demand. And what that does is it listens for the wake up. So if it sees an attempt
to be used after nine months, it sends a message via chat ops, Slack, Teams, email, if you're into
that sort of thing, which maybe I am, but everyone else likes Slack. You get this message that says, hey, this thing just tried to wake up.
Do you want to reanimate the zombie?
And if you do, you hit yes, I want to reanimate it.
The thing tries again, and it's going to work.
You could interrupt something, as you say, by turning it off.
Who screamed?
But you give this person screaming the ability to approve and turn the thing back on.
And then after some period of time, hopefully you do become comfortable and say,
this thing is really not used.
You should move it away.
But you do have to put in the exemptions
for like the break glass accounts, right?
You know what those are.
So we have in our product this way
that you can actually put them in as exemptions.
And of course they will never get blocked.
But I actually think it's one of the most powerful parts
of the product is being able to remove that.
Because what we find is,
is that they show up in these lateral movement change.
So this identity can get to this identity, which you then can get to this unused identity,
and then it can do all kinds of havoc. And by actually short-circuiting them,
they no longer laterally move through them. Do you know what's more old school than blowing
on a Nintendo cartridge to make it work? Manually creating individual policies to achieve least
privilege in your cloud.
Leave old habits in the past and lock down access to sensitive permissions and services without disrupting DevOps with a single click.
With a cloud permissions firewall, you can easily restrict excessive permissions
from human and machine identities, quarantine unused identities,
and restrict specific regions and unused services with the click of a button.
Start a 14-day free trial for Sonri's cloud permissions firewall at sonri.co
slash screaming. That's S-O-N-R-A-I dot C-O slash screaming.
It seems like it's one of those fun places that you can get lost in if you're not careful. It
feels like this is something that works super well for certain scales of company. This sounds great, even on my own test account, which is awesome.
I can see it working at small to medium scale. What I start to wonder is, at enterprise scale,
where in some cases I have clients spending hundreds of millions a year upon thousands of
accounts. And at that point, it's so diffuse that it becomes difficult to reason about any of these things in
any holistic way is there a sweet spot for that you found is the best that he's resonating with
or is this one of those rarities that actually does apply to theoretically every cloud customer
it uh how we came up with a solution is is kind of interesting and sometimes you have to get beat
up a lot to figure out where you need to be in these things. And so we had our existing customer base, the people that really cared about least privilege
were like large financials. They actually had the staff to put in place to kind of
monitor you're actually getting the least privilege and cut the tickets and they could
afford the extra developer time to do it right. And so those customers we found as a pattern,
not only cared about least privilege, they were really good at
writing, we use the example of ADOS, SCPs, Azure policy, things like that, to basically block the
undesired activity. But they probably had a team of people doing that. When we went to our customer
base that was, we'll call them large-scale cloud, but not as highly governed or as highly mature.
And it was typically a team of four people
that ran the whole cloud infrastructure
and they were responsible for everything end-to-end.
They didn't have the cycles to put into monitoring
to get to least privilege.
They didn't have the people to write SCPs.
They didn't have that.
And so cloud was kind of a mess,
a growing mess as it went.
And so when we were building the solution,
we were trying not to build it for that highly governed seven people writing SCPs. They just knew what to do when they
were doing it well. We were trying to write it for that team that was, man, we're understaffed.
We've got to get to least privilege from whatever compliance regime we're under. We're supposed to
get to least privilege, but we can't do it. This gave them a way to get there fast and easy and didn't disrupt anything. Because we have this
option where we find all of the exemptions based on the history, we put those in automatically,
and then you really only have to worry about day plus one where you use permissions on demand.
It's been interesting actually building the product and exposing it back to some of those
larger, highly governed companies.
And what we found was they too struggle with SCPs because if you look at SCPs, there's
SCP space limits.
There's the number of them you can attach.
There's all these weird constraints you have to do.
And some of the stuff we had to do to solve those problems is actually even applicable
to them.
So by no means is this for everybody's solution.
If you're the purest and
you can afford the staff to get to least privilege, I would agree. You should do that. That's the
perfect way to do it. However, for the people that can't do that and can't achieve it, this is a much
better solution. Scale, what's neat about this is you can start in one account. You can monitor the
whole arrogance. I'm just going to start in development in this one area, and then you can
kind of work your way up through it. You don't have to do it on day one. And we've built the SCP scaling such that it
works across thousands of accounts or across 10 accounts, whichever you happen to have.
That's a neat approach. It's on some level on paper. It sounds like if you use just the lens
of AWS, they have a few offerings that make what you do irrelevant. They have the IAM access analyzer,
which in turn now can generate policies
based upon what you actually use.
And that would be awesome and would basically be like,
well, why would I ever need to use what you built?
Except for the fact it doesn't freaking work
or it works, but it doesn't go far enough.
Where, oh, we saw that this role
used the DynamoDB write table option.
Okay, great.
Can you tell me what table you're up to?
No, go guess.
Then what's the point?
Like, you don't get to be specific enough.
Like, what I would love to see is something that it auto-generates a policy of,
okay, based upon our observed behavior during the capture window,
you're able to write to the following S3 keys.
Like, okay, great.
Let's back that up a little bit.
Give it a prefix or a bucket or something.
But yeah, that's the direction.
Let me broaden it.
Because otherwise you wind up in the hell
that I'm still in with one of my code build roles
that does deployments where it has full access
to spin things up in a given account.
To be clear, this is for my newsletter stuff.
This is not for my production stuff touching client
data. Different universes here.
But yeah, it still has
full access because every time I've tried to dial it in,
it's a problem because first
it has the ongoing updates
to things when it does deployments, its permission
set, but it needed a separate permission set
entirely to provision those things in the first
place the first time I ran it. So there's
a question of, great, how do I dial those in? It's okay to discard those extra permissions now,
but every time I thought I had it working, I'd make one small change and boom, I'm back to square
one. So I gave up. Yeah. And it's a common pattern. It is, again, I lived the last four
years of my life before this particular new product thinking I wanted to be that purist too,
right? I want to get everything absolutely perfect.
And then after looking at these customers struggle with,
you know, some of these accounts are huge, right?
You know, 50,000 plus identities
that have sensitive permissions
that haven't used them in the last whatever.
We failed and we had to get a better model.
And the only way to do that
was to start with a smaller subset.
We couldn't do every single permission, right?
But you could do the sensitive ones.
And by doing the sensitive ones, you could remove that.
Everything you have, the important stuff gets buried under an avalanche of random trivia.
And I also think what's interesting about, you know, you look at your problem and when
you're looking at, you know, the DynamoDB tables or the S3 keys or the prefixes, I don't
even think half the people know, like you have to turn on extra auditing even to see
that stuff.
And not every service in Amazon even supports that auditing. So, you know, doing it
is super hard. Oh, and data events to get logged those in CloudTrail. I've done the numbers on
this. The API call to read an object in S3 will show up there in the CloudTrail data events,
and it will cost 20 times as much as the API call for the read, which, okay, but that's not going to solve every problem
for everyone. I understand that there's value in security and some things should be paid for,
but I firmly believe that providers should not be charging extra for things that only they can
provide. If they want to go head to head of, we'll ingest syslog and do these analytics stuff, yeah,
by all means, charge away for that because I have a half dozen options and honestly i still like awk with occasional grep tied to it and that gets me surprisingly far
especially if i sprinkle in some pearl and that's the great other times you can send it to data dog
or splunk if you have a spare princess lying around you can ransom back to whatever kingdom
she's from awesome that's the like that's an open field. But I've got to pay for these audit CloudTrail events
because there's no second option
for me to pay someone for that.
No, it's amazing when you look at the amount
and, you know, there's other quirks in there too
as we're talking to your audience, right?
If you run two CloudTrails,
now you really, really get burned
because you only get one for free
and the second one costs more
and then you're storing the data twice.
You ever see CloudTrail paid events?
It's usually, usually a sign that something's misconfigured. Very occasionally, especially at finance
institutions, I see security teams want an unadulterated CloudTrail that no one else can
see for whatever reason, and they refuse to share it onward. Cool. I can tell you down to the penny
what that cost you last month. Great. Make your own decisions. I'm here to advise. I'm not here
to make decisions for you. You clearly have context. I don't. That's the nature of respecting your customers' businesses.
But it's frustrating to see that misconfiguration. It feels like a tax on not knowing this one weird
piece of trivia about AWS. Yeah. Anyway, again, this all led us down this path to say,
if we're going to try to fix this for the average customer, right,
that doesn't have the team from those large financials that can justify for cloud trails,
for some reason, we had to do it in a way that was, you know, click a button, make the thing
happen. And the way that that, you know, worked for us was we did a lot of statistics across these
sensitive permissions. So we did a lot of work figuring out what those 3000 sensitive permissions
were. When we looked across our customers,
we did throw a few of them out that were,
they were called a lot
and they were called by disparate identities.
So it would have been a lot of different identities
doing this thing.
And we said, well, that's too many permissions
on demands requests.
You'd have to approve it too many times.
And we kind of graded ourselves to say,
anything we're going to put in this list
has to be something that's called, it can be called a lot by a single identity,
but the number of unique identities that call it in a period of time, it needs to be somewhat
small. It's not called at all, but it can't have a hundred different identities in one account
calling it. And that was kind of this kind of guiding light to us to say, okay, well,
you know what? You create your inner gateway one time and you never really touch it again.
So that's a sensitive permission. You do things like, you would say, well, decrypt must be
sensitive. Decrypt gets called by everything in your cloud. So we can't use decrypt as a sensitive
permission, right? And so you use this as your guiding light to figure out what these are. And Azure has some crazy ones, by the way. There's stuff in Azure that
allows you to take like a file system off a running VM and make it a URL on the internet.
New database just dropped, I guess.
Yeah, exactly. And so, you know, you have to look at these permissions and know what's
sensitive and what's not. And anyway, so we spent a lot of time on that. It was a fun exercise for
sure. I imagine it would have to be. How do you wind up then handling the provisioning of
permissions that need to exist all the time? Because an aspect of what you do,
to my understanding, is the concept of permissions on demand.
And so what we do is, and so this is back to those statistics, which are so interesting across this.
So when we looked at what gets provisioned that has sensitive permissions,
and we'll use your AWS example because we've used them before, like EC2 star, Lambda star,
like I couldn't figure out how to get it to work. So I gave it a bunch of services with star,
it started to work and I moved on. So in those scenarios, in every one of those services,
rather it's Lambda or EC2 or whatever it happened to be, CloudTrail, there were some number of
sensitive permissions there. And EC2 has 40 some of them, they can go down to Lambda, CloudTrail, there were some number of sensitive permissions there. And EC2 has 40
some of them. You can go down to Lambda. It has, I think, 15. Every one of these has some number
of them. And so we said, okay, we are trying to solve the 92 that don't use them. What are we
going to do with the 8% that do use them? And so what we did was, when we initially onboard into
the account, we find that centralized
cloud org trail.
We read backwards in time, just like IAM Access Analyzer does, finds all of the identities
that would use it and have used it.
And we suggest those as exemptions.
But we tell you the last time they were used.
Was it used three months ago or was it used yesterday?
Right.
So you get some history in that.
And we build that exemption list so that when you hit the protect button and it removes
the 92% and leaves the eight, the eight are already there. So you don't have to go
and approve those ones. You previously approved them by giving them EC2 star. We just said they
can continue to do it, but the 92% are now off. They don't get to use sensitive permissions
anymore, but they can continue to work like they always did because they don't use sensitive
permissions anyway. So all the regular workloads work. However, the soon as they try to use one, so something in the 92%, all of a sudden tries to
create an internet gateway, which is suspicious in itself, but it does it. We hook on that and
we know that that deny just happened. And we have this approval tree, which basically says you can
set up for any different zone. We'll use accounts in AWS and projects in GCP. Like who's the owner of that that has to approve this?
That team gets notified in a Slack app or a Teams app.
Hey, this Terraform role just woke up
and tried to create an internet gateway.
Do you want to allow that to happen?
They hit approve.
We make a slight change in the cloud.
Think about ABAC access.
All of a sudden that
now can do it. And if they run the Terraform again, it'll work. And the idea is, is the team
that's doing the notifications and the approvals can be the same team for self-approval or it can
be escalated up one level. In your dev account, you should be able to prove yourself to do almost
anything. There should be like larger SCPs that stop you from things. But other than that, yeah.
Whereas in production, it's, yeah,
you're willing to do anything,
should be highly constrained
in most typical scaled out companies.
Like it's going to be a bit different at Twitter for pets,
the two person startup versus the, you know, a large bank.
Who can do what and the risk blast radius
is going to be somewhat distinct,
but you know, begin as you mean to go on.
So again, it gives you this great starting point.
You get everything kind of locked down in a hurry. And then because you can get the permissions back
very quickly, literally, and if it's in self-approval, it's literally
Slack message approved, run the thing again.
It doesn't create much friction for the dev team, so they kind of like it.
It's unlike the, we had one customer as a design partner
that was like, I love this story. Everybody here has contributor
in Azure because the process is the same for getting contributor as it is for getting any least
privilege role. So why would you ask for anything less? Right. You know, and you, so you created this
friction for getting any access. And so now it's so hard. Everybody just asks for more than they
need. And what this does is allows you, you can provision it with more, but until you get that
really low friction approval, you won't be able to use it. I might've accidentally discovered upon a source of confirming
some anecdata I was curious about. Someone attempts to do something in an AWS account.
Their role does not let them do it. The approval pops up. Well, first off, what is the time lag
on that rejection hitting? Because historically CloudTraTrail was racing the fossil record. And it's
gotten better, but not perfect.
And you cannot use, if you're
like as an example, let's say you're writing to a
centralized bucket somewhere and you were to look
in that CloudTrail for these events, it's way
too delayed. They say it can be as bad
as 20 minutes. It's not that bad,
but it is bad. It used to be. It's not anymore.
Yeah, it's still bad.
And so you have to use other mechanisms in the cloud to hook on these things. It used to be. It's not anymore. Yeah, it's still bad. And so you have to use other mechanisms in the cloud to hook on these things.
It used to be CloudWatch.
There's EventBridge.
And so there's ways that you can hook onto these very special events earlier in the cycle
before they're ever written.
And so you have to find other ways to hook them.
You can't actually do it using the standard CloudTrail mechanisms.
Otherwise, that delay is way too long.
We, again, when all of the stars align, they happen in like four seconds.
When all the stars don't align, it's still under a minute. So it's very fast.
Which is fair. Otherwise, you have a 10-minute cycle time every time someone thinks that it's
the permissions thing. But no, no, they just the wrong endpoint or something.
The second question I have for you is, okay, they get denied. They click the approve button, which I assume hits the API
more or less
synchronously, and then
it winds up enabling that on that role.
From that being time zero,
how long does it take until the change is actually
reflected in the role and
the thing can go through? Is it an
atomic transaction? Is there a
replication delay? And if so, how long is
that delay? That actually happens super, super quick. You know, think about things going through SQS
queues and stuff. It's super fast. It happens in 10 seconds.
Okay. So 10, there is a delay, but it's not massive. Okay.
Not massive.
Because I've often wondered when I do things that look like they should be working and they're not,
it's okay. Well, maybe there's some IAM replication lag going on here.
And usually I have never found that to be true that I'm aware of.
I'm sure there's been once or twice,
especially in far flung regions,
but it's a,
but yeah,
the big problem has been,
no,
no,
I'm just bad at computering.
It is.
It's interesting in AW.
We have these in the development community.
There's a lot of talking about eventually consistent.
It puts the eventually in the eventual consistency.
Yeah, I like that.
And so there is some of that in AWS.
Generally, it happens within that 10 to 15 seconds, right?
I've had scenarios where it's slightly outside that range.
It's generally not with things like, you know,
we'll say adding a policy to a role and attaching it.
It usually doesn't take more than 10 seconds ever for that to be effective.
Right.
So, you know, in some really crazy busy account,
maybe it hits 15 seconds or something, but they're pretty good.
I suspect there would probably be a larger latency delay
if you're using this to manage IAM roles on an AWS outpost.
Oh, yeah, I would think so. Yeah.
There's got to be a sync and caching storage.
If you yank the cable out of the back of it,
it still needs to be able to authenticate and do its thing. Whether that's constant online
or batched updates, I don't know. They haven't given me one of them to play with yet because
one of the requirements is enterprise support, which I might be able to talk around. And the
other is a loading dock attached to my house, which I'm having some trouble with.
Yeah, exactly. You'll need to roll a rock in there somewhere. So we used to put, you know,
rock, small rocks under our desk, didn't we? That was, that was something from 10 years ago. I don't
do that anymore. Oh yeah. People still do with Mac minis for the build servers. They usually
call them Bruno because when the auditor comes around, you do not talk about Bruno,
but the benefit now is that with their, with the EC2 Mac instances, yes, it winds up being
hundreds of bucks a month, but okay, fine for a build server that I can now treat like everything else.
Cheap at twice the price.
Security is one of those fun things.
I guess I have to wonder, though, how do you avoid being the inherent scapegoat for every
time something doesn't work in a cloud account, which is all the time?
Because you're now the department of literal no here where you say, no, you cannot do that thing. How do you avoid being the constant blame target? Yeah, there was, when we were going
through this, there were two parts of that. One is we were, you know, the person putting least
privileged policies in for years anyway. So we learned a lot about, as an example, if you were
to compare our least perfect least privileged policy with AWS Analyzer 1,
we are not quite so restrictive as they are because the reality is we know that even humans and workloads have things that are somewhat similar that they do that they don't do all the
time. So you should put that stuff together. So we got better at not being the no as often,
but that doesn't solve the problem. You're still the department of no because you're absolutely
denying something that somebody is now trying to do and they need to do to get their job done.
And so when we did this, the reason this permissions on demand uses this chat ops method where it's communicating back to the team that's doing the work instantly.
Within seconds, you're getting notified.
The thing you just did, we said no to.
And if you are the approver, press this button and you can continue.
Or if you're not the approver, this's who it, this is where it went to.
You need to talk to Joe.
A huge part of the solution was making sure that that whole cycle end to end from the
time that you were actually denied in the cloud and you're now sitting there staring
at an error message to the point of you getting notified in your other channel and somebody
hitting approve could happen in, we set a goal for ourselves, less than one minute.
Has to be, that whole cycle must be less than one minute interruption in your day.
And now, again, if you talk about that big bang with the large level approval in the prod account,
I agree that the approver may not press the button in a minute, but again, we'll,
you'll probably have to ask some questions, but the actual software time end to end really is
less than a minute. And so that's how we
got out of it. You have to say no sometimes. We are security people. That's what we do, and we do
it for the right reasons. But if you can get the workflow where the work happens quickly and they
can get out of jail quickly, it doesn't become an impediment to them. And we did a few other tricks
too in the sensitive permissions to do some groupings to say, if you're using this one,
you're going to use these other four too.
So we might as well give them back to you
at the same time.
So we did some tricks there
where you didn't keep running
into the same block
over and over and over
in these workloads.
I really like the story
about what you've built.
If people want to learn more,
where's the best place
for them to find you?
Look, we have,
and this is another big change
for Sunray Security.
We were typically selling
to these large banks and financials and we were very much an enterprise security cell at that point.
But now it has you start at $100 a month, so that's not unreasonable.
Completely different. We decided to say, look, we want to help people with 10 accounts. And so pricing's on the website. There's a free 14-day trial for anyone that wants to try it. And by the way, 14 days gives you enough to onboard it.
We will see all of your history that you've done before.
We'll find your exception.
So it's in monitor mode.
You can see it and you can try it in a dev account
and get the permissions on demand working
all in that 14 days.
You'll know if it works.
It's awesome.
And so super easy to do that from the website.
There's a click-through demo on the website.
I always say the sales guys have a block in it.
Somewhere's in the middle where they make you put your email in. And if
you're on the 14th click, which is sort of annoying, but well, they're salespeople,
so that's what they should do. That's what tagged email addresses are for.
Yeah, exactly. Exactly. Plus something on your Google thing. So it's super easy. Just come to
the website, all the great stuff there. There's some good blog content there on the sensitive
permissions and how we did that and lots of identity stuff. Awesome. And we'll of course,
put a link to that in the show notes.
Thank you so much for taking the time
to speak with me about this.
I really appreciate it.
Thank you, Corey.
It's been great.
Sandy Bird, co-founder and CTO of Sanri Security.
I'm cloud economist Corey Quinn,
and this is Screaming in the Cloud.
If you enjoyed this podcast,
please leave a five-star review
on your podcast platform of choice.
Whereas if you hated this podcast,
please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice,
along with an angry, insulting comment that I will no doubt use as a database of sorts
because your podcast platform of choice almost certainly did not pay attention to least privilege.