Screaming in the Cloud - Firewalls, Zombies, and Cloud Permissions Security with Sandy Bird

Episode Date: May 2, 2024

On this Featured Guest episode of Screaming in the Cloud, Corey is joined by Sandy Bird, Co-Founder and CTO of Sonrai Security. The two discuss the current state of cloud permissions security..., and Sandy details the company’s breakthrough Cloud Permissions Firewall which promises fast and scalable cloud least privilege all with one click. Corey and Sandy also talk about bunk AWS tools in this space, the insanely high “zombie” population in the cloud, and how Sonrai works for companies of all sizes.Highlights:(00:00) Welcome to Screaming in the Cloud with Corey Quinn(00:50) Sponsored Ad(01:32) Exploring Sonrai Security's Mission and Challenges(03:38) Introducing the Cloud Permissions Firewall Concept(05:59) Comparing Cloud Providers' Permissions Models(09:49) Sponsored Ad(10:12) Addressing the Zombie Identity Problem(16:44) Scaling Solutions for Different Company Sizes(20:10) Navigating Cloud Security Challenges(23:38) Innovative Approaches to Permission Management(25:27) Optimizing Permission Requests with Statistics(27:04) Improving Cloud Security with Permissions on Demand(35:15) Concluding Thoughts and ContactAbout Sandy: Sandy Bird is the co-founder and CTO of Sonrai Security, helping enterprises protect their data by securing cloud identities and access. Sandy was the co-founder and CTO of Q1 Labs, which was acquired by IBM in 2011. At IBM, Sandy became the CTO for the global security business and worked closely with research, development, marketing and sales to develop new and innovative solutions to help the IBM Security business grow to ~$2B in annual revenue. He is a trusted and experienced cloud security expert.Links referenced: Sonrai Security Website:  https://sonrai.co/screaming-cloud Free 14-Day Trial:  https://sonrai.co/screaming-trialSandy’s LinkedIn: https://www.linkedin.com/in/sandy-bird-835b5576/* Sponsor Sonrai Security: https://sonrai.co/screaming-cloud 

Transcript
Discussion (0)
Starting point is 00:00:00 Our existing customer base, the people that really cared about least privilege were like large financials. They actually had the staff to put in place to kind of monitor you're actually getting the least privilege and cut the tickets and they could afford the extra developer time to do it right. And so those customers we found as a pattern, not only cared about least privilege, they were really good at writing, we use the example of ADOS, SCPs, Azure policy, things like that, to basically block the undesired activity. Welcome to Screaming in the Cloud.
Starting point is 00:00:35 I'm Corey Quinn. And this promoted guest episode is brought to us by our friends at Sunree Security. Also brought to us is their own co-founder and CTO, Sandy Bird. Sandy, thank you for joining me. Thanks for having me, Corey. Do you know what's more old school than blowing on a Nintendo cartridge to make it work? Manually creating individual policies to achieve least privilege in your cloud. Leave old habits in the past and lock down access to sensitive permissions and services without disrupting DevOps with a single click. With a cloud permissions firewall, you can easily restrict excessive permissions from human and machine identities,
Starting point is 00:01:14 quarantine unused identities, and restrict specific regions and unused services with the click of a button. Start a 14-day free trial for Sonri's cloud permissions firewall at sonri.co. That's S-O-N-R-A-I dot C-O slash screaming. So think from the top, I suppose. I don't believe I'd heard of Sonri before you had reached out. What is it you folks do over there? Yeah, for the last five years, four to five years, we have focused on getting identities that are in AWS, Azure, GCP to least privilege. So you can think about that as looking
Starting point is 00:01:51 at the history of what they do, generating a better policy, applying that policy to that particular identity, and now it's at least privilege. We've learned a lot in four years. That probably is in some ways, and I hate to say this fool's errand because you have so many identities that doing them one at a time, unless you have some way to completely automate that and trust the automation is almost impossible. So effectively you take existing permission sets
Starting point is 00:02:14 in various cloud accounts and then prune them down to least up to a minimum viable privilege in order to get something out, people do their roles, but they don't just have casual access to things that they don't need. Is that directionally correct? That was, again, I always call that, you know, as you build these companies, Sunray 1.0, right? And it was our thesis, which was, and I had this great thesis. It was because cloud logged
Starting point is 00:02:39 everything, it doesn't actually log everything, but let's pretend it logs most things. We would be able to look at every resource and get these perfect policies for it. And then over time, we adapted those policies to make them a little less restrictive. Really annoying when you're using the console and somebody's taken away every single thing you've never done before and you browse around the console and everything is broke when you get there. That's not such a great experience. So we made those, we'll call them least restrictive, least privileged policies.
Starting point is 00:03:05 But we came into this conclusion about a year ago that we would monitor our customers. We had this great customer that was super successful at this. They had built it into their thing. They put Jira tickets in for people. They fixed their terraform. They would test it in UAT and then they would roll it to production and be like, oh, we were super successful. We measured that timing and it was like over a 10-month period,
Starting point is 00:03:27 they fixed 2,000 or 3,000 identities. And that's pretty successful until you realize they generated more than 2,000 identities in that same period. And then you're like, oh, this isn't working. We're getting more and more efficient at pushing this boulder up the hill continuously. It is, right? And so a year ago,
Starting point is 00:03:40 we took this kind of flip it on its head model and said, there has to be a better way to do this. And so we created this thing called a cloud permissions firewall. I'm curious, Corey, what do you think of that name? Not knowing even exactly what it does yet. What do you think of the name? Oh, I think it's brilliant marketing because you're not going to be able to get into RSA unless you have a firewall to sell someone. I mean, that's basically their entire shtick. So great. I'd call basically anything a firewall if it gets me access to people I need to market to. It's great. It also explains, based upon what I'm thinking off the top of my head, that that's something that helps explain something sort of esoteric, which is effectively identity as perimeter, which is what we're talking about here, and explaining it to people who still
Starting point is 00:04:19 think in terms of firewalls. See again, RSA. And you've kind of hit it on the head. It was a really touchy topic around here as we were naming it, because part of it, as you say, is this very old school name, which, by the way, is even older than networks, right? We have firewalls between apartment buildings and we have firewalls in our car. But the reality is it's a very old kind of term. And we didn't know if people would be able to make this bridge into this, as you say, identity as the new firewall world. But when we started thinking about it more, as we kind of built this new model, it really is flipped on its head to be a deny first model for identity, but only for the most
Starting point is 00:04:54 sensitive permissions. If you actually took every single identity, there's, I think it's up to like 43,000 permissions across those three main cloud providers. Now it's insane. And it grows every day, literally every day. There's more permissions. If you did that and tried to protect all 43,000 of them, everybody would be using something new at some point in time, and it would just be super annoying. However, if you took,
Starting point is 00:05:14 we actually did this piece of work to find all of the really sensitive stuff, created a new internet gateway, create a pre-signed URL, copy a snapshot to another place. The things that actually leak data, poke holes in your world, destroy the cloud, these types of things. We got it down to about 3,000 permissions across those three clouds,
Starting point is 00:05:34 plus or minus a few. And when we looked at it that way, we could flip the model. We could say, now we can build deny first for those 3,000 permissions. And if you have the other ones, they're not as restrictive. You should go back to our old model, build least privileged policies for it. But if you don't get it, we can take most of the risk out of this by protecting the 3,000 centrally. So it's a different model, super effective, super fast to getting it done. You can get it done in a week versus 10 months. I have a lot of thoughts on the idea of permissions in cloud and least privilege.
Starting point is 00:06:09 Two almost diametrically opposed philosophies, at least the last time I dug into this in any depth, AWS and GCP. By default, nothing in AWS can talk to anything full stop. Whereas in GCP, everything within a project generally can speak to everything within that project until you start isolating things down. And security purists love to turn up their nose at the Google approach, but I think it is the better way to start. Otherwise, you wind up with what everyone does in AWS. You try and just give it the permissions it needs, and then something doesn't work, and you expand it a bit, and it still doesn't work,
Starting point is 00:06:34 and you try and expand it yet again, and it still doesn't work, and then you just give it full access to do things. With a to-do, fix this later, and the to-do hangs around longer than any five employees in your company. I think you've nailed it on the head and it will, I'll bleed Azure in here too,
Starting point is 00:06:48 just to really mess the world up. Right? So in AWS, you have this expanding of the wildcard problem. You don't know what the permissions are underneath of them. And so people just, as an example, give it EC2 star,
Starting point is 00:06:58 Lambda star, whatever they need to get the thing done. Then they find out they need a pass roll. So they add, you know, I am, you also need to be able to talk to cloud trail logs or won't be able to charge you out the wazoo. That's right. Exactly. So you have all of this massive permission set in AWS.
Starting point is 00:07:13 One thing that's neat about the AWS model, though, is it is a deny first model. And so if you can get a deny somewhere in that path of your identity, you can deny something. And no matter how many times other things grant it, it will still be denied. As you say, GCP is a little different than that. It has these kind of very open projects, right? And we always, we pick on people for their service accounts that can act as anything else, including all the other service accounts in the project. But it is still a deny first model. And about a year and a half ago, maybe a little longer ago, GCP put a binding in that's very special that allows you to create a deny on a permission. And you can actually build exemptions around that using different principles that they
Starting point is 00:07:57 have. And so you can actually get a pretty good deny-first model in GCP. But as you say, at least it starts in a usable form where things aren't open to the entire GCP cloud. They're at least limited to that product. So it's a little bit better in some ways. Although sometimes I equate a project to an account in AWS. And again, we can talk about how open those are. Azure is really backwards. Azure is an allow-first model. So no matter how many denies you have and how many policies you've written, if anywhere's in Azure, there's one statement that says you're allowed to do it, you're allowed to do it. And so you have to think completely differently in Azure when you go to correct these things, because you can't, you can still create a deny first model, but you have to understand
Starting point is 00:08:37 all the inheritance and everything for doing that. And depending on where you're putting the rules, there are other things that can override them. So anyway, we could spend a lot of time on that. I've been beating the drum for ages that Azure security is deeply flawed across a variety of different levels. I wasn't even aware of this and just add it to the pile at this point. Although I will give them credit. They're the most cost-effective cloud just because how easy it is to run your stuff on someone else's account.
Starting point is 00:08:58 Yeah, well, there you go. We could spend a lot of time for it, but I'm going to go on a side tangent in Eric's discovery of Azure. So we were spending time building this particular project, and we were looking for ways to basically think we're going to talk about zombies later. I love zombies. We're going to talk about zombies and cleaning zombies up. But we were trying to find ways to make sure that we would know if something happened in Azure that was denied.
Starting point is 00:09:21 What we discovered was almost nothing in Azure that is denied is logged to their centralized logging. It shows up in the screen of the person who is denied in their console, you'll get a deny or in the SDK, you'll get the deny. But then when you go to look at the activity logs, no matter what you turn on, the diagnostic logs, all these things, no deny log. And it's not every permission, but it's huge numbers of them, which is really interesting and azure. Anyway, side tangent, we could go down that one for a while. I want to go to the zombie thing that you're talking about, because I suspect I may have a real-world story that is germane to this. If you're going the same place, I think you're
Starting point is 00:09:58 going with it. But please, tell me more. Yeah, so we were doing our research in building this flip-the-model-on- on its head and doing this cloud permission firewall instead of this, you know, let's fix every identity. And one of the statistics we started looking at was how many of these identities are completely unused. So they have permissions attached to them. Some of them are really sensitive. Some of them are benign,
Starting point is 00:10:18 but they just have something attached to them and they're sitting in the cloud and they're completely unused. And we took a large chunk of our customers, big, big enterprise customers that have thousands of accounts and then little small customers that have 10 or 15. And there was this interesting stat that the longer you were in cloud, the more of these identities that you had, which we nicknamed zombies, that were sitting there with all these permissions that weren't used. And it's really scary when you started looking at companies that were in cloud for more than five years.
Starting point is 00:10:47 So they had history. It was like 75% of the identities kicking around were unused. That high. It was insane how high it was. Some were worse, actually. That was an average. So it's pretty bad, actually. And all that stuff, of course, opens up risk in the environment. Well, so does closing them. And that's the challenge I have around this, because depending on what your sampling window is, there are things that only run
Starting point is 00:11:12 once a quarter, for example. So if it's not at least 90 days, you're going to catch some of those things out. And then you have some very frantic, very upset business people wondering why something isn't working. But the one that I care about the most from the old world IT ops side of the world is the break glass scripts, the things that you have sitting somewhere that don't normally run in the course of business. I have one now in my personal account where everything for my dev box, everything is on my tailscale network. On the off chance that that isn't working for whatever reason, I can hit a Lambda endpoint with a pre-stored key. And all that does is it changes the security group to open up port 22. So I can SSH into the thing with an actual credential and
Starting point is 00:11:51 continue from there. That is something that I don't think I've ever used it other than when I built it and tested it. Easy for something like this to view that as, oh, you don't need this around and you're right until suddenly I will very much need that in some weird networking circumstance, and it won't work. How do you avoid that trap? Look, Corey, I think you nailed the last four years of my life. We have this great CIEM solution, Cloud Infrastructure Entitlements Management, another acronym by Gardner. And we've been trying to get people to clean up these zombie identities forever. And there's really kind of, you said, two ways, and there's actually a bit of a third, which is part of your first solution, which is the break glass accounts are never supposed to be used, as you said, and we should never get rid of them. That's also a small handful of them,
Starting point is 00:12:35 though, to be clear, as opposed to the huge amount of things that got spun up as detritus of other things. Exactly. And I would argue you really should know what they are. The second part is more like your solution, though, where you have another team that's built something. It might be a yearly report. It's named really weird. And you as the cloud ops person at the top of the configuration setup where if I hit this Lambda endpoint, then that will do something which changes this. And there may be a resource group on that that trusts that Lambda function. And so it's this encompassing workload. It can get worse if it's an IAM user, which maybe you shouldn't be doing this, but has an access key cut on it. And you delete any of those things. What happens is not only do you delete the identity or the IAM user, you delete
Starting point is 00:13:25 the access. So you've lost the key material. You've removed all of the permissions. And now that identity that's trusted through some trust relationship on some other resource doesn't exist anymore. And so if you had to put it back, you wouldn't even know how to do it. You wouldn't have the original state. This is the guidance I give customers when they're talking about, we don't think this thing's being used, but we're not sure. How do we find out? And it's like, well, if you turn it off and no one knows what it is and something breaks, that's going to be challenging.
Starting point is 00:13:51 And not because there's really no warn if reject on a lot of these things. Great. Let's change security groups so nothing can talk to it and leave it there for some period of time. Check the instance role. Is it doing anything during that sampling period? And at some point, then go ahead and stop it without terminating it and let it go another period. And then there's the scream test. When you block access to it, who screams?
Starting point is 00:14:13 That's on some level sounds like what you're talking about. It is exactly it. And so what we did was we basically said, that's fine. Let's leave all the permissions intact and then basically short circuit it using a deny star. So it doesn't work anymore. And what we did was we have this second part of our product, which we call permissions on demand. And what that does is it listens for the wake up. So if it sees an attempt to be used after nine months, it sends a message via chat ops, Slack, Teams, email, if you're into that sort of thing, which maybe I am, but everyone else likes Slack. You get this message that says, hey, this thing just tried to wake up. Do you want to reanimate the zombie?
Starting point is 00:14:49 And if you do, you hit yes, I want to reanimate it. The thing tries again, and it's going to work. You could interrupt something, as you say, by turning it off. Who screamed? But you give this person screaming the ability to approve and turn the thing back on. And then after some period of time, hopefully you do become comfortable and say, this thing is really not used. You should move it away.
Starting point is 00:15:08 But you do have to put in the exemptions for like the break glass accounts, right? You know what those are. So we have in our product this way that you can actually put them in as exemptions. And of course they will never get blocked. But I actually think it's one of the most powerful parts of the product is being able to remove that.
Starting point is 00:15:20 Because what we find is, is that they show up in these lateral movement change. So this identity can get to this identity, which you then can get to this unused identity, and then it can do all kinds of havoc. And by actually short-circuiting them, they no longer laterally move through them. Do you know what's more old school than blowing on a Nintendo cartridge to make it work? Manually creating individual policies to achieve least privilege in your cloud. Leave old habits in the past and lock down access to sensitive permissions and services without disrupting DevOps with a single click.
Starting point is 00:15:52 With a cloud permissions firewall, you can easily restrict excessive permissions from human and machine identities, quarantine unused identities, and restrict specific regions and unused services with the click of a button. Start a 14-day free trial for Sonri's cloud permissions firewall at sonri.co slash screaming. That's S-O-N-R-A-I dot C-O slash screaming. It seems like it's one of those fun places that you can get lost in if you're not careful. It feels like this is something that works super well for certain scales of company. This sounds great, even on my own test account, which is awesome. I can see it working at small to medium scale. What I start to wonder is, at enterprise scale,
Starting point is 00:16:35 where in some cases I have clients spending hundreds of millions a year upon thousands of accounts. And at that point, it's so diffuse that it becomes difficult to reason about any of these things in any holistic way is there a sweet spot for that you found is the best that he's resonating with or is this one of those rarities that actually does apply to theoretically every cloud customer it uh how we came up with a solution is is kind of interesting and sometimes you have to get beat up a lot to figure out where you need to be in these things. And so we had our existing customer base, the people that really cared about least privilege were like large financials. They actually had the staff to put in place to kind of monitor you're actually getting the least privilege and cut the tickets and they could
Starting point is 00:17:16 afford the extra developer time to do it right. And so those customers we found as a pattern, not only cared about least privilege, they were really good at writing, we use the example of ADOS, SCPs, Azure policy, things like that, to basically block the undesired activity. But they probably had a team of people doing that. When we went to our customer base that was, we'll call them large-scale cloud, but not as highly governed or as highly mature. And it was typically a team of four people that ran the whole cloud infrastructure and they were responsible for everything end-to-end.
Starting point is 00:17:53 They didn't have the cycles to put into monitoring to get to least privilege. They didn't have the people to write SCPs. They didn't have that. And so cloud was kind of a mess, a growing mess as it went. And so when we were building the solution, we were trying not to build it for that highly governed seven people writing SCPs. They just knew what to do when they
Starting point is 00:18:12 were doing it well. We were trying to write it for that team that was, man, we're understaffed. We've got to get to least privilege from whatever compliance regime we're under. We're supposed to get to least privilege, but we can't do it. This gave them a way to get there fast and easy and didn't disrupt anything. Because we have this option where we find all of the exemptions based on the history, we put those in automatically, and then you really only have to worry about day plus one where you use permissions on demand. It's been interesting actually building the product and exposing it back to some of those larger, highly governed companies. And what we found was they too struggle with SCPs because if you look at SCPs, there's
Starting point is 00:18:51 SCP space limits. There's the number of them you can attach. There's all these weird constraints you have to do. And some of the stuff we had to do to solve those problems is actually even applicable to them. So by no means is this for everybody's solution. If you're the purest and you can afford the staff to get to least privilege, I would agree. You should do that. That's the
Starting point is 00:19:09 perfect way to do it. However, for the people that can't do that and can't achieve it, this is a much better solution. Scale, what's neat about this is you can start in one account. You can monitor the whole arrogance. I'm just going to start in development in this one area, and then you can kind of work your way up through it. You don't have to do it on day one. And we've built the SCP scaling such that it works across thousands of accounts or across 10 accounts, whichever you happen to have. That's a neat approach. It's on some level on paper. It sounds like if you use just the lens of AWS, they have a few offerings that make what you do irrelevant. They have the IAM access analyzer, which in turn now can generate policies
Starting point is 00:19:47 based upon what you actually use. And that would be awesome and would basically be like, well, why would I ever need to use what you built? Except for the fact it doesn't freaking work or it works, but it doesn't go far enough. Where, oh, we saw that this role used the DynamoDB write table option. Okay, great.
Starting point is 00:20:07 Can you tell me what table you're up to? No, go guess. Then what's the point? Like, you don't get to be specific enough. Like, what I would love to see is something that it auto-generates a policy of, okay, based upon our observed behavior during the capture window, you're able to write to the following S3 keys. Like, okay, great.
Starting point is 00:20:25 Let's back that up a little bit. Give it a prefix or a bucket or something. But yeah, that's the direction. Let me broaden it. Because otherwise you wind up in the hell that I'm still in with one of my code build roles that does deployments where it has full access to spin things up in a given account.
Starting point is 00:20:41 To be clear, this is for my newsletter stuff. This is not for my production stuff touching client data. Different universes here. But yeah, it still has full access because every time I've tried to dial it in, it's a problem because first it has the ongoing updates to things when it does deployments, its permission
Starting point is 00:20:57 set, but it needed a separate permission set entirely to provision those things in the first place the first time I ran it. So there's a question of, great, how do I dial those in? It's okay to discard those extra permissions now, but every time I thought I had it working, I'd make one small change and boom, I'm back to square one. So I gave up. Yeah. And it's a common pattern. It is, again, I lived the last four years of my life before this particular new product thinking I wanted to be that purist too, right? I want to get everything absolutely perfect.
Starting point is 00:21:25 And then after looking at these customers struggle with, you know, some of these accounts are huge, right? You know, 50,000 plus identities that have sensitive permissions that haven't used them in the last whatever. We failed and we had to get a better model. And the only way to do that was to start with a smaller subset.
Starting point is 00:21:40 We couldn't do every single permission, right? But you could do the sensitive ones. And by doing the sensitive ones, you could remove that. Everything you have, the important stuff gets buried under an avalanche of random trivia. And I also think what's interesting about, you know, you look at your problem and when you're looking at, you know, the DynamoDB tables or the S3 keys or the prefixes, I don't even think half the people know, like you have to turn on extra auditing even to see that stuff.
Starting point is 00:22:04 And not every service in Amazon even supports that auditing. So, you know, doing it is super hard. Oh, and data events to get logged those in CloudTrail. I've done the numbers on this. The API call to read an object in S3 will show up there in the CloudTrail data events, and it will cost 20 times as much as the API call for the read, which, okay, but that's not going to solve every problem for everyone. I understand that there's value in security and some things should be paid for, but I firmly believe that providers should not be charging extra for things that only they can provide. If they want to go head to head of, we'll ingest syslog and do these analytics stuff, yeah, by all means, charge away for that because I have a half dozen options and honestly i still like awk with occasional grep tied to it and that gets me surprisingly far
Starting point is 00:22:49 especially if i sprinkle in some pearl and that's the great other times you can send it to data dog or splunk if you have a spare princess lying around you can ransom back to whatever kingdom she's from awesome that's the like that's an open field. But I've got to pay for these audit CloudTrail events because there's no second option for me to pay someone for that. No, it's amazing when you look at the amount and, you know, there's other quirks in there too as we're talking to your audience, right?
Starting point is 00:23:14 If you run two CloudTrails, now you really, really get burned because you only get one for free and the second one costs more and then you're storing the data twice. You ever see CloudTrail paid events? It's usually, usually a sign that something's misconfigured. Very occasionally, especially at finance institutions, I see security teams want an unadulterated CloudTrail that no one else can
Starting point is 00:23:34 see for whatever reason, and they refuse to share it onward. Cool. I can tell you down to the penny what that cost you last month. Great. Make your own decisions. I'm here to advise. I'm not here to make decisions for you. You clearly have context. I don't. That's the nature of respecting your customers' businesses. But it's frustrating to see that misconfiguration. It feels like a tax on not knowing this one weird piece of trivia about AWS. Yeah. Anyway, again, this all led us down this path to say, if we're going to try to fix this for the average customer, right, that doesn't have the team from those large financials that can justify for cloud trails, for some reason, we had to do it in a way that was, you know, click a button, make the thing
Starting point is 00:24:14 happen. And the way that that, you know, worked for us was we did a lot of statistics across these sensitive permissions. So we did a lot of work figuring out what those 3000 sensitive permissions were. When we looked across our customers, we did throw a few of them out that were, they were called a lot and they were called by disparate identities. So it would have been a lot of different identities doing this thing.
Starting point is 00:24:37 And we said, well, that's too many permissions on demands requests. You'd have to approve it too many times. And we kind of graded ourselves to say, anything we're going to put in this list has to be something that's called, it can be called a lot by a single identity, but the number of unique identities that call it in a period of time, it needs to be somewhat small. It's not called at all, but it can't have a hundred different identities in one account
Starting point is 00:25:00 calling it. And that was kind of this kind of guiding light to us to say, okay, well, you know what? You create your inner gateway one time and you never really touch it again. So that's a sensitive permission. You do things like, you would say, well, decrypt must be sensitive. Decrypt gets called by everything in your cloud. So we can't use decrypt as a sensitive permission, right? And so you use this as your guiding light to figure out what these are. And Azure has some crazy ones, by the way. There's stuff in Azure that allows you to take like a file system off a running VM and make it a URL on the internet. New database just dropped, I guess. Yeah, exactly. And so, you know, you have to look at these permissions and know what's
Starting point is 00:25:39 sensitive and what's not. And anyway, so we spent a lot of time on that. It was a fun exercise for sure. I imagine it would have to be. How do you wind up then handling the provisioning of permissions that need to exist all the time? Because an aspect of what you do, to my understanding, is the concept of permissions on demand. And so what we do is, and so this is back to those statistics, which are so interesting across this. So when we looked at what gets provisioned that has sensitive permissions, and we'll use your AWS example because we've used them before, like EC2 star, Lambda star, like I couldn't figure out how to get it to work. So I gave it a bunch of services with star,
Starting point is 00:26:12 it started to work and I moved on. So in those scenarios, in every one of those services, rather it's Lambda or EC2 or whatever it happened to be, CloudTrail, there were some number of sensitive permissions there. And EC2 has 40 some of them, they can go down to Lambda, CloudTrail, there were some number of sensitive permissions there. And EC2 has 40 some of them. You can go down to Lambda. It has, I think, 15. Every one of these has some number of them. And so we said, okay, we are trying to solve the 92 that don't use them. What are we going to do with the 8% that do use them? And so what we did was, when we initially onboard into the account, we find that centralized cloud org trail.
Starting point is 00:26:47 We read backwards in time, just like IAM Access Analyzer does, finds all of the identities that would use it and have used it. And we suggest those as exemptions. But we tell you the last time they were used. Was it used three months ago or was it used yesterday? Right. So you get some history in that. And we build that exemption list so that when you hit the protect button and it removes
Starting point is 00:27:04 the 92% and leaves the eight, the eight are already there. So you don't have to go and approve those ones. You previously approved them by giving them EC2 star. We just said they can continue to do it, but the 92% are now off. They don't get to use sensitive permissions anymore, but they can continue to work like they always did because they don't use sensitive permissions anyway. So all the regular workloads work. However, the soon as they try to use one, so something in the 92%, all of a sudden tries to create an internet gateway, which is suspicious in itself, but it does it. We hook on that and we know that that deny just happened. And we have this approval tree, which basically says you can set up for any different zone. We'll use accounts in AWS and projects in GCP. Like who's the owner of that that has to approve this?
Starting point is 00:27:48 That team gets notified in a Slack app or a Teams app. Hey, this Terraform role just woke up and tried to create an internet gateway. Do you want to allow that to happen? They hit approve. We make a slight change in the cloud. Think about ABAC access. All of a sudden that
Starting point is 00:28:05 now can do it. And if they run the Terraform again, it'll work. And the idea is, is the team that's doing the notifications and the approvals can be the same team for self-approval or it can be escalated up one level. In your dev account, you should be able to prove yourself to do almost anything. There should be like larger SCPs that stop you from things. But other than that, yeah. Whereas in production, it's, yeah, you're willing to do anything, should be highly constrained in most typical scaled out companies.
Starting point is 00:28:30 Like it's going to be a bit different at Twitter for pets, the two person startup versus the, you know, a large bank. Who can do what and the risk blast radius is going to be somewhat distinct, but you know, begin as you mean to go on. So again, it gives you this great starting point. You get everything kind of locked down in a hurry. And then because you can get the permissions back very quickly, literally, and if it's in self-approval, it's literally
Starting point is 00:28:52 Slack message approved, run the thing again. It doesn't create much friction for the dev team, so they kind of like it. It's unlike the, we had one customer as a design partner that was like, I love this story. Everybody here has contributor in Azure because the process is the same for getting contributor as it is for getting any least privilege role. So why would you ask for anything less? Right. You know, and you, so you created this friction for getting any access. And so now it's so hard. Everybody just asks for more than they need. And what this does is allows you, you can provision it with more, but until you get that
Starting point is 00:29:23 really low friction approval, you won't be able to use it. I might've accidentally discovered upon a source of confirming some anecdata I was curious about. Someone attempts to do something in an AWS account. Their role does not let them do it. The approval pops up. Well, first off, what is the time lag on that rejection hitting? Because historically CloudTraTrail was racing the fossil record. And it's gotten better, but not perfect. And you cannot use, if you're like as an example, let's say you're writing to a centralized bucket somewhere and you were to look
Starting point is 00:29:54 in that CloudTrail for these events, it's way too delayed. They say it can be as bad as 20 minutes. It's not that bad, but it is bad. It used to be. It's not anymore. Yeah, it's still bad. And so you have to use other mechanisms in the cloud to hook on these things. It used to be. It's not anymore. Yeah, it's still bad. And so you have to use other mechanisms in the cloud to hook on these things. It used to be CloudWatch. There's EventBridge.
Starting point is 00:30:09 And so there's ways that you can hook onto these very special events earlier in the cycle before they're ever written. And so you have to find other ways to hook them. You can't actually do it using the standard CloudTrail mechanisms. Otherwise, that delay is way too long. We, again, when all of the stars align, they happen in like four seconds. When all the stars don't align, it's still under a minute. So it's very fast. Which is fair. Otherwise, you have a 10-minute cycle time every time someone thinks that it's
Starting point is 00:30:35 the permissions thing. But no, no, they just the wrong endpoint or something. The second question I have for you is, okay, they get denied. They click the approve button, which I assume hits the API more or less synchronously, and then it winds up enabling that on that role. From that being time zero, how long does it take until the change is actually reflected in the role and
Starting point is 00:30:57 the thing can go through? Is it an atomic transaction? Is there a replication delay? And if so, how long is that delay? That actually happens super, super quick. You know, think about things going through SQS queues and stuff. It's super fast. It happens in 10 seconds. Okay. So 10, there is a delay, but it's not massive. Okay. Not massive. Because I've often wondered when I do things that look like they should be working and they're not,
Starting point is 00:31:22 it's okay. Well, maybe there's some IAM replication lag going on here. And usually I have never found that to be true that I'm aware of. I'm sure there's been once or twice, especially in far flung regions, but it's a, but yeah, the big problem has been, no,
Starting point is 00:31:36 no, I'm just bad at computering. It is. It's interesting in AW. We have these in the development community. There's a lot of talking about eventually consistent. It puts the eventually in the eventual consistency. Yeah, I like that.
Starting point is 00:31:49 And so there is some of that in AWS. Generally, it happens within that 10 to 15 seconds, right? I've had scenarios where it's slightly outside that range. It's generally not with things like, you know, we'll say adding a policy to a role and attaching it. It usually doesn't take more than 10 seconds ever for that to be effective. Right. So, you know, in some really crazy busy account,
Starting point is 00:32:11 maybe it hits 15 seconds or something, but they're pretty good. I suspect there would probably be a larger latency delay if you're using this to manage IAM roles on an AWS outpost. Oh, yeah, I would think so. Yeah. There's got to be a sync and caching storage. If you yank the cable out of the back of it, it still needs to be able to authenticate and do its thing. Whether that's constant online or batched updates, I don't know. They haven't given me one of them to play with yet because
Starting point is 00:32:33 one of the requirements is enterprise support, which I might be able to talk around. And the other is a loading dock attached to my house, which I'm having some trouble with. Yeah, exactly. You'll need to roll a rock in there somewhere. So we used to put, you know, rock, small rocks under our desk, didn't we? That was, that was something from 10 years ago. I don't do that anymore. Oh yeah. People still do with Mac minis for the build servers. They usually call them Bruno because when the auditor comes around, you do not talk about Bruno, but the benefit now is that with their, with the EC2 Mac instances, yes, it winds up being hundreds of bucks a month, but okay, fine for a build server that I can now treat like everything else.
Starting point is 00:33:09 Cheap at twice the price. Security is one of those fun things. I guess I have to wonder, though, how do you avoid being the inherent scapegoat for every time something doesn't work in a cloud account, which is all the time? Because you're now the department of literal no here where you say, no, you cannot do that thing. How do you avoid being the constant blame target? Yeah, there was, when we were going through this, there were two parts of that. One is we were, you know, the person putting least privileged policies in for years anyway. So we learned a lot about, as an example, if you were to compare our least perfect least privileged policy with AWS Analyzer 1,
Starting point is 00:33:49 we are not quite so restrictive as they are because the reality is we know that even humans and workloads have things that are somewhat similar that they do that they don't do all the time. So you should put that stuff together. So we got better at not being the no as often, but that doesn't solve the problem. You're still the department of no because you're absolutely denying something that somebody is now trying to do and they need to do to get their job done. And so when we did this, the reason this permissions on demand uses this chat ops method where it's communicating back to the team that's doing the work instantly. Within seconds, you're getting notified. The thing you just did, we said no to. And if you are the approver, press this button and you can continue.
Starting point is 00:34:23 Or if you're not the approver, this's who it, this is where it went to. You need to talk to Joe. A huge part of the solution was making sure that that whole cycle end to end from the time that you were actually denied in the cloud and you're now sitting there staring at an error message to the point of you getting notified in your other channel and somebody hitting approve could happen in, we set a goal for ourselves, less than one minute. Has to be, that whole cycle must be less than one minute interruption in your day. And now, again, if you talk about that big bang with the large level approval in the prod account,
Starting point is 00:34:55 I agree that the approver may not press the button in a minute, but again, we'll, you'll probably have to ask some questions, but the actual software time end to end really is less than a minute. And so that's how we got out of it. You have to say no sometimes. We are security people. That's what we do, and we do it for the right reasons. But if you can get the workflow where the work happens quickly and they can get out of jail quickly, it doesn't become an impediment to them. And we did a few other tricks too in the sensitive permissions to do some groupings to say, if you're using this one, you're going to use these other four too.
Starting point is 00:35:26 So we might as well give them back to you at the same time. So we did some tricks there where you didn't keep running into the same block over and over and over in these workloads. I really like the story
Starting point is 00:35:35 about what you've built. If people want to learn more, where's the best place for them to find you? Look, we have, and this is another big change for Sunray Security. We were typically selling
Starting point is 00:35:44 to these large banks and financials and we were very much an enterprise security cell at that point. But now it has you start at $100 a month, so that's not unreasonable. Completely different. We decided to say, look, we want to help people with 10 accounts. And so pricing's on the website. There's a free 14-day trial for anyone that wants to try it. And by the way, 14 days gives you enough to onboard it. We will see all of your history that you've done before. We'll find your exception. So it's in monitor mode. You can see it and you can try it in a dev account and get the permissions on demand working
Starting point is 00:36:15 all in that 14 days. You'll know if it works. It's awesome. And so super easy to do that from the website. There's a click-through demo on the website. I always say the sales guys have a block in it. Somewhere's in the middle where they make you put your email in. And if you're on the 14th click, which is sort of annoying, but well, they're salespeople,
Starting point is 00:36:29 so that's what they should do. That's what tagged email addresses are for. Yeah, exactly. Exactly. Plus something on your Google thing. So it's super easy. Just come to the website, all the great stuff there. There's some good blog content there on the sensitive permissions and how we did that and lots of identity stuff. Awesome. And we'll of course, put a link to that in the show notes. Thank you so much for taking the time to speak with me about this. I really appreciate it.
Starting point is 00:36:49 Thank you, Corey. It's been great. Sandy Bird, co-founder and CTO of Sanri Security. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five-star review on your podcast platform of choice.
Starting point is 00:37:03 Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment that I will no doubt use as a database of sorts because your podcast platform of choice almost certainly did not pay attention to least privilege.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.