Screaming in the Cloud - Getting the Basics Right in Cloud Security with Fouad Matin

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Today's episode is brought to you in part by our friends at Minio, the high-performance Kubernetes native object store that's built for the multi-cloud,

Starting point is 00:00:40 creating a consistent data storage layer for your public cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you're defining those as, which depends probably on where you work. Getting that unified is one of the greatest challenges facing developers and architects today. S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the footprint to run anywhere. And that's exactly what Minio offers. With superb read speeds in excess of 360 gigs and a 100 megabyte binary that doesn't eat all the data you've got on the system, it's exactly what you've been looking for. Check it out today at min.io slash download and see for yourself. That's min.io slash download, and be sure to tell them that I sent you. Welcome to Screaming in the Cloud.

Starting point is 00:01:33 I'm Corey Quinn. One of the things I find myself screaming in the cloud about entirely too frequently is the concept or idea or implementation of the high-level principle of least privilege. I figure I'm probably not the only person who has strong opinions on this, so I went searching, and sure enough, I found a friend who feels the same way. Fuad Meytan is the co-founder and CEO over at Indent.com, colloquially known as Ind. Fawad, thank you for joining me. Thank you for having me. You are one of those weird friends that continually surprises me in weird ways.

Starting point is 00:02:13 I threw a drink up when I was out visiting the New York Summit that AWS threw, and you showed up for it, which was great and flattering. And that was great. It's a wonderful surprise. Don't get me wrong. But buddy, we both live in San Francisco. What gives? Yeah, we were just out there for the AWS Summit to meet some of our customers. And I saw that you were in town and we had exchanged maybe on Twitter or over email or something. And I figured, you know what, nothing better than in person. So sure enough, I just showed up. One of these days, we're going to hang out together in the same city without there having to be a big conference around. I think RSA was the first time we wound up hanging out in that sense. Yeah, that's right. So to give a little context on the idea of where I stand on security,

Starting point is 00:02:59 and then I'll let you run with it. I've been getting a bit of flack lately for saying that Google Cloud is the number one cloud when it comes to security, and AWS is number two. Azure is a distant third place, eating crayons and telling you which flavor tastes the best, confidently but wrongly, because that's what the AI said. Now, the reason I believe that is not because I think the security fundamentals or the primitives are in any way, shape, or form better in Google than they are in AWS. But the user experience, the understanding that they absolutely understand what the customer is trying to do and balancing it accordingly is awesome. With AWS, it feels like I am perpetually

Starting point is 00:03:38 blocked by default from doing anything. Add a permission. Doesn't work. Add another one. Still doesn't work. Hell with it. Mm-mm, still doesn't work. Hell with it, I'll add a star. You can do everything. Now it works, and I put a to-do in to go back and fix this later, and that's a lie. We know that's never going to happen.

Starting point is 00:03:54 So things wind up massively over-permissioned in perpetuity. It drives me nuts, and I haven't found a good way around it. Please, agree, disagree, rebut, or turn into a flagrant sales pitch. Your call. Well, I feel like the idea of everyone just needs administrator access just to do the most basic things. If you need to create a bucket, you need admin access. You need to restart an RDS server,

Starting point is 00:04:17 well, here's some admin access. And I think it all stems from people just trying to unblock their team. And so no one really wants to stand in the way of other people just trying to help out or trying to get work done. And I think that's the source of all of the frustration, is that people are just trying to do their work, and then they're hit with these 403s or some sort of error page that doesn't really tell them what exactly you should do that results in the snowballing of all this admin

Starting point is 00:04:44 and typically very sensitive admin access. In fact, if you wind up scoping down permissions and then log into the AWS console, even with fairly broad scoping, you're still going to find that as you maneuver your way through the joy slash terror slash pain that is the AWS console, you wind up encountering a whole bunch of giant red banners yelling at you and saying, oh, nope, that's not the way in the light, or winding up just trying to view something innocuous and not being allowed to do it. And the way that that UX works, you feel on some level like, oh, your account is broken because you don't have the permissions to do what you think is

Starting point is 00:05:21 fairly innocuous. It just becomes a very limiting and broken user experience that you can tell was designed from the perspective of, oh, anyone using the console, of course they're going to have full admin. Exactly. And I think it actually, what we've seen is the reverse happen where people just end up sharing these admin keys

Starting point is 00:05:40 across the team through the CLI. And so it's not just the console users, it's really anyone who ends up running some script. Well, it turns out our script doesn't work anymore because we tried to scope it down a little bit. But it really has to do with how people are setting up and architecting their identity interactions within AWS, but also just even how they set up their accounts.

Starting point is 00:06:02 I think we see a lot of teams start with one account that has everything. And so naturally, if you need to create a bucket or you need to take some sort of operation in RDS, you end up needing more permissions than you need just for that one operation because you're just doing a lot of different things. And as teams grow, they start to have multiple accounts. And maybe it's okay to have a production account where there's a few people who have access. So sure, they all get admin access. But that approach doesn't actually really work in practice

Starting point is 00:06:30 because inevitably everyone ends up needing to do one operation and that one operation requires admin access as a result of how it was architected. I think that's the kind of root issue is how you set up that foundation, how you set up that structure. And it's really hard to change behavior when people are used to just having standing admin access. One of the areas that you and I bonded over, which, you know, this should really say a lot about the kind of conversations we have in person and the company that we keep, is the idea that you should never have long-lived credentials hanging around on disk. I mean,

Starting point is 00:07:01 my system's the only things you'll find in the AWS credentials file, other than the occasional, effectively, credential process that grabs out to some sort of secured key store for some legacy implementation of something or other, is a set of credentials that is strictly a Canary token. Someone grabs it, and immediately I get alerts that, hey, this particular box has been compromised.

Starting point is 00:07:23 Perhaps do something about it before it gets oh so very much worse. And that is the right approach. I'm a huge believer in having things work automatically from a credentialing perspective and also with credentials that expire in relatively short order. So for humans, that means SSO or something like that. And for instances and things like it, it means using instance roles. And for things like CICD, it means using something like the OICD permissions dance,

Starting point is 00:07:52 which I don't pretend to fully understand despite having gotten it working with GitHub Actions. But now there's no long-term credential source where attackers can get to it. And I think that's pretty cool. Absolutely, yeah. I think the improvements to tooling, I mean, just looking a couple years ago, back when we were all using the AWS Vault CLI as a way of kind of vending credentials compared to now, where you're not just assuming that you have this credentials file that has not just one credential, but usually at the time had many different accounts credentials all just sitting in one place. And thankfully, to make it a little bit easier for anyone trying to compromise your

Starting point is 00:08:28 computer, it was just the same name on every single device. And where we sit today, where things are driven by identity, and I think one of the key differences around how people are even managing their accounts, you used to have, you know, you would have Fawad Dev, you would have Fawad Prod, and those were different accounts. And we try to keep them separated. But in reality, it should just be tied to my email address. I should just be able to log in with Fawad at indent.com. And then that then lists, here's the different roles you can assume. And if I, let's say, need to go and restart the database, then I just go and get a role

Starting point is 00:09:00 that has just the permissions I need to restart a database. I can't just go and insert deleting buckets or viewing objects that are in buckets. I don't need to do that. I just need the one role that I need it for. And so I think having that really kind of limited focus where you're kind of entering this privileged session rather than I just kind of open up the admin console and the world's my oyster,

Starting point is 00:09:20 but also I might step on some barbed wire or kind of enter a home alone situation and start having paint dropped on my head as a result. Does your position change based upon the nature of the AWS account you're getting into? Whether it's production, whether it has sensitive data or not, whether it's a development account, whether it's something you're just using to kick the tires on a new service, etc.? I think there's definitely a spectrum, like with any kind of risk posture or security posture. I think when you're dealing with development and things are pretty low risk, I think it's fine for people to just have these kind of elevated permissions on an ongoing basis. With the caveat that, and this was some experience I've had before where one day I needed, I didn't need, I wanted a GPU instance.

Starting point is 00:10:04 I wanted to train some model once upon a time. The unfortunate thing is I forgot to stop said instance. And as you probably know, and maybe this was just a great ad for your business, it was quite expensive. And I found out a year later. So that was a little bit of a bullet to have to bite and say, okay, yep, that was definitely my fault. No one really to blame other than myself for that one. And so it's not just about necessarily just security per se, as much as it is, there are sensitive operations, regardless of what level of account it is. I think just acknowledging what the sensitivity is. So in that case, it might be cost sensitivity in development. And you want to be mindful of and have some sort of

Starting point is 00:10:43 review practice around, okay, are we having long-running expensive instances that are running? And maybe we want to manage that a little bit better. But then when it comes to production, I have much stronger opinions about when you're storing customer data or any kind of sensitive data or confidential data for that matter, you really do want to protect it as if it were your own. And I think one of the analogies that I've heard that I really like is this idea of handling it like hazardous materials, where you don't collect it. If you have to collect it, don't store it. And if you have to store it, then don't keep it. And so taking that approach to how you handle your own customer data or even access to that customer data, I think really goes a long way in improving just the kind of standing operating procedures at a company of any size, regardless of

Starting point is 00:11:29 if it's production or even staging instances where sometimes inevitably customer data lands for better or worse. One of the things that I find that has also worked out super well for me and would have caused less excitement at some jobs in previous years is these days with trusted computing being what it is to some extent, I use a program called Secretive on the Mac that winds up generating an SSH key pair, but the private portion only lives inside the secure enclave and can never leave, which means you cannot export what that is. All it could do is sign individual requests. And as a result, when I SSH into a node, I have set it up so that I have to authenticate with Touch ID, but you can disable that part. But at that point, it just means that there is nothing

Starting point is 00:12:13 sensitive living on disk to wind up getting compromised, which is a nice way to live. Oh, it's a huge improvement. I think being able to push more of the, especially when it comes to encryption, but in general, pushing more of the security best practices into the hardware itself, where there isn't any kind of software kind of workaround, where I think we kind of saw this when we worked on some voting projects back in 2016, 2018. We were looking at using client-side encryption

Starting point is 00:12:40 and the consistent requirements from some teams that we were working with, where they just wanted us to send them plain text ballot information or voter information that we just couldn't even do. And it was so validating to just say, it is literally impossible for us to decrypt this without the customer's key that is only on their device. We just don't have access to it. And I think that approach where you're really pushing the security protections, not just within a couple of if statements that you have on your back end, but directly to user devices really goes a long way. And I think it's just been be clear, though. We're talking consumer tech. I would love it if people would stop using the same password of kitty on everything that they have. I've been tracking with, we'll call it depression, I suppose, the unfolding trickle-true thing

Starting point is 00:13:35 of the various customers slash victims of LastPass, where there was not an incident. Okay, there was an incident. Okay, there was an incident and data was breached. Okay, and it was everything. And it just get worse and worse and worse. Well, I migrated off of LastPass in 2017. And for better or worse, every account that I use has its own unique password,

Starting point is 00:13:55 even back then. In most cases, I use a tagged email address for most things. I have a few wildcarded domains for just that. And this was an excuse to spend an afternoon changing the probably 200 passwords I hadn't rotated since then. Because when you have a 40-character password, what does it matter if you rotate it often or not? But it was reassuring, just from a personal perspective, that there were some high-value accounts hidden in there that had never been compromised because I would have noticed. So at least that was useful. But I wish

Starting point is 00:14:24 people would use a password manager, whichever one that they happen to pick that isn't LastPass. I wish that people would enable MFA on high-value accounts. I wish that people would stop sharing passwords back and forth and try to get them to do that on the one hand.

Starting point is 00:14:39 It feels like it is a million miles away from trying to talk to companies about, oh yeah, now let's talk more about least privilege and you know you're doing it when working with AWS becomes actively painful. Yeah, absolutely. I think that kind of idea of getting people to do, quote, the basics, that's really what it's all about. I think that there's a lot of new and interesting patterns that are emerging and technologies that are emerging. But ultimately, a lot of the breaches that we see all stem from a lack of the basics. And so I think one thing just that you had mentioned

Starting point is 00:15:12 briefly around 2FA that I think is really interesting is that it's not just around what people consider high value. So people might think their bank account where you can't even log into your bank with a lot of providers now without having 2FA, at least in some form, even if it is just text-based. But I think what's interesting is people are actually experiencing this not just with their most sensitive accounts, but even what they saw as mundane, like their Facebook or their Instagram or their Twitter, where people are having their account compromised because they were using an insensitive password and a lack of 2FA.

Starting point is 00:15:44 That's then what opens the door to their account getting breached. And I think that getting the basics right and explaining why 2FA matters and how to do it and making it as obvious as possible. And kudos to their teams like Epic Games who, beyond just their own product and their own games, getting people into the habit of setting 2FA and having that across every consumer tech that people use is really important, then ultimately stems back to what we were talking about with least privilege in AWS. Because I think as engineers, we know, in theory, or in principle, you should be doing

Starting point is 00:16:19 these things. But we don't actually because it's inconvenient or because it slows things down. But if done correctly, it shouldn't. And I think that's the kind of balancing act that is important to strike here, where it's not about using one piece of technology or a product or anything like that. There's many different ways to kind of get the job done. But ultimately, focusing on the basics, in this case, maybe not everyone should have access to production, and maybe not everyone should be able to just log into the database whenever they want. I think that's the kind of analogy that I think about most often is no one would ever suggest that

Starting point is 00:16:53 every engineer should just have an open network connection to the database all day. But if you have admin access or even just kind of standing database access within AWS, you could. You could just have an open connection to the database all day and just poke around whenever you want and just see what's going on in there. And that's not good. We would not feel good if we saw that one of the products that we used, their engineers were doing that, even if they had some reason for doing it. It just doesn't really make sense. And I think making sure we're doing the things that we'd expect of others in our own homes and in our own companies, I think is really important. I wonder if there are, I guess, market pressures on these sorts of things. And what I

Starting point is 00:17:38 mean by that is not that AWS is going to yell at you for these things, but I still periodically will be kicking the tires on a variety of different vendor solutions that do all kinds of things by reaching into my AWS account. And I still see, oh yeah, generate long-lived IAM credentials and upload them into our web form here. It's no, no, no, no, no.

Starting point is 00:17:58 And then, okay, the smart ones are build a role that has the following permissions. Then you check and they're using IAM credentials that are long-lived on their side to go ahead and access that, which, okay, still not terrific. And my least favorite of all of them are, despite the fact that I use all of these tools

Starting point is 00:18:16 in a dedicated test account that has nothing sensitive whatsoever inside of it, but there are still a couple of them where, wait, I need to roll my own completely separate AWS account just for you. Because yeah, you wind up, for example, restricting all kinds of permissions around what you can and can't do with the role, but then you allow yourself to attach arbitrary IAM policies to things. So no, you can just give yourself admin rights and this all becomes security theater. I don't think it's malicious.

Starting point is 00:18:46 I think it's just not a whole lot of, I guess, decent thinking. Yeah, I think that kind of concept of making sure that you're not just doing things for the sake of showing that you're doing it and performing it, I think is really important. I think that's, we've heard from a lot of teams who end up, quote, implementing least privilege where, yeah, we did it, we checked the box for SOC 2, but actually everyone ends up having access. Or if you go through this process, you can get it anyway, but yeah, you're supposed to submit a ticket or something. And I think those flows, those kind of logical inconsistencies is really where, one, you lose trust both internally, because then what was

Starting point is 00:19:23 the point of this exercise where we added some friction, even though you can circumvent it anyway? But also, it doesn't really make you any more secure than you were before if people can still go in and run a script that can just grant them admin access. And in a pinch, they might just use that anyway. Or if you happen to know the IP address of your production database, then, and you happen to have already created a user account, even before you lock things down, well, now I can still just do whatever I need to do. So I can work around this issue for now. And it, one, I think reinforces this kind of seniority bias where being a senior engineer just means that you just know where the bodies are buried. But also, I think on the new team member side, it's just really hard to just get your basic work done. Regardless of what tool or what vendor you're using, you end up just reaching every single time for the super admin, for whatever kind of escape hatch that you can possibly use. You just put your finger on the pulse of the real problem here, from my perspective, which is you're trying to get work done. Well, everyone's job is security. No, it's not, because I assure you I am not metric at the features have I shipped, not how secure were those

Starting point is 00:20:45 features. And as a result, I'm trying to do my job. And I want to do that job as quickly and efficiently as possible. So any solution that winds up enforcing or creating least privilege has definitionally got to be easier and more straightforward and lower friction than just using admin for everything. And that's a tall order. It is hard, but I think if set up correctly, it is possible. And I think this kind of distinction of how do people get access the first time and how do people, and I think kind of to your point, not just how do people get access, but how do people do their work? And

Starting point is 00:21:20 sometimes that can require getting access to something. Like maybe I'm on an integration team and I just deployed a service and I want to confirm that it works. Focusing on what that workflow is and understanding the critical paths within a business and within a team, whether that's, you know, I need to access Rails console. And so really focusing on the critical path to how do we make it as easy as possible and as fast as possible for people to get production Rails console access, but in a time-limited and really audited way where you can still record the logs of what's happening on those servers, but also make sure that every time that someone's requesting that access, they're providing a reason. And I think that piece of providing a reason, while it can

Starting point is 00:22:00 seem like it's just for a security or a compliance reason. It actually provides a lot of value for the team overall because if, let's say, I needed access to run a migration and I get that access only for the next couple hours, which is how long it should take for me to run that migration, and then I request three more times to get access for running a migration, that's probably a good indicator that maybe my migration wasn't well-tested enough to actually go directly onto production and maybe there's an issue here. And I think that kind of distinction of, I probably need some help as opposed to,

Starting point is 00:22:32 let me just keep getting a little bit more access and keep poking around until I find out what's going on. I'm at least less likely to drop anything in that event compared to if I'm requesting and I know that it's visible for everyone, I'm more likely to actually both reach out for help and also at the very least, it's more visible to everyone else. So someone's going to come and say, hey, why do you need this much access? As opposed to the default standing where everyone has admin, it's impossible to tell you kind of need a kind of needle in a haystack finder to figure out what's even happening within your team.

Starting point is 00:23:03 This episode is sponsored in part by our friends at Strata. Are you struggling to keep up with the demands of managing and securing identity in your distributed enterprise IT environment? You're not alone, but you shouldn't let that hold you back. With Strata's Identity Orchestration Platform, you can secure all your apps on any cloud with any IDP, so your IT teams will never

Starting point is 00:23:25 have to refactor for identity again. Imagine modernizing app identity in minutes instead of months, deploying passwordless on any tricky old app, and achieving business resilience with always-on identity, all from one lightweight and flexible platform. Want to see it in action? Share your identity challenge with them on a discovery call and they'll hook you up with a complimentary pair of AirPods Pro. Don't miss out. Visit strata.io slash screaming cloud. That's strata.io slash screaming cloud.

Starting point is 00:23:56 So I have to ask, since you've built a company around the entire approach, how do you get to least privilege in practice? Because even AWS has their IAM access analyzer, which is designed as a native offering to let you build least privilege policies. So this is great. I've run this on things I have in production that are overscoped and it'll come back and say things like, ah, it was reading and writing to DynamoDB tables. Okay, great. Will you tell me which ones? Nuh-uh. Guess.

Starting point is 00:24:30 Okay. You were making some S3 operations. Will you tell me more than that? I will not. I know that it's doing this. I can look at that from static code analysis around which Botto calls it's making. The end. I'm trying to be a little bit more granular here. I'm sure that I am missing oceans of nuance on this,

Starting point is 00:24:50 but I cannot escape the feeling that there is a way for it to build out an absolutely incredibly scoped policy that I then loosen. Okay, great. Maybe I don't just put that one key and that one value into Dynamo, but maybe I just let myself have access

Starting point is 00:25:04 to that entire table or expand it out beyond just the single S3 object to an entire prefix or an entire bucket, but I still won't need to build the entire thing from scratch. It's easier to broaden than it is to tighten in, especially when you're not entirely sure what some of the syntax stuff looks like. But no, it just feels like it is not built for humans who are not steeped in security. As someone who's been working on security for a while, I still find the analyzer to be a little bit difficult to use and end up having to just look at logs to find out what's going on. But I think the core issue that you were getting at is exactly right, which is, how do I get the work that I'm trying to get done with the least amount of access to do it? And I think it all comes down to this core issue, which is you kind of have to start

Starting point is 00:25:49 from zero. You actually just pointed this out a moment ago, where it's much easier to broaden than it is to tighten. And it's easiest when you just don't have unlimited access. And so I think starting from zero, like when you show up on day one where you don't even have an account, you're actually better off than the entire organization is more secure if no one has access. But obviously, that's not realistic. That just can't work. People do need access. And what you can do is start from what are people trying to accomplish, define the role based on

Starting point is 00:26:19 that. So let's say I need to go in and we rely on Dynamo in this scenario. And I have this table that is our most critical. This is all of our customer data that is really sensitive in table A. In table B, we have some product analytics that a couple different teams are going to need access to. And then table C is just something that we kind of just store kind of as a cache. It typically is not going to have anything that's even remotely sensitive, but we still want to manage it in a somewhat sane way.

Starting point is 00:26:50 That way, you can actually say, I'm going to have three different roles, maybe even two different roles, but at the very least, I'm going to have one role that is hyper-locked down because table A has the most sensitive of all of our data. And rather than having just by default, people are used to just having access to all three tables. Instead, I'm already in the mindset that this is a sensitive table.

Starting point is 00:27:12 So when I need to go in and perform operations, I'm a little bit more cautious about what am I doing inside of that table. And maybe that means you also separate that and say, let's have a write and a read. And I think that's the most common pattern that we've seen people implement is just separating those two out is already half the battle. But really, it's not just about giving people one or the other. It's typically having three categories

Starting point is 00:27:33 where people have the kind of most minimal, just kind of log only access. They have read access to whatever system and then they have write access in the system, which people should practically never have unless they have a very good reason. and they should only have it for 30 minutes or a couple hours at a time. You don't need it for longer than a day. You don't need to just go in and change whatever you want in our Dynamo table. There's no good reason for that. And one analogy that I heard from an engineer that I really liked is this idea of thinking of changes to access as a migration in their own form. And so the easiest way to perform a migration is if you already have everything in a structured way, where you already know what are you actually changing, as opposed to,

Starting point is 00:28:13 well, there's this unbounded mess of things we've already created. We've accumulated and snowballed a lot of different resources and identities and everything in place. And that's really hard to manage. And unfortunately, that's where most companies sit today. And so some of these ideas of, well, just start out with the right approach is just not really going to be helpful. And that's kind of where I would go back to start from zero, not from the perspective of your account, but from people's relative access. And think about what the bare minimum, rather than the bare maximum, which is what most people have,

Starting point is 00:29:06 but the bare minimum viable access to perform a given task, whether that's within RDS or that's within Dynamo, or let's take S3 as an example, where you might have one bucket that storesAM policies that limit what prefixes you have access to, as opposed to a different scenario where you have a bucket per customer, which is what we would recommend, that you're limiting access on a per bucket level. But in either case, the least privileged control would be people should not just have get object on every single bucket. That should be limited and really looked at, not as a, oh, well, it's read-only, so it's fine. But actually, that's an anti-pattern, because read-only in that context could be looking at all the customer data. I'm a firm believer that this is not that hard of a problem to get to

Starting point is 00:29:40 if you're able to start from a place of building completely fresh in a greenfield scenario. Unfortunately, we don't have that available to us in most cases. I am staring at a pile of legacy stuff, and I've only been in business here for six years. And I look at some of the stuff I built early on, it's like, what moron did this? And I don't have to ask. I know, because it was only me for the first two of them. How do you wind up approaching the idea of migrating people from whatever hellacious scenario they're currently in that's also load-bearing and getting them to a point where this makes sense for them across the board? Knife switch cutovers suck and don't work. Yeah, so most of our customers actually did not start at Greenfield, as you can probably imagine. We worked with some companies that had a lot of, what do you call it, legacy, or really just

Starting point is 00:30:27 critical infrastructure that there was no option to say, oh, we'll just spin up a new account and start anew. That was not an option. And so it's really about looking at what are people already doing today? And I think it stems to what we were talking about earlier of, well, what's the workflows that people are already doing? What are the reasons why people need access to production? It's really hard to figure that out when everyone already has it because no one's going around asking for it per se. But there's really two sides of the equation where either it's open by default where everyone already has access, in which case you have to try to unwind people's natural instinct and say, well, I am an admin. It's one of my core personality traits. I'm an administrator access in our production AWS account. Or on the other side, you have

Starting point is 00:31:09 everything locked down and only a handful of people get to be the kind of curmudgeons deciding who gets bestowed access in the context of an incident or for a given project, and hopefully remembers to revoke that access. But ultimately, it all stemmed from access to sensitive data, whether that's in a blob storage or some sort of structured SQL database. They're running servers where we're actually computing with that data. And then typically, there's some set of pipelining that, whether it's SQS or any of the many different kind of messaging systems within AWS and some of the managed products, there's really three categories that we saw companies

Starting point is 00:31:50 try to approach this problem. So step one was, let's secure the thing where the data is actually being processed. So the servers, where if you were to make a change in there, you just go SSH into that server, you download something onto it, that would be really bad, we would look really embarrassed. And so locking that down, where there's not really that much good reason for you just SSHing into production servers willy-nilly. And so that was the first place that we see most companies start and limit that access because people still need it, but now they can just get it when they need. And that kind of on-demand nature to people's access, I think is really important. And then for the others, it's really about understanding the workflows before you revoke it. And I think a key part of that is

Starting point is 00:32:29 you can shift to on-demand access where people are requesting it and they don't have it by default, but it can be automatically approved. So you get people to start saying why they need access, and they'll just tell you, hey, I need it to do this workflow or to do this ticket, or I'm debugging this integration. And then you can start to ratchet up the controls and say, OK, for these really sensitive buckets, which we now know are extremely sensitive and have very sensitive workflows but are really time sensitive. Let's allow this data science team to be able to self-approve and everyone else has to go through their manager. And then on the flip side, there might be some resources where it's like CloudWatch or ECS cluster info is a common one where, well, it turns out if you look at the quotas, it's pretty easy if you have a growing team to hit those quotas.

Starting point is 00:33:18 And then all of a sudden, no one can see cluster info anymore. And there are workflows where people just start looking at different pages in the AWS dashboard because they're curious, as opposed to they actually need to do that thing. And then the people who need to do whatever that workflow is are unable to do it anymore because of either a quota getting hit or someone accidentally takes an action that they thought they were in staging instead of production. And I think just separating out those kinds of accidental mistakes or moments of curiosity that maybe should not have been run in production and shifting that to a process where people are saying, I need it for this reason, I need it for this amount of time. And given that, people are much more cautious about what

Starting point is 00:33:59 are they actually doing within that system. And I think that change of behavior is really the first place to start. Speaking of changing behaviors, one position that you have staked out publicly lately that I'm a fan of, and you started quoting me, I'll even be charitable and say in context on, has been that nobody should have production access. I like the pattern, and I know people are going to yell at me for it, so I will, of course, caveat it. Maybe you'll disagree on this one. Yes, fine. Build yourself a break glass option to get in when things are completely hosed.

Starting point is 00:34:32 Fine, of course. But no one should be able to do that without a whole bunch of other people knowing. Now, if you accept that you can always get in if you need to. Now, how do you build tooling and operational practices so that you don't need to? Exactly, exactly. And I think there will always be a situation where you need that break glass backup. I think when we look at incidents like the Uber incident from last year, where the unfortunate reality when you don't have the right tooling in place to temporarily elevate people's access,

Starting point is 00:35:03 what ends up happening is that break glass account, this, you know, only break in case of emergency, ends up getting used constantly to the point where it gets put in a Google Drive folder with credentials in plain text. And that was the root cause of not the initial incident and the initial compromise, but the escalation from this is kind of bad to catastrophic, where a outsider now had full admin, super admin across everything. And I think that distinction of if something is a break glass account, it should do just that. It should break glass and set off alarms. And everyone should be triggered and notified by and triggered alerts and notified because that should be the sensitivity of that account. Otherwise, people should be going through a process that's not quite break glass as much as it is an escalation.

Starting point is 00:35:51 And I think this kind of idea of an escalation where it's not necessarily in the sense of an incident escalation where this is a kind of one-off thing, because these end up happening constantly. It's escalation in the context of elevating someone's access from viewer to S3 reader to RDS backup recovery. Understanding the workflows that people are trying to accomplish and AWS, to their credit, has made some improvements around identity center to make defining these job-specific functions a little bit easier when you have an accounting team who needs access to AWS. That doesn't mean they need access to production continuously. It just means they need the billing account info. That's really all they're looking for.

Starting point is 00:36:35 They don't want to go look in S3, unless maybe they're Duckbill, but then they do need to go into S3. And I think that having that approach... We have to go into everything on some level. That's where the bills live. Exactly. But I think that having the right procedure and the right tooling to enable that procedure is really important because it all comes down to what is the easiest way to compromise the system. And if the easiest way is there's this account that people can just go into or there's a handful of people who have these accounts that the inevitable results of people just sharing credentials, or I ask you if

Starting point is 00:37:08 you have admin access, I just ask you to go run this query for me, and you're just trying to help me out to help me ship this feature faster, we're more likely to make a mistake compared to when we have a process in place that is as lightweight as possible. I think that's the key part of process can really add a lot of friction and it can be really this kind of terrible plight when it comes to engineering because you're just going through these hoops for not really that much reason. But if the process is more of a byproduct of you just kind of doing your work and saying, hey, I need just like you need access to a Google Doc or something. You just click a button and say, I need it. And then it goes to the right person. I don't need to go find who in the SRE team or on the IT team I need to ask. I can just click a

Starting point is 00:37:49 button and say, hey, I need this thing, and here's why I need it. It makes it so much smoother for people to get that access, to just get on with their lives and just keep doing work. And I think setting up the other half of that, setting up auto-approvals for cases where these people will almost certainly always be approved and just encoding that into a policy. I think that's the key distinction that a lot of people don't kind of take into account is that you can just connect your approval process to your page duty or to your ops genie or whatever incident management system you're using to make sure that people aren't just stuck around waiting. If it is, let's say, 2 a.m. and people need to, quote,

Starting point is 00:38:24 break glass, that shouldn't really be a break glass moment because that person's already on call or they're already responding to an incident. So understanding the situations in which people are going to have to use these kind of escalations and just accounting for it as part of your process. And I think that's where it is hard to do. There are companies, and credit to Segment where I used to work, where they actually built, they spent many months and some really smart engineers building an access service in-house to solve this problem because they had a lot of data, as you can probably imagine. And credit to them because they really invested and they took it seriously. A lot of companies just don't have that capacity, don't have that bandwidth, especially now they don't have that capacity. And so I think finding the right solution for you, whether that's some amount of tooling that kind of removes people's need to have production access, or it's using products,

Starting point is 00:39:09 there's a lot of different options, but ultimately getting to the right process for people who don't just have that standing admin access, whether it's not keeping production access or candidly, I even as one of the co-founders, I don't have admin access on our own systems. I can get it when I need it. But I think it's a really important procedure to make sure it's typically going to be the kind of senior leadership in a company or senior engineers in a company who are going to be so confident that they're going to do the right thing and kind of just cut once and be so confident that they're going to make a mistake. And we heard about this recently where someone was debugging an incident and the CTO hops on, tries to help, and accidentally drops the production index. They thought they were on staging because they wanted to test something to see would that solve the problem.

Starting point is 00:39:54 But they actually made the incident much, much worse because now not only were they already debugging this incident, but now all the other alerts started going off because everything started becoming so much slower. And I think that kind of confidence can actually hurt you in the moments like that. And having a little bit more procedure and process can help. I really want to thank you for taking so much time to explain how you feel about these unfortunately polarizing topics. If people want to learn more, where can they find you? You can check out Indent at indent.com. And we're on Twitter, LinkedIn. I'm also on Twitter and LinkedIn, Fawad Mayton. A little bit harder to spell than Indent.

Starting point is 00:40:31 But if anyone disagrees or feels strongly about this topic, definitely willing to scream into the cloud some more and hash it out. But thanks again for having me. Yes, you're encouraged to fight me in real life. I felt like I was getting pretty close to that. Exactly. Thanks once again for your me. Yes. You're encouraged to fight me in real life. I felt like I was getting pretty close to that. Exactly. Thanks once again for your time. I really do appreciate it. Thank you. Fuad Mayton, co-founder and CEO at Indent. I'm cloud economist Corey Quinn, and this is Screaming

Starting point is 00:40:59 in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment that I will then log in as you and remove, because you use the same password for everything, including production and your podcast platform of choice. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point.

Starting point is 00:41:45 Visit duckbillgroup.com to get started.

Your Ad Here

Screaming in the Cloud - Getting the Basics Right in Cloud Security with Fouad Matin

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.