Screaming in the Cloud - Transparency in Cloud Security with Gafnit Amiga

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Let's face it. On-call firefighting at 2 a.m. is stressful.

Starting point is 00:00:35 So there's good news and there's bad news. The bad news is that you probably can't prevent incidents from happening. But the good news is that Incident.io makes incidents less stressful and a lot more valuable. Incident.io is a Slack-native incident management platform that allows you to automate incident processes, focus on fixing the issues, and learn from incident insights to improve site reliability and fix your vulnerabilities. Try Incident.io to recover faster and sleep more. This episode is sponsored in parts by our friend EnterpriseDB. EnterpriseDB has been powering enterprise applications with Postgresquil for 15 years, and now EnterpriseDB

Starting point is 00:01:21 has you covered wherever you deploy PostgresquSQL, on-premises, private cloud, and they just announced a fully managed service on AWS and Azure called Big Animal. All one word. Don't leave managing your database to your cloud vendor because they're too busy launching another half-dozen managed databases to focus on any one of them that they didn't build themselves. Instead, work with the experts over at EnterpriseDB. They can save you time and money. They can even help you migrate legacy applications, including Oracle, to the cloud.

Starting point is 00:01:52 To learn more, try Big Animal for free. Go to biganimal.com slash snark and tell them Corey sent you. Welcome to Screaming in the Cloud. I'm Corey Quinn. We've taken a bit of a security bent to the conversations that we've been having on this show in over the past year or so. And well, today's episode is no different. In fact, we're going a little bit deeper than we normally tend to. My guest today is Gafneet Amiga, who's the Director of Security Research at Lightsman.

Starting point is 00:02:23 Gafneet, thank you for joining me. Hey, Corey, thank you for inviting me to the show. You sort of burst onto the scene, and by scene, I of course mean the cloud space, at least the level of community awareness. Back, I want to say in April of 2022, when you posted a very in-depth blog post about exploiting RDS and some misconfigurations on AWS's side to effectively display internal service credentials for the RDS service itself.

Starting point is 00:02:56 Now, that sounds like it's one of those incredibly deep, incredibly murky things, because it is, let's be clear. At a high level, can you explain to me exactly what it is that you found and how you did it yes so rds is a database service of amazon it's a managed service where you can choose the engine that you prefer one of them is postgres there i found the vulnerability the vulnerability was in the extension in the log fdw so it's for like stands for forging data wrapper where this extension is there for reading the logs directly of the engine and then you can query it using sql queries which should be simpler and easy to use. And this extension enables you to provide a path. And there was a path traversal, but the traversal happened only

Starting point is 00:03:56 when you dropped a validation of the wrapper. And this is how I managed to read local files from the database EC2 machine, which shouldn't happen because this is a managed service and you shouldn't have any access to the underlying host. It's always odd when the abstraction starts leaking from an AWS perspective. I know that a friend of mine was on Aurora during the beta and was doing some high performance work and suddenly started seeing SQL errors about VAR temp filling up, which is, for those who are not well-versed in SQL, and even for those who are, that's not the sort of thing you tend to expect to show up on there. It feels like the

Starting point is 00:04:33 underlying system tends to leak in, particularly in an RDS sense, into what is otherwise at least imagined to be a fully managed service. Yes, because sometimes they want to give you an informative error so you will be able to realize what happened and what caused the error. And sometimes they prefer not to give you too many information because they don't want you to get to the underlying machine. This is why, for example, you don't get a regular super user. You have an RDS super user

Starting point is 00:05:07 in the database. It seems to me that this is sort of a problem of layering different security models on top of each other. If you take a cloud-native database that they designed start to finish themselves, like DynamoDB, the entire security model for Dynamo, as best I can determine, is wrapped up within IAM. So if you know IAM, spoiler, nobody knows IAM completely, it seems, but if you have that on lock, you've got it. There's nothing else that you need to think about. Whereas with RDS, you have to layer on IAM to get access to the database and what you're allowed to do with it. But then there's an entirely separate user management system in many respects of local users for other Postgres or MySQL or any of the other systems that we were using, to a point where even when they started supporting IAM for authentication

Starting point is 00:05:55 to RDS at the database user level, it was flagged in the documentation with a bunch of warnings of don't do this for high volume stuff, only do this in development style environments. So it's clear that it has been a difficult marriage, for lack of a better term. And then you have to layer on all the other stuff that if, God forbid, you're in a multi-cloud style environment or working with Kubernetes on top of all of this, and it seems like you're having to pick and choose between four or five different levels of security modeling, as well as understand how all of those things interplay together. How come we don't see things like this happening four times a day as a result? Well, I guess that there are more issues being found, but not always published.

Starting point is 00:06:37 But I think that this is what makes it more complex for both sides, creating and managing services with resources and third parties that everybody knows to make it easy for them to use requires a deep understanding of the existing permission models of the service where you want to integrate it with your permission model and how the combination works. So you actually need to understand how every change is going to affect the restrictions that you want to have. So for example, if you don't want the database users to be able to read, write, or do a network activity, so you really need to understand the permission model of Postgres itself.

Starting point is 00:07:26 So it makes it more complicated for development, but it's also good for researchers because they already know Postgres and they have a good starting point. My philosophy has always been when you're trying to secure something, you need to have at least a topical level of understanding of the entire system start to finish. One of the problems I've had with the idea of microservices, as is frequently envisioned, is that there's separation, but not real separation. So you have to hand wave over a whole bunch of the security model. If you don't understand something,

Starting point is 00:08:00 I believe it's very difficult to secure it. Let's be honest, even if you do understand something, it can be very difficult to secure it. Let's be honest, even if you do understand something, it can be very difficult to secure it. And the cloud vendors with IAM and similar systems don't seem to be doing themselves any favors given the sheer complexity and the capabilities that they're demanding of themselves, even for having one AWS service talk to another one,

Starting point is 00:08:19 but in the right way. And it's finicky and it's nuanced and debugging it becomes a colossal pain. And finally, at least those of us who are bad at these things, finally say, the hell with it, and they just grant full access from service A to service B in the confines of a test environment. I'm not quite that nuts myself most days. And then it's the biggest lie we always tell ourselves is once we have something over-scoped like that, usually for CICD, it's, oh, to-do, I'll go back and fix that later.

Starting point is 00:08:45 Yeah, I'm looking back five years ago. Yeah, it's still on my to-do list. For some reason, it's never been the number one priority. And in all likelihood, it won't be until right after it really should have been my number one priority. It feels like in cloud security, particularly, you can't win. You can only not lose. I always thought that to be something of a depressing perspective, and I didn't accept it for the longest time. But increasingly, these days, it's starting to feel like that is the state of the world. Am I wrong on that? Am I just being too dour? What do you mean by you cannot lose? There's no winning in security from my perspective, because no one is going to say,

Starting point is 00:09:22 all right, we won security, problem solved, the end. Companies don't view security as a value add. It is only about a downside risk mitigation play. It's, yay, another day of not getting breached. And the failure mode from there is, okay, well, we got breached. We found out about it ourselves immediately, internally, rather than reading about it in the New York Times in two weeks. The winning is just the steady state, the status quo. It's just all different flavors of losing beyond that. So I don't think it's quite the case because I can tell that they do do always an active work on securing the services and the structure because I went over other extensions before reaching to the log4gen data wrapper and they actually excluded high-risk functionalities that could help me to achieve privileged access to the underlying host. And they do it with other services as well, because they do always do the

Starting point is 00:10:25 security review before having it integrated externally. But you know, it's an endless zone. You can always have something. Security vulnerabilities are always a raise. So everyone, whenever they can help and to search and to give their value. It's appreciated. I feel like I need to clarify a bit of nuance. When your blog post first came out talking about this, I was, well, let's say a little irritated toward AWS on Twitter and in other places. And Twitter is not a place for nuance. It is easy to look at that and think,

Starting point is 00:11:04 oh, I was upset at AWS for having a vulnerability. I am not. I want to be very clear on that. Now, it's always a question of not blocking all risk. It's about trade-offs and what risk is acceptable. And to AWS's credit, they do say that they practice defense in depth. Being able to access the credentials for the running RDS service on top of the instance that it was running on, while that's certainly not good, isn't as if you'd suddenly had keys to everything inside of AWS and all their security model crumbles away before you. They do the right thing. And the people working on these things are incredibly good. And they work very hard at these things. My concern and my complaint is, as much as I enjoy the work that you do and reading these blog posts talking about how you did it, it bothers me that I have to learn about a vulnerability in

Starting point is 00:12:06 a service for which I pay not small amounts of money. RDS is the number one largest charge on my AWS bill every month. And I have to hear about it from a third party rather than the vendor themselves. In this case, it was a full day later where after your blog post went up and they finally had a small security disclosure on AWS's site talking about it. And that pattern feels to me like it leads nowhere good. So transparency is a key word here. And when I wrote the post, I asked if they want to add anything from their side. They told that they already reached out to the vulnerable customers and helped them to migrate to the fixed version. So from their side, they didn't feel it's necessary to add it over

Starting point is 00:12:54 there. But I did mention the fact that they did the investigation and no customer data was heard. Yeah, but I think that if there will be maybe a more organized process for any submission of any vulnerability that where all the steps are aligned, it will help everyone and anyone can be informed with everything that happens. I have always been extraordinarily impressed by people who work at AWS and handle a lot of the triaging of vulnerability reports. Zach Glick, before he left, was doing an awful lot of that. Dan Erson continues to be one of the bright lights of AWS from my perspective, just as far as customer communication and understanding exactly what the customer perspective is. And as individuals, I see nothing but stars over at AWS. To be clear,

Starting point is 00:13:47 nothing but stars is also the name of most of my IAM policies, but that's neither here nor there. It seems like on some level, there's a communications and policy misalignment on some level. Because I look at this and every conversation I ever have with AWS's security folks, they are eminently reasonable. They're incredibly intelligent and they care. There's no mistaking that. They legitimately care. But somewhere at the scale of company they're at, incentives get crossed and everyone has a different position they're looking at these things from. And it feels like that disjointedness leads to almost a misalignment as far as how to effectively communicate things like this to customers. Yes, it looks like this is the case, but

Starting point is 00:14:31 if more things will be discovered and published, I think that they will have eventually an organized process for that. Because I guess that researchers do find things over there, but they're not always being published for several reasons. But yes, they should work on that. And that is part of the challenge as well, where AWS does not have a public vulnerability disclosure program. They're not on HackerOne. They don't have a public bug bounty program. They have a vulnerability disclosure email address, and the people working behind that are some of the hardest working folks in tech. But there is no unified way of building a community of researchers around the idea of exploring this. And that is a challenge because you've reported vulnerabilities. I have reported

Starting point is 00:15:22 significantly fewer vulnerabilities, but it always feels like it's a hurry-up-and-wait scenario where the communication is not always immediate and clear. And at best, it feels like we often get a begrudging thank you versus, all right, if we just throw ethics completely out the window and decide instead that now we're going to wind up focusing on just effectively selling it to the highest bidder. The value of, for example, a hypervisor escape on EC2, for example, is incalculable. There is no amount of money that a bug bounty program could offer for something like that compared to what it is worth to the right bad actor at the right time. So the vulnerabilities that we hear about are already, we're starting from a basis of people who have a functioning sense of ethics, people who are not deeply compromised trying to do something truly nefarious. What worries me is the story of, what are the stories that we aren't seeing? What are the things that are being found where instead of fighting against the bureaucracy around disclosure and the rest, people just use them for their own ends?

Starting point is 00:16:23 And I'm gratified by the level of response I see from AWS on the things that they do find out about. But I always have to wonder, what aren't we seeing? That's a good question. And it really depends on their side, if they choose to expose it or not. Part of the challenge, too, is the messaging and the communication around it and who gets credit and the rest. And it's weird, Whenever they release some additional feature to one of their big headline services, there are blog posts, there are keynote speeches, there are customer references. They go on speaking tours and the emails. Oh, God, they never stop the emails talking about how amazing all of these things are. But whenever there's a security vulnerability or a disclosure like this, and to be fair, AWS's response to this

Starting point is 00:17:09 speaks very well of them. It's like you have to go sneak down into the dark sub-basement with the filing cabinet behind the leopard sign and the rest to even find out that these things exist. And I feel like they're not doing themselves any favors by developing that reputation or lack of transparency around these things. Well, there was no customer impact, so why would we talk about it?

Starting point is 00:17:32 Because otherwise you're setting up a myth that there never is a vulnerability on the side of what it is that you're building as a cloud provider. And when there is a problem down the road, because always is going to be nothing is perfect people are going to say hey wait a minute you didn't talk about this what else haven't you talked about and it rebounds on them with sometimes really unfortunate side effects with azure as a as a counter example here we see a number of azure exploits where yeah it turned out that we had access to other customers data and and Azure had no idea until we told them. And Azure does its statements about, oh, we have no evidence of any of this stuff being used improperly. Okay, that can mean that you've either checked your logs

Starting point is 00:18:14 and things are great or you don't have logging. I don't know that necessarily is something I trust. Conversely, AWS has said in the past, we have looked at the audit logs for this service dating back to its launch years ago and have validated that it has never been used like this. One of those responses breeds an awful lot of customer trust. The other one doesn't. And I just wish AWS knew a little bit more how good crisis communication around vulnerabilities can improve customer trust rather than erode it.

Starting point is 00:18:48 Yes, and I think that, as you said, there will always be vulnerabilities. And I think that we are expected to find more. So being able to communicate it as clearly as you can and to expose things about maybe the fix and how the investigation is being done, even in a high level, for all the vulnerabilities, can gain more trust from the customer side. DoorDash had a problem.

Starting point is 00:19:20 As their cloud-native environment scaled and developers delivered new features, their monitoring system kept breaking down. In an organization where data is used to make better decisions about technology and about the business, losing observability means the entire company loses their competitive edge. With Chronosphere, DoorDash is no longer losing visibility into their application suite. The key? Chronosphere is an open-source compatible, scalable, and reliable observability solution that gives the observability lead at DoorDash business, confidence, and peace of mind. Read the full success story at snark.cloud slash chronosphere.

Starting point is 00:19:59 That's snark.cloud slash C-H-R-O-N-o-s-p-h-e-r-e. You have experience in your background, specifically around application security and cloud security research. You've been doing this for seven years at this point. When you started looking into this, did you come at the RDS vulnerability exploration from a perspective of being deeper on the Postgres side

Starting point is 00:20:26 or deeper on the AWS side of things? So it was both. I actually came to the RDS lead from another service where there was something about in the application level, but then I reached to an RDS and thought, well, it will be really nice to find things over here and to reach the underlying machine. And when I entered to the RDS zone,

Starting point is 00:20:55 I started to look at it from the application security eyes. But you have to know the cloud as well because there are integrations with S3. You need to understand the IAM model. So you need a mix of both to exploit specifically this kind of issue. But you can also be a database expert because the payload is a pure SQL. It always seems to me that this is an inherent risk in trying to take something that is pre-existing as an open source solution.

Starting point is 00:21:32 Postgres is one example, but there are many more. And offer it as a managed service. Because I think one of the big misunderstandings is that when AWS is just going to take something like Redis and offer that as a managed service, it's okay, I accept that they will offer a thing that respects the endpoints and acts as if it were Redis. But under the hood, there is so much in all of these open source projects that is built for optionality of wherever you want to run this thing, it will run there. Whatever type

Starting point is 00:22:03 of workload you want to throw at it, it can work. Whereas when you have a cloud provider converting these things into a managed service, they are going to strip out an awful lot of those things. Easy example might be, okay, there's this thing that winds up having to calculate for the way the hard drives on a computer work and from a storage perspective. Well, all the big cloud providers already have interesting ways that they have solved storage. Every team does not re-implement that particular wheel.

Starting point is 00:22:29 They use in-house services. Chubby's file locking, for example, over on the Google side is a classic example of this that they've talked about an awful lot. So every team building something doesn't have to rediscover all of that. So the idea that,

Starting point is 00:22:40 oh, we're just going to take up this open source thing, clone it off of GitHub, fork it, and then just throw it into production as a managed service seems more than a little naive. What's your experience around seeing, as you get more under the weeds of these things, and most customers are allowed to get, what's your take on this? Do you find that this looks an awful lot like the open source version that we all use?

Starting point is 00:23:02 Or is it something that looks like it has been heavily customized to take advantage of what AWS is offering internally as underlying bedrock services? So from what I saw until now, they do want to save the functionality so you will have the same experience as you're working with the same service, but not on AWS because you are used to that.

Starting point is 00:23:25 So they are not doing dramatic changes, but they do want to reduce the risk in the security space. So there will be some functionalities that they will not let you to do. And this is because of the managed part. In areas where the full workload is deployed in your account and you can access it anyway, so they will not have the same security restrictions because you can access the workload anyway. But when it's managed, they need to prevent you from accessing the underlying host, for example. And they do do the changes, but they're reallyaked to the specific actions that can lead you to that. It also feels like RDS is something of a, I don't want to

Starting point is 00:24:14 call it a legacy service because it is clearly still very much actively developed, but it's what we'll call a classic service. When I look at a new AWS launch, I tend to momentally bucket them into two things. There's the cloud-native approach, and we've already talked about DynamoDB. That would be one example of this. And there's the cloud-hosted model, where you have to worry about things like instances and security groups and the networking stuff and so on and so forth, where it basically feels like they're running their thing on top of a pile of EC2 instances, and that abstraction starts leaking. Part of me wonders if, looking at some of these older services like RDS, they made decisions

Starting point is 00:24:50 in the design and build out of these things that they might not if they were to go ahead and build it out today. I mean, Aurora is an example of what that might look like. Have you found, as you start looking around the various security foibles of different cloud services, that the security posture of some of the more cloud-native approaches is better, worse, or the same as the cloud-hosted world? Well, so for example, in several issues that were found, and also here in the RDS where you can see credentials in a file, this is not a best practice in security space. And so definitely there are things to improve, even if it's developed on the provider side. But it's really hard to answer this question because in the managed area where you

Starting point is 00:25:41 don't have any access, it's hard to tell how it's configured and if it's configured properly. So you need to have some certification from their side. This is on some level part of the great security challenge, especially for something that is not itself open source, where they obviously have terrific security teams. Don't get me wrong. At no point do I want to ever come across as saying, oh, those AWS people don't know how security works. That is provably untrue. But there is something to be said

Starting point is 00:26:13 for the value of having a strong community in the security space, focusing on this from the outside, of looking at these things, of even helping other people contextualize these things. And I'm a little disheartened that none of the major cloud providers seem to have really embraced the idea of a cloud security community to the point where the one that I'm the most familiar with, the cloud security forum

Starting point is 00:26:36 Slack team, seems to be my default place where I go for context on things. Because I dabble, I keep a hand in when it comes to security, but I'm certainly no expert. That's what people like you are for. I make fun of clouds and I work on the billing parts of it. And that's about as far as it goes for me. But being able to get context around, is this a big deal? Is this description that a company is giving, is it accurate? For example, when your post came out, I had not heard of Lightspin in this context. So reaching out to a few people I trusted, is this legitimate? The answer is yes, it's legitimate and it's brilliant. That's a company to keep your eye on. Great. That's useful context and there's no way to buy that. It has to come from having those conversations with people

Starting point is 00:27:15 in the broader sense of the community. What's your experience been looking at the community side of the world of security? Well, so I think that the cloud security has a great community. And this is one of the things that we at Lightspin really want to increase and push forward. And we see ourselves as a security-driven company. We always do the best to publish a post, even detailed posts, not about vulnerabilities, about how things work in the cloud and how things are being evaluated, to release open source tools where you can use them to check your environment, even if you're not a customer.

Starting point is 00:27:58 And I think that the community is always willing to explain and to investigate together. It's a welcome effort. But I think that the messaging should be also for all layers, you know, also for the DevOps and the developers, because it can really help if it will start from this point, from their side as well. It needs to be baked in from start to finish.

Starting point is 00:28:27 Yeah, exactly. I really want to thank you for taking the time out of your day to speak with me today. If people want to learn more about what you're up to, where's the best place for them to find you? So you can find me on Twitter and on LinkedIn and feel free to reach out. We'll of course put links to that in the show notes. Thank you so

Starting point is 00:28:46 much for being so generous with your time today. I appreciate it. Thank you, Corey. Gevneet Amiga, Director of Security Research at Lightspin. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. And if it's on the YouTubes, smash the like and subscribe buttons, which I'm told are there. Whereas if you've hated this podcast, same story, like and subscribe on the buttons, leave a five-star review on a various platform, but also leave an insulting, angry comment about how my observation that our IAM policies are all full of stars is inaccurate. And then I will go ahead and delete that comment later because you didn't set a strong password.

Starting point is 00:29:34 If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started. This has been a humble pod production stay humble

Screaming in the Cloud - Transparency in Cloud Security with Gafnit Amiga

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.