Screaming in the Cloud - Reverse Engineering the Capital One Breach with Josh Stella

Starting point is 00:00:00 Hello and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. while working on first-class company environments. I've got to say, I'm pretty skeptical of remote work environments, so I got on the phone with these folks for about half an hour, and let me level with you. I've got to say, I believe in what they're doing,

Starting point is 00:00:54 and their story is compelling. If I didn't believe that, I promise you I wouldn't say it. If you'd like to work for a company that doesn't require you to live in San Francisco, take my advice and check out Xteam. They're hiring both developers and DevOps engineers. a company that doesn't require you to live in San Francisco, take my advice and check out Xteam. They're hiring both developers and DevOps engineers. Check them out at the letter x-team.com slash cloud. That's x-team.com slash cloud to learn more. Thank you for sponsoring this ridiculous podcast. Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Josh Stella, founder and CTO of a company called Fugue. Josh, welcome to the show.

Starting point is 00:01:32 Thanks, Corey. It's great to be here. So let's start at the very beginning. What is a fugue? My awareness of the company starts and stops with a t-shirt I got at an event a couple of years back. It's glorious. People have general trouble reading it, but it's very nice once you look at the abstract design and then eventually realize it says the word fugue in a very stylistic way. There is a backstory there. A fugue is a compositional style in music, and particularly fugues are constructed out of relatively simple musical phrases that evolve over time and interleave and have produced some of the most sophisticated, beautiful pieces of music ever written. a book published in the 1980s that made an impression on me as a young person called Gertl Escher Bach, An Eternal Golden Braid by Douglas Hofstetter, which I highly recommend.

Starting point is 00:02:31 It's about the nature of complex systems arising out of simple systems. And I wasn't quite sure it would be a good company name, but my colleagues, when I brought it up, kind of insisted. So Fugue, we are. Excellent. And Fugue, you shall remain. So at a high level, what your company does, to my understanding, and please correct me loudly and energetically if I'm wrong, but you focus on cloud governance, specifically with an eye towards security and compliance. Yes, that's true.

Starting point is 00:03:03 I would say cloud governance focused primarily, as you said, on security and compliance. Yes, that's true. I would say cloud governance focused primarily, as you said, on security and compliance. The way we do that is very different than others who are thinking about security. From Fugue's perspective, the cloud is itself software-defined, which means that security is a software engineering problem, not a security analysis problem. So we actually create a complete model of the entire application or system that's being run in the cloud, and then we can compute against that. So we can compute things like, is it in compliance with certain compliance regimes like HIPAA or CIS or NIST? But we can also compute things like, has it changed over time? Has it mutated in dangerous

Starting point is 00:03:54 ways from one moment to the next? So that's the space we're in. Our approach is a little different. Understood. And security in the cloud is something that no one really pays attention to and it's not particularly interesting because nothing much ever really happens in that space. In a completely unrelated topic, as of the time of this recording,

Starting point is 00:04:15 a couple weeks back, there was a Capital One breach that puts a lie to everything that I just said. And you wrote a technical analysis of how that attack may have been pulled off that aligns almost perfectly with my own assessment of it. So can you take people through the high level of what happened and what you suspect occurred? Absolutely. So,

Starting point is 00:04:40 and I think the key word there is may, right? We only have a certain number of data. Largely, I focused on the DOJ complaint because that's an official document. And I did look at the screenshots of the attacker's Twitter feed, although only after I tried to recreate the attack. So I'm going to describe it from a technical perspective. And if you want me to, I can go back and describe it more in layman's terms. I think I've come up with a decent analogy for that. So some things we know about the attack from the DOJ complaint to the degree that that's accurate, but it's not enough to really piece the whole picture together. So I made some assumptions and I'll try to highlight those as I touch on them.

Starting point is 00:05:31 And let me start by saying I have really good friends both at AWS and Capital One that are absolutely brilliant engineers. And I think both those organizations do a phenomenal job. So this is stuff that can happen to just about anyone. I would absolutely agree with that. In fact, taking a look at what we've seen so far, there's been a lot of noise around this that doesn't necessarily seem to bear itself out. And a lot of people, of course, because it's the internet, speculating wildly and passing it off as fact. Oh, yes. Most of what I'm seeing out there, I think, misses the most

Starting point is 00:06:11 important points. And I hope your listeners get some value out of at least the things as somebody who's worked in cloud security and integrity for years now. I think there's some important things to notice that are largely getting ignored. My theory of the attack, and the way I did this is I came up with a theory, and then I recreated this attack in my own environment against my own infrastructure, not production-fugue infrastructure, it actually would not have worked there, but against my own development environment. So what we know from the DOJ complaint is there was a misconfigured firewall. We don't know what kind of firewall. We don't know what the misconfiguration was. And I think a lot of folks are conflating misconfigured firewall with the fact that an IAM role with WAF, which stands for Web Application Firewall, was used.

Starting point is 00:07:12 That may be true, but it also may not be true. In my theory, there is another firewall, a traditional IP firewall, that has a bad port. And we have heard from, I think, fairly credible sources that the attacker was scanning the internet for vulnerabilities. And that's the first point I'd like to make is kind of in the old days, attackers would target organizations specifically and go look for vulnerabilities. Now that still happens, but what's happening most of the time is attackers have automated their looking for vulnerabilities, and then they pick targets from those who are vulnerable. So in this case, it was Capital One, but it could have been Josh's auto repair. They just found a gap.

Starting point is 00:08:12 They found the back door open in a firewall and started working from there. So that might have been a security group that had a bad port open. It could have been another kind of firewall. But it appears that the attacker got access to likely an EC2 compute instance. And I say likely because there is a need to collect metadata off the metadata service in the hypervisor and to assume IAM roles. And that's most likely EC2. So I think that the attacker kind of got in the back door, looked around and found an exploitable EC2 instance. And then in her Twitter feed, she says she used assume role. And this is where things really get interesting. So assume role means taking on a different IAM identity.

Starting point is 00:09:11 So you've got this server, and maybe it has an IAM identity that only allows it to do things like, oh, I don't know, connect to a database. But once you're on that server, if that EC2 instance has itself IAM permissions to look at other IAM roles, you can kind of go shopping for an identity that allows you to do more destructive things. So let's assume for a moment, because we know Capital One are really good at this stuff, that that server's native IAM role, that EC2 instance's assigned IAM role, didn't have things like list S3 buckets, because that's a bad idea, right? You should know what S3 buckets you need to talk to, and listing them is like getting a phone directory to where the data safes are. But if you could get into that EC2 instance and shop for an IAM role that did have S3 list and then assume that role, now you get a listing of those S3 buckets. So I believe that was

Starting point is 00:10:20 the next step. And then once those buckets were identified, in the DOJ complaint, they use some interesting and specific language. They say that there were four commands executed. And in the fourth command, they describe it as a sync command. Now, S3's API does not have a sync API endpoint, but the AWS CLI has a utility function called sync for S3. So I suspect what happened is the attacker got into EC2, looked at the metadata services credentials, used those escalated identity and permissions through identity listed S3 buckets, and then synced those S3 buckets to another collection of S3 buckets.

Starting point is 00:11:17 And of course that would not send a lot of network traffic. It actually wouldn't send any network traffic over a VPC for things like VPC flow logs to catch. It would all be S3 to S3 and so largely invisible to traditional security tools. That's, in a nutshell, my theory of what happened. I would almost take it a potential step further and wonder if once you wind up assuming a new role, you still wind up having credentials that are now able to do that. There's no restriction that I'm aware of that will only permit that credential set to be used from a certain location. So running this from somewhere completely removed from the Capital One's portion of the internet, from a host

Starting point is 00:12:02 somewhere on the other side of the world, for example, would potentially have been able to do every bit as much of this without having full-on access to an instance inside the Capital One environment. Does that matter to your understanding? Yes and no. Generally true. However, let's again assume that Capital One are highly competent. We know they are. When you define S3 permissions, you can actually limit that to VPC CIDR blocks that you choose. And so assuming that they limited it to certain VPC CIDR blocks, the commands would have had to have come from those CIDR blocks. But I think the point here is that that protection didn't help.

Starting point is 00:12:44 Those S3 buckets that things were synced to could have been in another account, and that account needed nothing to do with that IAM role. It just needed public S3 buckets to shove data into. Everything you say is completely plausible, but that's more or less where your article starts and stops. That said, though, from my perspective, it feels like there are additional parts of the story. For example, 127 days elapsed between the time this data was exfiltrated and the time that Capital One was made aware of this by an external security researcher. apparently no auditing going on for strange behavior, such as listing 700 S3 buckets and then massive transfer of the contents of sensitive buckets. Am I mistaken on something there? I don't think you're mistaken, but one of the hardest problems in cloud security is kind of

Starting point is 00:13:41 finding signal in noise. So when you mentioned listing this S3, so let's break this down. What might have they detected? A listing of S3 buckets? Yeah, you'd find that command in the logs and so on. But the sync command is doing an S3 to S3 copy. So unlike a kind of traditional data exfiltration, where you have, you know, which would show up in a VPC flow log or in the old world, it would, you know, be hitting your perimeters of your network.

Starting point is 00:14:17 That doesn't hit any perimeters. That is just a command. And the data transfer happens kind of behind the scenes, not in the virtualized network that you would be monitoring. So my instinct is that, well, my educated instinct, let's say, is that this would be hard to detect. And given that she used both Tor and iPredator to cover her tracks, I strongly suspect the only reason she got caught and this got noticed is because she bragged about it. And stored the exfiltrated data and tooling

Starting point is 00:14:52 she used for this apparently in a GitHub account linked to her name. Yeah, this feels like somebody who, you know, I don't like to play psychologist, but somebody who wanted notoriety. And I think that was the undoing. She mentions that in the Twitter thread. certain of how you have configured IAM and S3, and particularly IAM permissions to S3 on production instances, I suspect there are a lot of folks with this vulnerability. I mean, it's like in the lateral movement kind of attack that we've seen or evidence of lateral movement, but the lateral movement is through identity on the cloud because identity isn't just user identity, it's system component identity. And so it almost forms a new kind of network. And I think it's that complex and

Starting point is 00:15:54 that big a problem. We're going to have to come up with security tools, and I'm biased in this because this is part of what we do, that considers IAM to be more like a network than just a set of usernames with authorizations. One other area for intercepting this, even if it didn't stop the attack but would have flagged it before someone else had brought it to their attention, wouldn't, in a relatively well-managed environment of this level of sensitivity, given that they are, in fact, a bank, isn't it sensible to say that they should have had something alarm whenever an EC2 instance role called an assume role API? Well, sure. Doing that is easier said than done unless you have that complete picture of the infrastructure and you can detect all drifts.

Starting point is 00:16:51 That's why we take that approach. Otherwise, you're you're you're tending to look through big, long logs of change of mutation and trying to pluck out from that what looks scary. And that's a really hard problem. And when you get to these, right, so let's rewind for a second, talk about the kind of fundamental benefits of cloud. It's not a data center. It's a moving surface. It is less like a bank vault, more like an aircraft carrier. You're building something that is constantly in motion. And that means there's a flood of API calls going on all the time every day if you're using it effectively. And that's a blessing because it allows us to build systems that scale automatically, that operate at high speed to compete with others in whatever your business is. You can, you know, talking to somebody the other day, a customer the other day that, you know, they do five to 10 production pushes a day. That really wasn't like that in the old days. So I think it comes

Starting point is 00:17:58 down to should things be caught? Yes. But what should be caught from that flood of information that is coming across the wire? And the old approaches of looking for things that appear scary just fail. You have to have an understanding of what the known good state of that infrastructure is and the ability to catch any drifts that occur to that infrastructure to elevate the actual changes above the normal functioning of the infrastructure so that you can put eyes on it. Or even better yet, our belief is those things should be automatically healed. That the attackers are automated. The defense needs to be automated. So that's a long-winded and wandering answer.

Starting point is 00:18:43 I apologize for that. No, no trouble at all. Ostensibly, isn't the value proposition of both Macie and GuardDuty in different ways to identify anomalous behavior, such as a EC2 instance that has always been a firewall until now that is now listing S3 buckets and causing a whole bunch of object gets and puts? Yeah, and I can't really speculate as to whether they were using those services and something got lost in the noise or if they weren't using those services. I mean, that's what those services try to do. Yeah, they could be forgiven for not using Macie.

Starting point is 00:19:18 I mean, they are a bank, but even a bank runs out of money sooner or later. And Macie is nowhere near affordable for any reasonable workload. Yeah, fair. Yeah, I think there are other approaches. But I'm going to go back to things like Macy's and others. They're still trying to infer what matters. And I personally believe these systems can be made much more deterministic than that, that you don't need like fancy logic to pluck out the signal from the noise. If you know what correct looks like, things that alter from correct are always suspicious unless they're handled through a proper CICD tool chain.

Starting point is 00:20:05 And again, I'll say the cloud is actually a big software computer. It's not a data center. And therefore you can use software engineering approaches like infrastructure as code and policy as code. And I think that's a much more successful way to deal with this kind of thing than trying to find needles in haystacks

Starting point is 00:20:27 of data flying by. This week's episode is sponsored by Chaos Search. If you've ever tried managing Elasticsearch yourself, you know that it's of the devil. You have to manage a series of instances. You have to potentially deal with a managed service, what if all of that went away? Chaos Search does that. It winds up taking the data that lives in your S3 buckets and indexing that and providing an Elasticsearch compatible API. You don't have to manage infrastructure, you don't have to play stupid slap and tickle games with various licensing arrangements, and fundamentally, you wind up dealing with a better user experience for roughly 80% less than you'll spend on managing actual Elasticsearch. ChaosSearch is one of those rare companies where I don't just advertise for them, I actively recommend them to my clients,

Starting point is 00:21:17 because fundamentally, they're hitting it out of the park. To learn more, look at ChaosSearch.io. ChaosSearch is, of course, all in capital letters because despite chaos searching, they cannot find the caps lock key to turn it off. My thanks to Chaos Search for sponsoring this ridiculous podcast. There is another position to take as far as blaming people goes, because whenever something like this happens, the first thing everyone wants to know is exactly whose fault it was and how irresponsible and terrible

Starting point is 00:21:50 they were. And I don't tend to give much credence to that. It's a natural human reaction, but having punitive responses inspires people to hide things. But there is an argument to be made that the current state of cloud security is

Starting point is 00:22:05 such that there's more than any one person can hold in their head as a full-time job, let alone the fact that most people are not security engineers. They have a thing they're trying to do and security is part and parcel of that, but it's not their core objective in what they're doing. So understanding all of the nuances of how all these things interplay feels like it's an awfully heavy lift. Maybe not as much for a bank, but given that we've seen this spate of cloud security issues and Capital One is almost certainly not the only company out there susceptible to something like this, it really does make someone wonder at what point do the

Starting point is 00:22:43 providers themselves bear some level of responsibility for simplifying the stupefying complexity that is the security model? Boy, you covered a lot of ground there. I'm going to go backwards. So why is it stupefyingly complex? It is stupefyingly complex because there are tons of features of the cloud. I remember back in the very early 90s, when I was a new Unix system administrator and I first got root access, I blew up my machine. I did. I got the arguments to a tar command backwards, and I replaced the contents of the kernel file with an empty tape. Why could I do that?

Starting point is 00:23:32 Should some microsystems have prevented me from doing that? At the time, I felt like it. But looking back, no. In fact, the beauty of these very rich and powerful systems is they allow humans to make lots of decisions and be clever. And with that comes risk. With that power comes risk. So, will the cloud providers get better at showing people the sharp edges? I'm sure they will. They have over time. But I think this is a more fundamental problem than the cloud provider should do better or cloud customers shouldn't make mistakes. I think this is actually a kind of physics and biology problem.

Starting point is 00:24:19 Human beings typically can only remember about seven discrete pieces of data. This is why phone numbers, sans area code in the US were seven digits long. The average person can remember seven things. We are bad at specificity. We are bad at detailed memory. And when you look at one of these cloud environments, let's say, you know, one of our customers might have 50,000 or so cloud resources, a resource being something like an EC2 instance or an S3 bucket. And when you look at all the ways you can configure those, each resource can be configured in thousands, tens of thousands, maybe hundreds of thousands of ways.

Starting point is 00:25:07 And so multiply those together. That is not the kind of problem humans are good at solving. It just simply isn't. But we have these handy things called computers and this 60 year old practice called programming and software engineering that is very well suited to this problem. So I think the blame lies on our kind of collective imagination to understand that the cloud is actually a big general purpose computer, and it needs to be programmed like one, and it needs to be automated, not just in terms of its scaling functions and business functions, but in terms of its security functions.

Starting point is 00:25:49 That might've been a little too geeky and down in the weeds. I'll take another shot at it if you like. No, I think that's an absolutely fair assessment. I think with great power does come great responsibility. I think that there's a responsibility to defend against sophisticated attacks like this. One thing that I've noticed across the internet in the wake of this has been, oh, she worked at Amazon back in 2016 on the S3 team,

Starting point is 00:26:20 so she must have used insider knowledge to pull this off. Well, you left Amazon, to my understanding, who used to be a principal solutions architect, and you left before 2016. And in 2017, they had their S3 apocalypse and then rebuilt the entire system from the ground up. Plus, you've laid out a very convincing and very plausible way this could have been exploited that none of it required inside knowledge. It just required deep familiarity with the platforms and publicly exposed utilities.

Starting point is 00:26:53 Yeah. You know, I can't know for sure. Apparently she worked on the S3 team, but my instinct is that that is bullshit. If that's not okay to say, I'll do something else. No, please, we'll keep it. Okay, I think it's bullshit. I think, you know, I recreated this in about five hours. My theory of how this worked.

Starting point is 00:27:19 I've been out of Amazon since 2013. I use no insider information to recreate this. I looked at APIs. I thought about it for a minute, actually for a few minutes, maybe a few hours over the course of doing it and pieced together a way to do this kind of attack. I think she was creative and you know what? So are a lot of people. So we should be worried about that. These new, the notion that an identity system and most people, when they hear identity, they think about, you know, my active directory user, you know, when I log into my machine,

Starting point is 00:28:03 that's not what this is. This is the identity of components of the system that can completely circumvent the traditional network boundaries. And so that becomes a vector again, for lateral movement. I don't think there was insider information use, I don't know, but none was needed. Absolutely, this is the sort of sophisticated and clever attack that for example i would dream up uh maybe not to the same degree certainly not with the ethical lapse but there's nothing that's

Starting point is 00:28:34 required from an inside baseball perspective on this and saying that oh it's obviously a failure of how amazon hires people that someone who worked there many years ago now did something awful. I mean, theoretically, if you were to go and turn evil and do something like this, I think there would be a whole hullabaloo made about the fact that once upon a time you worked there. So it must have been with inside knowledge you did all of these things. And I just think that's crap. I think so, too. I mean, look, people like things to have a bow on them. They like to have something to point out and blame.

Starting point is 00:29:10 Human beings, in my experience, are very uncomfortable with the idea that we're facing issues that are truly complex, that require a lot of thought, creativity and hard work to solve and instead look for some simple explanation. And I think that's one of a number I've heard. When actually for us to do something productive about this means really understanding what happened in an honest way. And I'm not claiming that my theory is perfectly correct or even, you know, mostly correct. It's the best one I could come up with. But if nothing else, my theory and experiment have shown that that is a massive attack vector and people need a solution to that. So I think it's pretty cheap to say, oh, this is an X AWS or it could have been just about anyone. It's also a bit insulting. Oh, no one could possibly understand AWS unless

Starting point is 00:30:05 they worked there for years. Nonsense. Humans can understand anything. Nothing's impossible for a person to wrap their brain around. It just requires dedication and effort more than almost anything else. Yeah, absolutely. I'll point back to the original use of the term hacker. It wasn't somebody who broke security walls down. It was somebody who did clever things in C or Lisp. You know, that's what people who are compelled to do creative things with computers get creative with computers. They use them in unintended ways. And if you are a bad actor, this is what that looks like. If you're a good actor, you might make the next great application or secure an existing one or what have you. But yeah, there's no easy answer to this.

Starting point is 00:30:53 And here's, I think, the most lingering question that we're faced with. Capital One may have their faults, but they don't hire stupid. And they care and they pay attention to this because they know what's at stake. If they were in a situation to fall victim to this, how many other companies are too? Every company that's running a digital computer, whether it's in a data center or on the cloud. Capital One does hire excellent people. One of my best friends and most brilliant programmers I know works there. We haven't spoken about this at all, by the way. We talk more about things like Rust and Haskell. So I know the quality of their team. Asking people to be perfect is unreasonable. And so I think the real question we have to be asking ourselves as an industry is how do we become resilient?

Starting point is 00:31:50 Because perfection is not an option. Exactly. And it can't be M&M security where once you break through the hard outer candy shell, everything inside is soft. Defense in depth is critical. I would even go beyond that. I completely agree. There is no perimeter. Forget about that. That's gone. It's never really been there. Security is a collection of architectural decisions. It's not a technology you can layer on. And so to get security right

Starting point is 00:32:22 means understanding the system as a whole. So I would argue that not just defense in depth, but defense at every level, all the way up and down the stack where you're doing your best to eliminate these vulnerabilities. And pretty clearly in this case, once that firewall was penetrated and the IAM role assumption was possible, that was a pretty soft middle. But I think that we need to think about this in terms of every layer of the stack and baking in security as an architectural practice. And that is directly at odds in many cases with kind of speed and efficiency. And so, again, I'm going to come back to computer science has pretty good answers for this.

Starting point is 00:33:11 People just aren't thinking about the problem that way. I think you may absolutely be onto something here, and I'm beginning to understand why it is that you started a company aimed at solving these problems. If people care more about what you have to say and want to see your thoughts, where can they find you? They can find me, our company is at fugue.co. I'm Josh Stella on Twitter. And if you want to reach out to me directly,

Starting point is 00:33:47 I'm josh at fugue.co. Thank you so much for taking the time to speak with me today. I appreciate it. Thanks, Corey. It's been fun to talk to you. Likewise. If you've enjoyed this episode, please leave us a positive review on iTunes.

Starting point is 00:33:59 If you hated this episode, please leave us a positive review on iTunes. I'm Corey Quinn. This is Screaming in the Cloud. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at Screaminginthecloud.com or wherever fine snark is sold. This has been a HumblePod production. Stay humble.

CODACE Plant Stand

Screaming in the Cloud - Reverse Engineering the Capital One Breach with Josh Stella

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Screaming in the Cloud - Reverse Engineering the Capital One Breach with Josh Stella

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.