Screaming in the Cloud - The Power of Networking in the Cloud with Tom Scholl

Episode Date: August 29, 2024

A cloud service is only as good as the team of network engineers who keep it up and running. In this episode, AWS Vice President and Distinguished Engineer Tom Scholl breaks down the importan...ce of security and legwork needed to support the company’s massive infrastructure. Corey picks Tom’s brain while singing the praises of the AWS DDoS Protection Team, marveling at the scale of the modern internet, and looking ahead to the next generation of network engineers that could land at AWS. If you’ve ever wondered about the inner workings of the AWS cloud, then this is the discussion for you.Show Highlights: (0:00) Intro(1:09) The Duckbill Group sponsor read(1:42) The importance of a good network for AWS(3:38) Evolution of networking(6:03) Efficiency of the AWS DDoS Protection Team(7:29) AWS Cloud and weathering DDoS attacks(10:03) Policing network abuse(12:08) Walking the SES tightrope and network attacks(15:00) Ensuring the security of the internet(17:53) The Duckbill Group sponsor read(18:37) Scale of the modern internet(20:47) Migrating the AWS network firewall(21:54) Internal network scaling(24:27) Preparing for DDoS disruption(29:14) Finding the next generation of network engineers(32:15) Where to learn more about AWS cloud securityAbout Tom Scholl:Tom Scholl is a VP and Distinguished Engineer at Amazon Web Services (AWS) in the infrastructure organization. His role includes working on AWS’s global network backbone, as well as focusing on denial of service detection and mitigation systems. He has been with AWS for over 13 years.Prior to AWS, Tom was a Principal Network Engineer at nLayer and AT&T Labs (formerly SBC Telecom). He also previously held network engineering roles at OptimalPATH Digital Network and ANET Internet Services. Links Referenced:AWS Security Blog: https://aws.amazon.com/blogs/security/How AWS threat intelligence deters threat actors: https://aws.amazon.com/blogs/security/how-aws-threat-intelligence-deters-threat-actors/Using AWS Shield Advanced protection groups to improve DDoS detection and mitigation: https://aws.amazon.com/blogs/security/using-aws-shield-advanced-protection-groups-to-improve-ddos-detection-and-mitigation/AWS re:Inforce 2024 presentation on Sonaris and MadPot: https://www.youtube.com/watch?v=38Z9csvyFDgNANOG 2023 presentation on AWS networking infrastructure: https://www.youtube.com/watch?v=0tcR-iQce7s AWS re:Invent 2022 presentation on AWS networking infrastructure: https://www.youtube.com/watch?v=HJNR_dX8g8c AWS re:Invent 2022 presentation on Scaling network performance on next-gen Amazon EC2 instances: https://www.youtube.com/watch?v=jNYpWa7gf1A&t=1373sIEEE paper on Scalable Relatable Diagram (SRD): https://ieeexplore.ieee.org/document/9167399SponsorThe Duckbill Group: https://www.duckbillgroup.com/

Transcript
Discussion (0)
Starting point is 00:00:00 I mean, it's definitely, you know, in the many, many terabits of capacity. And it's different layers of the network, right? Because you have to think from an availability zone, a data center, you know, how do you connect this to the rest of the world, right? So there's, you know, large amounts of capacity within a particular AWS region. And then you actually have to interconnect that too. Welcome to Screaming in the Cloud. I'm Corey Quinn. My guest today is Tom Scholl, VP and Distinguished Engineer at AWS.
Starting point is 00:00:29 Tom, thanks for joining me up. AWS, haven't heard of those folks. What do you do? Hey, thanks for having me. I am an engineer who focuses on our network, our overall infrastructure organization. So that includes our data centers, to our hardware engineering, to our supply chain, some of our network edge services, and our network infrastructure and things like particularly in the DDoS, anti-DDoS use case, as well as some of our CDN work as well. And more specifically, I focus on our network infrastructure, kind of our global backbone and internet transit and pairing. And I spend a fair amount of my time in DDoS protection and disruption. This episode is sponsored in part by my day job,
Starting point is 00:01:11 the Duck Bill Group. Do you have a horrifying AWS bill? That can mean a lot of things. Predicting what it's going to be, determining what it should be, negotiating your next long-term contract with AWS, or just figuring out why it increasingly resembles a phone number, but nobody seems to quite know why that is. To learn more, visit duckbillgroup.com. Remember, you can't duck the duck bill bill. And my CEO informs me that is absolutely not our slogan. There's, I think, a lack of awareness societally around the value of the network to something like this. There's, I think, a lack of awareness societally around the value of the network to something like this. I mean, without this, AWS becomes probably the world's
Starting point is 00:01:50 largest collection of space heaters. Because without being able to talk to one another, computers don't tend to do a whole heck of a lot. It used to be something that was incredibly top of mind for folks because networks would break and things would stop being able to communicate clearly. But for most of the world, it's gone to the level of being a utility where when you turn on the faucet in the bathroom, you don't wonder, is water going to come out this time? It just does. If it ever doesn't, that's momentous. And networks have sort of gone the same way, at least from the business user perspective, in no small part due to people who are doing the things that you do. How did you get into this space?
Starting point is 00:02:25 Well, it all started back in the 90s. I used to dial into BBSs and starting to learn a lot about Unix and Telephony and those sorts of systems. And I eventually got a job in an ISP where you had to be a jack of all trades, where you had to know Unix sysadmin work. We have to run the Unix radio servers, mail servers to, hey, you have to learn some of that network stuff too. In addition to being tech support too. So you had to basically hey, you have to learn some of that network stuff, too, in addition to being tech support,
Starting point is 00:02:46 too. So you had to basically kind of know it all and kind of end-to-end, right? And there was nothing that you could say no to. That wasn't your specialty, and did that, and eventually got a job at the phone company in the Chicagoland area, which was Ameritech, which later got acquired by SBC, which had Pacific Bell and SCT in Connecticut,
Starting point is 00:03:02 Southwestern Bell, and worked on building our broadband network and our internet infrastructure and got involved in sort of the whole networking scene with Nanog and pairing the whole ecosystem. And basically it was building large networks. And then we eventually acquired AT&T, which is even bigger network on top of that. And just did that for a fair amount of time.
Starting point is 00:03:19 And then around 2010, joined Amazon and left and then briefly came back and have been working on the Amazon side, primarily on our border network, which is basically the ISP transfer provider of Amazon that connects our data centers, interregion connectivity, connectivity to and from the internet. And then the last four years, I've been spending a bit more time on the DDoS space. I had the privilege of watching your talk at NANOG in Kansas City a month or two ago before this recording. And it was interesting seeing how so much of what you do, especially these days, seems like it shies away from a lot of the
Starting point is 00:03:51 technical countermeasures for DDoS and leans much more heavily into being a human being, reaching out to network operators on the other side of the line when you start seeing bad behavior emitting from their networks. Has that been something that's always been the case? And I've just been blind to it. Is this an evolution in networking culture? So I think in the networking culture, there's always been a strong operator community and you build a lot of relationships and friendships over time where, you know, hey, if there's a problem in another person's network, like you are that Rolodex, right? For reaching out to somebody, a particular CDN or cloud or hosting provider and saying, hey, we've got an issue and we need to troubleshoot it.
Starting point is 00:04:25 And so that transition worked really well in the DDoS space where you would see the sort of abuse that might be occurring from different parts of the world. It's like, well, who do I know there? Well, I know some of the networking side and let me go reach out to them. And depending on the nature of the issue, it is human contact to basically engage somebody to say, hey, can you route me to the right person? So I was kind of doing it for decades on the network side when it came to troubleshooting. And with the DDoS side, it's kind of a natural evolution to kind of leverage those same relationships to make progress.
Starting point is 00:04:54 It's one of those areas where it feels like there's not a lot of public awareness of the fact that all the big hyperscalers who compete with each other in cutthroat ways and in many business ways are very much working together around things like, I guess, the dark forces that will attempt to destroy the internet, around security, around abuse, around network peering. There's very much a sense of we're all in this together in every conversation I've been a part of. That's correct. There's very much a lively operator community where, you know, reaching out to people, engineer to engineer,
Starting point is 00:05:26 operator, operator, when you have a problem, it's like, hey, there's a mutual thing. It could be a mutual customer of ours, right? Or whatever it might be, but it's like, you know, we want to get the packets to flow. We're all in this together. Let's try to find a way to work the problem and get drive resolution. And so a lot of that could be, you know, directly through email or other back channels or slacks and things like that, where you need to reach out to people. We certainly found issues in other people's networks where it's like, hey, this thing is on fire. You need to take a look at it. And so in addition, we have formal ways to actually engage individual NOCs and things like that.
Starting point is 00:05:55 But definitely having those relationships pays off quite a bit when it comes to networking, abuse, DDoS, stuff like that. AWS offers a shield product that is DDoS protection, and the basic level is rolled out to most of your endpoints. Customers benefit from that automatically. There's a DDoS shield advanced product that comes in at a fixed fee of $3,000 a month, which at enterprise scale is dropping the bucket. It also does some weird economic things of changing how WAF rules wind up being charged. But what I found from customers who've had that and who have suffered from DDoS issues historically,
Starting point is 00:06:28 far and away the thing that they say that the biggest benefit of that has been being able to coordinate more closely with the AWS DDoS prevention team. Every story I've heard about those folks has been absolutely top flight. And it's rare because usually when someone is undergoing an attack,
Starting point is 00:06:45 they're not in a good mood. I'm just going to say it. They're angry. They're stressed out. They're wondering, will the website ever work again? So they're inclined to lash out. But I've heard nothing but positive stories about the team's work. That's great to hear. And I'm sure the team will be delighted to hear that. Because I assure you, if you have negative things to say, they find their ways to me. I'm sort of a negativity magnet by happenstance, I suppose. No, I mean, that team, and I work with them really closely, and they basically protect all of Amazon in addition to customers who have, let's say, Shield Advanced, where they directly engage with them, identify the attack, come up
Starting point is 00:07:19 with medications, and work with customers pretty closely. So it's definitely an area that we're proud to have and definitely enjoy working with them closely. My experiences with DDoS historically, and when you start a sentence like that, it sounds like it could go really negatively, but no, I was always firmly on the victim's side of it, where I was a network staff for a time for the Freenode IRC network, which was an ever-popular target because, oh, well, what am I going to do today? I'm just going to give people grief on the internet because. So there were constant challenges
Starting point is 00:07:50 in dealing with SYN floods and then more sophisticated attacks as time went on. And you saw it not just in my hobbies there, but I would see it with companies where in some cases suspected competitors would wind up launching giant attacks at unprotected endpoints. And it was easier to do early on when someone had a few servers sitting in a rack in their office.
Starting point is 00:08:12 You can overwhelm links pretty easily. As hyperscaling started to be a thing and people started realizing, oh, maybe there's something to this cloud thing, at least publicly it seems like a lot of those problems kind of went away. Given that you have been talking about this for a while, including on stage to very smart network people like yourself, I get the sneaking suspicion that people just didn't give up on this. There's an awful lot of very hard work that you and people like you are putting into this. How has it evolved?
Starting point is 00:08:40 Definitely. In the last several years, there's different, when you think about DDoS, there's different types out there. There's what we call layer four DDoS, and that's basically, you know, either bandwidth saturating, bits per second heavy, or packets per second heavy, which is really there to kind of exhaust state, right? So there's, traditionally, that's been historically how we think about DDoS. And then the last several years has also been layer seven request floods, which are basically HVGET input attacks that just overwhelm from a request-per-second perspective. But what has changed is that in the last several years, there's been much more focus in actually identifying where the infrastructure that's being used to launch these attacks and actually focusing on disrupting that and engaging with the actual sources of this traffic to go and get this shut down.
Starting point is 00:09:21 And that comes in different forms, right? Where it could be if it's spoof-type traffic, which we could talk a little bit more about how we can, you know, with our global backbone and our global reach and the amount of networks we connect to gives us insight into where spoof traffic comes from. And that's a unique one
Starting point is 00:09:34 because that's been a 20 plus year issue. And that goes back to IRC and Smurf attacks and things like that that people used to do. So that was kind of a unique area where we stepped up and collaboration other networks actually chase that down. And then there's other areas where we look at things like botnets and finding the command and control servers and actually going to target them and reach out to the hosting provider to get that shut down and the domain registers as well.
Starting point is 00:09:56 So that's some example of where we've started some of that work and pushed pretty aggressively on it. I started my career in tech running, well, what I thought were large-scale email systems compared to what you folks are doing. But I'm at the scale of, that's cute at a university. But managing a lot of the spam that was coming in was sort of a hobby horse of mine. I wound up getting dragged along fairly far down that path. But today, if I were to set up a web server somewhere on the internet, or sorry, set up an email server somewhere on the internet and start turning to an open relay or
Starting point is 00:10:30 sending ridiculous spam out of it, it would not be very long at all before every provider within some small degree of rounding error would still no longer accept traffic from that server. They would effectively black hole that. It would wind up on a bunch of block lists, and that would be the end of it. I'm curious why that pattern doesn't tend to follow a lot of these network providers who do a poor job of policing the traffic that they are emitting. Is that just because they're so big that it's difficult to wind up seeing it all from their side? Is it that they're too big to block as people are just not going to block AT&T, for example? Or is there something more to it? I mean, I think every network has their own policy of how they deal with this. I think some networks actually are proactive and they look and are we sending any abuse out? And you definitely find cases where there's other networks
Starting point is 00:11:17 that could do a better job. I know from the AWS perspective, we certainly have various different detection and mitigation capabilities if we ever see anything anomalous leaving from our network. And one of the things that in the last few years, like we've actually up leveled that to look for communication to command and control servers. And like that might be out there on the Internet and actually block that communication that even prevents resources from actually launching attacks in the first place, as well as reaching out to customers and say, hey, you're talking to this thing that our trust and safety team will go and engage with. So I think we do a really good job of actually preventing that sort of preventative type of work, where I think a number of other networks out there just haven't gotten to that area. Maybe they just, you know, the abuse team may not be funded appropriately. I can't really speak to how other networks operate, but we definitely, it's a high priority for us for sure. Something that I do want to call out is in the
Starting point is 00:12:09 early days when, even before SES came out, the EC2 IP ranges were generally in some cases a source of abusive traffic. And this is no necessary fault of your own. It's when you wind up letting anyone start using computers with the swipe of a credit card instantly, that that's an incredibly powerful thing. Not everyone is a good actor trying to build a business. Sometimes it's just, I want everyone to see my marketing, and it devolves massively from there very quickly. And you see that tension somewhere, where people sometimes find it challenging to get out of the SES sandbox for some workloads. Having worked for the SES team enough, I am of the opinion that they make the right call most of the time. But in the early days, AWS's traffic, especially once SES launched, was viewed
Starting point is 00:12:54 in the anti-spam community with some suspicion and distrust. I think on some level, that's probably a function of scale where, well, they're too big to really be able to communicate with anyone over there. So of course, they're going to be a bad actor. I don't see that anymore. There has been a tremendous focus somewhere on tamping out that behavior. But it's also happening from the perspective of not inconveniencing legitimate customers. That feels like an impossible tightrope to walk, but some of you folks have done it. Yeah, I don't work with the SES team that closely,
Starting point is 00:13:26 but I'm aware of some of the efforts that they've done in terms of how they control and their detection systems that they built to prevent that sort of activity. But we could follow up with you with more details on some of that. There's more to it than I believe just email. That's the one that I have the best experience with.
Starting point is 00:13:41 But I do not hear particular stories. When you hear about the various forms of novel network attacks and the rest, and you start looking at some of the traces that wind up getting published, for here are the bad actor IPs
Starting point is 00:13:52 that are helping to slam this thing, I don't see AWS represented nearly as much as I would expect relative to the sheer number, the sheer size of the IP space that you folks control. There is clearly something highly proactive going on that is making the internet a better place. number, the size of the IP space that you folks control. There is clearly something highly
Starting point is 00:14:05 proactive going on that is making the internet a better place. One of the things that we've talked about in the last year, which is the system called MadPot, which is basically our honeypot system that we've developed internally for several years ago, which lets us basically be a sponge to any sort of negative activity that's going out there. And so we can ingest that data, we can process it, and we can determine where is it coming from, basically. And if it's coming from internal resources, such as from EC2, we engage with our trusted safety teams directly to reach out, engage with customers, or take any other sort of mitigating action.
Starting point is 00:14:39 So we have some of the systems in place to detect that and proactively engage and take action as just one example. And that MatPat system has been used for a variety of other systems on the DDoS side, but that's just another example where that and some of the work from our trust and safety team to identify and mitigate any sort of outbound malicious abusive activity. It's been said for a long time that at AWS, security is job zero. And I've always interpreted that to mean protecting customers from external bad actors, the end. And also in many cases from hypothetical insider attacks at AWS. Here's how we guarantee that even Amazonians can't access your data when it's stored here.
Starting point is 00:15:19 And you have countless white papers on this to the point where, okay, if there's something inaccurate in here, I'm certainly not going to be the one to find it. I take that at face value just based upon the sheer amount of work you folks have done. A lot of the work that you're doing seems to be, in many respects, aimed not at protecting existing customers, but also security aiming at the larger Internet's well-being as a whole. Is that accurate? Is that a wildly naive, Pollyanna optimistic style misreading of the situation? No, that's accurate.
Starting point is 00:15:53 And as we went down this journey around like 2020, that's when I started pivoting into the DDoS space. And, you know, it was not just, you know, protect AWS infrastructure, but protect our customers. But, you know, looking at the data
Starting point is 00:16:03 and collaborating with other external networks, it was just a few of us together that said, you know, we can actually take this further. Like, let's not just observe it and block it, but, you know, we can actually take some actions here that will be good for the internet as a whole. And so that's how we started looking at kind of those three different silos of attack traffic where we saw, hey, there's spoofing traffic coming into our network through pairs. Like, let's go directly engage with that pair to say, can you trace the spoofing traffic coming into our network through pairs. Let's go directly engage with that pair to say,
Starting point is 00:16:25 can you trace the spoofing back and go and filter it and prevent it and just make that a daily habit? And now that one's a little bit more complicated because you have to go and engage with networks externally and explain to them what spoofing is. There's a lot of networks. Networks have grown. People who might have been there back in the day aren't there anymore
Starting point is 00:16:41 who maybe are more familiar with it. So you have to also kind of get over the hump of explaining and with pictures um like hey this is what spoof traffic is it's yes we know that's not your ip can you go use you know your netflow tooling to go and figure this out so that was kind of one area um and then when it came to botnets it was just like well we've got our madpod systems we can find where these botnet command and control servers are in the domains that they're using like we can go and actually automate and generate the notes to these hosting providers to say, here's the data about what's on here.
Starting point is 00:17:08 It's issuing attack commands to however many thousands of resources around the world. Please take this down. And that also goes into the Layer 7 side where you have resources where these booters and stressors, we didn't get really too much into kind of where these attacks come from, but these booters and stressors, we didn't get really too much into kind of where these attacks come from. But the booters and stressors, they set up a number of machines and they get open proxy lists and they just basically go and blast away at them.
Starting point is 00:17:34 And so you could try to mitigate all the proxies on the Internet or would it be better to really just go to actually the source that's actually generating it and focusing on it? And it was just really just a few of us together that said it wasn't anyone's roadmap, really. We're like, this is something we should just go and do. Let's get it going and measuring the impact of it that it's had, it's been pretty exciting. Here at the Duckbill Group, one of the things we do with, you know, my day job is we help negotiate AWS contracts.
Starting point is 00:18:00 We just recently crossed $5 billion of contract value negotiated. It solves for fun problems such as how do you know that your contract that you have with AWS is the best deal you can get? How do you know you're not leaving money on the table? How do you know that you're not doing what I do on this podcast and on Twitter constantly and sticking your foot in your mouth. To learn more, come chat at duckbillgroup.com. Optionally, I will also do podcast voice when we talk about it. Again, that's duckbillgroup.com. One thing that I continually have to remind myself of is the sheer scale of the modern internet.
Starting point is 00:18:41 You folks recently announced Direct Connect availability in some locations at 400 gigabit per second, which is just monstrously fast. Now, I can make jokes because of how I see the world in terms of data transfer means money, but ignoring entirely economic impact of that,
Starting point is 00:18:58 the sheer scale of peering between AWS and Comcast, given the disturbing proportion of the internet, and sometimes it feels like you and your peers tend to represent the sheer volume of traffic that must be. It's almost Beggar's belief to be able to even picture that sense of scale. It feels like at some point, even as big as I think it is, the reality is almost certainly much larger than that. No, I mean, it's definitely in the many, many terabits of capacity. The reality is almost certainly much larger than that. A lot of our teams that focus on our backbone network topology, like what are the amount of routes you need to set up backbone links to, understanding diversity when it comes to, well, cables are going to get cut, terrestrial or subsea, right?
Starting point is 00:19:51 And so how much additional capacity do you need to provision on alternate paths to plan for some of these cuts where a terrestrial cut might be short-lived, it might be a day or two, whereas a subsea could be weeks or months, right? So you have to put a lot of planning into actually having a lot of this capacity there and standing by. And then there's the internet side. Once you get to the actual edge of the network, you actually have to go and capacity plan with all these external networks, right? And one of the things that's really been helpful for us
Starting point is 00:20:15 is that in the last several years, we've taken a lot of the data center technology, network technology that we've used there, and we've actually brought that into basically the border, kind of the ISP border backbone side of it, where we've taken some of these smaller commodity chipset devices and actually used them in the internet scale, which is something that is not super common out there. And so that's really allowed us to basically get into these end by however many hundred gigs or end by however many 400 gigs. And it's been able to
Starting point is 00:20:43 allow us to scale up rapidly and stay ahead of things. You folks, I think earlier this year, had a blog post or, I don't know if it was a blog post or white paper. I know it was esoteric compared to a lot of the stuff that you folks put out, which frankly, I'm here for. It talked about migrating off of a bunch of networking appliances from legacy vendors was the vibe that I got onto the AWS managed firewall offering and how that wasn't just a bunch of networking appliances from legacy vendors was the vibe that I got onto the AWS managed firewall offering and how that wasn't just a matter of the capability of handling throughput
Starting point is 00:21:11 at scale, but the ability to get observability into what those traffic flows looked like in ways that previously had been very challenging. I'm aware of that project and know the team really well. That was an effort to basically move away off of hardware-based firewalls in a certain particular portion of the network. And it really caused the team to look at network firewall and
Starting point is 00:21:31 how are we going to leverage the capabilities of that system. And it actually, in the end, it got us to a really good spot because it gave us a level of compartmentalization that we like with VPCs. It gave us a level of visibility through flow logs and through some of the network firewall capabilities that we really like. And so it was gave us a level of visibility through flow logs and through some of the network firewall capabilities that we really like. And so it was a good success story of how like, hey, we can run these workloads on our products and it's worked really well.
Starting point is 00:21:53 Well, the question I have about IndieWave and internal networks work there. I mean, obviously the way that you are peering with other folks, you aren't rolling out your own custom special version of BGP because as it turns out, when it comes to the internet, interoperability is kind of a big deal.
Starting point is 00:22:07 But at reInvent two years ago, you folks talked about an internal TCP replacement protocol, SPF or something like that. And it was, this is fascinating what you talked about. It makes latency to EBS a lot lower, was I think the story that got told. It was, this is fascinating from a protocol perspective. Can you tell us more about it?
Starting point is 00:22:24 And the answer was no. Great. Awesome. We just sit here and be envious from the outside. My question is, is internally at AWS, when you start getting into the large scale internal networking piece, how much of a resemblance does it bear to what you might expect at a from commercial offerings or someone working in a Cisco lab to pass a certification, everything just scales up from there. Is it complete Wonderland style stuff,
Starting point is 00:22:52 or is it just the basics you would expect anywhere else writ large? I would say that it's certainly a network interconnection points within between, let's say, EC2 and sort of the border network. That's where you'll typically find still things like BGP operating. We certainly use BGP. Obviously, externally, we have to, to the internet. Within the data centers itself, there's a mixture of different existing open standard routing protocols.
Starting point is 00:23:14 But in the last few years, there's been some effort to actually focus on, can we build additional protocols that can provide us more rapid convergence and more unique topologies? So there's definitely active work going on there to actually look at. Because some of these protocols between OSPF, they do have their own limitations. And you could modify them and twist them and turn them in certain ways. But there's also some benefits by saying, can we rethink about how we do link adjacencies
Starting point is 00:23:40 and how do you path calculations? So certainly within the data center space, there's some of that innovation that's been going on there. But on the, and another part to also consider is that a lot of what we do in terms of traffic steering is through controllers, right? So we have different software based controllers when you have traffic that goes,
Starting point is 00:23:56 let's say to the internet to basically how do you, you know, routing protocols don't have a lot of things about performance, right? They don't understand latency. BGP doesn't capture that. So a lot of behind the scenes, we have controllers don't understand latency. BGP doesn't capture that. So a lot of behind the scenes,
Starting point is 00:24:08 we have controllers that actually look at performance data from the system that feeds into CloudWatch Internet Monitor to actually steer things to say, okay, you need to move this prefix over this location. Okay, this other path of latency has gotten better. Let's shift it over there. Does it fit? So there's a lot of, it's not just the protocols itself. It's also the controllers that actually manipulate
Starting point is 00:24:24 the routers themselves and forwarding. guaranteed way to wind up beating it was to be able to throw more bandwidth at it than the attacker could summon. The problem is with malware being what it is in the scale of the internet today, they more or less wind up with infinite levels of bandwidth. So at some point, that just becomes an arms race. How have you been doing around the area of DDoS disruption? So, I mean, you're accurate in that, like, yes, the attacks get bigger and bigger. You know, it used to be, you know, hundreds of gigabits, and now you're seeing into the low terabits level of bits per second. And, you know, in order to address that, you need to have a really large front door, right? And so that is one of the things that AWS does have, you know, at our scale is that we do have those large front doors,
Starting point is 00:25:16 whether cloud front, application load balancer, where you can basically absorb some of that traffic level. So that's certainly critical in order to be able to operate in that space. Now, in terms of disruption, it really comes down to identifying through some of our systems of MatPot to actually identify where these attacks are coming from and then engaging with those external network operators to basically say, hey, there's a C2 server
Starting point is 00:25:38 that needs to be taken down. It's clearly hosting bad things. Can you shut it down, please? This domain register, can you take this domain down because it is hosting a C2 that's there? With the layer seven attacks, it's interesting because it's actually typically a lot of Node.js scripts running on machines with lots of memory and a proxy list that someone imports, and it has some orchestration. So typically, a lot of these DDoS operators, they have storefronts, and those storefronts are kind of hidden a little bit further away from where the attacks actually get generated.
Starting point is 00:26:22 So a lot of the focus that we've done is looking at the actual infrastructure that can generate these and direct engagement with those networks to shut it down where possible. There was a school of thought for a while that, oh, about hackback attacks, where, oh, someone is attacking you, you just go ahead and wind up breaking into their systems and the rest. And I was a little concerned because that's always been a dicey proposition at best. So when you started talking in your talk at NANOG about the idea of disrupting these attacks, it's like, oh no, this is about to go somewhere disastrous. And no, you kept it very much in the correct direction. And I do keep a hand in the space just to make sure that people aren't increasingly suggesting debunked ideas from the early knots again, because enough time has passed and people don't think that, oh, well, this time it's sure to work. But your holistic approach to it has really been something of note.
Starting point is 00:27:14 No, I think with definitely on the spoofing side, there's a lot of collaboration with networks. And occasionally we do get a network where it can be difficult to deal with, right? And so we'll sometimes talk to other their peers as well, or maybe their upstream provider, you know, we're not getting through to them and we'll talk to them and be like, Hey, this is coming from your downstream network. Like, what are your options here that we can, we can do? So we definitely focus on, you know, being nice and communicating through email or personal contacts to, to address whatever the issue is. And, and it's a mixture of
Starting point is 00:27:43 things of like education, right? Some of these networks just don't know. It's interesting that broadband networks have done a really good job of preventing spoofing by default, right? You get a cable or DSL line, you can't spoof on it. But it's typically kind of the hosting shops that we find that have typically, oh, if you've got a dedicated server, then you can spoof, right? So a lot of it comes down to education and saying, you should make this the default, right? Or when somebody asks, sometimes people can ask their hosting provider, say, hey, I need to spoof for whatever use case, right? Sometimes they call it IP header modification, IPHM.
Starting point is 00:28:16 They'll ask for that to be removed. It's like, okay, we've talked to hosters. We're like, oh, this customer asked for it to be removed. They're like, well, you might want to be a little skeptical about it next time, if you can, please. Yeah, once it's been removed, what is the behavior they start doing? What are you seeing going across the wire? Yeah, trust but verify.
Starting point is 00:28:31 You see all these packets per second that spikes up, right? And it's all till UDP destination port 53 or 389. Like that's a pretty good clue, right? And so that's some of the things that we do try to educate networks. So it's like, this is what it looks like. Here are the different like heuristics or things that you can look like as a network operator
Starting point is 00:28:46 to find this going on in your network. And so that's what we've been really spending a lot of time in trying to educate and be like, here's how you can use some of your off-the-shelf NetFlow tools and some of our open source that you can actually dig on this and find it on your own. And I think that's where we've had a lot of success. And there are some networks that are in that mode or they actually do find it on their own and they deal with it. By the time you reach out to them, they're like, hey, it's already taken care of. It's like, that's amazing. I'm glad we've got you in a good spot now. of other people who have been doing this for a very long time, eventually parts wear out and
Starting point is 00:29:25 need to be replaced. As much as some of us might want to live forever, that is not an option that is currently available. Where does the next generation of people who will do in the future what you do today, where do they come from? Yeah, no, that's a great question because I think we struggle with that too sometimes in terms of how do you find talent and how do you, you know, one of the Amazon leadership principles I like a lot is learn and be curious, right? And I think, you know, trying to identify folks who have that learn and be curious of like, hey, I want to go deeper here. I want to understand this a little bit more, you know, don't maybe just treat this as yet another attack, but like actually understand what's going on behind
Starting point is 00:29:58 it. Like what's actually generating this, right? So a fair amount is just kind of identifying folks who are interested and, you know, presenting opportunities for them, right? And I think that is the, you know, as senior technical leaders, like you have to present opportunities for others. And sometimes it may not go the way you expect, but that's fine. You have to learn. And basically, you know, allowing people, you know, connecting them with other folks externally, right? Whether that be external forums, different trust groups, and just how do you basically like, hey, I want to get you into this. And I can, you know, serve as basically connecting them with other folks, giving the opportunity to take something and running with it, you know, talking about it after the fact. But it definitely requires like real effort, right? To actually,
Starting point is 00:30:38 you know, help and educate at the same time, which is like, hey, I'm going to have to, you know, let me try to explain this to you as best as I can. If you have any questions, let me know, no matter what, silly, good, bad, whatever it is, like I'm here to help, right? I want to make you successful. And I think certainly as senior technical folks, we definitely need to be growing other folks. And it needs, you have to carve out the time and resources for it. Do you find that those folks are matriculating into your org as having studied networking and that that was the direction they wanted to go in or are they basically phasing in from from other technical areas i've seen all types it's not always purely people with a networking background i've seen people you know and i've had this conversation with folks before in some of these areas where they're like well
Starting point is 00:31:16 we're not security engineers i'm like neither am i like this is just like like no like this is just purely like it you know this is an area to immerse yourself in and it was kind kind of my journey, too, when I got in the DDoS space. Because I've always dealt with it on the receiving end, right? When we build the network infrastructure and seeing attacks come in. But I never said, I'm going to actually try to understand this. And so I had to, myself, immerse myself in this domain. And even internally, working with other teams inside of Amazon, just understanding trust and safety or the fraud team. And I was like, hey, I'm coming in here as a newbie.
Starting point is 00:31:45 What can I learn? And I think definitely with other folks, we've seen people come in from various backgrounds where it's like, okay, I want to go and learn. Luckily, we have a lot of tools and data at our disposal where folks can pick up and go. And I think it's just really about connecting people to it.
Starting point is 00:32:02 And particularly when you're surrounded around a particular outcome, right? So, hey, like we want to address this particular issue. Like, how do we go and lean in here? And like, what are the different people that we need to bring together? So yeah, it's all types of backgrounds. I really want to thank you for taking the time
Starting point is 00:32:16 to talk to me today. If people want to learn more, where should they go? So on the AWS security blogs, we've definitely had a number of postings about some of the things that we've built. So we've talked about things like if you search for MadPot, a recent thing that we've talked about, which is Scenaris, that we just were public about, which is sort of this
Starting point is 00:32:33 basically service behind the scenes that actually detects people trying to go after, attack customers, right? And it actually blocks them. So I'd recommend reading some of the things that we've done on Scenaris, MadPot, Shield, Advanced. We've got a number of blog posts that are out there. Yeah, that's a good starting point
Starting point is 00:32:52 to kind of learn some of the things that we've done in this domain. And we will definitely make it a point to put those in the show notes. Tom, thank you so much for speaking to me. I really appreciate it. Oh, thank you for having me. Tom Scholl, VP and Distinguished Engineer at AWS.
Starting point is 00:33:05 I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment, so then I can block that particular platform from syndication, because that's how it works.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.