Screaming in the Cloud - How to Get 75 Gigs of Free Storage in AWS with xssfox

Episode Date: April 8, 2020

About xssfoxJust a dumb fox Links Referenced: DigitalOcean https://www.digitalocean.com/ CHAOSSEARCH http://CHAOSSEARCH.io Big Buck AWS https://github.com/xssfox/bigbuckaws Corey’s talk..., “Terrible Ideas in Git” https://www.lastweekinaws.com/blog/terrible-ideas-in-git-by-corey-quinn/ Twitter: https://twitter.com/xssfoxScreaming in the Cloud http://ScreamingintheCloud.com 

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. And this is important to me. No billing surprises. With simple, predictable pricing that's flat across 12 global data center regions and a UX developers around the world love, you can control your cloud infrastructure costs and have more time for your team to focus on growing your business.
Starting point is 00:00:59 See what businesses are building on DigitalOcean and get started for free at do.co slash screaming. That's do.co slash screaming. And my thanks to DigitalOcean for their continuing support of this ridiculous podcast. This week's episode is sponsored by Chaos Search. If you've ever tried managing Elasticsearch yourself, you know that it's of the devil. You have to manage a series of instances, you have to potentially deal with a managed service. What if all of that went away? Chaos Search does
Starting point is 00:01:32 that. It winds up taking the data that lives in your S3 buckets and indexing that and providing an Elastic Search compatible API. You don't have to manage infrastructure, you don't have to play stupid slap and tickle games with various licensing arrangements. And fundamentally, you wind up dealing with a
Starting point is 00:01:50 better user experience for roughly 80% less than you'll spend on managing actual Elasticsearch. Chaos Search is one of those rare companies where I don't just advertise for them. I actively recommend them to my clients because fundamentally they're hitting it out of the park. To learn more, look at chaossearch.io. Chaos Search is, of course, all in capital letters because despite chaos searching, they cannot find the caps lock key to turn it off. My thanks to Chaos Search for sponsoring this ridiculous podcast. Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Michael, who is similar to a previous guest, an Australian code terrorist.
Starting point is 00:02:33 Michael, welcome to the show. Hey, thanks for having me. So you wrote something a while back that really sort of took some of the serverless world by storm. It's a GitHub repo that you called BigBuckAWS. What does that do exactly? So it's sort of like a tech demo, I guess, of how you can abuse some of the inner workings of AWS, specifically Lambda. And it sort of allows you to stream content so like a mp4 hls stream without really paying for for much data um and it does that by abusing the ability that you can pull
Starting point is 00:03:19 your code back out of lambda yeah so the the thing is when you upload code into Lambda, they also let you download it again to view it in like the online editor, or if you have some sort of internal tool, you can use it. But the way it works internally in AWS is when you upload it, it ends up in an S3 bucket. You can grab that data back out from the S3 bucket through a signed URL, but that bucket is Amazon's bucket, not yours. So you don't get charged for it. So that's kind of the key part of it. And that always struck me as something very strange. I'm sure there's a technical reason behind it, but you get 75 gigs of Lambda storage per account. It's an Amazon bucket. There's no requester pays or anything like that.
Starting point is 00:04:06 And it just never occurred to me that, A, you could pull data back out of it, mostly because I guess I lack the appropriate level of code terrorism style imagination, but also because it's something that has always happened under the hood. Oh, that's where the code gets uploaded there and then it magically runs as your Lambda function.
Starting point is 00:04:22 I guess, starting at the beginning, what got you thinking down this path? I guess the key thing is I was waiting for an awfully large Lambda function to upload. And I was just looking at like, how big of a Lambda function can I actually upload and what sort of limitations there are. So that got me looking at the limits in that. And that's sort of when I started thinking down the path of why is there a 75 gig sort of limit to Lambda functions? There's probably a good reason for that. And from there I sort of thought, oh, it's because, you know, it's not hosted in your bucket. Unlike like CodeDeploy or OpsWorks where you've run the infrastructure, it's actually in Amazon's bucket. So that's why they've put that sort of cap on it. And then I thought about ways that that could possibly be, I guess, abused to store your own
Starting point is 00:05:10 stuff for free. Oh, it can be abused awesomely. This is also, I guess, this is not the first time that I've looked into that 75 gigabyte free storage option for, I guess, effectively stealing resources that they didn't think anyone would actually use in order to build something horrifying. Ben Kehoe and I have talked a bit about building out PackRatDB, which is how much of a database can you actually shove into an AWS account
Starting point is 00:05:35 without paying for anything? And we're continuing to explore what that might look like. And this was absolutely one of the single largest points of data store you can get. Everything else is about free tags on resources that don't cost anything, builds an awesome key value store, etc., etc. But this one sort of blows away everything else we've found so far,
Starting point is 00:05:56 as far as, hey, what can I get massive storage-wise without having to pay for any of it? Yeah, yeah, certainly. And the other thing is thinking about transfer costs. So a lot of places you can store data, but you somehow get charged for a request to pull it back out. And so that's what was a little bit weird with the Lambda functions. And I guess the only tricky part is making it usable for an end client
Starting point is 00:06:23 without having to have a Lambda function that pulls it out, does some data mangling it to get it into the right format and then send it back to the client. So that's sort of the only tricky part with using Lambda functions like that. Okay. So let's continue through the demo. You figured out that you could have 75 gigs of data just hanging out there and you could pull it back at no charge. Where'd you go from there? Right. So I remembered a blog post, a medium post by Laurent, and that blog post was about using Google Docs to store video. And they used the HLS format to basically skip through to the section that actually contained the video content. So in their example, they were uploading the videos as sort of inside a PNG file, and then they could skip through to that.
Starting point is 00:07:18 Their purpose is they wanted to hide it away from Google. So Google thought that it was just a picture, didn't try to do any sort of copyright data matching on it. And from there, I thought, hey, I could probably use that inside Lambda to make use of the storage. So I guess the key problem with Lambda is you need to upload a zip file. So you need to somehow have the video content in the zip file, but still accessible by the client. So for that, what we do is we take the zip file and we actually compress it with zero compression. So depending on your zip utility, there's lots of ways of enabling that.
Starting point is 00:08:00 But basically you say, zip it up, but don't compress it. So the entire file is there uncompressed. You just sort of need to jump to the right sort of byte offset. And that's where the HLS stream comes in. You can say, hey, just jump to this part of the file and you can skip all the zip header stuff. Right. So you have the header itself that's there, but effectively when people are saying they're looking for a great compression algorithm, you were looking for the exact opposite of that. Correct, yeah. We want something that's not compressed that way because the video client doesn't know how to deal with the compression.
Starting point is 00:08:32 So yeah, if we have it not compressed, then that works better in our favor. Gotcha. So you wind up then effectively having one giant object sitting there, but you can also do the byte offset to tell the video player where to get it from. Is that using the byte range stuff that is in S3's get API, or is it using something else? Yeah, so if you do a normal get to S3, it uses the bytes range header as part of the request,
Starting point is 00:08:58 so it'll skip through to that. Yeah, I think that's one of those things that people aren't generally aware exists, where you don't need to pull the entire object down. You can just say, give me this very specific portion of it. Yeah, it's a very handy feature for more production-like workloads. So you wound up then putting this behind an API gateway and then hooking that up to a Lambda function. Yep. Have you looked at all into their new HTTP API option, which I think is now in beta?
Starting point is 00:09:27 They talked about it a lot at reInvent, but I haven't had the chance to play with it myself yet. Yeah, so I actually tried to, because I thought this would be a brilliant sort of demo of testing that out. And I tried to set that up and I followed all the steps and I just could not get the API to return anything but like a 403 or a 500 or something. So I clearly... That sounds like most of my early explorations with
Starting point is 00:09:51 API Gateway start to finish until I started just using something like serverless framework to wrap it for me. I feel like for a long time that most of what the stuff I was getting from API Gateway was just a comedy of errors. It was not the most intuitive thing to learn, and I'm disheartened to hear that that potentially is what we're seeing from the new version as well. Yeah, I'm not sure. I feel like I've probably done something wrong. There's probably some key part of documentation,
Starting point is 00:10:20 or possibly I just didn't wait long enough for DNS to propagate or whatnot for the new API. But I quickly sort of jumped to... Who has the patience for that? Yeah, yeah. This episode is sponsored by ExtraHop. ExtraHop provides threat detection and response for the enterprise, not the starship.
Starting point is 00:10:39 On-prem security doesn't translate well to cloud or multi-cloud environments, and that's not even counting IoT. ExtraHop automatically discovers everything inside the perimeter, including your cloud workloads and IoT devices, detects these threats up to 35% faster, and helps you act immediately. Ask for a free trial of detection and response for AWS today at extrahop.com slash trial. Okay, so once this video is up and running and you give someone a URL that's fronted by an API gateway, that's awesome. But what sort of, I guess, viewers or clients can wind up understanding that? Yeah, so one of the problems you have with this is because it's hosted in s3 um in their bucket they haven't
Starting point is 00:11:26 enabled cores for us and you know that's sort of not surprising uh but you can play in basically every sort of modern um modern hls sort of player so vlc and player um i even got it running in windows media player um the problem you have is you can't embed it into a web page because of because of course so you can link someone to it and they can open it in their player but they you can't open it in a browser forgive me if i'm remembering this backwards would it be possible to adjust that by modifying the payload response that winds up being returned maybe through the lambda function itself so yes but the problem is the only thing we're doing in the lambda response is providing a 302 and it's the s3 bucket of where all the lambda functions are stored that's where the cause needs
Starting point is 00:12:22 to be and we don't gotcha and if we want to continue to make aws eat the bill for it there's not really a great series of answers around it yeah yeah so in theory and this is where my knowledge falls apart um in theory i believe there's like a same origin or there's there's a html tag that you could potentially put in the video tags to get it to work but it seems that none of the out of the-the-box sort of HLS streaming JavaScript libraries that I looked at support doing that. And I don't know if it is possible, but it seems to imply that it is. What other limitations do you see in something like this? So you are limited in your chunk size. so you can only have like a 50 meg
Starting point is 00:13:07 file per function um and that sort of poses a problem if you're doing like 4k video because you won't be uh you won't have a very large or like a long duration of video content in that and that can cause some issues with some of the clients trying to buffer because they only sort of try to download the next couple of files, not based on time. So you end up with a sort of stuttering stuff. So for 1080 video, it seemed to work okay. I haven't tested it with 4K.
Starting point is 00:13:40 I did this all by hand by manually writing the HLS stream file. It's trivial to automate, but if I was to do a 4k one, I'd have hundreds and hundreds of lambda functions I'd need to upload. I'm guessing this is probably not intended for anything remotely resembling production use? Most certainly not. I imagine Amazon is going to do something to block this somehow. There's a few ways that I've thought of how they could do that, but I don't know sort of what approach they'll take because they do have to sort of remain having backwards compatibility with... Right.
Starting point is 00:14:21 They view APIs as promises, as they love telling us. So the question then becomes, what could they change to make this a non-workable solution? And of that list, which of those are viable without having to break existing functionality? Yeah. So the two that come to mind is they could put sort of requester pays on the S3 bucket. And that would basically just mean that you'd start paying the cost for for downloading that um i'm not sure if that's something they they would do the other one that comes to mind is if they uh did the zipping themselves as well so that it is actually compressed i'm not sure internally how that would go like if that would break things.
Starting point is 00:15:13 It would fix using this for HLS, but it could still be used for other data purposes. Yeah, it's one of those interesting areas where it would solve this particular use case, but there's nothing preventing someone from building out something like this that just grants, oh, run this one-liner and you'll get 75 gigs of storage per account. Yep. It's interesting to see how this might wind up, I guess, influencing AWS. I mean, there is always the option where they just decide that this is an acceptable loss. They've made something like, what, $32 billion in revenue last year.
Starting point is 00:15:45 Yeah, if people want to go through these kind of hoops okay they won we'll let them unless they start seeing widespread abuse of this which frankly i kind of have a hard time envisioning i don't know that this is necessarily going to be on the top of their list of things to chase down yeah i'm not certain about that because i guess one of the use cases for this is video piracy. So it could potentially be used to pirate movies and stuff as they sort of come out. And it means that the person doing the pirating or hosting the pirating, apart from having their Amazon account shut down, they don't really risk spending a huge amount of money. But the other use case I kind of thought about, I mean, I just did this for fun. I had no use for it.
Starting point is 00:16:29 But the other use case I thought of after I built this was those times where media outlets have this huge story and they want to release it to the world, but they know that they're going to take a pretty big hit in terms of hosting cost for it. They could just quickly do this and that would save them like probably millions if everyone's looking at the same video content. Right.
Starting point is 00:16:56 Effectively a freestyle CDN to some extent. The counterpoint is, is I can't really see any reputable media organization going down this path it just seems like it would be a little bit uh too far towards the what do you folks think you're doing model yes yeah yeah certainly yeah i can't imagine anyone doing it but i don't know that any way to cut costs i guess as soon as one person does it at that point i feel like they're no longer able to ignore this as just some weird proof of concept someone on the internet threw up. Yeah, certainly. I imagine this will last a
Starting point is 00:17:30 while. I imagine they'll sort of monitor it, maybe run some analytics on it, and then at some point, once it gets to a tipping point where it's worth changing, then they'll look into it. Right, and there's always the customer unfriendly approaches where they can just solve this entirely in their terms of service and after they find egregious users and start effectively turning off their AWS accounts, the message would probably get out. It feels like something that a company that isn't Amazon
Starting point is 00:17:56 would be likelier to do. Yeah, yeah. I imagine based on some of the previous sort of examples of this sort of, I guess, uh, code terrorism, uh, it,
Starting point is 00:18:10 it seems like they are more likely to just sort of eat up the, the bill, um, until it becomes a huge problem. Yeah. I feel like I need to highlight yet again, that this is not something that people should use for production use. For a while,
Starting point is 00:18:23 I was giving a talk, uh, called terrible ideas and get, something that people should use for production use. For a while, I was giving a talk called Terrible Ideas in Git. And I had a Docker container that was published and ready to be used for this just because resetting a whole bunch of Git repositories after you've mangled the hell out of them is obnoxious. Just run a Docker container every time you give the talk and things are great. The container was called Terrible Ideas. And I'm sure someone was using it for something in production because people do terribly stupid things without any rationale. Similar to the time where I started making jokes about using route 53 as a database. And I started getting people responding with, well, that's not the worst idea in the world. What if we did it like this? And it's no, no,
Starting point is 00:19:00 no, no, no. That at some point the joke takes on a life of its own but you kind of want to at least keep the sharp edges away from people who may not understand yes what it exactly is they're doing yeah and that's why i tried to put a fairly decent write-up on how it works and also the limitations sort of a disclaimer to say hey this probably won't work in the future. But I am very scared about how many stars this has gotten on GitHub. And there's apparently three forks of it already. So hopefully no one's actually using this in the production sense.
Starting point is 00:19:38 One would very much like to hope. The counter argument, though, is that people will always surprise you with what ridiculous things they're doing. Back when I was doing open source development work on SaltStack, I counter argument, though, is that people will always surprise you with what ridiculous things they're doing. Back when I was doing open source development work on SaltStack, I figured that, oh, the problem clearly is that everyone who's used Poppet or Shaft or anything like that was just, oh, they weren't very good at what they do and the tool was not adequate. We've built this thing. It's going to be amazing. And that lasted right until I saw my first customer use case where, oh, it turns out that anything is a hammer if you hold it wrong. It's difficult to get people to see the vision.
Starting point is 00:20:11 And I feel like the things that you build never survive encounters with other people's horrifying use cases. Yeah. A lot of the things I have built in the past have been, I guess, horrible, horrible things, mostly just for fun, just seeing how far you can take a tool to work. I guess an example of that is I have built in the past a Lambda function that works as a custom resource in CloudFormation that starts a Mechanical Turk instance question and provides access keys and secret keys. So you can free text, create your CloudFormation and just say, hey, can you build me an S3 bucket? And it will fire off a Mechanical Turk request
Starting point is 00:21:01 and ask someone on Mechanical Turk to build the S3 bucket for you. On some level, you have to wonder at what point they just automate a lot of these common solutions into something that is AWS solutions option or a quick start versus how much of it is something like AWS IQ, where you can effectively pay people a few bucks to do common or uncommon things as the case may be. I would not put this past being wrapped around an official AWS offering at some point. Yeah, I can imagine.
Starting point is 00:21:33 I'm assuming that this is not the sort of thing that springs fully formed from I'm going to go online today on my first day on the job and go ahead and build something like this. Where does it come from? Where did you where did you wind up, I guess, starting down the path of thinking about creative use of services like this? I always think about limits and how can I at least make use of the limits, get to the sort of boundary, like that limit is set. How do I get right up to that edge and make the most out of this? So I guess every time I look
Starting point is 00:22:06 at something, if I see a limitation, I guess the prime example here is S3. When they first released it, I'm pretty sure they only charge for data for transfer. When they first released it, they didn't have any sort of billing for get requests or head requests or options and all of that stuff. All those requests weren't billed. So I heard this story a long time ago about someone that essentially used that as a database because those requests were free. They weren't really grabbing any data out of it. And that's sort of when Amazon had to add that limitation. I'm like, at some point, I really want to be doing that. I want to be the reason why Amazon puts in that limit. So every time I look at a new service when it's released, I look at the limits
Starting point is 00:22:57 and try and work out how can I use this to its fullest potential that Amazon never actually planned on it being used that way. Right. I want to be the exception case. How do I make that possible? Yeah, exactly. And you always think the, well, no one would actually go to the trouble of doing that stuff. Well, have you met me? I mean, your Lambda function is a whopping 36 lines all in, in Python. This is not, it's not a massive amount of code. It's not anything that is overly complex. I think it just requires looking at these things from a certain point of view that very
Starting point is 00:23:29 often the people building it never considered. Exactly. And I feel like there's probably a few other cases in Amazon where this sort of approach, like not this exact code, but this approach can be applied to. I haven't, I haven't been able to see those. It's just I happen to use Lambda enough that I have worked out the inner workings a bit more. But I'm sure there's other places where you can upload data and get it back down for free that could be abused, maybe not for video streaming, but at least as a free database. You almost start to wonder, okay, what is the upper bound of data
Starting point is 00:24:07 you can attach to an AWS support ticket? Yes. Because it does have an API. Yeah. One other thing I thought was kind of neat too, right around the same time that this came out, was someone did a whole write-up about how anything outside of the handler function in Lambda
Starting point is 00:24:22 ran with the full three gigs of resources and two vCPUs and wound up not billing or something on the order of that until it entered the handler that it was either unbilled or billed at a small fraction of what it was that was being charged. Remember what I'm talking about? Yeah, yeah, I've seen that and read that. And so, yeah, I think if I remember correctly, it's just like as soon as on that cold start,
Starting point is 00:24:51 it has full power, full memory to just set everything up. You know, it's designed for big Java applications and whatnot. So it can quickly get started. So that cold start time is really short and then run. But you can sort of abuse that by running your code outside the handler and do that. And a lot of the stuff I do professionally, you know, we do a lot of stuff outside the handler for the, you know, setting everything up. But I never really thought about using that that extra time but i have wondered could you expand that to
Starting point is 00:25:26 be more useful as like a clustered like distributed computing system um it'd be really cool to see that expanded on because amazon gave that the the tick of approval to say that's fine so yeah they said have fun on multiple uh folks who are in a position to be authoritative on Twitter said, yeah, go for it. See what you can build. So, all right, challenge accepted. This is the danger of goading people on who have very little sense of, oh, I shouldn't do that. They wouldn't like it. Oh, no. At some point, you've got to get portions of that AWS bill back. Yes, yes, certainly. Amazon is definitely not losing out on this.
Starting point is 00:26:06 If you're using any of their services, they're winning. Absolutely. I've yet to see a single exploit like this that didn't result in, yes, and it winds up causing a slight discounting on my bill that is already a phone number. This is very much a rounding error, even a per user account for most of these things.
Starting point is 00:26:23 Yes. So we'll see. So if people want to discover more of your various acts of code terrorism, where can they find you? They can find me on Twitter. So at XSSFox, X-Ray, C-R-R-C-R-L, Foxtrot, Oscar, X-Ray. Excellent. Thank you so much for taking the time to speak with me today.
Starting point is 00:26:44 I appreciate it. No worries. Thank you so much for taking the time to speak with me today. I appreciate it. No worries. Thank you. Michael, code terrorist at undisclosed location for obvious reasons. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave an excellent review on Apple Podcasts. If you've hated this podcast, please leave an excellent review on Apple Podcasts. this has been a humble pod production stay humble

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.